Intelligence and Security Informatics, 1th NSF-NIJ Symposium, ISI 2003

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2665 3 Berlin Heidelberg New Y...

Author: Hsinchun Chen | Richard Miranda | Daniel D. Zeng | Chris Demchak | Therani Madhusudan

41 downloads 1902 Views 9MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2665

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Hsinchun Chen Richard Miranda Daniel D. Zeng Chris Demchak Jenny Schroeder Therani Madhusudan (Eds.)

Intelligence and Security Informatics First NSF/NIJ Symposium, ISI 2003 Tucson, AZ, USA, June 2-3, 2003 Proceedings

13

Volume Editors Hsinchun Chen Daniel D. Zeng Therani Madhusudan University of Arizona Department of Management Information Systems Tucson, AZ 85721, USA E-mail: {hchen/zeng/madhu}@eller.arizona.edu Richard Miranda Jenny Schroeder Tucson Police Department 270 S. Stone Ave., Tucson, AZ 85701, USA E-mail: [email protected] Chris Demchak University of Arizona School of Public Administration and Policy Tucson, AZ 85721, USA E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

CR Subject Classification (1998): H.4, H.3, C.2, I.2, H.2, D.4.6, D.2, K.4.1, K.5, K.6.5 ISSN 0302-9743 ISBN 3-540-40189-X Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10927359 06/3142 543210

Preface

Since the tragic events of September 11, 2001, academics have been called on for possible contributions to research relating to national (and possibly international) security. As one of the original founding mandates of the National Science Foundation, mid- to long-term national security research in the areas of information technologies, organizational studies, and security-related public policy is critically needed. In a way similar to how medical and biological research has faced signiﬁcant information overload and yet also tremendous opportunities for new innovation, law enforcement, criminal analysis, and intelligence communities are facing the same challenge. We believe, similar to “medical informatics” and “bioinformatics,” that there is a pressing need to develop the science of “intelligence and security informatics” – the study of the use and development of advanced information technologies, systems, algorithms and databases for national security related applications, through an integrated technological, organizational, and policy-based approach. We believe active “intelligence and security informatics” research will help improve knowledge discovery and dissemination and enhance information sharing and collaboration across law enforcement communities and among academics, local, state, and federal agencies, and industry. Many existing computer and information science techniques need to be reexamined and adapted for national security applications. New insights from this unique domain could result in signiﬁcant breakthroughs in new data mining, visualization, knowledge management, and information security techniques and systems. This ﬁrst NSF/NIJ Symposium on Intelligence and Security Informatics (ISI 2003) aims to provide an intellectual forum for discussions among previously disparate communities: academic researchers (in information technologies, computer science, public policy, and social studies), local, state, and federal law enforcement and intelligence experts, and information technology industry consultants and practitioners. Several federal research programs are also seeking new research ideas and projects that can contribute to national security. Jointly hosted by the University of Arizona and the Tucson Police Department, the NSF/NIJ ISI Symposium program committee was composed of 44 internationally renowned researchers and practitioners in intelligence and security informatics research. The 2-day program also included 5 keynote speakers, 14 invited speakers, 34 regular papers, and 6 posters. In addition to the main sponsorship from the National Science Foundation and the National Institute of Justice, the meeting was also cosponsored by several units within the University of Arizona, including the Eller College of Business and Public Administration, the Management Information Systems Department, the Internet Technology, Commerce, and Design Institute, the NSF COPLINK Center of Excellence, the Mark and Susan Hoﬀman E-Commerce Lab, the Center for the Management of

VI

Preface

Information, and the Artiﬁcial Intelligence Lab, and several other organizations including the Air Force Oﬃce of Scientiﬁc Research, SAP, and CISCO. We wish to express our gratitude to all members of the conference Program Committee and the Organizing Committee. Our special thanks go to Mohan Tanniru and Joe Hindman (Publicity Committee Co-chairs), Kurt Fenstermacher, Mark Patton, and Bill Neumann (Sponsorship Committee Co-chairs), Homa Atabakhsh and David Gonzalez (Local Arrangements Co-chairs), Ann Lally and Leon Zhao (Publication Co-chairs), and Kathy Kennedy (Conference Management). Our sincere gratitude goes to all of the sponsors. Last, but not least, we thank Gary Strong, Art Becker, Larry Brandt, Valerie Gregg, and Mike O’Shea for their strong and continuous support of this meeting and other related intelligence and security informatics research.

June 2003

Hsinchun Chen, Richard Miranda, Daniel Zeng, Chris Demchak, Jenny Schroeder, Therani Madhusudan

ISI 2003 Organizing Committee

General Co-chairs: Hsinchun Chen Richard Miranda

University of Arizona Tucson Police Department

Program Co-chairs: Daniel Zeng Chris Demchak Jenny Schroeder Therani Madhusudan

University of Arizona University of Arizona Tucson Police Department University of Arizona

Publicity Co-chairs: Mohan Tanniru Joe Hindman

University of Arizona Phoenix Police Department

Sponsorship Co-chairs: Kurt Fenstermacher Mark Patton Bill Neumann

University of Arizona University of Arizona University of Arizona

Local Arrangements Co-chairs: Homa Atabakhsh David Gonzalez

University of Arizona University of Arizona

Publication Co-chairs: Ann Lally Leon Zhao

University of Arizona University of Arizona

VIII

Organization

ISI 2003 Program Committee

Yigal Arens Art Becker Larry Brandt Donald Brown Judee Burgoon Robert Chang Andy Chen Lee-Feng Chien Bill Chu Christian Collberg Ed Fox Susan Gauch Johannes Gehrke Valerie Gregg Bob Grossman Steve Griﬃn Eduard Hovy John Hoyt David Jensen Judith Klavans Don Kraft Ee-Peng Lim Ralph Martinez Reagan Moore Cliﬀord Neuman David Neri Greg Newby Jay Nunamaker Mirek Riedewald Kathleen Robinson Allen Sears Elizabeth Shriberg Mike O’Shea Craig Stender Gary Strong Paul Thompson Alex Tuzhilin Bhavani Thuraisingham Howard Wactlar Andrew Whinston Karen White

University of Southern California Knowledge Discovery and Dissemination Program National Science Foundation University of Virginia University of Arizona Criminal Investigation Bureau, Taiwan Police National Taiwan University Academia Sinica, Taiwan University of North Carolina, Charlotte University of Arizona Virginia Tech University of Kansas Cornell University National Science Foundation University of Illinois, Chicago National Science Foundation University of Southern California South Carolina Research Authority University of Massachusetts, Amherst Columbia University Louisiana State University Nanyang Technological University, Singapore University of Arizona San Diego Supercomputing Center University of Southern California Tucson Police Department University of North Carolina, Chapel Hill University of Arizona Cornell University Tucson Police Department Corporation for National Research Initiatives SRI International National Institute of Justice State of Arizona National Science Foundation Dartmouth College New York University National Science Foundation Carnegie Mellon University University of Texas at Austin University of Arizona

Organization

Jerome Yen Chris Yang Mohammed Zaki

IX

Chinese University of Hong Kong Chinese University of Hong Kong Rensselaer Polytechnic Institute Keynote Speakers

Richard Carmona Gary Strong Lawrence E. Brandt Mike O’Shea Art Becker

Surgeon General of the United States National Science Foundation National Science Foundation National Institute of Justice Knowledge Discovery and Dissemination Program Invited Speakers

Paul Kantor Lee Strickland Donald Brown Robert Chang Pamela Scanlon Kelcy Allwein Gene Rochlin Jane Fountain John Landry John Hoyt Bruce Baicar Matt Begert John Cunningham Victor Goldsmith

Rutgers University University of Maryland University of Virginia Criminal Investigation Bureau, Taiwan Police Automated Regional Justice Information Systems Defense Intelligence Agency University of California, Berkeley Harvard University Central Intelligence Agency South Carolina Research Authority South Carolina Research Authority and National Institute of Justice National Law Enforcement & Corrections Technology Montgomery County Police Department City University of New York

Table of Contents

Part I: Full Papers Data Management and Mining Using Support Vector Machines for Terrorism Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aixin Sun, Myo-Myo Naing, Ee-Peng Lim, Wai Lam

1

Criminal Incident Data Association Using the OLAP Technology . . . . . . . Song Lin, Donald E. Brown

13

Names: A New Frontier in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frankie Patman, Paul Thompson

27

Web-Based Intelligence Reports System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Dolotov, Mary Strickler

39

Authorship Analysis in Cybercrime Investigation . . . . . . . . . . . . . . . . . . . . . . Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen

59

Deception Detection Behavior Proﬁling of Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olivier Nimeskern, Chia-Wei Hu

74

Detecting Deception through Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . Judee K. Burgoon, J.P. Blair, Tiantian Qin, Jay F. Nunamaker, Jr

91

A Longitudinal Analysis of Language Behavior of Deception in E-mail . . . 102 Lina Zhou, Judee K. Burgoon, Douglas P. Twitchell

Analytical Techniques Evacuation Planning: A Capacity Constrained Routing Approach . . . . . . . 111 Qingsong Lu, Yan Huang, Shashi Shekhar Locating Hidden Groups in Communication Networks Using Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Malik Magdon-Ismail, Mark Goldberg, William Wallace, David Siebecker

XII

Table of Contents

Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department . . . . . . . . . . . . . . . . . . . . . . . . . 138 Kar Wing Li, Christopher C. Yang Decision Based Spatial Analysis of Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Yifei Xue, Donald E. Brown

Visualization CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Jennifer Schroeder, Jennifer Xu, Hsinchun Chen A Spatio Temporal Visualizer for Law Enforcement . . . . . . . . . . . . . . . . . . . 181 Ty Buetow, Luis Chaboya, Christopher O’Toole, Tom Cushna, Damien Daspit, Tim Petersen, Homa Atabakhsh, Hsinchun Chen Tracking Hidden Groups Using Communications . . . . . . . . . . . . . . . . . . . . . . 195 Sudarshan S. Chawathe

Knowledge Management and Adoption Examining Technology Acceptance by Individual Law Enforcement Oﬃcers: An Exploratory Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Paul Jen-Hwa Hu, Chienting Lin, Hsinchun Chen “Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Chris C. Demchak Untangling Criminal Networks: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . 232 Jennifer Xu, Hsinchun Chen

Collaborative Systems and Methodologies Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 T.S. Raghu, R. Ramesh, Andrew B. Whinston Collaborative Workﬂow Management for Interagency Crime Analysis . . . . 266 J. Leon Zhao, Henry H. Bi, Hsinchun Chen COPLINK Agent: An Architecture for Information Monitoring and Sharing in Law Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Daniel Zeng, Hsinchun Chen, Damien Daspit, Fu Shan, Suresh Nandiraju, Michael Chau, Chienting Lin

Table of Contents

XIII

Monitoring and Surveillance Active Database Systems for Monitoring and Surveillance . . . . . . . . . . . . . . 296 Antonio Badia Integrated “Mixed” Networks Security Monitoring – A Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 William T. Scherer, Leah L. Spradley, Marc H. Evans Bioterrorism Surveillance with Real-Time Data Warehousing . . . . . . . . . . . 322 Donald J. Berndt, Alan R. Hevner, James Studnicki

Part II: Short Papers Data Management and Mining Privacy Sensitive Distributed Data Mining from Multi-party Data . . . . . . . 336 Hillol Kargupta, Kun Liu, Jessica Ryan ProGenIE: Biographical Descriptions for Intelligence Analysis . . . . . . . . . 343 Pablo A. Duboue, Kathleen R. McKeown, Vasileios Hatzivassiloglou Scalable Knowledge Extraction from Legacy Sources with SEEK . . . . . . . . 346 Joachim Hammer, William O’Brien, Mark Schmalz “TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Sachin Kajarekar, Kemal S¨ onmez, Luciana Ferrer, Venkata Gadde, Anand Venkataraman, Elizabeth Shriberg, Andreas Stolcke, Harry Bratt Emergent Semantics from Users’ Browsing Paths . . . . . . . . . . . . . . . . . . . . . . 355 D.V. Sreenath, W.I. Grosky, F. Fotouhi

Deception Detection Designing Agent99 Trainer: A Learner-Centered, Web-Based Training System for Deception Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Jinwei Cao, Janna M. Crews, Ming Lin, Judee Burgoon, Jay F. Nunamaker Training Professionals to Detect Deception . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Joey F. George, David P. Biros, Judee K. Burgoon, Jay F. Nunamaker, Jr. An E-mail Monitoring System for Detecting Outﬂow of Conﬁdential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Bogju Lee, Youna Park

XIV

Table of Contents

Methodologies and Applications Intelligence and Security Informatics: An Information Economics Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Lihui Lin, Xianjun Geng, Andrew B. Whinston An International Perspective on Fighting Cybercrime . . . . . . . . . . . . . . . . . . 379 Weiping Chang, Wingyan Chung, Hsinchun Chen, Shihchieh Chou

Part III: Extended Abstracts for Posters Data Management and Mining Hiding Traversal of Tree Structured Data from Untrusted Data Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Ping Lin, K. Sel¸cuk Candan Criminal Record Matching Based on the Vector Space Model . . . . . . . . . . . 386 Jau-Hwang Wang, Bill T. Lin, Ching-Chin Shieh, Peter S. Deng Database Support for Exploring Criminal Networks . . . . . . . . . . . . . . . . . . . 387 M.N. Smith, P.J.H. King Hiding Data and Code Security for Application Hosting Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 Ping Lin, K. Sel¸cuk Candan, Rida Bazzi, Zhichao Liu

Security Informatics Secure Information Sharing and Information Retrieval Infrastructure with GridIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Gregory B. Newby, Kevin Gamiel Semantic Hacking and Intelligence and Security Informatics . . . . . . . . . . . . 390 Paul Thompson

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

Using Support Vector Machines for Terrorism Information Extraction Aixin Sun1 , Myo-Myo Naing1 , Ee-Peng Lim1 , and Wai Lam2 1

Centre for Advanced Information Systems, School of Computer Engineering Nanyang Technological University, Singapore 639798, Singapore [email protected] 2 Department of Systems Engineering and Engineering Management Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR [email protected]

Abstract. Information extraction (IE) is of great importance in many applications including web intelligence, search engines, text understanding, etc. To extract information from text documents, most IE systems rely on a set of extraction patterns. Each extraction pattern is deﬁned based on the syntactic and/or semantic constraints on the positions of desired entities within natural language sentences. The IE systems also provide a set of pattern templates that determines the kind of syntactic and semantic constraints to be considered. In this paper, we argue that such pattern templates restricts the kind of extraction patterns that can be learned by IE systems. To allow a wider range of context information to be considered in learning extraction patterns, we ﬁrst propose to model the content and context information of a candidate entity to be extracted as a set of features. A classiﬁcation model is then built for each category of entities using Support Vector Machines (SVM). We have conducted IE experiments to evaluate our proposed method on a text collection in the terrorism domain. From the preliminary experimental results, we conclude that our proposed method can deliver reasonable accuracies. Keywords: Information extraction, terrorism-related knowledge discovery.

1

Introduction

1.1

Motivation

Information extraction (IE) is a task that extracts relevant information from a set of documents. IE techniques can be applied to many diﬀerent areas. In the intelligence and security domains, IE can allow one to extract terrorism-related information from email messages, or identify sensitive business information from

This work is partially supported by the SingAREN 21 research grant M48020004. Dr. Ee-Peng Lim is currently a visiting professor at Dept. of SEEM, Chinese University of Hong Kong, Hong Kong, China.

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 1–12, 2003. c Springer-Verlag Berlin Heidelberg 2003

2

A. Sun et al.

news documents. In some cases where perfect extraction accuracy is not essential, automated IE methods can replace the manual extraction eﬀorts completely. In other cases, IE may produce the ﬁrst-cut results reducing the manual extraction eﬀorts. As reported in the survey by Muslea [9], the IE methods for free text documents are largely based on extraction patterns specifying the syntactic and/or semantic constraints on the positions of desired entities within sentences. For example, from the sentence, “Guerrillas attacked the 1st infantry brigade garrison”, one can deﬁne the extraction pattern subject active-attack to extract “Guerrilas” as a perpetrator, and active-attack direct object to extract “1st infantry bridage garrison” as a victim1 . The extraction pattern deﬁnitions currently used are very much based on some pre-deﬁned pattern templates. For example, in AutoSlog [12], the above subject active-attack extraction pattern is an instantiation of the subject active-verb template. While pattern templates reduce the combinations of extraction patterns to be considered in rule learning, they may potentially pose as the obstacles to derive other more expressive and accurate extraction patterns. For example, IBM acquired direct-object is a very pertinent extraction pattern for extracting company information but cannot be instantiated by any of the 13 AutoSlog’s pattern templates. Since it will be quite diﬃcult to derive one standard set of pattern templates that works well for any given domain, IE methods that do not rely on templates will become necessary. In this paper, we propose the use of Support Vector Machines (SVMs) for information extraction. SVM was proposed by Vapnik [16] and has been widelyused in image processing and classiﬁcation problems [5]. The SVM technique ﬁnds the best surface that can separate the positive examples from negative ones. Positive and negative examples are separated by the maximum margin measured by a normal vector w. SVM classiﬁers have been used in various text classiﬁcation experiments [2,5] and have been shown to deliver good classiﬁcation accuracy. When SVM classiﬁers are used to solve an IE problem, two major research challenges must be considered. – Large number of instances: IE for free text involves extracting from document sentences target entities (or instances) that belong to some pre-deﬁned semantic category(ies). A classiﬁcation task, on the other hand, is to identify candidate entities from the document sentences, usually in the form of noun phrases or verb phrases, and assign each candidate entity to zero, one or more pre-deﬁned semantic category. As large number of candidate entities can potentially be extracted from document sentences, it could lead to overheads in both learning and classiﬁcation steps. – Choice of features: The success of SVM very much depends on whether a good set of features is given in the learning and classiﬁcation steps. There should be adequate features that distinguish entities belonging to a semantic category from those outside the category. 1

Both extraction patterns have been used in the AutoSlog system [12].

Using Support Vector Machines for Terrorism Information Extraction

3

In our approach, we attempt to establish the links between the semantic category of a target entity with its syntactic properties, and reduce the number of instances to be classiﬁed based on their syntactic and semantic properties. A natural language parser is ﬁrst used to identify the syntactic parts of sentences and only those parts that are desired are used as candidate instances. We then use both the content and syntax of a candidate instance and its surrounding context as features. 1.2

Research Objectives and Contributions

Our research aims to develop new IE methods that use classiﬁcation techniques to extract target entities, while not using pattern templates and extraction patterns. Among the diﬀerent types of IE tasks, we have chosen to address the template element extraction (TE) task which refers to extracting entities or instances in a free text that belong to some semantic categories2 . We apply our new IE method on free documents in the terrorism domain. In the terrorism domain, the semantic categories that are interesting include victim, perpetrator, witness, etc. In the following, we summarize our main research contributions. – IE using Support Vector Machines (SVM): We have successfully transformed IE into a classiﬁcation problem and adopted SVM to extract target entities. We have not come across any previous papers reporting such an IE approach. As an early exploratory research, we only try to extract the entities falling under the perpetrator role. Our proposed IE method, nevertheless, can be easily generalized to extract other types of entities. – Feature selection: We have deﬁned the content and context features that can be derived for the entities to be extracted/classiﬁed. The content features refer to words found in the entities. The context features refer to those derived from the sentence constituents surrounding the entities. In particular, we propose the a weighting feature scheme to derive context features for a given entity. – Performance evaluation: We have conducted experiments on the MUC text collection in the terrorism domain. In our preliminary experiments, the SVM approach to IE has been shown to deliver performance comparable to the published results by AutoSlog, a well known extraction pattern-based IE system. 1.3

Paper Outline

The rest of the paper is structured as follows. Section 2 provides a survey of the related IE work and distinguishes our work from them. Section 3 deﬁnes our IE problem and the performance measures. Our proposed method is described in Section 4. The experimental results are given in Section 5. Section 6 concludes the paper. 2

The template element extraction (TE) task has been deﬁned in the Message Understanding Conference series (MUC) sponsored by DARPA [8].

4

A. Sun et al.

2

Related Work

As our research deals with IE for free text collections, we only examine related work in this area. Broadly, the related work can be divided into extraction pattern-based and non-extraction pattern-based. The former refers to approaches that ﬁrst acquire a set of extraction patterns from the training text collections. The extraction patterns use the syntactic structure of a sentence and semantic knowledge of words to identify the target entities. The extraction process is very much a template matching task between the extraction patterns and the sentences. The non-extraction pattern-based approach are those that use some machine learning techniques to acquire some extraction models. The extraction models identify target entities by examining their feature mix that includes those based on syntactics, semantics and others. The extraction process is very much a classiﬁcation task that involves accepting or rejecting an entity (e.g. word or phrase) as a target entity. Many extraction pattern-based IE approaches have been proposed in the Message Understanding Conference (MUC) series. Based on 13 pre-deﬁned pattern templates, Riloﬀ developed the AutoSlog system capable of learning extraction patterns [12]. Each extraction pattern consists of a trigger word (a verb or a noun) to activate its use. AutoSlog also requires a manual ﬁltering step to discard some 74% of the learned extraction patterns as they may not be relevant. PALKA is another representative IE system that learns extraction patterns in the form of frame-phrasal pattern structures [7]. It requires each sentence to be ﬁrst parsed and grouped into multiple simple clauses before deriving the extraction patterns. Both PALKA and AutoSlog require the training text collections to be tagged. Such tagging eﬀorts require much manual eﬀorts. AutoSlog-TS, an improved version of AutoSlog, is able to generate extraction patterns without a tagged training dataset [11]. An overall F1 measure of 0.38 was reported for both AutoSlog and AutoSlog-TS for the entities in perpetrator, and around 0.45 for victim and target object categories in the MUC-4 text collection (terrorism domain). Riloﬀ also demonstrated that the best extraction patterns can be further selected using bootstrapping technique [13]. WHISK is an IE system that uses extraction patterns in the form of regular expressions. Each regular expression can extract either single target entity or multiple target entities [15]. WHISK has been experimented on the text collection under the management succession domain. SRV, another IE system, constructs ﬁrst-order logical formulas as extraction patterns [3]. The extraction patterns also allow relational structures between target entities to be expressed. There have been very little IE research on non-extraction pattern based approaches. Freitag and McCallum developed an IE method based on Hidden Markov models (HMMs), a kind of probabilistic ﬁnal state machines [4]. Their experiments showed that the HMM method outperformed the IE method using SRV for two text collections in the seminar announcements and corporate acquisitions domains.

Using Support Vector Machines for Terrorism Information Extraction

5

TST1-MUC3-0002 SAN SALVADOR, 18 FEB 90 (DPA) -- [TEXT] HEAVY FIGHTING WITH AIR SUPPORT RAGED LAST NIGHT IN NORTHWESTERN SAN SALVADOR WHEN MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] ATTACKED AN ELECTRIC POWER SUBSTATION. ACCORDING TO PRELIMINARY REPORTS, A SOLDIER GUARDING THE SUBSTATION WAS WOUNDED. THE FIRST EXPLOSIONS BEGAN AT 2330 [0530 GMT] AND CONTINUED UNTIL EARLY THIS MORNING, WHEN GOVERNMENT TROOPS REQUESTED AIR SUPPORT AND THE GUERRILLAS WITHDREW TO THE SLOPES OF THE SAN SALVADOR VOLCANO, WHERE THEY ARE NOW BEING PURSUED. THE NOISE FROM THE ARTILLERY FIRE AND HELICOPTER GUNSHIPS WAS HEARD THROUGHOUT THE CAPITAL AND ITS OUTSKIRTS, ESPECIALLY IN THE CROWDED NEIGHBORHOODS OF NORTHERN AND NORTHWESTERN SAN SALVADOR, SUCH AS MIRALVALLE, SATELITE, MONTEBELLO, AND SAN RAMON. SOME EXPLOSIONS COULD STILL BE HEARD THIS MORNING. MEANWHILE, IT WAS REPORTED THAT THE CITIES OF SAN MIGUEL AND USULUTAN, THE LARGEST CITIES IN EASTERN EL SALVADOR, HAVE NO ELECTRICITY BECAUSE OF GUERRILLA SABOTAGE ACTIVITY.

Fig. 1. Example Newswire Document

Research on applying machine learning techniques on name-entity extraction, a subproblem of information extraction, has been reported in [1]. Baluja et al proposed the use of 4 diﬀerent types of features to represent an entity to extracted. They are the word-level features, dictionary features, part-of-speech tag features, and punctuation features (surrounding the entity to be extracted). Except the last feature type, the other three types of features are derived from the entities to be extracted. To the best of our knowledge, our research is the ﬁrst that explores the use of classiﬁcation techniques in extracting terrorism-related information. Unlike [4], we represent each entity to be extracted as a set of features derived from the syntactic structure of the sentence in which the entity is found, as well as the words found in the entity.

3

Problem Deﬁnition

Our IE task is similar to the template element (TE) task in the Message Understanding Conference (MUC) series. The TE task was to extract diﬀerent types of target entities from each document, including perpetrators, victims, physicaltargets, event locations, etc. In MUC-4, a text collection containing newswire documents related to terrorist events in Latin America was used as the evaluation dataset. An example document is shown in Figure 1. In the above document, we could extract several interesting entities about the terrorist event, namely location (“SAN SALVADOR”), perpetrator (“MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN]”), and victim(“SOLDIER”). The MUC-4 text collection consists of a training set (with 1500 documents and two test sets (each with 100 documents). For each document, MUC-4 speciﬁes for each semantic category the target entity(ies) to be extracted.

6

A. Sun et al.

In this paper, we choose to focus on extracting target entities in the perpetrator category. The input of our IE method consists of the training set (1500 documents) and the perpetrator(s) of each training documents. The training documents are not tagged with the perpetrators. Instead, the perpetrators are stored in a separate ﬁle known as the answer key ﬁle. Our IE method therefore has to locate the perpetrators within the corresponding documents. Should a perpetrator appear in multiple sentences in a document, his or her role may be obscured by features from these sentences, making it more diﬃcult to perform extraction. Once trained, our IE method has to extract perpetrators from the test collections. As the test collections are not tagged with candidate entities, our IE method has to ﬁrst identify candidate entities in the documents before classifying them. The performance of our IE task is measured by three important metrics: Precision, Recall and F1 measure. Let ntp , nf p , and nf n be the number of entities correctly extracted, number of entities wrongly extracted, and number of entities missed respectively. Precision, recall and F1 measure are deﬁned as follows: P recision =

Recall =

F1 =

4 4.1

ntp ntp + nf p

ntp ntp + nf n

2 · P recision · Recall P recision + Recall

Proposed Method Overview

Like other IE methods, we divide our proposed IE method into two steps: the learning step and the extraction step. The former learns the extraction model for the target entities in the desired semantic category using the training documents and their target entities. The latter applies the learnt extraction model on other documents and extract new target entities. The learning step consists of the following smaller steps. 1. Document parsing: As the target entities are perpetrators, they usually appear as noun-phrases in the documents. We therefore parse all the sentences in the document. To break up a document into sentences, we use the SATZ software [10]. As a noun-phrase could be nested within another noun-phrase in the parse tree, we only select all the simple noun-phrases as candidate entities. The candidate entities from the training documents are further grouped as positive entities if their corresponding noun-phrases match the perpetrator answer keys. The rest are used as negative entities.

Using Support Vector Machines for Terrorism Information Extraction

7

2. Feature acquisition: This step refers to deriving features for the training target entities, i.e., the noun-phrases. We will elaborate this step in Section 4.2. 3. Extraction model construction: This step refers to constructing the extraction model using some machine learning technique. In this paper, we explore the use of SVM to construct the extraction model (or classiﬁcation model). The classiﬁcation step performs extraction using the learnt extraction model following the steps below: 1. Document parsing: The sentences in every test document are parsed and simple noun phrases in the parse trees are used as candidate entities. 2. Feature acquisition: This step is similar to that in the learning step. 3. Classiﬁcation: This step applies the SVM classiﬁer to extract the candidate entities. By identifying all the noun-phrases and classifying them into positive entities or negative entities, we transform the IE problem into classiﬁcation problem. To keep our method simple, we do not use co-referencing to identify pronouns that refers to the positive or negative entities. 4.2

Feature Acquisition

We acquire for each candidate entity the features required for constructing the extraction model and for classiﬁcation. To ensure that the extraction model will be able to distinguish entities belonging to a semantic category or not, it is necessary to acquire a wide spectrum of features. Unlike the earlier work that focus on features that are mainly derived from within the entities [1] or the linear sequence of words surrounding the entities [4], our method derives features from syntactic structures of sentences in which the candidate entities are found. We divide the entity features into two categories: – Content features: These refer to the features derived from the candidate entities themselves. At present, we only consider terms appearing in the candidate entities. Given an entity e = w1 w2 · · · wn , we assign the content feature fi (w) = 1 if word w is found in e. – Context features: These features are obtained by ﬁrst parsing the sentences containing a candidate entity. Each context feature is deﬁned by a fragment of syntactic structure in which the entity is found and words associated with the fragment. In the following, we elaborate the way our context features are obtained. We ﬁrst use the CMU’s Link Grammar Parser to parse a sentence [14]. The parser generates a parse tree such as the one shown in Figure 2. A parse tree represents the syntactic structure of a given sentence. Its leaf nodes are the word tokens of the sentence and internal nodes represents the syntactic constituents of the sentence. The possible syntactic constituents are S (clause), VP (verb phrase), NP (noun phrase), PP (prepositional phrase), etc.

8

A. Sun et al.

(S (NP Two terrorists) (VP (VP destroyed (NP several power poles) (PP on (NP 29th street))) and (VP machinegunned (NP several transformers))) .) Fig. 2. Parse Tree Example

For each candidate entity, we can derive its context features as a vector of term weights for the terms that appear in the sentences containing the nounphrase. Given a sentence parse tree, the weight of a term is assigned as follows. Terms appearing in the sibling nodes are assigned the weights of 1.0. Terms appearing in the higher level or lower level of the parse tree will be assigned smaller weights as they are further away from the candidate entity. The feature weights are reduced by half for every level further away from the candidate entity in our experiments. The 50% reduction factor has been chosen arbitrarily in our experiments. A careful study needs to be further conducted to determine the optimal reduction factor. For example, the context features of the candidate entity “several power poles” are derived as follows3 . Table 1. Context features and feature weights for “several power poles” Label Terms PP NP VP NP

on 29th street destroyed Two terrorists

Weight 1.00 0.50 0.50 0.25

To ensure that the included context features are closely related to the candidate entity, we do not consider terms found in the sibling nodes (and their subtrees) of the ancestor(s) of the entity. Intuitively, these terms are not syntactically very related to the candidate entity and are therefore excluded. For example, for the candidate entity “several power poles”, the terms in the subtree “and machinegunned several transformers” are excluded from the context feature set. 3

More precisely, stopword removal and stemming are performed on the terms. Some of them will be discarded during this process.

Using Support Vector Machines for Terrorism Information Extraction

9

If an entity appears in multiple sentences in the same document, and the same term is included as context features from diﬀerent parse trees, we will combine the context features into one and assign it the highest weight among the original weights. This is necessary to keep one unique weight for each term. 4.3

Extraction Model Construction

To construct an extraction model, we require both positive training data and negative training data. While the positive training entities are available from the answer key ﬁle, the negative training entities can be obtained from the noun phrases that do not contain any target entities. Since pronouns such as “he”, “she”, “they”, etc. may possibly be co-referenced with some target entities, we do not use them as positive nor negative training entities. From the training set, we also obtain a entity ﬁlter dictionary that consists of noun-phrases that cannot be perpetrators. These are non-target noun-phrases that appear more than ﬁve times in the training set, e.g., “dictionary”, “desk” and “tree”. With this ﬁlter, the number of negative entities is reduced dramatically. If a larger number is used, fewer noun-phrases will be ﬁltered causing a degradation of precision. On the other hand, a smaller number may increase the risk of getting a lower recall. Once an extraction model is constructed, it can perform extraction on a given document by classifying candidate entities in the document into perpetrator or non-perpetrator category. In the extraction step, a candidate entity is classiﬁed as perpetrator when the SVM classiﬁer returns a positive score value.

5 5.1

Experiments and Results Datasets

We used MUC-4 dataset in our experiments. Three ﬁles (muc34dev, muc34tst1 and muc34tst2) were used as training set and the remaining two ﬁles (muc34tst3 and muc34tst4) were used as test set. There are totally 1500 news documents in the training set and 100 documents each for the two test ﬁles. For each news document, there are zero, one or two perpetrators deﬁned in the answer key ﬁle. Therefore, most of the noun phrases are negative candidate entities. To avoid severely unbalanced training examples, we only considered the training documents that have at least one perpetrator deﬁned in the answer key ﬁles. There are 466 training documents containing some perpetrators. We used all the 100 news documents in the test set since the classiﬁer should not know if a test document contains a perpetrator. The number of documents used, number of positive and negative entities for the training and test sets are listed in Table 2. From the table, we observe that negative entities contribute about 90% of the entities of training set, and around 95% of the test set. 5.2

Results

We used SV M light as our classiﬁers in our experiment [6]. The SV M light is an implementation of Support Vector Machines (SVMs) in C and has been widely

10

A. Sun et al. Table 2. Documents, positive/negative entities in traing/test data set Dataset Documents Positive Entities Negative Entities Train Tst3 Tst4

466 100 100

1003 117 77

9435 2336 1943

used in text classiﬁcation and web classiﬁcation research. Due to the unbalanced training examples, we set the cost-factor (parameter j) of SV M light to be the ratio of number of negative entities over the number of positive ones. The costfactor denotes the proportion of cost allocated to training errors on positive entities against errors on negative entities. We used the polynomial kernel function instead of the default linear kernel function. We also set our threshold to be 0.0 as suggested. The results are reported in Table 3. Table 3. Results on training and test dataset Dataset Train Tst3 Tst4

Precision

Recall

F1 measure

0.7752 0.3054 0.2360

0.9661 0.4359 0.5455

0.8602 0.3592 0.3295

As shown in the table, the SVM classiﬁer performed very well for the training data. It achieved both high precision and recall values. Nevertheless, the classiﬁer did not perform equally well for the two test data sets. About 43% and 54% of the target entities have been extracted for Tst3 and Tst4 respectively. The results also indicated that many other non-target entities were also extracted causing the low precision values. The overall F1 measures are 0.36 and 0.33 for Tst3 and Tst4 respectively. The above results, compared to the known results given in [11] are reasonable as the latter also showed not more than 30% precision values for both AutoSlog and AutoSlog-TS4 . [11] reported F1 measures of 0.38 which is not very diﬀerent from ours. The rather low F1 measures suggest that this IE problem is quite a diﬃcult one. We, nevertheless, are quite optimistic about our preliminary results as they clearly show that the IE problem can be handled as a classiﬁcation problem.

4

The comparison cannot be taken in absolute terms since [11] used a slightly diﬀerent experimental setup for the MUC-4 dataset.

Using Support Vector Machines for Terrorism Information Extraction

6

11

Conclusions

In this paper, we attempt to extract perpetrator entities from a collection of untagged news documents in the terrorism domain. We propose a classiﬁcationbased method to handle the IE problem. The method segments each document into sentences, parses the latter into parse trees, and derives features for the entities within the documents. The features of each entity are derived from both its content and context. Based on SVM classiﬁers, our method was applied to the MUC-4 data set. Our experimental results showed that the method performs at a level comparable to some well known published results. As part of our future work, we would like to continue our preliminary work and explore additional features in training the SVM classiﬁers. Since the number of training entities is usually small in real applications, we will also try to extend our classiﬁcation-based method to handle IE problems with small number of seed training entities.

References 1. S. Baluja, V. Mittal, and R. Sukthankar. Applying machine learning for high performance named-entity extraction. Computational Intelligence, 16(4):586–595, November 2000. 2. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, pages 148– 155, Bethesda, Maryland, November 1998. 3. D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th Conference on Artiﬁcial Intelligence (AAAI-98) 10th Conference on Innovation Applications of Artiﬁcial Intelligence (IAAI-98), pages 517–523, Madison, Wisconsin, July 1998. 4. D. Freitag and A. K. McCallum. Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31–36, Orlando, FL., July 1999. 5. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. 6. T. Joachims. Making large-scale svm learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999. 7. J.-T. Kim and D. I. Moldovan. Acquisition of linguistic patterns for knowledgebased information extraction. IEEE Transaction on Knowledge and Data Engineering, 7(5):713–724, 1995. 8. MUC. Proceedings of the 4th message understanding conference (muc-4), 1992. 9. I. Muslea. Extraction patterns for information extraction tasks: A survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 1–6, Orlando, Florida, July 1999. 10. D. D. Palmer and M. A. Hearst. Adaptive sentence boundary disambiguation. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 78–83, Stuttgart, Germany, October 1994.

12

A. Sun et al.

11. E. Riloﬀ. Automatically generating extraction patterns from untagged text. In Proceedings of the 13th National Conference on Artiﬁcial Intelligence (AAAI-96), pages 1044–1049, Portland, Oregon, 1996. 12. E. Riloﬀ. An empirical study of automated dictionary construction for information extraction in three domains. Artiﬁcial Intelligence, 85(1-2):101–134, 1996. 13. E. Riloﬀ and R. Jones. Learning dictionaries for information extraction by multilevel boot-strapping. In Proceedings of the 16th National Conference on Artiﬁcial Intelligence, pages 1044–1049, 1999. 14. D. Sleator and D. Temperley. Parsing english with a link grammar. Technical Report CMU-CS-91-196, Computer Science, Carnegie Mellon University, October 1991. 15. S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999. 16. V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, DE, 1995.

Criminal Incident Data Association Using the OLAP Technology Song Lin and Donald E. Brown Department of Systems and Information Engineering University of Virginia, VA 22904, USA {sl7h, brown}@virginia.edu

Abstract. Associating criminal incidents committed by the same person is important in crime analysis. In this paper, we introduce concepts from OLAP (online-analytical processing) and data-mining to resolve this issue. The criminal incidents are modeled into an OLAP data cube; a measurement function, called the outlier score function is defined on the cube cells. When the score is significant enough, we say that the incidents contained in the cell are associated with each other. The method can be used with a variety of criminal incident features to include the locations of the crimes for spatial analysis. We applied this association method to the robbery dataset of Richmond, Virginia. Results show that this method can effectively solve the problem of criminal incident association. Keywords. Criminal incident association, OLAP, outlier

1 Introduction Over the last two decades, computer technologies have developed at an exceptional rate, and become an important part of our life. Consequently, information technology now plays an important role in the law enforcement community. Police officers and crime analysts can access much larger amounts of data than ever before. In addition, various statistical methods and data mining approaches have been introduced into the crime analysis field. Crime analysis personnel are capable of performing complicated analyses more efficiently. People committing multiple crimes, known as serial criminals or career criminals, are a major threat in the modern society. Understanding the behavioral patterns of these career criminals and apprehending them is an important task for law enforcement officers. As the first step, identifying criminal incidents committed by the same person and linking them together is of major importance for crime analysts. According to the rational choice theory [5] in criminology, a criminal evaluates the benefit and the risk for committing an incident and makes a “rational” choice to maximize the “profit”. In the routine activity theory [9], a criminal incident is considered as the product of an interactive process of three key elements: a ready criminal, a suitable target, and lack of effective guardians. Brantingham and Brantingham [2] claim that the environment sends out some signals, or cues (physical, spatial, cultural, etc.), about its characteristics, and the criminal uses these cues to evaluate the target and make the decision. A criminal incident is usually an outcome H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 13–26, 2003. © Springer-Verlag Berlin Heidelberg 2003

14

S. Lin and D.E. Brown

of a decision process involving a multi-staged search in the awareness space. During the search phase, the criminal associates these cues, clusters of cues, or cue sequences with a “good” target. These cues form a template of the criminal, and once the template is built, it is self-reinforcing and relatively enduring. Due to the limit of the searching ability of a human being, a criminal normally does not have many decision templates. Therefore, we can observe criminal incidents with the similar temporal, spatial, and modus operandi (MO) features, which possibly come from the same template of the same criminal. It is possible to identify the serial criminal by associating these similar incidents. Different approaches have been proposed and several software programs have been developed to resolve the crime association problem. They can be classified into two major categories: suspect association and incident association. The Integrated Criminal Apprehension Program (ICAP) developed by Heck [12] enables police officers to match between the suspects and the arrested criminals using MO features; the Armed Robbery Eidetic Suspect Typing (AREST) program [1] employs an expert approach to perform the suspect association and classify a potential offender into three categories: probable, possible, or non suspect. The Violent Criminal Apprehension Program developed by the Federal Bureau of Investigation (FBI) (ViCAP) [13] is an incident association system. MO features are primarily considered in ViCAP. In the COPLINK [10] project undertaken by the researchers in the University of Arizona, a novel concept space model is built and can be used to associate searching terms with suspects in the database. A total similarity method was proposed by Brown and Hagen [3], and it can solve problems for both incident association and suspect association. Besides these theoretical methods, crime analysts normally use the SQL (Structure Query Language) in practice. They build the SQL string and make the system return all records that match their searching criteria. In this paper, we describe a crime association method that combines both OLAP concepts from the data warehousing area and outlier detection ideas from the data mining field. Before presenting our method, let us briefly review some concepts in OLAP and data mining.

2 Brief Review of OLAP and OLAP-Based Data Mining OLAP is a key aspect of many data warehousing systems [6]. Unlike its ancestor, OLTP (online transaction processing) systems, OLAP focus on providing summary information to the decision-makers of an organization. Aggregated data, such as sum, average, max, or min, are pre-calculated and stored in a multi-dimensional database called a data cube. Each dimension of the data cube consists of one or more categorical attributes. Hierarchical structures generally exist in the dimensions. Most existing OLAP systems concentrate on the efficiency of retrieving the summary data in the cube. For many cases, the decision-maker still needs to apply his or her domain knowledge and sometimes common sense to make the final decision. Data mining is a collection of techniques that detect patterns in large amounts of data. Quantitative approaches, including statistical methods, are generally used in data mining. Traditionally, data mining algorithms are developed for two-way datasets. More recently researchers have generalized some data mining methods for multi-

Criminal Incident Data Association Using the OLAP Technology

15

dimensional OLAP data structures. Imielinski et al. proposed the “cubegrade” problem [14]. The cubegrade problem can be treated as a generalized version of the association rule. Imielinski et al. claim that the association rule can be viewed as the change of count aggregates when imposing another constraint, or in OLAP terminology, making a drill-down operation on an existing cube cell. They think that other aggregates like sum, average, max, or min can also be incorporated, and the cubgegrade could support the “what if” analysis better. Similar to the cubegrade problem, the constrained gradient analysis was proposed by Dong et al. [7]. The constrained gradient analysis focuses on retrieving pairs of OLAP cubes that are quite different in aggregates and similar in dimensions (usually one cell is the ascendant, descendent, or sibling of the other cell). More than one aggregates can be considered simultaneously in the constrained gradient analysis. The discovery-driven exploration problem was proposed by Sarawagi et al. [18]. It aims at finding exceptions in the cube cells. They build a formula to estimate the anticipated value and the standard deviation (σ) of a cell. When the difference between the actual value of the cell and the anticipated value is greater than 2.5σ, the cell is selected as an exception. Similar to above approaches, our crime association method also focuses on the cells of the OLAP data cube. We define an outlier score function to measure the distinctiveness of the cell. Incidents contained in the same cell are determined to be associated with each other when the score is significant. The definition of the outlier score function and the association method is given in section 3.

3 Method 3.1 Rationale The rationale of this method is explained as follows: although theoretically the template (see section 1) is unique for each serial criminal, the data collected in the police department does not contain every aspect of the template. Some observed parts of the templates are “common” so that we may see a large overlap in these common templates. The creators (criminals) of those “common” templates are not separable. Some templates are “special”. For these “special templates”, we are more confident to say that the incidents come from the same criminal. For example, consider the weapon used in a robbery incident. We may observe many incidents with the value “gun” for weapon used. However, no crime analyst would say that the same person commits all these robberies because “gun” is a common template shared by many criminals. If we observe several robberies with a “Japanese sword” – an uncommon template, we are more confident in asserting that these incidents result from a same criminal. (This “Japanese sword” claim was first proposed by Brown and Hagen [4]). In this paper, we describe an outlier score function to measure this distinctiveness of the template.

16

S. Lin and D.E. Brown

3.2 Definitions In this section, we give the mathematical definitions used to build the outlier score function. People familiar with OLAP concepts can see that our notation derives from terms used in OLAP field. A1, A2, …, Am are m attributes that we consider relevant to our study, and D1, D2, …, Dm are their domains respectively. Currently, these attributes are confined to be categorical (categorical attributes like MO are important in crime association analysis). Let z(i) be the i-th incident, and z(i).Aj be the value on the j-th attribute of incident i. z(i) can be represented as z (i ) = ( z1(i ) , z 2(i ) ,..., z m(i ) ) , where

z k( i ) = z ( i ) . Ak ∈ D k , k ∈ {1,..., m} . Z is the set of all incidents. Definition 1. Cell Cell c is a vector of the values of attributes with dimension t, where t≤m. A cell can be represented as c = (ci1 , ci2 ,..., cit ) . In order to standardize the definition of a cell, for each Di, we add a “wildcard” element “*”. Now we allow D’i= Di∪{*}. For cell

c = (ci1 , ci2 ,..., cit ) , we can represent it as c = (c1 , c 2 ,..., c m ) , where c j ∈ D’j ,

and cj=* if and only if j ∉ {i1 , i2 ,..., it } . C denotes the set of all cells. Since each incident can also be treated as a cell, we define a function Cell: Z Å C. Cell(z)= (z1,z2,…,zm), if z=(z1,z2,…,zm), Definition 2. Contains relation We say that cell c = (ci1 , ci2 ,...,cit ) contains incident z if and only if z.Aj=cj or cj=*, j=1,2,…,m. For two cell, we say that cell c ’= (c1 ’, c 2 ’,..., c m ’) contains cell

c = (c1 , c2 ,..., cm ) if and only if c j ’= c j or c j ’= * , j = 1,2,..., m Definition 3. Count of a cell Function count is defined on a cell, and it returns the number of incidents that cell c contains. Definition 4. Parent cell Cell c’= (c’1 , c’2 ,..., c’m ) is the parent cell of cell c on the k-th attribute when: and

c’k = *

c’j = c j , for j ≠ k . Function parent(c,k) returns parent cell of cell c on the k-th

attribute.

Criminal Incident Data Association Using the OLAP Technology

17

Definition 5. Neighborhood P is called the neighborhood of cell c on the k-th attribute when P is a set of cells that takes the same values as cell c in all attributes but k, and does not take the wildcard value * on the k-th attribute, i.e., P= {c (1) , c ( 2 ) ,..., c (|P|) } where

cl( i ) = cl( j ) for all

l ≠ k , and c k( i ) ≠ * for all i = 1,2,..., | P | . Function neighbor (c , k ) returns the neighborhood of cell c on attribute k. (In OLAP field, the neighborhood is sometimes called siblings.) Definition 6. Relative frequency We call freq(c, k ) =

count(c) the relative frequency of cell c with count( parent(c, k ))

respect to attribute k. Definition 7. Uncertainty function We use function U to measure the uncertainty of a neighborhood. This uncertainty measure is defined on the relative frequencies. If we use P = {c (1) , c ( 2) ,..., c denote the neighborhood of cell c on attribute k, then,

U (c, k ) = U ( freq(c (1) , k ), freq(c ( 2) , k ),..., freq(c

(P)

(1)

(P)

} to

, k )) P

Obviously, U should be symmetric for all c , c ( 2) ,..., c . U takes a smaller value if the “uncertainty” in the neighborhood is low. One candidate uncertainty function is entropy, which comes from information theory:

U (c , k ) = H (c , k ) = −

∑ freq (c ’, k ) log( freq (c’, k ))

For

the

c ’∈neighbor ( c , k )

freq=0, we define 0 ⋅ log(0) = 0 , as is common in information theory. 3.3 Outlier Score Function (OSF) and the Crime Association Method Our goal is to build a function to measure the confidence or the significance level of associating crimes. This function is built over OLAP cube cells. We start building this function from analyzing the requirements that it needs to satisfy. Consider the following three scenarios: I.

II.

We have 100 robberies. 5 take the value of “Japanese sword” for the weapon used attributes, and 95 takes “gun”. Obviously, the 5 “Japanese swords” is of more interest than the 95 “guns”. Now we add another attribute: method of escape. Assume we have 20 different values: “by car”, “by foot”, etc. for the method of escape attribute. Each of them has 5 incidents. Although both “Japanese sword” and “by car” has 5 incidents, they should not be treated equally.

18

S. Lin and D.E. Brown

III.

“Japanese sword” highlights itself because all other incidents are “guns”, or in other words, the uncertainty level of the weapon used attribute is smaller. If we have some incidents takes “Japanese sword” on the weapon used attribute, and “by car” on the method of escape attribute, then the combination of “Japanese sword” and “by car” is more significant than both “Japanese sword” only and “by car” only. The reason is that we have more “evidences”.

Now we define function f as follows: − log( freq(c, k ))  ) max ( f ( parent(c, k )) +  f (c) = k takes all non−* dim ensionof c H (c, k ) 0 c = (*,*,...,*) When H(c,k) = 0, we say − log( freq (c, k )) = 0. H (c , k )

(1)

It is simple to verify that f satisfies above three requirements. We call f the outlier score function. (The term “outlier” is commonly used in the field of statistics. Outliers are observations significantly different that other observations and possibly are generated from a unique mechanism [11].) Based on the outlier score function, we give the following rule to associate criminal incidents: Given a pair of incidents, if there exists a cell containing both these incidents, and the outlier score of the cell is greater than some threshold value τ, we say that these two incidents are associated with each other. This association method is called an OLAP-outlier-based association method, or outlier-based method for abbreviation.

4 Application We applied this criminal incident association method to a real-world dataset. The dataset contained information on robbery incidents that occurred in Richmond, Virginia in 1998. The dataset consisted of two parts: the incident dataset and the suspect dataset. The incident dataset had 1198 records, and the temporal, spatial, and MO information were stored in the incident database. The name (if known), height, and weight information of the suspect were recorded in the suspect database. We applied our method to the incident dataset and used the suspect dataset for verification. Robbery was selected for two reasons: first, compared with some violent crime such as murder or sexual attack, serial robberies were more common; second, compared with breaking and entering crimes, more robbery incidents were “solved” (criminal arrested) or “partially solved” (the suspect’s name is known). These two points made the robbery favorable for evaluation purposes.

Criminal Incident Data Association Using the OLAP Technology

19

4.1 Attribute Selection We used three types of attributes in our analysis. The first set of attributes consisted of MO features. MO was primarily considered in crime association analysis. 6 MO attributes were picked. The second set of attributes was census attributes (the census data was obtained directly from the census CD held in library of the University of Virginia). Census data represented the spatial characteristics of the location where the criminal incident occurred, and it might help to reveal the spatial aspect of the criminals’ templates. For example, some criminals preferred to attack “high-income” areas. Lastly, we chose some distance attributes. They were distances from the incident location to some spatial landmarks such as a major highway or a church. Distance features were also important in analyzing criminals’ behaviors. For example, a criminal might preferred to initiate an attack from a certain distance range from a major highway so that the offense could not be observed during the attack, and he or she could leave the crime scene as soon as possible after the attack. There were a total of 5 distances. The names of all attributes and their descriptions are given in appendix I. They have also been used in a previous study on predicting breaking and entering crimes by Brown et al. [4]. An attribute selection was performed on all numerical attributes (census and distance attributes) before using the association method. The reason was that some attributes were redundant. These redundant attributes were unfavorable to the association algorithm in terms of both accuracy and efficiency. We adopted a featureselection-by-clustering methodology to pick the attributes. According to this method, we used the correlation coefficient to measure how similar or close two attributes were, and then we clustered the attributes into a number of groups according to this similarity measure. The attributes in the same group were similar to each other, and were quite different from attributes in other groups. For each group, we picked a representative. The final set of all representative attributes was considered to capture the major characteristics of the dataset. A similar methodology was used by Mitra et al. [16]. We picked the k-medoid clustering algorithm. (For more details about the kmedoid algorithm and other clustering algorithm, see [8].) The reason was that kmedoid method works on similarity / distance matrix (some other methods only work on coordinate data), and it tends to return spherical clusters. In addition, k-medoid returns a medoid for each cluster, based upon which we could select the representative attributes. After making a few slight adjustments and checking the silhouette plot [15], we finally got three clusters, as given in Fig. 1. The algorithm returned three medoids: HUNT_DST (housing unit density), ENRL3_DST (public school enrollment density), and TRAN_PC (expenses on transportation: per capita). We made some adjustments here. We replaced ENRL3_DST with another attribute POP3_DST (population density: age 12-17). The attackers and victims. For similar reasons, we replaced TRAN_PC with MHINC (median household income).

20

S. Lin and D.E. Brown

Fig. 1. Result of k-medoid clustering

There were a total of 9 attributes used in our analysis: 6 MO attributes (categorical) and 3 numerical attributes picked by applying the attributes selection procedure. Since our method was developed on categorical attributes, we converted the numerical attributes to categorical ones by dividing them into 11 equally sized bins. The number was determined by Sturge’s number of bins rule [19][20].

4.2 Evaluation Criteria We wanted to evaluate whether the association determined by our method corresponded to the true result. The information in the suspect database was considered as the “true result”. 170 incidents with the names of the suspects were used for evaluation. We generated all incident pairs. If two incidents in a pair had the suspects with the same name and date of birth, we said that the “true result” for this incident pair was a “true association”. There were 33 true associations. We used two measures to evaluate our method. The first measure was called “detected true associations”. We expected that the association method would be able to detect a large portion of “true associations”. The second measure was called “average number of relevant records”. This measure was built on the analogy of the search engine. Consider a search engine as Google. For each searching string(s) we give, it returns a list of documents considered to be “relevant” to the searching criterion. Similarly, for the crime association problem, if we give an incident, the algorithm will return a list of records that are considered as “associated” with the given incident. A shorter list is always preferred in both cases. The average “length” of the lists provided the second measure and we called it the “average number of relevant records”. The algorithm is more accurate when this measure has a smaller

Criminal Incident Data Association Using the OLAP Technology

21

value. In the information retrieval area [17], two commonly used criteria in evaluating a retrieval system are recall and precision. The former is the ability for a system to present relevant items, and the latter is the ability to present only the relevant items. Our first measure was a recall measure, and our second measure was equivalent to a precision measure. The above two measures do not work for our approach only; they can be used in evaluating any association algorithms. Therefore, we can use these two measures to compare the performances of different association methods. 4.3 Result and Comparison Different threshold values were set to test our method. Obvious if we set it to 0, we would expect that the method can detect all “true associations” and the average number of relevant records was 169 (given 170 incidents for evaluation). If we set the threshold, τ, to infinity, we would expect the method to return 0 for both “detected true associations” and “average number of relevant records”. As the threshold increased, we expected a decrease in both number of detected true associations and average number of relevant records. The result is given in Table 1. Table 1. Result of outlier-based method

Threshold 0 1 2 3 4 5 6 7 ∞

Detected true associations 33 32 30 23 18 16 8 2 0

Avg. number of relevant records 169.00 121.04 62.54 28.38 13.96 7.51 4.25 2.29 0.00

We compared this outlier-based method with a similarity-based crime association method. The similarity-based method was proposed by Brown and Hagen (Brown and Hagen, 2003). Given a pair of incidents, the similarity-based method first calculates a similarity score for each attribute, and then computes a total similarity score using the weighted average of all individual similarity scores. The total similarity score is used to determine whether the incidents are associated. Using the same evaluation criteria, the result of the similarity-based method is given in Table 2. If we set the average number of relevant records as the X-axis and set the detected true associations as the Y-axis, the comparisons can be illustrated as in Fig. 2. In Fig. 2, the outlier-based method lies above the similarity-based method for most cases. That means given the same “accuracy” (detected true associations) level, the outlier-based method returns fewer relevant records. Also if we keep the number

22

S. Lin and D.E. Brown Table 2. Result of similarity-based method

Threshold 0 0.5 0.6 0.7 0.8 0.9

Detected true associations 33 33 25 15 7 0

∞

Avg. number of relevant records 169.00 112.98 80.05 45.52 19.38 3.97

0

0.00

of relevant records (average length of the returned list) for both methods, the outlierbased method is more accurate. The curve of the similarity-based method sits slightly above the outlier-based method when the average number of relevant records is above 100. Since the size of the evaluation incident set is 170, no crime analyst would consider putting further investigation on any set of over 100 incidents. The outlierbased method is generally more effective.

35

30

Detected Associations

25

20 Similarity Outlier 15

10

5

0 0

20

40

60

80

100

120

140

160

180

Avg. relevant records

Fig. 2. Comparison: the outlier-based method vs. the similarity-based method

5 Conclusion In this paper, an OLAP-outlier-based method is introduced to solve the crime association problem. The criminal incidents are modeled into an OLAP cube and an outlier-score function is defined over the cube cells. The incidents contained in the

Criminal Incident Data Association Using the OLAP Technology

23

cell are determined to be associated with each other when the outlier score is large enough. The method was applied to a robbery dataset and results show that this method can provide significant improvements for crime analysts who need to link incidents in large databases.

References 1.

2. 3. 4. 5.

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

17. 18.

19. 20.

Badiru, A.B., Karasz, J.M. and Holloway, B.T., “AREST: Armed Robbery Eidetic Suspect Typing Expert System”, Journal of Police Science and Administration, 16, 210–216 (1988) Brantingham, P. J. and Brantingham, P. L., Patterns in Crimes, New York: Macmillan (1984) Brown D.E. and Hagen S.C., “Data Association Methods with Applications to Law Enforcement”, Decision Support Systems, 34, 369–378 (2003) Brown, D. E., Liu, H. and Xue, Y., “Mining Preference from Spatial-temporal Data”, Proc. of the First SIAM International Conference of Data Mining (2001) Clarke, R.V. and Cornish, D.B., “Modeling Offender’s Decisions: A Framework for Research and Policy”, Crime Justice: An Annual Review of Research, Vol. 6, Ed. by Tonry, M. and Morris, N. University of Chicago Press (1985) Chaudhuri, S. and Dayal, U., “An Overview of Data Warehousing and OLAP Technology”, ACM SIGMOD Record, 26 (1997) Dong, G., Han, J., Lam, J. Pei, J., and Wang, K., “Mining Multi-Dimensional Constrained Gradients in Data Cubes”, Proc. of the 27th VLDB Conference, Roma, Italy (2001) Everitt, B. Cluster Analysis, John Wiley & Sons, Inc. (1993) Felson, M., “Routine Activities and Crime Prevention in the Developing Metropolis”, Criminology, 25, 911–931 (1987) Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., and Chen, H., “Using Coplink to Analyse Criminal-Justice Data”, IEEE Computer, 35, 30–37 (2002) Hawkins, D., Identifications of Outliers, Chapman and Hall, London, (1980) Heck, R.O., Career Criminal Apprehesion Program: Annual Report (Sacramento, CA: Office of Criminal Justice Planning) (1991) Icove, D. J., “Automated Crime Profiling”, Law Enforcement Bulletin, 55, 27–30 (1986) Imielinski, T., Khachiyan, L., and Abdul-ghani, A., Cubegrades: “Generalizing association rules”, Technical report, Dept. Computer Science, Rutgers Univ., Aug. (2000) Kaufman, L. and Rousseeuw, P. Finding Groups in Data, Wiley (1990) Mitra, P., Murthy, C.A., and Pal, S.K., “Unsupervised Feature Selection Using Feature Similarity”, IEEE Trans. On Pattern Analysis and Machine Intelligence, 24, 301–312 (2002) Salton, G. and McGill, M. Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York (1983) Sarawagi, S., Agrawal, R., and Megiddo. N., “Discovery-driven exploration of OLAP data cubes”, Proc. of the Sixth Int’l Conference on Extending Database Technology (EDBT), Valencia, Spain (1998) Scott, D. Multivariate Density Estimation: Theory, Practice and Visualization, New York, NY: Wiley (1992) Sturges, H.A., “The Choice of a Class Interval”, Journal of American Statistician Association, 21, 65–66 (1926)

24

S. Lin and D.E. Brown

Appendix I. Attributes used in the analysis (a) MO attributes Name Description Rsus_Acts Actions taken by the suspects R_Threats Method used by the suspects to threat the victim R_Force Actions that suspects force the victim to do Rvic_Loc Location type of the victim when robbery was committed Method_Esc Method of escape the scene Premise Premise to commit the crime (b) Census attributes Attribute name Description General POP_DST Population density (density means that the statistic is divided by the area) HH_DST Household density FAM_DST Family density MALE_DST Male population density FEM_DST Female population density Race RACE1_DST RACE2_DST RACE3_DST RACE4_DST RACE5_DST HISP_DST

White population density Black population density American Indian population density Asian population density Other population density Hispanic origin population density

Population Age POP1_DST POP2_DST POP3_DST POP4_DST POP5_DST POP6_DST POP7_DST POP8_DST POP9_DST POP10_DST

Population density (0-5 years) Population density (6-11 years) Population density (12-17 years) Population density (18-24 years) Population density (25-34 years) Population density (35-44 years) Population density (45-54 years) Population density (55-64 years) Population density (65-74 years) Population density (over 75 years)

Householder Age AGEH1_DST AGEH2_DST AGEH3_DST

Density: age of householder under 25 years Density: age of householder under 25-34 years Density: age of householder under 35-44 years

Criminal Incident Data Association Using the OLAP Technology

Attribute name AGEH4_DST AGEH5_DST AGEH6_DST

Description Density: age of householder under 45-54 years Density: age of householder under 55-64 years Density: age of householder over 65 years

Household Size PPH1_DST PPH2_DST PPH3_DST PPH6_DST

Density: 1 person households Density: 2 person households Density: 3-5 person households Density: 6 or more person households

Housing, misc. HUNT_DST OCCHU_DST VACHU_DST MORT1_DST MORT2_DST COND1_DST OWN_DST RENT_DST

Housing units density Occupied housing units density Vacant housing units density Density: owner occupied housing unit with mortgage Density: owner occupied housing unit without mortgage Density: owner occupied condominiums Density: housing unit occupied by owner Density: housing unit occupied by renter

Housing Structure HSTR1_DST HSTR2_DST HSTR3_DST HSTR4_DST HSTR6_DST HSTR9_DST HSTR10_DST

Density: occupied structure with 1 unit detached Density: occupied structure with 1 unit attached Density: occupied structure with 2 unit Density: occupied structure with 3-9 unit Density: occupied structure with 10+ unit Density: occupied structure trailer Density: occupied structure other

Income PCINC_97 MHINC_97 AHINC_97

Per capita income Median household income Average household income

School Enrollment ENRL1_DST ENRL2_DST ENRL3_DST ENRL4_DST ENRL5_DST ENRL6_DST ENRL7_DST

School enrollment density: public preprimary School enrollment density: private preprimary School enrollment density: public school School enrollment density: private school School enrollment density: public college School enrollment density: private college School enrollment density: not enrolled in school

Work Force CLS1_DST CLS2_DST

Density: private for profit wage and salary worker Density: private for non-profit wage and salary worker

25

26

S. Lin and D.E. Brown

Attribute name CLS3_DST CLS4_DST CLS5_DST CLS6_DST CLS7_DST

Description Density: local government workers Density: state government workers Density: federal government workers Density: self-employed workers Density: unpaid family workers

Consumer Expenditures ALC_TOB_PH APPAREL_PH EDU_PH ET_PH FOOD_PH MED_PH HOUSING_PH PCARE_PH REA_PH TRANS_PH ALC_TOB_PC APPAREL_PC EDU_PC ET_PC FOOD_PC MED_PC HOUSING_PC PCARE_PC REA_PC TRANS_PC

Expenses on alcohol and tobacco: per household Expenses on apparel: per household Expenses on education: per household Expenses on entertainment: per household Expenses on food: per household Expenses on medicine and health: per household Expenses on housing: per household Expenses on personal care: per household Expenses on reading: per household Expenses on transportation: per household Expenses on alcohol and tobacco: per capita Expenses on apparel: per capita Expenses on education: per capita Expenses on entertainment: per capita Expenses on food: per capita Expenses on medicine and health: per capita Expenses on housing: per capita Expenses on personal care: per capita Expenses on reading: per capita Expenses on transportation: per capita

(c) Distance attributes Name D_Church D_Hospital D_Highway D_Park D_School

Description Distance to the nearest church Distance to the nearest hospital Distance to the nearest highway Distance to the nearest park Distance to the nearest school

Names: A New Frontier in Text Mining 1

2

Frankie Patman and Paul Thompson 1

Language Analysis Systems, Inc. 2214 Rock Hill Rd., Herndon, VA 20170 [email protected] 2 Institute for Security Technology Studies Dartmouth College, Hanover, NH 03755 [email protected]

Abstract. Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.

1 Introduction Database name matching technology has long been used in criminal investigations [1], counter-terrorism efforts [2], and in a wide variety of government processes, e.g., the processing of applications for visas. With this technology a name is compared to names contained in one or more databases to determine whether there is a match. Sometimes this matching operation may be a straightforward exact match, but often the process is more complicated. Two names may not match exactly for a wide variety of reasons and yet still refer to the same individual [3]. Often a name in a database comes from one field of a more complete database record. The values in other fields, e.g., social security number, or address, can be used to help match names which are not exact matches. The context from the complete record helps the matching process. In this paper we propose the design of a system that would extend database name matching technology to the unstructured realm of free text. Over the past 15 or so years the federal government has funded research in information extraction, e.g., the Message Understanding Conferences [4], Tipster [5], and Automatic Content

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 27–38, 2003. © Springer-Verlag Berlin Heidelberg 2003

28

F. Patman and P. Thompson

Extraction [6]. The goal of this research has been to develop the technology to extract entities, events, and their interrelationships, from free text so that the extracted entities and relationships can be stored in a relational database, or knowledgebase, to be more readily analyzed. One subtask during the last few years of the Message Understanding Conference was the Named Entity Task in which personal and company names, as well as other formatted information, was extracted from free text. The system proposed in this paper would extract personal and company names from free text for inclusion in a database, an information extraction template, or automatically marked up XML text [7]. It would expand link analysis capabilities by taking into account a broad and more realistic view of the types of name variation found in texts from diverse sources. The sophisticated name matching algorithms currently available for matching names in databases are equally suited to matching name strings drawn from text. Analogous to the way in which the context of a full database record can assist in the name matching process, in the free text application, the context of the full text of the document can be used not only to help identify and extract names, but also to match names, both within a single document and across multiple documents.

2 Database Name Matching Name matching can be defined as the process of determining whether two name strings are instances of the same name. It is a component of entity matching but is distinct from that larger task, which in many cases requires more information than a name alone. Name matching serves to create a set of candidate names for further consideration—those that are variants of the query name. ‘Al Jones’, for example, is a legitimate variant of ‘Alfred Jones,’ ‘Alan Jones,’ and ‘Albert Jones.’ Different processes from those involved in name matching will often be required to equate entities, perhaps relation to a particular place, organization, event, or numeric identifier. However, without a sufficient representation of a name (the set of variants of the name likely to occur in the data), different mentions of the same entity may not be recognized. Matching names in databases has been a persistent and well-known problem for years [8]. In the context of the English-speaking world alone, where the predominant model for names is a given name, an optional middle name, and a surname of AngloSaxon or Western European origin, a name can have any number of variant forms, and any or all of these forms may turn up in database entries. For example, Alfred James Martin can also be A. J. Martin; Mary Douglas McConnell may also be Mary Douglas or Mary McConnell or Mary Douglas-McConnell; Jack Crowley and John Crowley may both refer to the same person; the surnames Laury and Lowrie can have the same pronunciation and may be confused when names are taken orally; jSmith is a common typographical error entered for the name Smith. These familiar types of name variation pose non-trivial difficulties for automatic name matching, and numerous systems have been devised to deal with them (see [3]). The challenges to name matching are greatly increased when databases contain names from outside the Anglo-American context. Consider some common issues that arise with names from around the world.

Names: A New Frontier in Text Mining

29

In China or Korea, the surname comes first, before the given name. Some people may maintain this format in Western contexts, others may reverse the name order to fit the Western model, and still others may use either. The problem is compounded further if a Western given name is added, since there is no one place in the string of names where the additional name is required to appear. Ex: Yi Kyung Hee ~ Kyung Hee Yi ~ Kathy Yi Kyung Hee ~ Yi Kathy Kyung Hee ~ Kathy Kyung Hee Yi In some Asian countries, such as Indonesia, many people have only one name; what appears to be a surname is actually the name of the father. Names are normally indexed by the given name. Ex: former Indonesian president Abdurrahman Wahid is Mr. Abdurrahman (Wahid being the name of his father). A name from some places in the Arab world may have many components showing the bearer’s lineage, and none of these is a family name. Any one of the name elements other than the given name can be dropped. Ex: Aziz Hamid Salim Sabah ~ Aziz Hamid ~ Aziz Sabah ~ Aziz Hispanic names commonly have two surnames, but it is the first of these rather than the last that is the family name. The final surname (which is the mother’s family name) may or may not be used. Ex: Jose Felipe Ortega Ballesteros ~ Jose Felipe Ortega, but is less likely to refer to the same person as Jose Felipe Ballesteros There may be multiple standard systems for transliterating a name from a native script (e.g. Arabic, Chinese, Hangul, Cyrillic) into the Roman alphabet, individuals may make up their own Roman spelling on the fly, or database entry operators may spell an unfamiliar name according to their own understanding of how it sounds. Ex: Yi ~ Lee ~ I ~ Lie ~ Ee ~ Rhee Names may contain various kinds of affixes, which may be conjoined to the rest of the name, separated from it by white space or hyphens, or dropped altogether. Ex: Abdalsharif ~ Abd al-Sharif ~ Abd-Al-Sharif ~ Abdal Sharif; al-Qaddafi ~ Qaddafi Systems for overcoming name variation search problems typically incorporate one or more of (1) a non-culture-specific phonetic algorithm (like Soundex1 or one of its refinements, e.g. [9]); (2) allowances for transposed, additional, or missing characters; (3) allowances for transposed, additional or missing name elements and for initials and abbreviations; and (4) nickname recognition. See [10] for a recent example. Less commonly, culture-specific phonetic rules may be used. The most serious problem for name-matching software is the wide variety of naming conventions represented in modern databases, which reflects the multicultural composition of many societies. Name-matching algorithms tend to take a one-size-fits-all approach, either by underestimating the effects of cultural variation, 1

Soundex, the most well-known algorithm for variant name searching in databases, is a phonetics-based system patented in 1918. It was devised for use in indexing the 1910 U.S. census data. The system groups consonants into sets of similar sounds (based on American names reported at the time) and assigns a common code to all names beginning with the same letter and sharing the same sequence of consonant groups. Soundex does not accommodate certain errors very well, and groups many highly dissimilar names under the same code. See [11].

30

F. Patman and P. Thompson

or by assuming that names in any particular data source will be homogenous. This may give reasonable results for names that fit one model, but may perform very poorly with names that follow different conventions. In the area of spelling variation alone, which letters are considered variants of which others differs from one culture to the next. In transcribed Arabic names, for example, the letters “K” and “Q” can be used interchangeably; “Qadafi” and “Kadafi” are variants of the same name. This is not the case in Chinese transcriptions, however, where “Kuan” and “Quan” are most likely to be entirely different names. What constitutes similarity between two name strings depends on the culture of origin of the names, and typically this must be determined on a case-by-case basis rather than across an entire data set. Language Analysis Systems, Inc. (LAS) has implemented a number of approaches to coping with the wide array of multi-cultural name forms found in databases. Names are first submitted to an automatic analysis process, which determines the most likely cultural/linguistic origin of the name (or, at the discretion of the user, the culture of origin can be manually chosen). Based on this determination, an appropriate algorithm or set of rules is applied to the matching process. LAS technologies include culturally sensitive search systems and processes for generating variants of names, among others. Some of the LAS technologies are briefly discussed below. Automatic Name Analysis: The name analysis system (NameClassifier¹) contains a knowledge base of information about name strings from various cultures. An input name is compared to what is known about name strings from each of the included cultures, and the probability of the name’s being derived from each of the cultures is computed. The culture with the highest score is assigned to the input name. The culture assignment is then used by other technologies to determine the most appropriate name-matching strategy. NameVariantGenerator¹: Name variant generation produces orthographic and syntactic variants of an input string. The string is first assigned a culture of origin through automatic name analysis. Culture-specific rules are then applied to the string to produce a regular expression. The regular expression is compared to a knowledge base of frequency information about names drawn from a database of over 750,000,000 names. Variant strings with a high enough frequency score are returned in frequency-ranked order. This process creates a set of likely variants of a name, which can then be used for further querying and matching. NameHunter¹: NameHunter¹ is a search engine that computes the similarity of two name strings based on orthography, word order, and number of elements in the string. The thresholds and parameters for comparison differ depending on the culture assignment of the input string. If a string from the database has a score that exceeds the thresholds for the input name culture, the name is returned. Returns are ranked relative to each other, so that the highest scoring strings are presented first. NameHunter allows for noisy data; thresholds can be tweaked by the user to control the degree of noise in returns. MetaMatch¹: MetaMatch¹ is a phonetic-based name retrieval system. Entry strings are first submitted to automatic name analysis for a culture assignment. Strings are then transformed to phonetic representations based on culture-specific rules, which are then stored in the database along with the original entry. Query strings are similarly processed, and the culture assignment is retained to determine the particular

Names: A New Frontier in Text Mining

31

parameters and thresholds for comparison. A similarity algorithm based on linguistic principles is used to determine the degree of similarity between query and entry strings [12]. Returns are presented in ranked order. This approach is particularly effective when name entries have been drawn from oral sources, such as telephone conversations. NameGenderizer¹: This module returns the most likely gender for a given name based on frequency of assignment of the name to males or females. A major advantage of the technologies developed by LAS is that a measure of similarity between name forms is computed and used to return names in order of their degree of similarity to the query term. An example of the effectiveness of this approach over a Soundex search is provided in Fig.1 in the Appendix.

3 Named Entity Extraction The task of named entity recognition and extraction is to identify strings in text that represent names of people, organizations, and places. Work in this area began in earnest in the mid-eighties, with the initiation of the Message Understanding Conferences (MUC). MUC is largely responsible for the definition of and specifications for the named entity extraction task as it is understood today [4]. Through MUC-6 in 1995, most systems performing named entity extraction were based on hand-built patterns that recognized various features and structures in the text. These were found to be highly successful, with precision and recall figures reaching 97% and 96%, respectively [4]. However, the systems were trained exclusively on English-language newspaper articles with a fixed set of domains, leaving open the question of how they would perform on other text sources. Bikel et al. [13] found that rules developed for one newswire source had to be adapted for application to a different newswire service, and that English-language rules were of little use as a starting point for developing rules for an unrelated language like Chinese. These systems are labor-intensive and require people trained in text analysis and pattern writing to develop and maintain rule sets. Much recent work in named entity extraction has focused on statistical/ probabilistic approaches (e.g., [14], [15], [13], [16]). Results in some cases have been very good, with F-measure scores exceeding 94%, even for systems gathering information from the least computationally expensive sources, such as punctuation, dictionary look-up, and part-of-speech taggers [15]. Borthwick et al. [14] found that by training their system on outputs tagged by hand-built systems (such as SRA’s NameTag extractor), scores improved to better than 97%, exceeding the F-measure scores of hand-built systems alone, and rivaling scores of human annotators. These results are very promising and suggest that named entity extraction can be usefully applied to larger tasks such as relation detection and link analysis (see, for example, [17]).

32

F. Patman and P. Thompson

4 Intra- and Inter-document Coreference The task of determining coreference can be defined as “the process of determining whether two expressions in natural language refer to the same entity in the world,” [18]. Expressions handled by coreference systems are typically limited to noun phrases of various types—including proper names—and pronouns. This paper will consider only coreference between proper names. For a human reader, coreference processes take place within a single document as well as across multiple documents when more than one text is read. Most coreference systems deal only with coreference within a document (see [19], [20], [21], [18], [22]). Recently, researchers have also begun work on the more difficult task of crossdocument coreference ([23], [24], [25]). Bagga [26] offers a classification scheme for evaluating coreference types and systems for performing coreference resolution, based in part on the amount of processing required. Establishing coreference between proper names was determined to require named entity recognition and generation of syntactic variants of names. Indeed, the coreference systems surveyed for this paper treat proper name variation (apart from synonyms, acronyms, and abbreviations) largely as a syntactic problem. Bontcheva et al., for example, allow name variants to be an exact match, a word token match that ignores punctuation and word order (e.g., “John Smith” and “Smith, John”), a first token match for cases like “Peter Smith” and “Peter,” a last token match for e.g., “John Smith” and “Smith,” a possessive form like “John’s,” or a substring in which all word tokens in the shorter name are included in the longer one (e.g., “John J. Smith” and “John Smith”). Depending on the text source, name variants within a single document are likely to be consistent and limited to syntactic variants, shortened forms, and synonyms, such as nicknames.2 One would expect intra-document coreference results for proper names under these circumstances to be fairly good. Bontcheva et al. [19] obtained precision and recall figures ranging from 94%-98% and 92%-95%, respectively, for proper name coreferences in texts drawn from broadcast news, newswire, and newspaper sources.3 Bagga and Baldwin [23] also report very good results (F-measures up to 84.6%) for tests of their cross-document coreference system, which compares summaries created for extracted coreference chains. Note, however, that their reported research looked only for references to entities named "John Smith," and that the focus of the cross-document coreference task was maintaining distinctions between different entities with the same name. Research was conducted exclusively on texts from the New York Times. Nevertheless, their work demonstrates that context can be effectively used for disambiguation across documents. Ravin and Kazi [24] focus on both distinguishing different entities with the same name and merging variant names 2

3

Note, however, that even within a document inconsistencies are not uncommon, especially when dealing with names of non-European origin. A Wall Street Journal article appearing in January 2003 referred to Mohammed Mansour Jabarah as Mr. Jabarah, while Khalid Sheikh Mohammed was called Mr. Khalid. When items other than proper names are considered for coreference, scores are much lower than those reported by Bontcheva et al. for proper names. The highest F-measure score for coreference at the MUC-7 competition was 61.8%. This figure includes coreference between proper names, various types of noun phrases, and pronouns.

Names: A New Frontier in Text Mining

33

referring to a single entity. They use the IBM Context Thesaurus to compare the contexts in which similar names from different documents are found. If there is enough overlap in the contextual information, the names are assumed to refer to the same entity. Their work was also limited to articles from the New York Times and the Wall Street Journal, both of which are edited publications with a high degree of internal consistency. Across documents from a wide variety of sources, consistent name variants cannot be counted on, especially for names originating outside the Anglo/Western European tradition. In fact, the many types of name variation commonly found in databases can be expected. A recent web search on Google for texts about Muammar Qaddafi, for example, turned up thousands of relevant pages under the spellings Qathafi, Kaddafi, Qadafi, Gadafi, Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi, Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi (and these are only a few of the variants of this name known to occur). A coreference system that can be of use to agencies dealing with international names must be able to recognize name strings with this degree of variation as potential instances of a single name. Cross-document coreference systems currently suffer from the same weakness as most database name search systems. They assume a much higher degree of source homogeneity than can be expected in the world outside the laboratory, and their analysis of name variation is based on an Anglo/Western European model. For the coreference systems surveyed here, recall would be a considerable problem within a multi-source document collection containing non-Western names. However, with an expanded definition of name variation, constrained and supplemented by contextual information, these coreference technologies can serve as a starting point for linking and disambiguating entities across documents from widely varying sources.

5 Name Text Mining Support for Visualization, Link Analysis, and Deception Detection Commercial and research products for visualization and link analysis have become widely available in recent years, e.g., Hyperbolic Tree, or Star Tree [27], SPIRE [28], COPLINK [29], and InfoGlide [30]. Visualization and link analysis continues to be an active area of on-going research [31]. Some current tools have been incorporated into systems supporting intelligence and security informatics. For example, COPLINK [29] makes use of several visualization and link analysis packages, including i2’s [32] Analyst Notebook. Products such as COPLINK and InfoGlide also support name matching and deception detection. These tools make use of sophisticated statistical record linkage, e.g. [33], and have well developed interfaces to support analysts [32, 29]. Chen et al. [29] note that COPLINK Connect has the built-in capability for partial and phonetic-based name searches. It is not clear from the paper, however, what the scope of coverage is for phonetically spelled names, or how this is implemented. Research software and commercial products have been developed, such as those presented in [34, 30], which include modules that detect fraud in database records. These applications’ foci model ways that criminals, or terrorists, typically alter records to disguise their identity. The algorithms used by these systems could be

34

F. Patman and P. Thompson

augmented by taking into account a deeper multi-cultural analysis of names, as discussed in section 2.

6 Procedure for a Name Extraction and Matching Text Mining Module In this section a procedure is presented for name extraction and matching within and across documents. This algorithm could be incorporated in a module that would work with an environment such as COPLINK. The basic algorithm is as follows. Within document: 1. Perform named entity extraction. 2. Establish coreference between name mentions within a single document, creating an equivalence class for each named entity. 3. Discover relations between equivalence classes within each document 4. Find the longest canonical name string in each equivalence class. 5. Perform automatic name analysis on canonical names using NameClassifier; retain culture assignment. 6. Generate variant forms of canonical names according to culture-specific criteria using NameVariantGenerator. Across documents: 7. For each culture identified during name analysis, match sets of canonical name variants belonging to that culture against each other; for each pair of variant sets considered, if there are no incompatible (non-matching) members in the sets, mark as potential matches (e.g., Khalid bin (son of) Jamal and Khalid abu (father of) Jamal would be incompatible). 8. For potential name set matches, use a context thesaurus like that described in [24] to compare contexts where the names in the equivalence classes are found; if there are enough overlapping descriptions, merge the equivalence classes for the name sets (which will also expand the set of relations for the class to include those found in both documents); combine variant sets for the two canonical name strings into a single set, pruning redundancies. 9. For potential name set matches where overlapping contextual descriptions do not meet the minimum threshold, mark as a potential link, but do not merge. 10. Repeat process from #7 on for each pair of variant sets, until no further comparisons are possible. This algorithm could be implemented within a software module of a larger text mining application. The simplest integration of this algorithm would be as a module that extracted personal names from free text and stored the extracted names and relationships in a database. As discussed by [7], it would also be possible to use this algorithm to annotate the free text, in addition to creating database entries. This automatic markup would provide an interface for an analyst which would show not only the entities and their relationships, but also preserve the context of the surrounding text.

Names: A New Frontier in Text Mining

35

7 Research Issues This paper proposes an extension of linguistically-based, multi-cultural database name matching functionality to the extraction and matching of names from full text documents. To accomplish such an extension implies an effective integration of database and document retrieval technology. While this has been an on-going research topic in academic research [35, 36] and has received attention from major relational database vendors such as Oracle, Sybase, and IBM, effective integration has not yet been achieved, in particular in the area of intelligence and security informatics [37]. Achieving the sophistication of database record matching for names extracted from free text implies advances in text mining [38, 39, 40, 41]. One useful structure for supporting cross document name matching would be an authority file for named entities. Library catalogs maintain authority files which have a record for each author, showing variant names, pseudonyms, and so on. An authority file for named entity extraction could be built which would maintain a record for each entity. The record could start with information about the entity extracted from database records. When the named entity was found in free text, contextual information about the entity could be extracted and stored in the authority file with an appropriate degree of probability in the accuracy of the information included. For example, a name followed by a comma-delimited parenthetical expression, is a reasonably accurate source of contextual information about an entity, e.g., “X, president of Y, resigned yesterday”. A further application of linguistic/cultural classification of names could be to tracking interactions between groups of people where there is a strong association between group membership and language. For example, an increasing number of police reports in which both Korean and Cambodian names are found in the same documents might indicate a pattern in Asian crime ring interactions. Finally, automatic recognition of name gender could be used to support the process of pronominal coreference. Work is underway to provide a quantitative comparison of key-based name matching systems (such as Soundex) with other approaches to name matching. One of the hindrances to effective name matching system comparisons is the lack of generally accepted standards for what constitutes similarity between names. Such standards are difficult to establish in part because the definition of similarity changes from one user community to the next. A standardized metric for the evaluation of degrees of correlation of name search results, and a means for using this metric to measure the usefulness of different name search technologies is sorely needed. This paper has focused on personal name matching. Matching of other named entities, such as organizations, is also of interest for intelligence and security informatics. While different matching algorithms are needed, extending company name matching, or other entity matching, to free text will also be useful. One promising research direction integrating database, information extraction, and document retrieval that could support effective text mining of names is provided by work on XIRQL [7].

36

F. Patman and P. Thompson

8 Conclusion Effective tools exist for multi-cultural database name matching and this technology is becoming available in analytic tool kits supporting intelligence and security informatics. The proportion of data of interest to intelligence and security analysts that is contained in databases, however, is very small compared to the amount of data available in free text and audio formats. The extension of name extraction and matching to free text and audio will add important text mining functionality for intelligence and security informatics toolkits.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970) Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002) Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476 Grishman, R., Sundheim, B.: Message Understanding Conference – 6: A Brief History. In: th Proceedings of the 16 International Conference on Computational Linguistics. Copenhagen (1999) DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999) National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. http://www.itl.nist.gov/iad/894.01/tests/ace/index.htm (2000) Fuhr, N.: XML Information Retrieval and Extraction [to appear] Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985) Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology – Coding and Computing. Las Vegas (2002) Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15 Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001) Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002) Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1-3. (1999) 211–231 Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998) Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999) Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted th Perceptron. In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496

Names: A New Frontier in Text Mining

37

17. Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear] 18. Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001) 19. Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002) 20. Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144 21. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. th In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111 22. McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055 23. Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector th Space Model. In: Proceedings of the 36 Annual Meeting of the Association for th Computational Linguistics and the 17 International Conference on Computational Linguistics (1998) 79–85 24. Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999) 25. Schiffman, B., Mani, I., Concepcion, K.J. : Producing Biographical Summaries : th Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39 Annual Meeting of the Association for Computational Linguistics (2001) 450–457 26. Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566 27. Inxight. A Research Engine for the Pharmaceutical Industry. http://www.inxight.com 28. Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. th Proceedings of the 5 International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175 29. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003) 30. InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching. http://www.infoglide.com/content/images/whitepapers.pdf(2002) 31. American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998) 32. i2. Analyst’s Notebook. http://www.i2.co.uk/Products/Analysts_Notebook (2002) 33. Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau, http://www.census.gov/srd/papers/pdf/rr99-04.pdf 34. Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear] 35. Fuhr, N.: Probabilistic Datalog – A Logic for Powerful Retrieval Methods. In: Proceedings th of SIGIR-95, 18 ACM International Conference on Research and Development in Information Retrieval (1995) 282–290 36. Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996) 37. Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994) 38. Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. http://www.ima.umn.edu/reactive/spring/tm.html (2000)

38

F. Patman and P. Thompson

39. KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000) http:www2.cs.cmu.edu/~dunja/WshKDD2000.html 40. SIAM Text Mining Workshop. http://www.cs.utk.edu/tmw02 (2002) 41. Text-ML 2002 Workshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)

Appendix: Comparison of LAS MetaMatch¹ Search Engine Returns with SQL-Soundex Returns

Fig. 1. These searches were conducted in databases containing common surnames found in the 1990 U.S. Census data. The surnames in the databases are identical. The MetaMatch database differs only in that the phonetic form of each surname is also stored. The exact match “Sadiq” th th was 54 in the list of Soundex returns. “Siddiqui” was returned by Soundex in 26 place. th “Sadik” was 109 .

Web-Based Intelligence Reports System Alexander Dolotov and Mary Strickler Phoenix Police Department 620 W. Washington Street, Phoenix, Arizona 85003 {alex.dolotov, mary.strickler}@phoenix.gov

Abstract. Two areas for discussion will be included in this paper. The first area targets a conceptual design of a Group Detection and Activity Prediction System (GDAPS). The second area describes the implementation of the WEBbased intelligence and monitoring reports system called the Phoenix Police Department Reports (PPDR). The PPDR System could be considered the first phase of a GDAPS System. The already operational PPDR system’s goal is to support data access to heterogeneous databases, provide a means to mine data using search engines, and to provide statistical data analysis with reporting capabilities. A variety of static and ad hoc statistical reports are produced with the use of this system for interdepartmental and public use. The system is scalable, reliable, portable and secured. Performance is supported on all system levels using a variety of effective software designs, statistical processing and heterogeneous databases/data storage access.

1 System Concept The key to the effectiveness of a law enforcement agency and all of its divisions is directly related to its ability to make informed decisions for crime prevention. In order to prevent criminal activity, a powerful multilevel analysis tool based on a mathematical model could be used to develop a system that would target criminals and/or criminal groups and their activities. Alexander Dolotov and Ziny Flikop have proposed an innovative conceptual design of such a system. The fundamental idea of this project involves information collection and access to the information blocks reflecting the activities of an individual or groups of individuals. This data could then be used to create a system that would be capable of maintaining links and relationships for both groups and individuals in order to predict any abnormal activity. Such a system might be referred to as the “Group Detection and Activity Prediction System (GDAPS). The design of the GDAPS would include maintaining all individuals’ and groups’ activity trends simultaneously in real time mode. This system would be “selfeducating” meaning it would become more precise over a longer time-period and as more information is gathered. The design would be based on the principles of statistical modeling, fuzzy logic, open-loop control and many-dimensional optimizations. In addition, the latest software design technologies would be applied. The ultimate goal of this system would be to produce notifications alerting users of any predicted abnormal behavior. The initial plan for the design of GDAPS would be to break the system into three subsystems. The subsystems would consist of the PPDR system re H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 39–58, 2003. © Springer-Verlag Berlin Heidelberg 2003

40

A. Dolotov and M. Strickler

named to the “Data Maintaining and Reporting Subsystem” (DmRs), the “Group Detection Subsystem” (GTeS), and the “Activity Prediction Subsystem” (APreS). The first phase of the GDAPS System would be the PPDR System, which is currently operational within the Phoenix Police Department. This system is explained in detail in the remaining sections of this document. The DmRs subsystem (renamed from PPDR) supports access to heterogeneous databases using data mining search engines to perform statistical data analysis. Ultimately, the results are generated in report form. The GTeS subsystem would be designed to detect members of the targeted group or groups. In order to accomplish this, it would require monitoring communications between individuals using all available means. Intensity and duration of these communications can define relationships inside the group and possibly define the hierarchy of the group members. The GTeS subsystem would have to be adaptive enough to constantly upgrade information related to each controlled group, since every group has a life of its own. GTeS would provide the basic foundation for GDAPS. The purpose of the APreS subsystem is to monitor, in time, the intensity and modes of multiple groups’ communications by maintaining a database of all types of communications. The value of this subsystem would be the ability to predict groups’ activities based upon the historical correlation between abnormalities in the groups’ communication modes and intensities, along with any previous activities. APreS is the dynamic subsystem of GDAPS. To accelerate the GDAPS development, methodologies, already created for other industries, can be modified for use [1], [2], [3], [7]. Because of the complexity of the GDAPS system, a multi-phase approach to system development should be considered. Taking into account time and resources, this project can be broken down into manageable sub-projects with realistic development and implementation goals. The use of a multi-dimensional mathematical model will enable developers to assign values to different components, and to determine relationship between them. By using specific criteria, these values can be manipulated to determine the outcome under varying circumstances. The mathematical model, when optimized, will produce results that could be interpreted as “a high potential for criminal activity”. The multi-dimensional mathematical model is a powerful “forecasting” tool. It provides the ability to make decisions before a critical situation or uncertain conditions arise [4], [5], [6], [8]. Lastly, accumulated information must be stored in a database that is supported/serviced by a specific set of business applications. The following is a description of the PPDR system, the first phase of the Group Detection and Activity Prediction System (GDAPS).

2 Objectives A WEB-based intelligence and monitoring reports system called Phoenix Police Department Reports (PPDR) was designed in-house for use by the Phoenix Police Department (PPD). Even though this system was designed specifically for the Phoenix Police Department, it could easily be ported for use by other law enforcement agencies. Within seconds, this system provides detailed, comprehensive, and informative statistical reports reflecting the effectiveness and responsiveness of any division, for any date/time period, within the Phoenix Police Department. These reports are designed for use by all levels of management, both sworn and civilian, from police

Web-Based Intelligence Reports System

41

chiefs’ requests to public record requests. The statistical data from these reports provides information for use in making departmental decisions concerning such issues as manpower allocation, restructuring and measurement of work. Additionally, PPDR uses a powerful database mining mechanism, which would be valuable for use in the future development of the GDAPS System. In order to satisfy the needs of all users, the PPDR system is designed to meet the following requirements: - must maintain accurate and precise up-to-date information; - the use of a specific mathematical model for statistical analysis and optimization [5] [6]; - perform at a high level with quick response times; - must have the ability to support different security levels for different categories of users; - must be scalable and expandable; - must have a user friendly presentation and; - be able to easily maintain reliable and optimized databases and other information storage. The PPDR system went into production in February 2002. This system contains original and effective solutions. It provides the capability to make decisions which will ultimately have an impact on the short and long term plans for the department, the level of customer service provided to the public, overall employee satisfaction and organizational changes needed to achieve future goals. The PPDR system could be considered the first phase of a complex Intelligence Group Detection and Activity Prediction System.

3 Relationships to Other Systems and Sources of Information 3.1 Calls for Service There are two categories of information that are used for the PPDR. They are calls for service data and text messages sent by Mobile Data Terminal (MDT) and Computer Aided Dispatch (CAD) users. Both sources of information are obtained from the Department’s Computer Aided Dispatch and Mobil Data Terminal (CAD/MDT) System. The CAD/MDT System is operating on three redundant Hewlett Packard (HP) 3000 N-Series computers. The data is stored in HP’s proprietary Image database for six months. Phoenix Police Department’s CAD/MDT System handles over 7,000 calls for service daily from citizens of Phoenix. Approximately half of these calls require an officer to respond. The other half are either duplicates or ones where the caller is just asking for general information or wishing to report a non-emergency incident. Calls for Service data is collected when a citizen calls the emergency 911 number or the Department's crime stop number for service. A call entry clerk enters the initial call information into CAD. The address is validated against a street geobase which provides information required for dispatching such as the grid, the beat and the responsible precinct where the call originated. After all information is collected, the call is automatically forwarded to a dispatcher for distribution to an officer or officers in the field. Officers receive the call information on their Mobile Data Terminals (MDT). They enter the time they start on the call, arrive at the scene and the time they

42

A. Dolotov and M. Strickler

complete the call. Each call for service incident is given a disposition code that relates to how an officer or officers handled the incident. Calls for service data for completed incidents are transferred to a SQL database on a daily basis for use in the PPDR System. Police officers and detectives use calls for service information for investigative purposes. It is often requested by outside agencies for court purposes or by the general public for their personal information. It is also used internally for statistical analysis. 3.2 Messages The messages are text sent between MDT users, MDT users and CAD users, CAD users to other CAD users. The MDT system uses a Motorola radio system for communications, which interfaces to the CAD system through a programmable interface computer. The CAD system resides on a local area network within the Police Department. The message database also contains the results of inquiries on persons, vehicles, or articles requested by officers in the field from their MDTs or by CAD user from any CAD workstation within the Department. Each message stored by the CAD system contains structured data, such as the identification of the message sender, date and time sent and the free-form body of the message. Every twenty-four hours, more than 15,000 messages are passed through the CAD System. Copies of messages are requested by detectives, police officers, the general public and court systems, as well as outside law enforcement agencies.

4 PPDR System Architecture The system architecture of the PPDR system is shown in Figure 1.

5 PPDR Structural WEB Design PDR has been designed with seven distinctive subsystems incorporated within one easy to access location. The subsystems are as follows: Interdepartmental Reports; Ad Hoc Reports; Public Reports; Messages Presentation; Update functionality; Administrative Functionality; and System Security. Each subsystem is designed to be flexible as well as scaleable. Each subsystem has the capability of being easily expanded or modified to satisfy user enhancement requests.

Web-Based Intelligence Reports System

Fig. 1. PPDR Architecture (continued on next page)

43

44

A. Dolotov and M. Strickler

Fig. 1. (continued from previous page)

5.1 System Security Security begins when a user logs into the system and is continuously monitored until the user logs off. The PPDR security system is based on the assignment of roles to each user through the Administrative function. Role assignment is maintained across multiple databases. Each database maintains a set of roles for the PPDR system. Each role has the possibility of being assigned to both a database object and a WEB functionality. This results in a user being able to perform only those WEB and database functions that are available to his/her assigned role. When a user logs onto the system, the userid and password is validated with the database security information. Database security does not use custom tables but rather database tables that contain encrypted roles, passwords, userids and logins. After a successful login, the role assignment is maintained at the WEB level in a secure state and remains intact during the user’s session.

Web-Based Intelligence Reports System

45

The PPDR System has two groups of users: those that use the Computer Aided Dispatch System (CAD) and those that do not. Since most of the PPDR users are CAD users, it makes sense to keep the same userids and passwords for both CAD and PPDR. Using a scheduled Data Transfer System (DTS) process, CAD userids and passwords are relayed to the PPDR system on a daily basis, automatically initiating a database upgrade process in PPDR. The non-CAD users are entered into the PPDR system through the Administrative (ADMIN) function. This process does not involve the DTS transfer, but is performed in real time by a designated person or persons with ADMIN privileges. Security for non-CAD users is identical that of CAD users, including transaction logging that captures each WEB page access. In addition to transaction logging, another useful security feature is the storage of user history information on a database level. Anyone with ADMIN privileges can produce user statistics and historical reports upon request. 5.2 Regular Reports In general, Regular Reports are reports that have a predefined structure, based on input parameters entered by the user. In order to obtain the best performance and accuracy for these reports, the following technology has been applied: A special design of multiple databases which includes “summary” tables ( see Section V. Database Solutions); the use of cross tables reporting functionality which allows for creating a cross table recordset on a database level; and the use of a generic XML stream with XSLT performance on the client side instead of the use of ActiveX controls for the creation of reports. Three groups of Regular Reports are available within the PPDR system. The three groups are Response Time Reports, Calls for Service Reports and Details Reports. Response Time Reports. Response Time Reports present statistical information regarding the average response time for calls for service data obtained from the CAD System. Response time is the period between the time an officer was dispatched on a call for service and the time the officer actually arrived on the scene. Response time reports can be produced on several levels, including but not limited to beat, squad, precinct and even citywide level. Using input parameters such as date, time, shift, and squad area, a semi-custom report is produced within seconds. Below is an example of the “Average Quarterly Response Time By Precinct” report for the first quarter of 2002. This report calculates the average quarterly response time for each police precinct based on the priorities assigned to the calls for service. The right most column (PPD) is the citywide average, again broken down by priority. Calls for Service Reports. Calls for Service Reports are used to document the number of calls for service in a particular beat, squad, precinct area or citywide. These reports have many of the same parameters as the Response Time Reports. Some reports in this group are combination reports, displaying both the counts for calls for service

46

A. Dolotov and M. Strickler

Fig. 2. Response Time Reports

and the average response time. Below is an example of a “Monthly Calls for Service by Squad” report for the month of January 2002. This report shows a count of the calls for service for each squad area in the South Mountain precinct, broken down by calls that are dispatched and calls that are handled by a phone call made by Callback Unit.

Fig. 3. Calls For Service Report

Details Reports. These reports are designed to present important details for a particular call for service. Details for a call for service include such information as call location, disposition code (action taken by officer), radio code (type of calls for service - burglary, theft, etc.), received time and responding officer(s). From a Detail Re-

Web-Based Intelligence Reports System

47

port, other pertinent information related to a call for service is obtained quickly with a click of the mouse. Other available information includes unit history information. Unit history information is a collection of data for all the units that responded to a particular call for service, such as time the unit was dispatched, time unit arrived and what people or plates were checked. 5.3 AD HOC Reports The AD HOC Reports subsystem provides the ability to produce “custom” reports from the calls for service data. To generate an AD HOC report, a user should have basic knowledge of SQL queries using search criteria as well as basic knowledge of the calls for service data. There are three major steps involved in producing an AD HOC report: -

Selecting report columns Selecting search criteria Report generation

Selecting report columns and selecting search criteria use an active dialog. The report generation uses XML/XSLT performance. Selecting Report Columns. The first page that is presented when entering the AD HOC Reports subsystem allows the user to choose, from a list of tables, the fields that are to be displayed in the desired report. OLAP functionality is used for accessing a database’s schema, such as available tables and their characteristics, column names, formats, aliases and data types. The first presented page of the AD HOC Reports is displayed below. A selection can be made for any required field by checking the left check box. Other options such as original value (Orig), count, average (Averg), minimum (Minim), and maximum (Maxim) are also available to the user. Count, average, minimum and maximum are only available for numeric fields. As an example, if a user is requiring a count of the number of calls for service, a check is required in the Count field. When the boxes are check, the SELECT clause generates as a DHTML script. For instance, if the selected fields for an Ad Hoc report are ‘Incident Number ‘, ‘Date’, ‘Address’ and ‘Average of the Response Time’ (all members of the Incidents table), the following SELECT clause will be generated:

Date’,Incidents.Inc_Location AS ’Address’,Incidents.Inc_Time_Rec AS ’Received time’,Avg(Incidents.Inc_Time_Rec) AS ’Avg Of Received time’ FROM Incidents Syntaxing is maintained in the SELECT clause generation on the business logic level using a COM+ objects. Selecting Search Criteria. When all desired fields have been selected, click on “Submit” and the following search criteria page is presented:

48

A. Dolotov and M. Strickler

Fig. 4. Selecting Report Columns

This page will allow the user to build the search criteria necessary for the generation of the desired report. Most available criteria and their combinations are available to the user (i.e., >, <, =, etc.) with valid values presented in the drop down boxes. Options such as Grouped By, Ordered By, Ascending and Descending order are also available to the user. The final query statement generated from the example above is as follows:

SELECT Incidents.Inc_Number AS ’Incident Number’,Incidents.Inc_Date AS ’Date’,Incidents.Inc_Location AS ’Address’ FROM Incidents WHERE Incidents.Inc_Number = 20000077 All syntaxes and logic rules apply to this page through the COM+ objects and Microsoft Transaction Server (MTS) interfaces. When using the AD HOC report feature, a user may need to cross years to create the desired report. When requested, this requires accessing data stored in multiple databases. Each year’s worth of calls for service data is stored in a separate database. Generally, the date field is used to determine the correct database to access, but if the date is not a part of the search criteria, the incident number (the number assigned to the calls for service record by the CAD system) can be used. The first character of the incident number determines the year of the call for service. In the example above, the incident number was used to determine the correct database to access by using special database validation procedures. These procedures return the correct FROM statement with necessary modifications to capture the valid calls for service records. The modified SELECT statement for this query is as follows:

Web-Based Intelligence Reports System

49

Fig. 5. Selecting Search Criteria

SELECT Incidents.Inc_Number AS ’Incident Number’,Incidents.Inc_Date AS ’Date’,Incidents.Inc_Location AS ’Address’ FROM IncDB2002..Incidents WHERE Incidents.Inc_Number = 20000077 Section V. Database Solutions will provide a more detailed discussion of how yearly data is stored in multiple databases and how these multiple databases are accessed when creating a report that cross years or accesses previous year’s data. Report Generation. Report generation processing is similar to what was previously described. The returned record set is converted into the XML stream which is then converted to DHTML using XSLT script. The only difference is that reports generated using the AD HOC feature may result in a multiple page response. Special page services for the client has been added to handle multiple pages. Below is a final AD HOC report (Fig. 6) using the special page services along with the required disclaimer. 5.4 Public Reports This group of reports is designed for public dissemination. The creation of these reports is the same as described in Section 5.2.3. Details Reports. Users with a “public” role assignment can only access the reports in this group and no others. A “security” filter is applied to these reports to protect sensitive information from being distributed to the general public. This “security” filter is adjustable and can be modified as necessary if and when public information laws are changed.

50

A. Dolotov and M. Strickler

Fig. 6. Final AD HOC Report

5.5 CAD/MDT Messages Reports The Computer Aided Dispatch/Mobile Data System (CAD/MDT) captures text messages sent between mobile units in the field and messages sent to and from other CAD users such as dispatchers and desk aides. In addition to text messages, every vehicle and person query is also captured. These messages are transferred from the CAD/MDT System to the PPDR System on a daily basis. CAD/MDT only stores one day’s worth of messages (about 200 MB) at one time while PPDR retains one full year of these messages. Each message block includes message type, identification of the sender and receiver of the message, date/time when the message was generated and the body of the message, which is usually unstructured text. Single or multiple messages can be requested for reviewing and reporting for investigative purposes. Messages can be retrieved by a number of input parameters such as date, mobile unit id, CAD user id and all units in a squad. Maintaining this data on the PPDR system presented a few major obstacles. These obstacles included providing a system design with minimal data storage, while maintaining acceptable database access response time. In addition, the system had to maximize the performance of the WEB page presentation of the results. The system design techniques used to minimize storage requirements consisted of the following:

Web-Based Intelligence Reports System

-

-

51

A SQL metadatabase that contains consolidated tables with descriptors and pointers to each record of the stored message file. Building a database that maintained the relationships between the searchable subject into the unstructured text record and the associated pointers. The searchable subject could be the sender and receiver of messages, a vehicle identification number (VIN), last name, first name and date of birth. A “subject” table was created in the database. The number of those tables could be changed, depending on the number of desired searchable subjects. All subject tables are related to a pivot table with the file descriptors and are populated at the same time a transition process is run (star schema). The daily message file created from CAD/MDT is broken down into twelve (12) separate files that are compressed and stored without any internal changes. The use of DTS to load data from CAD/MDT (see details in Section 7. Data Transition).

In the current version of the PPDR system, only the senders and receivers of messages are searchable subjects. Future development is planned to include the capability to search using the other searchable subjects mentioned above. Reports are obtained using input parameters such as date and time range, a single sender/receiver and/or a group of senders/receivers. The result of any search includes all text messages with complete details. Below is an example of the Messages Search screen: Suppose a user wishes to view all messages between a CAD dispatcher and a mobile unit id of 512G for a 24-hour period on January 1, 2002. The following procedures occur in order to retrieve the requested records: -

After all parameters are entered, clicking on “Submit” will create and populate two session specific global temporary tables. One of the tables is populated with the pointers and the relationships to the text message files. The text messages files for the requested data are expanded into the designated expanded area. Only those messages that meet the search criteria are extracted from the expanded files and placed in the structured text file. The structured text file is then bulk copied into the second global temporary table.

The final step is obtaining a record set from joined global temporary tables. This record set is retrieved as an XML stream using a combination of DHTML performance and XSLT scripting. The results for the above query are shown below (Fig. 8).

52

A. Dolotov and M. Strickler

Fig. 7. Messages Search Screen

Fig. 8. Messages by Date Time Period

Web-Based Intelligence Reports System

53

The above solution provides many benefits such as a combining of the database with file storage for rapid retrieval; scalability so that a search can be performed using multiple parameters; high performance resulting in as many as 10,000 records retrieved and presented within 30 seconds; compact storage taking less than 350 Mb for 365 files; and, lastly convenient database and file maintenance. 5.6 Updater’s Block Data that is transferred from the CAD/MDT System to the PPDR System may require updating to correct erroneous fields. For example, an officer dispatched on a high priority call for service does not always depress the arrival button on the MDT to record the time he/she arrives on the scene, due to the criticality of the call. Other times may not have been recorded correctly as high priority calls are handled in an urgent matter. These inaccurate times may have a drastic effect on department statistics. The actual type of call and/or the priority of the call may have changed from the time the call was received to its completion. A daily report is generated in the CAD/MDT System that is given to the officer so that the correct the associated times, call type, and priority can be recorded for the call. Since the data has already been transferred to the PPDR System, a special feature was designed to give a select group of users (referred to as “Updaters”) the ability to update the incident after all the data on the report has been verified for accuracy. After the submission of the corrections, all recalculations are performed in the background for the database correction. The corrected data is displayed to the user on the “updater” screen. Each transaction is recorded as a special entry in the log file for future reference. Below is an example of the “Edit and Update Response Time” screen that is used by the “updater” to make corrections to a call for service (Fig. 9). 5.7 Administrative Functionality The PPDR System has a special Administrative (ADMIN) functionality for maintaining users and security. An Administrator is the only person who has the capability to add/delete users and to make changes to logins, names, roles and passwords. The users can change their passwords when they expire. User authenticity is synchronized with the CAD System. On a daily basis, valid user profile data from the CAD System is transferred to the PPDR System using DTS. Any changes made to a user’s profile in the CAD System are automatically updated in the PPDR System. Any users requiring access to the PPDR System who are not CAD users can be added by an Administrator.

54

A. Dolotov and M. Strickler

Fig. 9. Edit and Update Response Time

6 Database Solutions 6.1 Multiple Databases The PPDR System maintains multiple databases for Regular, AD HOC and the Details reports. The system archives 10 years worth of data and if all the data were consolidated in one database, WEB performance and reliability would be negatively affected. The multiple database schema is described on the following page: -

Data is accumulated annually with each year’s data stored in a separate database. All the schemas for each yearly database are identical. Multiple databases are named using the same naming convention. All databases, with the exception of the current database, are static and do not require modifications or updates. The current year’s database is dynamic and is populated on a daily basis using a DTS process. The current year database is the only database requiring maintenance. Previous years’ databases could be restored in case of system failures from any available backup. The only data that needs to be maintained and backed-up on a regular basis is the current year database.

Using the above schema, a user’s query should be directed only to the database and associated tables for the year in which the data resides. If the query crosses databases, the tables are joined from the appropriate databases. Before a query is processed, the appropriate database is determined from the date or dates requested by the user. For instance, if the requested date is 01/01/2001, then a stored procedure is called that determines the database named INCDB2001 will be accessed. If the requested date is

Web-Based Intelligence Reports System

55

a date range crossing years, then a view of the union of all existing databases is processed. Every January 1st, a new database is created and the previous database renamed appropriately. In addition, all associated views and tables from the previous year are updated. A special procedure automates this yearly renaming process. 6.2 Summary Tables Most of the user requested queries are to retrieve statistical information related to calls for service data such as average response time, weighted averages by police precincts and citywide comparisons. Approximately 7,000 calls for service are added to the PPDR system on a daily basis. To perform the many calculations on a month’s worth, or even a year’s worth, of data could pose severe performance issues and affect WEB response time. To overcome any potential performance issues, special summary tables are created and maintained for each database. Most statistical information that is requested by the user is pre-calculated on a daily basis, while the calls for service data is loading using DTS, and time is not a critical issue. The pre-calculated data is grouped by various keys such as police precinct, shift, type of call, and priority into a summary table. This table is four-dimensional. By using the summary table, WEB response time remains extremely fast for complicated statistical queries. In the summary table, each numbered field represents a value for an appropriate shift. Each record is then related to the appropriate priority for the calls for service. Since there are a possibility of three priorities, each subdivision could have up to three records for a particular date. Below is an example of the multidimensional summary table structure:

Fig. 10. Multidimensional Summary Table Structure

56

A. Dolotov and M. Strickler

6.3 File Storage and Messages MetaDatabase The PPDR System contains text messages sent between officers in the field, officers and dispatchers and between other CAD users. If all the messages were to be kept in the database, it would require gigabytes of database storage. Maintenance on such a large database would be difficult. In order to overcome some of the problems with keeping and maintaining such a massive amount of data, a special solution, using a database in combination with file storage, was created. The file storage could reside anywhere on the network and would not have to reside on the same server as the database. On a daily basis, the message file is transferred from the CAD/MDT System to the PPDR System. The file is broken down into twelve separate files. These files are zipped and compressed with an average ratio of 10. Breaking the main file into twelve files allows for parallel processing when loading the data. The original daily file has an average size of approximately 200MB. The twelve zipped files are approximately 650KB. Relationships are built between the records and file entities, such as date, time, vehicle identification and person’s name, while loading using DTS. Each entity creates a relational record in the message metadatabase. When a user performs a query, three steps are involved in returning the results. First, the relationship is determined in the metadatabase. Secondly, all appropriate messages are extracted from the multiple archived files. Lastly, the output is presented using an XML output stream. There are many benefits to using a solution involving the combination of a database with file storage. These benefits include: -

Data can be searched by any of the consolidated entities; Improved performance in that as many as 10,000 records can be retrieved and displayed in as little as 30 seconds; Compact file storage resulting in 365 files at 200MB each using less than 350 MB of space; and Maintenance of the database and stored files is more convenient.

7 Data Transition The PPDR system contains CAD/MDT data from two different sources. The data sources are the Incidents data source and the Messages data source. PPDR uses two different data transition processes for each data source. Both are performed using DTS. The data transition process for both data sources incorporates the idea of “precalculation”. “Pre-calculation” refers to a process where performances on the data occur, such as grouping, relationship building and calculations, all while the data is loaded into the PPDR database. This “pre-calculation” is absolutely necessary to maintaining superior WEB performance. The diagram below depicts the DTS process loading data the Incidents data source into the multiple databases:

Web-Based Intelligence Reports System

57

Fig. 11. Incidents Data Loading Diagram

The above DTS process creates multidimensional summary tables that contain the “pre-calculations” as the data is loaded. These summary tables reduce the need to perform calculations with every user request, thus greatly reducing system response time.

Fig. 12. Messages Data Loading Diagram

58

A. Dolotov and M. Strickler

The second DTS process shown below loads the Messages data source. This process breaks the daily file into twelve separate files, archives the twelve files and builds the relationships necessary to retrieve the files when requested. This solution of breaking the main message file into twelve files, all loaded at the same time with parallel processes, greatly reduces total performance time.

8 Conclusion The PPDR System could be considered the first phase of the Group Detection and Activity Prediction System (GDAPS). PPDR supports data access to heterogeneous databases, data mining using search engines, and the ability to produce diverse statistical reports. Using PPDR as a base, development of the GDAPS is a reality. By itself, the PPDR System is a powerful WEB based monitoring and decision-making support system. It produces a variety of statistical, interdepartmental, public, ad hoc and other informative reports. The system is scalable, reliable and portable. Performance is supported on all system levels, including: -

the presentation level – XML/XSLT to DHTML transformation, the business level – using of COM+ objects on the WEB server level, effective statistical processing algorithms [5], [6], [7] - the databases level – using multiply databases with multi-dimensional table combinations along with compressed file storage, and - the transition level – using parallelism and business calculations. The PPDR system is highly secured with database driven security. It is a system with administrative and versatile log functionality. The architecture of the PPDR system is easily expanded to add new features and functionality for future enhancements.

References 1. Flikop, Ziny. “Uncertainty and Management of Cellular Telephone Networks,” Proceedings of the International Fuzzy Engineering Symposium '91, Yokohama, Japan 2. Flikop, Ziny. “Management System for Cellular Telephone Network,” Proceedings of IEEE International Symposium on Personal, Indoor and Mobile Communications, September 1991, London, UK. 3. Flikop, Ziny. “Some Problems with the Design of Self-Learning Open-Loop Control Systems,” European Journal of Operational Research, vol.81, 1995. 4. Flikop, Ziny. “Input Set Decomposition and Open-Loop Control in Telecommunications Networks”, Proceedings of the 1995 American Control Conference, Seattle, 1995. 5. Dolotov, Alexander. “Effective algorithms for the Statistical Processing , Proceeding of the “Heterogeneous Systems Controls” Conference, Kiev, Ukraine, 1983 6. Dolotov, Alexander. “Experiments Design In a Process of a statistical Modeling Optimization” Journal of “Systems and Machines Control of the Ukraine Academy of Science vol 4 , 1973 7. Dolotov, Alexander, Sadovskiy, Vladimir. “Integrated Information Support System For Design & Management in a Construction Industry“, Proceeding of the “Computer Methods in Civil Engeneering” 1997, No3, Warsaw, Poland 8. Dolotov, Alexander. “A Method For the Distribution and Allocation Tasks Resolving”, Articles “Operations Research and Computing Systems” Vol. 24, Kiev State University, Kiev, Ukraine, 1984

Authorship Analysis in Cybercrime Investigation Rong Zheng, Yi Qin, Zan Huang, and Hsinchun Chen Artificial Intelligence Lab Department of Management Information Systems The University of Arizona Tucson, Arizona 85721, USA {rong, yiqin, zhuang, hchen}@eller.arizona.edu

Abstract. Criminals have been using the Internet to distribute a wide range of illegal materials globally in an anonymous manner, making criminal identity tracing difficult in the cybercrime investigation process. In this study we propose to adopt the authorship analysis framework to automatically trace identities of cyber criminals through messages they post on the Internet. Under this framework, three types of message features, including style markers, structural features, and content-specific features, are extracted and inductive learning algorithms are used to build feature-based models to identify authorship of illegal messages. To evaluate the effectiveness of this framework, we conducted an experimental study on data sets of English and Chinese email and online newsgroup messages. We experimented with all three types of message features and three inductive learning algorithms. The results indicate that the proposed approach can discover real identities of authors of both English and Chinese Internet messages with relatively high accuracies.

1 Introduction The development of networking technologies, and the Internet in particular, has created a new way to share information across time and space. While computer networks have enhanced the quality of life in many aspects, they have also opened a new venue for criminal activities. These activities have spawned the concept of cybercrime, which refers to illegal computer-mediated activities that can be conducted through global electronic networks, such as the Internet [31]. One predominant type of cybercrime is distribution of illegal materials in cyber space. Such materials include pirate software, child pornography materials, stolen properties, etc. Cyber criminals have been using various Web-based channels to distribute illegal materials such as Email, websites, Internet newsgroups, Internet chat rooms, etc. One common characteristic of these channels is anonymity. People usually do not need to provide their real identity information, such as name, age, gender, and address, in order to participate in cyber activities. Compared to conventional crimes, cybercrime conducted through such anonymous channels imposes unique challenges for law enforcement agencies in criminal identity tracing. The situation is further complicated by the sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. Law enforcement agencies have an urgent need for approaches that automate H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 59-73, 2003. © Springer-Verlag Berlin Heidelberg 2003

60

R. Zheng et al.

criminal identity tracing in cyberspace and allow investigators to prioritize their tasks and focus on the major criminals. In this paper we propose to adopt the authorship analysis framework in the context of cybercrime investigation to help law enforcement agencies deal with the identitytracing problem. We extract three types of features that are identified in authorship analysis research from online illegal messages and use inductive learning techniques to build feature-based models to perform automatic message author identification. We are specifically interested in evaluating the general effectiveness of this approach and the effects of using different types of features in the cybercrime investigation context. Because of the multinational nature of cybercrime, we are also interested in evaluating the applicability of the proposed framework in a multilingual context. The remainder of the paper is organized as follows. Section 2 surveys the existing work on authorship analysis and summarizes major types of text features and techniques. Section 3 describes our proposed cyber criminal identity-tracing framework in detail and presents the specific research questions that we aim to address. Section 4 presents an experimental study that answers the research questions raised in Section 3, based on several experimental data sets. We conclude the article in Section 5 by summarizing our research contributions and pointing out future directions.

2 Literature Review 2.1 Authorship Analysis Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship. More specifically, the problem can be broken down into three sub-fields [35]:

• Author Identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. • Author Characterization summarizes the characteristics of an author and generates the author profile based on his/her work. Some of these characteristics include gender, educational and cultural background, and language familiarity. • Similarity Detection compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author. Authorship analysis has many applications. It is rooted in the author attribution problem of historical literature. The most famous one is its success in resolving the debate on Shakespeare’s work [10]. Similarly, authorship analysis techniques have assisted in solving the author debates over the Federalist Papers [23] and the Unabomber Manifesto [13]. Another application domain is software forensics [14]. People try to identify or characterize the author of some malicious programs by analyzing executable code or source code to investigate the crime and prevent future attacks. Since our work is mainly concerned with text, we will not discuss software forensics in this paper. Generally, the major topics on authorship analysis in the past research are feature selection and the techniques used to facilitate the analysis process. In the following sub- section we review the literatures from these two perspectives.

Authorship Analysis in Cybercrime Investigation

61

2.2 Feature Selection The essence of authorship analysis is the formation of a set of features, or metrics, that remain relatively constant for a large number of writings created by the same person. In other words, a set of writings from one author would exhibit greater similarity in terms of these features than a set of writings from different authors. Initially researchers identified authors by categorizing different sets of words used by different authors. One example is the authorship analysis of Shakespeare’s work [10]. Elliot and Valenza [10] conducted a study that compared the poems of Shakespeare with those of Edward de Vere, the leading candidate as the true author of the works credited to Shakespeare. Modal testing based on keyword usage was conducted. However, the effectiveness of this approach is limited by the fact that word usage is highly dependent on the text topic. For discrimination purposes we need “content-free” features. We also call this kind of features as style marker. The basic idea came from Yule’s work, in which features like sentence length [39] and vocabulary richness [40] were proposed. Mosteller and Wallace [23] extracted some function words (or word-based style markers) such as ‘while’ and ‘upon’ to clarify the disputed work, Federalist Papers. Later Burrows developed a set of more than 50 high-frequency words, which were also tested on the Federalist Papers. Tomoji [32] used a 74-word set to analyze Dickens’s narrative style. Binongo and Smith [2] used the frequency of occurrence of 25 prepositions to discriminate between Oscar Wilde’s plays and essays. Holmes [17] analyzed the use of "shorter" words (2 or 3 letters word) and "vowel words" (words beginning with a vowel). Such word-based methods can require intensive effort to select the most appropriate set of words that best distinguish a given set of authors [16]. In summary, the word-based approach is highly author and language dependent and is difficult to apply to a wide range of applications. In order to avoid these problems, Baayen [4] proposed the use of syntax-based features. This approach is based on the statistical measures and methods applied to rewrite rules which appear in a syntactically annotated corpus. They demonstrated that syntax-based features can be more reliable in authorship identification problems than word-based features. Chaniak [8] discussed some statistical techniques for processing such syntactic information. Rudmen [29] concluded that almost 1,000 style markers had been used in authorship analysis applications. There is no agreement on a best set of style markers. As the size of feature set became larger, conventional methods gave way to some more powerful analytical methods such as machine learning methods. 2.3 Techniques for Authorship Analysis In early studies most analytical methods used in authorship analysis were statistical methods. The basic idea is that different authors have different text compositions which are characterized by a probability distribution of word usage. More specifically, given a population of an author’s texts, the identification of a new text can be considered as a statistical hypothesis test or a classification problem. Most early work used statistical methods to facilitate authorship analysis. Brainerd [1] used Chisquared and related distributions to perform lexical data analysis. An important statistical test was introduced by Thisted and Efron’s paper [30]. Farringdon [12] first ap-

62

R. Zheng et al.

plied the CUSUM technique in authorship analysis. Francis [11] gave a summary of early statistical approaches used to resolve the Federalist Papers dispute. Baayen [3] proposed a linguistic evaluation of diverse statistical models of word frequency. Although statistical methods achieved much success in authorship analysis, there are some constraints for particular methods. For example, Holmes [17] found that the CUSUM analysis was unreliable because the stability of those characteristics over multiple texts is not warranted. Moreover, the prediction capability of statistical methods, such as attributing a new text to a certain author, is limited. The advent of powerful computers instigated the extensive use of machine learning techniques in authorship analysis. The Bayesian model, was conducted by Mosteller and Wallace [24] to test the Federalist Papers. Based on their work, McCallum and Nigam [25] compared two different naïve Bayesian models for text classification. While the naïve Bayesian models for text classification still have structural limitations, a number of more powerful methods were also applied in text categorization and authorship analysis. The most representative one is the neural network. Tweedie [33] used a standard feedforward artificial neural network, also called multi-layer perceptron, to attribute authorship to the disputed Federalist Papers. The network they used had three hidden layers and two output layers. It was trained with a conjugate gradient and was tested by the k-fold cross-validation approach. The result was consistent with the results of the previous work on this topic. Another neural network, named radial basis function (RBF), was used by Lowe and Matthews [21]. They applied RBF to investigate the extent of Shakespeare’s collaboration with his contemporary, John Fletcher, on various plays. More recently, Khmelev [19] presented a technique for authorship attribution based on a simple Markov Chain, the key idea of which is using the probabilities of the subsequent letters as features. Diederich [9] introduced the Support Vector Machine (SVM) to this problem. Experiments were carried out to identify the writings of 7 target authors from a set of 2,652 newspaper articles written by several authors covering three topic areas. This method detected the target authors in 60%-80% of the cases. A new area of study is the identification of electronic message authors based on message contents. de Vel et al. [35] used SVM as a learning algorithm to classify 150 email documents from 3 authors. In this experiment an average accuracy of 80% was achieved. Generally speaking, machine learning methods achieved higher accuracies than statistical methods. They can model the underlying distribution of personal word usage with a large set of features. Based on the previous review, we present a taxonomy for authorship analysis research in Table 1. Table 2 shows some example studies in the field. Some general conclusions can be drawn from Table 2. First, most previous studies addressed resolving an authorship identification problem, which actually initiated this research domain and kept attracting researchers’ endeavor and application of new techniques (e.g. the dispute on Shakespeare’s work and Federalist Papers). Second, style markers were used most frequently as features. The reason is that style markers are general content-free features in most types of literatures. Finally, statistical approaches were extensively used in this field and more machine learning methods were introduced recently to this field.

Authorship Analysis in Cybercrime Investigation

63

Table 1. Taxonomy for Authorship Analysis Category

Description

Problems

Authorship Identification Authorship Categorization Similarity Detection

Feature s

Style markers Structural features Content-specific features

Approaches

Manual Analysis Statistical Analysis Machine Learning

Label

Determines the likelihood of a particular author having written a piece of work by examining other works produced by the same author. Summarizes the characteristics of an author and determines the author profile based on his/her works. Compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author Content-free features such as frequency of function word, total number of punctuations, average sentence length, vocabulary richness Such as use of a greeting statement, position of requoted text, use of a farewell statement etc. Such as frequency of keywords, special character for special content etc. Uses manual examination and analysis of a set of works to draw conclusions about the authors’ characteristics such as background, personality, and technical skill. Uses statistical methods for calculating document statistics based on metrics, in order to analyze the characteristics of the author or to examine the similarity between various pieces of work Uses classification methods to predict the author of a piece of work based on a set of metrics.

P1 P2 P3 M1 M2 M3 A1

A3 A4

Table 2. Previous Studies on Authorship Analysis Problems Research

P1

Mosteller [23] de Vel[35] Thisted [30] Yule [39, 40] Elliot [10] Tomoji [32] Binongo [2] Baayen [4] Gray et al [14] Bosch [5] Foster [13] Diederich [9] Brainerd [1] Farringdon [12] McCallum [25] Khmelev [19]

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

P2

Features P3

¥

¥ ¥ ¥ ¥

¥

¥

Approaches

M1

M2

M3

¥ ¥

¥

¥

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

A1

¥ ¥ ¥ ¥ ¥ ¥ ¥

¥ ¥

¥ ¥

A2

¥

¥ ¥ ¥ ¥

A3

¥

¥ ¥ ¥

64

R. Zheng et al.

3 Applying Authorship Analysis in Cybercrime Investigation The large amount of cyber space activities and their anonymous nature make cybercrime investigation extremely difficult. One of the major tasks in cybercrime investigation is tracing the real identity source of an illegal document. Normally the investigator tries to attribute a new illegal message to a particular criminal in order to get some new clues. Conventional ways to deal with this problem rely on manual work, which is largely limited by the sheer amount of messages and constantly changing author IDs. Automatic authorship analysis should be highly valuable to cybercrime investigators. Figure 1 depicts the typical process of cybercrime identity tracing using the authorship analysis approach.

Fig. 1. A Framework of Cybercrime Investigation with Authorship Analysis

Assume that an investigator has a collection of illegal documents created by a particular suspected cyber criminal. In the first step the feature extractor runs on those documents and generates a set of style features, which will be used as the input to/for the learning engine. A feature-based model is then created as the outcome of the learning engine. This model can identify whether a newly found illegal document is written by that suspicious criminal under different IDs or names. This information will help the investigator focus his/her effort on a small scope of illegal documents and effectively keep track of more important cyber criminals. Cyberspace texts have several characteristics which are different from those of literary works or published articles and make authorship analysis in cyber space a challenge to researchers. One big problem is that cyber documents are generally short in length. This means that many language-based features successfully used in previous studies may not be appropriate (e.g., vocabulary richness). This may also give rise to the weak perform-

Authorship Analysis in Cybercrime Investigation

65

ance of some techniques such as the Naïve Bayesian approach [35]. Also, the structure or composition style used in a cyber document is often different from normal text documents, possibly because of the different purposes of these two kinds of writings. In other words, the style of cyber documents is less formal and the vocabulary is limited and less stable. These factors might also lead to the ineffectiveness of previous feature selection heuristics. However, as a user spends more time in cyber space a more stable writing style will be formed. Some particular features, such as structural layout traits, unusual language usage, illegal content markers, and sub-stylistic features, may be useful in forming a suitable feature collection in the cybercrime investigation context. Another new challenge is that cyber criminals can use any language to conduct crime. In fact, most big crime groups or terrorists have international characteristics. They use the Internet to formulate plans, raise funds, spread propaganda, and communicate. For example, Osama bin Laden was known to use the Internet as his communication media. Applying authorship analysis in a multilingual context is becoming an important issue. Our study aimed to answer the following research questions: 1. Will authorship analysis techniques be applicable in identifying authors in cyber space? 2. What are the effects of using different types of features in identifying authors in cyber space? 3. Will the authorship analysis framework be applicable in a multilingual context?

4 Experiment Evaluation To address the proposed research questions, we created a testbed and conducted several experiments which are described in detail in this section. 4.1 Testbed Two English data sets and one Chinese data set were collected for the purpose of this study. The English data sets consist of an email message collection and an Internet newsgroup message collection. The Chinese data set consists of a Bulletin Board System (BBS) message collection. English Email Messages. The first dataset contains 70 email messages provided by 3 students. Each of the students randomly selected 20-30 messages from their primary email account. The content of these messages covered a variety of topics, ranging from school work to research activities to personal interests. The purpose of introducing different topics is to minimize the impact of content similarity which may contribute to high accuracy. English Internet Newsgroup Messages. The second dataset contains 153 Internet newsgroup messages. Over a time period of two weeks, we observed the activities of several USENET newsgroups involving computer software trading. Based on average

66

R. Zheng et al.

number of reads, posts, and unique user IDs per day, we identified the three most popular newsgroups relevant to our research. Through observation we were able to spot illegal sales of pirate software in all three newsgroups. Figure 2 is an example of such a message.

From: "The Collectaholic" <[email protected]> Subject: Software Titles - Only $3.00 Newsgroups: misc.forsale.computers.other.software Date: 2002-10-04 12:07:22 PST All CDs are the original CDs in working condition and come with all theoriginal documentation. Shipping is $3.00 for first title and $.50 for each additional title. $1.00 Titles PC World The Best of MediaClips: sounds and graphics that can be used onmedia projects… $3.00 Titles Boggle: classic word game Canon Publishing Suite: layout, drawing & photo editing tools

Fig. 2. Illegal Internet Newsgroup Message

We then identified the 9 most active users (represented by a unique ID and email address) who frequently posted messages in these newsgroups. Messages posted by these users were carefully checked to determine whether or not they indicated illegal activities. Between 8 and 30 illegal messages per user were downloaded for use in the experiment. Chinese BBS Messages. The Chinese BBS dataset consisted of 70 messages which were downloaded from the most famous Chinese BBS in the US, bbs.mit.edu. These messages were randomly selected from posted messages by three authors. Table 3, 4 and 5 summarize the composition of the three datasets.

Table 3. English Email Dataset Author T1 T2 T3 RZ 8 9 3 JX 2 18 8 YQ 3 5 14 Grand Total Number of Messages T1 = number of messages under school work T2 = number of messages under research activity T3 = number of messages under personal interest

Number of messages 20 28 22 70

Authorship Analysis in Cybercrime Investigation

67

Table 4. English Internet Newsgroup Dataset Author N1 N2 N3 Number of Messages DLW 1 28 1 30 KD 10 9 1 20 dCN 3 17 0 20 DB 0 16 4 20 SW 18 0 2 20 DLB 0 6 2 8 DLM 0 17 0 17 JKYS 9 0 0 9 JZ 0 9 0 9 Grand Total Number of Messages 153 N1 = number of messages from misc.forsale.computers.other.software N2 = number of messages from misc.forsale.computers.pc-specific.software N3 = number of messages from misc.forsale.computers.mac-specific.software

Table 5. Chinese BBS Dataset Author QQ SKY SEMA Grand Total Number of Messages

Total Number of Messages 20 28 22 70

4.2 Implementation We describe the implementation details of the two core components of our proposed authorship analysis framework: feature selection and inductive learning techniques. Feature selection. Based on the review of previous studies on text and email authorship analysis, along with the specific characteristics of the messages in our datasets, we selected a large number of features that were potentially useful for identifying message authors. Three types of features were used: style markers, structural features, and content-specific features. We used 122 function words and 48 markers suggested by de Vel [35]. Another 28 most common function words from the Oxford English Dictionary and 7 other markers were also included. And 2 additional structural features and content-specific features were added in our experiment, which are shown in Table 6. Techniques. We adopted a classification approach to predict the authorship of each message. Three learning algorithms (classifiers) were used in the experiments for comparison purposes, including decision tree [28], backpropagation neural networks [22], and support vector machines [7]. Among the various symbolic learning algorithms developed over the past decade, ID3 and its variants have been tested extensively and shown to rival other machine learning techniques in predictive power [6]. ID3 is a decision-tree building algorithm

68

R. Zheng et al. Table 6. Feature selection for authorship analysis in our experiment Additional style markers -Total number of words in subject -Total number of characters in subject (S) -Total number of upper-case characters in words in subject/S -Total number of punctuations in subject/S -Total number of whitespace characters in subject/S -Total number of lines -Total number of characters

Additional structural features -Types of signature (name, title, organization, email, URL, phone number) -Uses special characters (e.g. --------) to separate message body and signature

Content-specific Features -Has a price in subject -Position of price in message body -Has a contact email address in message body -Has a contact URL in message body -Has a contact phone number -Uses a list of products -Position of product list in body message -Indicates product categories in list -Format of product list

developed by Quinlan [28]. It adopts a divide-and-conquer strategy and the entropy measure for object classification. In this experiment, we implemented an extension of the ID3 algorithm, the C4.5 algorithm, to deal with attributes with continuous values. Backpropagation neural networks have been extremely popular for their unique learning capability [38] and have been shown to perform well in different applications such as medical applications [34]. It was also introduced to authorship analysis by Kjell [20] and Tweedie [33]. We implemented a typical backpropagation neural network which consists of three layers: an input layer, an output layer and a hidden layer[26], in which the input layer nodes are style features and output nodes are author identities. Based on the general heuristic, the number of hidden layer nodes is typically set to /2 (number of input nodes + number of output nodes). In this study, because the number of input nodes is quite large we modified the heuristic to /10 (number of input nodes + number of output nodes) and achieved relatively high accuracies in our experiments. Support vector machine (SVM) is a novel learning machine first introduced by Vapnik [37]. It is based on the Structural Risk Minimization principle from the computational learning theory. Due to the fact that SVM is capable of handling millions of inputs and does not require feature selection [7], it has been used extensively in authorship analysis, which normally involves hundreds or thousands of input features [9]. For the experiment we used an SVM program written by Hsu and Lin [15] which was publicly available on the Internet. These three algorithms have their applications in authorship analysis. In general SVM and neural networks have better performance than decision trees [9]. But most testbeds are newspaper articles, such as the Federalist Papers. Because of the differences between on-line messages and formal articles, mentioned in Section 3, we still needed to test the performances of these three algorithms on our testbed.

Authorship Analysis in Cybercrime Investigation

69

4.3 Experiment Design We designed the procedure of the experiment as follows: three experiments were conducted on the newsgroup dataset with one classifier at a time. First 205 style markers were used, 9 structural features were added in the second run, and 9 contentspecific features were added in the third run. For the email dataset and Chinese BBS dataset, two experiments were conducted with one classifier at a time; 205 style markers (67 for Chinese BBS dataset) were first used as input to the classifiers, and 9 structural features were then added for a second run. A 30-fold cross-validation testing method was used in all experiments. To evaluate the prediction performance we use accuracy, recall and precision measures which have been commonly adopted in the information retrieval and authorship analysis literature [36]. The accuracy is a measure which indicates the overall prediction performance of a particular classifier, which is defined as in (1) for our experiments:

Accuracy =

Number of messages whose author was correctly identified Total number of messages

(1)

For a particular author, we use precision and recall to measure the effectiveness of our approach for identifying messages that were written by that author. We report the average precision and recall for all authors in a data set. The precision and recall are defined as in (2) and (3):

Precision =

Recall =

Number of messages correctly assigned to the author Total number of messages assigned to the author

Number of messages correctly assigned to the author Total number of messages written by the author

(2)

(3)

4.4 Results & Analysis Based on the three datasets we prepared, we conducted experiments according to the design. The results are presented in Table 7, and detailed discussions are presented in this sub-section. Techniques comparison. We observed that SVM and neural networks achieved better performance than C4.5 decision tree algorithms in terms of precision, recall, and accuracies for all three datasets in our experiment. For example, using style markers on the email dataset, the C4.5, neural networks, and SVM achieved accuracies of 74.29%, 81.11% and 82.86% respectively. SVM also achieved consistently higher accuracies, precision, and recall than the neural networks. However, the performance differences between SVM and neural networks were relatively small. Our results were

70

R. Zheng et al.

Chinese BBS

Email

Newsgroup

Dataset

Table 7. 30-Fold Testing Accuracy, Precision and Recall C4.5 Measures Avg. Accuracy Avg. Precision. Avg. Recall Avg. Accuracy Avg. Precision. Avg. Recall Avg. Accuracy Avg. Precision. Avg. Recall

Neural Network

SVM

SM

SM+ SF

SM+S F+CF

SM

SM+ SF

SM+S F+CF

SM

SM+ SF

SM+S F+CF

86.28

90.20

90.85

84.31

94.77

95.42

88.24

95.42

96.08

85.46

90.02

90.56

84.17

95.16

95.49

89.25

97.07

97.39

85.11

88.37

88.92

80.17

91.18

92.60

85.87

94.72

95.83

74.29

77.14

81.11

90.00

82.86

91.43

72.23

79.03

82.67

90.10

83.92

91.23

71.07

78.27

81.97

91.43

83.17

91.74 N/A

N/A

N/A 54.83

72.58

59.67

82.25

69.06

82.58

54.73

71.83

60.50

82.40

70.45

83.92

54.90

72.37

59.60

82.13

68.32

81.88

SM: Style Markers Unit: Percent (%)

SF: Structural Features

CF: Content-specific Features

generally consistent with previous studies, in that neural networks and SVM typically had better performance than decision tree algorithms [9]. The good performance of SVM also conformed to its success in many other fields [18, 27]. Feature selection. As illustrated in Table 7, the authorship prediction performance varied significantly with different combinations of metrics. Pair-wise t-test results indicated that:

• Using style markers and structural features outperformed using style markers only: We achieved significantly higher accuracies for all three datasets (p-values were all below 0.05) by adopting the structural features. The results might be explained by the fact that an author’s consistent writing patterns show up in the message’s structural features. • Using style markers, structural features, and content-specific features did not outperform using style markers and structural features: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly (with p-value of 0.3086). We think this is because authors of illegal messages typically deliver diverse contents in their mes-

Authorship Analysis in Cybercrime Investigation

71

sages and little additional information can be derived from the message contents to determine the authorship. In response to our second research question, we conclude that the structural features help to achieve higher accuracies, while content-specific features do not improve the performance of online message authorship identification. We also observed that high accuracies were obtained using only style markers as input features for the English datasets. The accuracies ranged from 71% to 89%. The results indicated that style markers contain a large amount of information about writing styles of online message and were surprisingly robust in predicting the authorship. Chinese dataset performance. We noticed that there is a significant drop in prediction performance measures for the Chinese BBS dataset compared with the English datasets. For example, when using style markers only, C4.5 achieved average accuracies of 86.28% and 74.29% for the English Newsgroup and email datasets, while for the Chinese dataset it only achieved an average accuracy of 54.83%. The reason is that only 67 Chinese style markers were used in our current experiments, which are significantly fewer than the 205 style markers used with the English data set. We also observed that when structural features were added all three algorithms achieved relatively high precision, recall, and accuracies (from 71% to 83%) for the Chinese dataset. Considering the significant language differences, our proposed approach to the problem of online message identity tracing appears promising in a multilingual context.

5 Conclusion & Future Work Our experiments demonstrated that with a set of carefully selected features and an effective learning algorithm, we were able to identify the authors of Internet newsgroup and email messages with a reasonably high accuracy. We achieved average prediction accuracies of 80%–90% for email messages, 90%–97% for the Newsgroup messages, and 70%–85% for Chinese Bulletin Board System (BBS) messages. Significant performance improvement was observed when structural features were added on top of style markers. We also observed that SVM outperformed the other two classifiers on all occasions. The experimental results indicated a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identitytracing problem. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet, even though they may use different identities. This study will be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of our proposed approach. Also, more illegal messages will be incorporated into our testbed. The current approach will also be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography images. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset. We believe this will have a better performance cross the different datasets.

72

R. Zheng et al.

Acknowledgment. This project has primarily been funded by the following grants: • National Science Foundation, Digital Government Program, "COPLINK Center: Information and Knowledge Management for Law Enforcement," #9983304, July, 2000-June, 2003; • National Institute of Justice, "COPLINK: Database Integration and Access for a Law Enforcement Intranet," # 97-LB-VX-K023, July 1997-January 2000. We would like to thank Robert Chang from the Taiwan National Intelligence Office for initiating this project. We would also like to thank the officers from the Tucson Police Department: Detective Tim Petersen, Sergeant Jennifer Schroeder, and Detective Daniel Casey for their assistance for the project. Members of Artificial Intelligence Laboratory who have directly contributed to this paper are Michael Chau, Jie Xu, Wingyan Chung.

References 1. 2. 3. 4. 5. 6.

7. 8. 9. 10. 11. 12. 13. 14. 15.

B. Brainerd, Statistical analysis of Lexical data using Chi-squared and related distributions. Computers and the Humanities, 9, 161–178. (1975). Binongo and Smith, A Study of Oscar Wilde's Writings, Journal of Applied Statistics, vol. 26-7, p.781, (1999). R. H. Baayen, Statistical Models for Word Frequency Distributions: A Linguistic Evaluation. Computers and the Humanities, 26 347–363, 347–363. (1993). R. H. Baayen, H. van Halteren, and F. J. Tweedie, Outside The Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 2, 110–120, (1996). R. Bosch and J. Smith, Separating hyperplanes and the authorship of the disputed federalist papers, American Mathematical Monthly, 105(7): 601–608, (1998). H. Chen, G. Shankaranarayanan, A. Iyer, and L. She, A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing, Journal of the American Society for Information Science, Volume 49, Number 8, Pages 693–705, (1998). N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, (2000). E. Charniak, Statistical Language Learning. MIT Press, Cambridge, (1993). J. Diederich, J. Kindermann, E. Leopold, and G. Paass, Authorship Attribution with Support Vector Machines, Applied Intelligence, (2000). W. Elliot and R. Valenza, Was the Earl of Oxford The True Shakespeare? Notes and Queries, 38:501–506, (1991). I. S. Francis, An Exposition of a Statistical Approach to the Federalist Dispute. In J. Leed (Ed.), The Computer and Literary Style (pp. 38–79). Kent, Ohio: Kent State University Press. (1966). J. M. Farringdon, Analyzing for Authorship A Guide to the Cusum Technique. Cardiff: University of Wales Press. (1996). D. Foster, Author Unknown: On the Trail of Anonymous, Henry Holt, New York, (2000). A. Gray, P. Sallis, and S. MacDonell, Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1–8, (1997). C. W. Hsu and C. J. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, pages 415–425, (2002).

Authorship Analysis in Cybercrime Investigation

73

16. D. I. Holmes and R. S. Forsyth, The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111–127. (1995). 17. D. I. Holmes, The Evolution of Stylometry in Humanities. Literary and Linguistic Computing, 13, 3. (1998). 18. T. Joachims, Text Categorization with Support Vector Machines, in: Proceedings of the European Conference on Machine learning (ECML), (1998). 19. D.V. Khmelev and F. J. Tweedir, Using Markov Chains for Identification of Writers, Literary and Linguistic Computing, vol.16, no.4, pp.299–307, (2001). 20. B. Kjell, Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers. Literary and Linguistic Computing, 9, 119–124. (1994). 21. D. Lowe, and R. Matthews, Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions. Computers and the Humanities, 29, 449–461 (1995). 22. R. P. Lippmann, An Introduction to Computing with Neural Networks, IEEE Acoustics Speech and Signal Processing Magazine, 4(2): 4–22, (1987). 23. F. Mosteller and D. L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, Mass., (1964). 24. F. Mosteller, Frederick, and D. L. Wallace, Applied Bayesian and Classical Inference: the Case of the Federalist Papers, in the 2nd edition of Inference and Disputed Authorship, The Federalist, Springer-Verlag, (1964). 25. A. McCallum and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on "Learning for Text Categorization", (1998). 26. J. Moody and J. Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995). 27. E. Osuna, R. Freund and F. Girosi, Training Support Vector Machines: An Application to Face Detection, Proceedings of Computer Vision and Pattern Recognition, 130–136, (1997). 28. J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1): 81–106, (1986). 29. J. Rudman, The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31, 351–365. (1998). 30. R. Thisted, and B. Efron, Did Shakespeare Write a Newly Discovered Poem? Biometrika, 74, 445–455. (1987). 31. D. Thomas, and B. D. Loader, Introduction – Cyber Crime: law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, (2000). 32. T. Tomoji, Dickens's Narrative Style: A Statistical Approach to Chronological Variation. Revue, Informatique et Statistique dans les Sciences Humaines (RISSH, Centre Informatique de Philosophie et Lettres, Universite de Liege, Belgique), 30, 165–182, (1994). 33. F. J. Tweedie, S. Singh, and D. I. Holmes, Neural Network Applications in Stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10 (1996). 34. K. M. Tolle, H. Chen and H. Chow, Estimating Drug/Plasma Concentration Levels by Applying Neural Networks to Pharmacokinetic Data Sets, Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, 30(2), 139– 152, (2000). 35. O. de Vel, A. Anderson, M. Corney and G. Mohay, Mining E-mail Content for Author Identification Forensics, SIGMOD Record, 30(4): 55–64, (2001). 36. O. de Vel, Mining e-mail authorship. In Proc.Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), (2000). 37. V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, (1995). 38. B. Widrow, D. E. Rumelhart and M. A. Lehr, Neural Networks: Applications in Industry, Business, and Science, Communications of the ACM, 37, 93–105, (1994). 39. G. U. Yule, On sentence length as a statistical characteristic of style in prose, Bometrikka, 30, (1938). 40. G. U. Yule, The statistical study of literary vocabulary, Cambridge University Press, (1944).

Behavior Proﬁling of Email Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olivier Nimeskern, and Chia-Wei Hu Columbia University, New York, NY 10027, USA {sal,shlomo,kewang,on2005,charlie}@cs.columbia.edu

Abstract. This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the Columbia Intrusion Detection (IDS) Lab. EMT provides the means of loading, parsing and analyzing email logs, including content, in a wide range of formats. Many tools and techniques have been available from the ﬁelds of Information Retrieval (IR) and Natural Language Processing (NLP) for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model “user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the “social cliques” revealed by a user’s email behavior.

1

Introduction

This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the Columbia IDS Lab. EMT provides the means of loading, parsing and analyzing email logs, including content, in a wide range of formats. Many tools and techniques have been available from the ﬁelds of IR and NLP for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model “user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the “social cliques” revealed by a user’s email behavior. EMT’s design has been driven by the core security application to detect virus propagations, spambot activity and security policy violations. However, the technology also provides critical intelligence gathering and forensic analysis capabilities for agencies to analyze disparate Internet data sources for the detection of malicious users, attackers, and other targets of interest. This dual use is graphically displayed in Figure 1. For example, one target application for intelligence gathering supported by EMT is the identiﬁcation of likely “proxy email accounts”, email accounts that exhibit similar behavior and thus may be used by a single person. Although EMT has been designed speciﬁcally for email analysis, the principles of its operation are equally relevant to other Internet audit sources. This data mining technology previously reported [4,6,7], and graphically displayed in Figure 2, has been proven to automatically compute or create both signature-based misuse detection and anomaly detection-based misuse discovery. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 74–90, 2003. c Springer-Verlag Berlin Heidelberg 2003

Behavior Proﬁling of Email

75

Fig. 1. User account proﬁling, dual use: online detection and oﬄine analysis.

The application of this technology to diverse Internet objects and events (e.g., email and web transactions) allows for a broad range of behavior-based analyses including the detection of proxy email accounts and groups of user accounts that communicate with one another including covert group activities. Data mining applies machine learning and statistical techniques to automatically discover and detect misuse patterns, as well as anomalous activities in general. When applied to network-based activities and user account observations for the detection of errant or misuse behavior, these methods are referred to as behavior-based misuse detection. Behavior-based misuse detection can provide important new assistance for counter-terrorism intelligence. In addition to standard Internet misuse detection, these techniques will automatically detect certain patterns across user accounts that are indicative of covert, malicious or counter-intelligence activities. Moreover, behavior-based detection provides workbench functionalities to interactively assist an intelligence agent with targeted investigations and oﬀ-line forensics analyses. Intelligence oﬃcers have a myriad of tasks and problems confronting them each day. The sheer volume of source materials requires a means of honing in on those sources of maximal value to their mission. A variety of techniques can be applied drawing upon the research and technology developed in the ﬁeld of Information Retrieval. There is, however, an additional source of information available that can used to aid even the simplest task of rank ordering and sorting documents for inspection: behavior models associated with the documents can be used to identify and group sources in interesting new ways. This is demonstrated

76

S.J. Stolfo et al.

Fig. 2. Overview of data mining based detection system.

by the Email Mining Toolkit that applies a variety of data mining techniques for proﬁling and behavior modeling of email sources. The deployment of behavior-based techniques for intelligence investigation and tracking tasks represents a signiﬁcant qualitative step in the counterintelligence “arms race”. Because there is no way to predict what data mining will discover over any given data set, “counter-escalation” is particularly diﬃcult. Behavior-based misuse detection is more robust against standard knowledgebased techniques. Behavior-based detection has the capabilities to detect new patterns (i.e., patterns that have not been previously observed), provide early warning alerts to users and analysts, and automatically adapt to both normal and misuse behavior. By applying statistical techniques over actual system and user account behavior measurements, automatically-generated models and rules are tuned to the particular source material. This process, in turn, avoids the human bias that is intrinsic when misuse signatures, patterns and other knowledge-based models are designed by hand, as is the norm. Despite this, no general infrastructure has been developed for the systematic application of behavior-based (misuse) detection across a broad set of detection and intelligence analysis tasks such as fraudulent Internet activities, virus detection, intrusion detection and user account proﬁling. Today’s Internet security systems are specialized to apply a small range of techniques, usually knowledgebased, to an individual misuse detection problem, such as intrusion, virus or SPAM detection. Moreover, these systems are designed for one particular network environment, such as medium-sized network enclaves, and only tap into an individual cross-section of network activity such as email activity or TCP/IP activity. Behavior-based detection technology as proposed herein will likely pro-

Behavior Proﬁling of Email

77

vide a quantum leap in security and in intelligence analysis in both oﬄine and online task environments. EMT has been described in another publication, focusing on its use for security applications, including virus and spam detection, as well as security policy violations. In this paper, we focus on several of its features speciﬁc to intelligence applications, namely the means of clustering email by content based analyses, identiﬁcation of “similar email accounts” based upon measuring similarity between account proﬁles represented by histograms, and clique analyses that are supported by EMT. Table 1. Behavior-Based Internet Applications for Security and Beyond Application: Fraud detection

Description and Variations: Examples: Audit Sources: Unauthorized outgoing email Console usurped Email Child attacks teacher Unauthenticated email Deceptive source Unauthorized transactions Purchase/credit fraud HTTP Transaction services. Malicious email detection Viruses Email Worms “SPAM” Intrusion detection Network-based detection Standard IDS TCP/IP Host-based detection Less standard IDS System logs Application-based detection Future IDS App. logs User community discovery Closely connected user-base Email ’circles’ Email Behavior-pattern Account-based patterns Suspect activities All sources: Email, HTTP, discovery Community-based patterns Transaction services, TCP/IP, Telnet traffic, FTP traffic, Clandestine activities cookiesEmail, FTP, Telnet Analyst Workbench Interactive forensic analysis Targeted intelligence All sources investigations Account proxy detection Accounts used by same user Clandestine activities All sources Collaborative ﬁltering Website recommendations Pageview prediction HTTP Purchase recommendations Music/movie choices Transaction services Policy violation detection ISP or User espionage All sources Email enclave security policies Outgoing SPAM Web-bot detection Statistics/knowledge gathering Competitive analysis HTTP Site maintenance Finding broken links Search-engine spider Google, Altavista

1.1

Applying Behavior-Based Detection to Email Sources

Table 1 enumerates a range of behavior-based Internet applications. These applications cover a set of detection, security and marketing applications that exist within the government, commercial and private sectors. Each of these applications are within the capabilities of behavior-based techniques by applying data mining algorithms over appropriate audit data sources. Our current research has applied behavior-based methods directly to the ﬁrst six applications listed in Table 1: Fraud detection, malicious email detection, intrusion detection, user community discovery, behavior pattern discovery, and analyst workbench. Each of these are Internet security applications, applying to both outbound and inbound network- and email-based traﬃc. Solving Internet security problems greatly assists surveillance intelligence activities. For example, the discovery of user account communities and the discovery and detection of certain community behavior patterns can be directed to uncover certain classes of covert, clandestine or espionage behavior performed with Internet resources. Furthermore, fraud detection in particular has direct

78

S.J. Stolfo et al.

beneﬁt for an intelligence agency by proﬁling and identifying users and clusters of users that participate in such malicious Internet activities such as fraudulent activities. Behavior-based detection has been proven against similar, analogous security applications. The ﬁnance, telecom and energy industries have protected their customers from fraudulent misuse of their services (e.g., fraudulent misuse of credit card accounts, telephone calling cards, stealing of utility service, etc.) by modeling their individual customer accounts and detecting deviations from this model for each of their customers. The behavior-based protection paradigm applied to the Internet thus has an historical precedent that is now ubiquitous and transparent as exempliﬁed by the credit card in the reader’s wallet or purse. 1.2

EMT as an Analyst Workbench for Interactive Intelligence Investigations

The “Malicious Email Tracking” (MET) [1] is an online system that uses email ﬂow statistics to capture new virii, which are largely undetectable by the “signature” detection methods of today’s state-of-the-art commercial virus detection systems. Speciﬁcally, all email attachments are tracked by tracing a private hash value, temporal statistics such as replication rate are recorded to trace the attachments’ trajectory, e.g., across LANs, and these statistics directly inform the detection of self-replicating, malicious software attachments. MET has been developed and deployed as an extension to mail servers and is fully described elsewhere. MET is an example of an online “behavior-based” security system that defends and protects a system not solely by attempting to identify known attacks against a system, but rather by detecting deviations from a system’s normal behavior. Many approaches to “anomaly detection” have been proposed, including research systems that aim to detect masqueraders by modeling user behaviors in command line sequences, or even keystrokes. However, in this case, MET is architected to protect user accounts by modeling user email ﬂows to detect malicious email attachments, especially polymorphic viruses that are not detectable or traceable via signature-based detection methods. The “Email Mining Toolkit” (EMT) on the other hand, is an oﬄine system applied to email ﬁles gathered from server logs or client email programs. EMT computes information about email ﬂows from and to email accounts, aggregate statistical information from groups of accounts, and analyzes content ﬁelds of emails. The EMT system provides temporal statistical feature computations and behavior-based modeling techniques, through an interactive user interface to enable targeted intelligence investigations and semi-manual forensic analysis of email ﬁles. Figure 1 illustrates the general architecture of a behavior-based system deploying dual functionality: 1. An online security detection application (in this case, MET for malicious email detection) 2. A general analyst workbench for intelligence investigations (EMT, for email source analysis)

Behavior Proﬁling of Email

79

As this ﬁgure illustrates, these functionalities share a great deal of overhead. With regard to the implementation, by deploying these dual functionalities, the audit module, computation of temporal statistics, user modeler and database of user models each serve for both functionalities. Moreover, with regard to the conceptual design, the particular set of temporal statistics and user model processes designed for one can improve the performance of the other. In particular, temporal features, as well as user account models and clusters, are representatively general “fundamental building blocks.” EMT provides the following functionalities, interactively: – Querying a database (warehouse) of email data and computed feature values, including: • Ordering and sorting emails on the basis of content analysis (n-gram analysis, keyword spotting, and classiﬁcations of email supported by an integrated supervised learning feature using Na¨ıve Bayes classiﬁer trained on user selected features) • Historical features that proﬁle user groups by statistically measuring behavior characteristics. • User models that group users according to features such as typical emailing patterns (as represented by histograms over diﬀerent selectable statistics), and email communities (including the “social cliques” revealed in email exchanges between email accounts. – Applying statistical models to email data to alert on abnormal or unusual email events. EMT is also designed as a plug in to a data mining platform, originally designed and implemented at Columbia called the DW/AMG architecture (Data Warehouse/Adaptive Model Generation system). That work has been transferred to System Detection Inc (SysD http://www.sysd.com), a DARPA-spinout from Columbia who has commercialized the system as the Hawkeye Security Platform.

2

EMT Features

The full range of EMT features have been described elsewhere . For the present paper, we provide a brief overview of several of its key features of direct relevance to security analysis and intelligence applications, along with descriptive screenshots of EMT in operation. 2.1

Attachment Models

MET was initially conceived to statistically model the behavior of email attachments in real time ﬂowing through an enclave’s email server, and support the coordinated sharing of information among a wide area of email servers to identify malicious attachments and halt their propagation before saturation. In order to properly share such information, each attachment must be uniquely identiﬁed,

80

S.J. Stolfo et al.

which is accomplished through the computation of an MD5 hash of the entire attachment. EMT runs an analysis on each attachment in the database to calculate a number of metrics. These include, birth rate, lifespan, incident rate, prevalence, threat, spread, and death rate. They are explained fully in1 , and are displayed graphically in Figure 3.

Fig. 3. Attachment Statistics

Rules speciﬁed by a security analyst using the alert logic section of EMT are evaluated over the attachment metrics to issue alerts to the analyst. This analysis may be done to archived email logs by EMT oﬄine, or at runtime in MET while sniﬃng real-time email ﬂows. The initial version of MET provides the means of specifying alerts in rule form as a collection of Boolean expressions applied to thresholds compared to each of the calculated statistics. As an example, a basic rule might check for each attachment seen if its birth rate is greater than some speciﬁed threshold AND sent from at least users. The ﬂow statistics of each email attachment are computed by EMT, as well as the list of speciﬁc emails the 1

A paper entitled “A Behavior-based Approach to Securing Email Systems” has been prepared for submission to a technical conference and is under review. That paper describes the use of EMT for virus and spam detection. There is a minor overlap with that paper in presentation material of some of EMT’s features described herein.

Behavior Proﬁling of Email

81

attachment appears in, to identify recipients of those attachments. The primary detection task MET was designed for includes virus propagation and mitigation. Intelligence applications of this particular feature would include infosec security policy violations, and general evidence gathering in forensic analyses.

Fig. 4. Main analyst window to sort and inspect speciﬁc emails.

2.2

Email Content and Classiﬁcation

Figure 4 illustrates EMT’s main messages tab that provides an analyst with the means to inspect, cluster and sort email messages under analysis. Emails can be selected for review and analysis on the basis of time, sender or recipient account. This data may be labeled directly by an analyst for further data mining analysis supported by other feature tabs in EMT. Interestingly, EMT also provides the means of classifying attachments by way of the fully embedded EMF system, a supervised machine learning feature. In the earliest work on MEF (Malicious Email Filter [7]), the Na¨ıve Bayes classiﬁer was computed over user selected training sets of attachments. The features extracted include “n-grams” and their frequencies, extracted and computed directly from the attachment

82

S.J. Stolfo et al.

irrespective of its mime type. Hence, in addition to using ﬂow statistics and attachment classiﬁcations to classify an email message, EMT uses the email body as a content-based feature. The two features supported are n-gram [8] modeling and a calculation of the frequency of a set of words [9] from the body of the email. An n-gram represents the sequence of any n adjacent characters or tokens that appear in a document. An n-character wide window is passed over the entire email body, one character at a time, and a count is computed on the number of occurrences of each n-gram. This results in a hash table that uses the n-gram as a key and the number of occurrences as the value for each email; this we refer to as the document vector. Given a set of training emails, the arithmetic average of the document vectors can be computed as the centroid for the set. Given an instance of an email, we compute the cosine distance [8] against the centroid created during training. If the cosine distance is equal to 1, then the two documents are deemed identical. The smaller the value of the cosine distance, the more diﬀerent the two documents are. These content-based methods are integrated into the machine learning models for classifying sets of emails for further inspection and analysis. An analyst therefore has the means of honing in on a set of potentially relevant emails by ﬁrst classifying and clustering sets of emails using the EMT GUI. Using a set of normal email and spam we collected, we did some initial experiments over our own email sets to test the eﬃcacy of the approach. We used half of the labeled emails, both normal and spams, as training data, and used the other half as the test set. The accuracy of the classiﬁcation using ngrams and word tokens varies from 70% to 94% when using diﬀerent parts as training and testing sets. In the spam classiﬁcation experiment, we noticed some spam emails did not vary much from normal emails. For example a spam that would be a single link to a non-threatening website. To improve accuracy we also used weighted key-words and removal of stop-words. For example, the spam email set noticeably contain the words: free, money, big, lose weight, etc in a much higher frequency than regular emails. Users can empirically assign stop-words and keywords and give higher weight to their frequency count. We continue to evaluate these content based approaches further; experiments and analysis are ongoing. 2.3

Account Statistics and Alerts

This mechanism has been extended to provide alerts based upon deviation from other baseline user and group models. EMT computes and displays three tables of statistical information for any selected email account. The ﬁrst is a set of stationary email account models, i.e. statistical data represented as a histogram of the average number of messages sent over all days of the week, divided into three periods: day, evening, and night. EMT also gathers information on the average size of messages for these time periods, and the average number of recipients and attachments for these periods. These statistics can generate alerts

Behavior Proﬁling of Email

83

when values are above a set threshold as speciﬁed by the rule-based alert logic section of EMT. Stationary User Proﬁles – Histograms over discrete time intervals. Histograms are used to model the stationary behavior of a user’s email account. Figure 8 displays an example for one particular user account. Histograms are compared to ﬁnd similar behavior or abnormal behavior between diﬀerent accounts, and within the same account (between a long-term proﬁle histogram, and a recent, short-term histogram). A histogram depicts the distribution of items in a given sample. EMT employs a histogram of 24 bins, for the 24 hours in a day. Email statistics are allocated to diﬀerent bins according to their outbound time. The value of each bin can represent the daily average number of emails sent out in that hour, or daily average total size of attachments sent out in that hour, or other features deﬁned over an of email account computed for some speciﬁed period of time. Two histogram comparison functions are implemented in the current version of EMT, each providing a user selectable distance function. The ﬁrst comparison function is used to identify groups of email accounts that have similar usage behavior. The other function is used to compare behavior of an account’s recent behavior to the long term proﬁle of that account. The histogram comparison functions also may be run “unanchored”, meaning, the histograms are shifted to ﬁnd the best alignment with minimum distance; thus accounting for time zone changes. Similar Users – Histogram distance. Similar behaving user accounts may be identiﬁed by computing the pair-wise distances of their histograms (eg., a set of accounts may be inferred as similar to given known or suspect account that serves as a model). The histogram distance functions were modiﬁed for this detection task. First, we balance and weigh the information in the histogram representing hourly behavior with the information provided by the histogram representing behavior over diﬀerent aggregate periods of a day. This is done since measures of hourly behavior may be too low a level of resolution to ﬁnd proper groupings of similar accounts. For example, an account that sends most of its email between 9am and 10am should be considered similar to another that sends emails between 10am and 11am, but perhaps not to an account that emails at 5pm. Given two histograms representing a heavy 9am user, and another for a heavy 10am user, a straightforward application of any of the histogram distance functions will produce erroneous results. Thus, we divide a day into four periods: morning (7am-1pm), afternoon (1pm7pm), night (7pm-1am), and late night (1am-7am). The ﬁnal distance computed is the average of the distance of the 24-hour histogram and that of the 4-bin histogram, which is obtained by regrouping the bins in the 24-hour histogram. Second, because some of the distance functions require normalizing the histograms before computing the distance function, we also take into account the volume of emails. Even with the exact distribution after normalization, a bin

84

S.J. Stolfo et al.

representing 20 emails per day should be considered quite diﬀerent from an account exhibiting the emission of 200 emails per day. Figure 6 graphically displays the EMT analysis showing the target user account and a list of the most similar accounts found by EMT’s histogram analysis.

Fig. 5. Chi Square Test of recipient frequency

Abnormal User Account Behavior. EMT may apply these distance functions to one target email account. (See Figure 6.) A long term proﬁle period is ﬁrst selected by an analyst as the “normal” behavior period. The histogram computed for this period is then compared to another histogram computed for a more recent period of email behavior. If the histograms are very diﬀerent (i.e., they have a high distance), an alert is generated indicating possible account misuse. We use the weighted Mahalanobis distance function for these proﬁles. The long term proﬁle period is used as the training set, for example, a single month. We assume the bins in the histogram are random variables that are statistically independent. When the distance between the histogram of the selected recent period and that of the longer term proﬁle is larger than a threshold, an alert will be generated to warn the analyst that the behavior “might be abnormal” or is deemed “abnormal”. The alert is also put into the alert log of EMT.

Behavior Proﬁling of Email

85

Fig. 6. Histogram Comparison to Detect Similar users

The histograms described here are stationary models; they represent statistics at discrete time frames. Other non-stationary account proﬁles are provided by EMT, as described next. Non-stationary User Proﬁles – Histograms over blocks of emails. Another type of modeling considers the changing conditions over time of an email account. Most email accounts follow certain trends, which can be modeled by some underlying distribution. As an example of what this means, many people will typically email a few addresses very frequently, while emailing many others infrequently. Day to day interaction with a limited number of peers usually results in some predeﬁned groups of emails being sent. Other contacts with whom the email account owner interacts with on less than a day to day basis have a more infrequent email exchange behavior. The recipient frequency is used as a feature to study this concept of underlying distributions. Four behavior analysis graphs for any selected e-mail account are created by EMT for this model. These graphs display the address list size and average outgoing e-mail account spread over time, as well as the number of outgoing e-mails to each destination account. Every user of an email system develops a unique pattern of email emission to a speciﬁc list of recipients, each having their own frequency. Modeling every

86

S.J. Stolfo et al.

user’s idiosyncrasies enables the EMT system to detect malicious or anomalous activity in the account. This is similar to what happens in credit card fraud detection, where current behavior violates some past behavior patterns. Figure 5 provides a screenshot of the non-stationary model features in EMT, that are fully described elsewhere. In a nutshell, The Proﬁle tab in Figure 5 provides a snapshot of the account’s activity in terms of recipient frequency. It contains three charts and one table. The various proﬁle statistics selected by the analyst specify an empirical distribution that may then be compared by the analyst with a set of built-in metrics including Chi-square, and Hellinger distance [10]. Rapid changes in email emissions among accounts can then be discerned which may have particular intelligence value. 2.4

Group Communication Models: Cliques

In order to study the email ﬂows between groups of users, EMT provides a feature that computes the set of cliques in an email archive. We seek to identify clusters or groups of related email accounts that frequently communicate with each other, and then use this information to identify unusual email behavior that violates typical group behavior, or identify similar behaviors among diﬀerent user accounts on the basis of group communication activities. Clique violations may also indicate internal email security policy violations. For example, members of the legal department of a company might be expected to exchange many Word attachments containing patent applications. It would be highly unusual if members of the marketing department, and HR services would likewise receive these attachments. EMT can infer the composition of related groups by analyzing normal email ﬂows and computing cliques (see Figure 7), and use the learned cliques to alert when emails violate clique behavior. An analyst may simply wish to compute these cliques and rank order all associated emails of the clique members for direct inspection. EMT provides the clique ﬁnding algorithm using the branch and bound algorithm described in [2]. We treat an email account as a node, and establish an edge between two nodes if the number of emails exchanged between them is greater than a user deﬁned threshold, which is taken as a parameter (Figure 7 is displayed with a setting of 100). The cliques found are the fully connected subgraphs. For every clique, EMT computes the most frequently occurring words appearing in the subject of the emails in question which often reveals the clique’s typical subject matter under discussion. Chi Square + cliques. The Chi Square + cliques (CS + cliques) feature in EMT is the same as the Proﬁle window described above in 2.3.4, with the addition of the calculation of clique frequencies. In summary, the clique algorithm is based on graph theory. It ﬁnds the largest cliques (group of users), which are fully connected with a minimum number of

Behavior Proﬁling of Email

87

Fig. 7. Clique generation for 100 messages

emails per connection at least equal to the threshold (set at 50 by default). In this window, each clique is treated as if it were a single recipient, so that each clique has a frequency associated with it. Only the cliques to which the selected user belongs will be displayed. Some users don’t belong to any clique, and for those, this window is identical to the normal Chi Square window. If the selected user belongs to one or more cliques, each clique appears under the name cliquei i:=1,2. . . and is displayed in a cell with a green color in order to be distinguishable from individual email account recipients. (One can double click on each clique’s green cell, and a window pops-up with the list of the members of the clique.) Cliques tend to have high ranks in the frequency table, as the number of emails corresponding to cliques is the aggregate total for a few recipients. These metrics are a ﬁrst step to model user’s behavior in terms of group email emission frequency. A larger database will enable us to reﬁne them, and to better understand the time-continuous stochastic process taking place. The Chi square test may be modiﬁed or completed with ﬁner measures. The Chi Square tests if the frequencies of emission are constant for a given user. In the preliminary results that we ran on our collected database, the Chi Square test has tended to reject quite often the hypothesis that the frequencies were the same between training and testing periods, indicating that the frequencies are not stable. They change quite dynamically under short time frames,

88

S.J. Stolfo et al.

Fig. 8. Anomalous user behavior detected by histogram comparison

as new recipients and cliques become more or less popular over time. Any new model should take into account this dynamic evolution. Enclave cliques vs. User cliques. Conceptually, two types of cliques can be formulated and both are supported by EMT. The one described in the previous section can be called enclave cliques because these cliques are inferred by looking at email exchange patterns of an enclave of accounts. In this regard, no account is treated special and we are interested in email ﬂow pattern on the enclavelevel. Any ﬂow violation or a new ﬂow pattern pertains to the entire enclave. On the other hand, it is possible to look at email traﬃc patterns from a diﬀerent viewpoint altogether. Consider we are focusing on a speciﬁc account and we have access to its outbound traﬃc log. As an email can have multiple recipients, these recipients can be viewed as a clique associated with this account. Since a clique could be subsumed by another clique, we deﬁned a user clique as one that is not a subset of any other cliques. In other words, user cliques of an account are its recipient lists that are not subsets of other recipient lists. User clique computation provides an intelligence analyst with the means of quickly identifying groups directly associated with a target email account, and may be used to group emails for inspection based upon various clique analyses. This is an active area of our ongoing research. Preliminary experiments have been performed using these graph theoretic features for spam and virus detection. In both cases, the clique models provide interesting new evidence to improve the

Behavior Proﬁling of Email

89

accuracy of detection beyond what is achievable with pure content-based features of emails.

3

Conclusion

It is important to note that testing EMT and MET in a laboratory environment is not particularly informative of its performance on speciﬁc tasks and source material. The behavior models are naturally speciﬁc to a site or particular account(s) and thus performance will vary depending upon the quality of data available for modeling, and the parameter settings and thresholds employed. EMT is designed to be as ﬂexible as possible so an analyst can eﬀectively explore the space of models and parameters appropriate for their mission. An analyst simply has to take it for a test spin. (EMT has been deployed and is being tested and evaluated by external organizations.) One of the core principles behind EMT’s design may be stated succinctly: there is no single monolithic model appropriate for any detection or forensic analysis task. Hence, EMT provides a pallet of models and proﬁling techniques (specialized to email log ﬁles) that may be combined in interesting ways by an analyst to meet their own mission objectives. It is also important to recognize that no single modeling technique in EMT’s repertoire can be guaranteed to have no false negatives, or few false positives. Rather, EMT is designed to assist an analyst or security staﬀ member architect a set of models whose outcomes provide evidence for some particular detection task. The combination of this evidence is speciﬁed in the alert logic section as simple Boolean combinations of model outputs; and the overall detection rates will clearly be adjusted and vary depending upon the user supplied speciﬁcations of threshold logic. The Email Mining Toolkit is a work in progress. This paper has described the core concepts underlying EMT, and its related Malicious Email Tracking system, and the Malicious Email Filtering system. We have presented the features of the system currently implemented and available to a analyst for various security and intelligence applications. The GUI allows the user to easily automate many complex analyses. We believe the various behavior-based proﬁles computed by EMT will signiﬁcantly improve analyst productivity. We are continuing our research to broaden the range of features and models one may compute over email logs. For example, the notion of clique may be over-constrained, and may be relaxed in favor of other kinds of models of communication groups. Further, we are actively exploring stochastic models of long-term user proﬁles, with the aim to compute these models eﬃciently when training such proﬁles. Histograms computed in ﬁxed time periods is very eﬃcient, but likely insuﬃcient to model a user’s true dynamic behavior.

References 1. M. Bhattacharyya, S. Hershkop, E. Eskin, and S. J. Stolfo. MET: An Experimental System for Malicious Email Tracking. In Proceedings of the 2002 New Security Paradigms Workshop (NSPW-2002). Virginia Beach, VA, September, 2002.

90

S.J. Stolfo et al.

2. C. Bron, J. Kerbosch Finding all cliques of an undirected graph Comm. ACM 16(9) (1973) 575–577. 3. E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. J. Stolfo. A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Data Mining for Security Applications. Kluwer 2002. 4. George H. John and Pat Langley. Estimating continuous distributions in bayesian classiﬁers In Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence. Pages 338–345, 1995 5. Wenke Lee, Sal Stolfo, and Kui Mok. Mining Audit Data to Build Intrusion Detection Models In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD ’98), New York, NY, August 1998 6. Wenke Lee, Sal Stolfo, and Phil Chan. Learning Patterns from Unix Process Execution Traces for Intrusion Detection AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, July 1997 7. Matthew G. Schultz, Eleazar Eskin, and Salvatore J. Stolfo. Malicious Email Filter – A UNIX Mail Filter that Detects Malicious Windows Executables. Proceedings of USENIX Annual Technical Conference – FREENIX Track. Boston, MA: June 2001. 8. Damashek, M. Gauging Similarity with n-grams: language independent categorization of text Science, 267 (5199), 843–848, 1995. 9. Mitchell, T. Machine Learning, McGraw-Hill, 1997, pg. 180–183. 10. Hogg, R.V. Introduction to Mathematical Statistics, Prentice Hall, 1994.

Detecting Deception through Linguistic Analysis 1

2

1

Judee K. Burgoon , J.P. Blair , Tiantian Qin , and Jay F. Nunamaker, Jr 1

1

Center for the Management of Information, University of Arizona {jburgoon,tqin,nunamaker}@cmi.arizona.edu 2 Department of Criminal Justice, Michigan State University [email protected]

Abstract. Tools to detect deceit from language use pose a promising avenue for increasing the ability to distinguish truthful transmissions, transcripts, intercepted messages, informant reports and the like from deceptive ones. This investigation presents preliminary tests of 16 linguistic features that can be automated to return assessments of the likely truthful or deceptiveness of a piece of text. Results from a mock theft experiment demonstrate that deceivers do utilize language differently than truth tellers and that combinations of cues can improve the ability to predict which texts may contain deception.

1 Introduction One of the most daunting challenges facing intelligence and law enforcement communities in the wake of 9/11 is anticipating and preventing terrorist attacks. Doing so requires sifting through a daily avalanche of information and communications to discriminate good information from bad, and deceptive communications from truthful ones. Much of this task falls to human analysts. Aside from the enormity of the task, the empirical evidence is compelling that humans, even professionally trained ones, are typically very poor at detecting deception and fallacious information [1, 2, 3]. Moreover, evidence is building that detection abilities are even more faulty when communication is computer-mediated, especially if is text-based such as with e-mail [4, 5]. Increasing reliance worldwide on electronic communications thus carries with it the very real likelihood of further decrements in detection accuracy and heightened vulnerabilities to national and international security. Human information processors are in serious need of tools that can assist them in filtering information and alerting them to suspicious information. Aside from the security implications, such tools would also be of great benefit to law enforcement in dealing with a wide array of criminal investigations. Our program of research, funded by the Department of Defense, has begun to develop such tools by examining linguistic features and content characteristics of texts for reliable indicators of deceit. These indicators can then be incorporated into software tools that automate their detection and subject them to statistical analysis for probabilistic estimates of truthfulness or deceptiveness. Decades of research have confirmed that there are few indicators of deceit that remain invariant across genres of communication, situations, communicators, cultures, and other features of communication contexts. Yet combinations of indicators have shown to have good predictive H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 91–101, 2003. © Springer-Verlag Berlin Heidelberg 2003

92

J.K. Burgoon et al.

ability in specified contexts [6]. The research to be reported here was guided by the objectives of identifying those indicators that are (1) the least context-sensitive and (2) the most amenable to automation. We present preliminary results from a mock theft experiment that is still in progress, our purposes being to illustrate the promise of examining text-based linguistic indicators and of examining such indicators using a particular statistical approach that examines combinations of cues.

2 Background Two experiments from our research program predate the one to be reported here. One was modeled after the familiar Desert Survival Problem, in which pairs of participants were given a scenario in which their jeep had crashed in the Kuwaiti desert, read material we developed from an Army field manual that we entitled, “Imperative Information for Surviving in the Desert,” and then were asked to arrive at a consensus on the rank-ordering of salvageable items in terms of their importance to survival. The task was conducted via email over the course of several days. In half of the pairs, one person was asked to deceive the partner by advocating choices opposite of what the experts recommend (e.g., discarding bulky clothing and protective materials so as to make walking more manageable). Partners discussed the rankings and their recommendations either face-to-face or using a computer-mediated form of communication such as text chat, audioconferencing, or videoconferencing. All discussions were recorded and transcribed then subjected to linguistic analysis of such features as number of words, number of sentences, number of unique words (lexical diversity), emotiveness, and pronoun usage. Of the 27 indicators that were examined, several proved to reliably distinguish truth tellers from deceivers. Deceivers were more likely to use longer messages but with less diversity and complexity, and greater uncertainty and “distancing” in language use than truth tellers. These results revealed that systematic differences in language use could help predict which messages originated from deceivers and which, from those telling the truth. The second experiment was designed as a pilot effort for the experiment to be reported below. In this experiment, participants staged a mock theft and were subsequently interviewed by untrained and trained interviewers via text chat or face-to-face (FtF) interaction [4, 5]. The FtF interactions were later transcribed, and the transcripts and chats were submitted to linguistic analysis on the same features as noted above, plus several others that are available in the Grammatik tool within WordPerfect. Due to the small sample size, none of the differences between innocents (truth tellers) and thieves (deceivers) were statistically significant, but patterns were suggestive of deceivers tending toward briefer messages (fewer syllables, words, and sentences; shorter and simpler sentences) of greater complexity (e.g., greater vocabulary and sentence complexity, lower readability scores) than truth tellers (higher FleschKincaid, sentence complexity, vocabulary complexity, syllables per word). The patterns found in these first efforts suggested that we should expect to find many linguistic differences between deceivers and truth tellers with a larger, and welldesigned experiment. We therefore hypothesized that deceptive senders display higher (a) quantity, (b) nonimmediacy, (c) expressiveness, (d) informality, and (e) affect; and less (f) complexity, (g) diversity, and (h) specificity of language in their messages than truthful senders.

Detecting Deception through Linguistic Analysis

93

3 Method Students were recruited from a multi-sectioned communication class by offering them credit for participation and the chance to win money if they were successful at their task. Half of the students were randomly assigned to be “thieves,” i.e., those who would be deceiving about a theft, and the other half became “innocents,” i.e., those who would be telling the truth. Interviewees in the deceptive condition were assigned to “steal” a wallet that was left in a classroom. In the truthful condition, interviewees were told that a “theft” would occur in class on an assigned day. All of the interviewees and interviewers then appeared for interviews according to a pre-assigned schedule. We attempted to motivate serious engagement in the task by offering interviewers $10 if they could successfully detect whether their interviewee was innocent or guilty and successfully detect whether they were deceiving or telling the truth on a series of the interview questions. In turn, we offered interviewees $10 if they convinced a trained interviewer that they were innocent and that their answers to several questions were truthful. An additional incentive was a $50 prize to be awarded to the most successful interviewee. Interviewees were then interviewed by one of three trained interviewers under one of three modalities— Face to Face (FtF), text chat, or audioconferencing. The interviews followed a standardized Behavioral Analysis Interview format that is taught to criminal investigators [7]. Interviews were subsequently transcribed and submitted to linguistic analysis. Clusters of potential indicators, all of which could be automatically calculated with a shallow parser (Grok or Iskim) or could use a look-up dictionary, were included. The specific classes of cues and respective indicators were as follows: 1. Quantity (number of syllables, number of words, number of sentences) 2. Vocabulary Complexity (number of big words, number of syllables per word) 3. Grammatical Complexity (number of short sentences, number of long sentences, Flesh-Kincaid grade level, average number of words per sentence, sentence complexity, number of conjunctions) 4. Specificity and Expressiveness (emotiveness index, rate of adjectives and adverbs, number of affective terms)

4 Results In the following two subsections, we investigate data from two perspectives: analysis of individual cues and cluster analysis. To analyze how well individual cues distinguish messages of deceivers from those of truth tellers, we conducted multivariate analyses of related groups of cues followed by directional t-tests on individual cues to identify which ones contribute most to differentiating deceivers and truthful tellers. The cluster analysis answers the question of whether combinations of cues (in a hierarchy structure) can improve overall ability to differentiate deceivers from truth tellers. Furthermore, unlike the traditional statistical cluster analysis, we used a datamining algorithm – C4.5 [8] – to cluster the cues and obtain a hierarchical tree structure. In this way, we fulfilled the “automatic” requirement of automating deception detection.

94

J.K. Burgoon et al.

4.1 Individual Cue Analysis Results were based on data of 41 subjects whose modality was text chat (txt) or audio. (In the future, data for face-to-face (FtF) will be included after those sessions have been completed and all video files are transcribed). Among 41 subjects, 29 interacted via txt and 20, via audio; 26 were “thieves” (i.e., deceivers) and 23 were “innocents (i.e., truth tellers). Table 1 presents descriptive statistics for the 16 cues that were analyzed. Table 1. Means (Deviations) for 16 cues Cues Syllables Words

Txt 121.06(82.723)

Deceiver Audio 146.44(75.583)

Truthful-teller Txt Audio 144.58(86.732) 189.27(86.383)

86.06(58.297)

106.00(53.708)

106.25(64.435)

140.36(62.094)

Sentences

5.88(3.621)

7.11(4.372)

6.00(5.117)

6.91(2.948)

Short sentences

2.82(2.298)

3.11(3.140)

2.58(3.872)

2.09(1.57)

Long sentences

0.24(0.437)

0.44(0.520)

0.92(1.16)

1.09(1.22)

1.29(1.21)

2.11(2.205)

2.20(2.64)

1.36(0.920)

Simple sentences Big words

6.29(6.64)

6.78(4.32)

6.55(3.77)

8.28(5.71)

Average syllables per word

1.42(0.107)

1.36(0.089

1.36(0.068)

1.34(.096)

Average words per sentence

15.45(6.53)

16.93(9.82)

21.02(7.76)

22.94(13.36)

Flesch-Kincaid grade level

7.28(2.839)

7.32(3.986)

8.91(3.223)

8.82(3.087)

41.12(16.86)

43.33(18.66)

52.91(26.44)

49.73(24.08) 7.27(4.22)

Sentence complexity Vocabulary complexity

9.88(7.54)

7.11(4.48)

7.27(4.60)

Conjunctions

4.47(3.48)

5.30(4.03)

5.00(3.82)

9.27(7.08)

Rate of Adjectives & Adverbs

0.12(0.014)

0.12(0.016)

0.12(0.014)

0.13(0.014)

Emotiveness

0.29(0.100)

0.29(0.141)

0.29(.073)

0.27(.160)

Affect

0.24(0.562)

0(0)

0.55(0.820)

0.33(0.651)

Results of the multivariate tests and t-tests are shown in Table 2. (For plots of means by deception condition and modality, see the figures in the appendix.) The multivariate analysis on indicators of quantity of language produced a significant multivariate effect for deception (p = .033) and no modality by deception interaction. Deceivers said or wrote less than truth tellers. The multivariate analysis of indicators of complexity at both the sentence level (simple sentences, long sentences, short sentences, sentence complexity, FleschKincaid grade level, number of conjunctions, average-words-per-sentence (AWS)) and vocabulary level (vocabulary complexity, number of big words, averagesyllables-per-word (ASW)) did not produce overall multivariate effects, but several individual variables did show the effects of deception condition. Deceivers had significantly fewer long sentences, AWS and sentence complexity than truth tellers; and a lower Flesch-Kincaid grade level than truth tellers. This meant their language was less complex and easier to comprehend. The t-tests also provided weak support for deceivers having fewer ASW (p = .102) and conjunctions (p = .149) in messages than truth tellers. Thus, deceivers used less complex language at both the lexical (vocabulary) and grammatical (sentence and phrase) levels. Modality effects also showed that subjects in text chat used fewer conjunctions than in audio chat, indicating that that modality was less likely to exhibit compound and complex sentences. For the analyses of message specificity and expressiveness (adjectives and adverbs, emotiveness, and affect), the multivariate test showed a trend toward a main effect for the deception condition (p = .101). There was a significant univariate differ-

Detecting Deception through Linguistic Analysis

95

ence on affect, such that deceivers used less language referring to emotions and feelings than did truth tellers. Table 2. Univariate F-tests (p-values) and t-tests for individual cue analysis Test of between-subject effects Cues

Modality*Condition

Independent Samples t-Test

Modality

Condition

Syllables

2.054(.159)

1.842(.182)

.156(.695)

Words

2.363(.131)

2.407(.128)

.162(.689)

1.702(.096)*

Sentences

.810(.373)

.001(.972)

.018(.894)

.111(.912)

Short sentences

.122(.016)*

.588(.447)

.225(.637)

-.725(.472)

Long sentences

.547(.464)

6.566(.014)*

.005(.947)

2.781(.008)*

Simple sentences

.029(.886)

.002(.969)

1.874(.178)

.061(.951)

Big words

.462(.500)

.288(.594)

.146(.704)

.616(.541)

Average syllables per word

1.949(.17)

1.703(.199)

.413(.524)

-1.668(.102)

Average words per sentence

.374(.544)

4.368(.042)*

.006(.936)

2.414(.021)*

Flesch-Kincaid grade level

.001(.979)

2.690(.108)

.005(.943)

1.958(.056)*

Sentence complexity

.006(.940)

2.055(.159)

.181(.673)

1.779(.082)*

Vocabulary complexity

.657(.422)

.512(.478)

.657(.422)

-.997(.324)

# of Conjunctions

3.393(.072)*

2.569(.116)

1.496(.228)

1.426(.163)

Rate Adjectives and Adverbs

0.150(.700)

.329(.569)

.301(.586)

-.596(.554)

Emotiveness

0.020(.889)

.054(.818)

.060(.808)

-.233(.817)

Affect

1.591(.214)

3.291(.214)

.004(.948)

1.630(.110)

1.502(.140)

* p < .05, one-tailed.

4.2 Cluster Analysis by C4.5 Although many linguistic cues were not significant as shown in section 1, they can form a hierarchy tree that performs relatively well in discriminating deceptive communicators from truthful ones. Among many data-mining algorithms, we chose C4.5 because it provides a clear cluster structure (compared with neural network), as well as satisfactory precision [9]. C4.5 used a pruned tree to cluster the cues. This algorithm cuts off redundant branches while constraining error rates. We used software of Weka (University of Waikato in New Zealand; Witten and Frank, 2000) to implement the C4.5. Figure 1 is the output of a pruned tree. “1” stands for truthful condition, “2” stands for deception condition. The correct prediction rate using 15-fold cross-validation is 60.72%, which is reasonably satisfactory given the small size of the data set. As shown in the Figure 1, a combination of linguistic cues can well categorize deception behaviors. For example, sentence level complexity combined with vocabulary or affect acted as good classifier. Those significant linguistic cues in section 1 also played important roles in the cluster classification: number of conjunctions, FK grade level, AWS, affect. On the other hand, the cluster structure also showed consistency with multivariate tests: not all linguistic cues contribute in identifying deceptions. There were “unhelpful” cues, such as emotiveness, which showed no significance in both the single level structure and cluster analysis (hierarchy structure). However, it is premature to conclude the ineffectiveness of any linguistic cues at this point. Further investigations with larger data sets will give us deeper insight into the intra-relations of cues. The confusion matrix shows the number of misclassifications: 10 out of 37 true conditions are misclassified as deceptive, and 19 out of 35 deceptive conditions are misclassified. The tree mentioned above produced less misclassifications in the truth

96

J.K. Burgoon et al.

condition than in the deceptive condition, which implies a “truth-biased” judgment. In other words, the cluster method is cautious in designating messages as untrustworthy. #conjunction <= 10 | #conjunction <= 3 | | #bigwds <= 7 | | | FK_grade <= 8.383333 | | | | #bigwds <= 3 | | | | | Voc_comp <= 12 | | | | | | number_of_sentences <= 2 | | | | | | | Average_words_per_sentence <= 7.625: 1 (2.0) | | | | | | | Average_words_per_sentence > 7.625: 2 (3.0) | | | | | | number_of_sentences > 2: 2 (7.0) | | | | | Voc_comp > 12: 1 (2.0) | | | | #bigwds > 3: 2 (9.0) | | | FK_grade > 8.383333: 1 (7.0/1.0) | | #bigwds > 7: 1 (5.0/1.0) | #conjunction > 3 | | Average_words_per_sentence <= 31.25 | | | Affect <= 1: 1 (23.0/4.0) | | | Affect > 1 | | | | Average_words_per_sentence <= 14.166667: 2 (3.0) | | | | Average_words_per_sentence > 14.166667: 1 (4.0) | | Average_words_per_sentence > 31.25: 2 (3.0) #conjunction > 10: 2 (4.0) === Confusion Matrix === a b <-- classified as 27 10 | a = truthful 19 16 | b = deceptive

Fig. 1. Pruned tree output from C4.5

5 Discussion This investigation was undertaken largely to demonstrate the efficacy of utilizing linguistic cues, especially ones that can be automated, to flag potentially deceptive discourse, and to use statistical clustering techniques to select the best set of cues to reliably distinguish truthful from deceptive communication. This investigation demonstrates the potential of both the general focus on language indicators and the use of hierarchical clustering techniques to improve the ability to predict what texts might be deceptive. As for the specific indicators that might prove promising, these results provide some evidence for the hypothesis that deceivers behave differently than truth tellers in communications via text chat and/or audio chat. Although many tests were not significant due to the small sample size, there was a trend shown in the profile plots demonstrating that: deceivers’ messages were briefer (i.e., lower on quantity of language), were less complex in their choice of vocabulary and sentence structure, and lack specificity or expressiveness in their text-based chats. This is consistent with pro-

Detecting Deception through Linguistic Analysis

97

files found in nonverbal deception research showing deceivers tend to adopt, at least initially, a fairly inexpressive, rigid communication style with “flat” affect. It appears that their linguistic behavior follows suit and also demonstrates their inability to create messages rich with the details and complexities that characterize truthful discourse. Over time, deceivers may alter these patterns, more closely approximating normal speech in many respects. But it is possible that language choice and complexity may fail to show changes because deceivers are not accessing real memories and real details, and thus will not have the same resources in memory upon which to draw. Unlike asynchronous experiments such as the Desert Survival experiment (DSP), subjects did not have sufficient time to provide detailed lies that contained more quantity and complexities [10]. The differences in synchronicity in these two tasks points to time for planning, rehearsal, and editing as a major factor that may alter the linguistic patterns of deceivers and truth tellers. As a consequence, no single profile of deception language across tasks is likely to emerge. Rather, it is likely that different cue models will be required for different tasks. Consistent with interpersonal deception theory [11], deceivers may adapt their language style deliberately according to the task at hand and their interpersonal goals. If the situation does not afford adequate time for more elaborate deceits, one should expect deceivers to say less. But if time permits elaboration, and/or the situation is one in which persuasive efforts may prove beneficial, deceivers may actually produce longer messages. What may not change, however, is their ability to draw upon more complex representations of reality because they are not accessing reality. In this respect, complexity measures may prove less variant across tasks and other contextual features. The issue of context invariance thus becomes an extremely important one to investigate as this line of work proceeds. Modality also plays a role in communication. Subjects talked more than they wrote, but message complexity did not seem to be much different between the text and audio modalities. Future research will explore the effect of different communication modalities on the characteristics of truthful and deceptive messages. Although clustering analysis did not consider modality effects, it provided a hierarchy tree structure to capture the combined characteristics of the cues. It also provided an exploratory threshold value to separate deceptive and true messages. It should also be noted that the analysis in this study used the absolute values of linguistic characteristics to classify statements as truthful or deceptive. Because people vary greatly in their usage of the language (e.g. some people naturally use more or less complex language than others), the use of cue values that are relative to the sender of the message may result in greater classification accuracy. This would require building a model of the baseline speech patterns of an individual and then comparing an individual message to this model. Future research will consider intra-connections among linguistic cues, tasks, and modalities. More data will also enhance reliability of current results, but it is clear from these results alone that linguistic cues that are amenable to automation may prove valuable in the arsenal of tools to detect deceit.

98

J.K. Burgoon et al.

Acknowledgement. Portions of this research were supported by funding from the U.S. Air Force Office of Scientific Research under the U.S. Department of Defense University Research Initiative (Grant #F49620-01-1-0394). The views, opinions, and/or findings in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.

References 1.

Burgoon, J. K., Buller, D. B., Ebesu, A., Rockwell, P.: Interpersonal Deception: V: Accuracy in Deception Detection. Communication Monographs 61 (1994) 303–325 2. Levine, T., McCornack, S.: Linking Love and Lies: A Formal Test of the McCornack and Parks Model of Deception Detection. J. of Social and Personal Relationships 9 (1992) 143–154 3. Zuckerman, M., DePaulo, B., Rosenthal, R.: Verbal and Nonverbal Communication of Deception. In: Berkowitz, L. (ed.): Advances in Experimental Social Psychology, Vol. 14. Academic Press, New York (1981) 1–59 4. Burgoon, J., Blair, J. P., Moyer, E.: Effects of Communication Modality on Arousal, Cognitive Complexity, Behavioral Control and Deception Detection during Deceptive Episodes. Paper submitted to the Annual Meeting of the National Communication Association, Miami. (2003, November) 5. Burgoon, J., Marett, K., Blair, J. P.: Detecting Deception in Computer-Mediated Communication. In: George, J. F. (ed.): Computers in Society: Privacy, Ethics & the Internet. Prentice-Hall, Upper Saddle River, NJ (in press) 6. Vrij, A.: Detecting Lies and Deceit. John Wiley and Sons, New York (2000) 7. Inbau, F. E., Reid, J. E., Buckley, J. P., Jayne, B. C.: Criminal Interrogations and Confessions. 4th edn. Aspen, Gaithersburg, MD (2001) 8. Quinlan, J. R.: C4.5. Morgan Kaufmann Publishers, San Mateo, CA (1993) 9. Spangler, W., May, J., Vargas, L.: Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support. J. Management Information Systems 16 (1999) 37–62 10. Zhou, L. Twitchell, D., Qin, T., Burgoon, J. K., Nunamaker, J. F., Jr.: An Exploratory Study into Deception Detection in Text-based Computer-Mediated Communication. In: th Proceedings of the 36 Annual Hawaii International Conference of System Sciences. Big Island, Los Alamitos, CA (2003) 11. Buller, D. B., Burgoon, J. K.: Interpersonal Deception Theory. Communication Theory 6 (1996) 203–242

Detecting Deception through Linguistic Analysis

99

Appendix: Individual Cue Comparisons by Modality and Deception Condition* Estimated Marginal Means of total number of syllables

Estimated Marginal Means of total number of words

200

150

140

180

140

GUILT 120 guilty innocent

100 1

Estimated Marginal Means

Estimated Marginal Means

130

160 120

110

100

GUILT

90

guilty innocent

80

2

1

Modality

2

Modality

Estimated Marginal Means of number of sentences

Estimated Marginal Means of short sentences

7.2

3.2

7.0 3.0

6.8

6.4

6.2

GUILT

6.0

guilty

5.8

innocent 1

Estimated Marginal Means

Estimated Marginal Means

2.8

6.6

2.6

2.4

GUILT 2.2

guilty

2.0

2

innocent

1

Modality

2

Modality

Estimated Marginal Means of long sentences

Estimated Marginal Means of simple sentences

1.2

2.2

1.0 2.0

.6

.4

GUILT .2

guilty innocent

0.0 1

2

Modality

* Modality 1 = Text, Modality 2 = Audio

Estimated Marginal Means

Estimated Marginal Means

.8 1.8

1.6

Modality 1.4

1 2

1.2 guilty

GUILT

innocent

100

J.K. Burgoon et al. Estimated Marginal Means of big words

Estimated Marginal Means of average syllables per word

8.5

1.44

1.42

8.0

7.0

Modality 6.5

1 2

6.0 guilty

Estimated Marginal Means

Estimated Marginal Means

1.40

7.5

innocent

1.38

1.36

Modality 1.34

1 2

1.32 guilty

GUILT

GUILT

Estimated Marginal Means of Flesch-Kincaid grade level 9.5

22

9.0

20

8.5

18

Modality 16

1 2

14

Estimated Marginal Means

Estimated Marginal Means

Estimated Marginal Means of average words per sentence 24

guilty

innocent

innocent

8.0

Modality 7.5 1 2

7.0 guilty

GUILT

innocent

GUILT

Estimated Marginal Means of sentence complexity

Estimated Marginal Means of vocabulary complexity

54

10.5

52

10.0

50

9.5

46

44

Modality

42

1

40

2

guilty

GUILT

innocent

Estimated Marginal Means

9.0

48

8.5 8.0

Modality

7.5

1

7.0

2

6.5 guilty

GUILT

innocent

Detecting Deception through Linguistic Analysis

101

Estimated Marginal Means of RateAdAdj

Estimated Marginal Means of #conjunction .13

10

9

.12

7

6

Modality 5

1 2

4 guilty

Estimated Marginal Means

Estimated Marginal Means

8

.11

Modality 1 2

.10 guilty

innocent

innocent

GUILT

GUILT

Estimated Marginal Means of Affect

Estimated Marginal Means of Emotive .6

.30

.5

.4

.28

Modality 1 .27

2

guilty

innocent

GUILT

* Modality 1 = Text, Modality 2 = Audio

Estimated Marginal Means

Estimated Marginal Means

.29

.3

.2

Modality .1

1

0.0

2

guilty

GUILT

innocent

A Longitudinal Analysis of Language Behavior of Deception in E-mail 1

2

2

Lina Zhou , Judee K. Burgoon , and Douglas P. Twitchell 1

Department of Information Systems, University of Maryland, Baltimore County [email protected] 2 Center for the Management of Information, University of Arizona {jburgoon, dtwitchell}@cmi.arizona.edu

Abstract. The detection of deception is a promising but challenging task. Previous exploratory research on deception in computer-mediated communication found that language cues were effective in differentiating deceivers from truthtellers. However, whether and how these language cues change over time remains an open issue. In this paper, we investigate the effect of time on cues to deception in an empirical study. The preliminary results showed that some cues to deception change over time, while others do not. The explanation for the lack of change in the latter cases is provided. In addition, we show that the number and type of cues to deception vary from time to time. We also suggest what could be the best time to investigate cues to deception in a continuous email communication.

1 Introduction Deception generally entails messages and information knowingly transmitted to create a false conclusion [1]. Deception poses a problem to individuals and organizations because those who fail to detect a deceiver’s malicious manipulation may take action on the basis of misrepresented information and consequently behave in ways that are more favorable to the deceiver than otherwise warranted [14]. Deception is a common but non-normative event that unfolds over time [2]. When engaging in deception, a communicator must manage to both achieve his/her intention as well as be alert to signs of detection from an interaction partner. In order to succeed in the first goal, deceivers tend to adopt an information management strategy [5]. Due to the dampening effect of suspicion on their partners’ nonverbal communication behavior [3], however, deceivers may pose as easy-going and let the partner set the pace for interaction [2]. Therefore, deception is a continuous process rather than one-time event. During the interaction, deceivers manage their conflicting goals and may exhibit different behaviors. How deceivers manage the timing for deception during a deceptive interaction remains as an open issue. Given the recent research results on linguistic cues to deception in computer mediated communication (CMC) [21] and the increasingly wider application of CMC, we are in a good position to investigate deceivers’ manipulation of cues to deception over time in CMC. In particular, we chose email as the media type for research in this paper. Not only does it provide an opportunity to study deception H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 102–110, 2003. © Springer-Verlag Berlin Heidelberg 2003

A Longitudinal Analysis of Language Behavior of Deception in E-mail

103

in one of the most popular types of CMC (www.info.isoc.org ), but it also allows us to focus our attention on language behavior in deception. Most past deception research focuses on non-verbal cues or a mix of verbal and other types of cues to deception in face-to-face settings [10,11,15,18]. What remains theoretically challenging is how effective verbal behavior of deception could be in email and other types of CMC. Although some text-based cues from the prior studies are potentially applicable to email, deception research in email is still rare mainly because it has to address the following challenges: 1) high dynamics of messages, especially in message length, language style, and message structure. Email is expressed through the medium of writing, though it display several of the core properties of speech, such as the expectation of responses, transience, and time-governed interactions [8]; 2) low media richness; it lacks true ability to signal meaning through kinetic and proxemic features; 3) lack of other linguistic features typical of conversational speech, which makes it difficult for language to be used in a truly conversational way [8]. Among existing text-based cues [13,16,17], we selected language cues, which are less dependent upon domain experts and can potentially be automated as a result of progress in natural language processing technologies. In this paper, we aim at studying the effect of time on language cues to deception in email. We will see what kinds of language cues vary over time and which cues exhibit consistency. A secondary objective is to explore at what time period of a continuous email communication deceivers may display language cues to deception most evidently. The above results are expected to shed light on how deceivers adjust their deception strategies over time and what part of a conversation is best for detecting deception.

2 Theoretical Foundation and Hypotheses 2.1 Media Richness Theory Being one of the least rich media, email does not have the same ability to transmit information, meaning, and emotion as does richer media, such as face-to-face interaction [9]. Due to the nature of e-mail being a less rich medium, deception is claimed to be more difficult to detect over e-mail than deceptive messages transmitted via richer media [12]. However, if users of text-based systems perceive the channel being used as able to convey richer information than it really does, they may use the system in such a way that begins to mimic the use of more rich systems. The theory does not try to account for just one modality, many of its findings are applicable to other mediated channels that are low in richness. 2.2 Interpersonal Deception Theory (IDT) IDT attempts to explain deception from an interpersonal conversational perspective, not strictly from any physiological venue [2]. IDT posits that within the context and relationship of the sender and receiver of deception, the deceiver will both engage in strategic modifications of behavior in response to the receiver’s suspicions and will display non-strategic leakage cues or indicators of deception. Tests of this theory have

104

L. Zhou, J.K. Burgoon, and D.P. Twitchell

confirmed the existence of brevity and nonimmediacy along with other identifiable cues, which may be useful in detecting deception within any modality [4, 6]. The theory not only applies to physiological or nonverbal indicators, but also pertains to verbal indicators. Information management, one of the strategic behaviors of deceivers posited in IDT, is closely related to the modification or manipulation of the central message content and its language style [5]. Most of all, the idea of the influence of interaction on participants’ subsequent behaviors indicate the possible changing of language behavior during different phases of communication. 2.3 Interpersonal Adaptation Theory (IAT) IAT clarified and described the interaction patterns of reciprocity and compensation in dyadic interaction [7]. Among other propositions, it implied a focus on longitudinal analyses of interaction. Deception is likely to be a continuous event that occurs over time [20]. Even with deception goals in mind, deceivers may manage to make their intention embedded in other messages that seem truthful to their partners. One underlying motivation for deceivers is to prevent their partners from suspecting them, which may lead to cognitive arousal. In addition, the adaptation of dyadic interaction may have some effect on deceivers occasionally displaying similar behavior to truthtellers. Therefore, we expected to see dynamics over time in language cues to deception. 2.4 Hypotheses We first extract a set of effective linguistic cues to deception based on the findings of a previous study [21]. It was found that deceivers and truth-tellers are significantly different on quantity and diversity of language. In addition, the study revealed that deceivers display different informality and affect in their language than truth-tellers do, and it partially supported the effect of non-immediacy on deception. However, it did not examine whether the above differences still hold consistently in continuous communication. In this study, therefore, we focus on the change of cues to deception over time. To juggle between conflicting goals of achieving communication goals and potential arousals due to deception, deceivers may not exhibit the same language behaviors all the time. They may intentionally manage themselves to be less deceptive at some times than other times. The proposition of cues changing along time dimension is included in Hypothesis 1. HYPOTHESIS 1. Deceivers change (a) quantity, (b) diversity, (c) informality, (d) affect, and (e) non-immediacy of language over time. To remove the potential effect of the task, we compare deceivers with truthfultellers who perform the same task to examine the change of significance of cues over time. Thus, we are also interested in Hypothesis 2.

A Longitudinal Analysis of Language Behavior of Deception in E-mail

105

HYPOTHESIS 2. Differences between deceivers’ language and truth tellers’ language on (a) quantity, (b) diversity, (c) informality, (d) affect, and (e) nonimmediacy vary across time.

3 Method The research experiment was a 2×2×3 doubly repeated measures design varying experimental condition (0: truthful, 1: deceptive), dyad role (0: sender, 1: receiver), and time (1: time 1, 2: time 2, 3: time 3), with the last two factors serving as within-dyad repeated factors. Subjects were randomly assigned to one of the two roles in one of the two experimental conditions and performed a task for 3 consecutive days under the same condition. Truthful senders served as the control condition. A series of analyses of repeated measures and variance were conducted to test Hypotheses 1 and 2. In all analyses, day was treated as a within-dyads factor, and repeated contrasts were performed for day to test for potential trends. 3.1 Experiment Design Subjects. Subjects (N= 60; 57% = female) were pairs of freshmen, sophomore, junior, and senior students recruited from an MIS course with extra credit for experimental participation. Tasks. The task is involved with decision making in Desert Survival Problem. The subjects were given a scenario that they were stranded in the desert, and their primary goal was to achieve an agreeable ranking of the given list of items in the order of the usefulness to survival. Procedures. Senders (both truthful and deceptive) first ranked the given list of items, and emailed their partner their rankings and explanations. Then, each naïve partner responded to the ranking with his/her own re-rank and explanations. The above procedure repeated for three days, the only difference being that on day 2 and day 3, additional items on the list were rendered unsalvageable (e.g., the flashlight broken), thus forcing a reconsideration of the remaining items (see details in [21]). 3.2 Independent Variables Deceptive Condition. There were two types of deceptive conditions: deception and truth. In the deception condition, a sender in a dyad was instructed to mislead the partner to a ranking that is different from the sender’s actual opinion; while in the truth condition, a sender offered his/her true opinions. Of the 30 pairs, 14 were in the truth condition, and 16 in the deception condition.

106

L. Zhou, J.K. Burgoon, and D.P. Twitchell

Time. Each of the three days in the experiment was taken as a time. Thus, there were 3 different times labeled as time 1, time 2 and time 3. 3.3 Dependent Variables Nineteen dependent variables are grouped into five constructs: quantity, diversity, informality, affect, and non-immediacy, as shown in Table 1. Quantity represents the amount of messages that are produced, diversity indicates the diversity of wording, informality expresses the degree of informality of messages being produced, affect indicates the display of emotional affect in messages, and non-immediacy shows the indirectness of messages that may prevent recipients from obtaining definite or affirmative information. The measurement of dependent variables was conducted by taking advantage of a natural language processing tool [19]. Table 1. Summaries of linguistic constructs and their component dependent variables

Quantity 1. word 2. verb 3. modifier 4. noun phrase 5. sentence

Diversity 6. lexical diversity 7. content diversity 8. redundancy

Informality 9. typo ratio

Affect 10. positive affect 11. negative affect

12. 13. 14. 15. 16. 17. 18. 19.

Non-immediacy passive voice modal verb objectification uncertainty generalizing term self reference group reference other reference

4 Results 4.1 Repeated Measures Analyses

A series of repeated measure analyses were conducted on deceivers’ email messages for each of the five linguistic constructs to test Hypothesis 1 that deceivers’ language changes over time. The results showed that deceivers change quantity, :LON¶V 0.0838; F(10, 6) = 6.645, p SDUWLDO 2 = 91.7%, and diversity, :LON¶V 0.166; F(6, 10) = 8.36, p SDUWLDO 2 = 83.4%, over time; nonimmediacy approached significance, :LON¶V ) p SDUWLDO 2 = 39.2%, in showing changes over time. The follow-up univariate analyses and post-hoc contrast analyses revealed that all individual measures of quantity decreased significantly (p<0.005) and two measures in the diversity construct increased significantly (p<0.001) over time. In addition, other reference in non-immediacy showed a decreasing pattern (p<0.1), for it decreased continuously from time 1 to time 3. Thus, deceivers’ language became briefer and more complex over time, but with fewer pronouns referencing others as time passed. Therefore, our Hypotheses 1(a) and 1(b) were strongly supported, and Hypotheses 1(e) is weakly supported, whereas Hypotheses 1(c) and 1(d) were not supported.

A Longitudinal Analysis of Language Behavior of Deception in E-mail

107

4.2 Analyses of Variance We conducted multiple ANOVAs of the effect of deceptive condition on each of the five linguistic constructors for each of the three times separately so as to analyze whether the same cues tell deceivers from truth tellers over time. Table 2. Univariate analyses results for three times and five constructs respectively

Constructs

Quantity

Diversity Informality Affect

Nonimmediacy

Variable word verb modifier noun phrase sentence lexical diversity content diversity redundancy typo ratio positive affect negative affect passive voice modal verb objectification uncertainty generalizing terms self reference group reference other reference

p-values and Direction of Relationship Time 1 Time 2 Time 3 0.022+ 0.04+ 0.008+ 0.016+ 0.018+ 0.027+ 0.047+ 0.07+ 0.072+ 0.032+ 0.0020.0390.0150.0060.014+ 0.031+ 0.05+ 0.02+ 0.0750.026+

+ = higher for deceivers; - = lower for deceivers

The results revealed that none of the constructs was effective for differentiating deceivers from truth tellers at all three times. At best, diversity was significant at time 1, :LON¶V ) p SDUWLDO 2 = 39%, and time 2, :LON¶V = 0.712; F(3, 26) = 3.506, p SDUWLDO 2 = 28.8%; quantity was significant at time 2, :LON¶V ) p SDUWLDO 2 = 35.5%, but only approached significance at time 1, :LON¶V ) p = 0.093, partial 2 = 31%; informality was only significant at time 2, F(1, 28) = 6.824, p = 0.014, parWLDO 2 = 19.6%. Affect approached significance at time 1, :LON¶V ) 2.661, p SDUWLDO 2 = 16.5%, and time 3, :LON¶V ) p SDUWLDO 2 = 16.1%, and nonimmediacy was significant only at time 2, F(1, 28) = 6.824, p SDUWLDO 2 = 19.6%. Therefore, Hypothesis 2 was well supported. As shown in Table 2, the follow-up univariate analyses of time 1 revealed that deceivers were higher than truth tellers on all quantity measures (p<0.05), with verbs the most significant (p<0.01) and sentences the least (p<0.1); they were lower than truth tellers on lexical diversity (p<0.01) and content diversity (p<0.05); and they were higher than truth tellers on negative affect (p=0.05). At time 2, the same cues re-

108

L. Zhou, J.K. Burgoon, and D.P. Twitchell

mained significant except for negative affect. In addition, more significant cues (all with p<0.05) emerged. For example, deceivers used more modal verbs, used more group references, had a greater typo ratio, and tended to be lower on self-references. The only cue significant at time 3 was positive affect (p<0.5), with deceivers being higher.

5 Discussion We identified a pattern from the results in Section 4 that the number of constructs and individual cues to deception initiated with relatively high values at time 1, peaked at time 2, which represents the middle phase of the communication, and dropped abruptly at time 3, which is equivalent to the final phase of communication. It indicates that: 1) time of communication and/or task matters in differentiating deceivers from truth-tellers; 2) deceivers tend to expose a fair number of deception cues at the beginning of communication, increase the number over time, and then finish the communication with few cues exposed. Therefore, in order to identify deception in a continuous communication, we should either merge all the exchanged messages from the same person to look for overall deception patterns, or select certain times in the middle of communication for further investigation. It is also evident in Table 2 that deceivers showed more negative affect at time 1, but more positive affect at time 3, which illustrates that deceivers may initially be taken over by negative arousal and cognitive load caused by deception at the beginning of communication. However, they gained better control of their affective display as communication continued. Finally, they were able to assume positive affect and leave a pleasant impression with their partners. Deceivers changed quantity, diversity, and non-immediacy of language significantly over time; however, they maintained informality and affect at about the same level. The direct speculation on the latter is that either the two measures are too difficult for deceivers to strategically manage, or deceivers are very cautious to keep them consistent over time. It is reasonable to expect that deceivers do not intentionally produce typos, thus informality of language may be led to by the first possibility. On the contrary, we expect affect to change over time, for it is not natural for people to display similar level of affect all the time. The lack of change in affect may be the result of intentional control of deceivers. Taking together the significant effect of time on most of the dependent constructs found in this study and that of deception condition in the prior study [21], we can clearly see that detecting deception is an extremely complex task, which involves many dynamics and contextual factors. However, automatic deception detection with accuracy beyond the level of chance is still a reachable goal as more effective cues become available. Questions remain as to the external validity of these results to other tasks or contexts. This study suffers from the weakness associated with laboratory research and student samples. However, in view of the email setting, the laboratory condition is a close approximation of a real-life situation, for real deceivers would have sufficient time to compose messages asynchronously and do not show their partners anything other than the text of the messages itself.

A Longitudinal Analysis of Language Behavior of Deception in E-mail

109

The lack of common standards and structures in email language was fully reflected in this study. Some subjects did not give a full stop to their sentences until reaching the end of their messages, while others simply used phrases and fragments rather than complete sentences in their messages. This lack of structure is an important fact of email that CMC researchers have to face.

6 Conclusion Our results confirmed that some cues to deception change over time. Others that remained unchanged may be due to lack of control or intentional control. None of the cues was effective in differentiating truth from deception across all time periods. In other words, the number and type of cues that can reliably distinguish deceivers from truth-tellers varies from time to time. Differentiation was relatively high at the beginning, peaked in the middle, and plummeted at the end of communication. These results are consistent with Interpersonal Deception Theory and Interaction Adaptation Theory, which postulate that deceivers intentionally adapt their communication over time as they gain greater control of their internal arousal and external behavior patterns and as they adapt to receiver feedback. This study indicates that affect needs to be considered in identifying deception in email, though it is implicitly embedded in messages rather than explicitly displayed as in face-to-face communication. With communicators physically distributed, deceivers may feel it much easier to adjust their affective display over time. This study suggests that cues to deception do matter, but they interact with time or phase of communication. Based on this study and the prior research, we conclude that matching cues to deception to phase of communication is important to improving performance of deception detection. After all, email is a popular communication media type and has distinguishable features from other communication types, so the importance of understanding deception patterns in email will be well recognized. We believe that this study provides empirical evidence to support intelligent deception detection in cyberspace.

Acknowledgement & Disclaimer. Portions of this research were supported by funding from the U.S. Air Force Office of Scientific Research under the U.S. Department of Defense University Research Initiative (Grant #F49620-01-1-0394). The views, opinions, and/or findings in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.

References 1. D. B. Buller and J. K. Burgoon, "Deception: Strategic and nonstrategic communication," in Strategic interpersonal communication, J. A. Daly and J. M. Wiemann, Eds. Hillsdale, NJ: Erlbaum, 1994, pp. 191–223.

110

L. Zhou, J.K. Burgoon, and D.P. Twitchell

2. D. B. Buller and J. K. Burgoon, "Interpersonal Deception Theory," Communication Theory, vol. 6, pp. 203–242, 1996. 3. J. K. Burgoon, D. B. Buller, A. S. Ebesu, and P. Rockwell, "Interpersonal deception V: Accuracy in deception detection," Communication Monographs, vol. 61, pp. 303–325, 1994. 4. J. Burgoon and D. B. Buller, "Interpersonal deception: xi. effects of deceit on perceived communication and non-verbal behavior dynamics," Journal of Nonverbal Behavior, vol. 18, pp. 155–184, 1994. 5. J. K. Burgoon, D. E. Buller, L. K. Guerrero, W. A. Afifi, and C. M. Feldman, "Interpersonal Deception: XII. Information management dimensions underlying deceptive and truthful messages," Communication Monographs, vol. 63, pp. 52–69, 1996. 6. J. K. Burgoon, D. B. Buller, C. H. White, W. Afifi, and A. L. S. Buslig, "The role of conversational involvement in deceptive interpersonal interactions," Personality & Social Psychology Bulletin, vol. 25, pp. 669–685, 1999. 7. J. K. Burgoon, N. Miczo, and L. A. Miczo, "Adaptation during deceptive interactions: testing the effects of time and partner communication style," presented at National Communication Association Convention, Atlanta, 2001. 8. D. Crystal, Language and the Internet. Cambridge: Cambridge University Press, 2001. 9 R. Daft and R. Lengel, "Organizational information, message richness and structural design," Management Science, vol. 32, pp. 554–571, 1986. 10. B. M. DePaulo, J. T. Stone, and G. D. Lassiter, "Deceiving and detecting deceit," in The Self and Social Life, B. R. Schlenker, Ed. New York: McGraw-Hill, 1985. 11. P. Ekman and M. O'Sullivan, "Who Can Catch a Liar?," American Psychologist, vol. 46, pp. 913–920, 1991. 12. J. F. George and J. R. Carlson, "Group support systems and deceptive communication," presented at HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences, 1999. 13. E. Höfer, L. Akehurst, and G. Metzger, "Reality monitoring: a chance for further development of CBCA?," presented at the Annual Meeting of the European Association on Psychology and Law, Sienna, Italy, 1996. 14. P. E. Johnson, S. Grazioli, K. Jamal, and R. G. Berryman, "Detecting deception: adversarial problem solving in a low base-rate world," Cognitive Science, vol. 25, pp. 355–392, 2001. 15. R. E. Kraut, "Verbal and nonverbal cues in the perception of lying," Journal of Personality and Social Psychology, pp. 380–391, 1978. 16. S. Porter and J. C. Yuille, "The language of deceit: An investigation of the verbal clues to deception in the interrogation context," Law and Human Behavior, vol. 20, pp. 443–458, 1996. 17. M. Steller and G. Köhnken, "Criteria-Based Content Analysis," in Psychological methods in criminal investigation and evidence, D. C. Raskin, Ed. New York: Springer Verlag, 1989, pp. 217–245. 18. A. Vrij, K. Edward, K. P. Robert, and R. Bull, "Detecting deceit via analysis of verbal and nonverbal behavior," Journal of Nonverbal Behavior, pp. 239–264, 2000. 19. A. Voutilainen, "Helsinki taggers and parsers for English," in Corpora Galore: Analysis and techniques in describing English, J. M. Kirk, Ed. Amsterdam & Atlanta: Rodopi, 2000. 20. C. H. White and J. K. Burgoon, "Adaptation and communicative design: Patterns of interaction in truthful and deceptive conversation," Human Communication Research, vol. 27, pp. 9–37, 2001. 21. L. Zhou, D. Twitchell, T. Qin, J. Burgoon, and J. Nunamaker, "An Exploratory Study into Deception Detection in Text-based Computer-Mediated Communication," presented at 36th Hawaii International Conference on System Sciences, Big Island, Hawaii, 2003.

Evacuation Planning: A Capacity Constrained Routing Approach Qingsong Lu, Yan Huang, and Shashi Shekhar Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN 55455, USA {lqingson,huangyan,shekhar}@cs.umn.edu, http://www.cs.umn.edu/research/shashi-group

Abstract. Evacuation planning is critical for applications such as disaster management and homeland defense preparation. Eﬃcient tools are needed to produce evacuation plans to evacuate populations to safety in the event of catastrophes, natural disasters, and terrorist attacks. Current optimal methods suﬀer from computational complexity and may not scale up to large transportation networks. Current naive heuristic methods do not consider the capacity constraints of the evacuation network and may not produce feasible evacuation plans. In this paper, we model capacity as a time series and use a capacity constrained heuristic routing approach to solve the evacuation planning problem. We propose two heuristic algorithms, namely Single-Route Capacity Constrained Planner and Multiple-Route Capacity Constrained Planner to incorporate capacity constraints of the routes. Experiments on a real building dataset show that our proposed algorithms can produce close-to-optimal solution, which has total evacuation time within 10 percent longer than optimal solution, and also reduce the computational cost to only half of the optimal algorithm. The experiments also show that our algorithms are scalable with respect to the number of evacuees.

1

Introduction

Evacuation planning is critical for numerous important applications, e.g. emergency building evacuation, disaster management and recovery, and homeland defense preparation. Eﬃcient tools are needed to produce evacuation plans which identiﬁes routes and schedules to evacuate populations to safety in the event of catastrophes, natural disasters, and terrorist attacks [8,3,4]. The current methods of evacuation planning can be divided into three categories, namely warning systems, linear programming approaches, and heuristic approaches. Warning

This work is supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory under contract number DAAD19-01-2-0014. The content does not necessarily reﬂect the position or policy of the government and no oﬃcial endorsement should be inferred. AHPCRC and the Minnesota Supercomputer Institute provided access to computing facilities

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 111–125, 2003. c Springer-Verlag Berlin Heidelberg 2003

112

Q. Lu, Y. Huang, and S. Shekhar

systems simply convey threat descriptions and the need of evacuation to the affected people via mass media communication methods. Such systems can have unanticipated eﬀects on the evacuation process. For example, when Hurricane Andrew was approaching Florida and Louisiana in 1992, the aﬀected population was simply asked to leave the area as soon as possible. This caused tremendous traﬃc congestion on highways and led to great confusion and chaos [1]. The second type of evacuation planning uses network ﬂow and linear programming approach. EVACNET [9,12,13] produces optimal solution using linear programming methods. It has exponential running time and cannot be applied to large transportation networks. Hoppe and Tardos [10,11] gave the ﬁrst and by far the only polynomial algorithm to compute optimal solution for evacuation problem. However, their algorithm uses ellipsoid method which suﬀers from high computational complexity and therefore is not practical to implement. The third type of evacuation planning uses heuristics approaches to ﬁnd evacuation plans. However, current naive heuristic approaches only compute the shortest distance path from a source to the nearest exit without considering route capacity constraints and traﬃc from other sources. It cannot produce eﬃcient plans when the number of people to be evacuated is large and the route network is complex. New heuristic approaches are needed to account for capacity constraints of the evacuation network. A capacity constrained routing approach reserves route capacities subject to capacity constraints in an order speciﬁed by heuristics. We propose two new heuristic algorithms for capacity constrained routing, namely single-route approach and multiple-route approach. The ﬁrst algorithm evacuates all the people from the same source via a single route by reserving route capacity based on an order determined by pre-computed shortest path lengths. The second algorithm can assign multiple routes to groups of people from the same source based on an order prioritized by shortest travel time path lengths re-calculated in each iteration. The multiple-route approach produces close-to-optimal solutions with signiﬁcantly reduced computational time compared to optimal solution algorithms. It outperforms the single-route approach in solution quality because of its ﬂexibility in choosing multiple routes although it is computationally more expensive since the single-route approach can produce solution for large network in seconds. Experimental results on a large building dataset show that our proposed algorithms can produce close-to-optimal solution, which has total evacuation time within 10% longer than optimal solution, and at the same time reduce the computational cost to only half of the optimal algorithm. Our algorithms are also scalable with respect to the total number of people to be evacuated. To the best of our knowledge, this is the ﬁrst paper exploring heuristic algorithms using capacity constrained routing for evacuation planning. Outline: The rest of the paper is organized as follows. In Section 2, the problem formulation is provided and related concepts are illustrated by an example. Section 3 proposes two capacity constrained heuristic algorithms. The algorithm comparison and cost models are given in Section 4. In Section 5, we presents

Evacuation Planning: A Capacity Constrained Routing Approach

113

the experimental design and results. We summarize our work and discuss future directions in Section 6. Scope: The proposed algorithms cannot be applied directly to vehicle routing models in transportation networks that have intersection queuing delays and turn penalties.

2

Problem Formulation

The capacity constrained routing problem can be formulated as follows. Given a transportation network with capacity constraints, the initial number of people to be evacuated, their initial locations, and evacuation destinations, we need to produce evacuation route plans consisting of a set of origin-destination routes and a scheduling of people to be evacuated via the routes. The objective is to minimize the total time needed for evacuation. The scheduling of people onto the routes should observe the route capacity constraints. A secondary objective is to minimize the computational overhead of producing the evacuation plan. We illustrate the problem formulation and a solution with the following example. Suppose we have a simple two-story building, as shown in Figure 1 (ﬂoor map from [13]). In this building, there are two rooms on the second ﬂoor, two staircases, and one room and two exits on the ﬁrst ﬂoor.

Fig. 1. Building Floor Map with Node and Edge Deﬁnition

114

Q. Lu, Y. Huang, and S. Shekhar

This building will be modelled as a node-edge graph, as shown in Figure 2. In this model, each room, corridor, staircase, and exit of the building is represented as a node, shown by an ellipsis. Each node has two attributes: maximum node capacity and initial node occupancy. For example, at node N1, which represents Room 201 in the building, the maximum capacity is 50, which means Room 201 can hold at most 50 people, while the initial occupancy is 10, which means there are initially 10 people in this room that are to be evacuated. Each pathway from one node to another node is represented as an edge, shown by arrows between two nodes in Figure 2. Each edge also has two attributes: maximum edge capacity and travel time. For example, at edge N1-N3, which represents the path linking Room 201 and the corridor, the maximum capacity is 7, which means at most 7 people can travel from Room 201 to the corridor simultaneously, while the travel time is 1, which means it takes 1 time unit to travel from the room to the corridor. This approach to model building ﬂoor-map with capacity to node-edge graph is similar to those presented in [13,5].

Fig. 2. Node-Edge Graph Model of Example Building

As shown in Figure 2, suppose we initially have 10 people at node N1, 5 at node N2, and 15 at node N8. The task is to compute an evacuation plan that evacuates the 30 people to the exits (N13 and N14) using the least amount of time.

Evacuation Planning: A Capacity Constrained Routing Approach

115

Example 1 (An Evacuation Plan). Table 1 shows an evacuation plan. In the table, each row shows one group of people moving together during the evacuation with a group ID, number of people in this group, origin node, the start time, the evacuation route, and the exit time. Take node N8 for example, initially there are 15 people at N8. They are divided into 3 groups: Group A with 6 people, Group B with 6 people and Group C with 3 people. Group A starts at time 0, follows route N8-N10-N13 and reaches EXIT1(N13) at time 4. Group B starts at time 1, also follows route N8-N10-N13 and reaches EXIT2(N13) at time 5. Group C start at time 0, follows route N8-N11-N14 and reaches EXIT2(N14) at time 4. The procedure is similar for people from N1 and N2. The whole evacuation takes 16 time units since the last group of people (Group F and J) reaches the exit at time 16.

Table 1. Evacuation Plan Example Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 0 D N1 3 0 E N1 3 1 F N1 3 2 G N1 1 0 H N2 3 0 I N2 2 1

3

Route Exit Time N8-N10-N13 4 N8-N10-N13 5 N8-N11-N14 5 N1-N3-N4-N6-N10-N13 14 N1-N3-N4-N6-N10-N13 15 N1-N3-N4-N6-N10-N13 16 N1-N3-N5-N7-N11-N14 16 N2-N3-N5-N7-N11-N14 15 N2-N3-N5-N7-N11-N14 16

Capacity Constrained Routing Approach

We use a capacity constrained routing approach to conduct the evacuation planning. We model available edge capacity and available node capacity as a time series instead of a ﬁxed number. A time series represents the available capacity at each time instant for a given edge or node. We propose an approach based on the extension of shortest path algorithms [7,6] to account for route scheduling with capacity constraints. We propose two heuristic algorithms to compute the evacuation plan. 3.1

Single-Route Capacity Constrained Routing Approach

In the Single-Route Capacity Constrained Planner (SRCCP) algorithm, ﬁrst, the shortest routes from each source to any destination are pre-computed. Next, capacities are reserved along the pre-computed routes by reducing available node

116

Q. Lu, Y. Huang, and S. Shekhar

Algorithm 1 Single-Route Capacity Constrained Planner (SRCCP) Input: 1) G(N, E): a graph G with a set of nodes N and a set of edges E; Each node n ∈ N has two properties: M aximum N ode Capacity(n) : non-negative integer Initial N ode Occupancy(n) : non-negative integer Each edge e ∈ E has two properties: M aximum Edge Capacity(e) : non-negative integer T ravel time(e) : non-negative integer 2) S: set of source nodes, S ⊆ N ; 3) D: set of destination nodes, D ⊆ N ; Output: Evacuation plan Method: for each source node s ∈ S do (1) find the shortest time route Rs < n0 , n1 , . . . , nk > among routes from s to all destinations d ∈ D, ( where n0 = s and nk = d ); (2) Sort routes Rs by total travel time, increasing order; (3) for each route Rs in sorted order do { (4) Initialize next start node on route Rs to move: st = 0; (5) while not all evacuees from n0 reached nk do { (6) t =next available time to start move from node nst ; (7) nend =furthest node can be reached from nst without stopping; (8) f low = min( number of evacuee at node nst , Available Edge Capacity(all edges between nst and nend on Rs ), Available N ode Capacity(all nodes from nst+1 to nend on Rs ), ); (9) for i = st to end − 1 do { (10) t = t + T ravel time(eni ni+1 ); (11) Available Edge Capacity(eni ni+1 , t) reduced by f low; (12) Available N ode Capacity(ni+1 , t ) reduced by f low; (13) t = t ; (14) } (15) st =closest node to destination on route Rs with evacuee; (16) } (17) } (18) Postprocess results and output evacuation plan; (19)

and edge capacities at certain time points along the route. The detailed pseudocode and algorithm description are as follows. In the ﬁrst step(line 1-2), for each source node s, we ﬁnd the route Rs with shortest total travel time among routes between s and all the destination nodes. The total travel time of route Rs is the sum of the travel time of all edges on Rs . For example, in ﬁgure 2, RN 1 is N1-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 2 is N2-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 8 is N8-N10-N13 with total travel time of 4 time units. This step is done by a variation of Dijkstra’s algorithm[7] in which edge travel time

Evacuation Planning: A Capacity Constrained Routing Approach

117

Table 2. Result Evacuation Plan of the Single-Route Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 2 D N1 3 0 E N1 3 0 F N1 1 0 G N1 2 1 H N1 1 1 I N2 2 0 J N2 3 0

Route Exit Time N8-N10-N13 4 N8-N10-N13 5 N8-N10-N13 6 N1-N3-N4-N6-N10-N13 14 N1-N3(W1)-N4-N6-N10-N13 15 N1-N3(W2)-N4-N6-N10-N13 16 N1-N3(W1)-N4-N6-N10-N13 16 N1-N3(W2)-N4-N6-N10-N13 17 N2-N3(W3)-N4-N6-N10-N13 17 N2-N3(W4)-N4-N6-N10-N13 18

is treated as edge weight and the algorithm terminates when the shortest route from s to one destination node is determined. The second step(line 3), is to sort the routes we obtained from step 1 in increasing order of the total travel time. Thus, in our example, the order of routes will be RN 8 ,RN 1 ,RN 2 . The third step(line 4-18), is to reserve capacities for each route in the sorted order. The reservation for route Rs is done by sending all the people initially at node s to the exit along the route in the least amount of time. The people may need to be divided into groups and sent by waves due to the constraints of the capacities of the nodes and edges on Rs . For example, for RN 8 , the ﬁrst group of people that starts from N8 at time 0 is at most 6 people because the available edge capacity of N8-N10 at time 0 is 6. The algorithm makes reservations for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0 because the 6 people travel through this edge starting from time 0; for node N10 at time 3 because they arrive at N10 at time 3; for edge N10-N13 at time 3 because they travel through this edge starting from time 3. They ﬁnally arrive at N13(EXIT1) at time 4. The second group of people leaving N8 has to wait until time 1 since the ﬁrst group has reserved all the capacity of edge N8-N10 at time 0. Therefore, the second group leaves N8 at time 1 and reaches N13 at time 5. Similarly, the last group of 3 people leaves N8 at time 2 and reaches N13 at time 6. Thus all people from N8 are sent to exit N13. The next two routes, RN 1 and RN 2 , will make their reservation based on the available capacities that the previous routes left with. The ﬁnal step of the algorithm is to output the entire evacuation plan, as shown in Table 2, which takes 18 time units.

118

3.2

Q. Lu, Y. Huang, and S. Shekhar

Multiple-Route Capacity Constrained Routing Approach

The Multiple-Route Capacity Constrained Planner (MRCCP) is an iterative approach. In each iteration, the algorithm re-computes the earliest time route from any source to any destination taking the previous reservations and possible onroute waiting time into consideration. Then it reserves the capacity for this route in the current iteration. The detailed pseudo-code and algorithm description are as follows. Algorithm 2 Multiple-Route Capacity Constrained Planner (MRCCP) Input: 1) G(N, E): a graph G with a set of nodes N and a set of edges E; Each node n ∈ N has two properties: M aximum N ode Capacity(n) : non-negative integer Initial N ode Occupancy(n) : non-negative integer Each edge e ∈ E has two properties: M aximum Edge Capacity(e) : non-negative integer T ravel time(e) : non-negative integer 2) S: set of source nodes, S ⊆ N ; 3) D: set of destination nodes, D ⊆ N ; Output: Evacuation plan Method: while any source node s ∈ S has evacuee do { (1) find route R < n0 , n1 , . . . , nk >= with earliest destination arrival time among routes between all s,d pairs, where s ∈ S,d ∈ D,n0 = s,nk = d; (2) f low = min( number of evacuee still at source node s, Available Edge Capacity(all edges on route R), Available N ode Capacity(all nodes from n1 to nk on route R), ); (3) for i = 0 to k − 1 do { (4) t = t + T ravel time(eni ni+1 ); (5) Available Edge Capacity(eni ni+1 , t) reduced by f low; (6) Available N ode Capacity(ni+1 , t ) reduced by f low; (7) t = t ; (8) } (9) } (10) Postprocess results and output evacuation plan; (11)

The MRCCP algorithm keeps iterating as long as there are still evacuees at any source node (line 1). Each iteration starts with ﬁnding the route R with the earliest destination arrival time from any sources node to any any exit node based on the current available capacities (line 2). This is done by generalizing Dijkstra’s shortest path algorithm [7] to work with the time series capacities and edge travel time. Route R is the route that reaches an exit in the least

Evacuation Planning: A Capacity Constrained Routing Approach

119

Table 3. Result Evacuation Plan of the Multiple-Routes Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 0 D N1 3 0 E N1 3 1 F N1 3 0 G N1 1 2 H N1 3 1 I N2 2 2

Route Exit Time N8-N10-N13 4 N8-N10-N13 5 N8-N10-N14 5 N1-N3-N4-N6-N10-N13 14 N1-N3-N4-N6-N10-N13 15 N1-N3-N5-N7-N11-N14 15 N1-N3-N4-N6-N10-N13 16 N2-N3-N5-N7-N11-N14 16 N2-N3-N5-N7-N11-N14 17

amount of time and at least one person can be sent to the exit through route R. For example, at the very ﬁrst iteration, R will be N8-N10-N13, which can reach N13 at time 4. The actual number of people that will travel through R is the smallest number among the number of evacuees at the source node and the available capacities of each of the nodes and edges on route R (line 3). Thus, in the example, this amount will be 6, which is the available edge capacity of N8-N10 at time 0. The next step is to reserve capacities for the people on each node and edge of route R (lines 4-9). The algorithm makes reservation for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0, for node N10 at time 3, and for edge N10-N13 at time 3. They ﬁnally arrive at N13(EXIT1) at time 4. Then, the algorithm goes back to line 2 for the next iteration. The iteration terminates when the occupancy of all source nodes is reduced to zero, which means all evacuee have been sent to exits. Line 11 outputs the evacuation plan, as shown in Table 3.

4

Comparison and Cost Models of the Two Algorithms

It can be seen that the key diﬀerence between the two algorithms is that the SRCCP algorithm only produces one single route for each source node, while the MRCCP can produce multiple routes for groups of people in each source node. MRCCP can produce evacuation plan with shorter evacuation time than SRCCP by the ﬂexibility of adapting to the available capacities after previous reservations. Yet, MRCCP needs to re-compute the earliest time route in each iteration which incurs more computational cost than SRCCP. We then provide simple algebraic cost models for the computational cost of the two proposed heuristic algorithms. We assume the total number of nodes in the graph is n, the number of source nodes is ns , and the number of groups generated in the result evacuation plan is ng .

120

Q. Lu, Y. Huang, and S. Shekhar

The cost of the SRCCP algorithm consists of three parts: the cost of the computing the shortest time route from each source node to any exit node is denoted by Csp , the cost of sorting all the pre-computed routes by their total travel time is denoted by Css , and the cost of reserving capacities along each route for each group of people is denoted by Csr . The cost model of the SRCCP algorithm is given as follows: CostSRCCP = Csp + Css + Csr = O(ns × nlogn) + O(ns logns ) + O(n × ng ) (1) The MRCCP algorithm is an iterative approach. In each iteration, the route for one group of people is chosen and the capacities along the route are reserved. The total number of iterations is determined by the number of groups generated. In each iteration, the route with earliest destination arrival time from each source node to any exit node is re-computed with the cost of O(ns ×nlogn). Reservation is made for the node and edge capacities along the chosen route with the cost of O(n). The cost model of the MRCCP algorithm is given as follows: CostM RCCP = O((ns × nlogn + n) × ng )

(2)

In both cost models, the number of groups generated for the evacuation plan depends on the network conﬁguration which include maximum capacity of nodes and edges, and the number of people to be evacuated at each source node.

5

Solution Quality and Performance Evaluation

In this section, we present the experiment design, our experiment setup, and the results of our experiments on a building dataset.

5.1

Experiment Design

Figure 3 describes the experimental design to evaluate the impact of parameters on the algorithms. The purpose is to compare the quality of solution and the computational cost of the two proposed algorithms with that of EVACNET which produces optimal solution. First, a test dataset which represents a building layout or road network is chosen or generated. The dataset is a evacuation network characterized by its route capacities and its size (number of nodes and edges). Next, a generator is used to generate the initial state of the evacuation by populating the network with a distribution model to assign people to source nodes. The initial state will be converted to EVACNET input format to produce optimal solution via EVACNET and converted to node-edge graph format to evaluate the proposed two heuristic algorithms. The solution qualities and algorithm performance will be analyzed in analysis module.

Evacuation Planning: A Capacity Constrained Routing Approach route capacity

number of nodes, edges

Test Dataset (Building layout or road network) number of people

121

Algorithm 1 Conversion to Node-Edge Model

initial people location distribution model

Intial State of the Building or Road Network Generator

Solution 1 Running Time1 Solution 2

Algorithm 2

Analysis

Running Time 2 Conversion to EVACNET Model

Optimal Solution Running Time 3

Fig. 3. Experiment Design

5.2

Experiment Setup and Results

The test dataset we used in the following experiments is the ﬂoor-map of Elliott Hall, a 6-story building on the University of Minnesota campus. The dataset network consists of 444 nodes with 5 exits nodes, 475 edges, and total node capacity of 3783 people. The generator produces initial states by varying source node ratio and occupancy ratio from 10% to 100%. The experiment was conducted on a workstation with Intel Pentium III 1.2GHz CPU, 256MB RAM and Windows 2000 Professional operating system. The initial state generator distributes Pn people to Sn randomly chosen Sn source nodes. The source node ratio is deﬁned as and total number of nodes Pn . the occupancy ratio is deﬁned as total capacity of all nodes We want to answer two questions: (1)How does people distribution aﬀect the performance and solution quality of the algorithms? (2) Are the algorithms scalable with respect to the number of people to be evacuated? Experiment 1: Eﬀect of People Distribution. The purpose of the ﬁrst experiment is to evaluate how the people distribution aﬀects the quality of the solution and the performance of the algorithms. We ﬁxed the occupancy ratio and varied the source node ratio to observe the quality of the solution and the running time of the two proposed algorithms and EVACNET. The experiment was done with ﬁxed occupancy ratio from 10% to 100% of total capacity. Here we present the experiment results with occupancy ratio ﬁxed at 30% and source node ratio varying from 30% to 100% which shows a typical result of all test cases. Figure 4 shows the total evacuation time given by the three algorithms and Figure 5 shows their running time. As seen in Figure 4, at each source node ratio, MRCCP produces solution with total evacuation time that is within 10% longer than optimal solution produced by EVACNET. The quality of solution of MPCCP is not aﬀected by the distribution of people when the total number of people is ﬁxed. For SRCCP, the solution is 59% longer than EVACNET optimal solution when source node ratio is 30% and drops to 29% longer when source node ratio increases to 100%. It shows that the solution quality of SRCCP increases when source node ratio increases. In Figure 5, we can see that the running time of EVACNET grows

122

Q. Lu, Y. Huang, and S. Shekhar

Total Evacuation Time

250 200 150

SRCCP MRCCP EVACNET

100 50 0 30

50

70

90

100

Source Node Ratio (%)

Fig. 4. Quality of Solution With Respect to Source Node Ratio

35

Running Time (second)

30 25 SRCCP MRCCP EVACNET

20 15 10 5 0 30

50

70

90

100

Source Node Ratio (%)

Fig. 5. Running Time With Respect to Source Node Ratio

much faster then the running time of SRCCP and MRCCP when source node ratio increases. This experiment shows: (1)SRCCP produces solution closer to optimal solution when source node ratio is higher. (2)MRCCP produces close to optimal solution (less than 10% longer than optimal) with less than half of running time of EVACNET. (3) The distribution of people does not aﬀect the performance of two proposed algorithms when total number people is ﬁxed. Experiment 2: Scalability with Respect to Occupancy Ratio. In this experiment, we evaluated the performance of the algorithms when the source node ratio is ﬁxed and the occupancy ratio is increasing. Figure 6 and Figure 7 show the total evacuation time and the running time of the 3 algorithms when the source node ratio is ﬁxed at 70% and occupancy ratio varies from 10% to 70% which is a typical case among all test cases. As seen in Figure 6, compared with the optimal solution by EVACNET, solution quality of SRCCP decreases when occupancy ratio increases, while solution quality of MRCCP still remains within 10% longer than optimal solution. In Figure 7, the running time of EVACNET grows signiﬁcantly when occupancy

Evacuation Planning: A Capacity Constrained Routing Approach

123

450

Total Evacuation Time

400 350 300 SRCCP MRCCP EVACNET

250 200 150 100 50 0 10

30

50

70

Occupany Ratio (%)

Fig. 6. Quality of Solution With Respect to Source Node Ratio

Running Time (second)

60 50 40 SRCCP MRCCP EVACNET

30 20 10 0 10

30

50

70

Occupany Ratio (%)

Fig. 7. Running Time With Respect to Source Node Ratio

ratio grows, while running time of MRCCP remains less than half of EVACNET and only grows linearly. This experiment shows: (1)The solution quality of SRCCP goes down when total number of people increases. (2) MRCCP is scalable with respect to number of people.

6

Conclusion and Future Work

In this paper, we proposed and evaluated two heuristic algorithms of capacity constrained routing approach. Cost models and experimental evaluations using a a real building dataset are presented. The proposed SRCCR algorithm can produces plan instantly but the quality of solution suﬀers when evacuee number grows. The MRCCR algorithm produces solution within 10% of optimal solution while the running time is scalable to number of evacuees and is reduced to half of the optimal algorithm. Both algorithms are scalable with respect to the number of evacuees. Currently, we choose the shortest travel time route without considering the available capacity of the route. In many cases, a longer route with larger available capacity may be a better choice. In our future work, we

124

Q. Lu, Y. Huang, and S. Shekhar

would like to explore heuristics with route ranking method based on weighted available capacity and travelling time while choosing best routes. We also want to extend and apply our approach to vehicle evacuation in transportation road networks. Modelling vehicle traﬃc during evacuation is a more complicated job than modelling pedestrian movements in building evacuation because modelling vehicle traﬃc at intersections and the cost of taking turns are challenging tasks. Current vehicle traﬃc simulation tools, such as DYNASMART [14], DYNAMIT [2], uses an assignment-simulation method to simulate the traﬃc based on origin-destination routes. We plan to extend our approach to work with such traﬃc simulation tools to address vehicle evacuation problems. Acknowledgment. We are particularly grateful to Spatial Database Group members for their helpful comments and valuable discussions. We would also like to express our thanks to Kim Koﬀolt for improving the readability of this paper. This work is supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory under contract number DAAD19-01-2-0014. The content does not necessarily reﬂect the position or policy of the government and no oﬃcial endorsement should be inferred. AHPCRC and the Minnesota Supercomputer Institute provided access to computing facilities.

References 1. Hurricane Evacuation web page. http://i49south.com/hurricane.htm, 2002. 2. M. Ben-Akiva et al. Deveopment of Dynamic Traﬃc Assignment System for Planning Purposes: DynaMIT User’s Guide. ITS Program, MIT, 2002. 3. S. Browon. Building America’s Anti-Terror Machine: How Infotech Can Combat Homeland Insecurity. Fortune, pages 99–104, July 2002. 4. The Volpe National Transportation Systems Center. Improving Regional Transportation Planning for Catastrophic Events(FHWA). Volpe Center Highlights, pages 1–3, July/August 2002. 5. L. Chalmet, R. Francis, and P. Saunders. Network Model for Building Evacuation. Management Science, 28:86–105, 1982. 6. C. Corman, T. Leiserson and R. Rivest. Introduction to Algorithms. MIT Press, 1990. 7. E.W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik, 1:269–271, 1959. 8. ESRI. GIS for Homeland Security, An ESRI white paper. http://www.esri.com/library/whitepapers/pdfs/homeland security wp.pdf, November 2001. 9. R. Francis and L. Chalmet. A Negative Exponential Solution To An Evacuation Problem. Research Report No.84-86, National Bureau of Standards, Center for Fire Research, October 1984. 10. B. Hoppe and E. Tardos. Polynomial Time Algorithms For Some Evacuation Problems. Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 433–441, 1994.

Evacuation Planning: A Capacity Constrained Routing Approach

125

11. B. Hoppe and E. Tardos. The Quickest Transshipment Problem. Proceedings of the 6th annual ACM-SIAM Symposium on Discrete Algorithms, pages 512–521, January 1995. 12. T. Kiosko and R. Francis. Evacnet+: A Computer Program to Determine Optimal Building Evacuation Plans. Fire Safety Journal, 9:211–222, 1985. 13. T. Kiosko, R. Francis, and C. Nobel. EVACNET4 User’s Guide. University of Florida, http://www.ise.uﬂ.edu/kisko/ﬁles/evacnet/, 1998. 14. H.S. Mahmassani et al. Development and Testing of Dynamic Traﬃc Assignment and Simulation Procedures for ATIS/ATMS Applications. Technical Report DTFH6 1-90-R-00074-FG, CTR, University of Texas at Austin, 1994.

Locating Hidden Groups in Communication Networks Using Hidden Markov Models Malik Magdon-Ismail1 , Mark Goldberg1 , William Wallace2 , and David Siebecker1 1

CS Department, RPI, Rm 207 Lally, 110 8th Street, Troy, NY 12180, USA. {magdon,goldberg,siebed}@cs.rpi.edu 2 DSES Department, RPI, 110 8th Street, Troy, NY 12180, USA. [email protected].

Abstract. A communication network is a collection of social groups that communicate via an underlying communication medium (for example newsgroups over the Internet). In such a network, a hidden group may try to camoﬂauge its communications amongst the typical communications of the network. We study the task of detecting such hidden groups given only the history of the communications for the entire communication network. We develop a probabilistic approach using a Hidden Markov model of the communication network. Our approach does not require the use of any semantic information regarding the communications. We present the general probabilistic model, and show the results of applying this framework to a simpliﬁed society. For 50 time steps of communication data, we can obtain greater than 90% accuracy in detecting both whether or not their is a hidden group, and who the hidden group members are.

1

Introduction

The tragic events of September 11, 2001 underline the need for a tool which is capable of detecting groups that hide their existence and functionality within a large and complicated communication network such as the Internet. In this paper, we present an approach to identifying such groups. Our approach does not require the use of any semantic information pertaining to the communications. This is preferable because communication within a hidden group is usually encrypted in some way, hence the semantic information will be misleading, or unavailable. Social science literature has developed a number of theories regarding how social groups evolve and communicate, [1,2,3]. For example, individuals have a higher tendency to communicate if they are members of the same group, in accordance with homophily theory. Given some of the basic laws of how social groups evolve and communicate, one can construct a model of how the communications within the society should evolve, given the (assumed) group structure. If the group structure does not adequately explain the observed communications, but the addition of an extra, hidden, group does explain them, then we H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 126–137, 2003. c Springer-Verlag Berlin Heidelberg 2003

Locating Hidden Groups in Communication Networks

127

have grounds to believe that there is a hidden group attempting to camouﬂage its communications within the existing communication network. The task is to determine whether such a group exists, and identify its members. We use a maximum likelihood approach to solving this task. Our approach is to model the evolution of a communication network using a Hidden Markov Model. A Hidden Markov model is appropriate when an observed process (in our case the macroscopic communication structure) is naturally driven by an unobserved, or hidden, Markov process (in our case the microscopic group evolution). Hidden Markov models have been used extensively in such diverse areas as: speech recognition, [4,5]; inferring the language of simple grammars [6]; computer vision, [7]; time series analysis, [8]; biological sequence analysis and protein structure prediction, [9,10,11,12,13]. Our interpretation of the group evolution giving rise to the observed macroscopic communications evolution makes it natural to model the evolution of communication networks using a Hidden Markov model as well. Details about the general theory of Hidden Markov models can be found in [4,14,15]. In social network analysis there are many static models of, and static metrics for the measurement and evaluation of social networks [16]. These models range from graph structures to large simulations of agent behavior. The models have been used to discover a wide array of important communication and sociological phenomenon, from the small world principle [17] to communication theories such as homophily and contagion [1]. These models, as good as they are, are not suﬃcient to study the evolution of social groups and the communication networks that they use; most focus on the study of the evolution of the network itself. Few attempt to explain how the use of the network shapes its evolution [18]. Few can be used to predict the future of the network and communication behavior over that network. Though there is an abundance of simulation work in the ﬁeld of computational analysis of social and organizational systems [2,19,3] that attempts to develop dynamic models for social networks, none have employed the proposed approach and few incorporate sound probability theory or statistics [20] as the underlying model. The outline of the paper is as follows. First we consider a simpliﬁed example, followed by a description of the general framework. We also present some results to illustrate proof of concept on an example, and we end with some concluding remarks. 1.1

Example

A simple, concrete example will help to convey the details of our method. A more detailed formulation will follow. Consider the newsgroups, for example alt.revisionism, alt.movies. A posting to a newsgroup in reply to a previous posting is a communication between two parties. Now imagine the existence of a hidden group that attempts to hide its communications, illustrated in the ﬁgure below. Figure 1(a) shows the group structure. There are 4 observed groups. A ﬁfth hidden group also exists, whose members are unshaded. We do not observe the actual group composition, but rather the communications (who is posting and

128

M. Magdon-Ismail et al. 1

2

X

3

4

(a)

(b)

(c)

Fig. 1. Illustration of a society.

Communication Graph Time Series for 1 Hidden Group Communication Graph, t=1

Communication Graph, t=2

Communication Graph, t=3

Communication Graph, t=4

Communication Graph, t=5

Communication Graph Time Series for No Hidden Groups Communication Graph, t=1

Communication Graph, t=2

Communication Graph, t=3

Communication Graph, t=4

Communication Graph, t=5

Fig. 2. Communication time series of two societies.

replying to posts in a given newsgroup). This is illustrated in Figure 1(b), where all the communications are between members of the same group. Figure 1(c) illustrates the situation when the hidden group members need to broadcast some information among themselves. The hidden group member who initiates the broadcast (say X) communicates with all the other hidden group members who are in the same visible groups as X. The message is then passed on in a similar manner until all the hidden members have received the broadcast. Notice that no communication needs to occur between members who are not in the same group, yet, a message can be broadcast across the whole group. In order to maintain the the appearance of being a bona-ﬁde member of a particular newsgroup, a hidden node will participate in the “normal” communications of that group as well. Only occasionally will a message need to be broadcast through the hidden group, resulting in a communication graph as in Figure 1(c). The matter is complicated by the fact that the communications in Figure 1(c) will be overlayed onto the normal group communications, Figure 1(b). What we observe are a time

Locating Hidden Groups in Communication Networks

129

series of node to node communications as illustrated in Figure 2, which shows the evolving communications of two hypothetical communities. The individuals are represented by nodes in the graph. An edge between two nodes represents communication during that time period. The thickness of the edge indicates the intensity of the communications. The dotted lines indicate communications between the hidden group members. The task is to take the communication history of the community (for example the one above) and to determine whether or not there exists a hidden group functioning within this community, and to identify its members. It would also be useful to identify which members belong to which groups. The hidden community may or may not be functioning as an aberrant group trying to camouﬂage its communications. In the above example the hidden community trying to camouﬂage its broadcasts. However, the hidden group could just as well be a new group that has suddenly arisen, and we would like to discover its existence. We assume that we know the number of observed groups (for example the newsgroups societies are known), and we have a model of how the society evolves. We do not know who belongs to which news group, and all communications are aggregated into the communications graph for a given time period. We will develop a framework to determine the presence of a hidden group that does not rely on any semantic information regarding the communications. The motivation for this approach is that even if the semantics are available (which is not likely), the hidden communications will usually be encrypted and designed so as to mimic the regular communications anyway.

2

Probabilistic Setup

We will illustrate our general methodology by ﬁrst developing the solution of the simpliﬁed example discussed above. The general case is similar, with only minor technical diﬀerences. The ﬁrst step is to build a model for how individuals move from group to group. More speciﬁcally, let Ng be the number of observed groups in the society, and denote the groups by F1 , . . . , FNg . Let n be the number of individuals in the society, and denote the individuals by x1 , . . . , xn . We denote by F(t), the micro-state of the society at time t. The micro-state represents the state of the society. In our case, F(t) is the membership matrix at time t, which is a binary n × Ng matrix that speciﬁes who is in which group, 1 if node xi is in group Fj , (1) Fij (t) = 0 otherwise. The group membership may change with time. We assume that F(t) is a Markov chain, in other words, the members decide which groups to belong to at time t + 1 based solely on the group structure at time t. In determining which groups to join in the next period, the individuals may have their own preferences, thus there is some transition probability distribution P [F(t + 1)|F(t), θ],

(2)

130

M. Magdon-Ismail et al.

where θ is a set of (ﬁxed) parameters that determine, for example, the individual preferences. This transition matrix represents what we deﬁne as the micro-laws of the society, that determines how its group structure evolves. A particular setting to the parameters θ is a particular realization of the micro-laws. We will assume that the group membership is static, which is a trivial special case of a Markov chain where the transition matrix is the identity matrix. In the general case, this need not be so, and we pick this simpliﬁed case to illustrate the mechanics of determining the hidden group, without complicating it with the group dynamics. Thus, the group structure, F(t) is ﬁxed, so we will drop the t dependence. We do not observe the group structure, but rather the communications that are a result of this structure. We thus need a model for how the communications arise out of the groups. Let C(t) denote the communications graph at time t. Cij (t) is the intensity of the communication between node xi and node xj at time t. C(t) is the “expression” of the micro-state F. Thus, there is some probability distribution P [C(t)|F(t), λ],

(3)

where λ is a set of parameters governing how the group structure gets expressed in the communications. Since F(t) is a Markov chain, C(t) follows a Hidden Markov process governed by the two probability distributions P [F(t + 1)|F(t), θ] and P [C(t)|F(t), λ]. In particular, we will assume that there is some parameter 0 < λ < 1 that governs how nodes in the same group communicate. We assume that the communication intensity Cij (t) has a Poisson distribution with parameter Kλ, where K is the number of groups that both nodes are members of. If K = 0, we will set the Poisson parameter to λ2 1. otherwise K = λ. Thus, nodes that are not in any groups will tend not to communicate. The Poisson distribution is often used to model such “arrival” processes. Thus, P(k; Kλ) xi and xj are in K > 0 groups together, P [Cij = k] = (4) P(k; λ2 ) xi and xj are in no groups together. Where P(k; λ) is the Poisson probability distribution function, P(k; λ) =

e−λ λk . k!

(5)

We will assume that the communications between diﬀerent pairs of nodes are independent of each other, as are communications at diﬀerent time steps. Suppose we have a broadcast hidden group in the society as well, as illustrated in Figure 1(c). We assume a particular model for the communications within the hidden group, namely that every pair of nodes that are in the same visible group communicate. The intensity of the communications, B is assumed to follow a Poisson distribution with parameter β, thus P [B = k] = P(k; β),

(6)

Locating Hidden Groups in Communication Networks

131

We have thus fully speciﬁed the model for the society, and how the communications will evolve. The task is to use this model to determine, from communication history (as in Figure 2), whether or not there exists a hidden group, and if so, who the hidden group members are. 2.1

The Maximum Likelihood Approach

For simplicity we will assume that the only unknown is F, the group structure. Thus, F is static and unknown and λ and β are known. Let H be a binary indicator variable that is 1 if a hidden group is present, and 0 if not. Our approach is to determine how likely the observed communications would be if there is a hidden group, l1 and compare this with how likely the observed communications would be if there was no hidden group, l0 . To do this, we use the model describing the communications evolution with a hidden group (resp. without a hidden group) to ﬁnd what the best group structure F would be if this model were true, and compute the likelihood of communications given this group structure and the model. Thus, we have two optimization problems, l1 = max P [Data|F, v, λ, β, H = 1],

(7)

l0 = max P [Data|F, λ, H = 0],

(8)

F,v F

where Data represents the communication history of the society, namely {C(t)}Tt=1 , and v is a binary indicator variable that indicates who the hidden and visible members of the society are. If l1 > l0 , then the communications are more likely if there is a hidden group, and we declare that there is a hidden group. As a by product, of the optimization, we will obtain F and v, hence we will identify not only who the hidden group members are, but also the remaining group structure for the society. In what follows, we will derive this likelihood function that needs to be optimized for our example society. What remains is to then solve the two optimization problems to obtain l1 , l0 . The simpler case is when there is no hidden group, which we analyze ﬁrst. Suppose that F is given. Let fij be the number of groups that nodes xi and xj are both members of, Fik Fjk . (9) fij = k

Let λij be the Poisson parameter for the intensity of the communication between nodes xi and xj , λ2 fij = 0, (10) λij = λfij fij > 0. Let P (t) be the probability of obtaining the observed communications C(t) at time t. Since the communications between nodes are assumed independent, and

132

M. Magdon-Ismail et al.

each is distributed according to a Poisson process with parameter λij , we have that P (t) = P [C(t)|F, λ, H = 0] n = P(Cij (t); λij )

(11) (12)

i<j

Since the communications at diﬀerent times are independent (given the group structure at that time), we have that P [Data|F, λ, H = 0] =

T n

P(Cij (t); λij ).

(13)

t=1 i<j

Since l0 is given by the maximum value of this function, we can equivalently maximize the logarithm. Further, the value of F that attains this maximum is the estimate of the group structure, assuming that there is no hidden group, log l0 = max F

n T

log P(Cij (t); λij )

(14)

t=1 i<j

F0 = argmax

n T

F

log P(Cij (t); λij )

(15)

t=1 i<j

When a hidden group is present, we use an indicator variable v to identify who the hidden group members are. We assume that there are as many hidden members as there are groups for illustration, as in Figure 1. At time t, the observed communications can be broken down into the regular communications plus a hidden group broadcast. The intensity B of the broadcast can be B = 0, 1, . . . , with B distributed according to a Poisson distribution, P [B|β] = P(B; β) This is our model of the hidden group communications. We can thus decompose the communication graph at time t into two components, a residual graph R(t), and the hidden graph H(t), C(t) = R(t) + H(t).

(16)

The hidden graph will be of the form in Figure 1(c) and is completely speciﬁed by the broadcast intensity B. This will then specify R(t) by R(t) = C(t) − H(t). Suppose that we know F, v. In this case, P (t) is given by P (t) = P [C(t)|F, v, λ, β, H = 1] ∞ = P [R(t)|B]P [B]

(17) (18)

B=0

Where P [R(t)|B] is given by an expression exactly analogous to (12), P [R(t)|B] =

n i<j

P(Rij (t; B); λij )

(19)

Locating Hidden Groups in Communication Networks

133

where R(t; B) is the residual graph depending on B, and λij is deﬁned exactly analogously to (10) with fij = k Fik Fjk . v places a constraint on what F can be, and serves to determine what the hidden group broadcast graph can be. Note that the sum in (18) gets truncated when B gets large enough so that the residual graph has negative edges, which is impossible, since it must be a communications graph. We will denote this maximum possible value of B by t Bmax . Then, using the fact that P [B] = P(B; β), we get that t Bmax

P (t) =

P(B; β)

n

P(Rij (t; B); λij )

(20)

i<j

B=0

Taking the logarithm and summing over t, we get that log l1 = max F,v

T

t Bmax

log

t=1

{F1 , v1 } = argmax F,v

P(B; β)

B=0 T t=1

n

t Bmax

log

B=0

P(Rij (t; B); λij )

(21)

i<j

P(B; β)

n

P(Rij (t; B); λij )

(22)

i<j

Thus, in order to obtain l0 , l1 , F0 , F1 , v1 , we need to solve two combinatorial optimization problems. Notice that the size of the search space is huge. When there is no hidden group, the size of the search space is 2nNg , and the evaluation of the objective function is O(T n2 ). When there is a hidden group, the size of the search space is 2(n−Ng )Ng n!/(n − Ng )! and the evaluation of the objective function is O(Cmax T n2 ), where Cmax is the maximum communication intensity between any two nodes. If in addition the parameters of the model, namely λ, β are also not known, then we have to optimize with respect to these parameters as well, in which case, we have a mixed continuous/discrete optimization problem. Some algorithms for discrete/combinatorial optimization problems are reactive search, [21,22], and randomized approaches, see for example [23]. Continuous problems are often approached using derivative based methods such as gradient descent, conjugate gradients, Levenberg-Marquardt, etc., [24]. Mixed discrete/continuous problems have not been studied as intensely, and most methods are based upon simulated annealing [25] or genetic algorithms, [26]. For illustration, we assume that the parameters are known, the purpose here is to set the framework for the problem. To illustrate, we have implemented a simulated annealing approach to the combinatorial optimization. We used 10, 000 steps of Monte Carlo, where at each step, the current group structure F was randomly perturbed. The probability of perturbation decreased as a function of the step number. Results. We show results on a small society (9 nodes) with 3 groups. We picked this society so that it would be computationally eﬃcient to run many simulations. We ran simulations to test both the false positive (declaring a hidden group when there isn’t one) and false negative (declaring no hidden group when there

134

M. Magdon-Ismail et al.

is one) errors. For each, we generated a society group structure randomly, and then generated the communication time series. These communication time series were fed into the optimization algorithm to obtain l0 , l1 , F0 , F1 , v1 . If l1 > l0 we declare a hidden group to be present and identify its members in v1 and the group structure in F1 . If not, we declare no hidden group and identify the group structure in F0 . The results are summarized in Table 1. Table 1. Error matrices for diﬀerent time periods. % correct is the percentage of nodes identiﬁed correctly (hidden or not) when a hidden group is present and is predicted correctly. 10 time steps

True H 1 0

Predicted H 1 0 0.73 0.19

0.27 0.81

% correct =84%

20 time steps

True H 1 0

Predicted H 1 0 0.78 0.04

0.28 0.96

% correct =89%

50 time steps

True H 1 0

Predicted H 1 0 0.88 0.03

0.12 0.97

% correct=94%

As can be seen, with just 50 time steps of data, the error rate in predicting the presence of a hidden group is lower than 0.1. 2.2

General Maximum Likelihood Formulation

In general, the group structure evolves according to the micro-law transition matrix for the Markov chain, P [F(t + 1)|F(t), θ], and, the group structure gets expressed as a communication graph according to P [C(t)|F(t), λ]. In our example, P [F(t + 1)|F(t), θ] was the identity matrix, and P [C(t)|F(t), λ] based on modeling the communications using Poisson processes. A detailed description of a general model that describes an evolving society over a communication network is given in [27]. Let N = {x1 , . . . , xn } be the set of nodes and let H ⊂ N be the subset of nodes that forms the hidden group. We assume that H does not change with time. The hidden group may have a communication pattern governed by a different probability distribution, P [H(t)|H, β], where β is a set of parameters that governs this distribution. The group structure of the society from t = 1, . . . , T is given by the time series of matrices {F(t)}Tt=1 . In our example, this time series was speciﬁed by the constant matrix F. If there is no hidden group, we can compute the likelihood of observing the communication data {C(t)}Tt=1 as follows. The probability of obtaining the evolution F(1), F(2), . . . , F(T ) is given by P [{F(t)}|θ] = P [F(1)]

T t=2

P [F(t)|F(t − 1), θ].

(23)

Locating Hidden Groups in Communication Networks

135

The likelihood of obtaining the observed communications given this evolution is then given by P [{C(t)}|{F(t)}, θ, λ] =

T

P [C(t)|F(t), λ].

(24)

t=1

Ideally, we would like to compute P [{C(t)}, {F(t)}|θ, λ] l0 = P [{C(t)}|θ, λ] = =

(25)

{F(t)}

P [{F(t)}|θ]P [{C(t)}|{F(t)}, θ, λ]

(26)

{F(t)}

=

P [F(1)]P [C(1)|F(1), λ]

T

P [F(t)|F(t − 1), θ]P [C(t)|F(t), λ](27)

t=2

{F(t)}

If θ, λ are known, then this summation can be computed using a Monte Carlo simulation. If not, then we ﬁnd the values of θ, λ that maximize l0 . In this case, the optimization is computationally costly and an alternative is to simultaneously optimize with respect to {F(t)}Tt=1 , θ, λ, which is itself a non-trivial mixed discrete/continuous optimization problem. When a hidden group H is present, we decompose the communications at time t to the hidden communications H(t) and the residual communications R(t), with C(t) = R(t) + H(t). Then, P [C(t)|{F(t)}, H, θ, λ, β] = P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (28) H(t)

where this summation is ﬁnite because both R(t) and H(t) must have nonnegative edges. Taking the product over t gives us P [{C(t)}|{F(t)}, H, θ, λ, β] =

T

P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (29)

t=1 H(t)

and ﬁnally multiplying by P [{F(t)}|θ] and summing over {F(t)}, we get that l1 = max H

{F(t)}

P [F(1)]

T

P [F(t + 1)|F(t), θ]P [{C(t)}|{F(t)}, H, θ, λ, β],

t=1

(30) where P [{C(t)}|{F(t)}, H, θ, λ, β] is given in (29). The hidden group H at which the maximum is attained identiﬁes who the hidden group members are. We assume that the Hidden Markov model and its parameters (θ, λ, β) are known. If the parameters are not known, then they have to be optimized as well. For a relatively simple hidden group communication structure, for example the broadcast hidden group as in our example, the computation of the likelihood is tractable. For more complicated examples, one may need to use heuristic approaches to these combinatorial optimization problems.

136

3

M. Magdon-Ismail et al.

Concluding Remarks

We have presented a framework for determining the members of a hidden group that attempts to camouﬂage its broadcasts within a functioning communication network. The basic idea is to ﬁrst have a model for the society’s evolutions. Then by examining the discrepancy between the observed and expected communications, one can draw conclusions regarding the presence or absence of a hidden group. We focussed on a speciﬁc example, where we made a number of assumptions: static group structure; Poisson communication model; independence between communications at diﬀerent times; the hidden group communications were only broadcasts; we used a maximum likelihood formulation. These restrictions were made primarily for expository and computational reasons, and are dropped in the general framework (resulting in more computationally intensive and complex optimization problems). Ongoing research involves developing eﬃcient heuristic algorithms that solve the combinatorial optimization problems faced in the more general framework, as well as applying our methodology toward ﬁnding hidden groups in real societies.

References 1. Monge, P., Contractor, N.: Theories of Communication Networks. Oxford University Press (2002) 2. Carley, K., Prietula, M., eds.: Computational Organization Theory. Lawrence Erlbaum associates, Hillsdale, NJ (2001) 3. Sanil, A., Banks, D., Carley, K.: Models for evolving ﬁxed node networks: Model ﬁtting and model testing. Journal oF Mathematical Sociology 21 (1996) 173–196 4. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (1989) 257–286 5. Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Magazine (1986) 4–15 6. Georgeﬀ, M.P., Wallace, C.S.: A general selection criterion for inductive inference. European Conference on Artiﬁcial Intelligence (ECAI, ECAI84) (1984) 473–482 7. Bunke, H., Caelli, T., eds.: Hidden Markov Models. Series in Machine Perception and Artiﬁcial Intelligence – Vol. 45. World Scientiﬁc (2001) 8. Edgoose, T., Allison, L.: MML Markov classiﬁcation of sequential data. Stats. and Comp. 9 (1999) 269–278 9. Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macro-molecules. J. Molec. Evol. 35 (1992) 77–89 10. Allison, L., Wallace, C.S., Yee, C.N.: Normalization of aﬃne gap costs used in optimal sequence alignment. J. Theor. Biol. 161 (1993) 263–269 11. Bystroﬀ, C., Thorsson, V., Baker, D.: HMMSTR: A hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology 301 (2000) 173–90 12. Bystroﬀ, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281 (1998) 565–77 13. Bystroﬀ, C., Shao, Y.: Fully automated ab initio protein structure prediction using I-sites, HMMSTR and ROSETTA. Bioinformatics 18 (2002) S54–S61

Locating Hidden Groups in Communication Networks

137

14. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA (1998) 15. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge, new York (2001) 16. Wasserman, S., Faust, K.: Social Network Analysis. Cambridge University Press (1994) 17. Watts, D.J.: Small Worlds: The dynamics of networks between order and randomness. Princeton University Press, Princeton, NJ (1999) 18. Butler, B.: The dynamics of cyberspace: Examing and modelling online social structure. Technical report, Carnegie Melon University, Pittsburgh, PA (1999) 19. Carley, K., Wallace, A.: Computational organization theory: A new perspective. In Gass, S., Harris, C., eds.: Encyclopedia of Operations Research and Management Science. Kluwer Academic Publishers, Norwell, MA (2001) 20. Snijders, T.: The statistical evaluation of social network dynamics. In Sobel, M., Becker, M., eds.: Sociological Methodology dynamics. Basil Blackwell, Boston & London (2001) 361–395 21. Battiti, R.: Reactive search: Toward self-tuning heuristics. Modern Heuristic Search Methods, Chapter 4 (1996) 61–83 22. Battiti, R., Protasi, M.: Reactive local search for the maximum clique problem. Technical Report TR-95-052, Berkeley, ICSI, 1947 Center St. Suite 600 (1995) 23. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (2000) 24. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995) 25. Aarts, E., Korst, J.: Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley & Sons Ltd., New York (1989) 26. Stelmack, M., N., N., Batill, S.: Genetic algorithms for mixed discrete/continuous optimization in multidisciplinary design. In: AIAA Paper 98-4771, AIAA/ USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, Missouri (1998) 27. Siebeker, D., Goldberg, M., Magdon-Ismail, M., Wallace, W.: A Hidden Markov Model for describing the statistical evolution of social groups over communication networks. Technical report, Rensselaer Polytechnic Institute (2003) Forthcoming.

Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department Kar Wing Li and Christopher C. Yang Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong {kwli, yang}@se.cuhk.edu.hk

Abstract. The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analysed, searched. The traditional information retrieval (IR) approaches normally require a document to share some keywords with the query. In reality, the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Apart from this, terrorists and criminals may communicate through letters, e-mails and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. To facilitate cross-lingual information retrieval, a corpusbased approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. However, collecting parallel corpora between European language and Oriental language is not an easy task due to the unique linguistics and grammar structures of oriental languages. In this paper, the text-based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. This article then reports an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval.

1 Introduction In a string of fatal attacks that include the tragic event of September 11, a car bombing in Bali, and an explosion on a French oil tanker off the coast of Yemen, casualties of terrorism have increasingly become regular in daily news all over the globe. These events have prompted the rapid growth of attention of national security H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 138–152, 2003. © Springer-Verlag Berlin Heidelberg 2003

Automatic Construction of Cross-Lingual Networks of Concepts

139

and criminal analysis. However, Osama bin Laden’s al Qaeda terrorists are not the only threat. We also need to effectively predict and prevent other criminal activities. These include religious, racist and fascist terrorists, opportunistic crime, organized crime (narcocriminial, Mafia, Russian mob, Triads, etc.), political espionage and sabotage, anarchists and vandals. An intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. For example, historical cases of tax fraud can disclose patterns of taxpayers’ behaviors and provide indicators for potential fraud. The customers’ credit card data can reveal the patterns of transactions and help to detect credit card theft. It should also allow the user to retrieve what persons, organizations, projects, and topics are relevant to a particular event of interest, e.g. car bombing in Bali. However, information stored in the repositories is often fragmented and unstructured, especially on-line catalogs. Also, the man-made fog of deliberate deception militates against normal pattern learning from databases causes much crucial information and the knowledge underlying to be buried. Therefore this information has become inaccessible. Developing systems that can retrieve relevant information have long been the goal of many researchers since important domain knowledge or information resides in the databases. Many information retrieval systems have been created in the past for medical diagnosis and business applications. The major difficulties to retrieve relevant information are the lack of explicit semantic clustering of relevant information and the limits of conventional keyword-driven search techniques (either full text or index-based)[2]. The traditional approaches normally require a document to share some keywords with the query. In reality, it is known that the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. How to create relationships for the related terms between the two spaces is an important issue. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Language boundaries is another problem for criminal analysis. In criminal analysis, we need to find out how to frame questions, or create search patterns, that would help an analyst. If the right questions are not posed, the analyst may head down a path with no conclusions. In addition, terrorists and criminals may communicate openly and less openly through letters, e-mails, faxes, bulletin boards, etc. in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. Use of every possible translation for a single term can greatly expand the set of possible meanings because some of those translations are likely to introduce additional homonomous or polysemous word senses in the second language. Also, the users can have different abilities for different languages, affecting their ability to form queries and refine results. The human expertise to decompose an information need into the queries may take a man several years to acquire. However, knowledge-based systems aim to capture human expertise or knowledge by means of computational models. Knowledge acquisition was defined by Buchanan [10] as “the transfer and transformation of potential problem-solving expertise from some knowledge source to a program”. The approach to knowledge elicitation is referred to as “knowledge mining” or

140

K.W. Li and C.C. Yang

“knowledge discovery in databases” [2]. The “knowledge discovery” approach is believed by many Artificial Intelligence experts and database researchers to be useful for resolving the information overload and knowledge acquisition bottleneck problems. In this research, our aim is to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the documents of English/Chinese daily press release issued by Hong Kong Police Department. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval. Before the generation of the thesaurus-like, semantic network knowledge base, we firstly propose the text-based approach to collect the parallel press release documents from the Web.

2 Automatic Construction of Parallel Corpus Cross-lingual semantic interoperability has drawn significant attention in recent criminal analysis as the information of criminal activities written in languages other English has grown exponentially. Since it is impractical to construct bilingual dictionary or sophisticated multilingual thesauri manually for large applications, the corpus-based approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model for cross-lingual information retrieval. Many corpora are domain-specific. To deal with criminal analysis, we use the English/Chinese daily press release articles issued by Hong Kong SAR Police Department. Bates [1] stressed the importance of building domainspecific lexicons for retrieval purposes since a domain-specific, controlled list of keywords can help identify legitimate search vocabularies and help searchers “dock” on to the retrieval system. For most domain-specific databases, there appears to be some lists of subject descriptors (e.g., the subject indexes at the back of a textbook), people’s names (e.g., author indexes), and other domain-specific objects (e.g., organizational names, procedures, location names, etc.). These domain-specific keywords can be used to identify important concepts in documents. In the criminal analysis world, the information can help the analyst to identify the people who belongs to which group or organization, uses what methods to conduct the criminal activities in where. In addition, the online bilingual newswire articles used in this experiment are dynamic. They provide a continuous large amount of information for relieving the lag between the new information and the information incorporated into a reference work. To continuously collect English/Chinese daily Police press release articles from the data stream, we investigate the text-based approach to align English/Chinese parallel documents from the Web. Parallel corpus can be generated using overt translation or covert translation. The overt translation [20] possesses a directional relationship between the pair of texts in two languages, which means texts in language A (source text) is translated into texts in language B (translated text)[25]. The covert translation [13] is non-directional, e.g. press release from the government, commentaries on a sports event broadcast live in several languages by a broadcasting organization. There are two major approaches for document aligning, namely length-based and text-based alignment. The length-based makes use of the total number of characters or

Automatic Construction of Cross-Lingual Networks of Concepts

141

words in a sentence and the text-based approaches use linguistic information in the sentence alignment [9]. Many parallel text alignment techniques have been developed in the past. These techniques attempt to map various textual units to their translation and have been proven useful for a wide range of applications and tools, e.g. crosslingual information retrieval [18], bilingual lexicography, automatic translation verification and the automatic acquisition of knowledge about translation [22]. Translation alignment technique has been used in automatic corpus construction to align two documents [16]. There are three major structures of parallel documents on the World Wide Web, parent page structure, sibling page structure, and monolingual sub-tree structure[24]. Resnik [19] noticed that the parent page of the Web page may contain the links to different versions of the web page. The sibling page structure refers to the cases where the page in one language contains a link directly to the translated pages in the other language. The third structure contains a completely separate monolingual subtree for each language, with only the single top-level Web page pointing off to the root page of single-language version of the site. Parallel corpus generated by overt translation usually uses the parent page structure and sibling page structure. However, parallel corpus generated by covert translation uses monolingual sub-tree structure. Each sub-tree is generated independently [24]. The press release issued by the HKSAR Police Department is an example.

Hong Kong SAR Police Department Web page (Chinese)

1/1/1999 (Chinese)

Article 0001

Hong Kong SAR Police Department Web page (English)

Press News Archives

Press News Archives

(Chinese)

(English)

1/1/1999 (English)

……

……

……

Article 0019

……

……

parallel articles Fig. 1. Organization of Hong Kong SAR Police Department’s press release articles in the Hong Kong SAR Police Department Web site.

142

K.W. Li and C.C. Yang

2.1 Title Alignment Titles of two texts can be treated as the representations of two texts. Referring to He [11], the titles present “micro-summaries of texts” that contain “the most important focal information in the whole representation” and as “the most concise statement of the content of a document”. In other words, titles function as the condensed summaries of the information and content of the articles. In our proposed text-based approach, the longest common subsequence is utilized to optimize the alignment of English and Chinese titles [24]. Our alignment algorithm has three major steps: 1) alignment at word level and character level, 2) reducing redundancy, 3) score function. An English title, E, is formed by a sequence of English simple words, i.e., E = e1 e2 e3 … ei … , where ei is the ith English word in E. A Chinese title, C, is formed by a sequence of Chinese characters, i.e., C = char1 char2 char3 … charq … , where charq is a Chinese character in C. An English word in E, ei, can be translated to a set of possible Chinese translations, Translated(ei), by dictionary lookup. Translated(ei) = j { Te1 , Te2 , Te3 , … , T j , … } where T is the jth Chinese translation of ei. Each i

i

i

ei

ei

Chinese translation is formed by a sequence of Chinese characters. The set of the j

j

longest-common-subsequence (LCS) of a Chinese translation T ei and C is LCS( T ei , C). MatchList(ei) is a set that holds all the unique longest common subsequences of

T eij and C for all Chinese translations of ei. Based on the hypothesis that if the characters of the Chinese translation of an English word appears adjacently in a Chinese sentence, such Chinese translation is more reliable than other translations that their characters do not appear adjacently in the Chinese sentence. Contiguous(ei) is used to determine the most reliable translation based on adjacency. The second criteria of the most reliable Chinese translations, is the length of the translations. Reliable(ei) is used to identify the longest sequence in Contiguous(ei). Due to redundancy, the translations of an English word may be repeated completely or partially in Chinese. To deal with redundancy, Dele(x,y) is an edit operation to remove the LCS(x,y) from x. WaitList is a list to save all the sequences obtained by removing the overlapping of the elements of MatchList(ei) and Reliable(ei). MatchList(ei) is initialized to ∅ and Reliable(ei) is initialized to ε . Remain is a sequence that is initialized as C, and Reliable(ei) are removed from Remain starting from the e1 until the last English word. WaitList will also be updated for each ei. When all Reliable(ei) are removed from Remain, the elements in WaitList will also be removed from Remain in order to remove the redundancy. Given E and C, the ratio of matching is determined by the portion of C that matches with the reliable translations of English words in E. Given an English title, the Chinese title that has the highest Matching_Ratio among all the Chinese titles is considered as the counterpart of the English title. However, it is possible that more than one Chinese title have the highest Matching_Ratio. In such case, we shall also consider the ratio of matching determined by the portion of English title that is able to identify a reliable translation in the Chinese title.

Automatic Construction of Cross-Lingual Networks of Concepts

143

2.2 Experiment An experiment is conducted to measure the precision and recall of the aligned parallel Chinese/English documents from the HKSAR Police press releases using the textbased approach as described in Section 2.1. Results are shown on Table 1. The Hong Kong SAR Police press releases are developed based on covert translation. From 1st January, 2001 to 31st October,2002, there are 2,698 press articles in Chinese and 2,695 press articles in English. There are only 2,664 pairs of Chinese/English parallel articles. Experimental result shows that the proposed text-based title alignment approach can effectively align the Chinese and English titles. Table 1. Experimental results

Proposed text-based approach

Precision 1.00

Recall 1.00

3 A Corpus-Based Approach: Automatic Cross-Lingual Concept Space Generation The semantic network knowledge base approach to automatic thesaurus generation is also referred to as a concept space approach[4] because a meaningful and understandable concept space (a network of terms and weighted associations) could represent the concepts (terms) and their associations for the underlying information space (i.e., documents in the database). In terms of criminal analysis, recent terrorist events have demonstrated that terrorist and other criminal activities are connected, in particular, terrorism, money laundering, drug smuggling, illegal arms trading, and illegal biological and chemical weapons smuggling. In addition, hacker activities may be connected to these other criminal activities. Information in the concept space can be split into concepts and links. Concepts include real people, aliases, groups, organizations, companies (including bank and shells), countries, towns, regions, religious groups, families, attacks (hacker, terrorist), etc. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and noncriminal activities, and communicate openly and less openly through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. It helps the analyst to detect the important anomalies. The cross-lingual concept space clustering model is originally suggested by Lin and Chen [15] and based on the Hopfield network. The cross-lingual concept space includes the concepts themselves, their translations as well as their associated concepts. The automatic Chinese-English concept space generation system consists of four components: 1)English phrase extraction; 2)Chinese phrase extraction; and 3) Hopfield network, and 4) Parallel Chinese/English Police press release corpus. The Chinese and English phrase extraction identifies important conceptual phrases in the corpora. The Hopfield network generates the cross-lingual concept space with the Chinese and English important conceptual phrases as input. A press release parallel corpus was dynamically collected from the Hong Kong Police website in order to get the relationship between Chinese terms and English terms.

144

K.W. Li and C.C. Yang

3.1 Automatic English Phrase Extraction Automatic phrase extraction is a fundamental and important phrase in concept space clustering. The clustering result will be downgraded significantly if the quality of term extraction is low. Salton [21] presents a blueprint for automatic indexing, which typically includes stop-wording and term-phrase formation. A stop-word list is used to remove non-semantic bearing words such as the, a, on, in, etc. After removing the stop words, term-phrase formation that formulates phrases by combining only adjacent words is performed[4]. 3.2 Chinese Phrase Extraction Unlike English language, there are not any natural delimiters in Chinese language to mark word boundaries. In our previous work, we have developed the boundary detection [23] and the heuristic techniques to segment Chinese sentence based on the mutual information and significant estimation [5]. The accuracy is over 90%. 3.2.1 Automatic Phrase Selection To generate the concept space, the relevance weights between the English and Chinese term phrases are first computed in order to select significant concepts from the collection. d ij = tf ij × log(

N × w j) df j

(1)

Equation 1 shows how the combined weight of term j in document i is calculated. tfij is the occurrence frequency of term j in document i. N is the total number of documents in the collection and dfj is the number of documents containing term j. wj is the length of term j. For an English term, the length of it is the number of words in it. For a Chinese term, the length of it is the number of characters in it. The weight is directly proportional to the occurrence frequency of the term because it carries important idea if it appears in the document for many times. On the other hand, it is inversely proportional to the number of documents containing the term because the meaning carried by the term may be too general. For example, "Hong Kong" frequently appears in the collection of documents from HKSAR Police. It becomes a common term in the collection and does not carry specific meaning in any document of the collection. The length of term also plays an important role in the weight. It is known that a longer term carries more specific meaning. For example, name of places and organizations are often in multiple words (for English) or characters (for Chinese). Terms, which significantly represent a document, are selected for clustering. Based on the combined weights of terms that are calculated using Equation 1, a number of terms with the largest combined weights in each document are selected for clustering. The number is based on the average length of documents in the collection. For longer average length, more terms are selected for clustering. Terms with common meaning and not representative are filtered out.

Automatic Construction of Cross-Lingual Networks of Concepts

145

3.2.2 Co-occurrence Weight After the calculation of dij, asymmetric co-occurrence function [2] is used to evaluate the relevance weights among concepts. For a pair of relevant term A and B, the weight of the link from term A to term B and that of the link from term B to term A are different. This function gives a good description of natural thinking of human to terms. For example, "Ford" and "car" are relevant. When a person comes up with "Ford", he can think of "car". However, when a person comes up with "car", he may not think of "Ford". This example shows that two terms the associations between two terms are not symmetric. Therefore, we adopt the co-occurrence weight to calculate the relevance weights. N (2) d = tf × log( × w ) ijk

ijk

j

df

jk

The co-occurrence weight, dijk , in Equation 2 is the weight between term j and term k that are both exist in document i . tfijk is the minimum between occurrence frequency of term j and that of term k in document i . The weight will be zero if either of term j or term k is not exist in the document. The calculation is similar to the calculation in Equation 1. Therefore, the co-occurrence weight is a measure of combined weight between term j and term k. n

Weight (T j , Tk ) =

∑d i =1 n

ijk

∑d i =1

(3) × WeightingF actor (Tk )

ij

n

Weight (Tk , T j ) =

∑d i =1 n

ikj

∑d

(4) × WeightingF actor (T j )

ik

i =1

Equation 3 shows the relevance weights from term j to term k. Equation 4 shows the relevance weight from term k to term j. Relevance weight measures the association between two terms in the collection. The combined weights and cooccurrence weights of terms in all documents are summed up to derive the global association between terms in the collection. log Weighting

Factor

(T j ) =

Factor

(T k ) =

(5)

log N N df k log N

log Weighting

N df j

(6)

Equation 5 shows the weighting factor of term j. Equation 6 shows the weighting factor of term k. The weighting factor is used to penalize general terms. General terms always affect the result of clustering. A lot of terms associate with the general terms. During clustering, if a general term is activated, other terms associate with that general term will also be activated. Then, the size of that concept space will be large and the precision will unavoidably low. The weighting factor is a value between 0 and 1. It carries an idea of inverse document frequency. The more the documents contain the concept, the smaller the weighting factor.

146

K.W. Li and C.C. Yang

3.2.3 The Hopfield Network Algorithm Given the relevance weights between the extracted Chinese and English term phrases in the parallel corpus, we will employ the Hopfield network to generate the concept space. The Hopfield network models the associate network and transforms a noisy pattern into a stable state representation. When a searcher starts with an English term phrase, the Hopfield network spreading activation process will identify other relevant English term phrases and gradually converge towards heavily linked Chinese term phrases through association (or vice versa). Term is represented by node in the network. The algorithm is shown below: n −1

u j ( t + 1 ) = f s [ ∑ t ij u i ( t )], 0 ≤ j ≤ n − 1

(7)

i=0

where uj(t+1) denotes the value of node j in iteration t+1, n is the total number of nodes in the network, tij denotes the relevance weight from node i to node j. fs(x) =

1 − (x − θ 1 + exp  θ o 

j

)  

(8) Equation 8 shows the continuous SIGMOID transformation function which normalizes any given value to a value between 0 and 1[4].

∑ [u n −1

j

(t + 1) − u

j= 0

j

(t )

]

2

≤ ε

(9)

where ε was the maximal allowable difference between two iterations. ε measures the total change of values of nodes from iteration t to t+1. After several iterations, more nodes are activated and nodes with strong connection to the target node are those with high values. Total change of values of nodes is evaluated at the end of iteration. When the change is smaller than a threshold, ε, the Hopfield network is converged and the iteration process stops. Once the network converged, the final output represented the set of terms relevant to the starting term. In our system the following values were used: θ j = 0.1 , θ o = 0.01 , ε=1.

4 Concept Space Evaluation 10 students of the Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, were invited to examine the performance of concept space. The concept space is a robust and domain-specific Hong Kong Police press release thesaurus which contains 9222 Chinese/English concepts. The thesaurus includes many social, political, legislative terms, abbreviations, names of government departments and agencies. Each concept in the thesaurus may associate with up to 46 concepts. It is generated from 2548 parallel Hong Kong Police press release article pairs. The goal of this experiment is to capture meaningful conceptual association between concepts. The associations forms the basis for the decisions and inferences the user use when searching the criminal information of Hong Kong.

Automatic Construction of Cross-Lingual Networks of Concepts

147

4.1 Experimental Design Among these 10 graduate students, 5 subjects are Hong Kong students and the other 5 subjects came from Mainland China. They all have been living in Hong Kong for more than one year. They use their knowledge and experience on both the Hong Kong SAR Police system and the living environment in Hong Kong to evaluate the concept space. 50 among 9222 concepts were randomly selected as the test descriptors. Twenty five among these 50 test descriptors are English concepts. The other 25 test descriptors are Chinese concepts. Each test descriptor together with its associated concepts were presented to the 10 subjects. A small portion (about 10% of total number of associated concepts for each test descriptor) of noise terms was added to reduce the bias generated by the subjects to the concept space. The experiment is divided into two phrases: recall phrase and recognition phrase. In the recall phrase, each subject (Hong Kong graduate students and graduate students from Mainland China) was asked to generate as many related terms as possible in response to each test descriptor presented. In the recognition phrase, the subjects needed to determine the associated concept either "irrelevant" or "relevant" to the test descriptor. Terms considered too general were to be ranked as “irrelevant”. This phrase tested the ability of subjects on recognition of relevant terms. If the subjects felt the definition of a concept needed to clarify or they wished to add comments on the concept, they were asked to write them on a piece of paper. After the experiment, we found that the subjects spent more time on recognition phrase than what they spent on recall phrase. This confirms the statement made by Chen et al. [3] that human beings are more likely to recognize than to recall. Apart from the 10 students, the 50 concepts in concept space were also carefully evaluated by two experimenters and no noise term was added in the case. One of them is a graduate student of the Department of System Engineering and Engineering Management. The other is a graduate student of the Department of Translation. They both have been living in Hong Kong for more than 10 years. They also have done research on Chinese to English translation and English to Chinese translation for more than two years. Since there is no tailored bilingual thesaurus for Hong Kong government press release articles, the experimental result provided by these two senior subjects is treated as a benchmark or human verified thesaurus in comparison with the result provided by the 10 subjects. The additional associated concepts provided by the 10 subjects in the recall phrase were examined by the two senior judges before treating them as relevant terms. 4.2 Experimental Result We adopted the concept recall and concept precision for evaluation based on the following equations: Number of Retrieved Relevant Concepts (10) Concept Recall = Number of Total Relevant Concepts

Number of Retrieved Relevant Concepts Concept Precision = Number of Total Retrieved Concepts

(11)

The number of Retrieved Relevant Concepts represented the number of concepts in the concept space judged as "Relevant". The number of total relevant concepts

148

K.W. Li and C.C. Yang

includes the concepts in the concept space judged as "Relevant", the additional relevant concepts provided. The number of total retrieved concepts represented the number of concepts suggested by the concept space and the human verified thesaurus. 4.3 Evaluation Provided by 10 Graduate Subjects The 10 graduate students provided 12 to 73 new associated concepts during the experiment. The analysis is listed in Table 2. It is interested to note that all the Hong Kong graduate subjects have been living in Hong Kong for at least six years but the graduate subjects from Mainland China have been living in Hong Kong around one year. So, the Hong Kong graduate subjects are more familiar with Hong Kong Police system and they added more new concepts to the concept space. In addition, the Hong Kong graduate students added more English concepts to the concept space than that of the graduate students from Mainland China. This confirms that even though the first language of all these graduate students is Chinese, the working language for the Hong Kong graduate students is English. Table 2. The statistics of new associated concepts added by the 10 graduate students

Table 3. Precision and recall

10 graduate students 2 experimenters

Precision 0.835 0.86

Recall 0.795 0.83

Table 4. The new concepts added by the 10 graduate students

10 graduate students

Chinese added 222

concept English added 220

concept

Hong Kong is a bilingual community. Even though the Police concept space contains many technical, political and geographical English vocabularies, the Hong Kong graduate students frequently encounter these terms in their daily life. As a result, the Hong Kong graduate students naturally added more English terms into the concept space. This observation also appears in Welsh and English community [7]. Also, even though Chinese technical terms do exist, they may not common use. Therefore, the Hong Kong graduate may have limited Chinese technical vocabulary

Automatic Construction of Cross-Lingual Networks of Concepts

149

even where Chinese is their first language and use English terms when necessary. As a result, the Hong Kong graduate subjects judged more English concepts to be relevant and added more English terms into the concept space. On the other hand, the graduate students from Mainland China have a higher degree of Chinese fluency than that of Hong Kong graduate students. Also, they know more Chinese translations of those English technical vocabularies in Mainland China. These cause them to add more Chinese concepts. We also observe some associated concepts are judged as irrelevant because the associated concepts do not show the clear association with their test descriptor. For example, one of associated concepts for the test descriptor " " (smuggling) is "Mr Mark Steeple" ( ) because the Chief Inspector Anti-smuggling Task Force in Hong Kong is Mr Mark Steeple. Another associated ) because of the recent trend of smuggling by small concept is "Mirs Bay" ( craft in the Mirs Bay area. However, all the graduate students do not have a prior knowledge of these and judged them as irrelevant. Since the corpus is a dynamic resource, it is not surprise that the students do not have a prior knowledge. For criminal analyst, the information is important for identifying the recent trend of smuggling by small craft in the Mirs Bay area. In addition, one of the associated ) are “ ” (Police). We know concepts for “Golden Bauhinia Square” ( that the flag raising ceremony began promptly at 8 a.m. with the Flag Raising Parade at the twin flagpoles at Golden Bauhinia Square. The flag party, provided by the Hong Kong Police Force comprised a Senior Inspector of Police, four flag raisers. Without knowing this, the subjects only read the concept space and judged that there is no clear association between “Police” and “Golden Bauhinia Square”. The phenomenon displays that the clustering process using Hopfield network induces the relevant concepts based on the contents of documents. Apart from this, as we know, a lexical item (word) in a sentence may be a concept in one language[12], where concept is a recognizable unit of meaning in any given language [11]. A concept represented by a word in one language may be translated into a word, two words, a phrase, or even a sentence in another language [11]. A concept in one language can be a broader concept encompassing some narrower concepts, and the translation of such a concept may result in an altered concept in another language. In contrast, a narrower concept in one language may be translated as a broader concept in another language. Such relationship is known as generic-specific relationship[12]. For example, the word “China” is modified to be a specific word “ ” (Beijing), a city of China. Omission, addition, and deviation are also common phenomena. For example, “Closure” ” in some cases. “Closure” is translated to “ ” by corresponds to “ (stop service)” in some cases (deviation). dictionary, but it refers to “ Therefore, conceptual alternation may occur in translation. This also causes the judges to judge some associated concepts to be irrelevant. Nida[11] explains that conceptual alteration is caused by three major reasons: 1) no two languages were completely isomorphic, 2) different languages might have different domain vocabulary; and 3) some languages were more rhetorical than other languages. Courtial and Pomian[6] argued that searches performed in the realms of science and technology frequently involve association of concepts that lie outside the traditional associations represented in thesauri. Associative networks gleaned through textual analysis, they argued, facilitated innovation by making obvious associations that would otherwise be impossible for humans to find on their own. In early research,

150

K.W. Li and C.C. Yang

Lesk[14] found little overlap between term relationships generated through term associations and those presented in existing thesauri. This term relationship is especially important for criminal analysis. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and non-criminal activities, and communicate through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. Ekmekcioglu, Robertson and Willet [8] tested retrieval performances for 110 queries on a database of 26,280 bibliographic records using four approaches. Their result suggested that the performance may be greatly improved if a searcher can select and use the terms suggested by a co-occurrence thesaurus in addition to the terms he has generated[4]. 4.4 Translation Ability of the Concept Space The 46683 associated concepts were also examined. For those test descriptors associating with two relevant associated concepts, 47.64% of these associated concepts are Chinese concepts and 52.36% of these associated concepts are English concepts. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.

5 Conclusion The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analyzed, shared, searched. To effectively predict and prevent criminal activities, an intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. However, information retrieval (IR) systems present two main interface challenges: first, how to permit a user to input a query in a natural and intuitive way, and second, how to enable the user to interpret the returned results. A component of the latter encompasses ways to permit a user to comment and provide feedback on results and to iteratively improve and refine results. As we know, the vocabulary difference problem has been widely recognized: users tend to use different terms for the same information sought. Also, in terms of criminal analysis, the man-made fog of deliberate deception militates against normal pattern learning from databases cause much crucial information and the knowledge underlying to be buried. As a result, an exact match between the user's terms and those of the indexer is unlikely. An advanced tool is required to understand the user's needs. Cross-lingual information retrieval brings an added complexity to the standard

Automatic Construction of Cross-Lingual Networks of Concepts

151

IR task. Users can have different abilities for different languages, affecting their ability to form queries and interpret results. This highlights the importance of automated assistance to refine a query in cross-lingual information retrieval. This article has presented a bilingual concept space approach using Hopfield network to relieve the vocabulary problem in national security information sharing, using the Hong Kong Police press release bilingual pairs as an example. The concept space allows the user to interactively refine a search by selecting concepts which have been automatically generated and presented to the user. This allows the user to descend to the level of actual objects in a collection at any time. By observation, some information may be seemingly unconnected but actually information can help the analyst to identify the important anomalies, such traffic accidents frequently happen at a particular location. Since the press release collection is dynamically generated, the subjects may not have a full prior knowledge. However, experimental result shows the precision and recall for the bilingual concept space are over 78% in all cases. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.

References 1.

Bates, M. J. "Subject access in online catalogs: A design model". Journal of the American Society for Information Science, 37,357–376. (1986) 2. Chen, H., Lynch, K. J., "Automatic construction of networks of concepts characterizing document database" IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 5, pp. 885–902, Sept-Oct (1992) 3. Chen, H., Schatz, B., Ng, T., Martinez, J., Kirchhoff, A., Lin, C., "A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 771–782, August (1996) 4. Chen, H., Ng, T., Martinez, J., Schatz, B., "A Concept Space Appraoch to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System" In Journal of The American Society for Information Science, 48(1):17–31. (1997) 5. Chien, L. F., "PAT-Tree-BASED Keyword Extraction for Chinese Information Retrieval", In Proceedings of ACM SIGIR,pp.50-58, Philadelphia, PA,1997. 6. Courtial, J. P. and Pomian, J. “A system based on associational logic for the interrogation of databases”, In Journal of Information Science, 13,91–97,1987 7. Cunliffe, D., Jones, H., Jarvis, M., Egan, K., Huws, R., Munro, S., “Information Architecture for Bilingual Web Sites”. In Journal of The American Society for Information Science, 53(10):866–873. 2002 8. Ekmekcioglu, F. C., Robertson, A. M. and Willett, P. “Effectiveness of query expansion in ranked-output document retrieval systems”, In Journal of Information Science, 18, 139– 147,1992. 9. Fung, P. and McKeown, K. (1997) " A technical word- and term-translation aid using noisy parallel corpora across language groups". In Machine Translation 12: 53–87. 10. Hayes-Roth, F., Waterman, D. A. and Lenat, D. (1983) "Building Expert Systems". Reading, MA: Addison-Wesley.

152

K.W. Li and C.C. Yang

11. He, S. "Translingual Alteration of Conceptual Information in Medical Translation: A Cross-Language Analysis between English and Chinese," Journal of the American Society for Information Science, Vol. 51, No. 11,2000, pp.1047–1060. 12. Larson, M. L. Meaning-based translation: A guide to cross-language equivalence. Lanham, MD: University Press of American 13. Leonardi, V., "Equivalence in Translation: Between Myth and Reality," Translation Journal, Vol. 4, No.4, 2000. 14. Lesk, M. E. (1969) “Word-word associations in document retrieval systems”, In American Documentation, 20(1),27–38,1969. 15. Lin, C. H., Chen, H., "An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents" IEEE Transactions on Systems, Man and Cybernetics, vol 26, no.1, pp. 75–88, Feb 1996 16. Ma X. and Liberman M. (1999) “BITS: A Method for Bilingual Text Search over the Web”. In Machine Translation Summit VII, September 13th, 1999, Kent Ridge Digital Labs, National University of Singapore. 17. Oard, D. W., & Dorr, B. J. (1996). A Survey of Multilingual Text Retrieval. UMIACS-TR96-19 CS-TR-3815. 18. Oard, D. W. (1997). Alternative approaches for cross-language text retrieval. In Hull D, Oard D, (Eds.) ,1997 AAAI Symposium in Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence. 19. Resnik P. "Mining the Web for Bilingual Text," 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, June, 1999. 20. Rose, M. G. (1981). Translation Types and Conventions. In Translation Spectrum: Essays in Theory and Practice, Marilyn Gaddis Rose, Ed., State University of New York Press, pp.31–33. 21. Salton, G. (1989) Automatic Text Processing. Addison-Wesley Publishing Company, Inc., Reading, MA, 1989. 22. Simard, M. (1999) "Text-translation Alignment: Three Languages Are Better Than Two". In Proceedings of EMNLP/VLC-99. College Park, MD. 23. Yang, C. C., Luk, J., Yung, S., Yen, J., (2000) “Combination and Boundary Detection Approach for Chinese Indexing, ” In Journal of the American Society for Information Science, Special Topic Issue on Digital Libraries, vol.51, no.4, March, 2000, pp.340–351. 24. Yang, C. C. and Li, K. W. "Automatic Construction of English/Chinese Parallel Corpora," Journal of the American Society for Information Science and Technology, vol.54, no.7, May, 2003. 25. Zanettin, F,. "Bilingual comparable corpora and the training of translators," Laviosa, Sara. (ed.) META, 43:4, Special Issue. The corpus-based approach: a new paradigm in translation studies: 616–630, 1998.

Decision Based Spatial Analysis of Crime Yifei Xue and Donald E. Brown Department of Systems and Information Engineering University of Virginia, Charlottesville, VA 22904, USA. {yx8d,brown}@virginia.edu

Abstract. Spatial analysis of criminal incidents is an old and important technique used by crime analysts. However, most of this analysis considers the aggregate behavior of criminals rather than individual spatial behavior. Recent advances in the modeling of spatial choice and data mining now enable us to better understand and predict individual criminal behavior in the context of their environment. In this paper, we provide a methodology to analyze and predict the spatial behavior of criminals by combining data mining techniques and the theory of discrete choice. The models based on this approach are shown to improve the prediction of future crime locations when compared to traditional hot spot analysis. Keywords. Spatial choice, feature selection, preference specification, modelbased clustering

1 Introduction Crime analysts are interested in understanding the relationship between location and crime and in using this understanding to predict where future crimes will occur. While no one would suppose that these predictive models could identify point targets for criminals, nonetheless, they do provide insight into areas that are expected to have increased likelihoods of criminal incidents compared with other areas. Predictions of this sort enable better use of scare resources to police those areas most threatened and to launch programs in those areas to address problems that may be feeding criminal activity. The recent introduction and widespread use of geographic information systems (GIS) have sped improvements in understanding of the role of space in criminal activity. Many methods for using GIS for the visualization of spatial data have been developed and these help to identify unusual concentrations of crimes or hot spots. These concentrations can be formally modeled as has been done in other fields, such as epidemiology and public health. Examples of formal statistical models include point pattern analysis [20], [21], distance statistics [2] and area analysis [9], [26]. Among all these studies of place-based crime data, regression analysis plays a crucial role in the attempts to explain the causes of criminal activities [17], [22]. However, most spatial analyses of crime data do not attempt to model the individual decision making process of the criminal but look instead at the aggregated behavior of many criminals. According to the rational choice perspective in criminology, criminal H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 153–167, 2003. © Springer-Verlag Berlin Heidelberg 2003

154

Y. Xue and D.E. Brown

incidents, like many other human initiated events involve a decision making and choice process. Much criminological work has taken the criminals or offenders as decision makers who want to benefit from their criminal behaviors and avoid the risk exposure to law enforcement [8]. We take advantage of the fact that the selection of crime targets indicates the criminals’ preferences for specific sites in terms of spatial attributes. While the interest in these criminal preferences is unique to law enforcement, we can exploit work in economics that has looked at spatial choice among consumers to aid us in better understanding criminal preferences and then use this understanding to predict criminal behavior. This paper develops a spatial choice methodology based on these ideas to analyze the location-based crime data. Spatial choice theory describes human’s behaviors in space as rational decisions among the available spatial alternatives. The choices indicate certain spatial patterns and represent the decision makers’ preferences. At the heart of recent work in this area is the pioneering development by McFadden which lead to the formal modeling of discrete spatial choice [1], [18]. Discrete choice models are used for analysis and prediction of spatial decision making under uncertainty with multiple alternatives. It has been extended to a number of areas, such as consumer destination selection [12], [24], travel mode analysis [3], [19], and recreational demand models [25]. These analyses indicate the spatial decisions of a large number of individuals. In general these decisions have been studied through surveys that address a rather limited set of spatial alternatives for each decision maker. Clearly, spatial choice analysis for crime data breaks new ground. The alternatives are commercial properties, buildings, and houses in the study area and while the number of spatial alternatives is finite it is, nonetheless, very large compared to other spatial choice problems. Also, the preferences of the criminal decision makers cannot be directly or accurately assessed through interviews, surveys, or questionnaires. In the rest of this paper, we first formally define the criminal spatial choice in Section 2. Section 3 presents the spatial choice models derived from these formal definitions. In Section 4, the models are applied to actual crime data and the locations of future criminal incidents are predicted. Comparison results with these new models are reported and summarized. Section 5 contains the conclusions.

2 Problem Statement Data items for spatial or crime analysis have two components: a location component and an attribute component. They can be represented by a vector {Q, S, k}. Q is the universe of the location component, which is discrete and indexes all spatial alternatives by an ordered pair of coordinates {x, y}. S is the attribute component associated with given spatial alternatives, which indicate S different attributes S = {s1 , s 2 ,..., s S } . k : Q → S is a mapping function specifying the observed attributes of the alternatives.

Decision Based Spatial Analysis of Crime

155

The spatial decision process can be represented by a vector {Q, S, k, A, D, u, P} . The set A is a subset of Q indicating finite choices available to all individuals D. A = { a1 , a 2 ,..., a N } represents N available alternatives for decision makers to choose. For spatial analysis of crimes, N is a very large number. D is the universe of individuals who make choices over the available alternative set A. Each individual makes choices based on a decision process. u is the utility function mapping the preferences from individuals D over the alternative set A to a utility value U. For a individual d, if choice set Ad = { a1 , a 2 ,..., a N } and Ad′ = { a1′ , a ′2 ,..., a ′N } have same attribute values, then the choice sets will have same utility U= u( Ad ) = u( Ad′ ) . According to the rational decision making assumptions, individuals make choices that maximize their utility. The probability that an individual d from D will choose alternative a i from an available choice set Ad can be specified as P{ ai

| Ad , d } , which is produced from the choice process {Q, S, k, A, D,u, P} . The probability P{ ai | Ad , d } is a mapping based on the

preferences of individual d and the attributes of all alternatives in set Ad. The mapping can be stated as P : A × S × D → ( 0 ,1 ) , or indicated by a utility-based

| Ad , d } = P{ u( a i ) ≥ u( a j ) | d , a j ∈ Ad } . The utility of alter-

function P{ ai

native ai to individual d can be divided into two parts U id

= V ( d , si ) + ε( d , s i ) .

V ( d , si ) = ∑ β x is the deterministic part of the utility value and expressed as a i l

i l

l

linear additive function of all attributes.

xli ∈ X = ( S , D ) represents the lth

component of the combination of attribute values si and characteristics of individual d. ε( d , s i ) is the error term of utility function indicating unobservable components of the utility function.

3 Model Development 3.1 Spatial Choice Patterns Spatial choice theory describes how individuals choose a specific site in space as their target. Their choices show certain patterns in space. The geographical sites form a spatial alternative set A. Individuals make selections from this choice set. Since the number of alternatives for a spatial choice process is very large, individuals are unable to evaluate all spatial alternatives before they make their selections. They can only compare part of the choice set and pick one spatial alternative with highest utility value. This can be stated as a sub-optimal or locally optimal problem. According to Fotheringham’s framework of individuals’ hierarchical information processing [13], individuals make spatial choices from the alternatives they have evaluated. For

156

Y. Xue and D.E. Brown

individual d, the choice set will be Ad

⊆ A , which indicates all spatial alternatives

that individual d really considered. The choice that individual d makes will probably have the highest utility among all alternatives in choice set Ad . What is different from previous work in discrete choice theory, the real choice set

Ad in crime analysis is not

clear to the analysts. Some methods are proposed to identify or estimate the probability P( ai ∈ Ad ) that an alternative ai is considered by individual d. After the identification of the individuals’ choice set, two factors are considered in people’s spatial choice process: i) the utility of alternative ai to individual d and ii) the probability that alternative ai is available or considered by individual d. Since the number of spatial alternatives is very large, it is possible that some alternatives can give higher utility values but they are never considered. In order to reveal the individuals’ preferences, we make an assumption here. Assumption 1: The two factors (i and ii) mentioned above are equally important to the individuals’ choice decisions. The combination of P( a i ∈ Ad ) and the utility of alternative ai to individual d, U id can give a better estimation of the possibility of choices. With the assumption 1, the probability that individual d chooses alternative ai from Ad can be stated as P( U id > U jd + ln P( a j ∈ Ad ), all a j ∈ Ad )P( a i ∈ Ad ) [13]. In order to get the spatial choice model, we make another assumption. Assumption 2: The error term of individuals’ utility function

ε( d , s i ) is

independently and identically distributed with Weibul distribution [18]. The spatial choice model is derived with same method as McFadden has used [13], [18].

 P( ai | Ad , d ) = exp(V ( d , si )) ⋅ P( ai ∈ Ad ) /  ∑ exp(V ( d , s j )) ⋅ P( a j ∈ Ad  j∈A

 ) (1) 

This model is a multinomial logit model where each alternative’s observable utility is weighted by the probability that the alternative is evaluated. 3.2 Specification of Prior Probability We assume that the hierarchical information process takes place before the individuals’ spatial choices. Individuals will first evaluate sets of alternatives and only alternatives within the sets can be selected. We can either define the choice set Ad or give the probability that an individual will evaluate certain alternative P( a i

∈ Ad ) .

Decision Based Spatial Analysis of Crime

157

For the spatial analysis of crime data, it is not easy to know individuals’ preferences. Then we have to make an assumption to simplify the model derivation. Assumption 3: During the process of individuals’ spatial choices, the preferences of all individuals d ∈ D are same. The pre-evaluated spatial alternative set Ad for different individuals is also same. We use M to represent the set of pre-evaluated spatial alternatives for all individuals. Under assumption 3, the spatial site selection model changes to   P( ai | Ad , d ) = exp( V ( d , si )) ⋅ P( ai ∈ M ) /  ∑ exp( V ( d , s j )) ⋅ P( a j ∈ M )  (2)   j∈A The definition of P( ai ∈ M ) is important here. We use kernel density estimation method to get the probability that spatial alternative ai is evaluated by criminals. From the study of Brown et al. [7], we know that location components of spatial alternatives alone do not provide enough information about the criminals’ preferences. There are many feature values attached with the spatial alternatives. A part of these values is believed to be relevant to the occurrence of criminal incidents. Unfortunately, we do not know which part. However, we can mine the criminal’s preferences from all feature values of past crime incidents. We use a feature selection process to find the smallest feature subset from the universe feature space. It is called the key feature set or key feature space and shows that the past criminal incidents can indicate clear patterns in this key feature space. These are possible preferences of criminals’ preevaluation. Using the selected key features, we get the prior evaluation probability P( ai ∈ M ) as follow.

P( ai ∈ M ) = 1

2

1 K

K

∑ L( k =1

s i − s k si − s k s i − s k , , ,...) h1 h2 h3 1

1

2

2

3

3

(3)

3

where, si , si , si … are the key features of spatial alternative ai. K is the total number of observations; L is a function to specify the kernel estimator. We use a Gaussian function here. h’s are bandwidths used in the kernel estimation. The change of bandwidths will influence the effect of density estimation. The choice of bandwidths is important and literature in this area offers a great deal of discussions. We use a recommended bandwidth selection method from Bowman and Azzalini [5],

4 hi = ( p + 2 )⋅ K

1 /( p + 2 )

× σ i for ith dimension. p is the number of dimensions for

density estimation. The model adjusted with the estimated prior probability P( ai ∈ M ) is called the key feature adjusted spatial choice model. 3.3 Spatial Misspecification Both spatial choice model and other discrete choice models try to include all related predictor variables to estimate decision makers’ preferences and predict their future

158

Y. Xue and D.E. Brown

choices. However, it is practically impossible to include all relevant variables that affect people’s decisions into spatial choice models. First, some variables may be very difficult to measure. Second, some variables that affect choices may not have been conceptualized or identified by analysts. Third, even it is possible to identify and include all relevant variables. Some variables will be redundant and correlated with each other. Also, too many predictor variables will make the estimated parameters unstable and reduce the models’ predictive accuracy. It is necessary and inevitable to omit many predictor variables. This leads to the misspecification of choice models. During the development of our spatial choice model, assumption 3 indicates that the preferences of all individuals d ∈ D are same. The pre-evaluated choice set Ad is also the same for all individuals. This makes it easy to estimate the pre-evaluated choice set Ad . However, it also makes the estimated individuals’ preferences biased due to the lack of related information about decision makers’ preferences. For crime analysis, it is impossible to include all preference information into the spatial choice model. But the preferences of decision makers can be specified from their past choices. To avoid the bias and increase the accuracy of spatial choice models’ prediction, it is necessary to specify the bias introduced by the absence of important factors and discover the preferences of individuals. In our spatial choice model, we will specify the pre-evaluated choice set for individuals with different preferences. One solution is to classify all decision makers by their preferences from the past incidents of choices. With well selected key features, the past choices can indicate certain patterns in the key feature space. We will use clustering methods to identify the different classes of decision makers and identify their preferences by defining the pre-evaluated choice sets. The adjusted spatial choice model is called the Preference Specified Spatial Choice Model (PSSCM). 3.4 Clustering Methods Clustering is one of the most useful tasks in data mining for discovering groups and identifying interesting distributions and patterns of an underlying data set. Clustering involves partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters. Researchers have extensively studied clustering since it occurs in many applications in engineering and science. Clustering may result in a different partitioning of a data set, depending on the specific criterion used for clustering. The basic steps to develop a clustering can be summarized as feature selection, clustering algorithm, validation of the results, and interpretation of the results. Feature selection chooses the features on which clustering is to be performed so as to encode as much information as possible. We have used the feature selection step for finding key features. By removing all features that are irrelevant to classification, the small feature space subset provides enough

Decision Based Spatial Analysis of Crime

159

information for pattern recognition therefore reducing the cost and improving the quality of classification [23]. The clustering algorithm is the most important part of the clustering process, and it includes similarity measures, partitioning methods, and stopping criteria. Each of these is described by a variety of sources [11], [16]. No matter what clustering algorithms are used, it is important to find a way to define a stopping criterion or define how many clusters are in the data set. Various strategies for simultaneous determination of the number of clusters and cluster membership have been proposed, like Engelman and Hartigan [10], Bock [4], Bozdogan [6], and Fraley and Raftery [14]. Fraley and Raftery use a model based strategy and Bayesian Information Criterion (BIC) to do clustering and determine the number of clusters. In this approach, the data are viewed as coming from a mixture of probability distributions, each representing a different cluster. Methods of this type have been applied in a number of practical applications, In model based clustering, it is assumed that the data are generated by a mixture of underlying probability distribution in which each component represents a different group or clusters. Let f k ( a i | θ k ) be the density of an observation ai from the kth component.

θ k are the corresponding parameters. The density function

f k ( a i | θ k ) is generally assumed to be a multivariate normal distribution. The function has the form as

1 exp{ ( ai − u k )T ∑ k−1 ( ai − u k )} 2 (4) f k ( ai µ k , ∑ k ) = 1/ 2 ( 2π ) p / 2 ∑ k where µk is the mean vector, ∑k is covariance matrix of observations. These are the parameters of the density distribution. The parameterization of covariance matrix ∑k decides the characteristics (orientation, volume and shape) of the distributions of clusters. These characteristics can be allowed to vary between clusters or constrained to be same for all clusters. Then expectation maximization (EM) is used to find the clusters and the Bayesian Information Criterion (BIC) is used as a criterion to compare different models.

4 Application of Spatial Choice Model for Real Crime Analysis 4.1 Crime Data Set The data for model estimation came from ReCAP (Regional Crime Analysis Program) system. The ReCAP system is an interactive shared information and decision support system that uses databases, geographic information system (GIS), and statistical tools to analyze, predict, and display future crime patterns.

160

Y. Xue and D.E. Brown

Our crime analysis was based on crime incidents between July 01 1997 and September 30 1997 in the city of Richmond Virginia. We used residential “Breaking and Entering” (B & E) crime incidents for model estimation and validation. Using the crime incidents in the training dataset, we got locations of all incidents on a geographic map. The sub regions shown in Fig. 1 are block groups, which are the smallest areas for which census counts are recorded.

Fig. 1. Breaking and Entering criminal incidents between July 01, 1997 and September 30, 1997 in Richmond, Virginia.

The analysis of B & E is related to locations of households in a city. However, it is difficult to represent all locations of individual houses in even a modest sized city, such as Richmond. Therefore, we aggregated alternatives using 2517 regular grids, which were assumed to be fine enough to represent all spatial alternatives within this area. The features of each spatial alternative came from the combination of census data (from the “censusCD + maps” compact disk held at university of Virginia’s geospatial and statistical data center) and calculated distance values. All features were possibly related to the decision process of criminals. 4.2 Feature Selection by Similarities Since the attributes of spatial alternatives came from census data and calculated distance values, it is possible that some values of these attributes are correlated. Using the calculated correlation values as similarities, we made hierarchical clustering on all features of observed spatial incidents. The clustering of features of observed spatial alternatives is shown as Fig. 2.

161

MED.PH PCINC.97 MHINC.97 AHINC.97 ALC.TOB.PH HOUSING.PH APPAREL.PH FOOD.PH ET.PH P.CARE.PH TRANS.PH

POP8910.DS AGEH56.DST CLS67.DST POP67.DST AGEH34.DST MALE.DST POP.DST FEM.DST RENT.DST HUNT.DST HH.DST OCCHU.DST POP45.DST AGEH12.DST CLS12.DST CLS345.DST FAM.DST POP123.DST MORT2.DST MORT1.DST OWN.DST

D.CHURCH D.PARK D.SCHOOL D.HOSPITAL VACHU.DST

D.HIGHWAY COND1.DST

Decision Based Spatial Analysis of Crime

Fig. 2. Clusters of features of observed spatial alternatives

From the clustering tree, we divided the features into five clusters. Each cluster included correlated features. After checking the distribution of the feature values, we found that COND1.DST is almost uniformly distributed. It is not a good feature for our analysis. Then there are two choices for feature selection of the rest features, random picking from each cluster or combining the features in same clusters. We picked the features D.HIGHWAY (distance to highway), FAM.DENSITY (Family density per unit area), P.CARE.PH (personal care expenditure per household) and D.HOSPITAL (distance to hospital). The first three were used by Brown et al. [7]. These are the key features and supposed to be good enough to represent all other features in same clusters. Based on the selected features, we applied Fraley and Raftery’s clustering methods [14] to the crime data for analysis. The number of clusters was decided by the calculated BIC values. The trends of BIC value are indicated by Fig. 3. According to the Fig. 3, we decided there are 6 clusters among the crime dataset. Each cluster corresponds to certain group of criminals that have similar preferences on their choices of spatial alternatives. The distribution of crime incidents within different clusters is listed in Table 1. 4.3 Model Estimation and Prediction The number of spatial alternatives for crime spatial analysis is very large, which makes the data preparation and computation time prohibitively expensive. To handle this problem, we adopted an importance sampling technique suggested by Ben-Akiva

162

Y. Xue and D.E. Brown

-1700

[1]. Sampling alternatives is an commonly applied technique for reducing the computational burden involved in the model estimation process.

4

4

4

-1750

4 4 3

-1800

3

2 3

-1850

4 3

1

1 2

2 3 1

3 4 2 1

1

2

-1900

BIC

2 3

1

-2000

-1950

2

1 4 3 1 2 2

4

6

8

number of clusters

Fig 3. The trends of BIC values of different parameterized model-based clustering algorithms 1: equal volume, equal shape and no orientation 2: variable volume, equal shape and no orientation 3: equal volume, equal shape and equal orientation 4: variable volume, variable shape and equal orientation

Table 1. Distribution of crime incidents in clusters

Cluster 1 Cluster 2 Cluster 3

Crime incidents 109 180 200

Cluster 4 Cluster 5 Cluster 6

Crime incidents 202 133 55

Next we considered the model estimation and prediction step. The prior probability P( ai ∈ M ) of the adjusted spatial choice models were calculated as in section 3.2 for each cluster. The key features are the features coming out from the feature selection process. Using the training data set of B & E incidents of each cluster, we obtained the estimation of the preference specified spatial choice model for each cluster P( ai | Ad , d ∈ M l ) . M l indicates the presence of criminals with preferences in lth cluster. The final prediction of future crime’s spatial distribution is

Decision Based Spatial Analysis of Crime

163

the combination of the predicted probabilities of all clusters. The combination method is also very important. Given the conditional probability that spatial alternative ai will be picked by criminals within cluster M l , P( ai | Ad , d ∈ M l ) and the chance that criminals

d ∈ M l will commit next crime within the study region P( M l ) , the

probability that spatial alternative ai is picked by any criminal will be L

P( ai | Ad , d ∈ M ) = ∑ P( a i Ad , d ∈ M l )P( M l ) . L is the total number of l =1

clusters within the crime data set. The probability methods. Here we used a ratio as P( M l ) =

P( M l ) can be defined by many

P( ai ∈ M l )

(5)

L

∑ P( a ∈ M i

j

)

j =1

P( ai ∈ M l ) is the probability that an individual d ∈ M l pre-evaluate spatial alternative ai. With the preference specified spatial choice model described above, we made our predictions. Also, we use hot spot model as the comparison model to test the two models provided by this paper, the key feature adjusted spatial choice model and preference specified spatial choice model. The residential B & E incidents between October 1, 1997 and October 31, 1997 were used as testing data set. The predictions of future crimes’ spatial distribution and the testing incidents are shown as Fig. 4-6.

Fig. 4. Prediction of hot spot model with crime incidents from 10/01/97 to 10/31/97

164

Y. Xue and D.E. Brown

Fig. 5. Prediction of key feature adjusted spatial choice model with crime incidents from 10/01/97 to 10/31/97

Fig. 6. Prediction of preference specified spatial choice model with crime incidents from 10/01/97 to 10/31/97

4.4 Model Comparisons To compare different models, we standardized all predictions of the adjusted models and the comparison model. The hypothesis is that for the population of all future crime incidents, the proposed model will outperform the comparison model.

Decision Based Spatial Analysis of Crime

165

We assumed that the testing data set contains m incidents that occurred at the locations a1′ , a ′2 ,..., a m′ , respectively. For incident a i′ , let the predicted probability

pspi and that given by the comparison model be p sci . The hypothesis test was built around µ which denoted the mean of the difference

given by the proposed model be

between the predicted probability given by the proposed model and that given by the comparison model. Assumed that the proposed model have a better prediction than the comparison model for future crimes. Then the null hypothesis is that the predicted probability difference µ between the two models for future crime incident locations is less than or equal to 0. The alternative hypothesis is the predicted probability for proposed model will be significantly better than the comparison model. We performed the hypothesis test as H0: µ

≤0, H a: µ > 0 .

(6) Using the testing data set with m crime incidents, we obtained the estimated probability difference µˆ . m

(

)

ˆ = (1 m )∑ pspi − p sci µ i =1

The standard deviation of the difference,

ˆ = σ

(7)

qsi = pspi − psci was estimated by

(1 (m − 1))∑ (qs m

i =1

i

ˆ −µ

)

2

.

(8)

The results of these tests are shown in table 1. In the testing results, “Mean” and “Std. Dev.” stand for µˆ and σˆ , respectively. p-value indicates the probability that the null hypothesis will be accepted. Table 2. The comparison results Testing data set (10/01/97 - 10/31/97)

Preference Specified vs. Hot Spot Key feature adjusted vs. Hot Spot Preference Specified vs. Key feature adjusted

Mean

Std. Dev

z-Statistic

p-Value

7.757×10-4

5.051×10-3

2.624

0.004

2.861×10-5

2.603×10-4

1.878

0.030

7.471×10-4

4.922×10-3

2.593

0.005

The comparison results indicate that the two spatial choice models significantly outperform the comparison hot spot model. The preference specified spatial choice model also outperforms the key feature adjusted spatial choice model significantly. The results prove that the analysis of feature values attached to all spatial alternatives and the analysis of specified preferences of decision makers lead to the improvement

166

Y. Xue and D.E. Brown

of the prediction of future crimes’ locations. Based on the estimation of criminals’ preferences over the feature space, we provide a more efficient and accurate prediction method for the analysis of crimes’ spatial information.

5 Conclusion Spatial analysis is of critical importance to law enforcement. It enables better planning and use of scarce resources and is particularly useful when addressing the variety of threats facing modern communities. Past work in this area has concentrated on aggregated approaches to understanding criminal behavior and displayed results of this analysis as hot spots. In this paper, a new preference specified spatial choice model is provided that shows how the preferences of criminals can be modeled to better understand the spatial patterns of crime. When used with actual breaking and entering data this method increased the accuracy of the prediction of future criminal locations by a statistically significant amount. In addition, the method also provides a way to interpret the relationship between criminal decision making and spatial attributes.

References 1.

Ben-Akiva, M. and Lerman, S. (1985). Discrete choice analysis, theory and application to travel demand. the MIT press. 2. Besage, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of the Royal Statistical Society A 154: 143–155. 3. Bhat C. (1998) Incorporating observed and unobserved heterogeneity in urban work travel mode choice modeling. Transportation science, Vol. 34, No. 2, pp. 228–238, May. 4. Bock, H. H. (1996). Probability models and hypothesis testing in partitioning cluster analysis. Clustering and Classification, Ed. By Arabie, P., Hubert, L., and DeSorte, G. World Science Publishers. 5. Bowman, A. and Azzalini, A. (1997). Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford Statistical science series. 6. Bozdogan, H. (1993) Choosing the number of component clusters in the mixture model using a new informational complexity criterion of the inverse Fisher information matrix. Information and Classification, Ed. By Opitz, O., Lausen, B., and Klar, R. 40–54. Springer-Verlag. 7. Brown, D., Liu, H. and Xue, Y. (2001). Mining preferences from spatial-temporal data. Proceedings of first SIAM conference, 2001. 8. Clarke, R. and Cornish, D. (1985). Modeling offenders’ decisions: a framework for research and policy. Crime Justice: An Annual review of research, Vol. 6, Ed. By Tonry, M. and Morris, N. University of Chicago Press. 9. Cliff, A.D. and Ord, J.K. (1981). Spatial processes, models, and applications. London: Pion. 10. Engelman, L. and Hartigan, J.A. (1969). Percentage points of a test for clusters. Journal of the American Statistical Association, 64: 1674. 11. Everitt, B. (1993). Cluster analysis. John Wiley & Sons. New York. 12. Fotheringham, S. (1988) Consumer store choice and choice set definition. Marketing Science, Summer, 299–310.

Decision Based Spatial Analysis of Crime

167

13. Fotheringham, S., Brunsdon, C. and Charlton, M. (2000). Quantitiative Geography.SAGE Publications Ltd. 14. Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering method? – Answers via model-based cluster analysis. The Computer Journal, 41(8): 578–588. 15. Graham, U. and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons. 16. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999). ACM Computing Surveys, Vol. 31, No. 3, September, 264–323. 17. Kposowa, A., and Breault, K.D. (1993) Reassessing the structural covariates for U.S. homicide rates: A county level study. Sociological Forces 26:27–46. 18. McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, 105–142. New York, 19. McFadden, D. and Train, K. (1978). The goods/leisure tradeoff and disaggregate work trip mode choice models. Transportation research, 12,349–353. 20. Openshaw, S., Charlton, M., Wymer, C. and Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point datasets. International Journal of Geographical Information Systems 1: 335–358. 21. Openshaw, S., A. Craft, A., Carlton, M. and Birch, J. (1988). Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1:272–273. 22. Osgood, W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of quantitative criminology. Vol. 16. No. 1. 23. Ripley, B. D. (1981). Spatial statistics, John Wiley and Sons: New York. 24. Rust, R. and Donthu, N. (1995). Capturing geographically localized misspecification error in retail store choice models. Journal of Marketing research, Vol. XXXII, 103–110. 25. Train, K. (1998). Recreational demand models with taste differences over people. Land economics, 74, 230–239. 26. Upton, G., and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons.

CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis 1

2

2

Jennifer Schroeder , Jennifer Xu , and Hsinchun Chen 1

Tucson Police Department, 270 S. Stone Avenue, Tucson, AZ 85701 [email protected] 2 Department of MIS, University of Arizona, Tucson, Arizona 85721 {jxu, hchen}@eller.arizona.edu

Abstract. Link (association) analysis has been used in law enforcement and intelligence domains to extract and search associations between people from large datasets. Nonetheless, link analysis still faces many challenging problems, such as information overload, high search complexity, and heavy reliance on domain knowledge. To address these challenges and enable crime investigators to conduct automated, effective, and efficient link analysis, we proposed three techniques which include: the concept space approach, a shortest-path algorithm, and a heuristic approach that captures domain knowledge for determining importance of associations. We implemented a system called CrimeLink Explorer based on the proposed techniques. Results from our user study involving ten crime investigators from the Tucson Police Department showed that our system could help subjects conduct link analysis more efficiently. Additionally, subjects concluded that association paths found based on the heuristic approach were more accurate than those found based on the concept space approach.

1 Introduction Link analysis in the criminal justice domain, as opposed to link analysis in other disciplines, refers to the identification, analysis, and visualization of associations between entities such as persons, locations, and criminal incidents. Hereafter in this document, “link analysis” will refer to this process used by criminal justice investigators and crime intelligence analysts. Law enforcement officers and crime investigators throughout the world have long utilized link analysis to find and analyze relationships and associations between people. For example, the FBI used link analysis in the investigation of the Oklahoma City Bombing case and the Unabomber case to look for criminal associations and investigative leads. The Department of Treasury of the United States used link analysis to detect money laundering activities [12]. Link analysis often provides information about motives in a variety of crimes and helps uncover investigative leads [13]. However, link analysis remains a challenging problem, which consumes much time and human effort. First, it is complicated by the “information overload” problem [3, H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 168–180, 2003. © Springer-Verlag Berlin Heidelberg 2003

CrimeLink Explorer: Using Domain Knowledge

169

15]. Information about associations between crime entities (person, location, organization, property, etc.) is often buried in large volumes of raw data collected from multiple sources (e.g., crime incident reports, surveillance logs, telephone records, financial transaction records, etc.). Usually, link analysis entails an investigator manually expanding known entities by reading each document where the entities in question appear. If two entities appear in the same document, this indicates the two may have some association with each other. If no association is found, the investigator has to iteratively expand more documents until a significant path of associations between the entities is found. This process can be tremendously time-consuming. Second, high branching factors (the number of direct links an entity has) increase the search complexity of link analysis dramatically. A high branching factor can lead to a large number of associations that need to be evaluated when two crime entities are not directly associated. In a breadth-first-search of depth 4, for instance, an average branching factor of 7 can result in 2,401 associations that need to be evaluated. In reality, criminals who have repeated police contacts and arrests tend to commit many crimes with many people, causing high branching factors. The branching factor of an association search can be further inflated if associations with many other entity types (e.g., addresses, organizations, property, or vehicles) are considered. Third, determining the importance of associations for uncovering investigative leads relies heavily on domain knowledge. Crime investigators often focus only on those strong and important associations and paths because different types of crimes usually have different characteristic. Associations between crime entities may carry different weights in investigation of different types of crimes. For example, the relationship between a suspect and a victim may not be as important to uncover investigative leads in a burglary case as in a homicide case. Link analysis may distract or mislead an investigation if not guided by domain knowledge. There have been some link analysis software packages available for use in crime investigation. However, most of these packages do not help extract, search, and analyze associations beyond mere visualization of analysis results. Some tools facilitate only single-level association searches—finding only directly related entities. Automated, effective, efficient link analysis techniques are needed to assist law enforcement and intelligence investigators in carrying out crime investigation [21, 24]. To address the challenges of link analysis, we proposed and implemented several techniques for automated link analysis. These techniques include the concept space approach [4] to extracting associations from crime data, a heuristic-based approach to incorporating domain knowledge, and a shortest-path algorithm [7] to search association paths and reduce search complexity imposed by high branching factors. The rest of the paper is organized as follows. We review prior literature in section 2 and discuss system design in section 3. In section 4 we present results of a system evaluation study conducted at the Tucson Police Department (TPD). Section 5 concludes the paper and suggests future directions.

170

J. Schroeder, J. Xu, and H. Chen

2 Literature Review In this section, we review related work in link analysis, domain knowledge incorporation approaches, and shortest-path algorithms. 2.1 Link Analysis The earliest approach for link analysis is the Anacapa charting system [16]. In this approach, an investigator first constructs an association matrix by examining documents to identify associations between crime entities. Based on this association matrix, a link chart can be drawn for visualization purposes. In a link chart, different symbols represent different types of entities, such as individuals, organizations, vehicles, or locations. Based on this chart, an investigator may discover new investigative directions or confirm initial suspicions about specific suspects [24]. However, this approach is primarily a manual approach and depends on human investigators to extract, search, and analyze association data. It offers little help to address the information overload and high search complexity problems. Some automated approaches have been proposed for link analysis. Lee [20] developed a technique to extract association information from free text. Relying heavily on Natural Language Processing (NLP) techniques, this approach can extract entities and events from textual documents by applying large collections of predefined patterns. Associations among extracted entities and events are formed using relation-specifying words and phrases such as “member of” and “owned by”. The heavy dependence of this approach on hand-crafted language rules and patterns limits its application to crime data in diverse formats. There have been some link analysis tools that allow for “single-level” or direct association searches. Watson [2] can identify possible links and associations between entities by querying databases. Given a specific entity such as a person’s name, Watson can automatically form a database query to search for other related records. The related records found are linked to the given entity and the result is presented in a link chart. The COPLINK Detect system [5] applied a concept space approach developed by Chen and Lynch [4] for exploring associations. This approach was originally designed for generating thesauri from textual documents automatically by measuring cooccurrence weight, the frequency that two phrases appear in the same document. When applied to crime incident reports, this approach can automatically extract association information between crime entities and has been found to be efficient and useful for crime investigation [17]. However, both Watson and COPLINK Detect system allow users to search for only direct associations (“single-level”) and do not facilitate the search for association paths consisting of multiple intermediate links. Moreover, association strengths obtained using the concept space approach are merely based on co-occurrence weights. No domain knowledge is utilized to determine the importance of associations and to consider other information that can potentially suggest associations between crime enti-

CrimeLink Explorer: Using Domain Knowledge

171

ties. In next section, we review prior research on domain knowledge incorporation approaches. 2.2 Domain Knowledge Incorporation Approaches Domain knowledge often is important to solving domain-specific problems. In broader fields of artificial intelligence and data mining research, expert systems and Bayesian networks are typical techniques for incorporating domain knowledge. During the knowledge acquisition phase of expert system construction, domain experts’ knowledge and experience in addition to some common sense rules are collected and recorded. Knowledge generated usually is represented in a set of rules and stored in a knowledge base [26]. Expert systems have been employed in some domains such as factory scheduling [11], telephone switch maintenance [14], and disease diagnosis [23]. Because of the high expense of building knowledge bases and other issues such as low scalability and accuracy, expert systems have not been widely used. Bayesian network is another approach to incorporating knowledge of domain experts [18]. It encodes existing knowledge in a probability network with each node representing a variable and a link representing a dependency relationship between two variables. Some variables in a Bayesian network representing auditors’ knowledge of bank performance, for instance, can be the financial ratios indicating banks’ financial health. Other variables can be indicators of bank failure or other risks. Links between these variables specify the dependency relationships [25]. In addition to incorporating existing knowledge, Bayesian networks can learn new knowledge from data [18] and have been shown to be effective in some domains such as gene regulation function prediction [6,10]. In the domain of law enforcement and intelligence, the approaches for incorporating expert knowledge have been primarily ad-hoc. Goldberg and Senator [12] used a heuristic-based approach in the FinCEN system to forming associations between individuals who had a shared address, a shared bank account, or related transactions. Money laundering and other illegal financial activities could be detected based on associations discovered. However, these heuristics were used by investigators to manually uncover associations and have not been really incorporated into the system for automated link analysis. In case of large datasets, investigators still face the problems of information overload and high search complexity. The next section reviews shortest-path algorithms, which can help reduce search complexity for human investigators. Although they have been studied and employed widely in other domains, shortest-path algorithms have not yet been adopted widely in the law enforcement domain. 2.3 Shortest-Path Algorithms Shortest-path algorithms can find optimal paths between given nodes by evaluating link weights in a graph. One can focus on only the optimal path without being distracted by a large number of other possible paths. The Dijkstra algorithm [7] is the

172

J. Schroeder, J. Xu, and H. Chen

classical method for computing the shortest paths from a single source node to every other node in a weighted graph. Most other algorithms for solving shortest-path problems are based on the Dijkstra algorithm but have improved data structures for implementation [8]. Some researchers have proposed using neural network approaches to solving the shortest-path problem [1]. The shortest-path algorithm has been used to find the strongest association paths between two or more crime entities [27]. Another tool that employs the shortest-path algorithm is Link Discovery Tool [19]. It is able to search for association paths between two individuals that on the surface appear to be unrelated. In summary, prior work related to link analysis has proposed some approaches to addressing the challenges. However, link analysis remains to be a difficult problem for crime investigators when facing large volumes of data. In next section we present the system design of our CrimeLink Explorer to address the three challenges of link analysis.

3 System Design We designed and implemented CrimeLink Explorer for automated link analysis. The system contained a set of crime incident data originating from the Tucson Police Department (TPD) Records Management System. The concept space approach was used to identify and extract associations between all criminals in the dataset based on cooccurrence weights. Alternatively, a number of heuristics captured expert knowledge for identifying criminal associations and determining the importance of associations for investigation. To facilitate the search for the strongest association paths between individuals of interest, we implemented Dijkstra’s shortest-path algorithm with logarithmic transformations on association weights (co-occurrence weights or heuristic weights). A graphical user interface was provided to allow users to input names of interest and visualize association paths found based either on the concept space approach or on the heuristic approach. 3.1 Crime Incident Reports Law enforcement databases usually store crime incident reports, which are a rich source of data about both criminal and non-criminal incidents over extended time periods. Incident reports may document serious crimes such as homicides or trivial incidents such as suspicious activity calls or neighbor disputes. The trivial incident may later provide important information about associations that can later be used to solve serious crimes. Individuals involved in criminal activities may have repeated contacts with police, resulting in their presence in multiple incident reports. All crime incidents are classified into different types (e.g., Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, etc.) usually based on the Uniform Crime

CrimeLink Explorer: Using Domain Knowledge

173

Graphical User Interface

Association Path Search (shortest-path algorithm)

Co-occurrence Weights

Heuristic Weights

Heuristics

Concept Space

Crime Incident Reports

(crime types, shared address, shared phone)

Fig. 1. CrimeLink Explorer system architecture

Reporting (UCR) standard that has been the national standard for case classification and crime reporting since 1930 [22]. The successor to UCR, the National Incident Based Reporting System [9], has not been universally adopted by many U.S. law enforcement agencies. Thus, crime incident reports in this research are UCR based. These incident report records formed the source for automating link analysis in this research. 3.2 Concept Space Approach We used the concept space approach to automatically identifying and extracting associations from crime incident reports. We treated each incident report as a document and each crime entity as a phrase. To reduce complexity, we focused on associations only between persons and did not consider possible associations between other types of entities such as location and property. We then calculated the co-occurrence weights based on the frequency that two persons appeared together in the same crime incidents. Ideally, the value of a co-occurrence weight not only implies the presence of an association between two persons but also indicates the importance of the association for uncovering investigative leads [17]. However, this approach has its limitations when used in link analysis. An example is a burglary investigation where the victim and the suspect appear together in the incident report but have never met and are not even casual acquaintances. Moreover, co-occurrence weights obtained by the concept space algorithm had been found to be of only minor assistance when subjected to user evaluation. In previous user studies, investigators tended to made judgments about the associations independent of the cooccurrence weights provided by the system. Crime investigators were still facing the information overload problem because they had to make the final determination as to

174

J. Schroeder, J. Xu, and H. Chen

the importance of associations. In next section we discuss the heuristic approach as an alternative to the concept space approach. 3.3 Heuristic Approach We collected heuristics that domain experts often use when analyzing crime data to make judgments about the strength of associations between people. We interviewed several crime analysts and detective sergeants at the TPD. Three criteria were identified as the most important heuristics: (a) the relationship between crime type and person roles, (b) shared addresses or telephone numbers, and (c) repeated cooccurrence in incident reports. Rather than employing expert systems or Bayesian networks approaches to incorporating expert knowledge we represented heuristics collected using a 1-100 percentage scale to indicate the strength of associations ranging from weak to strong. A weak association, such as the relationship between a victim and a suspect in a burglary incident, was assigned a value of 1, and a strong association, such as a person and his close friend and criminal associate who have been arrested together repeatedly, were assigned a value near 100. Crime-Type and Person-Role Relationships. The crime investigators we interviewed specialized in investigation of one or more types of crime: Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, Child Sexual Abuse, Domestic Violence, and many others. Person roles used in the TPD dataset included: Victim, Witness, Suspect, Arrestee, and Other. We constructed a matrix and assigned scores to role combinations in each of the crime types. All of the crime investigators agreed that most co-arrestees or suspects in an incident had a strong association. Other role combinations, however, varied considerably depending on the type of crime. The score for a specific role combination was based on the estimation of the strength of the association occurring for that role combination and crime type out of every 100 incidents. For instance, the homicide detective sergeant estimated that at least 98 out of 100 homicide incidents included a victim and a suspect who were acquaintances. Thus, the corresponding score for victimsuspect combination for homicide crimes in the heuristic matrix was set to be 98. This method of assigning heuristic scores was somewhat arbitrary and could be enhanced by including a statistical analysis of the crime-type/person-role relationship. However, to capture such statistics by manually reading a large number of incident reports from each crime type to assess relationship information would be time prohibitive. We therefore relied on domain experts’ estimation based on their past experience rather than statistical analysis. Although informative, heuristics based on the relationship between crime type and person role could not necessarily provide complete information about criminal associations. For instance, the association score between two arrestees in narcotics sale incidents was assigned 95. This accurately reflected the high likelihood that the two arrestees knew each other, but did not capture the fine gradient from acquaintances to

CrimeLink Explorer: Using Domain Knowledge

175

close friends. Shared telephone and address associations and repeated appearances together in incidents could provide additional information to distinguish links from weak to strong. To allow a point spread to include this additional information, the heuristic scores based on crime type and person role were reduced to account for 85% of the final heuristic weight. Shared Address/Phone. Our domain experts stated that shared phone numbers and addresses were often important indicators for associations. We therefore included assignment of additional score to an association when two persons who shared a common phone number or address. Since phone number data were often subject to various errors in the TPD databases, they added only 5% of the final heuristic weight. Shared addresses added an additional 10% to the final heuristic weight since they were often more significant and less erroneous than phone number data. Co-occurrence. In the absence of other information suggesting an association, that two persons appeared together in multiple incidents might imply a strong relationship. This was the same rationale behind the concept space approach. However, rather than using co-occurrence weight, we estimated the strength of an association resulted from multiple co-occurrences in incidents based on an empirically derived probability distribution. We obtained the empirical distribution by analyzing a random sample of 40 incident reports of various crime types and counting the number of times each pair of persons co-occurred. We read supporting narrative reports for each incident to determine whether an association was important. We found that the more times two persons appeared together, the more likely they were involved in family related crimes. That is, a large number of co-occurrences between two persons implied a high likelihood for them to have a close relationship. For example, in 21 out of 40 incidents containing persons who appeared together four times, 15 were domestic violence incidents, custodial interferences, or family fights, six were court order enforcements or civil matters that were often related to domestic situations. Court orders and civil matters that were not family related overwhelmingly concerned persons who had some prior association. Based on our analysis, we constructed the probability distribution by assigning 1 to a single co-occurrence, indicating that it could be completely random with no other facts to support a stronger association. From two to three co-occurrences the probability increased rapidly. The probability distribution above 4 exceeded 99%, so all pairs of subjects who co-occur four or more times were given a probability of 100. Table 1. Empirically derived probability distribution Co-occurrence count 1 2 3 ≥4

Association probability (%) 1 45 98 100

176

J. Schroeder, J. Xu, and H. Chen

The final heuristic weight for a specific association was calculated by the maximum of the scores between the sum of crime-type/person-role relationship, shared address, and shared phone, and association probability based on co-occurrence counts: MAX(0.85 (crime-type/person-role score) + 0.05 (shared phone score) + 0.10 (shared address score)), 1.00 (association probability based on co-occurrence counts)). 3.4 Association Path Search For this system, we used the Dijkstra’s shortest-path algorithm [7] to address the search complexity problem. A logarithmic transformation was made on association weights because the conventional shortest-path algorithms could not be used directly to solve the problem of identifying the strongest association between a pair of persons [27]. With this transformation, a user could find the strongest association paths among two or more persons of interest. 3.5 User Interface A graphical user interface was implemented to allow a user to interact with the system. Figure 2 shows the user interface after the user has conducted a search for a path between three persons. Names are scrubbed for data confidentiality. The user entered the names of interest in the text field and then pressed the “Show Associations” button. The system conducted the shortest-path search based on either the co-occurrence weights or heuristic weights depending on the user’s choice. The user could then double-click on any node to see additional information (sex, date of birth, and Social Security number) about the person represented. The user could also double-click on a link and see information about the origin of the link, shared phone numbers or addresses, the weights from the concept space approach or from the heuristics, and the descriptions of incidents in which the two persons were involved.

4 System Evaluation We conducted a user study at the TPD to evaluate our system’s performance. We wanted to find out whether the automated link analysis approaches we proposed (concept space approach, heuristic approach, and the shortest-path algorithm) help address the information overload and search complexity problem and whether domain knowledge helps identify associations between crime entities more accurately than the concept space approach. We extracted approximately 20 months of incident reports from the TPD database. The resulting datasets contained 239,780 incident reports in which 229,938 persons were involved. Information, such as age, gender, race, address, and phone number, about those persons was also extracted.

CrimeLink Explorer: Using Domain Knowledge

177

Fig. 2. CrimeLink Explorer user interface

Ten crime analysts and criminal intelligence officers at the TPD participated in the study. Several subjects were very experienced specifically in link analyses. Each subject was asked to perform three tasks using CrimeLink Explorer and COPLINK Detection (“single-level” link analysis tool—finding crime entities that were only directly associated with a given entity): (a) use COPLINK Detect to find the strongest association paths among three given person names, (b) use the concept space approach provided by CrimeLink Explorer to find the strongest association paths among three given persons, and (b) use the heuristic approach provided by CrimeLink Explorer to find the strongest association paths among three given persons. Name sets used in the tasks were different but equally difficult. We summarize the results as follows: Subjects could conduct a link analysis more efficiently using CrimeLink Explorer than using COPLINK Detect. Because COPLINK Detect did not facilitate the search for association paths between crime entities that were indirectly connected, subjects had to expand links manually to find possible criminal associations. CrimeLink Explorer, in contrast, provided the functionality of searching for the strongest association paths between crime entities for multiple levels. Most subjects were able to find direct associations of the three given names using COPLINK Detect, but could not keep track of all the associations that were possible to generate as they traversed into the second and third level of the search. They said it would take them hours or possibly more than a day to find the paths between the names. However, all subjects could quickly find association paths for tasks (b) and (c) using CrimeLink Explorer. This

178

J. Schroeder, J. Xu, and H. Chen

result showed that automated path search functionality significantly increased the efficiency of link analysis by using the shortest-path algorithm. Subjects believed that association paths found using the heuristic approaches were more accurate than those found using the concept space approach. This was because the heuristics captured the domain knowledge crime investigators relied on to determine the importance of associations between crime entities. The heuristic weights included not only co-occurrence information but also person roles in different types of crimes, shared phones, and shared addresses. As some subjects commented, “That makes more sense, since it takes into account the kind of case". Subjects were also asked to indicate how useful the system was as an investigative tool. All subjects gave positive feedback and expressed enthusiasm about the tool. Several subjects had asked when they would be able to use the system for their daily work. The results of the user study were quite encouraging. The automated link analysis approaches we proposed in the research could greatly reduce crime investigators’ time and effort when conducting link analysis. Moreover, domain knowledge incorporated in the system could reflect human judgment more accurately about strength of associations between criminals.

5 Conclusions and Future Work Link analysis has faced challenges such as information overload, search complexity, and the reliance on domain knowledge. Several techniques were proposed in this paper for automated link analysis including the concept space approach, the shortest path algorithm, and a heuristic approach that captured domain knowledge for determining importance of associations. We implemented the proposed techniques in a system called CrimeLink Explorer. The system evaluation focused on the approaches’ efficiency and accuracy, both of which are desirable features of a sophisticated link analysis system. The user study results demonstrated the potential of our approaches to achieve these features using domain-specific heuristics. Rather than using estimates of heuristic weights, we plan in the future to apply a statistical analysis on NIBRS (National Incident-Based Reporting System) data [9], which captures specific information about the nature of associations between individuals involved in an incident, to validate the weights for the heuristic table. The heuristics can also be extended to include common vehicles and common organization associations. We also plan to encode expert knowledge in Bayesian networks and incrementally learn new knowledge from crime data. Variables in such a Bayesian network may specify whether two persons were family members, were good friends, or went to the same school. The other variables may be the likelihood of these pieces of information being important to uncovering investigative leads. Links between these variables can indicate the dependency relationships.

CrimeLink Explorer: Using Domain Knowledge

179

Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. We appreciate the critical and important comments, suggestions, and assistance from Detective Tim Petersen and other personnel from the Tucson Police Department.

References 1. 2. 3. 4.

5.

6. 7. 8. 9.

10.

11. 12.

13.

14. 15.

Ali, M., Kamoun, F.: Neural networks for shortest path computation and routing in computer networks. IEEE Transactions on Neural Networks, Vol. 4, No. 5. (1993) 941–953. Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Blair, D. C., Maron, M. E.: An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, Vol. 28, No. 3. (1985) 289–299. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document database. IEEE Transaction on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46, No. 1. (2003) 28–34. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning, Vol. 9. (1992) 309–347. Dijkstra, E.: A note on two problems in connection with graphs. Numerische Mathematik, Vol. 1. (1959) 269–271. Evans, J., Minieka, E.: Optimization Algorithms for Networks and Graphs, 2nd edn. Marcel Dekker, New York (1992). Federal Bureau of Investigation: Uniform Crime Reporting Handbook: National IncidentBased Reporting System (NIBRS). Edition NCJ 152368. U.S. Department of Justice. Federal Bureau of Investigation (1992). Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. In: Proceedings of the Forth Annual International Conference on Computational Molecular Biology (RECOMB00) (2000). Fox, M. S., Smith, S.F.: ISIS: A knowledge-based system for factory scheduling. Expert Systems, Vol. 1, No. 1. (1984). Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goldberg, H. G., Wong, R. W.H.: Restructuring transactional data for link analysis in the FinCEN AI system. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goyal, S. K. et al.: COMPASS: An expert system for telephone switch maintenance. Expert Systems. July 1985. Grady, N. W., Tufano, D. R., Flanery, R. E. Jr.: Immersive visualization for link analysis. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998).

180

J. Schroeder, J. Xu, and H. Chen

16. Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. 17. Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., Chen, H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35. (2002) 30–37. 18. Heckerman, D.: A tutorial on learning with Bayesian networks, Microsoft Research Report, MSR-TR-95-06, (1995). 19. Horn, R. D., Birdwell, J. D., Leedy, L. W.: Link discovery tool. In: Proceedings of the Counterdrug Technology Assessment Center and Counterdrug Technology Assessment Center’s ONDCP/CTAC International Symposium, Chicago, IL (1997). 20. Lee, R.: Automatic information extraction from documents: A tool for intelligence and law enforcement analysts. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). 21. McAndrew, D.: The structural analysis of criminal networks. In: Canter D., Alison L. (eds.), The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999). 22. National Archive of Criminal Justice Data. Uniform Crime Reporting Program Data [United States] Series. http://www.icpsr.umich.edu:8080/NACJD-SERIES/00057.xml 23. Shortliffe, E. H.: Computer-Based Medical Consultations: MYCIN. Elsevier, NorthHolland (1976). 24. Sparrow, M. K.: The application of network analysis to criminal intelligence: an assessment of the prospects. Social Networks, Vo. 13. (1991) 251–274. 25. Sarkar, S., Sriram, R. S.: Bayesian models for early warning of bank failures. Management Science, Vol. 47, No. 11. (2001) 1457–1475. 26. Turban, E.: Review of expert systems technology. IEEE Transactions on Engineering Management, Vol. 35, No. 2. (1988) 71–81. 27. Xu, J., Chen, H.: Using shortest-path algorithms to identify criminal associations. In: Proceedings of the National Conference for Digital Government Research (dg.o 2002), Los Angeles, CA (2002).

A Spatio Temporal Visualizer for Law Enforcement 1

1

1

1

1

Ty Buetow , Luis Chaboya , Christopher O’Toole , Tom Cushna , Damien Daspit , 2 1 1 Tim Petersen , Homa Atabakhsh , and Hsinchun Chen 1

University of Arizona, MIS Department, AI Lab {tbuetow, chaboyal, otoolec}@cs.arizona.edu {tcushna, damien, homa, hchen}@bpa.arizona.edu 2 Tucson Police Department, Tucson, Arizona 75701 [email protected]

Abstract. Analysis of crime data has long been a labor-intensive effort. Crime analysts are required to query numerous databases and sort through results manually. To alleviate this, we have integrated three different visualization techniques into one application called the Spatio Temporal Visualizer (STV). STV includes three views: a timeline; a periodic display; and a Geographic Information System (GIS). This allows for the dynamic exploration of criminal data and provides a visualization tool for our ongoing COPLINK project. This paper describes STV, its various components, and some of the lessons learned through interviews with target users at the Tucson Police Department.

1 Introduction Information visualization techniques have proven useful for presenting large amounts of data. Specifically, in the law enforcement domain visualization techniques can be very helpful for tasks such as crime investigation as well for presenting findings to supervisors and even in court. Law enforcement agencies currently use a combination of technological and manual techniques for crime analysis. However, these methods are very time consuming. We have developed the Spatio Temporal Visualizer (STV) tool to assist crime analysts in their search for information and in presenting their results. In order to visualize the data needed by crime analysts, we use three types of visualization techniques: a periodic view, timeline view and GIS view. Each technique has its own strength as follows: periodic visualization displays patterns with respect to time; timeline visualization displays characteristics of temporal data in a linear manner; GIS visualization displays information on a map and allows for spatial analysis of data. We combine these techniques into one tool to allow the same data to be examined from three different views simultaneously. In this paper we present the motivation behind STV followed by a literature review on relevant visualization techniques. Next, we demonstrate how STV provides dynamic access to data and presents three different views. We illustrate STV’s functionality with an example of how it would be used by a crime analyst or police officer. Finally, we discuss some of the lessons learned after interviews and discussions with

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 181–194, 2003. © Springer-Verlag Berlin Heidelberg 2003

182

T. Buetow et al.

potential users from Tucson Police Department (TPD) and conclude with some future directions.

2 Background and Motivation Historically, law enforcement agencies have attempted to maintain records of criminal events to solve crime, aid prosecution, document response, detect serial crimes, and identify trends. Solving a crime often depends on identifying characteristics of the incident and then matching those characteristics to a known criminal or suspects whose past actions, motives or opportunities most closely correlate to the incident at hand. This matching process can occur in an individual officer’s memory or in a multimillion record database. The efficiency of the individual officer is adversely affected by the large amounts of information exceeding his memory capacity or his ability to process that information. In addition, the usefulness of a large database is dependant on the ability to display appropriate and adequate information in a manner, which can be efficiently utilized by an investigator. In the past, crime analysts have dealt with some of these issues through the use of pin maps, graphs, timelines and summarizations. All of these tend to be somewhat subjective and dependent upon the ability and understanding of the analyst doing the preparation and the quality of the data being analyzed. For a better understanding of the problem, imagine a situation in which an analyst is tasked with enlightening a group of police managers on the state of burglaries in a city. He would first need to decide how to approach this task, whether by comparing the number of incidents over several years, from year to year, or month to month. He would also need to decide whether to analyze the occurrence of these crimes between areas of the city, or time of day, day of week, type of victim or any other factors or combination of factors. He would extract data for the period (or periods) he considers appropriate and then through the use of various tools, construct graphs, charts and maps to depict the information in the manner he chooses. An undertaking of this nature often takes several days for an experienced analyst to complete. The problem is that the information the analyst chooses to survey is quite dependent upon the training and experience of the analyst or perhaps the input of his immediate supervisor. Considering the time and effort needed to compile the project, if the group of police managers had concerns or questions different from those which the analyst chose to address, a second or third separate project would be required. The current state of the art of crime analysis is hindered by limited objectivity and lack of tools to allow for dynamic review. STV aims to remedy this analysis deficiency by providing an easy, dynamic workspace.

3 Literature Review Research in the areas of the three views implemented in the STV project has been done extensively and has been applied to various application domains. In the area of crime analysis, GIS software allowing users to view crimes on a map is quite common. There are few tools that allow users to view law enforcement related data in a

A Spatio Temporal Visualizer for Law Enforcement

183

temporal context or in a periodic pattern. Analysis of these techniques shows that they are largely segregated and miss the synergy that is created when multiple views of the same data can be seen simultaneously. To the best of our knowledge, there are currently few tools that harness the power to examine a single data set from multiple perspectives. As will be described in section 4, STV aims to incorporate three different views of the same data set into one tool. 3.1 Periodic Data Visualization Tools Common methods for viewing periodic data include sequence charts, point charts, bar charts, line graphs, and spiral graphs which can all be displayed in 2D or 3D [7, 16]. We use the spiral graph method in STV due to its ability to visualize periodic patterns better than the other methods. The Spiral Graph [17] developed at the Technical University of Darmstadt is an excellent example in which the spiral method of visualization was used. Using the Spiral Graph, different periodic information can be visualized [17]. The main method of mapping data to the Spiral Graph relies upon the thickness of lines to represent the amount of data and different colors to represent different types of data. The University of Minnesota has also developed different implementations of the spiral method of visualization using the spiral of Archimedes [1]. Here we have good examples of how data can be mapped in different ways. For instance, both 2D and 3D spiral graphs in which the thickness of dots along the spokes of a spiral represents the amount of data [3]. The advantage of using a 3D representation is that several data sets can be shown simultaneously. However, using a 3D representation can become confused and make it difficult to see a developing pattern. The spiral method to which STV most resembles is the ReCAP implementation known as a Time Chart [2]. The disadvantage of Time Chart is that it only plots data in, monthly, 24-hour, or 7-day time periods. Therefore, the user does not have the ability to see yearly patterns. In addition, using Time Chart the user is unable to see how many incidents took place in a certain time period. As will be discussed in section 4.2.2, STV’s periodic pattern view overcomes these shortcomings. 3.2 Timeline Tools A timeline is a linear or graphical representation of a sequence of events. In general, timelines are a temporal ordering of a subject of interest. Events, entities, or topics of interest are displayed along an axis. As such, many projects have explored visualization through timeline techniques. The desire to visualize time relationships and patterns in data has been an ongoing area of research. One area addressed in visualization is the desire to see the big picture and be able to drill down to examine events in detail. In this regard, Snap [6] attempts to increase the total amount of data that can be displayed by placing a large number of entities into a single “aggregate”. This new collection can then be displayed for summary information or drilled down for closer inspection. Lifelines [14] display legal or medical data to professionals in those fields. Here, the goal is the visualization of a patient or case history, allowing users access to data

184

T. Buetow et al.

from one screen. In addition, this project aims to enhance anomaly and trend spotting and to streamline access to data. In an attempt to create more relational querying power in a timeline, Hibino [8] developed Multimedia Visual Information Seeking to allow users to interactively select two subsets of events and dynamically query for temporal relationships. In short, this allows for a user to ask, “How often is event type A followed by event type B?” Others, such as Kullberg [10], attempted to reinvent the 2D timeline into three dimensions. Holly [9] has proposed timelines to view program hotspots during execution. In a more general approach, Kumar [11] developed the ITER model for the basis of developing timeline applications. All of these applications offer different temporal views of their respective data sets. In addition, with the proliferation of the Internet, many forms of informal timelines are present, many of which communicate personal histories and the like. Several private companies also have timeline tools for analysis in various professional fields. Although there are many existing timeline tools, to the best of our knowledge there are very few tools that incorporate a timeline view simultaneously with other views of the same data set, as we have implemented in STV. 3.3 Crime Mapping Tools The uses of Geographic Information Systems (GIS) in law enforcement applications are becoming increasingly important in supporting crime analysts’ capabilities. This field is split between two main areas: finding better ways to display the data available and finding better ways to mine the data to help crime analysts save time. One tool might be used to mine data and another tool to display the information gleaned from the data mining. The crime analyst would still have to manually run the data mining program and then manually move the data into GIS software for display. This process can be painfully slow. One tool that combines these two areas is the Regional Crime Analysis Program (ReCAP) developed by Dr. Brown at the University of Virginia [2]. Brown realized that the current systems have three main shortcomings. These systems did not allow the user to run a spatial query to obtain the data set, in which the user has interest, nor did they automate the process of analyses through data mining. In addition, the systems required users to be proficient in GIS and mapping technologies. ReCAP was developed to fulfill these three shortcomings. A tool that deals mainly with data mining for GIS is the CrimeStat Spatial Statistics Program developed by Ned Levine & Associates [12]. This tool has an impressive amount of data mining options available. These features include spatial distribution analysis, distance analysis, hot spot analysis, interpolation (kernel density estimation), and space-time analysis (Knox and Mantel) tools. The user must manually import the data into those tools. The analyzed data can then be saved for later use. There are many examples of how other organizations have created tools to display the data they have mined using CrimeStat [12]. Two commercial tools that are popular in law enforcement for viewing crime data on a map are ArcView developed by ESRI and MapInfo developed by the MapInfo Corporation [5] [13]. These tools allow the user to import data from various file types and even perform sophisticated database operations on the imported data. Their

A Spatio Temporal Visualizer for Law Enforcement

185

popularity has the advantage that many people in the industry are already familiar with them.

4 Features of STV The STV is a data visualization tool built on top of our ongoing COPLINK project [4]. COPLINK provides a one-stop data access and search capabilities through an easy to use user interface, for local law enforcement agencies such as the Tucson Police Department (TPD). STV is intended to take COPLINK one step further by providing an interactive environment where analysts can load, save, and print police data in a dynamic fashion for exploration and dissemination. For instance, an analyst can search all robberies that have taken place over the past two years and visualize them. In addition the analyst may wish to visualize all drug arrests, simultaneously with the robberies, and see if there is any correlation between the two. 4.1 Technologies Used STV is built into a Java applet in a modular fashion. This was done with the intent that other types of views would be added in the future with relatively little work by taking advantage of object-oriented inheritance. One key advantage of an applet is that no software needs to be installed or maintained on analysts’ machines. Queries are performed using applet to servlet communication to connect to an Oracle database. Results are stored by a controller class and accessed by each STV view. On the backend, JDBC is used to connect to the COPLINK database. One addition, specifically required by the STV project, was an area to save user preferences and past queries specific to each of the views. Although this information is saved in the same database, it is independent of the COPLINK schema. This addition allows police officers the capability to save valuable time by saving the search information gathered in the application’s database. 4.2 Components STV overcomes some of the disadvantages of other existing crime visualization tools by viewing three perspectives on the same data. The detail of each view is described in the following sections. In addition, there are two screenshots of STV in figures 1 and 2, which illustrate its functionality by displaying an example of bank robbery data from 1996-2002, described in section 5. Control Panel. The control panel (figure 1.c) maintains central control over temporal aspects of the data.

• The time-slider controls the range of time viewed. Thus, the data may span six years, but the timeslider may be narrowed to focus on one year, or one month. This time window into the data may then be moved like a typical slider to incorporate

186

T. Buetow et al.

new data points and exclude others. This slider was inspired by Lifelines [14] and by Richter [15]. • Granularity, referring to unit of time, is controlled through a drop down menu. Currently, years, months, weeks, and days are implemented. Changing this option has the effect of re-labeling the timeline and altering the periodic patterns being examined. • The overall time bounds are controlled through a series of drop down menus. Thus, while all data points may lie in a particular time span, a user can narrow focus to a subset of data based on time bounds. Periodic View. The main purpose of the periodic view (figure 1.d) is to give the crime analyst a quick and easy way to search for crime patterns.

• The circle represents time in the granularity the user chooses. For instance, it may represent a year, month, week or day. • Within the circle there are sectors which divide it into different time periods within the granularity selected. The analyst also has the ability to change the granularity of the sectors. For example, the circle could be set to year granularity and the sectors could be set to represent months, weeks, or even days. The advantage of this is that the analyst may see different patterns developing over the different time periods. • Sectors are labeled to indicate their specific time interval. • Data is represented by spikes within each time period. • Rings with labels inside the circle represent quantity of data. • Using the box plot method a crime analyst can easily determine if any spikes are outliers. Timeline View. The timeline view (figure 1.a) is a 2D timeline with a hierarchical display of the data in the form of a tree.

• A specific time instant may be highlighted. When combined with the current granularity, all points in that time period are highlighted. For example, if the granularity is month and a point in June 1999 is selected, all data in June 1999 are highlighted. • The tree view and timeline views of the data are coordinated such that expanding a node in the tree expands the data points viewed on the timeline. At the same time, data under a particular node in the tree is summarized in the timeline at that node’s corresponding y-coordinate location. • The time-slider controls the current timeframe viewed. This has the effect of allowing the user to slide across the timeline at various levels of detail. • The tree view allows the user to see the data in a traditional and organized way. GIS View. The GIS view (figure 1.b) displays a map of the city of Tucson on which incidents can be represented as points of a specific color.

• The user can zoom in and out of the map. Zooming in allows for more streets to be displayed.

A Spatio Temporal Visualizer for Law Enforcement

187

• Incidents may be selected by dragging a box around points on the map. This will narrow the information being displayed by all views, focusing on the selected incidents. • The user can move backward and forward in the zoom history similar to an Internet browser. • The GIS view pronounces data points within the time period specified by the timeslider. Data points outside this period are faded. • Data points highlighted in the timeline view are highlighted in the GIS view.

Fig. 1.b Fig. 1.a

Fig. 1.d

Fig. 1.c

Fig. 1. STV. In this case, bank robberies for the last six years are displayed in the timeline, GIS and periodic views. From here, users may narrow focus through granularities and time bounds as well as geographic parameters

5 A Crime Analysis Example To illustrate STV functionality, we explore a hypothetical scenario in which a police officer has been assigned to the task of examining bank robbery data. The officer begins by logging into COPLINK as described in figures 1 and 2. He performs a search for bank robberies in Tucson and selects the results he’s interested in. STV starts by visualizing the 280 bank robberies selected. The officer looks for trends, using the three views. Upon expanding the spiral view, he notices that the period from October to December are peak months for bank robberies in Tucson. Deciding to compare this trend with the previous year, he narrows the data being viewed by inputting September 1, 2001 as a start date and December 31, 2001 as an end date (figure 3).

188

T. Buetow et al.

Fig. 2. Functionality. Views may be moved to provide better focus or because of user preference. Here, GIS view is centered and a geographic query is performed. The data set is narrowed to those selected by the user with corresponding updates in other tools. In the timeline view, points within the geo-search are emphasized, while other points are faded. The periodic view displays summary data on the selected points indicating June, April, November and December have higher incidence of bank robberies. The control panel allows for focus onto a specific period of time within the global time frame selected. Granularity (viewing in terms of days, weeks, months, years) and global time bounds may also be altered

At this point, the data has been narrowed to 31 bank robberies. By looking at the timeline view the officer sees three gaps in bank robbery occurrences (figure 4). He notices that at the beginning of September and October, no bank robberies occurred. More striking is the fact that after approximately Thanksgiving, only two robberies occurred. The officer decides to examine geographic aspects of the data to see if further trends are apparent (figure 5). He notices a cluster of robberies in the Northwest side of town. Zooming in, he sees that north of Broadway Avenue, is where the vast majority of bank robberies occurred during the selected time interval with some locations being robbed multiple times in four months. Additionally, an area around the intersection of Euclid Avenue and Grant Road appears to be the center of a concentration of activity. The officer selects points on the Northwest side of town by dragging a box around them to see if other trends become apparent. He then moves the periodic view to the center, bringing several trends to light. None of the 17 robberies occurring in this geographic region during the four month period occurred within the first week of a month while the third week of the month was the most frequently robbed. In addition, the periodic tool reveals that more robberies occur on Fridays than other days of the week (figure 6).

A Spatio Temporal Visualizer for Law Enforcement

189

Returning to the timeline view, he notices that several robberies have occurred on the same day. The officer highlights November 15. This automatically highlights the robberies on the geographic view as well. In addition, this helps the officer realize that two days earlier, two other banks were robbed in this same area. For a police officer or crime analyst, many questions arise. Why the sudden disappearance of robberies after Thanksgiving? Why was the first week of each month devoid of robberies? Why were so many banks hit in the same area at the same time? A crime analyst could use the STV for further queries, for example concerning arrests that occurred immediately after these robberies. Although further queries and exploration may be necessary, points of interest were discovered. It may now be advisable to increase patrols in those areas where increased incidents of bank robbery occurred, particularly within the time periods which became apparent. By manipulating the data, cutting and slicing, zooming in and zooming out several trends were revealed in less than 20 minutes of data manipulation.

Fig. 3. The periodic view displaying bank robberies for each month from 1996-2002. The period from October to December has more events than other months

6 Lessons Learned Although the STV tool has not yet been deployed at TPD, we have been able to receive feedback regarding the tool from ten TPD crime analysts and from a seasoned detective. It is important for the STV tool to be assessed by these sources because it is the detectives and crime analysts who will be the primary users of the STV tool. Comments made by the detective and analysts throughout the initial development are summarized below.

190

T. Buetow et al.

Fig. 4. Robberies from September 1, 2001 to December 31, 2001

Fig. 5. Selecting points in the GIS view narrows focus

A Spatio Temporal Visualizer for Law Enforcement

Fig. 6. The periodic view reveals week-per-month and day-per-week trends

Fig. 7. Highlights in the timeline view appear automatically in the GIS view

191

192

T. Buetow et al.

6.1 Current Strengths of STV From our first meeting with analysts, the options to load, save and print projects were expressed as high priorities. Once implemented, projects no longer needed to be recreated each time a user logged onto COPLINK. Similarly, the ability to produce a hard copy of information is often very desirable. These functions enable users to more easily incorporate STV into their analysis. Potential users of STV at the TPD have indicated that the ability to expand and constrict the data being displayed is important in searching for different crime patterns. For instance, an analyst may begin with a large number of incidents being displayed and then narrow them down to relevant incidents, or vice versa. They feel that the STV tool does this quickly and efficiently by means of the control panel and the GIS view. The STV tool will also allow police managers, along with the help of analysts, to discuss ongoing problems and trends. For example, TPD has a meeting known as the Targeted Operational Planning Meeting (TOP) in which Police Chiefs and other managers analyze problems and address them. Having STV available during this brainstorming session would allow these TPD officials to view additional crime trends that may not have been considered. The analysts indicated this as an important strength because quite often the Police Chiefs and managers will want to see different aspects of crime trends “on the fly”. A final strength that cannot be overestimated is STV’s ability to abstract away tedious details of database searches and displays. Computers are excellent at these types of processes. By shifting an analysts focus from a low level of computer interaction to a much higher level of patterns, causes, and effects of crime, STV increases the efficiency of analysis. 6.2 Areas of Improvement for STV While most of the feedback we received from TPD was favorable, users have indicated certain areas of potential improvement for STV. The biggest concern is the limited customization that the tool currently supports. For instance, crime analysts may wish to add a note and reference it to an incident that is being visualized. They may also wish to add events to the data set that are not present in the databases. A second area of concern is the customization of colors and shapes. For example, officers may want to have all robberies displayed by a green triangle, and all homicides displayed by a red circle. Size of data points was also expressed as a concern. A problem common to virtually all visualization techniques is that of labeling. Analysts recommended a variety of labels for data points, from standard text labels to balloon labels that appear on mouse hovers. The size and content of labels were also of interest. Crime analysts have also expressed interest in the ability to have STV communicate with COPLINK Connect/Detect [4] which has already been deployed at TPD. For instance, if a group of incidents such as robberies are visualized, an analyst may wish to select a particular incident and see the corresponding information from COPLINK Connect/Detect displayed.

A Spatio Temporal Visualizer for Law Enforcement

193

Finally, STV lacks automatic analysis functionality. This means that users cannot click a button and have an algorithm applied to their data set to solve a problem. Features such as hot spot algorithms which determine clusters of activity or algorithms that determine anomalies in data sets are currently not present, but desirable.

7 Conclusions and Future Directions The STV tool is scheduled to begin user studies at the TPD in March 2003. The plan is to have crime analysts use the STV tool in their daily activities in order to discover other strengths and areas for improvement. The experiences of crime analysts will provide valuable insights into future directions for the STV project. The ability provided in STV to synchronize three different views for visualizing crime related data would provide law enforcement an advantage in crime analysis. This combined with the dynamic access to data and STV’s user-friendly interface present advantages over traditional methods. As the veteran detective said, “This application has the potential to revolutionize the manner in which we examine crime trends and pursue criminals.” Acknowledgements. This project has primarily been funded by the following grants:

• NSF, Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July 2000-June 2003. • National Institute of Justice, “COPLINK: Database Integration and Access for A Law Enforcement Intranet,” # 97-LB-VX-K023, July 1997-Jan. 2000. • National Institute of Justice, “Distributed COPLINK Database and Concept Space Development,” #0308-01, Jan. 2001-Dec. 2001. • NSF, Information Technology Research, “Developing A Collaborative Information and Knowledge Management Infrastructure,” NSF/IIS #0114011, Sept. 2001-Aug. 2004. We would like to thank the following people for their support and assistance during the entire project development and evaluation process:

• All members of the University of Arizona Artificial Intelligence Lab staff and Coplink Staff, • Lt. Jenny Schroeder, Dan Casey and other contributing personnel from the Tucson Police Department, • The Phoenix Police Department.

194

T. Buetow et al.

References 1. 2.

3. 4. 5. 6.

7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

Archimedean spiral, http://www.2dcurves.com/spiral/spirala.html Brown, Donald E. 1998. “The Regional Crime Analysis Program (RECAP): A Framework for Mining Data to Catch Criminals.” In Proceedings for the 1998 International Conference on Systems, MAN, and Cybernetics (San Diego, CA, USA, Oct. 11–14). IEEE, Piscataway, N.J., 2848-2853. University of Virginia, June 1998. Carlis, J. (1998). “Interactive Visualization of Serial Periodic Data,” Proceedings of User Interface Software and Technology. Chen, H., D. Zeng, H. Atabakhsh, W. Wyzga & J. Schroeder (2003). “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, pp 28-34. Environmental Systems Research Institute (ESRI), http://www.esri.com Fredrikson, A., C. North, C. Plaisant & B. Schneiderman (1999). “Temporal, Geographical and Categorical Aggregations Viewed Through Coordinated Displays: a Case Study with Highway Incident Data,” Human-Computer Interaction Laboratory Technical Report No. 99-31 December 1999, NPIVM, pp 26–34. Harris, R. (1996). “Information Graphics – A Comprehensive Illustrated Reference,” Management Graphics. Hibino, S. & E.A. Rudensteiner (1998). “Comparing MMVIS to a Timeline for Temporal Trend Analysis of Video Data,” Proceedings of Advanced Visual Interfaces. Holly, M. (2001). “Temporal and Spatial Program Hot Spot Visualization,” Technical Report SOCS-01.6. Kullberg, R.L. (1996). “Dynamic Timelines: Visualizing Historical Information in Three Dimensions,” Proceeding of CHI ‘96, pp 386–387. Kumar, V. & R. Furuta (1998). “Metadata Visualization for Digital Libraries: Interactive Timeline Editing and Review,” Proceedings of the third ACM conference on Digital libraries, pp 126–133. Levine, Ned (2000). “CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident Locations (v 1.1),” URL, http://www.icpsr.umich.edu/NACJD/crimestat.html. MapInfo, http://www.mapinfo.com Plaisant, C., B. Milash, A. Rose, S. Widoff & B. Schneiderman (1996). “Lifelines: Visualizing Personal Histories,” ACM CHI ’96 Conference Proceedings. pp 221–227. Richter H, J. Brotherton, G.D. Abowd & K. Truong (1999). “A Multi-Scale Timeline Slider for Stream Visualization and Control,” GVU Technical Report GIT-GVU-99-30. Tufte, E. (1983). “The Visual Display of Quantitative Information”. Graphics Press. Webber, M., M. Alexa & W. Muller (2000). “Visualizing Time-Series on Spirals”, Technical University of Darmstadt.

Tracking Hidden Groups Using Communications Sudarshan S. Chawathe Computer Science Department University of Maryland College Park, Maryland 20742, USA [email protected]

Abstract. We address the problem of tracking a group of agents based on their communications over a network when the network devices used for communication (e.g., phones for telephony, IP addresses for the Internet) change continually. We present a system design and describe our work on its key modules. Our methods are based on detecting frequent patterns in graphs and on visual exploration of large amounts of raw and processed data using a zooming interface.

1

Introduction

Suppose a group of suspicious agents (henceforth, suspects) has been identiﬁed based on some a priori knowledge. Instead of taking immediate action to stop the suspicious activities, it is often prudent to carefully monitor the suspects and their communications in order to maximize the detection of suspects (expand the group) and uncover the nexus of activity (locate the key or controlling agents). Unfortunately, the suspects typically do not communicate using easily identiﬁable sources. For example, a ring of car thieves may continually change phone numbers (using prepaid cellular phones, short-term pager numbers, etc.). Similarly, globally dispersed agents planning a distributed denial-of-service attack on the cyber-infrastructure typically do not use the same IP address for very long. Such behavior makes it very diﬃcult to accurately and eﬃciently track groups of suspects over extended periods of time. In this paper, we describe a strategy to solve this problem by using a combination of automated and human-directed techniques. We begin by describing the problem more precisely. Problem Development. We will use the term agents to denote real-world entities (typically, humans) that we are interested in monitoring. However, these agents are not directly observable and their real-world identities are, in general, unknown. That is, we do not have any method to directly track the actions of the agents. Instead, all we can observe is the communications between such agents. The medium used for such communication may be a phone network, the Internet, physical mail, etc. We refer to it as the network in general. We will use the term nodes to denote the devices used to communicate using this network (e.g., phone numbers in a telephone network, IP addresses on the Internet). A key feature of nodes is that they are, by virtue of their connections to the network, H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 195–208, 2003. c Springer-Verlag Berlin Heidelberg 2003

196

S.S. Chawathe

agents nodes Monitored messages Unobservable Real World Network

Observable Communication Network Unknown and dynamic mapping between agents and nodes Fig. 1. The tracking problem

easily identiﬁable and observable. Agents use nodes to communicate on the network. (For example, people use phone numbers to communicate using the phone network, and IP addresses to communicate using the Internet.) A group of communicating suspects is called a s-group. Note that since suspects are, in general, not directly observable, neither are s-groups. At a given point in time, there is a group of nodes (in the communication network) corresponding to the agents in an s-group; we refer to this group of nodes as a n-group. In contrast with s-groups, n-groups are easily observable. For example, the group of phone numbers used by a ring of car thieves in the past few days forms a ngroup. Over time, the n-group corresponding to a given s-group changes. For example, the ring of thieves is likely to be using a completely diﬀerent set of phone numbers two months from now. The problem at hand is then the problem of tracking s-groups by observing only the n-groups. By observing a n-group, we mean tracking the communications between the nodes in the group. In this paper, we assume that the only information we can obtain from the communication network is a timestamped list of inter-node messages. We use the term messages in a general sense. In a phone network, a message is a phone call; on the Internet, a message may be a TCP connection. More precisely, monitoring the network yields a list of tuples of the form (n1 , n2 , t, A) indicating a message from n1 to n2 at time t. We use A to denote a list of additional attributes, which depend on the particulars of the communication network and the monitoring methods. In a phone network, A includes attributes such as the length of the call. On the Internet, A includes the source and destination ports associated with a TCP connection and other connection parameters. It is convenient to regard this stream of tuples as the edges of a connection multigraph whose nodes represent communication network nodes (e.g., phone numbers) and whose edges represent messages annotated with additional attributes (e.g., phone calls with durations). In most networks, such a list is never-ending and therefore better modeled as a stream of tuples. Another characteristic of the data from network monitoring is that it is typically produced at a very high rate. For example, call records

Tracking Hidden Groups Using Communications Network Monitoring Data (edge stream)

Newswires, memos, etc.

tuning

Online Analysis

197

Visual Exploratoin

alerts

Offline Analysis Storage

mining

Fig. 2. System architecture

on a phone network and TCP connection build-ups and break-downs occur at a very high rate. It is important to analyze such stream data using online methods that detect important patterns as early as possible. (For example, detecting that a ring of thieves is about to move to another state or country may prompt immediate action if it is detected in a timely manner.) Further, indiscriminately storing such stream data can exhaust even the large amounts of inexpensive storage currently available. Storing the data indiscriminately also makes it more diﬃcult to operate on the data as less interesting data is likely to slow access to interesting data. On the other hand, many of the kinds of operations required by this application are not likely to yield to purely online methods. For example, many data mining algorithms require random access to data on disk and cannot be easily modiﬁed for the restrictions of stream data. Thus, a practical solution is likely to require both online and oﬄine analysis methods that operate cooperatively. So far, we have not indicated how the results of the automated or semiautomated methods suggested above are presented to the analyst responsible for decisions, nor have we indicated how such analysts may use their knowledge to direct and guide the tracking process. A simple solution here is to process data in batches, and provide input in batches. For example, a detective may analyze the output of the tracking method from yesterday and adjust the input parameters for guiding the method when it is run on today’s data. This solution has problems analogous to those encountered by batch-based solutions to the tracking problem. Again, it is desirable to provide methods that permit online viewing of the results of tracking and immediate ﬁne-tuning of the tracking process. Assuming we have at hand streaming methods for tracking s-groups, we need methods for visualizing, searching, and manipulating the streaming and dynamic data generated by these methods. System Architecture. Figure 2 depicts the high-level architecture of our system for tracking s-groups. The monitoring devices on the network (e.g., instrumented routers on the Internet) produce a stream of tuples, each of which describes a

198

S.S. Chawathe

message between nodes. This stream of tuples is sent to both the online analysis module and the storage module. The storage module is responsible for recording the stream and merging it with the archived data at suitable intervals (say, every 24 hours). The online analysis module uses the stream to trigger detection features based on the archived data and input from the analyst. The oﬄine analysis module is where methods that are not suited to stream processing are implemented. These methods can be classiﬁed as data mining or pattern detection methods that require random access to data. The exploration module includes a graphical user interface and, more important, implementation of methods for quickly assimilating vast amounts of data at varying levels of detail. The data includes the stream data processed to varying degrees, the results of the online and oﬄine analysis modules, and an integration with external data sources that are relevant to an analyst’s decision making process (e.g., newswire articles, police reports, memos). In Section 2, we describe methods for detecting frequent patterns in the connection graph. These methods form the building blocks for of the oﬄine analysis module. Section 3 describes methods for exploring large volumes of graph data using a zooming interface that form the basis of the exploration module. We discuss related work brieﬂy in Section 4 and conclude in Section 5. Due to space constraints, we do not discuss the online analysis module here, and refer the interested reader to [5] for details.

2

Detecting Frequent Patterns

In this section, we describe our method for detecting hidden groups by analyzing large volumes of historical connection data obtained by network monitoring. This method is part of the static analysis module of Figure 2. Recall that in this module, we are given a database consisting of a communication graph that forms a historical record of messages between nodes and we wish to detect potential s-groups for further investigation (and to serve as inputs for the online analysis module). The goal is to help an analyst detect s-groups by highlighting patterns in the data. The kinds of patterns of interest to analysts are likely to be varied and complex, and we do not attempt to completely automate the task of detecting them. Instead, our approach is to provide eﬃcient implementation of a few key operations that the analyst may use to investigate the data based on real-world knowledge. In particular, we focus on the eﬃcient implementation of an operation that is not only useful on its own, but also forms the building block for more sophisticated analysis methods (both automated and human directed). This operation is the detection and enumeration of frequently occurring patterns, which are informally patterns of communicating nodes occur frequently enough to be of potential interest for a detailed data analysis. (Such frequently occurring patterns are to our problem what frequent itemsets are to the problem of mining market basket data [1].) The main idea behind our method, which is called SEuS (Structure Extraction using Summaries) is the following three-phase process: In the ﬁrst phase

Tracking Hidden Groups Using Communications

199

(summarization), we preprocess the given dataset to produce a concise summary. This summary is an abstraction of the underlying graph data. Our summary is similar to data guides and other (approximate) typing mechanisms for semistructured data [12,15,4]. In the second phase (candidate generation), our method interacts with a human analyst to iteratively search for frequent structures and reﬁne the support threshold parameter. Since the search uses only the summary, which typically ﬁts in main memory, it can be performed very rapidly (interactive response times) without any additional disk accesses. Although the results in this phase are approximate (a superset of ﬁnal results), they are accurate enough to permit uninteresting structures to be conservatively ﬁltered out. When the analyst has ﬁltered potential structures using the approximate results of the search phase, an accurate count of the number of occurrences of each potential structure is produced by the third phase (counting).

Fig. 3. Example input graph

Users are often willing to sacriﬁce quality for a faster response. For example, during the preliminary exploration of a dataset, one might prefer to get a quick and approximate insight into the data and base further exploration decisions on this insight. In order to address this need, we introduce an approximate version of our method, called L-SEuS. This method only returns the top-n frequent structures rather than all frequent structures. We present only a brief discussion of SEuS below, and refer the reader to [11] for a detailed discussion both SEuS and L-SEuS. Summarization. We use a data summary to estimate the support of a structure (i.e., the number of subgraphs in the database that are isomorphic to the structure). The summary is a graph with the following characteristics. For each

200

S.S. Chawathe

Fig. 4. A structure and its three instances

distinct vertex label l in the original graph G, the summary graph X has an l-labeled vertex. For each m-labeled edge (v1 , v2 ) in the original graph there is an m-labeled edge (l1 , l2 ) in X , where l1 and l2 are the labels of v1 and v2 , respectively. The summary X also associates a counter with each vertex (and edge) indicating the number of vertices (respectively, edges) in the original graph that it represents. For example, Figure 5 depicts the summary generated for the input graph of Figure 3.

Fig. 5. Summary graph

We use the summary X to estimate the support of a structure S as follows: By construction, there is at most one subgraph of X (say, S ) that is isomorphic to S. If no such subgraph exists, then the estimated (and actual) support of S is 0. Otherwise, let C be the set of counters on S (i.e., C consists of counters

Tracking Hidden Groups Using Communications

201

on the nodes and edges of S ). The support of S is estimated by the minimum value in C. Given our construction of the summary, this estimate is an upper bound on the true support of S. Candidate Generation. The candidate generation phase is a simple search in the space of structures isomorphic to at least one subgraph of the database. We maintain two lists of structures: open and candidate. In the open list we store structures that have not been processed yet (and that will be checked later). The algorithm begins by adding all structures that consist of only one vertex and pass the support threshold test to the open list. The rest of the algorithm is a loop that repeats until there are no more structures to consider (i.e., the open list is empty.) In each iteration, we select a structure (S) from the open list and we use it to generate larger structures (called S’s children) by calling the expand subroutine, described below. New child structures that have an estimated support greater than the threshold are added to the open list. The qualifying structures are accumulated in the candidate list, which is returned as the output when the algorithm terminates. Given a structure S, the expand subroutine produces the set of structures generated by adding a single edge to S (termed the children of S). In the following description of the expand(S) subroutine, we use S(v) to denote the set of vertices in S that have the same label as vertex v in the data graph and V (s) to denote the set of data vertices that have the same label as a vertex s in S. For each vertex s in S, we create the set addable(S, s) of edges leaving some vertex in V (s). This set is easily determined from the data summary: It is the set of out-edges for the summary vertex representing s. Each edge e = (s, v, l) in addable(S, s) that is not already in S is a candidate for expanding S. If S(v) (the set of vertices with the same label as e’s destination vertex) is empty, we add a new vertex x with the same label as v and a new edge (s, x, l) to S. Otherwise, for each x ∈ S(v) if (s, x, l) in not in S, a new structure is created from S and e by adding the edge (s, x, l) (an edge between vertices already in S). If s does not have an l-labeled edge to any of the vertices in S(v), we also add a new structure which is obtained from S by adding a vertex x with the same label as v and an edge (s, x , l). Support Counting. Once the analyst is satisﬁed with the structures discovered in the candidate generation phase, she may be interested in ﬁnalizing the frequent structure list and getting the exact support of the structures. This task is performed in the support counting phase. Let us deﬁne the size of a structure to be the number of nodes and edges it contains; we refer to a structure of size k as a k-structure. From the method used for generating candidates (Section 2), it follows that for every k-structure S in the candidate list there exists a structure Sp of size k −1 or k −2 in the candidate list such that Sp is a subgraph of S. We refer to Sp as the parent of S in this context. Clearly, every instance I of S has a subgraph I that is an instance of Sp . Further, I diﬀers from I only in having one fewer edge and, optionally, one fewer vertex. We use these properties in the support counting process.

202

S.S. Chawathe

Determining the support of a 1-structure (single vertex) consists of simply counting the number of instances of a like-labeled vertex in the database. During the counting phase, we store not only the support of each structure (as it is determined), but also a set of pointers to that structure’s instances on disk. To determine the support of a k-structure S for k > 1, we revisit the instances of its parent Sp using the saved pointers. For each such instance I , we check whether there is a neighboring edge and, optionally, a node that, when added to I generates an instance I of S. If so, I is recorded as an instance of S.

Fig. 6. A screenshot of the SEuS system

3

Visual Exploration

In this section, we describe methods for implementing the interface module of Figure 2. Recall that the task of this module is to help the analyst assimilate the output of the automated analysis modules (oﬄine and online) as well as the external data feed (newswire articles, intelligence reports, etc.). The interconnections between data items from diﬀerent sources are of particular interest. In this module, we model data as a multiscale graph in which nodes represent data items and edges represent the relationships among them. At a high level, this graph aggregates many data items into one node; at the lowest level, each node

Tracking Hidden Groups Using Communications

id

a100

a59

a64

a22

a23

a72

a42

a89

a7

a31

a35

a19

a51

a97

86 More

Forward/Cross Links

a100

a59

a64

a22

a23

a72

a89

a7

a31

a19

a51

a93 a74

Back Links

Zooming-in Details

Zooming-in Numbers id

203

a46

a28

a42

a9

a92

a35

a53

a14

a1

a67

a88

a37

a70

a63

a99

a2

a26

a58

a71

a81

a68

66 More

book/paper id name publications

p7 a100 Yeo p10 p25

p7 a22 Chawathe p2 b5 b29 ...

98 More

More Links

Fig. 7. Two kinds of logical zooming

represents a single data item or concept (e.g., a phone number). This representation allows the analyst to work at a level of abstraction best suited to the task at hand. We have implemented methods for exploring such graphical data at varying levels of detail as part of our VQBD system [6], and we describe the key ideas below. Although VQBD is extensible and incorporates many features for the power user, it is designed to be accessible to a casual user. To this end, the basic modes of interacting with the system are very simple. At all times, the VQBD display consists of a single window with a graphical representation of the XML data. Although, as we shall see below, this representation may be the result of some complex operations, the user interface is always the same: There are nodes (boxes) representing data elements (often summarized) and arcs (lines) representing relationships among them. There are no tool-bars, scroll-bars, sliders, or other widgets. We believe this simplicity is key to usability by a casual user. The basic modes of controlling, described below, VQBD are also simple and unchanging. The ﬁrst three are meant for the casual user, while the next two are for users who have gained more experience with the system. Panning. The displayed objects can be moved in any direction relative to the canvas by a dragging motion with the left button of the mouse. Zooming. The display may be zoomed in (or out) by a right- (respectively, left-) dragging motion with the right mouse button. VQBD uses the position of the

204

S.S. Chawathe

pointer to determine the type of zooming. If the pointer is outside all graphical object then the result is simple graphical zooming (e.g., larger objects, bigger fonts). If the pointer is inside a graphical object then the data resolution of that object, and any others of a similar type, is increased. For example, consider the screenshot in Figure 8(b). The lower part represents speech and line objects and includes sample values from the input document. Zooming in with the pointer inside the larger box (representing the collection of line objects) results in the display of a larger number of sample speech objects. Zooming in with the pointer inside one of the smaller boxes representing an individual line object displays that object in more detail (more text). Figure 7 illustrates these two modes of zooming. In the case of other visualization modules (e.g., histograms), zooming results in actions appropriate to that module (e.g., histogram reﬁnement). Link Navigation. Clicking on a link causes the display to recenter itself around the target of the link at an appropriate zoom level. Following the design method of the Jazz toolkit, such link navigation is not instantaneous; instead it occurs at a speed that allows the viewer to discern the relative positions of the referencing and referenced objects. In addition to selecting an appropriate graphical zoom level, VQBD automatically picks a suitable logical zoom level. For example, a collection of numbers that is too large to display in its entirety is often presented as a histogram. View Change. While VQBD automatically selects an appropriate method for visualizing data at the available resolution, the user may override this selection a pop-up menu bound to the middle mouse button. For example, a user interested in the highest values in a collection of numbers may force VQBD to change the view from histogram to sorted list. Querying. The XML document may be queried using a query-by-example interface. This interface permits users to specify selection conditions as annotations on displayed objects. In addition, the user may mark objects as distinguished objects for use in queries. Intuitively, these objects can be used as the starting points for query-based exploration. VQBD has built-in query modules for regular expressions and XPath. Additional query modules can be easily added using the plug-in interface. More precisely, these objects are logically inserted into a table that can be used in the from clause of OQL-like queries. Since we do not have access to realistic monitoring data, we illustrate the key features of VQBD using a sample user session based using Jon Bosak’s XML rendition of Shakespeare’s A Midsummer Night’s Dream, available at |http://www.ibiblio.org/xml/examples/shakespeare/—. The system parses the data and graphically and presents a summary of its implicit structure with objects representing the play, acts, scenes, and lines. This structural summary is the default view presented by VQBD. A screenshot appears as Figure 8(a). Note that the screenshots in Figure 8 are based on a rather small VQBD display (approximately 350x350 pixels). While we picked this size primarily to ﬁt the space

Tracking Hidden Groups Using Communications

(a) Zoomed out—structural summary

205

(b) Zoomed in—instances

Fig. 8. Two screenshots of VQBD in action

constraints of this report, it also illustrates how VQBD’s zooming interface allows it to function eﬀectively at this size. In this example, the summary is small enough to be displayed in its entirety. However, when the summary is larger (or the screen smaller), the panning and graphical zooming features of VQBD are used to view the summary. Now suppose the analyst zooms in on the speech object using a dragging motion with the right mouse button. Initially, the zooming results in standard graphical results (larger objects, higher resolution text, etc.). However, as soon as the object becomes large enough to display graphical elements within it, the graphical zooming is accompanied by a logical zooming: a few sample elements are displayed. VQBD displays randomly sampled elements, with the number of displayed elements increasing as the available space increases as a result of the zooming in operation. Figure 8(b) is a screenshot at this stage of exploration. In addition to details of the speech and line elements, details of scene elements (appearing above the speech elements in this ﬁgure as in Figure 8(b)) are partially visible, providing a useful context. These ﬁgures do not convey the colors used by VQBD for indicating many relationships, including grouping elements based on parents (enclosing elements). When a sample element is displayed in this manner, VQBD reads its attributes and sub-elements to pick a short string that distinguishes the element from others with the same tag. This string is displayed within the object representing the element on screen. In our example,

206

S.S. Chawathe

VQBD uses the scene titles to identify scene elements on screen. At this stage, the analyst also has the option of single-clicking on any of the displayed objects, causing VQBD to display all details of the selected object. For example, clicking on the scene object labeled A hall in the castle results in displaying the scene in greater detail (as much as will ﬁt in the VQBD window). Note that this clicking action is simply an accelerated form of zooming; the same result could be achieved by zooming in to the scene object. Subelements of the scene element are displayed as active links that can be activated in order to smoothly transport the display to the referenced object. This link-based navigation can be freely interleaved with zooming. Zooming out at this point results in VQBD retracing its steps, displaying data in progressively less detail until we are back at the original structural summary view. In addition to browsing data in this manner, an analyst may also query data using the VQBD interface. For example, if a scene object is selected as the origin of a search for the string Lysander, VQBD executes the query and highlights objects in the query result. In our sample data, the query string matches elements of diﬀerent types (two persona elements, one stagedir element, and several speaker and line elements). If the current resolution is insuﬃcient to display individual objects, only the structural summary objects corresponding to the individual objects are highlighted. To view the query results in detail, one may zoom in as before. Unlike the earlier zooming action, which displayed a random sample of all elements corresponding to the summary object, VQBD now displays a sample chosen only from the elements in the query result. When all elements in the query result have been displayed, further zooming results in a random selection from the remaining elements (as before). (Colors are used to distinguish the elements in the query result from the rest of the elements.) This exploration of query results may be interleaved with zooming, panning, query reﬁnement, and other VQBD operations.

4

Related Work

There is a long history of work on network and graph analysis. However, many of the methods do not scale to the amount of data generated by the network monitoring situations that interest us. For high-volume data, work on Communities of Interest [10,9] is perhaps the closest to our work. A method for managing high-volume call-graph data from a phone network based on daily merging of records is described in [10]. There is work on structure discovery in speciﬁc domains; a detailed comparison of several such methods appears in [7]. We are more interested in domain independent methods such as CLIP and Subdue [16,8]. The method of Section 2 diﬀers from these in its use of a summary structure to yield an interactive system with high throughput. A detailed discussion and performance study appears in [11]. AGM [13] is an algorithm for ﬁnding frequent structures that uses an algorithm similar to the apriori algorithm for market basket data [2]. The FSG [14] is similar to AGM but uses a sparse graph representation that minimizes storage

Tracking Hidden Groups Using Communications

207

and computation costs. The FREQT algorithm is based on the idea of discovering tree structures using by attaching nodes to only the rightmost branches of trees [3]. The general idea of using a succinct summary of a graph for various purposes has a large body of work associated with it. For example, this idea is developed in semistructured databases as graph schemas, representative objects, and data guides, which are used for constraint enforcement, query optimization, and query-by-example interfaces [4,15,12].

5

Conclusion

We described and formalized the problem of tracking hidden groups of entities using only their communications, without a priori knowledge of the communication device identiﬁers (e.g., phone numbers) used by the entities. We discussed the practical constraints on the environment in which this problem must be solved and presented a system architecture that combines oﬄine analysis, online analysis, and interactive exploration of both raw and processed data. We described our work on methods that form the basis of some of the system modules. We have conducted detailed evaluation of these methods by themselves and are now working on assembling and evaluating the system as a whole. Acknowledgments. Shayan Ghazizadeh helped design the and implement the SEuS system. Jihwang Yeo and Thomas Baby implemented parts of the VQBD system. This work was supported by National Science Foundation grants in the CAREER (IIS-9984296) and ITR (IIS-0081860) programs.

References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. SIGMOD Record, 22(2):207–216, June 1993. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th International Conference Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994. 3. Tatsuya Asai, Kenji Abe, Shinji Kawasoe, et al. Eﬃcient substructure discovery from large semi-structured data. In Proc. of the Second SIAM International Conference on Data Mining, 2002. 4. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, 1997. 5. Sudarshan S. Chawathe. Tracking moving clutches in streaming graphs. Technical Report CS-TR-4376 (UMIACS-TR-2002-56), Computer Science Department, University of Maryland, College Park, Maryland 20742, May 2002. 6. Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. VQBD: Exploring semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001. Demonstration Description.

208

S.S. Chawathe

7. D. Conklin. Structured concept discovery: Theory and methods. Technical Report 94-366, Queen’s University, 1994. 8. D. J. Cook and L. B. Holder. Graph-based data mining. ISTA: Intelligent Systems & their applications, 15, 2000. 9. Corinna Cortes and Daryl Pregibon. Signature-based methods for data streams. Data Mining and Knowledge Discovery, 5:167–182, 2001. 10. Corinna Cortes, Daryl Pregibon, and Chris Volinsky. Communities of interest. In Fourth International Symposium on Intelligent Data Analysis (IDA 2001), Lisbon, Portugal, 2001. 11. Shayan Ghazizadeh and Sudarshan S. Chawathe. SEuS: Structure extraction using summaries. In Steﬀen Lange, Ken Satoh, and Carl H. Smith, editors, Proceedings of the 5th International Conference on Discovery Science, volume 2534 of Lecture Notes in Computer Science (LNCS), pages 71–85, Lubeck, Germany, November 2002. Springer-Verlag. 12. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the Twenty-third International Conference on Very Large Data Bases, Athens, Greece, 1997. 13. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 13–23, 2000. 14. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of the 1st IEEE Conference on Data Mining, 2001. 15. S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the International Conference on Data Engineering, pages 79–90, 1997. 16. K. Yoshida, H. Motoda, and N. Indurkhya. Unifying learning methods by colored digraphs. In Proc. of the International Workshop on Algorithmic Learning Theory, volume 744, pages 342–355, 1993.

Examining Technology Acceptance by Individual Law Enforcement Officers: An Exploratory Study 1

2

2

Paul Jen-Hwa Hu , Chienting Lin , and Hsinchun Chen 1

Accounting and Information Systems, David Eccles School of Business University of Utah, Salt Lake City, Utah 84112 [email protected] 2 Management Information Systems, Eller College of Management University of Arizona, Tucson, Arizona 85721 {linc,hchen}@eller.arizona.edu

Abstract. Management of technology implementation has been a critical challenge to organizations, public or private. In particular, user acceptance is paramount to the ultimate success of a newly implemented technology in adopting organizations. This study examined acceptance of COPLINK, a suite of IT applications designed to support law enforcement officers’ analyses of criminal activities. We developed a factor model that explains or predicts individual officers’ acceptance decision-making and empirically tested this model using a survey study that involved more than 280 police officers. Overall, our model shows a reasonably good fit to officers’ acceptance assessments and exhibits satisfactory explanatory power. Our analysis suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Subjective norm also appears to have a significant effect on user acceptance through the mediation of perceived usefulness. Several managerial implications derived from our study findings are also discussed.

1 Introduction Technology implementation management [9] has been a critical challenge to organizations, public and private. In this regard, technology acceptance by individual users in an adopting organization is indispensable to the success of a newly implemented technology [16]. During the past decade, user technology acceptance has received, and is likely to continue receiving, considerable research attention; e.g., [11], [15], [17], [18], and [25]-[27]. Central to our continuing quest for successful information technology (IT) implementation is increased understanding of the key determinants of user acceptance, together with their causal relationships. Equipped with such insights, adopting organizations are more likely to create favorable conditions for IT adoptions and to design and implement effective management interventions for fostering technology acceptance among target users. Although most prior research has been concentrated on user acceptance in business settings, the deployment and use of IT also have been vigorously pursued in nonbusiness sectors that include government agencies. Of particular importance is user technology acceptance in various professional contexts where individuals perform H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 209–222, 2003. © Springer-Verlag Berlin Heidelberg 2003

210

P.J.-H. Hu, C. Lin, and H. Chen

highly specialized tasks and often have considerable autonomy. As Chau and Hu commented [5], the fast-growing investment and deployment of innovative technologies that support individual professionals demand additional investigations of their technology acceptance decision-making. Law enforcement is a fundamental and critical aspect of government services, as measured by its profound impacts on homeland security. By and large, law enforcement agencies are in the intelligence business and their crime fighting/prevention capability depends on individual officers’ timely access to relevant and accurate information presented in an effective and easily-assimilated manner. When investigating a criminal case or monitoring an organized gang ring, a police detective usually has to access, scrutinize, and integrate relevant information from various sources, internal and external. Because of its stringent information/knowledge support requirements, law enforcement indeed represents a service sector in which applications of information systems (IS) research and practice are inherently appealing and increasingly important. Our observations also suggest that individual officers usually have considerable autonomy in their case analysis and investigative tasks, thus manifesting or resembling a professional work arrangement. Together, the specialized and critical services in law enforcement settings, extensive information/knowledge management support requirements, and individual autonomy demand further examinations of user technology acceptance in law enforcement settings. Investigations of technology acceptance by individual law enforcement officers, nonetheless, have received limited attention from IS researchers. In response, this study aims at examining user acceptance of COPLINK [6-7], a suite of applications designed to provide enhanced information sharing and knowledge management support to offers within and across law enforcement agencies. Specifically, we developed a factor model that explains or predicts individual officers’ acceptance decisionmaking and then empirically tested the model using a survey study that involved more than 280 police officers. The current research purports to identify key technology acceptance drivers in law enforcement settings and investigate how these drivers and their effects might differ from those commonly observed in business contexts. The following section reviews relevant prior research and highlights our motivation.

2 Literature Review and Motivation In this study, technology acceptance broadly refers to an individual’s psychological state with regard to his or her voluntary and intentional use of a technology [13]. User technology acceptance has been examined extensively in IS research. A review of relevant previous studies suggests the dominance of a cognitive/behavioral anchor in conceptualizing and analyzing individual technology acceptance. According to this approach, an individual is conscious about his or her acceptance of a technology that can be sufficiently explained or mediated by the underlying behavioral intention. Substantial empirical support of the explanatory/mediating power of behavioral intention for actual technology use has also been established. As Mathieson [17] concluded, “given the strong causal link between intention and actual behavior, the fact that behavior was not directly assessed is not a serious limitation.” Several theories that anchor at behavioral intention have prevailed, including the Theory of Reasoned Action

Examining Technology Acceptance by Individual Law Enforcement Officers

211

[12], the Theory of Planned Behavior [1]-[2], the Diffusion of Innovations Theory [20], and the Technology Acceptance Model [11]. Rooted in social psychology, the Theory of Reasoned Action (TRA) suggests that an individual’s acceptance of a technology can be explained by his or her intention that is jointly determined by attitudinal beliefs and (perceived) subjective norm. The Theory of Planned Behavior (TPB) extends TRA by incorporating an additional construct (i.e., perceived behavioral control) to account for situations where an individual lacks the capability or resources necessary for performing the behavior under discussion. The Diffusion of Innovations (DOI) theory also has premises established in social psychology, positing that the diffusion of an innovation in a social system is jointly affected by the communication of key innovation attributes that include relative advantages, complexity, compatibility, demonstrability and trialibility. Overall, these theories are generic and have been applied to explain a wide array of individual behaviors, including technology acceptance. Previous individual technology acceptance studies that used TRA, TPB or DOI as a theoretical foundation have garnered considerable empirical support for the respective theories. The Technology Acceptance Model (TAM) adapts from TRA and is developed specifically for explaining individual technology acceptance across different technologies, user groups, and contexts. According to TAM, an individual’s decision on whether or not to accept a technology can be sufficiently explained by behavioral intention which, in turn, is determined by his or her perception of the technology’s usefulness and ease of use. Judged by its frequent use by prior studies, TAM has emerged as a predominant model for individual technology acceptance. This model, however, has been criticized for its parsimonious structure, thus subsequently limiting its use for designing effective organizational interventions that foster technology acceptance. As Mathieson commented [17], “TAM is predictive, but its generality does not offer sufficient understanding to provide system designers with information needed for creating and promoting user acceptance of new systems.” Nevertheless, TAM offers a valid and generic framework upon which extended or detailed models can be developed for specific user acceptance scenarios. Collectively, findings from previous research suggest that analysis of user technology acceptance in an organization setting should consider key characteristics pertaining to multiple fundamental contexts. For instance, Tornatzky and Klein [24] suggested that an individual’s acceptance decision in an organizational setting is jointly affected by factors pertaining to the technological context, the organizational context, and the external environment. Similarly, Chau and Hu [5] examined individual technology acceptance in a professional setting and singled out the importance of the technological, individual, and (organizational) implementation contexts. Igbaria et al. [15] highlighted the importance of management context. Goodhue and Thompson [14] discussed the importance of the technology and task contexts, advocating a contingency fit between them. A review of the literature suggests that conceptualization of user technology acceptance needs to include multiple fundamental contexts, and that model development should proceed from identifying important characteristics of these contexts, based on the user acceptance phenomenon examined. In addition, our literature review also suggests the development and empirical evaluation of specific models that extend from generic theories or models; e.g., [4], [23], [26], and [27]. According to this approach, a generic theory or model is used as a grounded framework upon which a detailed model is developed for a targeted user acceptance scenario; e.g., via inclusion of additional constructs or antecedents of key

212

P.J.-H. Hu, C. Lin, and H. Chen

acceptance drivers. The current research used both TAM and TPB as a theoretical framework for anchoring our analysis of key determinants of individual officers’ acceptance of COPLINK. Our model contained major TAM constructs (e.g. perceived usefulness and ease of use), as well as their key antecedents and other constructs from TPB. During our model development, we also took into consideration important characteristics pertinent to our targeted technology, user group, and organizational (implementation) context.

3 Overview of COPLINK Technology The COPLINK project was initiated and undertaken by the Artificial Intelligence Lab at the University of Arizona, in collaboration with the Tucson Police Department (TPD). An important project objective was to design, develop, and deploy innovative technology solutions to support and enhance information sharing and collaborative investigation within and across regional law enforcement agencies. Funded by the National Institute of Justice (NIJ) and the Digital Government Initiative of the National Science Foundation (NSF), the project has delivered COPLINK [6]-[7] which currently consists of two distinct but complementary applications: COPLINK Connect and COPLINK Detect. COPLINK Connect allows detectives and field officers to access data in other jurisdictions or government agencies, beyond the constraints of system or platform heterogeneity. COPLINK Detect extends the capabilities of Connect by supporting individual officers’ analysis of sophisticated criminal links and networks, using integrated and shared data. At the time of the study, a large-scale deployment of COPLINK had just been completed at TPD and the implementation planning was underway in other jurisdictions in the states of Arizona and Texas. In parallel, technology development in COPLINK also continued, aiming at further enhanced information/knowledge management support and extended functionality through the use of agent and wireless technologies.

4 Research Model and Hypotheses As shown in Figure 1, our research model suggests that an individual officer’s decision to accept or not to accept a technology can be explained by important characteristics pertaining to the technological, individual, and organizational contexts. Specifically, perceived usefulness, perceived ease of use, and efficiency gain are fundamental determinants of the technological context. Consistent with the propositions of TAM, our model states that perceived usefulness and perceived ease of use jointly determine attitude, and that perceived ease of use has a direct positive effect on perceived usefulness. All other factors being equal, an officer is more likely to consider COPLINK to be useful when it is easy to use. Efficiency gain refers to the degree to which an officer perceives his or her task performance efficiency would be improved through the use of COPLINK. Agility is critical in law enforcement, where individual officers are in a constant competition against time. In most cases, officers must respond to crime fighting/prevention challenges in a timely manner. Results from our preliminary evaluation of COPLINK showed that individual officers had

Examining Technology Acceptance by Individual Law Enforcement Officers

213

Organizational Context

Subjective Norm

Technological Context

- 0.18*

0.25**

Availability - 0.01

Efficiency Gain

0.67***

Perceived Usefulness

Intention to Accept

0.96***

(R2 = 0.60)

(R2 = 0.58)

0.68***

0.15

0.11

Perceived Ease of Use

0.28***

Attitude

(R2 = 0.66)

Individual Context *: P-value < 0.05 **: P-value < 0.01 ***: P-value < 0.001

Fig. 1. Research model and model testing results

placed great importance on task performance efficiency resulting from their use of the technology. Accordingly, we tested the following hypotheses. H1: H2: H3: H4: H5:

The usefulness of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The usefulness of COPLINK as perceived by an officer has a positive effect on his or her intention to accept the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her perception of the technology’s usefulness. An officer’s perceived efficiency gain through the use of COPLINK has a positive effect on his or her perception of the technology’s usefulness.

Within a law enforcement setting, attitude is critical to the individual context and refers to an individual officer’s positive or negative attitudinal beliefs about the use of COPLINK. Through previous technology demonstrations and recently completed user training, officers at TPD were expected or likely to have developed personal assessments of and attitudinal beliefs about COPLINK. According to TAM and TPB, an individual who has a positive attitude towards a technology is likely to exhibit a strong intention to accept the technology. Venkatesh and Davis [25] and others (e.g., [11]) have questioned the effectiveness of attitude in mediating the impact of perceived usefulness and perceived ease of use on behavioral intention, thus suggesting its re-

214

P.J.-H. Hu, C. Lin, and H. Chen

moval from TAM and its extensions. In this study, we retained attitude in our model as a key intention determinant, partially because of the described autonomy of individual law enforcement officers, including their technology choice and use. Thus, we tested the following hypothesis. H6:

An officer is likely to have a strong intention to accept COPLINK when he or she has a positive attitude towards the technology.

Subjective norm and availability are key characteristics of the organizational (implementation) context. Consistent with TPB, subject norm refers to an officer’s assessment or perception of significant referents’ desire or opinion on whether or not he or she should accept COPLINK [1]-[2]. In this study, the organizational context includes the communication of COPLINK assessments by administrators and individual officers in an adopting agency and therefore encompasses the management context discussed by Igbaria et al. [15]. Specifically, we posit that subjective norm has a direct positive effect on both perceived usefulness and behavioral intention. Within the social system common to law enforcement agencies, an officer’s behavior might be somewhat affected by significant referents’ opinions or suggestions. Consequently, an officer is likely to consider COPLINK to be useful and thus develops a strong intention for its acceptance when his or her significant referents are in favor of the technology. By and large, officers appear to have a relatively strong psychological attachment to their agency and the social system within it; therefore, they are likely to develop and exhibit a close bond with colleagues and administrative commanders. Such psychological attachment and personal bond might be partially attributed to several factors that include an agency’s non-profit nature, less direct peer competition for resources or promotion (as compared with business organizations), personal commitment to public services, relatively long-term career pursuit, and the closed community common to most agencies. Therefore, we tested the following hypotheses. H7: H8:

An officer is likely to perceive COPLINK to be useful when his or her significant referents are in favor of the technology. An officer is likely to have a strong intention to accept COPLINK when his or her significant referents are in favor of the technology.

Availability is also essential to the organizational context. In this study, availability refers to an officer’s perception of the availability of the computing equipment necessary for using COPLINK. Availability is a fundamental aspect of perceived behavioral control (from TPB). As noted by Ajzen [1]-[2], perceived behavioral control embraces internal (e.g., self-efficacy [3], [8]) and external conditions (e.g., facilitating condition [23]). In their comparative examination of competing models, Taylor and Todd [23] explicitly separated the internal and external aspects of control beliefs. Similarly, Venkatesh [27] also argued that the availability of resources and opportunities required to perform a target behavior is an important perspective of perceived ease of use. Availability of the computing equipment necessary for using COPLINK has been singled out as a potential concern to many officers, particularly those routinely working on criminal case analysis or away from the department offices. Results from multiple focus group discussions and interviews with individual officers consistently suggested the importance of making available the necessary computing equipment. All other factors being equal, the greater the availability as perceived by an of-

Examining Technology Acceptance by Individual Law Enforcement Officers

215

ficer, the stronger his or her intention to accept COPLINL technology. Hence, we tested the following hypothesis. H9:

Availability of the computing equipment necessary for using COPLIBK technology has a positive effect on an officer’s intention to accept the technology.

5 Instrument Development and Validation We empirically tested our model using a self-administered survey that involved more than 280 police officers who volunteered their technology acceptance assessments. Our research method choice was made primarily because of its broad coverage (e.g., number of respondents) and support of different quantitative analyses. All participating officers were from the Tucson Police Department. Our investigation proceeded immediately after the department’s having completed technology implementation (including testing) and mandatory user training. Multiple methods were used in our survey instrument development. Candidate question items were first identified from relevant previous empirical studies. In parallel, we also conducted focus group discussions, as well as unstructured and semistructured interviews with individual officers from the participating police department and other similar agencies. Preliminary measurements for each included construct were obtained by combining our interview/discussion findings and the candidate items extracted from previously validated inventories. Three police officers then assessed the validity of the resultant question items at face value. Based on their comments and suggestions, several minor wording changes were made to tailor to the law enforcement context. All questionnaire items used a seven-point Likert-scale, with anchors from “strongly agree” to “strongly disagree.” To ensure the desired balance and randomness of the questionnaire, half of the question items were worded with proper negation and all items were randomly sequenced. A pretest was then conducted to validate the instrument in terms of reliability and construct validity. Although mostly drawn from previously validated measurements, we re-examined the question items to ensure the necessary validity in the law enforcement setting [21]. Our pre-test included a total of 42 police officers who varied in rank and division. Using their responses, we examined the instrument’s reliability by evaluating the Cronbach’s alpha value for the respective constructs. As summarized in Table 1, all the constructs showed an alpha value greater than 0.70, a commonly suggested threshold for exploratory research [19]. In addition, we also used pre-test responses to assess the instrument’s construct validity in terms of convergent and discriminant validity [21]. Specifically, we performed a principal component factor analysis, which yielded a total of seven components; i.e., matching the exact number of constructs specified in our model. As shown in Table 2, items intended to measure a particular construct exhibited a distinctly higher factor loading on a single component than on other components, suggesting the measurements were of adequate convergent and discriminant validity. The validated measurements were subsequently used in the survey study from which the individuals who had participated in the instrument development or pretest study were excluded. The question items used in the study are listed in the Appendix.

216

P.J.-H. Hu, C. Lin, and H. Chen Table 1. Reliability analysis - cronbach’s alpha Construct

Perceived Usefulness (PU)

Perceived Ease of Use (PEOU) Subjective Norm (SN) Attitude (ATT) Behavioral Intention (BI)

Availability (AV)

Efficiency Gain (EG)

Item PU-1 PU-2 PU-3 PU-4 PEOU-1 PEOU-2 PEOU-3 PEOU-4 SN-1 SN-2 ATT-1 ATT-2 ATT-3 BI-1 BI-2 BI-3 AV-1 AV-2 AV-3 AV-4 EG-1 EG-2 EG-3

Mean

STD

2.33 3.12 2.83 2.71 3.10 3.21 2.81 3.05 2.62 2.12 4.05 3.76 3.88 2.55 2.67 3.40 4.10 2.98 3.74 3.71 3.81 3.43 3.31

1.41 1.55 1.30 1.35 1.39 1.18 1.40 1.31 1.34 1.35 1.34 1.50 1.29 1.21 1.18 1.43 1.80 1.68 2.02 1.71 1.20 1.17 1.32

Cronbach’s • 0.91

0.84

0.78 0.89

0.73

0.89

0.87

6 Data Analysis Results A self-administered survey study was conducted to test our research model and hypotheses. With the assistance of multiple assistant chiefs and captains, questionnaires were distributed through the line of command using an email attachment. Our subjects were individual officers who had been identified as target users of COPLINK and had completed the mandatory user training. The participating officers were from investigative and field operations divisions and each of them was given two weeks to complete and return the questionnaire. Officers who had failed to complete and return the survey within the initial time window were reminded and given another two weeks to do so. A final one-week time window was then offered to whose who still failed to respond. Of the 411 questionnaires distributed, a total of 283 complete and effective responses were received, showing a 68.9% response rate. Analysis of the respondents’ gender distribution showed an approximate 4-1 ratio in favor of males. Most respondents were from the field operations divisions (60%), followed by the Criminal Investigative Division and Special Investigative Division (35%). Most of the respondents had a two-year college degree or associate bachelor’s degree (41%), followed by those having a high school diploma (30%), and those holding a four-year college degree (29%). On average, the responding officers were 38.4 years of age and had had 12.1 years of experience in law enforcement services. Comparative analysis of the of-

Examining Technology Acceptance by Individual Law Enforcement Officers

217

ficers who completed and returned the survey within the initial response period versus those who needed the extended response time window(s) showed no significant differences in gender or home division distribution, educational background, age, or experience in law enforcement. Table 3 summarizes the demographic profile of the 283 respondents in our survey. Table 2. Examination of convergent and discriminent validity – factor analysis results Factor 1 PU-1 PU-2 PU-3 PU-4 PEOU-1 PEOU-2 PEOU-3 PEOU-4 BI-1 BI-2 BI-3 ATT-1 ATT-2 ATT-3 SN-1 SN-2 AV-1 AV-2 AV-3 AV-4 EG-1 EG-2 EG-3 Eigen Values % of Variance

Factor 2

Factor 3

Factor 4

Factor 5

Factor 6

Factor 7

0.82 0.84 0.76 0.77 0.79 0.70 0.84 0.78 0.58 0.37 0.82 0.85 0.85 0.74 0.83 0.86 0.91 0.76 0.86 0.90 0.81 0.84 0.73 8.06

3.34

2.30

1.58

1.51

1.06

1.02

35.03

14.21

9.99

6.88

6.57

4.61

4.42

Model Testing Results. We tested our research model using LISREL. Analysis results showed our model exhibiting a reasonable fit to the data; e.g., Comparative Fit Index (CFI) being 0.91, Non-norm Fit Index (NNFI) being 0.89, and Standardized Root Mean Square Residual (SRMSR) being 0.06. We also assessed the model’s explanatory power. As shown in Figure 1, our model exhibited satisfactory explanatory utility, accounting for 58% of the variances in intention, 66% of the variances in attitude, and 60% of the variances in perceived usefulness. Individual causal Paths. Six of the nine hypothesized causal paths were significant statistically; i.e., p-value 0.05 or lower. As suggested by our analysis results, efficiency gain and subjective norm appeared to be significant determinants of perceived

218

P.J.-H. Hu, C. Lin, and H. Chen

usefulness, which, in turn, showed a significant effect on both attitude and behavioral intention. Perceived ease of use significantly affected attitude, which, however, was not a significant intention determinant. In addition, subjective norm appeared to have a significant effect on intention, but in direct opposition to our hypothesis. The remaining hypotheses were not supported by our data; i.e., perceived ease of use on perceived usefulness, availability on intention, and attitude on intention (which might have been somewhat significant). Table 3. Summary of respondent’s demographics profile Demographic Dimension

Descriptive Statistics

Average Age

38.4 Years

Average Experience in law Enforcement

12.1 Years

Gender Home Division

Education Background

• Male: 81% • Female: 19% • Criminal/Special Investigative: 35% • Field Operations: 60% • Other: 5% • 4-Year College or University: 29% • 2-Year College: 41% • High School: 30%

7 Discussion Overall, our model showed a reasonably good fit to the responding officers’ technology acceptance assessments and exhibited an explanatory power level compared to, if not higher than, that of representative previous studies; e.g., [17], [23]. Several research and management implications can be derived from our findings. First, our study suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Perceived usefulness may be the single most important driver in individual officers’ technology acceptance decision-making. Based on our model testing results, perceived usefulness appears to be the only construct that has a significant direct effect on intention. The observed significance may suggest a tendency or likelihood of an officer’s anchoring his or her technology acceptance decision from a utility perspective. The discussed utility-centric view of technology is supported by the insignificant influence of perceived ease of use on perceived usefulness. Together, our findings suggest that a law enforcement officer is not likely to consider a technology to be useful simply because it is easy to use. Efficiency gain is a critical aspect or source of utility. According to our analysis, many officers feel that the use of COPLINK would improve their task performance, and that COPLINK is useful for their work. Second, subjective norm appears to be an important technology acceptance determinant, judged by its total effect on behavioral intention. According to our analysis,

Examining Technology Acceptance by Individual Law Enforcement Officers

219

subjective norm has a significant positive effect on individual acceptance decisionmaking but this effect may be mediated by other factors; e.g., perceived usefulness. Individual officers are likely to take significant referents’ opinions into consideration when assessing a technology’s usefulness. However, such normative beliefs alone may not foster positive acceptance decisions directly. In effect, our analysis shows a negative effect of subjective norm on behavioral intention, significant at the 0.05 level. One possible interpretation is that an officer exhibiting a strong intention to use COPLINK may have developed a negative response to others’ desire that he or she should accept the technology, and vice versa. The observed negative effect might be partially attributed to individual autonomy in law enforcement, thus resembling a professional setting to some degree. Third, the influence of attitude on intention may be somewhat significant, as suggested by a p-value between 0.05 and 0.10. Perceived usefulness and perceived ease of use appear to be important determinants of an individual officer’s attitude toward COPLINK and together explain a significant portion of the variances in attitude; i.e., 66%. Our finding suggests not to under-estimate the importance of individual attitudes. In this connection, administrators and technology providers need to proactively facilitate the cultivation and development of favorable attitudes by individual officers, particularly by means of convincing demonstrations and unambiguous communication of a technology’s utility and ease of operations. Management of individual attitude is essential in situations where law enforcement officers are relatively autonomous in task performance and technology use. With increased understanding of key acceptance drivers and their probable causal relationships, administrators and technology providers can identify specific areas where user acceptance is likely to be hindered and tackle these barriers accordingly. In light of the prominent influence path from efficiency gain to perceived usefulness and then to intention for acceptance, initial demonstrations and user training should concentrate on communicating a technology’s utility for improving officers’ performance and emphasize on the technology’s relevance to their routine tasks. Cultivating and promoting a favorable community assessment or view of the technology under discussion is also important and can create normative or even conformance pressure for individual acceptance decision-making. Such normative or compliant forces may not contribute directly to positive acceptance decisions, but can be so prevalent as to practically reinforce individual officers’ technology assessments. In addition, management of individual attitude towards a new implemented technology is also relevant and deserves administrative or managerial attention in situations where individual officers have considerable autonomy in their task performance and technology choice/use.

220

P.J.-H. Hu, C. Lin, and H. Chen

Acknowledgement. We would like to thank the following TPD officers for their input and support: Chief Richard Miranda, Asst. Chief Kathleen Robinson, Asst. Chief Kermit Miller, Cap. David Neri, Lt. Jenny Schroeder, Det. Tim Petersen, and Daniel Casey. We also would like to thank Andy Moosmann for his invaluable assistance in data collection. The work reported in this paper was substantially supported by the Digital Government Program, National Science Foundation (NSF Grant # 9983304: “COPLINK Center: Information and Knowledge Management for Law Enforcement”).

References 1. 2. 3. 4. 5. 6.

7. 8. 9. 10. 11. 12. 13.

14. 15.

Ajzen, I., “From Intention to Actions: A Theory of Planned Behavior,” in: Kuhl J. and Beckmann (eds): Action Control: From Cognition to Behavior, Springer Verlag, New York, 1985, pp. 11–39. Ajzen, I., “The Theory of Planned Behavior,” Organizational Behavior and Human Decision Processes, Vol. 50, 1991, pp. 179–211. Bandura, A., “Self-efficacy: Toward a Unifying Theory of Behavioral Change,” Psychological Review, Vol. 84, 1977, pp. 191–215. Chau, P.Y.K., “An Empirical Assessment of a Modified Technology Acceptance Model,” Journal of Management Information Systems, Vol. 13, No. 2, 1996, pp. 185–204. Chau, P.Y.K, and Hu, P.J., “Examining a Model for Information Technology Acceptance by Individual Professionals: An Exploratory Study”, Journal of Management Information Systems, Vol. 18, No. 4, 2002, pp. 191–229. Chen, H., Schroeder, J., V. Hauck, R., Ridgeway, L., Atabakhsh, H., Gupta, H., Boarman, C., Rasmussen, K., and Clements, A.W., “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Vol. 34, No. 3, 2003, pp. 271–285. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and Schroeder, J., “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, Vol. 46, No. 1, 2003, pp. 28–34. Compeau, D.R., and Higgins, C.A., “Computer Self-Efficacy: Development of a Measure and Initial Test,” MIS Quarterly, Vol. 19, 1995, pp. 189–211. Cooper, R.B., and Zmud, R.W., “Information technology implementation research: A technology diffusion approach,” Management Science, Vol. 34, No. 2, 1990, pp. 123–139. Davis, F.D., “Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology,” MIS Quarterly, Vol. 13, No. 3, September 1989, pp. 319–339. Davis, F.D., Bagozzi, R.P., and Warshaw, P.R., “User Acceptance of Computer Technology: A Comparison of Two Theoretical Models,” Management Science, Vol. 35, No. 8, 1989, pp. 982–1003. Fishbein, M. and Ajzen, I., Belief, Attitude, Intention and Behavior: An Introduction to Theory and Research, Addison-Wesley, Reading, MA, 1975. Gattiker, U.E., “Managing Computer-based Office Information Technology: A Process Model for Management in Human Factors in Organizational Design,” in H. Hendrick and O. Brown (eds), Human Factors in Organizational Design, Elsevier Science, Amsterdam, The Nethelands, 1984, pp. 395–403. Goodhue, D.L. and Thompson, R.L., “Task-Technology Fit and Individual Performance,” MIS Quarterly, Vol. 19, No. 2, June 1995, pp. 213–236. Igbaria, M., Guimaraes, T. and Davis, G.B., “Testing the Determinants of Microcomputer Usage via a Structural Equation Model,” Journal of Management Information Systems, Vol. 11, No. 4, 1995, pp. 87–114.

Examining Technology Acceptance by Individual Law Enforcement Officers

221

16. Keen, P., Shaping the Future: Business Design through Information Technology, Harvard Business School Press, Boston, MA, 1991. 17. Mathieson, K., “Predicting User Intention: Comparing the Technology Acceptance Model with Theory of Planned Behavior,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 173–191. 18. Moore, G.C. and Benbasat, I., “Development of an Instrument to Measure the Perception of Adopting an Information Technology Innovation,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 192–223. 19. Nunnally, J.C., Psychometric Theory, 2nd edn, McGraw-Hill, New York, 1978. 20. Rogers, E.M., Diffusion of Innovations, 4th edn, Free Press, New York, NY, 1995. 21. Straub, D.W., “Validating Instruments in MIS Research,” MIS Quarterly, Vol. 13, No. 2, 1989, pp. 147–169. 22. Szajna, B., “Empirical Evaluation of the Revised TAM,” Management Science, Vol. 42, No. 1, 1996, pp. 85–92. 23. Taylor, S. and Todd, P.A., “Understanding Information Technology Usage: A Test of Competing Models,” Information Systems Research, Vol. 6, No. 1, 1995, pp. 144–176. 24. Tornatzky, L.G. and Klein, K.J., “Innovation Characteristics and Innovation Adoption Implementation: A Meta-Analysis of Findings,” IEEE Transactions on Engineering Management, Vol. 29, No. 1, 1982, pp. 28–45. 25. Venkatesh, V. and Davis, F.D., “A Model of the Antecedents of Perceived Ease of Use: Development and Test,” Decision Sciences, Vol. 27, No. 3, 1996, pp. 451–482. 26. Venkatesh, V. and Davis, F.D., “A Theoretical Extension of the Technology Acceptance Model: Four longitudinal studies,” Management Science, Vol. 46, No. 2, 2000, pp. 186– 204. 27. Venkatesh, V., “Determinants of Perceived Ease of Use: Integrating Control, Intrinsic Motivation, and Emotion into the Technology Acceptance Model,” Information Systems Research, Vol. 11, No. 4, 2000, pp. 342–365.

Appendix: Listing of Questions Items Construct

Measurement Item

Source

PU-1: Using COPLINK would improve my job performance. Perceived Usefulness (PU)

Perceived Ease of Use (PEOU)

Attitude (ATT)

PU-2: Using COPLINK in my job would increase my productivity. PU-3: Using COPLINK would enhance my effectiveness at work. PU-4: Overall, I find COPLINK to be useful in my job. PEOU-1: My interaction with COPLINK is clear and understandable.

Venkatesh & Davis (1996)

PEOU-2: Interacting with COPLINK does not require a lot of mental effort. PEOU-3: Overall, I find COPLINK easy to use. PEOU-4: I find it easy to get COPLINK to do what I want it to do. ATT-1: Overall, it is a good idea to use COPLINK in my job. ATT-2: Using COPLINK would be pleasant. ATT-3: Using COPLINK would be beneficial to my work.

Venkatesh & Davis (1996)

Taylor & Todd (1995)

222

P.J.-H. Hu, C. Lin, and H. Chen

Subjective Norms (SN) Efficiency Gains (EG)

Availability (AV)

Behavioral Intention (BI)

SN-1: My colleagues in the department think that I should use COPLINK. SN-2: I would use COPLINK more if I knew my boss wanted me to. EG-1: Using COPLINK reduces the time I spend completing my job-related tasks. EG-2: COPLINK allows me to accomplish tasks more quickly. EG-3: Using COPLINK saves me time. AV-1: There are enough computers for everyone to use COPLINK. AV-2: I have no difficulty finding a computer to use COPLINK when I need it. AV-3: Availability of computers for accessing COPLINK is not going to be a problem. AV-4: There are enough computers for me to use COPLINK in the department. BI-1: When I have access to COPLINK, I would use it as often as needed. BI-2: To the extent possible, I intend to use COPLINK in my job. BI-3: Whenever possible, I would COPLINK for my tasks.

Taylor & Todd (1995)

Davis (1989)

Taylor & Todd (1995)

Venkatesh & Davis (1996)

“Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age Chris C. Demchak Cyberspace Policy Research Group, School of Public Administration and Policy University of Arizona, Tucson, Arizona 85721 [email protected]

Abstract. Eighty percent of business process reengineering efforts have failed. This piece argues that the missing piece is an ability to see the newer technical systems conceptually integrated into an organization as well as functionally embedded. Similarly, a model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative socio-technical organizational design labeled the “Atrium” model based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large scale security force.

1 Introduction Not all existing organizations can be equally responsive to the sociotechnical demands of a surprisingly insecure information age. It is a common failing among designers of technical networks and computer programs to assume they are designing a flexible but encapsulated tool that anyone can use. This is similar to viewing networked systems and their basic hardware as a car that anyone can drive anywhere. In reality, these systems are more like designing road or railways. It matters where the road is placed, what it accesses, who can or will travel on it, and what surrounding sociotechnical arrangements are changed by its creation and use. Not all organizational forms are receptive to newer knowledge systems. Much critical knowledge can be lost, distorted, or never recognized when the instantiation of new systems is seen as positive or, even with a tough transition, going to be at worst merely neutral. A rule of thumb is that 80 percent of business process reengineering efforts have failed, despite being largely spurred on, and instituted with and through, modern enterprise-wide networked systems. This piece argues that the missing piece is an ability to see the newer systems conceptually integrated into an organization as opposed to largely functionally improving value added activities. The underlying successful information operations (IO) rely on the accurate sociotechnical organization of knowledge––i.e., the right information with the right amount of precision in the right modality or format for absorption with H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 223–231, 2003. © Springer-Verlag Berlin Heidelberg 2003

224

C.C. Demchak

the right amount of time to apply the correct electronic or other response. In short, IO like the effective application of all other advanced technologies depends as much on the organization of the people around the artifacts than on the quality of the artifacts themselves [3][8]. A model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance.1 To achieve that future focus, supportive organizational designs need to engaged in the transition process. In this work, I present my conclusions after several years of research taking a knowledge-centric approach in developing an alternative model to the dominant organizational models of modern security forces––in these cases, militaries––seen in several nations, including the US.2 In the research, I focused on how the current and loosely planned future organizational designs could or could not assure that explicit and implicit knowledge in a complex system could be discovered, winnowed, connected, weighted, and applied using advancing technologies when the threats were multi-layered and present in peace as well as war. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative sociotechnical organizational design labeled the “Atrium” model (see Figure 1 below) based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large-scale security force. Before introducing the model itself, it is important to note that, like their civilian counterparts in a rapidly globalizing environment, modern military technologies across both machine and human systems need information sharing, not hoarding, to both act quickly and to counter surprises. Designed by engineers, not social scientists, however, the newer systems tend to assume knowledge will come with the automatic and comprehensive provision of data. However, knowledge is not an automatic byproduct of networks and grids unless the surrounding social system deliberately seeks to capture that knowledge. Ultimately, in military or commercial endeavors, it is the organization, not the computer network, that is ultimately the knowledge-producing entity. And it costs a great deal to develop everything one needs on one’s own. The more distinct the organization is from a supporting and surrounding knowledge base, the more expensive the internal development of knowledge for that group of people [5]. Hence, it is preferred for the organization to share and benefit from the sharing of other organizations. Furthermore, complex systems including organizations are also path-dependent on initial conditions. The more the initial organizational design facilitates absorbing and accumulating knowledge from the beginning, including more slack, redundancy and trial and error, the more likely the design will be robust and successful in the face of surprise. Since surprise is the endemic characteristic of the systems and requirements faced by militaries, especially smaller forces, any modernizing design needs to consider these complex system realities from the outset.

1 2

The description of this model is drawn heavily from [2]. Much of the model discussion was originally presented in an earlier work developing the model for a small state and using the case of Israel [2].

“Atrium” – A Knowledge Model for Modern Security Forces

225

Fig. 1. The Atrium

The uncertainties of the new global circumstances require a different kind of modernization of the military organization – one less tied to legacy forces and more designed to support a new social construction of the role of knowledge as a player in organizational operations. To meet these aims, I propose a military or security adaptation of the commercial “hypertext” organization described by Nonaka and Takeuchi [6:99-133]. This refinement, which I labeled the “Atrium” form of information based organization, is a design that treats knowledge as a third and equal partner in the military organization’s peacetime and wartime operations. In the original model and in my refinement, the knowledge base is not merely an overlain tool or connecting pipelines. Rather, the knowledge base of the organization is actively nurtured both in the humans and in the digitized institutional integrated institutional structure. Writing for the commercial world, Nonaka and Takeuchi attempted to reconcile the competing demands and benefits of both matrix and hierarchical organizational forms. Their “hypertext” organization intermingled three intermingling structures: a matrix structure in smaller task forces specifically focused on innovative problems at hand and answering to senior managers, a second hierarchical structure that both supports the general operational systems but also contributes and then reabsorbs the members of task forces, and finally a large knowledge base that is intricately interwoven through the activities of both matrix and hierarchical units.3 3

As Nonaka and Tageuchi [6:106-107] aptly phrased it, “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”

226

C.C. Demchak

In both their and my models, the knowledge base is more than a library or a database on a server; it is a structure in and of itself integrating applications and data. It reaches into the task forces who use it for data mining while also sustaining the general operations, sharing information broadly. But it is also socially constructed as a key player in the organization such that task force members are required to download their experiences in a task force into the knowledge base before they are permitted to return to their positions in the hierarchical portion of the organization. Similarly, operations in the general hierarchy are required to interact through the knowledge base systems so that patterns in operations and actions are automatically captured for analysis [6:99-133]. The major contribution here is that the knowledge base is not a separate addition to the organization and irrelevant to the architecture of the human-machine processes as it is in the emergent US and other western models or modernizing militaries or security forces.4 Rather, it is integral to the success of processes and the survival of the institution. Several Japanese corporations seem to operate along these lines productively and one is struck by an interesting distinction––implicit knowledge developed by human interactions related to the job is not only viewed as a source of value by the corporation but also as key to long term survival.5 It is this view of knowledge that distinguishes these corporations and makes them more prepared for surprise in the marketplace. In adapting this design and social construction to a military or security setting, I have given this concept of a knowledge base a name, the “Atrium.”6 The term captures the sense of being a place to which a member of the organization can go, virtually or otherwise, to contribute and acquire essential knowledge, and that it is also a place of refuge to think out solutions. The mental image is that it is overarching, not beneath the human actors, but something that protects as well as demands inputs. Entering into and interacting with the Atrium is essentially acting with a major player in the institution. Such a conception rationalizes the efforts to ensure implicit knowledge is integrated into the long term analyses of the organization, such as the time spent in downloads of experiences and information from the task force members before they return to more hierarchical stem. The “atrium” form requires an explicit embrace of what has been called the “new knowledge management”7. In particular, the new knowledge management means using network/web technologies to move from controlling information inventories as human relationship-based “controlled hoards” to web-based “trusted source” struc4

5

6

7

A close reading of JV2010 and related US transformational documents shows a broad assumption that, as fast as the new equipment becomes, the knowledge needed to make that speed, lethality, and deployability successful will automatically be there as long as raw information is moved in real time. It is a rather naïve understanding of knowledge and complex systems but not unexpected if the decision-makers have focused on target acquisition and firing weapons at single points all their professional lives. “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”, see [6:106-107]. In a manuscript under construction now “The Atrium–Refining the HyperText Organizational Form,” I more fully explain the mechanisms of integrating an Atrium into an organization. For a more modern use of this term, see [4] and [7].

“Atrium” – A Knowledge Model for Modern Security Forces

227

tures.8 With networks, everything is dual use and sufficient technical familiarity can be found in foreign ministries as well as in basements inhabited by teenage geeks with a sociopathic attitude. Knowledge development will inevitably come through surprises that are encountered all along the spectrum of formal declaration of operations, from peace-building, through peace-making, peace-keeping, posturing, and prevailing in actual hostilities. The design of a modern knowledge-centric military must, in effect, accept 24/7 operations with all the ethical, legal, budgetary, socio-economic, and geostrategic constraints implied.

2 The “Atrium” as Colleague and Institutional Memory Key to this model is the stabilizing the locus of institutional memory and creativity in the human-Atrium processes. In principle, according to their rank, each member of the organization will have the chance to cycle in and out of task forces, core operations or Atrium maintenance and refinement. As they cycle into a new position, gear up, operate, and then cycle out, each player does a data dump, including frustrations about process, data, and ideas, into the Atrium. Organizational members elsewhere can then apply data mining or other applications on this expanding pool of knowledge elements to guide their future processes. Explicit and implicit organizational institutional knowledge thus becomes instinctively valued and actively retained and maintained for use in ongoing or future operations.

3 The Core – Main Operational Knowledge Creation and Application Hierarchies With this new social construction of what one does with information in the military or security force (one creates, stores, refines, connects, weights, shares, and nurtures it), the Core then embraces the new knowledge potential of conscripts and reservists by reinforcing the trends in national digital education. That is, service involving computers is not only promoted as a benefit of conscription but training in computers is pursued irrespective of the actual military or security function. For example, the maintainer will expect to find knowledge about diagnostic workarounds in other maintenance units in a foray into the Atrium, as well as being expected to give back one’s own personal experiences to the system. That maintainer – who could easily be a youngster of 19 years – will have been taught not just how to do that diagnostic task but also how to manipulate digital applications in general. This education in the military or security force will enhance the surprise-reducing potential of operations but also to improve the soldiers' future marketability to the economy and their long-term contribution to Atrium nurturing as a reservist. As a side benefit, the growing unwillingness to serve in the military or in security services may be mitigated when all full

8

The evolution of the internet or the web is in essence a social history of information sharing among individuals embedded in organizations. There are a number of versions of the history of the internet. For one discussion, see [1]. See also the Internet Society web site.

228

C.C. Demchak

time members (and associated part-timers or reservists) receive what is considered a valuable education in networked technology. Furthermore, the Core also embraces the potential of part-timers or reservists for security forces by assigning tasks that further the knowledge development of the Atrium. In the United States, the role for reservists in the future conflicts involving terrorism is under debated. This model can orchestrate the accomplishment of Core tasks can be accomplished on weekends without requiring the reservist to show up during working hours in uniform. The implicit knowledge of these experienced individuals is not lost as they are able to draw upon reserve years of solving puzzles or refining data to keep their skills at usable levels while keeping other employment. Reservists can then still serve physically in uniform in the Core when called up but that period can be limited and infrequent since it is not expected the reservist will do much basic security tasks in the field. Naturally, this approach sustains all the advantages of a close connection between the wider society and its part-time or reservist security forces without having the disruption of a civilian job. As described, the Core will have plenty of tasks associated with the Atrium, both in initial creation of applications, elements, processes, and uses but also in the coordinating and integrated of these evolutions. Its use of part-time or reservist security forces provide an essential constant intellectual recharge available from the wider community, permitting the Atrium to avoid iterating into a brittle bureaucratic equilibrium. By having the problem solving of the task forces as well as the intense attention of active serving security force members, the members of the forces serving in the Core will come to understand the Atrium as an intelligent agent rather than a mindless amalgamation of individual databases. In short, the vibrancy of the Atrium in providing knowledge to accommodate surprise is due not to the professionalism of the small permanent Core party but to the newness of perspective and rising familiarity of both the active and part-time participants. However, this organization will be surrounded by complex systems as well as being a complex system itself. Problems beyond the normal Core operations and Atrium knowledge analysis will emerge constantly. Some of these will be physically dangerous and immediate. Some will be prospective, such as determining why certain neighboring political leaders have allocated budget amounts to shadowy organizations. Some will be long term, such as rechanneling the design goals of key data chunk allocations within the Atrium or retargeting some of its uses in the light of wider global trends. For these kinds of problems, a matrix organization is imminently preferred and hence we come to the final element, the task forces.

4 The Task Forces – Responses in Knowledge Creation and Security Applications Security forces, in and out of militaries, tend to fragment into many small existing units with specialized missions. Each of them develops a broad and deep array of implicit knowledge that this model would be able to capture and put to good use. Many of existing units can be altered to function as task force structures answering to the senior military or security force officers in a knowledge-centric organization. First, to capture the implicit information currently lost or buried, members of all field units

“Atrium” – A Knowledge Model for Modern Security Forces

229

will rotate in from their operations to download implicit knowledge, update their understanding of the Atrium’s holdings and possible insights, and contribute to the Core. Second, some of the more elite units will be retargeted along different modalities of knowledge acquisition and use to using such data in knowledge mining combined with other information presented in the Atrium. Some units will be left with the more physically challenging missions such as border incursion controls and basic training but their members will also be rotated in and out on longer cycles, perhaps a year, to accommodate exceptional physical requirements. Other units will be gradually altered to problem analysis units – moving from simply gathering data on all suspicious activity to meta-analyses of such activities over time and locations with an eye to proactively disrupting the initiating efforts of the infiltrating threat rather than sending squads after the cell is well established. For this, the members will have to be digitally creative as much as physically hardy. The deployed or physically demanding units will be smaller and directly answerable to senior members of the headquarters staff. However, since rotating organization members among the three – task forces, Core and the Atrium – is a basic tenet, even senior leaders must rotate. For example, senior leaders could spend most of their time leading each of the field divisions or commands but they must rotate in for Atrium service, as well as heading task forces occasionally. While on rotation to the Atrium, the senior leader must be free completely from leadership duties, thus attention must be paid to a functioning deputy leader culture. Finally, the explicit assumption is that each task force is solving a problem or exploring an opportunity but also developing important nonobvious information that must also be inputted into the Atrium's processes. Senior leaders just like lowly field members have implicit knowledge to contribute to, and skills to refine in extracting and manipulating data from, the Atrium resources. Not all of the existing military or security force units will change in their mission; rather they are more likely to scale back the size of the units and attach them higher up in the hierarchy. The ones that retain the more physically dangerous missions will alter only in that their members will rotate out of Core positions for a position in the elite force and then back through an Atrium tour before returning to the Core. Fortunately, the value placed on computer skills and possibly a civilian career to follow military or security force experience offers a way to socially construct this change for easier acceptance, as well as continuing service on a part-time basis. Also, placing them directly beneath the senior leaders also mollifies grievances over a loss of prestige. Personnel rotating in and out of these units are assured not only of interesting current problems for six months to a year but also greater visibility at senior levels. The units will benefit from the strong advantages of a matrix structure in creativity and are more produce more innovative problem solutions than can be produced today.

5 Advantages – Surprise-Oriented, Scalable Knowledge-Enabled Institutions This design has advantages in using advance knowledge to extend the limited strategic depth of a nation or community under the unknowable unknowns of the emerging information and terrorism age. Deleterious surprises by actively hostile opponents can

230

C.C. Demchak

be countered by integrating different kinds of forces across early warning and response forces, and in the innovative combination of information accumulations. The existing widely held model of a modern security force tends towards centralization of control by reducing slack in the organization’s time and/or redundancy in its resources. It has become an act of faith that this centralization explicitly promotes synchronicity of operations, and in due course centralization across networks is also encouraged. But a fixation on central decision-making and synchronized actions can encourage devastating ripple effects in an increasingly tightly coupled organization. In contrast, the Atrium model is based on an understanding of complexity across large-scale systems – the environment faced by security forces today under active threats. If only trends – not specifics – can be seen in advance, then the best preparation is to have the knowledge base and the skills in creative combinations ready and waiting for the elements of the trend to take concrete shape. The model encourages a dampening of rippling rogue outcomes by the rotation of members and inclusion of skilled part-timers. Its design presumes that surprise during operations is normal in complex systems and only slack built through knowledge mechanisms can really accommodate or mitigate or dampen the effects on a large-scale organization. Hence, the Atrium concept encourages independent thinking while permitting widespread coordination and integration across the organization, time, and operations. And that this response can be done at any scale. Having socialized into unit members some key central themes in operations is as close as the Atrium comes to endorsing expensive centralization such as the Total Information Awareness program currently under pursuit by the US Department of Defense. Furthermore, this proposal does not assume wisdom comes automatically with 100 percent visibility of any conflict arena, or that this kind of visibility of an operation is the goal of modernization. On the contrary, this Atrium organizational model presumes that the 24/7 accumulation of information, much of which implicit and never before digitized, will use data mining techniques and a constant inflow of new pairs of eyes (in rotations through the Atrium) to construct new visions of operations. Innovative operations at any scale are enhanced when integration of a wide variety of information is more possible. While a nation or a security service under threat still needs physically demanding forces and standoff weapons, other electronic options emerge such as targeted disruption efforts that may overtly or covertly derail threatening postures by hostile opponents or even a long-term, slow-roll deception goal that diverts potential hostile actors from other more dangerous choices. Furthermore, when work is digitized, internal security can increase nonobviously. It is easier and less intrusive to scan across employee actions when work is digitized. Also, when part-timers are rotating in and out of all functions and their implicit knowledge is also being accumulated in the Atrium, then individual elements of knowledge are potentially spread all over the society. With so many knowing in general the overall structure and uses of the Atrium and the military or security force’s capabilities, the competition is less for secret information but for positive social assessments by chief acquisition officers. This kind of institutional knowledge helps both in curbing corruption through database transparency and in permitting those secrets that absolutely must be kept to be buried in the data noise.

“Atrium” – A Knowledge Model for Modern Security Forces

231

References 1. Benedikt, Michael. (1991). Cyberspace: First Steps. Boston, Massachusetts: the MIT Press. 2. Demchak, Chris C. (2001). “Knowledge Burden Management and a Networked Israeli Defense Force: Partial RMA in ‘Hyper Text Organization’?” Journal of Strategic Studies. 24:2 (June). 3. Drucker, Peter F. (1959). Technology Management and Society. San Francisco, CA: Harper and Row. 4. Gleick, James. (1987). Chaos: Making a New Science. New York: Viking. 5. Landau, Martin. (1973). “On the Concept of a Self-Correcting Organization.” Public Administration Review (November- December 1973). 6. Nonaka, Ikujiro and Takeuchi, Hirotaka. (1997). “A New Organizational Structure (HyperText Organization).” In Prusak, Laurence, ed. 1997. Knowledge in Organizations. Boston: Butterworth-Heinemann. 99–133. 7. Wheatley, Margaret J. (1992). Leadership and the New Science. San Francisco: BoerrettKoehler Publishers. 8. Wilson, James Q. (1989). Bureaucracy: What Government Agencies Do and Why They Do It. New York: Basic Books, Inc.

Untangling Criminal Networks: A Case Study Jennifer Xu and Hsinchun Chen Department of Management Information Systems, University of Arizona Tucson, AZ 85721, U. S. A. {jxu, hchen}@eller.arizona.edu

Abstract. Knowledge about criminal networks has important implications for crime investigation and the anti-terrorism campaign. However, lack of advanced, automated techniques has limited law enforcement and intelligence agencies’ ability to combat crime by discovering structural patterns in criminal networks. In this research we used the concept space approach, clustering technology, social network analysis measures and approaches, and multidimensional scaling methods for automatic extraction, analysis, and visualization of criminal networks and their structural patterns. We conducted a case study with crime investigators from the Tucson Police Department. They validated the structural patterns discovered from gang and narcotics criminal enterprises. The results showed that the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might be useful in the development of effective disruptive strategies for criminal networks.

1 Introduction Criminals seldom operate in a vacuum but interact with one another to carry out various illegal activities. In particular, organized crimes such as terrorism, drug trafficking, gang-related offenses, frauds, and armed robberies require collaboration among offenders. Relationships between individual offenders form the basis for organized crimes [18] and are essential for smooth operation of a criminal enterprise, which can be viewed as a network consisting of nodes (individual offenders) and links (relationships). In criminal networks, there may exist groups or teams, within which members have close relationships. One group also may interact with other groups to obtain or transfer illicit goods. Moreover, individuals play different roles in their groups. For example, some key members may act as leaders to control activities of a group. Some others may serve as gatekeepers to ensure smooth flow of information or illicit goods. Structural network patterns in terms of subgroups, between-group interactions, and individual roles thus are important to understanding the organization, structure, and operation of criminal enterprises. Such knowledge can help law enforcement and intelligence agencies disrupt criminal networks and develop effective control strategies to combat organized crimes such as narcotic trafficking and terrorism. For examH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 232–248, 2003. © Springer-Verlag Berlin Heidelberg 2003

Untangling Criminal Networks: A Case Study

233

ple, removal of central members in a network may effectively upset the operational network and put a criminal enterprise out of action [3, 17, 21]. Subgroups and interaction patterns between groups are helpful for finding a network’s overall structure, which often reveals points of vulnerability [9, 19]. For a centralized structure such as a star or a wheel, the point of vulnerability lies in its central members. A decentralized network such as a chain or clique, however, does not have a single point of vulnerability and thus may be more difficult to disrupt. To analyze structural patterns of criminal networks, investigators must process large volumes of crime data gathered from multiple sources. This is a nontrivial process that consumes much human time and effort. Current practice of criminal network analysis is primarily a manual process because of the lack of advanced, automated techniques. When there is a pressing need to untangle criminal networks, manual approaches may fail to generate valuable knowledge in a timely manner. To help law enforcement and intelligence agencies analyze criminal networks, we propose applying the concept space and social network analysis approaches to extract structural patterns automatically from large volumes of data. We have implemented these techniques in a prototype system, which is able to generate network representations from crime data, detect subgroups in a network, extract between-group interaction patterns, and identify central members. Multi-dimensional scaling has also been employed to visualize criminal networks and structural patterns found in them. The rest of the paper is organized as follows: Section 2 reviews related work; Section 3 describes the system architecture; Section 4 presents the case study in detail; Section 5 concludes the paper and suggests future research directions.

2 Related Work The process of extracting structural network patterns from crime data usually includes three phases: network creation, structural analysis, and network visualization. We review related work for each phase. 2.1 Network Creation To create network representations of criminal enterprises, investigators have to wade through floods of database records to search for clues of relationships between offenders. Such a task can be time-consuming and labor-intensive. A technique called link analysis has been used to detect relationships between crime entities and create network representations. Traditional link analysis is based on the Anacapa charting approach [12] in which data have to be examined manually to identify possible relationships. For visualization purposes, an association matrix is then constructed and a link chart based upon it is drawn. An investigator can study the structure of the link chart (a network representation) to discover patterns of interest. Krebs [15], for example, mapped a terrorist network comprised of the 19 hijackers in the September 11 attacks on the World Trade Center, using such an approach. How-

234

J. Xu and H. Chen

ever, the manual link analysis approach will become extremely ineffective and inefficient for large datasets. Some automated approaches to creating representations of criminal networks based on crime data have been proposed. Goldberg and Senator [11] used a heuristic-based approach to forming links and associations between individuals who had shared addresses, bank accounts, or related transactions. The networks created were analyzed to detect money laundering and other illegal financial activities. Dombroski and Carley [8] combined multi-agent technology, a hierarchical Bayesian inference model, and biased network models to create representations of a criminal network based on prior network data and informant perceptions of the network. A different network creation method used in the COPLINK system [13] is based on the concept space approach developed by Chen and Lynch [5]. Such an approach can generate a thesaurus from documents based on co-occurrence weights that measure the frequency with which two words or phrases appear in the same document. Applying this approach to crime incident data results in a network representation in which a link between a pair of entities exists if they ever appear together in the same criminal incident report. The more frequently they appear together, the stronger the association. After a network representation has been created, the next phase is to extract structural patterns from the networks. 2.2 Structural Analysis Social Network Analysis (SNA) provides a set of measures and approaches for structural network analysis. These techniques were originally designed to discover social structures in social networks [23] and are especially appropriate for studying criminal networks [17, 18, 21]. Specifically, SNA is capable of detecting subgroups, identifying central individuals, discovering between-group interaction patterns, and uncovering a network’s organization and structure [23]. Studies involving evidence mapping in fraud and conspiracy cases have recently employed SNA measures to identify central members in criminal networks [3, 20]. Subgroup Detection. With networks represented in a matrix format, the matrix permutation approach and cluster analysis have been employed to detect underlying groupings that are not otherwise apparent in data [23]. Burt [4] proposed to apply hierarchical clustering methods based on a structural equivalence measure [16] to partition a social network into positions in which members have similar structural roles. Centrality. Centrality deals with the roles of network members. Several measures, such as degree, betweenness, and closeness, are related to centrality [10]. The degree of a particular node is its number of direct links; its betweenness is the number of geodesics (shortest paths between any two nodes) passing through it; and its closeness is the sum of all the geodesics between the particular node and every other node in the

Untangling Criminal Networks: A Case Study

235

network. Although these three measures are all intended to illustrate the importance or centrality of a node, they interpret the roles of network members differently. From an individual’s having a high degree measurement, for instance, it may be inferred to have a leadership function whereas an individual with a high level of betweenness may be seen as a gatekeeper in the network. Baker and Faulkner [3] employed these three measures, especially degree, to find the key individuals in a price-fixing conspiracy network in the electrical equipment industry. Krebs [15] found that, in the network consisting of the 19 hijackers, Mohamed Atta scored the highest on degrees. Discovery of Patterns of Interaction. Patterns of interaction between subgroups can be discovered using an SNA approach called blockmodel analysis [2]. Given a partitioned network, blockmodel analysis determines the presence or absence of an association between a pair of subgroups by comparing the density of the links between them at a predefined threshold value. In this way, blockmodeling introduces summarized individual interaction details into interactions between groups so that the overall structure of the network becomes more apparent. 2.3 Network Visualization SNA includes visualization methods that present networks graphically. The Smallest Space Analysis (SSA) approach, a branch of Multi-Dimensional Scaling (MDS), is used extensively in SNA to produce two-dimensional representations of social networks. In a graphical portrayal of a network produced by SSA, the stronger the association between two nodes or two groups, the closer they appear on the graph; the weaker the association, the farther apart [17]. Several network analysis tools, such as Analyst’s Notebook [14], Netmap [11], and Watson [1], can automatically draw a graphical representation of a criminal network. However, they do not provide much structural analysis functionality and continue rely on investigators’ manual examinations to extract structural patterns. Based on our review of related work, we proposed to employ the concept space approach, SNA measures and approaches, and MDS for extracting and visualizing structural patterns of criminal networks. We have developed a prototype system in which the proposed techniques have been implemented. The architecture of the system and its individual components are presented in the next section.

3 System Architecture The prototype system contains three major components: network creation, structural analysis, and network visualization. Figure 1 illustrates the system architecture.

236

J. Xu and H. Chen

Fig. 1. System architecture

3.1 Network Creation Component We employed the concept space approach to create networks automatically, based on crime data. We assumed that criminals who committed crimes together might be related and that the more often they appeared together the more likely it would be that they were related. We treated each incident summary (database records specifying the date, location, persons involved, and other information about a specific crime) as a document and each person’s name as a phrase. We then calculated co-occurrence weights based on the frequency with which two individuals appeared together in the same crime incident. As a result, the value of a co-occurrence weight not only implied a relationship between two criminals but also indicated the strength of the relationship.

3.2 Structural Analysis Component The structural analysis component includes three functions: network partition for detecting subgroups, centrality measures for identifying central members, and blockmodeling for extracting interaction patterns between subgroups. Network Partition. We employed hierarchical clustering, namely complete-link algorithm [6], to partition a network into subgroups based on relational strength. Clusters obtained represent subgroups. To employ the algorithm, we first transformed cooccurrence weights generated in the previous phrase into distances/dissimilarities. The

Untangling Criminal Networks: A Case Study

237

distance between two clusters was defined as the distance between the pair of nodes drawn from each cluster that were farthest apart. The algorithm worked by merging the two nearest clusters into one cluster at each step and eventually formed a cluster hierarchy. The resulting cluster hierarchy specified groupings of network members at different granularity levels. At lower levels of the hierarchy, clusters (subgroups) tended to be smaller and group members were more closely related. At higher levels of the hierarchy, subgroups are large and group members might be loosely related. Centrality Measures. We used all three centrality measures to identify central members in a given subgroup. The degree of a node could be obtained by counting the total number of links it had to all the other group members. A node’s score of betweenness and closeness required the computation of shortest paths (geodesics) using Dijkstra’s algorithm [7]. Blockmodeling. At a given level of a cluster hierarchy, we compared between-group link densities with the network’s overall link density to determine the presence or absence of between-group relationships. SNA was the key technique in our prototype system for extraction of criminal network knowledge.

3.3 Network Visualization Component To map a criminal network onto a two-dimensional display, we employed MDS to generate x-y coordinates for each member in a network. We chose Torgerson’s classical metric MDS algorithm [22] since distances transformed from co-occurrence weights were quantitative data. A graphical user interface was provided to visualize criminal networks. Figure 2 shows the screenshot of our prototype system. In this example, each node was labeled with the name of the criminal it represented. Criminal names were scrubbed for data confidentiality. A straight line connecting two nodes indicated that two corresponding criminals committed crimes together and thus were related. To find subgroups and interaction patterns between groups, a user could adjust the “level of abstraction” slider at the bottom of the panel. A high level of abstraction corresponded with a high distance level in the cluster hierarchy. Group members’ rankings in centrality are listed in a table.

238

J. Xu and H. Chen (a) Left: A 57-member criminal network. Each node is labeled using the name of the criminal it represents. Lines represent the relationships between criminals.

(c) Right: The inner structure of the biggest group (the relationships between group members).

(b) Above: The reduced structure of the network. Each circle represents one subgroup labeled by its leader’s name. The size of the circle is proportional to the number of criminals in the group. A line represents a relationship between two groups. The thickness represents the strength of the relationship. Centrality rankings of members in the biggest group are listed in a table at the right-hand side.

Fig. 2. A prototype system for criminal network analysis and visualization

4 Case Study In order to examine our system’s ability to reveal structural patterns from criminal networks, we conducted a case study at the Tucson Police Department (TPD). The study was intended to answer the following research questions: Can structural analysis approaches correctly detect subgroups from criminal networks? Can structural analysis approaches correctly identify central members from criminal networks? Can structural analysis approaches correctly identify interaction patterns between subgroups from criminal networks? Can structural analysis approaches help extract the overall structure of a criminal network? In this study, we focused on two types of networks: gang and narcotics, both of which were organized crimes. For each network, the Gang Unit at the TPD provided a list of names of active criminals. We extracted from the TPD database all crime incidents in which these criminals had been involved and we created two networks.

Untangling Criminal Networks: A Case Study

239

4.1 Data Preparation The gang network. The list of gang members consisted of 16 offenders who had been under investigation in the first quarter of 2002. These gang members had been involved in 72 crime incidents of various types (e.g., theft, burglary, aggravated assault, drug offense, etc.) since 1985. We used the concept space approach and generated links between criminals who had committed crimes together, resulting in a network of 164 members (Figure 3a). The narcotics network (The “Meth World”). The list for narcotics network consisted of 71 criminal names. A sergeant from the Gang Unit had been studying the activities of these criminals since 1995. Because most of them had committed crimes related to methamphetamines, the Sergeant called this network “Meth World.” These offenders had been involved in 1,206 incidents since 1983. A network of 744 members was generated (Figure 3b).

(a) The 164-member gang network

(b) The 744-member narcotics network

Fig. 3. The gang and narcotics networks

These two networks were analyzed using our prototype system. Several crime investigators including the sergeant and one detective from the Gang Unit and two detectives from the Information Section validated our results. 4.2 Result Validation The study was divided into two sessions. During each session, the crime investigators examined one network and evaluated the structural patterns discovered from it. Both sessions were tape-recorded and the results were summarized as follows.

240

J. Xu and H. Chen

Detection of Subgroups. Since our system could partition a network into subgroups at different levels of granularity, we selected the partition that the crime investigators considered to be closest to their knowledge of the network organizations. The result showed that our system could detect subgroups from a network correctly: Subgroups could be detected correctly using cluster analysis. Two major subgroups together with several small subgroups were found in the 164-member gang network based on the clustering results (Figure 4a). The bigger subgroup (solid circle) consisted of 99 members and the smaller subgroup (dashed circle) consisted of 24 members. In the narcotics network, no obvious subgroups except for four cliques originally could be seen because of the large network size (Figure 3b). After clustering, however, two subgroups became very obvious with the bigger one (solid circle) consisting of 397 members and the smaller one (dashed circle) consisting of 331 members (Figure 4b). Moreover, the crime investigators verified that partitions within each of the subgroups were also correct.

(a) Subgroups in the gang network

(b) Subgroups in the narcotics network

Fig. 4. Subgroups detected from the networks

Subgroups detected had different characteristics. It turned out that the subgroups found were consistent with their members’ characteristics, specializations, or responsibilities in the networks. In the gang network (Figure 4a), the subgroup represented by a solid circle was identified as a set of white gang members who often were involved in murders, shootings, and aggravated assaults. “These are people who always create a lot of trouble,” the sergeant said. The subgroup represented by a dashed circle, on the other hand, consisted of many white gang members who specialized in sale of crack cocaine. The subgroup represented by a small dotted circle was a set of back gang members who were quite separate from the whole network. The two subgroups

Untangling Criminal Networks: A Case Study

241

(solid and dashed) in Figure 4b, similarly, corresponded with two criminal enterprises led by different leaders. Moreover, each subgroup could be further broken down into smaller subgroups that might be responsible for different tasks. For example, Figure 5a presents the subgroups within one of the criminal enterprises in the narcotics network. The group in solide circle was responsible for stealing, counterfeiting, and cashing checks and providing money to other groups to carry out drug transactions. The group in the dashed circle, on the other hand, consisted of many drug dealers.

173

87

( (a) Subgroups with different responsibilities

b) Relationships between group members

Fig. 5. Subgroup characteristics and relationships

Incident-based relationships reflected other types of associations between group members. Two group members might have been related because they came from the same family, went to the same school, spent time together in prison, etc. Figure 5b, for example, presents connections among 24 members of the crack cocaine group in the gang network. Member 87 was member 173’s girlfriend (connected by a solid line) who often brought female dancers to purchase crack cocaine. In the narcotics network in Figure 4b, members of the dashed circle were former schoolmates. As the sergeant commented, “They knew each other in high school and at that time they were juvenile gang members. Then they got involved in methamphetamines.” Long-time relationships between group members showed a high frequency of committing crimes together, and high relational strength was captured by high co-occurrence weight. Identification of Central Members. We interpreted the highest degree score as an indicator for a leader, the highest betweenness score as an indicator for a gatekeeper, and the one with the lowest closeness (the least likely to be a central member) as an outlier. The crime investigators evaluated central members identified from six subgroups at different granularity levels in both gang network and narcotics network. The

242

J. Xu and H. Chen

results showed that although the system could identify important members in a subgroup, it could not necessarily identify a true leader. A member who scored the highest in degree might not necessarily be a leader. On one hand, offenders with high degree often were those who had had frequent police contacts. Such offenders may play active roles in leading a group. Three out of six leaders were identified as true leaders in their subgroups. For example, in the crack cocaine subgroup shown in Figure 5b, member 173 had the largest number of connections with other group members. This person had a lot of money, was able to buy and sell drugs frequently, and provided his house for drug transactions. As mentioned in the previous section, his girlfriend also helped bring in more people to purchase drugs. Similarly, the member with the highest degree in the murderers group (solid circle in Figure 4a) was also identified as the leader in the group. On the other hand, a high degree could not always be interpreted as an indicator of leadership for two reasons. First, in a criminal enterprise, the leader may hide behind other offenders and keep frequency of activities low by using other people to do tasks. “Especially, when they got out of prison they tended to be smarter and more educated and thus were more careful to avoid police contacts,” the sergeant commented. In Figure 6, for example, member 501 (labeled with a star) was the true leader of one subgroup from the narcotics network. However, he did not score the highest in degree in this group because he actually used other group members (along the dashed path) to sell methamphetamines for him. Second, current police databases did not capture leadership data about criminal enterprises. A crime investigator had no way to tell which group member was the leader unless he/she obtained such information from interrogation or other sources. Three out of six leaders evaluated were not the true leaders of their groups. Therefore, the degree measure should be interpreted carefully. A member who scored highest in betweenness was a gatekeeper. Our crime investigators verified that all of the six gatekeepers were correctly identified from their subgroups. These gatekeepers played important roles in maintaining the flow of money, drugs, or other illicit goods in their networks. Although not identified as a leader based on degree measure, member 501 (labeled with a star) in Figure 6a was correctly identified as a gatekeeper because he controlled and managed the flow of money and drugs in his group. The star in Figure 6b represented a gatekeeper in that group because she was responsible for cashing stolen or counterfeit checks and redistributing money to other group members. The other four gatekeepers evaluated were offenders who often rode bicycles to sell drugs on the street. “Such gatekeepers were quite important to the operation of their criminal enterprises,” a detective from the Gang Unit said. An outlier who scored the lowest in closeness might play an important role in a network. No detailed evaluation was conducted on outliers because of the long time spent on the discussion of leader and gatekeeper roles in both validation sessions. Our crime

Untangling Criminal Networks: A Case Study

(a) A group leader without the highest degree

243

(b) A gatekeeper

Fig. 6. Central members in subgroups

investigators only mentioned that it was possible that an outlier might be a true leader who stayed away from the rest of his group but actually controlled the whole group. No specific example was given, however. Identification of Interaction Patterns between Subgroups. Our crime investigators evaluated a set of between-group interaction patterns including interactions among three groups (solid, dashed, and dotted) in the gang network (Figure 4a), interactions between two major groups (solid circle and dashed circle) in the narcotics network (Figure 4b), and those between the solid and dashed groups in Figure 5a. The results showed that patterns identified using blockmodel analysis reflected the truth about interactions between criminal groups correctly. Frequency of interaction (represented by thickness of lines) between subgroups was a correct indicator of the strength of between-group relationship. In Figure 4a, for example, the blockmodeling result revealed a strong link between the murderers’ group (solid circle) and the crack cocaine group (dashed circle). When asked whether this interaction pattern was accurate, the sergeant answered: “Sure. These guys often hang together. The leaders of these two groups are best friends.” Moreover, interaction patterns might also represent flows of money and goods between groups. In Figure 5a, money and drugs flowed frequently between the dashed group (for drug sales) and the solid group (for check washing and cashing). Interaction patterns between groups might also represent problems or hatred. Frequent interactions between the two major groups in the narcotics network (Figure 4b) resulted not only from their group members’ switching back and forth but also from

244

J. Xu and H. Chen

problems between the two groups, whose leaders had been at odds for a long time. Their subordinates often ran into shootings and fights. Interaction patterns identified could help reveal relationships that previously had been overlooked. During the evaluation of the gang network (Figure 4a), the sergeant noticed that there was a line (dotted) connecting the murderers’ group (solid circle) and the black gang group (dotted circle): “I have never seen these black gang members having any connection with those white gang members”. When referring back to the original network in Figure 3a, we found a link (dotted line) between one member from the black group and a member from the murderers’ group. According to the sergeant, identifying such a connection would be very helpful for developing investigative leads. Extraction of Overall Network Structures. According to our crime investigators, gang and narcotics enterprises usually differed in structure: gang enterprises tended to be more centralized and narcotics organizations tended to be more decentralized. In order to assess our system’s abilities to reveal such structural differences, we extracted two datasets from the TPD database: (a) incident summaries of narcotics crimes from January 2000 to May 2002, and (b) incident summaries of gang-related crimes from January 1995 to May 2002. We selected four gang networks and nine narcotics networks from our datasets. Sizes of these networks ranged from 21 to 100. Other networks generated from our datasets were either too small or too large and were not analyzed. We found that the blockmodeling function in our system did reveal distinguishing structural patterns of the two types of criminal enterprises: Two out of four gang networks under study had a star structure similar to that presented in Figure 2. The third network was a chain of stars and the fourth had a star structure with some of its branches being a smaller star or a clique (Figure 7a-b). All nine narcotics networks had a chain structure (Figure 7c-d). Three of these networks were chains of stars. One network had a circle in the middle of the chain. 4.3 Usefulness of System All our crime investigators provided very positive comments on our system. They believed that the system could be very useful for extracting structural network patterns and discovering knowledge about criminal enterprises. In particular, our system could help them in the following ways: Saving investigation time. The sergeant and his assistants had obtained knowledge about the gang and narcotics organizations during several years of work. Using information gathered from a large number of arrests and interviews, he had built the networks incrementally by linking new criminals to known gangs in the network and then studied the organization of these networks. Because there was no structural analysis tool available, he did all this work by hand. With the help of our system, he expected substantial time could be saved in network creation and structural analysis.

Untangling Criminal Networks: A Case Study

(a) A 51-member gang network

(c) A 60-member narcotics network

245

(b) The star structure found in the gang network

(d) The chain structure in the narcotics network

Fig. 7. Overall structures of criminal networks

Saving training time for new investigators. New investigators who did not have sufficient knowledge of criminal organizations could use the system to grasp the essence of the network and crime history quickly. They would not have to spend a significant amount of time studying hundreds of incident reports. Suggesting investigative leads that might otherwise be overlooked. For example, the link between the back gang group and the white murderers’ group in the gang network that had been overlooked and could have suggested useful investigative leads. Helping prove guilt of criminals in court. The relationships discovered between individual criminals and criminal groups would be helpful for proving guilt when presented at court for prosecution.

246

J. Xu and H. Chen

In summary, the structural analysis approaches we proposed showed promise for extracting important patterns in criminal networks. Specifically, subgroups, central members, and interaction patterns among subgroups usually could be identified correctly by the use of centrality measures, and blockmodeling functionality.

5 Conclusions and Future Work Criminal network knowledge has important implications for crime investigation and national security. In this paper we have proposed a set of approaches that helped extract structural network patterns automatically from large volumes of data. These techniques included the concept space approach for network creation, hierarchical clustering methods for network partition, and social network analysis for structural analysis. MDS was used to visualize a criminal network and its structural patterns. We conducted a case study with crime investigators from TPD to validate the structural patterns of gang and narcotics criminal enterprises. The results were quite encouraging—the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might help in the development of effective disruptive strategies for criminal networks. We plan to continue our criminal network analysis research in the following directions: Allowing investigators to edit a network by adding, deleting, and modifying nodes and links. Networks created using our system were based entirely on incident data. Other important information collected from multiple sources about network members, relationships between members, and member roles would help provide a more complete picture of a criminal enterprise. Especially, knowledge about group leaders that could not be obtained using incident data from typical police databases should be added to a network representation to avoid misleading interpretation of the degree measure. Including other entity types than person. Criminal networks in our current studies were limited to only person type. Criminals’ connections with other types of entities such as location, weapon, and property could also be useful. In the “Meth World”, for example, drug offenders often used a specific hotel to carry out transactions. Examining frequencies of hotel addresses associated with a set of narcotics crimes could help in understanding the operation of a narcotics organization and predicting future crimes. Studying temporal and cross-regional patterns of criminal networks. Over time criminal networks could change in size, organization, structures, member roles and many other characteristics. The “Meth World” in Tucson had expanded from a network consisting of no more than 150 members in 1995 to the one with more than 700 members in 2002. Members and their roles in the network had also changed a lot in the past eight years: some old members left the network because of arrest or death; new members had been attracted into the network in search of profit; more powerful

Untangling Criminal Networks: A Case Study

247

new leaders might have replaced old leaders, etc. It would be interesting to study how a criminal network evolved over time. Should a certain temporal pattern be discovered, it would be helpful to predicting the trend and operation of a criminal enterprise. On the other hand, a criminal enterprise can expand across several regions or nations. The “Meth World” was initially only in Tucson and was later connected with criminals from Phoenix, California, and Mexico. Cross-regional analysis of criminal enterprises could be used to analyze criminal enterprises on a large scale and could have significant value for combating terrorism. At the same time, we will continue to develop more techniques to further advance the research on criminal networks.

Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. Special thanks go to Dr. Ronald Breiger from the Department of Sociology at the University of Arizona for his kind help with the initial design of the research framework. We would like also to thank the following people for their support and assistance during the entire project development and evaluation processes: Dr. Daniel Zeng, Michael Chau, and other members at the University of Arizona Artificial Intelligence Lab. We also appreciate important analytical comments and suggestions from personnel from the Tucson Police Department: Lieutenant Jennifer Schroeder, Sergeant Mark Nizbet of the Gang Unit, Detective Tim Petersen, and others.

References 1. 2. 3.

4. 5.

6. 7. 8.

Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Arabie, P., Boorman, S. A., Levitt, P. R.: Constructing blockmodels: How and why. Journal of Mathematical Psychology, Vol. 17. (1978) 21–63. Baker, W. E., Faulkner R. R.: The social organization of conspiracy: illegal networks in the heavy electrical equipment industry. American Sociological Review, Vol. 58, No. 12. (1993) 837–860. Burt, R. S.: Positions in networks. Social Forces, Vol. 55, No. 1. (1976) 93–122. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Defays, D.: An efficient algorithm for a complete link method. Computer Journal, Vo. 20, No. 4. (1977) 364–366. Dijkstra, E.: A note on two problems in connection with graphs, Numerische Mathematik, Vol. 1. (1959) 269–271. Dombroski, M. J., Carley, K. M.: NETEST: Estimating a terrorist network’s structure. Computational & Mathematical Organization Theory, Vol. 8. (2002) 235–241.

248 9.

10. 11.

12. 13. 14.

15. 16. 17.

18. 19.

20.

21. 22. 23.

J. Xu and H. Chen Evan, W. M.: An organization-set model of interorganizational relations. In: M. Tuite, R. Chisholm, M. Radnor (eds.): Interorganizational Decision-making. Aldine, Chicago (1972) 181–200. Freeman, L.: Centrality in social networks: Conceptual clarification. Social Networks, Vol. 1. (1979) 215–239. Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, (1998). Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. Hauck, R. V., Atabakhsh, H., Ongvasith, P., Gupta, H., Chen H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35, No. 3. (2002) 30–37. Klerks, P.: The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands, Connections, Vo. 24, No. 3. (2001) 53–65. Krebs, V. E.: Mapping networks of terrorist cells. Connections, Vo. 24, No. 3. (2001) 43– 52. Lorrain, F. P., White, H. C.: Structural equivalence of individuals in social networks, Journal of Mathematical Sociology, Vol. 1. (1971) 49–80. McAndrew, D.: The structural analysis of criminal networks. In: Canter, D., Alison, L. (eds.): The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999) 53–94. McIllwain, J. S.: Organized crime: A social network approach. Crime, Law & Social Change, Vol. 32. (1999). 301–323. Ronfeldt, D., Arquilla, J.: What next for networks and netwars? In: Arquilla, J., Ronfeldt, D. (eds.): Networks and Netwars: The Future of Terror, Crime, and Militancy. Rand Press, (2001). Saether, M., Canter, D.V.: A structural analysis of fraud and armed robbery networks in Norway. In Proceedings of the 6th International Investigative Psychology Conference, Liverpool, (2001). Sparrow, M. K.: The application of network analysis to criminal intelligence: An assessment of the prospects. Social Networks, Vol. 13. (1991) 251–274. Torgerson, W. S.: Multidimensional scaling: Theory and method. Psychometrika, Vol. 17. (1952) 401–419. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, Cambridge, Cambridge University Press, (1994).

249

Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework 1

2

3

T.S. Raghu , R. Ramesh , and Andrew B. Whinston 1

W. P. Carey School of Business, Arizona State University , Tempe, AZ. 85287 [email protected] 2 Department of Management Science & Systems, School of Management State University of New York at Buffalo, Buffalo, NY 14260 [email protected] 3 Department of Management Science & Information Systems University of Texas at Austin, Austin, TX 78712 [email protected]

Abstract. A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. This problem is further exacerbated due to the multitude of agencies involved in the decisionmaking process. Thus the decision-making processes faced by the intelligence agencies are characterized by group deliberations that are highly ill structured and yield limited analytical tractability. In this context, a collaborative approach to providing cognitive support to decision makers using a connectionist modeling approach is proposed. The connectionist modeling of such decision scenarios offers several unique and significant advantages in developing systems to support collaborative discussions. Several inference rules for augmenting the argument network and to capture implicit notions in arguments are proposed. We further explore the effects of incorporating notions of information source reliability within arguments and the effects thereof.

1 Introduction A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. The Office of Management and Budget (OMD) lists about 100 different federal government categories that are funded to specifically carry out anti-terrorism tasks.1 This obviously excludes state and local government agencies that are often involved in anti-terrorism operations. Given the diversity of agencies and the diversity of information sources it is quite clear that decision-making tasks related to homeland security are highly decentralized. Effective sharing, dissemination and assimilation of information is key to successful homeland security

1

Office of Management and Budget, Annual Report to Congress on Combating Terrorism, available at www.whitehouse.gov/omb/legislative/nsd_annual_report2001.pdf (Oct. 2001), Pages 89–100.

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 249–265, 2003. © Springer-Verlag Berlin Heidelberg 2003

250

T.S. Raghu, R. Ramesh, and A.B. Whinston

strategy. In this paper, a collaborative decision-making framework is proposed as a key enabler of a distributed information and decision-making backbone for homeland security. When acting upon and integrating intelligence information from several sources, decision makers have to consider and debate various possible decision and security alternatives. In general, most such decision problems are highly ill-structured and yield limited mathematical tractability. Consequently, such decision issues have to be resolved through discussions, where argumentative logic and persuasive presentation are critical. Conventional decision modeling tools may not be able to solve the decision issues as a whole, although they may be used to generate argumentative logic while discussing some of them. In a collaborative decision making process, the group members assume positions, which could be claims or endorsements or oppositions to other claims. These positions could be assumed with or without supporting arguments or evidential data gathered from various intelligence sources (summarized at various levels). The sequence of challenges and responses typically follows an evolutionary path until the decision issues are resolved. In this process, the argument logic and the supporting/contradicting evidential data could grow significantly in both size and complexity, causing a substantial cognitive load on the decision makers. The primary objective of this research is therefore to develop pragmatic and efficient support tools to ease the cognitive burden, focus the group on critical security related issues and guide creative positional and argument strategy development throughout the discussion. The Collaborative Decision-Making (CDM) framework in Figure 1 presents a broad architectural view of a CDM system for homeland security. The CDM system comprises of four broad components: Knowledge repositories, Group facilitation and coordination, Discussion strategy support and Dialectic decision support. Each of these perspectives share some requirements on basic systems components as backbone services, we have identified many of these components in our framework. A brief description of these perspectives is given below. Knowledge Repositories. Given the diversity of federal, state and local agencies involved in intelligence gathering and decision-making processes, a unifying, semantically developed structure to represent intelligence knowledge and information is the first key requirement. The volumes of data, diversity of culture, language and vocabularies exacerbate the complexity of knowledge storage and retrieval. To facilitate communication among geographically, culturally, and/or technically diverse populations of people and systems it is imperative to develop unified knowledge and data repositories. In this context, it is important to build domain ontology and taxonomies that will play a key role in shaping collaborative decision-support systems for homeland security. Group Facilitation and Coordination. Providing system support for enabling distributed teams to coordinate has been studied extensively in the literature. Under this perspective, the recent trends in the areas of group support systems, collaborative filtering and Computer Supported Cooperative Work (CSCW) are the key technological components of a CDM system.

Addressing the Homeland Security Problem

251

Fig. 1. A Collaborative Decision Support System Framework for Homeland Security

Discussion Strategy Support and Dialectic Support. These two aspects of CDM are perhaps the least understood. Considerable research in the areas of argumentation analysis, natural language processing, and structured knowledge interchange has taken place over the past few years. However, application of these fundamental areas in collaborative decision-making has been scarce if not non-existent. A substantial portion of the paper will delve into how semi-structured information from several sources can be meaningfully analyzed. Arguments and positions enunciated by decision-makers are enhanced through simple inference procedures and argument coherence and dialectical assessments are carried out through connectionist procedures.

252

T.S. Raghu, R. Ramesh, and A.B. Whinston

The organization of the paper is as follows. Section 2 discusses the foundations of this research. Section 3 presents the connection network architecture, and Section 4 summarizes the model elements and presents an integrated global view of dialectical support through connectionism. Section 5 presents our concluding remarks.

2 Research Foundations Intelligence communities involved in homeland security tasks represent a very complex global virtual organization. The underlying context of this domain is the geographical distribution of the strategic, tactical and operational communities and their activities over the globe. The key to achieving success and breakthroughs in homeland security lies in effective team communication, creative conflict management, sustained coordination of team efforts and continuity in collaboration, all ensured within a structured collaborative decision environment. Although the road to achieving the full potential of such teamwork is filled with challenges, both organizational and technical, advanced information technology can be used in novel ways to facilitate effective collaboration that have not even been conceived till recently. The current research is envisioned as an important milestone in this direction. We identify Information filtering as the first key challenge that would need to be overcome for effective decision-making. The objective here should be to filter the vast information base so that relevant and important intelligence information are accessible quickly to key decision makers. Most of the current filtering systems provide minimal means to classify documents and data. A common criticism of these systems is their extreme focus on information storage, and failure to capture the underlying meta-information. As a consequence, the concept of knowledge ontology has emerged, with a view to create domain level context that enable users to attach rich domain-specific semantic information and additional annotations to intelligence information and documents and employ the meta-information for information retrieval. Once information storage is augmented with knowledge ontology, it becomes easier to provide structured mechanisms for communication wherein decision makers are enabled to communicate over distributed systems. Structured communication enables one to capture the knowledge of intelligence community in easily accessible discussion archives. The underlying structure in the discussion archives would enable the provision of additional collaborative decision support to intelligence personnel. Thus, we draw upon the literature from knowledge ontology and theory of argumentation as the theoretical bases for this research. Formal ontology characterizes knowledge providing a framework binding contextual elements with the relationships that link them within the ontology, as well as the relationships with other units of knowledge [5]. The knowledge ontology consists of a conceptual model, a thesaurus, and a set of expanded attributes and axioms. Its concern is for the appropriate representation of content, which may later be augmented with a mechanistic formalism, such as UML (Unified Modeling Language), RDF (Resource Description Framework), BNF (Backus Naur Form), or formal logic [19]. The main challenge that agencies involved in homeland security face is the volume and number of different information sources that would potentially feed useful and useable information to the CDM system. For instance, the key targets that need protection include large buildings, sports arenas, nuclear facilities, airports, trains and sub-

Addressing the Homeland Security Problem

253

ways, and national symbols in over 200 cities [1]. Clearly operational and intelligence information pertaining to these key targets will be varied in format, content and context. It is therefore imperative to impose uniform semantic structures where possible and define contextual meta-data on other sources of information to enable dissemination of information across federal, state and local agencies. The main contribution of this research is to demonstrate that further decision support functionalities can be embedded in a CDM system that leverage the metainformation framework of domain knowledge ontology. This would help decision makers better utilize the volumes of information collected through various sources. The basis for collaborative decision support in our system comes from argumentation theory. The logic of argumentation can be studied in terms of its two, rather classical, elements: structure and content. The two components have a symbiotic relationship in the sense that the informational content of an argument needs a logical structure for its coherence and significance. Connectionist modeling provides a way to capture both the elements in a single framework[3,4]. Several works deal primarily with representation formalisms and heuristics for argument analysis, interpretation and outcome prediction [7,9,12,13,15,18]. Given the diffuse nature of intelligence information and the uncertainties associated with the information sources, it would be difficult for any system to provide discrete decisions on security issues. Our approach is to move towards a system of argument analysis in which one is not necessarily constrained to resolving argumentation to discrete categories [16,17]. Using binary categories as a basis for rejecting or accepting arguments prevents one from assessing the relative strengths of the arguments. While connectionist models do not have the strong theoretical underpinnings of logic based defeasible graphs[6,8], using Connectionist models for this purpose has many advantages over methods that utilize simple binary categories of acceptance and rejection [11]. Connectionist modeling achieves better sensitivity in argument assessment by indicating the degree of acceptance or rejection of arguments[14,16]. In addition, one can assign different weights on the arcs connecting the different units in the model. This enables one to capture not only the relations between units but also the strength of the relation. The basic computational details of the connectionist architecture are described in [14]. Briefly, arguments in a discussion are structured into basic, atomic-level information units along with their logical and other human-intended relationships. The basic informational units are represented as the units (which is a term used to represent network nodes in the connectionist literature) and their relationships as the arcs in a network formalism for argument logic. The dialectical power of an argument is an indicator of the strength or validity of an argument, and is measured by the activation level of the unit representing the final thesis of the argument at asymptotic convergence. For example, the final thesis of the argument can be that there is an imminent threat to a key national monument in the near future. An argument derives its dialectical power by the logical coherence inherent in its structure and by the support it derives from its evidence. The evidence could be either observed facts, intelligence information, and previous incidents or derived conclusions from other claims and arguments. The structure and content of the supporting as well as opposing logic behind an argument together determine its dialectical power. The dialectical power of various positions in a collaborative discussion is a very useful evaluative feedback to the decision makers. This measure identifies the relative strengths and weaknesses of the positions, and points to whether a discussion is

254

T.S. Raghu, R. Ramesh, and A.B. Whinston

moving towards a resolution or not. Consequently, it can be used to focus a group on critical security flaws, reexamine security measures if necessary and develop strategies to address future threats. Further, the connectionist paradigm can also be used to derive assessments on subsets of a large argument network selectively, or on higherlevel meta networks derived by aggregating argument sets from a basic network into meta-units and meta arcs. Thus the proposed model can provide selectively local views of a comprehensive discussion as well as condensed global perspectives on an entire discussion. The dialectical support functionality can provide comprehensive and dynamic monitoring/guidance systems for collaborative discussions on the Intranets.

3 Argument Structure and Connectionism 3.1 Argument Structure The basic formalism for our connectionist approach is available in [14]. We briefly describe the argumentation formalism here. For a detailed discussion please refer to [14]. The discussion of inference rules and incorporation of information source reliability are additional contributions in this paper. Let * denote the group of individuals in a collaborative discussion. Let ' denote the argument structure representing the various positions, facts, and their interrelationships generated in the discussion. Clearly, ' is a temporal entity, evolving and changing over time as the discussion proceeds. The structure ' is basically a collection of assertions made by the individuals in the group. This is indicated as follows. ' = {$$ is an Assertion}. An assertion $ is of two types: positions and inferences. A statement of position is a claim, and is assumed to be a well-formed sentence. A statement of inference is a structural relationship among a set of positions and facts. We formalize the structure of these assertion types as follows. Let / denote a language from which the structure ' is constructed. The language / is a triple <654>, where 6 constitutes the sentences, 5 is a set of assertions built using sentences, and 4 is a set of assertion qualifications. 6 provides the basis for the construction of positions and statements of fact and is composed of defeasible sentences (Gd) and factual sentences (GF). A factual statement is any evidential data that is commonly accepted by the group, while the positions are the subject of discussion. 5 provides the basis for the construction of positional and inferential assertions. This enables the construction of positional assertions from sentences obtained from 6 as well as inferential structures from other assertions. 4 provides the basis for the qualification of an argument on whether it is strict or defeasible. While a defeasible argument is subject to debate and possibly defeat, a strict argument is a logical inference that will not be questioned by anyone in the group. 5 provides two constructs <support> and to build inferential structures among positions and facts. 4 provides two constructs <strict> and <defeasible> to qualify assertions. As a result, a combination of these constructs yields the following qualified inferences: <strict support>, <defeasible support>, <strict opposition> and <defeasible opposition>. We in-

Addressing the Homeland Security Problem

255

dicate these qualified inferences as <>, <Â>, <>, and <Ä>, respectively, in our structural formalism. Finally, we associate each assertion $ with its proponents through a signature set s($), which consists of all the individuals who subscribe to $. 3.2 The Connectionist Formalism We define a connectionist network in graph theoretic terms as follows: Definition 1 (Connectionist Network) A connectionist network is defined as a 4-tuple: S = where, N denotes the set of connectionist units, is a real valued activation function that maps each element aN: N § of N to a real number, A is the set of directed arcs is a real valued arc weight function that maps A to a real WA: A § number. Given the above definition of a connectionist network we can define an argument network as a mapping of the argumentation system ' to a connectionist network S as follows. Definition 2 (Argument Network) An argument network is a mapping of an argumentation system ' to a connectionist network S. The <654> framework is translated to the connectionist formalism by mapping <6> to , and <5,4> to . We represent the sentences (G) of the sentence structure (6) of an argument as units in the argument network. Two types of units are defined corresponding to the type of sentence they represent. Defeasible sentences (Gd) are represented as defeasible units (Nd) and factual sentences (GF) are represented as factual units (NF) in the network. In addition, logical connectives (such as ¾,¿) are represented using logical units (NL). Factual sentences and defeasible sentences differ in their activation level assignments in the argument network. Similarly, logical units express their behavior by their specific activation level assignments. We illustrate this as follows. d F L d F Define aN = {aN , aN , aN } where aN is the activation function for Nd, aN is the L activation function for NF and aN is the activation function for NL. Each qualification type from the qualification structure 4, namely, defeasible support, defeasible opposition, strict support, and strict opposition, is mapped to a corresponding arc type in Â Ä the connectionist network. The set of directed arcs is defined as A = {A , A , A , A }, Â Ä where A represents defeasible support, A represents defeasible opposition, A repre sents strict support, and A represents strict opposition. Each arc type may have a different weight assignment function. We define the weight assignment function WA as Â Ä Â Â Ä WA = {WA , WA , WA , WA }, where WA is the arc weight function for A , WA is the Ä arc weight function for A , WA is the arc weight function for A , and WA is the arc weight function for A . The following section presents extensions to the basic argumentation structure for enhancing inference capabilities.

256

T.S. Raghu, R. Ramesh, and A.B. Whinston

4 Inference Rules In First-order logic, inference is accomplished by using a set of axioms and modus ponens. Given that first-order inference is limited to two discrete points (true, false), it is not easily extensible to other forms of logic and it allows very restrictive set of rules for inference. For instance, if A implies B, and A is false, no inference can be made on the value of B. Whereas, in the connectionist framework, A supports B is equivalent to A opposes ~B and ~A opposes B, and inference on B is permitted even when value of A is close to –1 (false). Other systems of logic such as fuzzy logic systems have attempted to extend the concept of inference to systems where continuous values have to be inferred [17]. Using a similar approach we extend the concept of inference in first-order logic to connectionist argument networks. The inference rules are heuristic in nature and exploit the multi-valued nature of connectionist framework. The connectionist paradigm developed so far is built on the assertions, facts and their logical and human-intended relationships that have been explicitly stated by the members of a group in an argumentation process. However, when individuals make statements, several implicit ideas are usually intended. These ideas mostly take the form of implied relationships that occur beneath the labyrinth of explicit assertions, and are very important to a mutual understanding of each other among the group members. In fact, these unarticulated intentions play a critical role in collaborative decision-making by directing argument strategies and explicit statements from behind. While the argument network captures all the explicit statements, some, if not all, of the implicit notions can be captured from the structure of the stated arguments. This will significantly enrich the ability of the connectionist model to capture the content and structure of a discussion within a computationally supporting framework. Furthermore, the connectionist paradigm can provide a series of views of a discussion by successively incorporating different kinds of implied notions. These views can be extremely useful to a decision maker by facilitating what-if analyses on available intelligence information and the discussions thereof. We propose a set of inference rules to derive implied notions from explicit statements. These inference rules are regarded as the guiding principles of an assessment of human behavior in this model. The inference rules add additional arcs to an argument network, indicating the derived relationships among the units from the structure of the explicitly stated arguments. The process of generating the inference rules has been automated in our implementation, and it is also possible to selectively apply individual inference rules. Although the use of the inferred arcs is not a requirement for a connectionist assessment of a discussion, an inference based assessment yields several key insights. The inference rules can be used as a means to test the logical consistency among arguments. This implies that the activation levels of a unit with and without inferred arcs should be consistent, and the existence of inconsistencies would indicate inconsistencies in the argument network, and direct a decision maker to examine the arguments further. There is a behavioral rationale behind extending the concept of inference to connectionist networks. Human behavior reveals several principles for a determination of implicit notions in a connectionist framework. These intentions are captured as arcs derived from the inference rules among the connectionist units. These arcs represent the synergies and counteractions that arise among the units due to the argument

Addressing the Homeland Security Problem

257

structure. We organize the inference rules into three categories based on argument structure: common successor, common predecessor and transitive sequence. We develop the rationale underlying these inference rules and their structural properties in the following discussion. 4.1 Transitive Sequence The transitive sequence structure involves three defeasible units as follows. Let p, q and r be defeasible units such that arcs (p, q) and (q, r) exist in the argument network. The units p, q and r are said to form a sequence. A sequence is said to possess transitive synergy if : (i) p supports q and q supports r, or (ii) p opposes q and q opposes r. On the other hand, a sequence is said to possess transitive counteraction if : (i) p supports q and q opposes r or (ii) p opposes q and q supports r. Transitive synergy is captured by introducing a supporting arc from p to r and transitive counteraction by an opposing arc. The rules based on these ideas are stated as follows. Inference Rule: 1 (Transitive Synergy) Consider a synergy producing sequence of defeasible units p, q and r. This synergy generates an implied arc (p, r) of type support. Inference Rule: 2 (Transitive Counteraction) Consider a counteraction-producing sequence of defeasible units p, q and r. This counteraction generates an implied arc (p, q) of type oppose. 4.2 Common Successor The common successor argument structure involves the units structured as follows. Let p, q, and r be defeasible units such that arcs (p, r) and (q, r) exist in the argument network. The unit r is said to be the common successor of p and q. Synergy between propositions p and q arises if : (i) both jointly support r, or (ii) both jointly oppose r. On the other hand, counteraction between p and q arises if one of them supports r and the other opposes r. Synergy is captured in the connectionist model by enhancing the activation levels of p and q by introducing mutually supporting arcs. The counteraction is captured by inhibiting the activation levels of p and q by introducing mutually opposing arcs. The inference rules based on these ideas are stated as follows. Inference Rule: 3 (Common Successor Synergy) Consider two defeasible units p and q with a synergy generating common successor. This synergy generates two implied arcs (p, q) and (q, p) of type support. Inference Rule: 4 (Common Successor Counteraction) Consider two defeasible units p and q with a counteraction generating common successor. These counteractions generate two implied arcs (p, q) and (q, p) of type oppose. 4.3 Common Predecessor The common predecessor argument structure is the reverse structure of the common successor. In this case, arcs (r, p) and (r, q) exist in the argument network among the

258

T.S. Raghu, R. Ramesh, and A.B. Whinston

defeasible units p, q and r. The unit r is the common predecessor of p and q in this structure. Synergy between p and q arises if : (i) both are supported by r, or (ii) both are opposed by r. Similarly, counteraction between p and q arises if one of them is supported and the other opposed by r. Synergy is captured by introducing mutually supporting arcs between p and q, and counteraction by mutually opposing arcs. The inference rules based on these ideas are stated as follows. Inference Rule: 5 (Common Predecessor Synergy) Consider two defeasible units p and q with a synergy generating common predecessor. This synergy generates two implied arcs (p, q) and (q, p) of type support. Inference Rule: 6 (Common Predecessor Counteraction) Consider two defeasible units p and q with a counteraction generating common predecessor. This counteraction generates two implied arcs (p, q) and (q, p) of type oppose. 4.4 Argument Consistency In this section we demonstrate that the application of inference rules on an argument network can be used to determine argument consistency. We use the definition of argument consistency from [2], where given two sets of arguments P and Q (members of P and Q form the nodes in the argument network), where argument P opposes argument Q, the argument P is said to be inconsistent if a node in P opposes one or more nodes in P. Example: Suppose there are two facts F1 and F2 and nodes a, b and c in an argument where {a, b, c} ∈ P . An argument aÂc;bÂc; and bÄa is an inconsistent argument. In the case of a large argument network it becomes difficult to determine argument inconsistency. We now show that the application of the inference rules can aid in the determination of inconsistencies. Definition 3 (Complete Argument Network): An argument network is said to be complete if each node in the network connects to every other node in the network in both directions. Thus given an argument network with n nodes, the total number of arcs in n

that network is given by

∑ 2(i − 1) . i =1

Proposition: An inference rule enhanced argument network is inconsistent if it is not a complete argument network. Proof: We ignore disjoint networks and consider argument networks with more than two nodes for the proof. Assume a consistent argument network in which there exist two nodes p and q that are not completely connected (i.e., they do not have a bidirectional arc between them). Then there should exist some node r that satisfies one of the six inference rules. If the connections between p, q and r satisfy the inference rules 1 to 4, p and q it will result in bi-directional path between p and q. If the connection between p,q and r satisfies the transitive inference rules 5 or 6, we obtain the following. If inference rule 5 applies: Consider, pÂq;qÂr, then we obtain pÂr. Now application of inference rule 1 on qÂr and pÂr yields a bi-directional arc between p and q.

Addressing the Homeland Security Problem

259

If inference rule 6 applies: Consider pÂq; qÄr, then we obtain pÄr. Now application of inference rule 2 on qÄr and pÄr yields a bi-directional arc between p and q. Computationally, successive application of inference rules should yield consistent set of arcs between two nodes. When the arcs inferred conflict with existing arcs, it indicates inconsistencies in the arguments. Example: Suppose there are two facts F1 and F2 and nodes a, b and c in an argument where {a, b, c} ∈ P . An argument aÂc;bÂc; and bÄa is an inconsistent argument. Now application of inference rules yields the following results. Inf. Rule 1: From aÂc and bÂc, we obtain aÂb and bÂa. This is in conflict with bÄa in the original argument.

5 Computational Analysis We present extensive computational results from the connectionist argument networks on basic argument structures (with and without application of inference rules). The intent of these computational experiments is to demonstrate that inference in basic connectionist argument structures improve as a result of applying the inference rules. To summarize, the results indicate that the application of inference rules exaggerates the bias in the original network to indicate which direction the argument is heading. It is apparent from the results that the inference rules can assist in performing sensitivity analysis on the arguments proposed. However, inference rules cannot be used as a means to assess the dialectic support for the propositions in the arguments. We now present the application of the connectionist mechanism and the inference rules on a hypothetical argumentation scenario debating a security threat. The example is intended to illustrate how the model proposed in this paper can be utilized as the underlying mechanism of support for collaborative decisions made by intelligence agencies. 5.1 An Example The objective of intelligence discussions is to ensure that intelligence personnel use all available information in evaluating threats to homeland security. It is inevitable that in such discussions conflicting positions and claims may emerge which cannot be easily resolved through simple quantitative analyses and would require an elaborate discussion process between personnel involved in the decision process. In the discussion process, intelligence personnel will have to utilize the information sources available to them through the CDM system. In this section, we present a hypothetical scenario in which domestic and foreign intelligence teams discuss the possibility of an imminent threat of a terrorist attack. The factual information used by the discussants is assumed to be retrieved through an ontology enabled information repository. The domestic intelligence team is assumed to have better knowledge of the developments within the country, whereas the foreign intelligence team is assumed to be in a better position to make assessments of the terrorist organizations capabilities (where the terrorist organization originates). In the following, the defeasible assertions of the dod mestic intelligence team are denoted as D and those of the foreign intelligence team

260

T.S. Raghu, R. Ramesh, and A.B. Whinston d

are denoted as E . The evidential information presented in the debate process is denoted as F. In the interests of space, we summarize the debate process by presenting the claims and arguments made by the two teams in Table 1 and Figure 2. The arrows marked with thick dots are arguments made by the foreign intelligence team, and the rest of the arrows are arguments made by the domestic intelligence team. We have plotted the activation level of the domestic team’s (Dteam) main thesis after each stage of the argumentation process in Figure 3. In each stage, we compute the activation levels for the nodes in the argument network that is built based on the assertions made up to that stage. For instance, after the foreign intelligence team's d d (Fteam) attack against D1 and D2 using facts F1 and F2, the activation level of the domestic team’s main thesis drops to -0.8. The domestic team’s main thesis is overwhelmingly defeated well before the end of the discussion. Table 1. Meanings of the network units in the Example

Unit d D1 D2

d

d

D3 d D4 d

D5 d E1 F1 F2 F3 F4 F5 F6

Meaning The terrorist organization has the financial resources necessary to carry out the attack There is an imminent threat of terrorist attack on National Monument A An enemy government may be involved in planning this attack The remaining active cells of the terrorist orgn. are capable of carrying out the threat The terrorist orgn. is capable of quickly recruiting new recruits Improving economic condns. in the countries that it recruits are making it harder for the terrorist orgn. to recover quickly Field intelligence data (No.xxx) indicates preparations for a major attack The main local cell of the terrorist organization was broken up last month Intelligence report from a friendly country indicates a fallout between enemy government and the terrorist organization The main funding source for the terrorist organization was closed last month and all the money in their accounts have been frozen Field agents reported panic in rank and file of the terrorist organizations as a result of the raids in the U.S over the last few months A report by a special task force has documented that previous raids had little impact on overall effectiveness of the terrorist organization d

All the defeasible units of the Dteam’s arguments, except D5 , settle to activation levels between -0.4 and -0.85 (see Table 2). Since a majority of defeasible assertions are below zero, we conclude that the Dteam’s argument is very weak, and does not have adequate factual or other basis for support. Furthermore, the Fteam challenges these assertions with substantial factual support. Note that while only the fact F6 lends direct support to the Dteam’s position, all other facts are in opposition (except F1, which forms part of a logical assertion and does not provide any direct support). The connectionist model clearly reflects this situation. In the role of an impartial assessor, the model highlights the weaknesses in the Dteam’s argument.

Addressing the Homeland Security Problem

261

d

D2

d

D4 d

F1

D1

F4

F3

F2 d

D3

d

F5

D5

Strict Opposition

Strict Support Defeasible Support

E1d

F6

Defeasible Opposition

Fig. 2. Argument Network for the Example

A cti va ti o n L e ve l o f D 2 d in th e A rg u m e n ta tio n P ro ce ss 0.1 0 -0.1

Dteam

Fteam

Dte am

Fteam

Dteam

Fteam

-0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9

Fig. 3. Activity Level at each stage of the debate process

Based on the above analysis, we can conclude that the Dteam’s argument supporting the assertion that there is an imminent threat of a terrorist attack on national monument A is weak. The Fteam armed with an array of facts seem to have a strong argument against the Dteam's claim. We further tested this argument network with the addition of inference rules. The d resulting activation level on the main thesis unit, D2 is -0.9351. Clearly, utilization of the inference rules further reinforces the initial assessment of minimal threat of a terrorist attack on the national monument.

262

T.S. Raghu, R. Ramesh, and A.B. Whinston Table 2. Final Activation Levels of the Defeasible Units in Example

Unit Name Activation Level

E1

d

0.0001

D5

d

0.714

d

d

d

d

d

D3

D4

D1

D1 ^ F1

D2

-0.83

-0.417

-0.825

-0.825

-0.789

6 Effects of Information and Argument Reliability While inference rules can aid in assessing the sensitivity of the dialectic power computed using the connectionist mechanism, in several instances one has to contend with other dynamics during argumentation. The variability in the sensitiveness and reliability of information and its sources can be quite large. Moreover, it is possible that decision-makers in some agencies may have greater power (and access to classified information). In this section, we illustrate how the connectionist approach can accommodate such variances in its model quite easily. We use the example used in Section 5 for this purpose. To capture the varying information and expertise levels of the two teams involved, we vary the weights on arcs D D proposed by each party as follows. Let wS and wC denote the weights on the support F F and opposition arcs proposed by the Dteam, respectively. Similarly, let wS and wC denote the arc weights of the Fteam’s proposals. In all the experimental runs, we set D D F F wC = -wS and wC = -wS . The varying reliance on the two team’s arguments is deD F scribed using the weights as follows. The parameters wS and wS are each tested at 10 levels, starting from 0.1 and increasing in steps of 0.1 till 1. The combination of D F weights (wS , wS ) describes the relative reliance placed on the arguments of the two D F teams in an experimental run. The levels of wS and wS result in (10 * 10) = 100 experimental runs. We now summarize the results of these experiments in the following discussion. The net effects of argument reliance in two-person argument games on the activation levels of final thesis units at convergence in the example are summarized in FigF D ures 4 and 5. We denote (wS /wS ) as the Fteam/Dteam reliance ratio. The figures show a behavior that is consistent with the earlier analysis of the example. Figure 5 shows that as the Fteam reliance increases for a given Dteam reliance, the activation level of the thesis unit continuously decreases to strongly negative values. However, a revealing feature of the analysis is the following: in order to bring neutral outcome to the argument, it is necessary to reduce the Fteam’s argument reliance to almost zero while holding the Dteam’s argument reliance at a significantly high level. This implies that the Fteam’s position should be almost entirely disregarded if the Dteam’s position is to be upheld. This clearly highlights the intrinsic weaknesses of the Dteam’s arguments. This implies that the argument reliance has absolutely minimal effects on the final outcome if the argument is well-knit and well supported. Figure 5 provides a scatter plot of the activation levels on the main thesis under different combinations of Dteam and Fteam’s argument reliance values.

Addressing the Homeland Security Problem

263

Activation Level

Argument Reliance Effect on Dteam’s Main Thesis 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DteamReliance=0.1 DteamReliance=0.5 DteamReliance=1.0

Fteam’s Argument Reliance d

Fig. 4. Activation Levels of D2 in Example

Activation Level

S ca tte r P lo t o f Activa tio n L e ve ls fo r d iffe re n t R e lia n ce R a tio s 0 -0 .1 0 -0 .2 -0 .3 -0 .4 -0 .5 -0 .6 -0 .7 -0 .8

2

4

6

8

10

12

-0 .9 -1 Argum e nt Re lia nc e Ra tio= Dte a m /Fte a m

Fig. 5. Scatter of Plot of Activation Level

7 Conclusion In this work, we have presented the conceptual foundations of a collaborative decision-making system as a key information system for decision makers involved in homeland security. The underlying decision issues associated with homeland security are highly ill-structured and yield limited analytical tractability. Human cognition plays a pivotal role in such problem-solving instances, where one critically needs to assimilate the intelligence information presented, understand the implications, and arrive at sound judgments on the positions. We model the anti-terrorism decision tasks as an argumentation process among a group of geographically diverse decision-

264

T.S. Raghu, R. Ramesh, and A.B. Whinston

makers. The individuals assume positions in proposing claims and solutions to a security issue, and support them with argumentative logic contrived from evidential intelligence information and other propositions. In the argumentation process, decisionmakers challenge each other when positions conflict, and using a process of argument exchanges attempt to arrive at a resolution. Connectionist modeling presents an efficient and elegant approach that is close to human cognition for supporting the decision makers in a deliberation. Furthermore, in addition to closely capturing the human cognitive processes and providing an analytical approach to typically ill-structured decision problems, connectionist modeling exploits the tremendous computational power of computers in analyzing large, complex networks of arguments and also serving as a repository of discussions related to intelligence information gathering, assimilation, dissemination, and analysis.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

Brookings Institution Project on Homeland Security, available at http://www.brook.edu/dybdocroot/fp/projects/homeland/report.htm , 2002 (pages 51–66). Phan Minh Dung: “On the Acceptability of Arguments and its Fundamental Role in Nonmonotonic Reasoning, Logic Programming and n-Person Games.” Artificial Intelligence 77(2): 321–358, 1995. Feldman, J. A., “A Connectionist Model of Visual Memory,” In G. E. Hinton, and J. A. Anderson (Eds.), Parallel Models of Associative Memory. Hillsdale, NJ: Lawrence Erlbaum Associates, 1981. Feldman, J., and D. Ballard, "Connectionist Models and their Properties," Cognitive Science, 6, 205–254, 1982. Guarino, Nicola. “Formal Ontology and Information Systems,” in N. Guarino (ed.), Formal Ontology in Information Systems, IOS Press, Amsterdam, Netherlands, 3–15, 1998. Hua, G., and Kimbrough, S., “On Hypermedia Based Argumentation Decision Support Systems,” Decision Support Systems, 22, 259–275, 1998. Loui, R. P., "Argument and Arbitration Games." Workshop on Computational Dialectics, AAAI Conference, 72–83, 1994. Kimbrough, S. O., "A Graph Representation for Management of Logic Models," Decision Support Systems, 2, 27–37, 1986. Lin, F., and Shoham, Y., "Provably correct theories of action." Journal of the Association for Computing Machinery. 42(2), 293–320. 1995. Locks, M. O., "The Logic of Policy as Argument," Management Science, 31(1), 109-114, 1985. Lorenzen, P., Formal Logic, Reidel Publishing Company, Dordrecht, The Netherlands, 1965. Loui, R. P., "Argument and Arbitration Games." Workshop on Computational Dialectics, AAAI Conference, 72–83, 1994. Mitroff, I. I., R. O. Mason, and V. P. Barabba, "Policy as Argument-A Logic for IllStructured Decision Problems, " Management Science, 28(12), 1391–1404, 1982. Raghu T. S., Ramesh. R., Whinston, A. B., and Ai-Mei Chang, "Collaborative Decision Making: A Connectionist Paradigm for Dialectical Support." Information Systems Research, 12(4), 363–383, 2001. Ramesh, R., A. B. Whinston, "Claims, Arguments, and Decisions: Formalisms for Representation, Gaming and Coordination," Information Systems Research, 5(3), 294–324, 1994.

Addressing the Homeland Security Problem

265

16. Thagard, P., "Explanatory Coherence," Behavioral and Brain Sciences, 12, 435–502, 1989. 17. Thornber, K. K., “The Fidelity of Fuzzy-logic Inference,” IEEE Transactions on Fuzzy Systems, 1(4), 288–297, 1993. 18. Toulmin, S., The Uses of Arguments, Cambridge University Press, Cambridge, England, 1958. 19. Wand, Yair; Storey, Veda C.; and Weber, Ron. “An Ontological Analysis of the Relationship Construct in Conceptual Modeling,” ACM Transactions on Database Systems, 24(4), 494–528, 2000.

Collaborative Workflow Management for Interagency Crime Analysis J. Leon Zhao, Henry H. Bi, and Hsinchun Chen Department of Management Information Systems University of Arizona, Tucson, AZ 85721 {lzhao, hbi, hchen}@bpa.arizona.edu

Abstract. To strengthen homeland security, there is a critical need for new tools that can facilitate real time collaboration among various law enforcement agencies. Through a field study, we find that law enforcement work is knowledge intensive and involves complex collaborative processes interrelating a large number of disparate units in a loosely defined virtual organization. To support knowledge intensive collaboration, we propose a new workflow centric framework to seamlessly integrate previously separate techniques from the fields of information retrieval and workflow management. Specifically, we develop a collaborative workflow management framework for interagency crime analysis. The key contribution of our research is that by integrating various state-of-the-art techniques innovatively, the proposed system can support real time collaboration processes in a virtual organization that evolves dynamically.

1 Introduction The recent creation of the Department of Homeland Security is the most significant transformation of the U.S. government in over a half-century. On the one hand, this signifies the importance of integrating the law enforcement forces at the federal level to strengthen public safety and security, and on the other hand, it offers a new opportunity of nationwide collaboration on fighting crimes. In this paper, we propose a workflow centric collaboration framework that is capable of supporting real time exchange of knowledge and facilitating event-based interactions among disparate law enforcement agencies. Currently, there is the trend to migrate traditional police information systems to Internet/Intranet [5]. Internet technologies, including the most recent development known as web services, can serve as the integration component to vertically and horizontally promote information flow among various agencies at the same and/or different levels. In this study, we will show how web services could be used as a universal middleware to integrate data and processes among different agencies. Through a field study, we find that law enforcement workflow is knowledge intensive, and as a result, much of the existing literature treats police organizations as information processing systems [10]. With the estimation that police officers spend up to 40% of their time on handling information, processing information is one of the

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 266–280, 2003. © Springer-Verlag Berlin Heidelberg 2003

Collaborative Workflow Management for Interagency Crime Analysis

267

most expensive police activities [8]. As such, the primary emphasis of information technology in the law enforcement context is on “increasing applied, useful, and succinct information” [7]. Our investigation indicated to us that an interagency workflow system will help tremendously for the following reasons: The focus of most existing crime recording, analysis, and investigation systems is on the data perspective, such as the use of data mining techniques for automatically detecting crime patterns [1]. However, little research has been reported on how to manage the collaborative processes among disparate law enforcement agencies. Many detection failures in police work are due to the fact that the necessary information is either not received by police or lost or distorted within the police system, especially when the mobility of criminals is increased [13]. Most existing police search systems are developed and deployed by individual agencies operating at the regional, state, and national level [9]. In order to reduce or eliminate duplicated efforts, wasted time, and opportunities for error, a force-wide and fully integrated system is required [2]. Two major objectives of developing office automation systems in police forces are “to improve information flows and access” and “to improve the quality of available information within the force” [12]. Workflow technology has been developed to streamline and automate business processes [11], and an interagency workflow system will help fulfill these two objectives. We choose crime analysis as the context for the interagency collaboration. Interagency collaboration is a complex process that involves the search for collaborators, the establishment of collaborative relationships, the initiation of rules of engagement, and the execution of collaboration activities. However, existing workflow solutions are too rigid and are mainly suited for a well-structured organization and insufficient for supporting loosely structured applications in interagency collaboration. Therefore, major innovation to existing workflow concepts and techniques are needed to support interagency crime analysis workflow. As such, we propose a collaborative workflow management framework that integrates several previously separate techniques from the field of information retrieval and workflow management. In this study, our core contributions include the development of a business case on crime analysis processes, an interagency crime analysis workflow model, the integration of information retrieval and workflow techniques in crime analysis workflow, and a web services enabled system architecture for a crime analysis workflow system.

2 Characteristics of Crime Analysis Process: A Field Study In this section, we describe our findings about the characteristics of crime analysis processes in a major metropolitan police department based on a field study. Our main observation from a field study of crime analysis is that the crime analysis process is dynamic and complex and involves a large number of formal documents. As presented in a later section, this observation leads to an innovative workflow management framework that help automates interagency crime analysis processes.

268

J.L. Zhao, H.H. Bi, and H. Chen

2.1 Crime Analysis Processes in a Major Police Department We have conducted a field study with a major police department, during which we interviewed six police officers including patrol officers, crime analysts, and detectives over the period of one whole year. We also collected sample documents and observed how knowledge intensive law enforcement personnel work was. We summarize our main findings below. Typically, patrol officers, crime analysts, and detectives are three main roles that are involved in fighting crimes in a police department. Corresponding to these three roles, there are usually three functional departments in police agencies. Crime analysts and detectives are further organized into groups in their departments, respectively, according to crime types so that an agent can specialize in one or more crime types. In terms of workflow, crime analysis and investigation can be organized into the following five stages: 1. Collecting crime data. When a crime occurs, usually patrol officers and detectives work at the scene to collect and record crime information in various documents. Information collection also includes that detectives inquire victims, witnesses, and suspects during the whole investigation process. 2. Processing and storing crime data and documents. Documents are then collected and managed by a special unit inside the police department, and part of the recorded information is entered into one or more computer-based systems. 3. Searching, retrieving, and collecting additional information for crime analysis. When investigating a case, more information about the suspects and crimes are needed and retrieved from various data resources and databases. For instance, after a patrol officer or a detective is assigned a case, he or she searches for needed information by himself or herself, or may ask a crime analyst for help to search for needed information. There are many possible sources for collecting information including records from various institutions such as other police departments, numerous utility companies on water, gas, and electricity, phone companies, and various city, state, and federal government branches. 4. Analyzing information to find clues. In order to find more complete evidence to charge a criminal or to find clues for an investigation, sophisticated and logical crime analyses need to be conducted in order to find linkages among criminals and/or among crimes. Such analyses include crime pattern analysis, data analysis, etc. A crime analyst must type and print out a report after the analysis stage. The report is delivered to the requester by a hardcopy, not by email. Although the crime analysts are encouraged to enter the findings into a database to share with others, they usually do not do so because they are too busy. 5. Using information to prosecute criminals. Detectives use information collected from the crime scene, the suspects, victims, and witnesses of the crime, and crime analysis reports to generate formal documents in order to prosecute criminals. In addition, detectives also need to perform many data intensive tasks such as completing supplement reports, requesting lab reports from crime lab technicians, and obtaining transcripts of all interviews from transcribers. Furthermore, this stage often involves collaboration of detectives from several departmental units, county attorney’s office, and courts to combine all the information into a complete file that becomes the basis for the prosecution.

Collaborative Workflow Management for Interagency Crime Analysis

269

2.2 The Need for Greater Support in Collaborative Workflow The description of a crime analysis process in a major police department reveals several deficiencies in the current crime analysis and investigation practices: 1. Crime analysts and detectives need to retrieve information from many databases that are not integrated. When a crime analyst or detective needs information about a case, they have to search many data resources and then piece together scattered information. It is common to search four or five different systems to establish the history and intelligence about a single person or address [1]. 2. A computer-based recording system is strongly preferred to a paper-based recording system. Currently, the process of crime analysis and investigation in a typical policy department is mainly a non-automated process. Because all of the investigation steps must be thoroughly documented to secure an effective outcome. Paper-based processes result in large amount of duplicate information as victim, witness and suspect information, addresses, phone numbers and so on may need to be entered into many different documents. 3. Too many forms are to be completed because different units/organizations (record unit, CAO, court, etc.) require different document formats. Various documents require a number of similar data, such as information of victims, witnesses, and suspects. If detectives complete these documents on paper by writing the same thing ten times, record unit needs to enter these documents into computer by typing the same thing another ten times. There are numerous forms that duplicate much of the same information. It would be nice to boil some of that information down to fewer forms. Besides wasting money and time, paper forms also take much space. For example, it is required by many agencies to keep homicide cases forever and other cases for at least seven years. 4. Although many cases may be implicitly related to each other, for instance, a narcotics case may be related to a homicide case, they may be assigned to different crime analysts and detectives according to the characteristics of cases. As a result, while many criminals commit crimes in multiple places (counties, cities, or states), crime analysis and investigation usually have local focuses that are not effective enough to fight with criminals who commit crimes in multiple places. Further, the police department must collaborate with other law enforcement agencies to solve crimes that are related to one or another and particularly when the crime is of a federal nature. The four types of deficiencies in crime analysis point to a great need for process automation. A workflow system can improve the police work efficiency by linking various databases flexibly, automating the data collection process, standardizing and/or automatically creating various forms, and revealing relationships among various crime cases. However, existing workflow systems do not possess all the functions that are needed in the police work. In this paper, we focus on investigating a workflow solution to interagency crime analysis, which is probably the most challenging to find a solution. We hope that this focus will allow us to develop solutions later to support the whole crime analysis and investigation process.

270

J.L. Zhao, H.H. Bi, and H. Chen

3 A Conceptual Model of Collaborative Workflow for Interagency Crime Analysis In this study, we focus on studying interagency crime analysis as a workflow management problem. There are several issues in interagency crime analysis workflow. First, one needs to find the right agent to share with for a particular piece of information such as a crime case. Second, once a relevant agent or agency is identified, a workflow instance needs to be initiated. Since we assume loosely coupled law enforcement agencies, there is no standard way of handling the workflows among different agencies. For instance, some scheduling needs to be done before the workflow can be initiated because one agency cannot order another agency to do anything. The workflow process therefore must be extremely flexible and need to be design on the fly on a case-by-case basis. Furthermore, since it is not possible to ask each agency to invest a lot of time and money to use such an interagency crime analysis workflow system, we need to design a workflow mechanism that is inexpensive to implement and that can be used based on various information system standards. We propose a six-phase meta-process for collaborative workflow management for the given context. We use activity-based workflow modeling notations proposed in [4] to represent these meta processes. These notations are shown in Fig. 1. In activitybased workflow modeling, there are two types of nodes, activity nodes that represent activities and routing nodes that route the execution flows of activities. There are two special activity nodes. Start activity node represents the start point of the workflow and end activity node stands for the end point. There are three kinds of routing nodes, AND node, OR node, and XOR node. Activities linked to an AND node should all be executed. For activities that are linked to an XOR node, only one of them is executed. For activities that are linked to an OR node, one or more than one of them may be executed, depending on the scenarios. Finally, directed arcs are used to link nodes.

Activity

AND connector

OR connector

Exclusive OR connector

Control flow

s

e

Start node

End node

Fig. 1. Notations of activity-based workflow modeling

1. Workflow identification (Fig. 2.). Given a new crime case, a crime analyst can submit the case to the workflow system for identification of relevant cases and relevant crime analysts. The workflow system then searches the National Crime Case Repository based on the Unified Case Language System for similar cases (to be discussed in more detail in a later section). The relevant cases can be closed cases or open cases, and the relevant agents can be active agents who can perform collaborative analysis or agents that may have relocated or retired.

Collaborative Workflow Management for Interagency Crime Analysis

271

identify open cases

s

specify case search criteria

identify closed cases

submit a case

identify active agents

find matching results find no matching results

identify inactive agents

1 stop search

2

modify search criteria

Fig. 2. Process model of workflow identification

2. Workflow negotiation (Fig. 3.). Given the constraints of the matching results of the case, various optional workflows could result. For instance, if the identified agent is no longer with the agency, alternative agent needs to be identified. An agent must get approval from his/her agency in order to start a formal workflow.

contact relevant agents

request approval reject cooperation

1 contact alternative agents

request approval reject cooperation

get approval get disapproval

get approval get disapproval

find collaborators

3

find no collaborators

4

Fig. 3. Process model of workflow negotiation

3. Event-based workflow design (Fig. 4.). Once the agent and the associated agency agree to participate in the interagency workflow, the workflow process needs to be designed. The nature of the workflow design may require the selection and modification of existing workflow templates in order to design the workflow efficiently and correctly within a reasonable amount of time such as in the matter of minutes. Overly costly workflow design will render the workflow system difficult to adopt. Our proposal cannot be supported directly by existing workflow management systems due to the constraints aforementioned. In this study, we propose an eventbased workflow management environment. Under this environment, collaborators determine through discussion a list of action items, the associated events to each action item, the dependencies among those events and action items, and the deadlines for those events. Collaborative workflow management requires the system support for those user activities in the following respects:

• Event definition. Upon workflow initiation, registered users will be given a template to define and decide collectively their events. Events found in similar collaborative workflow instances in the past will be given as suggestions. Some events can contain sub-events as well if needed.

272

J.L. Zhao, H.H. Bi, and H. Chen

• Event dependency specification. Users can specify the dependency relationships among the events. The dependency relationships among events provide the basis for workflow scheduling and re-scheduling by sequencing the events properly. • Deadline Specification and integrity verification. Users can specify event deadlines collectively, while the system can help verify the integrity of the deadlines based on the dependency relationships among events. For instance, if event X depends on event Y, event X cannot occur before event Y. This implies that if a user wants to reschedule event X before event Y, the workflow system will not allow it.

start workflow process design

3

select an existing template

modify existing template

design a new template

determine action items

determine associated events

determine dependencies of action items & events

determine deadlines of events

5

Fig. 4. Process model of event-based workflow design

4. Workflow initialization (Fig. 5. ). The designed workflow instance will need to be initialized. This requires the verification of the workflow template to remove any conflicts or other errors. Once the workflow template is deemed correct, the users will be informed of the start of the workflow. Initial confirmations with all users are necessary to make sure that the users and their emails are correct and that all are committed to see the workflow through.

5

generate workflow model

verify workflow model

find no error

find errors

notify all collaborators

start collaboration workflow

6

correct errors

Fig. 5. Process model of workflow initialization

5. Event-based workflow execution (Fig. 6. ). • Event execution. Once the action items and events are specified, the collaborative workflow engine will monitor the progress of events and interact with the users in several ways. First, if an event completion passes the specified deadline, the workflow system will send messages to all users about the delay. Second, once an event is completed, the workflow engine will check if any subsequent events need to be started. Necessary data flows from prior events to subsequent events will be done. Furthermore, the workflow engine can also inform the users if the workflow is on schedule and indicate any positive or negative discrepancies from the schedule. • Event modification. When a particular event is overdue, other events depending on the delayed event might need to be rescheduled. Further, changes in circumstances might prompt users to ask for a change of event deadlines. In any case,

Collaborative Workflow Management for Interagency Crime Analysis

273

our workflow system supports deadline changes and/or event changes such as adding, deleting, or modifying an event. event modifications

6

monitor collaboration process

an event passes deadline

inform of delay

an event is completed

collaboration process is completed

check subsequent events

find positive/negative discrepancies

7

inform users

Fig. 6. Process model of workflow execution

6. Workflow termination (Fig. 7.). Once the workflow terminates, we need to update the relevant systems to insert the results of the crime cases involved. This requires the distribution of relevant documents to various legacy systems.

update database records

7

2

4 e

distribute relevant documents

Fig. 7. Process model of workflow termination

As discussed above, existing workflow management systems do not have the functions to support all the needed features of interagency crime analysis workflow. The main challenge is the integration of techniques from these separate research domains, namely, information retrieval, distributed and autonomous information systems, and agile workflow management. These six phases of workflow are generic. As shown in a later section, we will include a case-based workflow modeling paradigm at the instance level. To distinguish the two modeling levels, we refer to the generic workflow models as meta models.

4 A Collaborative Workflow Management System for Interagency Crime Analysis 4.1 A Three-Layer Framework In order to support a system architecture for managing interagency crime analysis workflow, we make several innovative proposals. First, we propose a National Crime Case Repository (NCCR) that will store crime cases that are submitted by law enforcement agencies that are interested in seeking collaborators. Those cases should typically have features of an overdue case and requiring multi-agency investigation.

274

J.L. Zhao, H.H. Bi, and H. Chen

This crime case repository should contain additional functions such as crime case matching and workflow routing, which are needed to automate interagency crime analysis workflow. Second, we propose a Unified Case Language System (UCLS) based on XML. The purpose of the UCLS is to aid the development of ontology and systems that help law enforcement professionals and researchers retrieve and integrate electronic crime fighting information from a variety of sources. Similar to the Unified Medical Language System (UMLS) that was designed to support information integration in the medical domain, UCLS can help create machine intelligence by enabling computerized parsing of crime cases and features and discover more precisely similarities among them. For instance, if we can create a gun ontology and a vehicle ontology, then the types of guns and automobiles found in a crime scene can be more easily identified and matched. Third, we design a Collaborative Workflow Management System (CWMS) based on the recent development in the web services. This agile workflow management framework will be able to support all the features needed by the interagency crime analysis workflow. As discussed above, the dynamic nature of the crime analysis process, the loosely coupled law enforcement agencies, and the heterogeneous data and processes in crime analysis collaborations renter the existing workflow standards difficult to apply.

Unified Case Language System Collaborative Workflow Management System National Crime Case Repository Fig. 8. A three-layer framework for interagency crime analysis workflow System

These three components, i.e., NCCR, UCLS, and AWMS, are essential to an interagency crime analysis system. NCCR is needed because a single source of cases will make case exchange and matching much easier and less costly. UCLS is needed to enable the creation of machine understandable case descriptions based on XML. AWMS will integrate NCCR with existing law enforcement information resources in an efficient manner. The web services standard will make it possible to develop solutions that are universally accessible with little change to existing police information systems. Fig. 8. illustrates the relationship among the three layers, namely, the case language layer, the workflow management layer, and the case repository layer. 4.2

Web Services Enabled System Architecture for Interagency Crime Analysis Workflow

As indicated through the crime analysis process, the core challenge for information technology has always been and will continue to be the integration of inter- and intraenterprise applications. Today, web services provide a new standard for enterprises to build successful, cost-effective, tractable application integration. The International Data Group (IDG) estimated that the total software, hardware and services opportu-

Collaborative Workflow Management for Interagency Crime Analysis

275

nity derived from Web Services will rise from USD1.6 billion in 2004 to USD34 billion by 2007. Web services help create a universal computing environment where all computer programs can communicate with almost any other program anywhere at anytime. The web services technology can therefore streamline inter-business processes by creating an open, distributed systems environment, and allow the possibility of significantly reducing the cost of integrating disparate business systems. Workflow technologies and web services can be implemented at three levels for police agencies to improve homeland security: 1. A workflow system is implemented inside each agency to automate crime analysis and investigation and increase information sharing inside the agency. 2. Web services are implemented as the middleware to link all agency-based workflow systems together to increase information sharing among agencies. 3. Use web services to vertically integrate workflow systems of police agencies with CAO, court, jail, and sheriff to automate crime investigation and reduce paper work. At the business level, our research goal is to integrate workflow technologies and web services to interagency crime analysis and investigation, and at the technical level, we strive to develop new information processing and process management techniques towards an agile workflow management framework. Fig. 9. illustrates how workflow techniques and web services are implemented in an interagency crime analysis workflow system. As shown, each law enforcement agency might have its own database used by various agents. We assume that various agencies will sign up to use the collaborative workflow services in crime analysis and the interagency crime analysis workflow system (implementing the three-layer framework shown in Figure 8) will enable direct communications of agents in other agencies from their existing information systems based on the SOAP connections. In this system architecture, SOAP (Simple Object Access Protocol) connections provide the web service channels between the interagency crime analysis workflow system and the agency information systems. A genc y 2 A gen c y n

A genc y 1

... IS 2

IS 1

ISn

,Q W H U D J H Q F \ & U LP H $ Q D O\ V LV : R U N I OR Z 6 \ V W H P

ag ent

IS i

info system i

62 $3FRQQHFWLRQ

Fig. 9. Web service-based system architecture for interagency workflow interoperability

276

J.L. Zhao, H.H. Bi, and H. Chen

This particular system architecture provides several important features. First, the SOAP protocols act as universal connectors to existing information systems already implemented in law enforcement agencies. Because most information system vendors are adopting the SOAP standard for system interoperability, this will make the interoperation between the interagency workflow system and existing information systems possible without major new investment. Second, the collaborative workflow management system has very high fluidity because it can enable spontaneous collaborative interactions among agents that are not yet collaborated before when they find one another based on a brand new case. Third, since the information systems currently used by the agents include a variety of types such as databases, emails, and decision support systems, this spontaneous interoperability makes the collaborative workflow management very flexible since the agents can choose any platforms they like so long as their preferred systems understand the web service standard. With the rapid pace of adopting web services by the software vendors, this web service-based system architecture makes it possible to design and implement the proposed nationwide interagency crime analysis workflow.

5 Event-Based Workflow and Event Management Language 5.1 Meta-level and Instance-Level Workflow Models The six phase conceptual model for collaborative workflow management presented in Section 3 contains two levels of workflow modeling techniques, a meta-level workflow management model and an instance-level workflow event model. The metalevel workflow management model is essentially the process management protocol used to set up a collaborative workflow spontaneously at the time of user request. The collaborative workflow engine will use this meta-model to interact with users to design and execute all collaborative workflow instances. The instance-level workflow event model is discussed in Section 3 under eventbased workflow design and event-based workflow execution phases. The workflow event management is our innovation in order to management workflows flexibly on a case-by-case basis. To the best of our knowledge, there is no work found in the literature until now that deals with workflow flexibility using an event-based workflow design and execution technique. Although the concept of event management is not new, applying the event management technique in workflow modeling is a novel concept. 5.2 A Workflow Event Language and Associated Operators To manage event-based workflow, we propose an event constraint language similar to a sequence constraint language proposed in [6]. In this paper, we illustrate the basic principles of the event constraint language and the associated event manipulation operators next. A workflow event is an action item that must be completed by one or more agents on or before a specified time. We denote event e to be completed by agent a by time t as e(a, t). When an event must be completed by more than one agent, we use A = {a1,

Collaborative Workflow Management for Interagency Crime Analysis

277

a2, ..., an} to indicate a set of agents. In this paper, we assume that the event start time is determined by its prerequisite event(s), but only the deadline t of event e is explicitly specified. Note that the concept of event in our framework is very general. That is, an event could entail a decision making task, a telephone conference, a data collection task, and an emailing task. The relationships between two events can be prerequisite, subsequent, and irrelevant. For instance, if event e1 must be completed before event that is e1 is a prerequisite of e2, we say e1 Å e2. Conversely, the same notation indicates that e2 is a subsequent event of e1. It is possible that multiple events, say e1, e2, and e3 are prerequisites of event e4. In that case, we denote the relationship as {e1, e2, e3} Å e4. Similarly, multiple events can be subsequent events of other events. The following collaborative workflow example can be modeled with this simple event specification language. Given five agents, John and Sarah from Tucson Police Department, Jen from the Arizona Homeland Security Council, Tom from the CIA, and Mike from the FBI. Assume that they have agreed to collaborate on a case that involves a group of potential criminals on a specific crime, which is supported by the interagency crime analysis workflow system. They have agreed to accomplish the tasks shown in the following table. This simple example is for the purpose of illustrating the use of the event language. The list of events can be written formally as E1({John, Sarah}, ‘5 pm 1/10/03’), E2(Tom, ‘5 pm 1/10/03’), E3(Mike, ‘5 pm 1/10/03’), E4({John, Sarah, Tom, Mike, Jen}, ‘2 pm 1/11/03’), E5(Jen, ‘7 pm 1/11/03’), E6({John, Mike}, ‘5 pm 1/15/03’), E7(Sarah, ‘11 am 1/17/03’), E8({John, Sarah, Tom, Mike, Jen}, ‘5 pm 1/18/03’). The dependency constraints are specified as {E1,E2,E3}ÅE4, E4ÅE5, E5ÅE6, E6ÅE7, E7ÅE8. Summary of the events is listed in Table 1.

Table 1. Listing of Events

(YHQWV ( ( ( ( ( ( ( (

'HVFULSWLRQ &ROOHFWGDWDIURP7XFVRQ$JHQFLHV &ROOHFWGDWDIURP&,$ &ROOHFWGDWDIURP)%, 7HOHFRQIHUHQFHPHHWLQJ &RPSLOHPHHWLQJPLQXWHV HPDLOWRDOODJHQWV 5HYLHZILQGLQJV SURSRVHSURVHFXWLRQDFWLRQV 3UHSDUHSURVHFXWLRQIRUPV HPDLOWRDOO )LQDOWHOHFRQIHUHQFH

$JHQWV -RKQ6DUDK 7RP 0LNH $OODJHQWV -HQ -RKQ0LNH 6DUDK $OODJHQWV

'HDGOLQH SP SP SP SP SP SP DP SP

3UHUHTXLVLWHV 1RQH 1RQH 1RQH ((( ( ( ( (

In practice, the list of events might not be specified completely at the start and are subject to change by adding new events or modifying existing events. This requires an event manipulation algebra that can insert a new event, remove an event, and modify an event. Further, the dependency relationships among events need also be changed when new events are inserted or old events are removed. There are a number of event manipulation operators. Insert(e1, e2, e3) means that event e1 is to be inserted before e3 and after e2. That is, e1 will become the prerequisite event of e3 and the subsequent event of e2. Remove(e4) means that event e4 is to be removed, in which case, any subsequent events of e4 will become the subsequent events of the prerequisite events of e4. For instance,

278

J.L. Zhao, H.H. Bi, and H. Chen

suppose we have {e2, e3}Å e4, and e4Åe8. After event e4 is removed, the event management system should automatically derive the relationship {e2, e3}Åe8. Other event manipulation operators will be needed as well to change the contents of any columns among event ID, event description, agents, deadline, and prerequisites. For instance, Modify(E1, Description = ‘Collect data from Arizona police departments’) would change the description of the first row in the table above. Additional details of the event manipulation operators are omitted. 5.3 Uniqueness of Event-Based Workflows Note that the event modeling method does not require complex routing rules as found in conventional workflow management systems because we assume that the collaboration will be handled by the involved agents. In a way, our workflow automation falls between a simple email-based collaboration system and a full-fledged workflow system. While the former does not provide any support for systematic event specification and modeling and the former is too complex and too costly for spontaneous and dynamic collaborations, our model has a moderate complexity and significant functionality with a lot of flexibility. The basic premise of event-based workflow design and execution is based on the idea that conventional workflow design and execution techniques are complex to learn and expensive to use. Typically, conventional workflow models are based on a process graph and logic-based triggering constraints. Because of the complexity involved in this paradigm, existing workflow systems typically provide a graphical user interface for designing a workflow. However, this conventional workflow modeling paradigm is not suitable in interagency crime analysis workflow for several reasons.

• First, the interagency crime analysis workflow models are difficult to standardize since the types of processes and their levels of complex are numerous. A workflow model used by one collaborative team of agents might not be useful for another collaborative team of agents because of the differences in their crime cases. • Second, our proposal of workflow template and event-based workflow design is superior to the conventional activity-based workflow design because changing the events during the workflow execution is much more convenient than changing a workflow graph. The main reason is that the event-based workflow model only requires the verification of event sequence constraints based on event dependencies. • Third, we abandon the concept of roles in the workflow model details since we assume that the collaborative team of agents usually consists of a small number of agents. New agents can be added to an existing workflow, and existing agents may be replaced under the event-based modeling paradigm while maintaining a log of all events and the related agents.

6 Conclusions In this paper, we outlined an interagency crime analysis workflow system based on web services and event management. The main objective is to facilitate interagency collaboration on crime analysis through a national crime case repository and a unified

Collaborative Workflow Management for Interagency Crime Analysis

279

case language system. The system can help collaborators work together seamlessly and efficiently. We argued that existing workflow solutions are too rigid and are largely confined within a well-structured organization and insufficient for supporting loosely structured applications in a virtual organization that is constantly evolving. Although the collaborative workflow concepts have been used in various contexts, our notion of collaborative workflow is unique. For instance, the collaboration management infrastructure developed in Microelectronics and Computer Technology Corporation [3] consists of the core workflow management functions and extended modules that support process and situation awareness needed for process coordination. However, the collaboration management infrastructure does not provide the kind of flexibility needed by the interagency crime analysis workflow. We claim a number of contributions in this study: (1) Our field study in a major police department revealed the intensive knowledge and complex processes in law enforcement work and a great need for collaborative workflow management. (2) We proposed a national crime case repository and an associated Unified Case Language System to enable the discovery of collaborators related to a specific crime case so that spontaneous collaboration can occur in real time when needed. (3) We designed a unique workflow management framework called collaborative workflow management and the associated workflow design and execution model. (4) We proposed an eventbased workflow management method and the associated event manipulation language. Currently, we are working on the details of the system design and related theoretical development to validate the system described in this paper.

References 1. 2. 3.

4. 5. 6. 7. 8.

Adderley, R.W., P. Musgrove. 2001. Police crime recording and investigation systems A user’s view. Policing: An International Journal of Police Strategies & Management. 24(1) 100–114. Anonymous. 1986. Stepping up the beat (Sussex Police Force computer system). Computer Systems. 6(5). Baker, D., D. Georgakopoulos, H. Schuster, A. Cassandra, A. Cichocki. 1999. Providing customized process and situation awareness in the collaboration management infrastructure. Proceedings of the 4th IFCIS International Conference on Cooperative Information Systems. Bi, H.H., J.L. Zhao. 2003. Mending the Lag between Commercial Needs and Research Prototypes: A Logic-based Workflow Verification Approach. Proc. of The 8th INFORMS Computing Society Conf. Kou, C.Y. 1998. An evaluation for nation wide police information system (NWPIS) based on Internet. Proc. of the 32nd Annual International Carnahan Conference on Security Technology. 117–120. Kumar, A., J.L. Zhao. 1999. Dynamic routing and operational controls in workflow management systems. Management Science. 45(2) 253–272. Manning, P.K. 1996. Information technology in the police context: The "sailor" phone. Information Systems Research. 7(1) 52–62. McKay-Smith, M., A.C. Gustin. 1982. Forms Control and Design in a Small Police Department. ARMA Records Management Quarterly. 16(1) 7–8.

280 9. 10. 11. 12. 13.

J.L. Zhao, H.H. Bi, and H. Chen Northrop, A., D. Dunkle, K.L. Kraemer, J.L. King. 1994. Computers, police, and the fight against crime: an ecology of technology, training and use. Informatization and the Public Sector. 3(1) 21–45. Simms, B.W., E.R. Petersen. 1991. An Information Processing Model of a Police Organization. Management Science. 37(2) 216–232. Stohr, E. A. and J. L. Zhao, "Workflow Automation: Overview and Research Issues", Information Systems Frontiers, Volume 3, Issue 3, September 2001, p. 281–96. Taylor, J.A., H. Williams. 1992. Police management, office automation and organizational change. New Technology, Work, and Employment. 7(1) 44–53. Willmer, M.A.P. 1978. An information-theoretic approach to the organization of police forces. Proceedings of the 4th International Congress of Cybernetics & Systems.

COPLINK Agent: An Architecture for Information Monitoring and Sharing in Law Enforcement Daniel Zeng, Hsinchun Chen, Damien Daspit, Fu Shan, Suresh Nandiraju, Michael Chau, and Chienting Lin Department of Management Information Systems University of Arizona, Tucson, Arizona 85721 {zeng, hchen, damien, shan, mchau, linc}@eller.arizona.edu

Abstract. In this paper, we report our work on developing and evaluating a prototype system aimed at addressing the information monitoring and sharing challenges in the law enforcement domain. Our system, called COPLINK Agent, is designed to provide automatic information filtering and monitoring functionalities. This system also supports knowledge sharing by proactively identifying officers who are working on the same or similar cases on a real-time basis. To accommodate the mobile needs of law enforcement officers who are constantly in the field, COPLINK Agent can deliver messages through a variety of communications channels including e-mail, pager, and mobile phones. In order to assess the effectiveness of COPLINK Agent, we conducted a pilot user study at the Tucson Police Department. Overall, COPLINK Agent was shown to be an effective tool for improving the effectiveness and efficiency of criminal investigations in crimes such as gang, theft, and fraud.

1 Introduction The rapid advancement of information technologies and the Internet provides great opportunities as well as challenges for government agencies. These technologies not only allow easier and faster access to information but also facilitate the sharing and reuse of information. In the law enforcement domain, information access is especially critical for crime analysis and investigation. Consequently, police officers and investigation personnel are increasingly becoming knowledge workers whose daily activities include considerably complex and extensive interactions with diverse information and knowledge sources. The types of interactions include: selecting, collecting, preserving, organizing, using, accessing, analyzing, and producing data. In addition, timely access to information is often critical. Consequently, crime and police report data are rapidly migrating from paper-based records to automated law enforcement records management systems (RMS). Despite the availability of such viable RMS solutions, there are still several major issues and challenges not yet addressed by existing systems. First, crime analysts often need to query different distributed data sources, including both internal databases as well as external ones managed by other law enforcement agencies. These data sources, funded and maintained by multiple agencies, often employ a wide range of different hardware platforms, database systems, network protocols, data schemas, and user interfaces. To find the desired inforH. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 281–295, 2003. © Springer-Verlag Berlin Heidelberg 2003

282

D. Zeng et al.

mation, law enforcement personnel have to know where such data sources are located and how to access them. As such, a large amount of manual and cognitive effort is required to query all the relevant data sources, each with a different search interface. Another major problem concerns the dynamic nature of law enforcement data sources. Since many cases involve long periods of investigation, law enforcement personnel often have to track the activities of a particular suspect or the whereabouts of a vehicle over a long period of time. As the data are updated frequently, the database has to be repeatedly queried for changes, often on a daily basis. Since such automatic monitoring functions are not available in most current systems, the data sources have to be checked manually, requiring a lot of time and effort from the user. Conceivably, it is not uncommon that cases requiring constantly monitoring have to be dropped because of lack of resources. Last but not least, the lack of support for knowledge sharing is yet another problem confronting the current law enforcement information systems. From their daily work, law enforcement personnel with different job functions and working at different locations can easily acquire a vast amount of knowledge about a particular suspect or case. Such knowledge, nonetheless, is tacit and often not efficiently shared. When a police officer needs some particular information, it is often not clear whom to contact. It is common that two different law enforcement units are working on two closely related cases (e.g., related to the same person) without knowing each other’s existence or progress. Naturally, the two units are not able to collaborate and share their collective knowledge. Thus, the ability to share knowledge in a collaborative environment by linking together people who are working on the same or similar cases can significantly improve law enforcement agencies’ crime-fighting capabilities. In this paper, we report on our experience in designing and evaluating a collaborative information monitoring system for law enforcement. This approach was implemented in a prototype system called the COPLINK Agent. COPLINK Agent was designed to answer the aforementioned challenges by providing automatic information filtering and knowledge sharing functionalities. The rest of the article is outlined as follows. Section 2 reviews related research and existing systems in the law enforcement domain. Section 3 discusses the research questions outlined in this study. Section 4 describes the system architecture and main components of the COPLINK Agent. In Section 5, a sample user session with the system is described in detail to illustrate its use in a law enforcement environment. Section 6 focuses on a user study designed to evaluate the COPLINK Agent and answers the research questions raised above. We conclude the paper in Section 7 by summarizing our research contributions and suggesting future research directions.

2 Research Background 2.1 Information Systems in the Law Enforcement Domain Quick and easy access to information is critical to the success of law enforcement agencies. Database technologies have been widely used to manage crime and police reports to provide faster and easier access for law enforcement personnel [15,18]. One such example is the COPLINK Connect system [7,14]. COPLINK Connect aims to enable law enforcement agencies to search for information more effectively by pro-

COPLINK Agent: An Architecture for Information Monitoring and Sharing

283

viding an user-friendly interface that integrates data from various sources such as incident records, mug shots, and gang information. COPLINK Connect was deployed in 2001 at the Tucson Police Department (TPD), which has more than 1,000 employees and serves a population of over 475,000 citizens. This system has been shown in a field study to have improved the efficiency of detectives and crime analysts [14]. Other information technologies also have been used in law enforcement. For example, the Comstat system introduced by the New York Police Department uses computer statistics and crime mapping techniques to identify the types of crimes happening in different districts [10]. Another example, COPLINK Detect system, uses co-occurrence analysis to identify the relationships among different entities (e.g., persons, vehicles, locations, and organizations) in criminal justice databases [14]. Data mining techniques have also been applied to identifying interesting patterns in criminal data. For example, a self-organizing map is used to cluster similar sexual offense cases into groups in order to identify serial offenders [1]. 2.2 Information Monitoring and Sharing The law enforcement information systems discussed above have mainly focused on two aspects, i.e., providing easy and efficient access to data, as well as performing analysis on existing data. However, none of these systems support automatic information monitoring or information sharing among users, which have been widely studied for Web applications. There are many monitoring and notification systems for Web information sources. One example is the NorthernLight Web search engine (www.northernlight.com), which alerts users when new Web pages are added to the database. Some client-side search tools, such as Copernic Agent (www.copernic.com) and WebSeeker (www.bluesquirrel.com), also provide the functionality for scheduling automatic searches. Stock prices are also frequently monitored. For example, E*trade (www.etrade.com) allows users to choose which stocks they want to monitor and the users are alerted when the stock price reaches the level they specified. In the financial application arena, more advanced monitoring (e.g., monitoring based on the results of complex financial analysis) has also been proposed [24]. In these systems, users can often opt to be alerted in different ways, such as Web messages, emails, pagers, voice messages, or short messages for mobile devices. In the area of providing monitoring and alerting support for law enforcement applications, the FALCON system [4] developed at Charlotte-Mecklenburg Police Department (CMPD) in Charlotte, North Carolina offers the functionality of monitoring all incoming police records as well as sending alert messages to police offers by email and pager. FALCON, nonetheless, does not offer collaborative filtering capabilities or advanced collaboration functions. To facilitate user collaboration and information sharing, collaborative filtering, also referred to as recommender system, has been widely studied in Web applications. Goldberg et al. [12] defines collaborative filtering as a kind of collaboration in which people help one another perform filtering by recording their reactions to documents they read. Examples of collaborative filtering and recommender systems include Amazon.com, GroupLens [17], Fab [3], Ringo [22], Do-I-Care [23], and Collaborative Spider [6]. When a user performs a search, these systems will recommend a set of documents or items that may be of interest based on this user’s profile and other users’ interests and past actions. Collaborative filtering systems typically operate in a

284

D. Zeng et al.

“push” mode. When certain users come across some interesting pieces of information, they highly recommend these pieces, or “push” them to other users who share similar interests. For instance, Amazon.com uses collaborative filtering to recommend books to potential customers based on the preferences and recommendations of other customers who have similar interests or purchasing histories. Annotations in free-text or predefined formats are also incorporated in systems such as AntWorld [16], Annotate! [11], and CIRE [21] to facilitate collaboration among users. In collaborative filtering systems, one major issue is users’ willingness to share information. It has been suggested that Lotus Notes was not well utilized because users had little or no incentive to share information [20]. The situation, however, becomes less problematic for Web searching, which consists mostly of voluntary contributions [21] as users are more willing to contribute in exchange of pride and popularity. Lastly, many systems attempt to minimize user effort by capturing user profiles/patterns automatically [2,23].

3 Research Questions As discussed earlier, information monitoring and information sharing are two areas that are critical to criminal investigation and analysis. Although these topics have been widely studied in other areas such as Web applications/computing, they have not been adequately applied to address the unique problems in law enforcement. In our research, we aim to answer the following research questions: Can information monitoring and sharing techniques be effectively applied to existing law enforcement information systems? Can such a system alleviate the existing information monitoring and sharing problems in the law enforcement domain? How will such a system affect law enforcement personnel in their crime analysis and investigation work? Can the system improve effectiveness and efficiency of current criminal investigation practices?

4 COPLINK Agent System Architecture 4.1 System Architecture Overview Attempting to answer the above questions, we propose a modular, personal-agentoriented architecture to support information monitoring and collaboration in law enforcement. Based on the architecture, we implemented a prototype system called COPLINK Agent, which is built on top of the COPLINK Connect system discussed in Section 2.1. The COPLINK Connect system, which supports data access and basic searching/sorting functions, serves as an ideal test environment for the proposed architecture as we can readily add advanced monitoring, collaboration, and alerting functions to the system.

COPLINK Agent: An Architecture for Information Monitoring and Sharing

CAD

TPD

MVD

285

TCC

Search queries and search results Searching and monitoring tasks

Searching and Monitoring Module

Searching and monitoring histories

Searching and monitoring tasks User Interface

Collaboration tasks

Collaboration Module

Finding similar cases and similar users

User Profile Database

Alert messages Web alert messages

Alert messages Alerting Module

Email, pager, and mobile phone alert messages

Fig. 1. System architecture for COPLINK Agent

Our proposed architecture is shown in Figure 1. It consists of a Web-based user interface and three functional modules, namely the Searching and Monitoring Module, the Collaboration Module, and the Alerting Module. The Searching and Monitoring Module is responsible for retrieving records from the database, keeping a list of monitoring tasks for each user, and performing these tasks periodically based on the user’s preference. The Collaboration Module facilitates the sharing of information among different users. The Alerting Module is responsible for keeping track of the messages for each user and delivering these messages through different communications channels. The Personalization Module keeps track of each user’s search history and allows the user to customize various system settings. The functionalities of each module are described in detail below. 4.2 Searching and Monitoring Module The Searching and Monitoring Module accepts search queries from users and forwards them to the corresponding data sources. In addition to the COPLINK database for TPD data used in COPLINK Connect, the Searching and Monitoring Module connects to three additional data sources: the Computer Aided Dispatch (CAD) database used at TPD, the Motor Vehicle Division (MVD) database in the state of Arizona, and the Tucson City Court (TCC) Web-based search engine. These databases provide ad-

286

D. Zeng et al.

ditional person, location and vehicle information that are not available in the COPLINK database. In addition to the search functionalities, this module also allows users to set up monitoring tasks for the available data sources. For instance, if a user wants to monitor all four data sources for a particular query, the monitor task will be stored in the user profile database and the data sources will be automatically monitored for changes. Different mechanisms are used to monitor the data sources due to the differences in their nature. The COPLINK database, to which our system has full access, is monitored by adding triggers to the database directly. For external databases, such as the CAD and the MVD databases, the Searching and Monitoring Module sends periodic queries to these databases. For TCC, which is accessed by a Web-based search form, the system sends HTTP request to periodically query the search engine. When the relevant records are updated or inserted into the databases, the system will send an alert message to the user through the Alerting Module. Readers are referred to Zeng et al. [25] for detailed technical discussion on how monitoring requests are handled by COPLINK Agent. To facilitate the management of user requests, the searching and monitoring sessions are also stored in the user profile database. Every time a user logs on the system, he/she can retrieve the previous searches and monitor histories from the user profile database. The user can then review previous search sessions as well as edit the settings of existing monitoring tasks. 4.3 Collaboration Module To facilitate collaboration among law enforcement personnel, we developed a collaborative filtering module in COPLINK Agent. While traditional collaborative filtering relies on documents read (e.g., [17]) or items purchased by users (e.g., Amazon.com), we make use of the users’ search actions and search histories. The rationale behind this design is that when two users search for the same information in criminal databases, it is likely that the users have similar information needs and that they may possibly be working on two related cases. By storing and analyzing user search histories, the Collaboration Module facilitates such collaboration in two different ways. First, when a user performs a search, the Collaboration Module can instantly identify other users who have performed a similar search in the past. For example, if a detective runs a search on a particular suspect, he/she can also view all the other users who have searched information about this suspect. Second, the user also can specify whether he/she wants to be notified when some other users perform a similar search in the future. When this happens, the Collaboration Module will notify both users through the Alerting Module. The users can then contact each other to determine whether they have any information to share. Currently, we consider two searches to be similar only if the search query terms match exactly with each other. Other matching algorithms can be easily added and will be pursued in our future research. In our initial user requirement study with TPD, detectives and crime analysts stated that it is very important to protect the confidentiality of the police personnel and the cases on which they are working [5]. While it is safe and possibly beneficial to share a user’s search history in some cases, in other cases users have to keep their search histories confidential and not accessible by other users (e.g., cases that involve undercover or internal investigations). It is also important that the system does not send an overwhelming number of alert messages to the users, which would otherwise create

COPLINK Agent: An Architecture for Information Monitoring and Sharing

287

yet another information overload problem. To cater to the different levels of confidentiality requirement and information needs, the Collaboration Module allows user to specify the confidentiality level and the alerting level of each search. More details about the different levels will be discussed in Section 5.2. 4.4 Alerting Module The Alerting Module manages all the alert messages that should be sent to a user. Whenever a user sets up a task in the Searching and Monitoring Module or the Collaboration Module that may result in future alerting messages, the user can specify how he/she wants to be notified. When an alerting condition is satisfied, the Alerting Module will receive the alerting messages from the collaboration and search modules. The messages will then be saved in the database and delivered to the user via the communications channel specified. Currently, messages can be sent to a user instantly through e-mail, pager, and mobile phone. If the user is currently logged on the system, the message also can be presented through the COPLINK Agent user interface. Otherwise, the user can see the message next time he/she logs on the system.

5 Sample User Sessions with COPLINK Agent In this section, we present a usage scenario to highlight the various monitoring and collaboration functions of COPLINK Agent. 5.1 Searching and Collaborating Suppose the user wants to perform a search for a person, he/she can click on the tab “Perform New Search”. Currently, four types of searches are implemented, namely “Person/Organization Search,” “Vehicle Search,” “Location Search,” and “Incident Search.” All the search forms have a similar layout while each form has its specific search fields. Person search is used in our example. After the user clicks on “Person Search,” the corresponding search form will be shown (see Figure 2). The search form shown in Figure 2 is divided into the five input areas. The following shows how the user goes through each area and inputs the necessary information. 1) Database Selection: This allows the user to select which data sources are to be searched. In the example shown in Figure 2, the TPD database is chosen by the user as the information source. 2) Search Fields: The user can then enter the searching criteria. The user wants to search for the records of a person named “Jason Sejkora”, so the user enters “Jason” in the first name field and “Sejkora” in the last name field.

288

D. Zeng et al.

Fig. 2. Person search and monitoring screen for COPLINK Agent

3) Collaboration Settings: The user can set the desired level of collaboration. In the upper portion, the user can choose the “Notification Level” of the search. The user can choose to be notified when anyone performs the same search, when anyone in the specified unit performs the same search, or not to be notified at all. In the lower part of the same interface, the user can choose the “Confidentiality Level” of the search. The user can choose to make the search visible to all other users, only users in the same unit, or nobody at all. In this example, the user chooses to be notified if any other user performs the same search. The user also chooses to make this particular search visible to all other users. 4) Alerting Methods: This allows the user to specify how he/she wants to be notified if there are some other users who have performed the same search. Multiple methods can be specified. The user can also decide whether to receive notification through email, cellular phone messages, or Web messages.

COPLINK Agent: An Architecture for Information Monitoring and Sharing

289

Fig. 3. Managing searching and monitoring tasks

5) Notes: The user can enter some personal notes in this area that are relevant to the search. The notes will be displayed in the alert messages. After specifying all the information, the user can now perform a search by clicking the “Search” button at the bottom of the search form. The search query is then forwarded to the specified database(s) and the search results are displayed to the user. When the number of search results is large, the user can click on the heading of any column to sort the records based on the values of that column. Alternatively, before performing the search, the user can also click on the “List” button to see whether any other user has performed the same search before. This is called the “Instant Collaboration” function, which allows the user to consult other officers instantly to see whether there is any further information about the person being searched. When the user clicks on the “List” button, a screen will be displayed showing all the users who have performed the same search in the past. The user can then click on the name of any of these users to retrieve their contact information and contact them directly.

290

D. Zeng et al.

5.2 Information Monitoring At the search result screen, the user can choose to add a monitor to the search results. There are two types of monitoring: the user can choose to monitor the changes to the existing records shown in the search results, or to monitor any future addition to the database that match the original search criteria. In this example, the user chooses to perform both types of monitoring. As with the collaboration function, the user can also specify how he/she wants to be notified when there is change in the database that matches with the monitoring task. After the alerting method is set, the user can select the “Add Monitor Tasks” option and a confirmation message will be displayed. 5.3 Managing Search Sessions When the user logs on the system at a later date, he/she can retrieve all his previous search sessions by choosing the “Manage Prior Searches” tab on the left-hand side menu. All the previous search sessions by the user will be displayed (see Figure 3). The user can choose to review these searches, or modify the monitoring settings of each of these sessions. For cases that have been solved, the user can also delete these search sessions and monitoring tasks from his profile such that he will not be alerted by future changes in the database.

6 Evaluation In conjunction with the Tucson Police Department (TPD), we evaluated the effectiveness of COPLINK Agent by conducting a usability study in the summer of 2002. Details of our evaluation design and analysis follow. 6.1 Methodology Our methodology for evaluating the COPLINK Agent system is a case study method incorporating structured interviews, usability surveys, and archival records analysis (e.g., summary of user-added monitoring tasks and the alerts produced by the system). To select the required usability evaluation techniques, we first identified two usability goals and the three dimensions of usability. The resulting usability metrics [9] encompassing the specific measures and techniques used is shown in Table 1. The structured interviews for the pilot users were guided by the COPLINK Agent system log files which include lists of monitoring profiles that the pilot users added into the system, as well as the alerts that the users received after matches are found. The subjective measure of user satisfaction was evaluated using a standard usability survey instrument. User comments for database monitoring and collaboration functions were also collected, along with suggestions for interface and functionality improvements. Lastly, the qualitative data obtained from the interview sessions was triangulated with the quantitative results from the alert log ratings and usability surveys.

COPLINK Agent: An Architecture for Information Monitoring and Sharing

291

Table 1. Usability Metrics of COPLINK Agent Usability Evaluation

Usability Objective Suitability for Investigative Tasks Learnability

Effectiveness Measures Percentage of Alerts deemed Useful (Archival Data + Interview) Percentage of functions learned (Survey)

Efficiency Measures Time required to create a new monitoring profile (Interview) Time to learn criteria (Interview)

Satisfaction Measures Rating scales for overall usability (Survey) Rating scales for ease of learning (Survey)

6.2 Participants Fifteen detectives from TPD’s Criminal Investigation Division (CID) were recruited to evaluate the COPLINK Agent prototype. The target participants are users who have had extensive prior experiences in the COPLINK Connect systems. The participant profile is shown in Table 2: Table 2. Participant demographics

Job Classification Police Units Gender

Sergeants 7%, Detectives 80%, Crime Analysts 13% Gang 34%, Fraud 20%, Theft 13 % Robbery 13%, Sex Offense 20% Female: 27%, Male 73%

6.3 Data Collection Procedures Participants who received alerts were given listings of the alerts and were asked to rate the usefulness of each alert. Based on the alert ratings and the list of monitoring tasks, a user was asked to provide his or her subjective rating of the alerts received, along with other relevant contextual information including: the nature/type of the case, the search parameters in which a user is interested (Last name, First name, Day of Birth, Race, and Sex, …etc), the reasons behind adding a monitoring profile, the usefulness of the alert messages received by the users if there is any, and if there is any follow-up done by the user for a particular alert. Participants were also asked to rate the effectiveness and efficiency of database monitoring and collaboration functions, as well as desired new functionalities. Suggestions for improving current functions and interface were also collected. To gauge subjective user satisfaction, we modified the QUIS instrument as reported by Chin [8] based on our application characteristics and added sections to gauge the effectiveness and efficiency of the monitoring and collaboration functions. QUIS was chosen because it provides specific ratings on the following two specific types of system quality information that we were interested in: (1) overall reactions; and (2) four specific interface factors including screen layout and sequence, terminology and system information, learning factors, and system capabilities.

292

D. Zeng et al.

6.4 Summary of Evaluation Study Effectiveness and efficiency of COPLINK Agent. During our three-month testing period, a user, on average, received 5.5 alerts per month. Out of those alerts received, approximately 32% of them were rated equal or above “Somewhat Useful” on our scale. The user’s subjective ratings of the alerts also averaged 5.5 out of a 7-point scale (with 7 being the most useful), suggesting a relatively high user satisfaction. The most typical reasons that users add monitoring tasks include: 1) person monitoring: monitoring a suspect, a witness, or an informant, or someone who is on parole; 2) address monitoring: monitoring the exact address or the address of the apartment complex of a suspect; 3) license plate monitoring: monitor a specific car whose license plate number is of interest to the detectives. As to the monitoring and collaboration functionalities, users were generally pleased with the system’s capabilities for assisting criminal investigations. One user commented, “Although I only have it ‘watching’ for 2 names, the information I have received back was instrumental in making at least 2 felony cases that will be prosecuted on the federal level.” Another user also commented, “Investigating gangs requires extensive use of networking and sharing of information. I find this option valuable.” In terms of overall effectiveness, COPLINK Agent also garnered positive feedback and one recent success story: A crime analyst involved in our evaluation study had previously added a monitoring task for a particular fraud suspect in our system. One day she received an alert about her suspect using counterfeit money in a local convenient store, and was able to follow up with the case and obtained the video tape from the store’s surveillance camera. The alert has led to two felony charges for the criminal on a federal level. Had the crime analyst not received the COPLINK Agent alert in a timely manner, she would have to wait for several weeks to see the case report. By that time, the critical video tape might have been destroyed as the convenient store only keeps video tapes for the past 30 days. As to the efficiency of operating the COPLINK Agent system, most users were able to finish adding a new monitoring profile within 2-5 minutes. User Satisfaction of COPLINK Agent. The short-form of the QUIS instrument averaged 5.5 for 27 items on a 7-point Likert scale (7: most useful). After conducting a profile analysis, the weaknesses of COPLINK Agent include: lack of help messages, difficult for inexperienced users, and obscure user preference settings. The strengths of COPLINK Agent include: offers good investigative power, easy to read layout, potential for collaborative information sharing, CAD Integration, as well as high intention to use. We were able to use the feedback on the user satisfaction to create a list of system enhancements that we plan to implement in the next phase of the COPLINK Agent development. 6.5 Discussions In this section, we discuss the lessons learnt from our field study of COPLINK Agent in TPD’s criminal investigative units. First, in order to harness the full potential of COPLINK Agent’s advanced information monitoring/filtering functionalities, the databases monitored by COPLINK Agent need to be checked on a near real-time basis. Although the TPD database records are examined once per day by our system, most

COPLINK Agent: An Architecture for Information Monitoring and Sharing

293

pilot users expressed interest in significantly increasing the frequency of database monitoring, e.g., allowing the system to check for new updates every 10 minutes or at least every hour. Some user comments in this area include: “The only other improvement I could ask for would be it query a couple times a day as opposed to once every 24 hours.” Another user states, “Detective could have been dispatched immediately, if notification had been in real time.” Second, the system under evaluation appeared to be quite effective in connecting people together, an essential functional aspect of the knowledge management framework proposed by O’Leary [19]. In our case, the alerts provided by COPLINK Agent facilitate inter-unit information and knowledge sharing [13] by creating a short and direct network path from detectives to field officers who have direct access to important knowledge concerning the cases under investigation. A TPD detective commented on the alert messages’ usefulness, “COPLINK Agent is allowing us to respond to incidents we know are important that the field units perhaps don’t realize in a timely manner.” Most subjects shared this assessment. As noted by another officer, “COPLINK Agent is good because so many times we complain that we don’t get information from the field. This way I know who ran (a query in our database on) someone and can inquire as to why.”

7 Conclusions and Future Directions In this paper, we report our work on designing and evaluating a prototype system for information monitoring and sharing in the law enforcement domain. The system, called COPLINK Agent, is based on an Agent system architecture designed to provide automatic information filtering functionalities and facilitate knowledge sharing between police personnel. Through the advanced information monitoring and sharing functions provided by the searching, collaboration and alerting modules, COPLINK Agent can help alleviate the information overload problem faced by many police officers and detectives. COPLINK Agent has also been shown to be an effective tool for improving the effectiveness and efficiency of criminal investigations in areas such as gang, theft, and fraud cases. Given the encouraging results from the user study, we plan to incorporate the functionalities of the COPLINK Agent system into the latest COPLINK Connect system deployed at TPD. Afterwards, a larger scale testing will be performed to study how the number of users affects the usability of the system. Additionally, we continue to work on improving each individual component of the COPLINK Agent system. Specifically, we are working on supporting more alerting methods, as well incorporating several load-balancing and task scheduling algorithms in the monitoring module such that the monitoring tasks can be scheduled effectively to avoid overloading particular databases. Load balancing techniques will gradually become more important as the number of users increases. We are also enhancing the collaboration module by adding fuzzy match techniques that can be used to identify similar searches. Lastly, we plan to apply data mining techniques on the user profile database in order to group similar users together and analyze their search patterns.

294

D. Zeng et al.

Acknowledgement. The work described in this report was substantially supported by the following grants: (1) NSF Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July 2000–June 2003; (2) National Institute of Justice, “COPLINK: Database Integration and Access for a Law Enforcement Intranet,” July 1997–January 2000; (3) NSF/CISE/CSS, “An Intelligent CSCW Workbench: Personalization, Visualization, and Agents,” #9800696, June 1998–June 2001. We also would like to thank all of the personnel from TPD who participated in this study. In particular, we would like to thank Lt. Jenny Schroeder, Det. Tim Petersen, and Dan Casey. Lastly, we also would like to thank members of the COPLINK Team at the University of Arizona and Knowledge Computing Corporation (KCC) for their support.

References 1. R. Adderley and P.B. Musgrove (2001). Data Mining Case Study: Modeling the Behavior of Offenders Who Commit Serious Sexual Assaults. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 26–29, 2001. 2. R. Armstrong, D. Freitag, T. Joachims and T. Mitchell (1995). WebWatcher: A Learning Apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Mar 1995. 3. M. Balabanovic and Y. Shoham (1997). Fab: Content-Based, Collaborative Recommendation. Communications of the ACM, 97(3), 66–72. 4. M. Brown (1998). Future Alert Contact Network: Reducing Crime Via Early Notification. http://pti.nw.dc.us/solutions/solutions98/public_safety/charlotte.html. 5. M. Chau, H. Atabakhsh, D. Zeng, and H. Chen (2001). Building an Infrastructure for Law Enforcement Information Sharing and Collaboration: Design Issues and Challenges. In Proceedings of the National Conference for Digital Government Research, Los Angeles, California, May 21–23, 2001. 6. M. Chau, D. Zeng, H. Chen, M. Huang, and D. Hendriawan (2003). Design and Evaluation of a Multi-agent Collaborative Web Mining System. Decision Support Systems, Special Issue on Web Retrieval and Mining, 35(1), 167–183. 7. H. Chen, J. Schroeder, R.V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K. Rasmussen, A. W. Clements (2003). COPLINK Connect: Information and Knowledge Management for Law Enforcement. Decision Support Systems, 34(3), 271–285. 8. J.P. Chin, V.A. Diehl, and K.L. Norman (1988). Development of an Instrument Measuring User Satisfaction of the Human-Computer Interface. In Proceedings of the ACM CHI ’88, Washington, DC, pp. 213–218. 9. A. Dix, J. Finlay, G. Abowd, and R. Beale (1998). Human Computer Interaction. Englewood Cliffs, NJ: Prentice Hall. 10. R. Dussault (2000). Maps and Management: Comstat Evolves. Government Technology Crime and the Tech Effect, April 2000. 11. M. Ginsburg (1998). Annotate! A Tool for Collaborative Information Retrieval. In Proceedings of the 7th IEEE International Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises (WET ICE’98), IEEE CS, Los Alamitos, California, 1998, 75–80. 12. D. Goldberg, D. Nichols, B. Oki and D. Terry (1992). Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM, 35(12), 61–69. 13. M. Hansen (2002). Knowledge Networks: Explaining Effective Knowledge Sharing in Multiunit Companies. Organization Science, 13(3), 232–248.

COPLINK Agent: An Architecture for Information Monitoring and Sharing

295

14. R.V. Hauck and H. Chen (1999). COPLINK: A Case of Intelligent Analysis and Knowledge Management. In Proceedings of the 20th Annual International Conference on Information Systems, Charlotte, December 13–15, 1999. 15. M. J. Hoogeveen and K. van der Meer (1994). Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, 20(2), 79–87. 16. P.B. Kantor, E. Boros, B. Melamed, V. Meñkov, B. Shapira, and D.J. Neu (2000). Capturing Human Intelligence in the Net. Communications of the ACM, 43(8), 112–115. 17. J.A. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon and J. Riedl (1997). GroupLens: Applying Collaborative Filtering to Usenet News. Communications of the ACM, 40(3), 77–87. 18. B. Miller (1996). Searchable Databases Help Missouri Solve Crime. Government Technology, 9(8), 18–19. 19. D.E. O'Leary (1998). Knowledge Management Systems: Converting and Connecting. IEEE Intelligent Systems, 13(2), 30–33. 20. W. Orlikowski (1992). Learning from Notes: Organizational Issues in Groupware Implementation. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’92), 1992, 362–369. 21. N. Romano, D. Roussinov, J. F. Nunamaker, and H. Chen (1999). Collaborative Information Retrieval Environment: Integration of Information Retrieval with Group Support Systems. In Proceedings of the 32nd Hawaii International Conference on System Sciences (HICSS-32), 1999. 22. U. Shardanand and P. Maes (1995). Social Information Filtering: Algorithms for Automating “Word of Mouth.” In Proceedings of the ACM Conference on Human Factors and Computing Systems, Denver, Colorado, May 1995. 23. B. Starr, M. Ackerman, and M. Pazzani (1996). Do-I-Care: A Collaborative Web Agent. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI’96), 273–274. 24. J. Yen, A. Chung, H. Ho, B. Tam, R. Lau, M. Chau, K. Hwang (1999). Collaborative and nd Scalable Financial Analysis with Multi-agent Technology. In Proceedings of the 32 Hawaii International Conference on System Sciences, Maui, Hawaii, January 5–8, 1999. 25. D. Zeng, M. Dror, and H. Chen (2002). Efficient Scheduling of Periodic Information Monitoring Requests. Submitted to INFORMS Journal on Computing.

Active Database Systems for Monitoring and Surveillance Antonio Badia Computer Engineering and Computer Science Department University of Louisville Louisville KY 40292 [email protected]

Abstract. In many intelligence and security tasks it is necessary to monitor data in database in order to detect certain events or changes. Currently, database systems offer triggers to provide active capabilities. Most triggers, however, are based on the Event-Condition-Action paradigm, which can express only very primitive events. In this paper we propose an extension of traditional triggers in which the Event is a complex situation expressed by a Select-Project-Join-GroupBy SQL query, and the trigger can be programmed to look for changes in the situation defined. Moreover, the trigger can be directed to check for changes on a periodic basis. After proposing a language to define changes, we sketch an implementation, based on the idea of incremental view maintenance, to support efficiently our extended triggers.

1

Introduction

In the past, databases were passive, low-level repositories of data on top of which smarter, domain-focused applications were built. Lately, databases have taken a more active role, offering more advanced services and higher functionality to other applications. In this framework, the database assumes responsibility for execution of some tasks previously left to the application, which offers several advantages: the possibility of better performance (since the database has direct access to the data, knows how the data is stored and distributed), better data quality (since the database is already in charge of basic data consistency) and better overall control. However, this trend has resulted in the database taking in more ambitious roles, and having to provide more advanced functionality than in the past. One of the areas where this trend is clear is in the area of active databases. In the past, the database could monitor data and respond to certain changes via triggers (also called rules1 ). However, commercial systems offer very limited capabilities in this sense. In addition to problems of performance (triggers add quite a bit of overhead) and control (because of the problems of non-terminating, non-confluent trigger sets), trigger systems are very low-level: while the events that may activate a triggers are basic database actions (insertions, deletions and updates), users are interested in complex conditions that 1

In this paper, we use the terms rule and trigger as equivalent.

H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 296−307, 2003.  Springer-Verlag Berlin Heidelberg 2003

Active Database Systems for Monitoring and Surveillance

297

may depend on several database objects and their interactions. It is difficult to express this high-level, application-dependent events in triggers. In this paper, we describe an ongoing project whose goal is to add advanced monitoring and control functionality to database systems through the design and development of extended rule systems. In a nutshell, we develop triggers were more complex events can be stated, thus letting system users specify, in a high level language, the patterns they need to monitor. Since performance is still an issue, we also develop efficient algorithms based on the idea of incremental recomputation already used in the evaluation of materialized views in data warehouses ([9]). As a result of the added functionality, a database system will be able to monitor the appearance of complex patterns and to detect changes on said patterns. Other research in active databases has not deal with this issue. Our approach is focused on concepts that may have a practical impact; in particular, we aim at expressing more complex events, making it easier for database users to specify the conditions they are interested in monitoring, but we also propose an efficient implementation, something which is absent from most research in the area.

2

Background and Related Research

In most database systems (certainly in all commercial systems), active capabilities are incorporated through the ability to define triggers. A trigger has the form Event-Condition-Action (ECA). The typical events considered by active rules are primitives for database state changes, like insertions, deletions and updates from/to database tables. The condition is either a database predicate or a query (the query is implicitly considered true if the query returns a non-empty answer, and false otherwise). The action may include transactional commands, rollback or rule manipulation commands, or sometimes may activate externally defined procedures, including arbitrary data manipulation programs. Rules are fired when a particular event occurs; the condition is then evaluated, and if found true then the action is executed. This simple schema is found lacking for several reasons ([4, 26]). Mainly, the events used in triggers are considered too low-level to be useful for many applications; a great deal of research in active databases has focused on defining more complex events ([16, 17, 15, 12, 10, 14]). In basically all the previous research, complex events are obtained by combining primitive events in some event language, which usually includes conjunction, disjunction, negation and sequencing of events ([27]). Some approaches include time primitives ([23, 21, 18]), sometimes based on some temporal logic ([22, 6]). Although none of these projects addresses the issue we are dealing with here (active monitoring of complex conditions) we note that [24] also proposes using incremental recomputation to compute complex events (described as queries), as we do; and [2] proposes incremental computation of temporal queries. However,

298

A. Badia

these works have no concept of active monitoring. Finally, [20] also propose a system for monitoring2 . We take an approach different from previous research, based on the observation that analysts are usually interested in much higher level events, which are application and goal oriented: in particular, they screen for conditions which deviate from normal or standard, or for complex conditions which may involve several objects and their relationships. As a simplified example, assume a database with two relations, PEOPLE(name,country) and CALLS(called,caller,date), where we keep a list of suspicious people and their country of residence, and also a list of telephone calls among them as intercepted by signal intelligence. Both called and caller are foreign keys for name. At some point, an analyst is following a suspected terrorist (let’s call him ’X’) and wants to know from which country he receives the most calls. The information can be easily obtained from the database (see below), but once obtained the analyst would like to follow up on this query by monitoring changes: in particular, the analyst may be interested in being alerted when the country from which ’X’ receives the highest number of calls changes. Since sending an alert is an action that must be taken only under certain circumstances, a trigger is the obvious way to implement this functionality. However, the event of interest to the analyst -when the country from which ’X’ receives the most calls changes from the current one- cannot be expressed with trigger events, which are limited to checking for insertions, deletions and updates in relations. Note that insertions in CALLS are the only way in which the current top-calling country could. Thus, one could simulate the desired trigger by using insertions in CALLS as events, and then computing the desired information. A simple SQL query can provide a list of the countries from which ’X’ is called in order of the number of calls, so that the top-calling country is in the first row of the answer: SELECT country, count(*) as numcalls FROM PEOPLE, CALLS WHERE caller = name and called = ’X’ GROUP BY country SORT BY numcalls There are, however, two problems with this approach: it is both conceptually hard and computationally inefficient. It is hard because the above still does not give us the answer: one should keep the name of the top-calling country in some table or variable and compare it with the name in the first row of the above query every time it is recomputed. Thus, quite a bit of programming is needed 2

It is also worth noting that such research, while containing many worthwhile ideas, has seen little practical use, possibly due to two concerns. First, even though some research has been implemented in systems ([14, 21, 25, 17]), efficiency is not addressed in most approaches ([12, 24] are some exceptions); second, sophisticated logic-based languages, as proposed in the research literature ([8, 6, 23, 27]), are highly expressive, but probably outside the comfort zone of most programmers, and certainly most users.

Active Database Systems for Monitoring and Surveillance

299

to implement a relatively simple request. It is inefficient because the trigger is still fired for every table insertion, and therefore its complex condition (the query above) must be evaluated every time. A possible approach would be to use the above SQL query to define a view or table T , and declare the trigger over T . In some systems, views cannot have triggers and hence T needs to be a table. This is clearly undesirable, since T is, conceptually, a view (i.e. needs to be updated whenever the tables it is based upon are updated). Even if the system allows triggers on views, there are several things that the analyst may be interested in, only some of which are expressible with regular triggers: – continuous monitoring, immediate reaction: this is what a trigger does. Every single change in T (insertion, deletion, update) fires the trigger; as soon as a change is detected, an action takes place. This gives us real time monitoring and is certainly useful for certain situations. Note that some programming would still be necessary: because we are looking for changes to a situation, we need to store the current situation (which country is the top producer of calls) and compare it after every event with the new result. – continuous monitoring, delayed action: recheck the situation after every single change in T , but if the condition is found to be true, take action only at certain specified points in time. Delayed action is adequate for periodical reporting. This could be simulated with a trigger (store changes in a temporary relation, for instance) at the cost of more programming. – periodical monitoring, immediate action: recheck the situation at certain specified periods (for instance, every month), and execute and action whenever a check detects a change. Note that this does not give us real time, since by the time the change is detected, the change itself may have taken place time ago. This is good, though, when we need regular and constant monitoring of a situation, but we do not need to be immediately aware of every single change. Again, this could be simulated in some trigger systems, depending on what is allowed in the condition part, at the cost of quite a bit of programming. – periodical monitoring, delayed action: recheck the situation periodically as in the previous case and execute action if changes are detected at certain specified periods. This could also be simulated in some trigger systems, depending on what is exactly allowed in the condition and action parts. Note that all cases can be simulated with some trigger systems. Most systems allow arbitrary programs in the condition and action part of a trigger; therefore, this is equivalent to writing a little program for each condition we want to monitor. As stated above, this is clearly inefficient because of the human effort (programming) and machine effort (trigger execution) involved. Clearly, a more flexible approach is needed.

300

3

A. Badia

The Proposal

The aim of our approach is to overcome the limitations described in the previous section. We would like to develop an approach that provides analysts with the tools needed to monitor real-life, complex events, in an conceptually simple and efficient manner. Consequently, the project has two parts: development of languages and interpreters for extended triggers, and design of algorithms to support efficient computation of the extension. Each part is discussed next. 3.1

Extended Triggers

In our proposal, we develop extended triggers, or triggers with extended events, which correspond to high-level, semantic properties. The events would be able to monitor evolution and change in data, by giving a language in which to represent changes and complex conditions. We call this active monitoring. By using extended triggers, an analyst is able to state naturally and simply, in a declarative language, what are the activities, changes or states which are noteworthy from the analyst’s point of view. Our extensions are based on several intuitions. First, the mismatch between currently allowed events in triggers (called database events) and the events we want to monitor (called semantic events) are due to a difference in levels: semantic events are high level, related to the application; database events are low level, related to the database (in our example, top-calling country vs. insertions on CALLS). Therefore, a mechanism is needed to bridge the gap, one that will express the semantic event in terms of database events. However, expressing the semantic event is not enough, since we are interested in monitoring changes in that event (in our example, changes in the top-calling country). Hence, a language in which to express changes is also needed. Second, even if the previous mismatch did not exist, triggers are not adequate for the task of active monitoring described above, since this task requires knowing when to start, when to stop and how often to check. This information cannot be expressed in current triggers, which are more of a one time action: although the trigger is fired repeatedly as the event repeats, each firing is an isolated event, unrelated to others -unless a link or history is established by adequate programming of the trigger. Finally, and as a result of the mismatch, many database events must happen before affecting a significant change on a semantic event (in our example, many calls may have to be inserted into CALLS before the top-calling country changes). This accumulation naturally happens over time and size (of the database). Thus, it is inefficient to check for a condition after every database event; it is more efficient to do it periodically. We propose a language which will establish: a) a certain environment or baseline to express semantic events; b) the changes to the baseline that the system can monitor; and c) an interval that determines how long and how often to monitor those changes. The baseline will be established by an SQL query that will specify the context in which changes must be examined. To establish the interval, an starting point, an end point and a frequency must be defined. Our

Active Database Systems for Monitoring and Surveillance

301

language supports interval definitions in two dimensions: time and size (of the database), as discussed above The following specification is proposed (keywords are all in uppercase): BASELINE <modification> <modification>:= IN

Intelligence and Security Informatics, 2 conf., ISI 2004

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2006, San Diego, CA, USA, May

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20,

Intelligence and Security Informatics - PAISI 2011

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings

Novel Approaches in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence)

Ambient Intelligence: First European Symposium, EUSAI 2003, Veldhoven, The Netherlands, November 3.-4, 2003, Proceedings

Windows Server 2003 Security Cookbook

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Discoveries and Breakthroughs in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence (Acini) Book Series)

Intelligence and Security Informatics, 1th NSF-NIJ Symposium, ISI 2003

Intelligence and Security Informatics, 2 conf., ISI 2004

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2006, San Diego, CA, USA, May

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20,

Intelligence and Security Informatics - PAISI 2011

Security Informatics

Intelligence and Security Informatics, Pacific Asia Workshop, PAISI 2009

Intelligence and Security Informatics, Pacific Asia Workshop, PAISI 2007

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings

Homeland Security and Intelligence

Intelligence and Security Informatics: Techniques and Applications (Studies in Computational Intelligence)

Intelligence and Security Informatics for International Security: Information Sharing and Data Mining

Novel Approaches in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence)

Ambient Intelligence: First European Symposium, EUSAI 2003, Veldhoven, The Netherlands, November 3.-4, 2003, Proceedings

Windows Server 2003 Security Cookbook

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Discoveries and Breakthroughs in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence (Acini) Book Series)

Strategic Intelligence (Intelligence and the Quest for Security

Encyclopedia of espionage, intelligence, and security

Encyclopedia of espionage, intelligence, and security

Computational Intelligence and Security, CIS 2006

Intelligence and National Security: A Reference Handbook

Intelligence and Security Informatics: International Workshop, WISI 2006, Singapore, April 9, 2006, Proceedings

Computational Intelligence in Information Assurance and Security

Careers With Government Security and Intelligence Agencies

Computational intelligence in information assurance and security

Encyclopedia of espionage, intelligence, and security

Terrorism Informatics - Knowledge Management and Data Mining for Homeland Security

Computational Intelligence and Pattern Analysis in Biology Informatics

Computational intelligence and pattern analysis in biology informatics

Novel approaches in cognitive informatics and natural intelligence

Intelligence and Security Informatics, 1th NSF-NIJ Symposium, ISI 2003

Intelligence and Security Informatics, 2 conf., ISI 2004

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2006, San Diego, CA, USA, May

Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20,

Intelligence and Security Informatics - PAISI 2011

Security Informatics

Intelligence and Security Informatics, Pacific Asia Workshop, PAISI 2009

Intelligence and Security Informatics, Pacific Asia Workshop, PAISI 2007

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings

Homeland Security and Intelligence

Intelligence and Security Informatics: Techniques and Applications (Studies in Computational Intelligence)

Intelligence and Security Informatics for International Security: Information Sharing and Data Mining

Novel Approaches in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence)

Ambient Intelligence: First European Symposium, EUSAI 2003, Veldhoven, The Netherlands, November 3.-4, 2003, Proceedings

Windows Server 2003 Security Cookbook

Intelligence and Security Informatics: IEEE ISI 2008 International Workshops: PAISI, PACCF and SOCO 2008, Taipei, Taiwan, June 17, 2008, Proceedings (Lecture ... Applications, incl. Internet Web, and HCI)

Discoveries and Breakthroughs in Cognitive Informatics and Natural Intelligence (Advances in Cognitive Informatics and Natural Intelligence (Acini) Book Series)

Strategic Intelligence (Intelligence and the Quest for Security

Encyclopedia of espionage, intelligence, and security

Encyclopedia of espionage, intelligence, and security

Computational Intelligence and Security, CIS 2006

Intelligence and National Security: A Reference Handbook

Intelligence and Security Informatics: International Workshop, WISI 2006, Singapore, April 9, 2006, Proceedings

Computational Intelligence in Information Assurance and Security

Careers With Government Security and Intelligence Agencies

Computational intelligence in information assurance and security

Encyclopedia of espionage, intelligence, and security

Terrorism Informatics - Knowledge Management and Data Mining for Homeland Security

Computational Intelligence and Pattern Analysis in Biology Informatics

Computational intelligence and pattern analysis in biology informatics

Novel approaches in cognitive informatics and natural intelligence

Recommend Documents