This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2665
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Hsinchun Chen Richard Miranda Daniel D. Zeng Chris Demchak Jenny Schroeder Therani Madhusudan (Eds.)
Intelligence and Security Informatics First NSF/NIJ Symposium, ISI 2003 Tucson, AZ, USA, June 2-3, 2003 Proceedings
13
Volume Editors Hsinchun Chen Daniel D. Zeng Therani Madhusudan University of Arizona Department of Management Information Systems Tucson, AZ 85721, USA E-mail: {hchen/zeng/madhu}@eller.arizona.edu Richard Miranda Jenny Schroeder Tucson Police Department 270 S. Stone Ave., Tucson, AZ 85701, USA E-mail: [email protected] Chris Demchak University of Arizona School of Public Administration and Policy Tucson, AZ 85721, USA E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .
Since the tragic events of September 11, 2001, academics have been called on for possible contributions to research relating to national (and possibly international) security. As one of the original founding mandates of the National Science Foundation, mid- to long-term national security research in the areas of information technologies, organizational studies, and security-related public policy is critically needed. In a way similar to how medical and biological research has faced significant information overload and yet also tremendous opportunities for new innovation, law enforcement, criminal analysis, and intelligence communities are facing the same challenge. We believe, similar to “medical informatics” and “bioinformatics,” that there is a pressing need to develop the science of “intelligence and security informatics” – the study of the use and development of advanced information technologies, systems, algorithms and databases for national security related applications, through an integrated technological, organizational, and policy-based approach. We believe active “intelligence and security informatics” research will help improve knowledge discovery and dissemination and enhance information sharing and collaboration across law enforcement communities and among academics, local, state, and federal agencies, and industry. Many existing computer and information science techniques need to be reexamined and adapted for national security applications. New insights from this unique domain could result in significant breakthroughs in new data mining, visualization, knowledge management, and information security techniques and systems. This first NSF/NIJ Symposium on Intelligence and Security Informatics (ISI 2003) aims to provide an intellectual forum for discussions among previously disparate communities: academic researchers (in information technologies, computer science, public policy, and social studies), local, state, and federal law enforcement and intelligence experts, and information technology industry consultants and practitioners. Several federal research programs are also seeking new research ideas and projects that can contribute to national security. Jointly hosted by the University of Arizona and the Tucson Police Department, the NSF/NIJ ISI Symposium program committee was composed of 44 internationally renowned researchers and practitioners in intelligence and security informatics research. The 2-day program also included 5 keynote speakers, 14 invited speakers, 34 regular papers, and 6 posters. In addition to the main sponsorship from the National Science Foundation and the National Institute of Justice, the meeting was also cosponsored by several units within the University of Arizona, including the Eller College of Business and Public Administration, the Management Information Systems Department, the Internet Technology, Commerce, and Design Institute, the NSF COPLINK Center of Excellence, the Mark and Susan Hoffman E-Commerce Lab, the Center for the Management of
VI
Preface
Information, and the Artificial Intelligence Lab, and several other organizations including the Air Force Office of Scientific Research, SAP, and CISCO. We wish to express our gratitude to all members of the conference Program Committee and the Organizing Committee. Our special thanks go to Mohan Tanniru and Joe Hindman (Publicity Committee Co-chairs), Kurt Fenstermacher, Mark Patton, and Bill Neumann (Sponsorship Committee Co-chairs), Homa Atabakhsh and David Gonzalez (Local Arrangements Co-chairs), Ann Lally and Leon Zhao (Publication Co-chairs), and Kathy Kennedy (Conference Management). Our sincere gratitude goes to all of the sponsors. Last, but not least, we thank Gary Strong, Art Becker, Larry Brandt, Valerie Gregg, and Mike O’Shea for their strong and continuous support of this meeting and other related intelligence and security informatics research.
June 2003
Hsinchun Chen, Richard Miranda, Daniel Zeng, Chris Demchak, Jenny Schroeder, Therani Madhusudan
ISI 2003 Organizing Committee
General Co-chairs: Hsinchun Chen Richard Miranda
University of Arizona Tucson Police Department
Program Co-chairs: Daniel Zeng Chris Demchak Jenny Schroeder Therani Madhusudan
University of Arizona University of Arizona Tucson Police Department University of Arizona
Publicity Co-chairs: Mohan Tanniru Joe Hindman
University of Arizona Phoenix Police Department
Sponsorship Co-chairs: Kurt Fenstermacher Mark Patton Bill Neumann
University of Arizona University of Arizona University of Arizona
Local Arrangements Co-chairs: Homa Atabakhsh David Gonzalez
University of Arizona University of Arizona
Publication Co-chairs: Ann Lally Leon Zhao
University of Arizona University of Arizona
VIII
Organization
ISI 2003 Program Committee
Yigal Arens Art Becker Larry Brandt Donald Brown Judee Burgoon Robert Chang Andy Chen Lee-Feng Chien Bill Chu Christian Collberg Ed Fox Susan Gauch Johannes Gehrke Valerie Gregg Bob Grossman Steve Griffin Eduard Hovy John Hoyt David Jensen Judith Klavans Don Kraft Ee-Peng Lim Ralph Martinez Reagan Moore Clifford Neuman David Neri Greg Newby Jay Nunamaker Mirek Riedewald Kathleen Robinson Allen Sears Elizabeth Shriberg Mike O’Shea Craig Stender Gary Strong Paul Thompson Alex Tuzhilin Bhavani Thuraisingham Howard Wactlar Andrew Whinston Karen White
University of Southern California Knowledge Discovery and Dissemination Program National Science Foundation University of Virginia University of Arizona Criminal Investigation Bureau, Taiwan Police National Taiwan University Academia Sinica, Taiwan University of North Carolina, Charlotte University of Arizona Virginia Tech University of Kansas Cornell University National Science Foundation University of Illinois, Chicago National Science Foundation University of Southern California South Carolina Research Authority University of Massachusetts, Amherst Columbia University Louisiana State University Nanyang Technological University, Singapore University of Arizona San Diego Supercomputing Center University of Southern California Tucson Police Department University of North Carolina, Chapel Hill University of Arizona Cornell University Tucson Police Department Corporation for National Research Initiatives SRI International National Institute of Justice State of Arizona National Science Foundation Dartmouth College New York University National Science Foundation Carnegie Mellon University University of Texas at Austin University of Arizona
Organization
Jerome Yen Chris Yang Mohammed Zaki
IX
Chinese University of Hong Kong Chinese University of Hong Kong Rensselaer Polytechnic Institute Keynote Speakers
Richard Carmona Gary Strong Lawrence E. Brandt Mike O’Shea Art Becker
Surgeon General of the United States National Science Foundation National Science Foundation National Institute of Justice Knowledge Discovery and Dissemination Program Invited Speakers
Paul Kantor Lee Strickland Donald Brown Robert Chang Pamela Scanlon Kelcy Allwein Gene Rochlin Jane Fountain John Landry John Hoyt Bruce Baicar Matt Begert John Cunningham Victor Goldsmith
Rutgers University University of Maryland University of Virginia Criminal Investigation Bureau, Taiwan Police Automated Regional Justice Information Systems Defense Intelligence Agency University of California, Berkeley Harvard University Central Intelligence Agency South Carolina Research Authority South Carolina Research Authority and National Institute of Justice National Law Enforcement & Corrections Technology Montgomery County Police Department City University of New York
Table of Contents
Part I: Full Papers Data Management and Mining Using Support Vector Machines for Terrorism Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aixin Sun, Myo-Myo Naing, Ee-Peng Lim, Wai Lam
1
Criminal Incident Data Association Using the OLAP Technology . . . . . . . Song Lin, Donald E. Brown
13
Names: A New Frontier in Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frankie Patman, Paul Thompson
Detecting Deception through Linguistic Analysis . . . . . . . . . . . . . . . . . . . . . . Judee K. Burgoon, J.P. Blair, Tiantian Qin, Jay F. Nunamaker, Jr
91
A Longitudinal Analysis of Language Behavior of Deception in E-mail . . . 102 Lina Zhou, Judee K. Burgoon, Douglas P. Twitchell
Analytical Techniques Evacuation Planning: A Capacity Constrained Routing Approach . . . . . . . 111 Qingsong Lu, Yan Huang, Shashi Shekhar Locating Hidden Groups in Communication Networks Using Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Malik Magdon-Ismail, Mark Goldberg, William Wallace, David Siebecker
XII
Table of Contents
Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department . . . . . . . . . . . . . . . . . . . . . . . . . 138 Kar Wing Li, Christopher C. Yang Decision Based Spatial Analysis of Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Yifei Xue, Donald E. Brown
Visualization CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Jennifer Schroeder, Jennifer Xu, Hsinchun Chen A Spatio Temporal Visualizer for Law Enforcement . . . . . . . . . . . . . . . . . . . 181 Ty Buetow, Luis Chaboya, Christopher O’Toole, Tom Cushna, Damien Daspit, Tim Petersen, Homa Atabakhsh, Hsinchun Chen Tracking Hidden Groups Using Communications . . . . . . . . . . . . . . . . . . . . . . 195 Sudarshan S. Chawathe
Knowledge Management and Adoption Examining Technology Acceptance by Individual Law Enforcement Officers: An Exploratory Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Paul Jen-Hwa Hu, Chienting Lin, Hsinchun Chen “Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Chris C. Demchak Untangling Criminal Networks: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . 232 Jennifer Xu, Hsinchun Chen
Collaborative Systems and Methodologies Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 T.S. Raghu, R. Ramesh, Andrew B. Whinston Collaborative Workflow Management for Interagency Crime Analysis . . . . 266 J. Leon Zhao, Henry H. Bi, Hsinchun Chen COPLINK Agent: An Architecture for Information Monitoring and Sharing in Law Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Daniel Zeng, Hsinchun Chen, Damien Daspit, Fu Shan, Suresh Nandiraju, Michael Chau, Chienting Lin
Table of Contents
XIII
Monitoring and Surveillance Active Database Systems for Monitoring and Surveillance . . . . . . . . . . . . . . 296 Antonio Badia Integrated “Mixed” Networks Security Monitoring – A Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 William T. Scherer, Leah L. Spradley, Marc H. Evans Bioterrorism Surveillance with Real-Time Data Warehousing . . . . . . . . . . . 322 Donald J. Berndt, Alan R. Hevner, James Studnicki
Part II: Short Papers Data Management and Mining Privacy Sensitive Distributed Data Mining from Multi-party Data . . . . . . . 336 Hillol Kargupta, Kun Liu, Jessica Ryan ProGenIE: Biographical Descriptions for Intelligence Analysis . . . . . . . . . 343 Pablo A. Duboue, Kathleen R. McKeown, Vasileios Hatzivassiloglou Scalable Knowledge Extraction from Legacy Sources with SEEK . . . . . . . . 346 Joachim Hammer, William O’Brien, Mark Schmalz “TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Sachin Kajarekar, Kemal S¨ onmez, Luciana Ferrer, Venkata Gadde, Anand Venkataraman, Elizabeth Shriberg, Andreas Stolcke, Harry Bratt Emergent Semantics from Users’ Browsing Paths . . . . . . . . . . . . . . . . . . . . . . 355 D.V. Sreenath, W.I. Grosky, F. Fotouhi
Deception Detection Designing Agent99 Trainer: A Learner-Centered, Web-Based Training System for Deception Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Jinwei Cao, Janna M. Crews, Ming Lin, Judee Burgoon, Jay F. Nunamaker Training Professionals to Detect Deception . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Joey F. George, David P. Biros, Judee K. Burgoon, Jay F. Nunamaker, Jr. An E-mail Monitoring System for Detecting Outflow of Confidential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Bogju Lee, Youna Park
Using Support Vector Machines for Terrorism Information Extraction Aixin Sun1 , Myo-Myo Naing1 , Ee-Peng Lim1 , and Wai Lam2 1
Centre for Advanced Information Systems, School of Computer Engineering Nanyang Technological University, Singapore 639798, Singapore [email protected] 2 Department of Systems Engineering and Engineering Management Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR [email protected]
Abstract. Information extraction (IE) is of great importance in many applications including web intelligence, search engines, text understanding, etc. To extract information from text documents, most IE systems rely on a set of extraction patterns. Each extraction pattern is defined based on the syntactic and/or semantic constraints on the positions of desired entities within natural language sentences. The IE systems also provide a set of pattern templates that determines the kind of syntactic and semantic constraints to be considered. In this paper, we argue that such pattern templates restricts the kind of extraction patterns that can be learned by IE systems. To allow a wider range of context information to be considered in learning extraction patterns, we first propose to model the content and context information of a candidate entity to be extracted as a set of features. A classification model is then built for each category of entities using Support Vector Machines (SVM). We have conducted IE experiments to evaluate our proposed method on a text collection in the terrorism domain. From the preliminary experimental results, we conclude that our proposed method can deliver reasonable accuracies. Keywords: Information extraction, terrorism-related knowledge discovery.
1
Introduction
1.1
Motivation
Information extraction (IE) is a task that extracts relevant information from a set of documents. IE techniques can be applied to many different areas. In the intelligence and security domains, IE can allow one to extract terrorism-related information from email messages, or identify sensitive business information from
This work is partially supported by the SingAREN 21 research grant M48020004. Dr. Ee-Peng Lim is currently a visiting professor at Dept. of SEEM, Chinese University of Hong Kong, Hong Kong, China.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 1–12, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
A. Sun et al.
news documents. In some cases where perfect extraction accuracy is not essential, automated IE methods can replace the manual extraction efforts completely. In other cases, IE may produce the first-cut results reducing the manual extraction efforts. As reported in the survey by Muslea [9], the IE methods for free text documents are largely based on extraction patterns specifying the syntactic and/or semantic constraints on the positions of desired entities within sentences. For example, from the sentence, “Guerrillas attacked the 1st infantry brigade garrison”, one can define the extraction pattern subject active-attack to extract “Guerrilas” as a perpetrator, and active-attack direct object to extract “1st infantry bridage garrison” as a victim1 . The extraction pattern definitions currently used are very much based on some pre-defined pattern templates. For example, in AutoSlog [12], the above subject active-attack extraction pattern is an instantiation of the subject active-verb template. While pattern templates reduce the combinations of extraction patterns to be considered in rule learning, they may potentially pose as the obstacles to derive other more expressive and accurate extraction patterns. For example, IBM acquired direct-object is a very pertinent extraction pattern for extracting company information but cannot be instantiated by any of the 13 AutoSlog’s pattern templates. Since it will be quite difficult to derive one standard set of pattern templates that works well for any given domain, IE methods that do not rely on templates will become necessary. In this paper, we propose the use of Support Vector Machines (SVMs) for information extraction. SVM was proposed by Vapnik [16] and has been widelyused in image processing and classification problems [5]. The SVM technique finds the best surface that can separate the positive examples from negative ones. Positive and negative examples are separated by the maximum margin measured by a normal vector w. SVM classifiers have been used in various text classification experiments [2,5] and have been shown to deliver good classification accuracy. When SVM classifiers are used to solve an IE problem, two major research challenges must be considered. – Large number of instances: IE for free text involves extracting from document sentences target entities (or instances) that belong to some pre-defined semantic category(ies). A classification task, on the other hand, is to identify candidate entities from the document sentences, usually in the form of noun phrases or verb phrases, and assign each candidate entity to zero, one or more pre-defined semantic category. As large number of candidate entities can potentially be extracted from document sentences, it could lead to overheads in both learning and classification steps. – Choice of features: The success of SVM very much depends on whether a good set of features is given in the learning and classification steps. There should be adequate features that distinguish entities belonging to a semantic category from those outside the category. 1
Both extraction patterns have been used in the AutoSlog system [12].
Using Support Vector Machines for Terrorism Information Extraction
3
In our approach, we attempt to establish the links between the semantic category of a target entity with its syntactic properties, and reduce the number of instances to be classified based on their syntactic and semantic properties. A natural language parser is first used to identify the syntactic parts of sentences and only those parts that are desired are used as candidate instances. We then use both the content and syntax of a candidate instance and its surrounding context as features. 1.2
Research Objectives and Contributions
Our research aims to develop new IE methods that use classification techniques to extract target entities, while not using pattern templates and extraction patterns. Among the different types of IE tasks, we have chosen to address the template element extraction (TE) task which refers to extracting entities or instances in a free text that belong to some semantic categories2 . We apply our new IE method on free documents in the terrorism domain. In the terrorism domain, the semantic categories that are interesting include victim, perpetrator, witness, etc. In the following, we summarize our main research contributions. – IE using Support Vector Machines (SVM): We have successfully transformed IE into a classification problem and adopted SVM to extract target entities. We have not come across any previous papers reporting such an IE approach. As an early exploratory research, we only try to extract the entities falling under the perpetrator role. Our proposed IE method, nevertheless, can be easily generalized to extract other types of entities. – Feature selection: We have defined the content and context features that can be derived for the entities to be extracted/classified. The content features refer to words found in the entities. The context features refer to those derived from the sentence constituents surrounding the entities. In particular, we propose the a weighting feature scheme to derive context features for a given entity. – Performance evaluation: We have conducted experiments on the MUC text collection in the terrorism domain. In our preliminary experiments, the SVM approach to IE has been shown to deliver performance comparable to the published results by AutoSlog, a well known extraction pattern-based IE system. 1.3
Paper Outline
The rest of the paper is structured as follows. Section 2 provides a survey of the related IE work and distinguishes our work from them. Section 3 defines our IE problem and the performance measures. Our proposed method is described in Section 4. The experimental results are given in Section 5. Section 6 concludes the paper. 2
The template element extraction (TE) task has been defined in the Message Understanding Conference series (MUC) sponsored by DARPA [8].
4
A. Sun et al.
2
Related Work
As our research deals with IE for free text collections, we only examine related work in this area. Broadly, the related work can be divided into extraction pattern-based and non-extraction pattern-based. The former refers to approaches that first acquire a set of extraction patterns from the training text collections. The extraction patterns use the syntactic structure of a sentence and semantic knowledge of words to identify the target entities. The extraction process is very much a template matching task between the extraction patterns and the sentences. The non-extraction pattern-based approach are those that use some machine learning techniques to acquire some extraction models. The extraction models identify target entities by examining their feature mix that includes those based on syntactics, semantics and others. The extraction process is very much a classification task that involves accepting or rejecting an entity (e.g. word or phrase) as a target entity. Many extraction pattern-based IE approaches have been proposed in the Message Understanding Conference (MUC) series. Based on 13 pre-defined pattern templates, Riloff developed the AutoSlog system capable of learning extraction patterns [12]. Each extraction pattern consists of a trigger word (a verb or a noun) to activate its use. AutoSlog also requires a manual filtering step to discard some 74% of the learned extraction patterns as they may not be relevant. PALKA is another representative IE system that learns extraction patterns in the form of frame-phrasal pattern structures [7]. It requires each sentence to be first parsed and grouped into multiple simple clauses before deriving the extraction patterns. Both PALKA and AutoSlog require the training text collections to be tagged. Such tagging efforts require much manual efforts. AutoSlog-TS, an improved version of AutoSlog, is able to generate extraction patterns without a tagged training dataset [11]. An overall F1 measure of 0.38 was reported for both AutoSlog and AutoSlog-TS for the entities in perpetrator, and around 0.45 for victim and target object categories in the MUC-4 text collection (terrorism domain). Riloff also demonstrated that the best extraction patterns can be further selected using bootstrapping technique [13]. WHISK is an IE system that uses extraction patterns in the form of regular expressions. Each regular expression can extract either single target entity or multiple target entities [15]. WHISK has been experimented on the text collection under the management succession domain. SRV, another IE system, constructs first-order logical formulas as extraction patterns [3]. The extraction patterns also allow relational structures between target entities to be expressed. There have been very little IE research on non-extraction pattern based approaches. Freitag and McCallum developed an IE method based on Hidden Markov models (HMMs), a kind of probabilistic final state machines [4]. Their experiments showed that the HMM method outperformed the IE method using SRV for two text collections in the seminar announcements and corporate acquisitions domains.
Using Support Vector Machines for Terrorism Information Extraction
5
TST1-MUC3-0002 SAN SALVADOR, 18 FEB 90 (DPA) -- [TEXT] HEAVY FIGHTING WITH AIR SUPPORT RAGED LAST NIGHT IN NORTHWESTERN SAN SALVADOR WHEN MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] ATTACKED AN ELECTRIC POWER SUBSTATION. ACCORDING TO PRELIMINARY REPORTS, A SOLDIER GUARDING THE SUBSTATION WAS WOUNDED. THE FIRST EXPLOSIONS BEGAN AT 2330 [0530 GMT] AND CONTINUED UNTIL EARLY THIS MORNING, WHEN GOVERNMENT TROOPS REQUESTED AIR SUPPORT AND THE GUERRILLAS WITHDREW TO THE SLOPES OF THE SAN SALVADOR VOLCANO, WHERE THEY ARE NOW BEING PURSUED. THE NOISE FROM THE ARTILLERY FIRE AND HELICOPTER GUNSHIPS WAS HEARD THROUGHOUT THE CAPITAL AND ITS OUTSKIRTS, ESPECIALLY IN THE CROWDED NEIGHBORHOODS OF NORTHERN AND NORTHWESTERN SAN SALVADOR, SUCH AS MIRALVALLE, SATELITE, MONTEBELLO, AND SAN RAMON. SOME EXPLOSIONS COULD STILL BE HEARD THIS MORNING. MEANWHILE, IT WAS REPORTED THAT THE CITIES OF SAN MIGUEL AND USULUTAN, THE LARGEST CITIES IN EASTERN EL SALVADOR, HAVE NO ELECTRICITY BECAUSE OF GUERRILLA SABOTAGE ACTIVITY.
Fig. 1. Example Newswire Document
Research on applying machine learning techniques on name-entity extraction, a subproblem of information extraction, has been reported in [1]. Baluja et al proposed the use of 4 different types of features to represent an entity to extracted. They are the word-level features, dictionary features, part-of-speech tag features, and punctuation features (surrounding the entity to be extracted). Except the last feature type, the other three types of features are derived from the entities to be extracted. To the best of our knowledge, our research is the first that explores the use of classification techniques in extracting terrorism-related information. Unlike [4], we represent each entity to be extracted as a set of features derived from the syntactic structure of the sentence in which the entity is found, as well as the words found in the entity.
3
Problem Definition
Our IE task is similar to the template element (TE) task in the Message Understanding Conference (MUC) series. The TE task was to extract different types of target entities from each document, including perpetrators, victims, physicaltargets, event locations, etc. In MUC-4, a text collection containing newswire documents related to terrorist events in Latin America was used as the evaluation dataset. An example document is shown in Figure 1. In the above document, we could extract several interesting entities about the terrorist event, namely location (“SAN SALVADOR”), perpetrator (“MEMBERS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN]”), and victim(“SOLDIER”). The MUC-4 text collection consists of a training set (with 1500 documents and two test sets (each with 100 documents). For each document, MUC-4 specifies for each semantic category the target entity(ies) to be extracted.
6
A. Sun et al.
In this paper, we choose to focus on extracting target entities in the perpetrator category. The input of our IE method consists of the training set (1500 documents) and the perpetrator(s) of each training documents. The training documents are not tagged with the perpetrators. Instead, the perpetrators are stored in a separate file known as the answer key file. Our IE method therefore has to locate the perpetrators within the corresponding documents. Should a perpetrator appear in multiple sentences in a document, his or her role may be obscured by features from these sentences, making it more difficult to perform extraction. Once trained, our IE method has to extract perpetrators from the test collections. As the test collections are not tagged with candidate entities, our IE method has to first identify candidate entities in the documents before classifying them. The performance of our IE task is measured by three important metrics: Precision, Recall and F1 measure. Let ntp , nf p , and nf n be the number of entities correctly extracted, number of entities wrongly extracted, and number of entities missed respectively. Precision, recall and F1 measure are defined as follows: P recision =
Recall =
F1 =
4 4.1
ntp ntp + nf p
ntp ntp + nf n
2 · P recision · Recall P recision + Recall
Proposed Method Overview
Like other IE methods, we divide our proposed IE method into two steps: the learning step and the extraction step. The former learns the extraction model for the target entities in the desired semantic category using the training documents and their target entities. The latter applies the learnt extraction model on other documents and extract new target entities. The learning step consists of the following smaller steps. 1. Document parsing: As the target entities are perpetrators, they usually appear as noun-phrases in the documents. We therefore parse all the sentences in the document. To break up a document into sentences, we use the SATZ software [10]. As a noun-phrase could be nested within another noun-phrase in the parse tree, we only select all the simple noun-phrases as candidate entities. The candidate entities from the training documents are further grouped as positive entities if their corresponding noun-phrases match the perpetrator answer keys. The rest are used as negative entities.
Using Support Vector Machines for Terrorism Information Extraction
7
2. Feature acquisition: This step refers to deriving features for the training target entities, i.e., the noun-phrases. We will elaborate this step in Section 4.2. 3. Extraction model construction: This step refers to constructing the extraction model using some machine learning technique. In this paper, we explore the use of SVM to construct the extraction model (or classification model). The classification step performs extraction using the learnt extraction model following the steps below: 1. Document parsing: The sentences in every test document are parsed and simple noun phrases in the parse trees are used as candidate entities. 2. Feature acquisition: This step is similar to that in the learning step. 3. Classification: This step applies the SVM classifier to extract the candidate entities. By identifying all the noun-phrases and classifying them into positive entities or negative entities, we transform the IE problem into classification problem. To keep our method simple, we do not use co-referencing to identify pronouns that refers to the positive or negative entities. 4.2
Feature Acquisition
We acquire for each candidate entity the features required for constructing the extraction model and for classification. To ensure that the extraction model will be able to distinguish entities belonging to a semantic category or not, it is necessary to acquire a wide spectrum of features. Unlike the earlier work that focus on features that are mainly derived from within the entities [1] or the linear sequence of words surrounding the entities [4], our method derives features from syntactic structures of sentences in which the candidate entities are found. We divide the entity features into two categories: – Content features: These refer to the features derived from the candidate entities themselves. At present, we only consider terms appearing in the candidate entities. Given an entity e = w1 w2 · · · wn , we assign the content feature fi (w) = 1 if word w is found in e. – Context features: These features are obtained by first parsing the sentences containing a candidate entity. Each context feature is defined by a fragment of syntactic structure in which the entity is found and words associated with the fragment. In the following, we elaborate the way our context features are obtained. We first use the CMU’s Link Grammar Parser to parse a sentence [14]. The parser generates a parse tree such as the one shown in Figure 2. A parse tree represents the syntactic structure of a given sentence. Its leaf nodes are the word tokens of the sentence and internal nodes represents the syntactic constituents of the sentence. The possible syntactic constituents are S (clause), VP (verb phrase), NP (noun phrase), PP (prepositional phrase), etc.
8
A. Sun et al.
(S (NP Two terrorists) (VP (VP destroyed (NP several power poles) (PP on (NP 29th street))) and (VP machinegunned (NP several transformers))) .) Fig. 2. Parse Tree Example
For each candidate entity, we can derive its context features as a vector of term weights for the terms that appear in the sentences containing the nounphrase. Given a sentence parse tree, the weight of a term is assigned as follows. Terms appearing in the sibling nodes are assigned the weights of 1.0. Terms appearing in the higher level or lower level of the parse tree will be assigned smaller weights as they are further away from the candidate entity. The feature weights are reduced by half for every level further away from the candidate entity in our experiments. The 50% reduction factor has been chosen arbitrarily in our experiments. A careful study needs to be further conducted to determine the optimal reduction factor. For example, the context features of the candidate entity “several power poles” are derived as follows3 . Table 1. Context features and feature weights for “several power poles” Label Terms PP NP VP NP
on 29th street destroyed Two terrorists
Weight 1.00 0.50 0.50 0.25
To ensure that the included context features are closely related to the candidate entity, we do not consider terms found in the sibling nodes (and their subtrees) of the ancestor(s) of the entity. Intuitively, these terms are not syntactically very related to the candidate entity and are therefore excluded. For example, for the candidate entity “several power poles”, the terms in the subtree “and machinegunned several transformers” are excluded from the context feature set. 3
More precisely, stopword removal and stemming are performed on the terms. Some of them will be discarded during this process.
Using Support Vector Machines for Terrorism Information Extraction
9
If an entity appears in multiple sentences in the same document, and the same term is included as context features from different parse trees, we will combine the context features into one and assign it the highest weight among the original weights. This is necessary to keep one unique weight for each term. 4.3
Extraction Model Construction
To construct an extraction model, we require both positive training data and negative training data. While the positive training entities are available from the answer key file, the negative training entities can be obtained from the noun phrases that do not contain any target entities. Since pronouns such as “he”, “she”, “they”, etc. may possibly be co-referenced with some target entities, we do not use them as positive nor negative training entities. From the training set, we also obtain a entity filter dictionary that consists of noun-phrases that cannot be perpetrators. These are non-target noun-phrases that appear more than five times in the training set, e.g., “dictionary”, “desk” and “tree”. With this filter, the number of negative entities is reduced dramatically. If a larger number is used, fewer noun-phrases will be filtered causing a degradation of precision. On the other hand, a smaller number may increase the risk of getting a lower recall. Once an extraction model is constructed, it can perform extraction on a given document by classifying candidate entities in the document into perpetrator or non-perpetrator category. In the extraction step, a candidate entity is classified as perpetrator when the SVM classifier returns a positive score value.
5 5.1
Experiments and Results Datasets
We used MUC-4 dataset in our experiments. Three files (muc34dev, muc34tst1 and muc34tst2) were used as training set and the remaining two files (muc34tst3 and muc34tst4) were used as test set. There are totally 1500 news documents in the training set and 100 documents each for the two test files. For each news document, there are zero, one or two perpetrators defined in the answer key file. Therefore, most of the noun phrases are negative candidate entities. To avoid severely unbalanced training examples, we only considered the training documents that have at least one perpetrator defined in the answer key files. There are 466 training documents containing some perpetrators. We used all the 100 news documents in the test set since the classifier should not know if a test document contains a perpetrator. The number of documents used, number of positive and negative entities for the training and test sets are listed in Table 2. From the table, we observe that negative entities contribute about 90% of the entities of training set, and around 95% of the test set. 5.2
Results
We used SV M light as our classifiers in our experiment [6]. The SV M light is an implementation of Support Vector Machines (SVMs) in C and has been widely
10
A. Sun et al. Table 2. Documents, positive/negative entities in traing/test data set Dataset Documents Positive Entities Negative Entities Train Tst3 Tst4
466 100 100
1003 117 77
9435 2336 1943
used in text classification and web classification research. Due to the unbalanced training examples, we set the cost-factor (parameter j) of SV M light to be the ratio of number of negative entities over the number of positive ones. The costfactor denotes the proportion of cost allocated to training errors on positive entities against errors on negative entities. We used the polynomial kernel function instead of the default linear kernel function. We also set our threshold to be 0.0 as suggested. The results are reported in Table 3. Table 3. Results on training and test dataset Dataset Train Tst3 Tst4
Precision
Recall
F1 measure
0.7752 0.3054 0.2360
0.9661 0.4359 0.5455
0.8602 0.3592 0.3295
As shown in the table, the SVM classifier performed very well for the training data. It achieved both high precision and recall values. Nevertheless, the classifier did not perform equally well for the two test data sets. About 43% and 54% of the target entities have been extracted for Tst3 and Tst4 respectively. The results also indicated that many other non-target entities were also extracted causing the low precision values. The overall F1 measures are 0.36 and 0.33 for Tst3 and Tst4 respectively. The above results, compared to the known results given in [11] are reasonable as the latter also showed not more than 30% precision values for both AutoSlog and AutoSlog-TS4 . [11] reported F1 measures of 0.38 which is not very different from ours. The rather low F1 measures suggest that this IE problem is quite a difficult one. We, nevertheless, are quite optimistic about our preliminary results as they clearly show that the IE problem can be handled as a classification problem.
4
The comparison cannot be taken in absolute terms since [11] used a slightly different experimental setup for the MUC-4 dataset.
Using Support Vector Machines for Terrorism Information Extraction
6
11
Conclusions
In this paper, we attempt to extract perpetrator entities from a collection of untagged news documents in the terrorism domain. We propose a classificationbased method to handle the IE problem. The method segments each document into sentences, parses the latter into parse trees, and derives features for the entities within the documents. The features of each entity are derived from both its content and context. Based on SVM classifiers, our method was applied to the MUC-4 data set. Our experimental results showed that the method performs at a level comparable to some well known published results. As part of our future work, we would like to continue our preliminary work and explore additional features in training the SVM classifiers. Since the number of training entities is usually small in real applications, we will also try to extend our classification-based method to handle IE problems with small number of seed training entities.
References 1. S. Baluja, V. Mittal, and R. Sukthankar. Applying machine learning for high performance named-entity extraction. Computational Intelligence, 16(4):586–595, November 2000. 2. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, pages 148– 155, Bethesda, Maryland, November 1998. 3. D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98) 10th Conference on Innovation Applications of Artificial Intelligence (IAAI-98), pages 517–523, Madison, Wisconsin, July 1998. 4. D. Freitag and A. K. McCallum. Information extraction with hmms and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 31–36, Orlando, FL., July 1999. 5. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. 6. T. Joachims. Making large-scale svm learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999. 7. J.-T. Kim and D. I. Moldovan. Acquisition of linguistic patterns for knowledgebased information extraction. IEEE Transaction on Knowledge and Data Engineering, 7(5):713–724, 1995. 8. MUC. Proceedings of the 4th message understanding conference (muc-4), 1992. 9. I. Muslea. Extraction patterns for information extraction tasks: A survey. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, pages 1–6, Orlando, Florida, July 1999. 10. D. D. Palmer and M. A. Hearst. Adaptive sentence boundary disambiguation. In Proceedings of the 4th Conference on Applied Natural Language Processing, pages 78–83, Stuttgart, Germany, October 1994.
12
A. Sun et al.
11. E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049, Portland, Oregon, 1996. 12. E. Riloff. An empirical study of automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85(1-2):101–134, 1996. 13. E. Riloff and R. Jones. Learning dictionaries for information extraction by multilevel boot-strapping. In Proceedings of the 16th National Conference on Artificial Intelligence, pages 1044–1049, 1999. 14. D. Sleator and D. Temperley. Parsing english with a link grammar. Technical Report CMU-CS-91-196, Computer Science, Carnegie Mellon University, October 1991. 15. S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999. 16. V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, DE, 1995.
Criminal Incident Data Association Using the OLAP Technology Song Lin and Donald E. Brown Department of Systems and Information Engineering University of Virginia, VA 22904, USA {sl7h, brown}@virginia.edu
Abstract. Associating criminal incidents committed by the same person is important in crime analysis. In this paper, we introduce concepts from OLAP (online-analytical processing) and data-mining to resolve this issue. The criminal incidents are modeled into an OLAP data cube; a measurement function, called the outlier score function is defined on the cube cells. When the score is significant enough, we say that the incidents contained in the cell are associated with each other. The method can be used with a variety of criminal incident features to include the locations of the crimes for spatial analysis. We applied this association method to the robbery dataset of Richmond, Virginia. Results show that this method can effectively solve the problem of criminal incident association. Keywords. Criminal incident association, OLAP, outlier
of a decision process involving a multi-staged search in the awareness space. During the search phase, the criminal associates these cues, clusters of cues, or cue sequences with a “good” target. These cues form a template of the criminal, and once the template is built, it is self-reinforcing and relatively enduring. Due to the limit of the searching ability of a human being, a criminal normally does not have many decision templates. Therefore, we can observe criminal incidents with the similar temporal, spatial, and modus operandi (MO) features, which possibly come from the same template of the same criminal. It is possible to identify the serial criminal by associating these similar incidents. Different approaches have been proposed and several software programs have been developed to resolve the crime association problem. They can be classified into two major categories: suspect association and incident association. The Integrated Criminal Apprehension Program (ICAP) developed by Heck [12] enables police officers to match between the suspects and the arrested criminals using MO features; the Armed Robbery Eidetic Suspect Typing (AREST) program [1] employs an expert approach to perform the suspect association and classify a potential offender into three categories: probable, possible, or non suspect. The Violent Criminal Apprehension Program developed by the Federal Bureau of Investigation (FBI) (ViCAP) [13] is an incident association system. MO features are primarily considered in ViCAP. In the COPLINK [10] project undertaken by the researchers in the University of Arizona, a novel concept space model is built and can be used to associate searching terms with suspects in the database. A total similarity method was proposed by Brown and Hagen [3], and it can solve problems for both incident association and suspect association. Besides these theoretical methods, crime analysts normally use the SQL (Structure Query Language) in practice. They build the SQL string and make the system return all records that match their searching criteria. In this paper, we describe a crime association method that combines both OLAP concepts from the data warehousing area and outlier detection ideas from the data mining field. Before presenting our method, let us briefly review some concepts in OLAP and data mining.
2 Brief Review of OLAP and OLAP-Based Data Mining OLAP is a key aspect of many data warehousing systems [6]. Unlike its ancestor, OLTP (online transaction processing) systems, OLAP focus on providing summary information to the decision-makers of an organization. Aggregated data, such as sum, average, max, or min, are pre-calculated and stored in a multi-dimensional database called a data cube. Each dimension of the data cube consists of one or more categorical attributes. Hierarchical structures generally exist in the dimensions. Most existing OLAP systems concentrate on the efficiency of retrieving the summary data in the cube. For many cases, the decision-maker still needs to apply his or her domain knowledge and sometimes common sense to make the final decision. Data mining is a collection of techniques that detect patterns in large amounts of data. Quantitative approaches, including statistical methods, are generally used in data mining. Traditionally, data mining algorithms are developed for two-way datasets. More recently researchers have generalized some data mining methods for multi-
Criminal Incident Data Association Using the OLAP Technology
15
dimensional OLAP data structures. Imielinski et al. proposed the “cubegrade” problem [14]. The cubegrade problem can be treated as a generalized version of the association rule. Imielinski et al. claim that the association rule can be viewed as the change of count aggregates when imposing another constraint, or in OLAP terminology, making a drill-down operation on an existing cube cell. They think that other aggregates like sum, average, max, or min can also be incorporated, and the cubgegrade could support the “what if” analysis better. Similar to the cubegrade problem, the constrained gradient analysis was proposed by Dong et al. [7]. The constrained gradient analysis focuses on retrieving pairs of OLAP cubes that are quite different in aggregates and similar in dimensions (usually one cell is the ascendant, descendent, or sibling of the other cell). More than one aggregates can be considered simultaneously in the constrained gradient analysis. The discovery-driven exploration problem was proposed by Sarawagi et al. [18]. It aims at finding exceptions in the cube cells. They build a formula to estimate the anticipated value and the standard deviation (σ) of a cell. When the difference between the actual value of the cell and the anticipated value is greater than 2.5σ, the cell is selected as an exception. Similar to above approaches, our crime association method also focuses on the cells of the OLAP data cube. We define an outlier score function to measure the distinctiveness of the cell. Incidents contained in the same cell are determined to be associated with each other when the score is significant. The definition of the outlier score function and the association method is given in section 3.
3 Method 3.1 Rationale The rationale of this method is explained as follows: although theoretically the template (see section 1) is unique for each serial criminal, the data collected in the police department does not contain every aspect of the template. Some observed parts of the templates are “common” so that we may see a large overlap in these common templates. The creators (criminals) of those “common” templates are not separable. Some templates are “special”. For these “special templates”, we are more confident to say that the incidents come from the same criminal. For example, consider the weapon used in a robbery incident. We may observe many incidents with the value “gun” for weapon used. However, no crime analyst would say that the same person commits all these robberies because “gun” is a common template shared by many criminals. If we observe several robberies with a “Japanese sword” – an uncommon template, we are more confident in asserting that these incidents result from a same criminal. (This “Japanese sword” claim was first proposed by Brown and Hagen [4]). In this paper, we describe an outlier score function to measure this distinctiveness of the template.
16
S. Lin and D.E. Brown
3.2 Definitions In this section, we give the mathematical definitions used to build the outlier score function. People familiar with OLAP concepts can see that our notation derives from terms used in OLAP field. A1, A2, …, Am are m attributes that we consider relevant to our study, and D1, D2, …, Dm are their domains respectively. Currently, these attributes are confined to be categorical (categorical attributes like MO are important in crime association analysis). Let z(i) be the i-th incident, and z(i).Aj be the value on the j-th attribute of incident i. z(i) can be represented as z (i ) = ( z1(i ) , z 2(i ) ,..., z m(i ) ) , where
z k( i ) = z ( i ) . Ak ∈ D k , k ∈ {1,..., m} . Z is the set of all incidents. Definition 1. Cell Cell c is a vector of the values of attributes with dimension t, where t≤m. A cell can be represented as c = (ci1 , ci2 ,..., cit ) . In order to standardize the definition of a cell, for each Di, we add a “wildcard” element “*”. Now we allow D’i= Di∪{*}. For cell
c = (ci1 , ci2 ,..., cit ) , we can represent it as c = (c1 , c 2 ,..., c m ) , where c j ∈ D’j ,
and cj=* if and only if j ∉ {i1 , i2 ,..., it } . C denotes the set of all cells. Since each incident can also be treated as a cell, we define a function Cell: Z Å C. Cell(z)= (z1,z2,…,zm), if z=(z1,z2,…,zm), Definition 2. Contains relation We say that cell c = (ci1 , ci2 ,...,cit ) contains incident z if and only if z.Aj=cj or cj=*, j=1,2,…,m. For two cell, we say that cell c ’= (c1 ’, c 2 ’,..., c m ’) contains cell
c = (c1 , c2 ,..., cm ) if and only if c j ’= c j or c j ’= * , j = 1,2,..., m Definition 3. Count of a cell Function count is defined on a cell, and it returns the number of incidents that cell c contains. Definition 4. Parent cell Cell c’= (c’1 , c’2 ,..., c’m ) is the parent cell of cell c on the k-th attribute when: and
c’k = *
c’j = c j , for j ≠ k . Function parent(c,k) returns parent cell of cell c on the k-th
attribute.
Criminal Incident Data Association Using the OLAP Technology
17
Definition 5. Neighborhood P is called the neighborhood of cell c on the k-th attribute when P is a set of cells that takes the same values as cell c in all attributes but k, and does not take the wildcard value * on the k-th attribute, i.e., P= {c (1) , c ( 2 ) ,..., c (|P|) } where
cl( i ) = cl( j ) for all
l ≠ k , and c k( i ) ≠ * for all i = 1,2,..., | P | . Function neighbor (c , k ) returns the neighborhood of cell c on attribute k. (In OLAP field, the neighborhood is sometimes called siblings.) Definition 6. Relative frequency We call freq(c, k ) =
count(c) the relative frequency of cell c with count( parent(c, k ))
respect to attribute k. Definition 7. Uncertainty function We use function U to measure the uncertainty of a neighborhood. This uncertainty measure is defined on the relative frequencies. If we use P = {c (1) , c ( 2) ,..., c denote the neighborhood of cell c on attribute k, then,
U (c, k ) = U ( freq(c (1) , k ), freq(c ( 2) , k ),..., freq(c
(P)
(1)
(P)
} to
, k )) P
Obviously, U should be symmetric for all c , c ( 2) ,..., c . U takes a smaller value if the “uncertainty” in the neighborhood is low. One candidate uncertainty function is entropy, which comes from information theory:
U (c , k ) = H (c , k ) = −
∑ freq (c ’, k ) log( freq (c’, k ))
For
the
c ’∈neighbor ( c , k )
freq=0, we define 0 ⋅ log(0) = 0 , as is common in information theory. 3.3 Outlier Score Function (OSF) and the Crime Association Method Our goal is to build a function to measure the confidence or the significance level of associating crimes. This function is built over OLAP cube cells. We start building this function from analyzing the requirements that it needs to satisfy. Consider the following three scenarios: I.
II.
We have 100 robberies. 5 take the value of “Japanese sword” for the weapon used attributes, and 95 takes “gun”. Obviously, the 5 “Japanese swords” is of more interest than the 95 “guns”. Now we add another attribute: method of escape. Assume we have 20 different values: “by car”, “by foot”, etc. for the method of escape attribute. Each of them has 5 incidents. Although both “Japanese sword” and “by car” has 5 incidents, they should not be treated equally.
18
S. Lin and D.E. Brown
III.
“Japanese sword” highlights itself because all other incidents are “guns”, or in other words, the uncertainty level of the weapon used attribute is smaller. If we have some incidents takes “Japanese sword” on the weapon used attribute, and “by car” on the method of escape attribute, then the combination of “Japanese sword” and “by car” is more significant than both “Japanese sword” only and “by car” only. The reason is that we have more “evidences”.
Now we define function f as follows: − log( freq(c, k )) ) max ( f ( parent(c, k )) + f (c) = k takes all non−* dim ensionof c H (c, k ) 0 c = (*,*,...,*) When H(c,k) = 0, we say − log( freq (c, k )) = 0. H (c , k )
(1)
It is simple to verify that f satisfies above three requirements. We call f the outlier score function. (The term “outlier” is commonly used in the field of statistics. Outliers are observations significantly different that other observations and possibly are generated from a unique mechanism [11].) Based on the outlier score function, we give the following rule to associate criminal incidents: Given a pair of incidents, if there exists a cell containing both these incidents, and the outlier score of the cell is greater than some threshold value τ, we say that these two incidents are associated with each other. This association method is called an OLAP-outlier-based association method, or outlier-based method for abbreviation.
4 Application We applied this criminal incident association method to a real-world dataset. The dataset contained information on robbery incidents that occurred in Richmond, Virginia in 1998. The dataset consisted of two parts: the incident dataset and the suspect dataset. The incident dataset had 1198 records, and the temporal, spatial, and MO information were stored in the incident database. The name (if known), height, and weight information of the suspect were recorded in the suspect database. We applied our method to the incident dataset and used the suspect dataset for verification. Robbery was selected for two reasons: first, compared with some violent crime such as murder or sexual attack, serial robberies were more common; second, compared with breaking and entering crimes, more robbery incidents were “solved” (criminal arrested) or “partially solved” (the suspect’s name is known). These two points made the robbery favorable for evaluation purposes.
Criminal Incident Data Association Using the OLAP Technology
19
4.1 Attribute Selection We used three types of attributes in our analysis. The first set of attributes consisted of MO features. MO was primarily considered in crime association analysis. 6 MO attributes were picked. The second set of attributes was census attributes (the census data was obtained directly from the census CD held in library of the University of Virginia). Census data represented the spatial characteristics of the location where the criminal incident occurred, and it might help to reveal the spatial aspect of the criminals’ templates. For example, some criminals preferred to attack “high-income” areas. Lastly, we chose some distance attributes. They were distances from the incident location to some spatial landmarks such as a major highway or a church. Distance features were also important in analyzing criminals’ behaviors. For example, a criminal might preferred to initiate an attack from a certain distance range from a major highway so that the offense could not be observed during the attack, and he or she could leave the crime scene as soon as possible after the attack. There were a total of 5 distances. The names of all attributes and their descriptions are given in appendix I. They have also been used in a previous study on predicting breaking and entering crimes by Brown et al. [4]. An attribute selection was performed on all numerical attributes (census and distance attributes) before using the association method. The reason was that some attributes were redundant. These redundant attributes were unfavorable to the association algorithm in terms of both accuracy and efficiency. We adopted a featureselection-by-clustering methodology to pick the attributes. According to this method, we used the correlation coefficient to measure how similar or close two attributes were, and then we clustered the attributes into a number of groups according to this similarity measure. The attributes in the same group were similar to each other, and were quite different from attributes in other groups. For each group, we picked a representative. The final set of all representative attributes was considered to capture the major characteristics of the dataset. A similar methodology was used by Mitra et al. [16]. We picked the k-medoid clustering algorithm. (For more details about the kmedoid algorithm and other clustering algorithm, see [8].) The reason was that kmedoid method works on similarity / distance matrix (some other methods only work on coordinate data), and it tends to return spherical clusters. In addition, k-medoid returns a medoid for each cluster, based upon which we could select the representative attributes. After making a few slight adjustments and checking the silhouette plot [15], we finally got three clusters, as given in Fig. 1. The algorithm returned three medoids: HUNT_DST (housing unit density), ENRL3_DST (public school enrollment density), and TRAN_PC (expenses on transportation: per capita). We made some adjustments here. We replaced ENRL3_DST with another attribute POP3_DST (population density: age 12-17). The attackers and victims. For similar reasons, we replaced TRAN_PC with MHINC (median household income).
20
S. Lin and D.E. Brown
Fig. 1. Result of k-medoid clustering
There were a total of 9 attributes used in our analysis: 6 MO attributes (categorical) and 3 numerical attributes picked by applying the attributes selection procedure. Since our method was developed on categorical attributes, we converted the numerical attributes to categorical ones by dividing them into 11 equally sized bins. The number was determined by Sturge’s number of bins rule [19][20].
4.2 Evaluation Criteria We wanted to evaluate whether the association determined by our method corresponded to the true result. The information in the suspect database was considered as the “true result”. 170 incidents with the names of the suspects were used for evaluation. We generated all incident pairs. If two incidents in a pair had the suspects with the same name and date of birth, we said that the “true result” for this incident pair was a “true association”. There were 33 true associations. We used two measures to evaluate our method. The first measure was called “detected true associations”. We expected that the association method would be able to detect a large portion of “true associations”. The second measure was called “average number of relevant records”. This measure was built on the analogy of the search engine. Consider a search engine as Google. For each searching string(s) we give, it returns a list of documents considered to be “relevant” to the searching criterion. Similarly, for the crime association problem, if we give an incident, the algorithm will return a list of records that are considered as “associated” with the given incident. A shorter list is always preferred in both cases. The average “length” of the lists provided the second measure and we called it the “average number of relevant records”. The algorithm is more accurate when this measure has a smaller
Criminal Incident Data Association Using the OLAP Technology
21
value. In the information retrieval area [17], two commonly used criteria in evaluating a retrieval system are recall and precision. The former is the ability for a system to present relevant items, and the latter is the ability to present only the relevant items. Our first measure was a recall measure, and our second measure was equivalent to a precision measure. The above two measures do not work for our approach only; they can be used in evaluating any association algorithms. Therefore, we can use these two measures to compare the performances of different association methods. 4.3 Result and Comparison Different threshold values were set to test our method. Obvious if we set it to 0, we would expect that the method can detect all “true associations” and the average number of relevant records was 169 (given 170 incidents for evaluation). If we set the threshold, τ, to infinity, we would expect the method to return 0 for both “detected true associations” and “average number of relevant records”. As the threshold increased, we expected a decrease in both number of detected true associations and average number of relevant records. The result is given in Table 1. Table 1. Result of outlier-based method
Avg. number of relevant records 169.00 121.04 62.54 28.38 13.96 7.51 4.25 2.29 0.00
We compared this outlier-based method with a similarity-based crime association method. The similarity-based method was proposed by Brown and Hagen (Brown and Hagen, 2003). Given a pair of incidents, the similarity-based method first calculates a similarity score for each attribute, and then computes a total similarity score using the weighted average of all individual similarity scores. The total similarity score is used to determine whether the incidents are associated. Using the same evaluation criteria, the result of the similarity-based method is given in Table 2. If we set the average number of relevant records as the X-axis and set the detected true associations as the Y-axis, the comparisons can be illustrated as in Fig. 2. In Fig. 2, the outlier-based method lies above the similarity-based method for most cases. That means given the same “accuracy” (detected true associations) level, the outlier-based method returns fewer relevant records. Also if we keep the number
22
S. Lin and D.E. Brown Table 2. Result of similarity-based method
Threshold 0 0.5 0.6 0.7 0.8 0.9
Detected true associations 33 33 25 15 7 0
∞
Avg. number of relevant records 169.00 112.98 80.05 45.52 19.38 3.97
0
0.00
of relevant records (average length of the returned list) for both methods, the outlierbased method is more accurate. The curve of the similarity-based method sits slightly above the outlier-based method when the average number of relevant records is above 100. Since the size of the evaluation incident set is 170, no crime analyst would consider putting further investigation on any set of over 100 incidents. The outlierbased method is generally more effective.
35
30
Detected Associations
25
20 Similarity Outlier 15
10
5
0 0
20
40
60
80
100
120
140
160
180
Avg. relevant records
Fig. 2. Comparison: the outlier-based method vs. the similarity-based method
5 Conclusion In this paper, an OLAP-outlier-based method is introduced to solve the crime association problem. The criminal incidents are modeled into an OLAP cube and an outlier-score function is defined over the cube cells. The incidents contained in the
Criminal Incident Data Association Using the OLAP Technology
23
cell are determined to be associated with each other when the outlier score is large enough. The method was applied to a robbery dataset and results show that this method can provide significant improvements for crime analysts who need to link incidents in large databases.
References 1.
2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17. 18.
19. 20.
Badiru, A.B., Karasz, J.M. and Holloway, B.T., “AREST: Armed Robbery Eidetic Suspect Typing Expert System”, Journal of Police Science and Administration, 16, 210–216 (1988) Brantingham, P. J. and Brantingham, P. L., Patterns in Crimes, New York: Macmillan (1984) Brown D.E. and Hagen S.C., “Data Association Methods with Applications to Law Enforcement”, Decision Support Systems, 34, 369–378 (2003) Brown, D. E., Liu, H. and Xue, Y., “Mining Preference from Spatial-temporal Data”, Proc. of the First SIAM International Conference of Data Mining (2001) Clarke, R.V. and Cornish, D.B., “Modeling Offender’s Decisions: A Framework for Research and Policy”, Crime Justice: An Annual Review of Research, Vol. 6, Ed. by Tonry, M. and Morris, N. University of Chicago Press (1985) Chaudhuri, S. and Dayal, U., “An Overview of Data Warehousing and OLAP Technology”, ACM SIGMOD Record, 26 (1997) Dong, G., Han, J., Lam, J. Pei, J., and Wang, K., “Mining Multi-Dimensional Constrained Gradients in Data Cubes”, Proc. of the 27th VLDB Conference, Roma, Italy (2001) Everitt, B. Cluster Analysis, John Wiley & Sons, Inc. (1993) Felson, M., “Routine Activities and Crime Prevention in the Developing Metropolis”, Criminology, 25, 911–931 (1987) Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., and Chen, H., “Using Coplink to Analyse Criminal-Justice Data”, IEEE Computer, 35, 30–37 (2002) Hawkins, D., Identifications of Outliers, Chapman and Hall, London, (1980) Heck, R.O., Career Criminal Apprehesion Program: Annual Report (Sacramento, CA: Office of Criminal Justice Planning) (1991) Icove, D. J., “Automated Crime Profiling”, Law Enforcement Bulletin, 55, 27–30 (1986) Imielinski, T., Khachiyan, L., and Abdul-ghani, A., Cubegrades: “Generalizing association rules”, Technical report, Dept. Computer Science, Rutgers Univ., Aug. (2000) Kaufman, L. and Rousseeuw, P. Finding Groups in Data, Wiley (1990) Mitra, P., Murthy, C.A., and Pal, S.K., “Unsupervised Feature Selection Using Feature Similarity”, IEEE Trans. On Pattern Analysis and Machine Intelligence, 24, 301–312 (2002) Salton, G. and McGill, M. Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York (1983) Sarawagi, S., Agrawal, R., and Megiddo. N., “Discovery-driven exploration of OLAP data cubes”, Proc. of the Sixth Int’l Conference on Extending Database Technology (EDBT), Valencia, Spain (1998) Scott, D. Multivariate Density Estimation: Theory, Practice and Visualization, New York, NY: Wiley (1992) Sturges, H.A., “The Choice of a Class Interval”, Journal of American Statistician Association, 21, 65–66 (1926)
24
S. Lin and D.E. Brown
Appendix I. Attributes used in the analysis (a) MO attributes Name Description Rsus_Acts Actions taken by the suspects R_Threats Method used by the suspects to threat the victim R_Force Actions that suspects force the victim to do Rvic_Loc Location type of the victim when robbery was committed Method_Esc Method of escape the scene Premise Premise to commit the crime (b) Census attributes Attribute name Description General POP_DST Population density (density means that the statistic is divided by the area) HH_DST Household density FAM_DST Family density MALE_DST Male population density FEM_DST Female population density Race RACE1_DST RACE2_DST RACE3_DST RACE4_DST RACE5_DST HISP_DST
White population density Black population density American Indian population density Asian population density Other population density Hispanic origin population density
Population Age POP1_DST POP2_DST POP3_DST POP4_DST POP5_DST POP6_DST POP7_DST POP8_DST POP9_DST POP10_DST
Population density (0-5 years) Population density (6-11 years) Population density (12-17 years) Population density (18-24 years) Population density (25-34 years) Population density (35-44 years) Population density (45-54 years) Population density (55-64 years) Population density (65-74 years) Population density (over 75 years)
Householder Age AGEH1_DST AGEH2_DST AGEH3_DST
Density: age of householder under 25 years Density: age of householder under 25-34 years Density: age of householder under 35-44 years
Criminal Incident Data Association Using the OLAP Technology
Attribute name AGEH4_DST AGEH5_DST AGEH6_DST
Description Density: age of householder under 45-54 years Density: age of householder under 55-64 years Density: age of householder over 65 years
Housing units density Occupied housing units density Vacant housing units density Density: owner occupied housing unit with mortgage Density: owner occupied housing unit without mortgage Density: owner occupied condominiums Density: housing unit occupied by owner Density: housing unit occupied by renter
Density: occupied structure with 1 unit detached Density: occupied structure with 1 unit attached Density: occupied structure with 2 unit Density: occupied structure with 3-9 unit Density: occupied structure with 10+ unit Density: occupied structure trailer Density: occupied structure other
Income PCINC_97 MHINC_97 AHINC_97
Per capita income Median household income Average household income
School Enrollment ENRL1_DST ENRL2_DST ENRL3_DST ENRL4_DST ENRL5_DST ENRL6_DST ENRL7_DST
School enrollment density: public preprimary School enrollment density: private preprimary School enrollment density: public school School enrollment density: private school School enrollment density: public college School enrollment density: private college School enrollment density: not enrolled in school
Work Force CLS1_DST CLS2_DST
Density: private for profit wage and salary worker Density: private for non-profit wage and salary worker
25
26
S. Lin and D.E. Brown
Attribute name CLS3_DST CLS4_DST CLS5_DST CLS6_DST CLS7_DST
Description Density: local government workers Density: state government workers Density: federal government workers Density: self-employed workers Density: unpaid family workers
Expenses on alcohol and tobacco: per household Expenses on apparel: per household Expenses on education: per household Expenses on entertainment: per household Expenses on food: per household Expenses on medicine and health: per household Expenses on housing: per household Expenses on personal care: per household Expenses on reading: per household Expenses on transportation: per household Expenses on alcohol and tobacco: per capita Expenses on apparel: per capita Expenses on education: per capita Expenses on entertainment: per capita Expenses on food: per capita Expenses on medicine and health: per capita Expenses on housing: per capita Expenses on personal care: per capita Expenses on reading: per capita Expenses on transportation: per capita
(c) Distance attributes Name D_Church D_Hospital D_Highway D_Park D_School
Description Distance to the nearest church Distance to the nearest hospital Distance to the nearest highway Distance to the nearest park Distance to the nearest school
Names: A New Frontier in Text Mining 1
2
Frankie Patman and Paul Thompson 1
Language Analysis Systems, Inc. 2214 Rock Hill Rd., Herndon, VA 20170 [email protected] 2 Institute for Security Technology Studies Dartmouth College, Hanover, NH 03755 [email protected]
Abstract. Over the past 15 years the government has funded research in information extraction, with the goal of developing the technology to extract entities, events, and their interrelationships from free text for further analysis. A crucial component of linking entities across documents is the ability to recognize when different name strings are potential references to the same entity. Given the extraordinary range of variation international names can take when rendered in the Roman alphabet, this is a daunting task. This paper surveys existing technologies for name matching and for accomplishing pieces of the cross-document extraction and linking task. It proposes a direction for future work in which existing entity extraction, coreference, and database name matching technologies would be harnessed for cross-document coreference and linking capabilities. The extension of name variant matching to free text will add important text mining functionality for intelligence and security informatics toolkits.
1 Introduction Database name matching technology has long been used in criminal investigations [1], counter-terrorism efforts [2], and in a wide variety of government processes, e.g., the processing of applications for visas. With this technology a name is compared to names contained in one or more databases to determine whether there is a match. Sometimes this matching operation may be a straightforward exact match, but often the process is more complicated. Two names may not match exactly for a wide variety of reasons and yet still refer to the same individual [3]. Often a name in a database comes from one field of a more complete database record. The values in other fields, e.g., social security number, or address, can be used to help match names which are not exact matches. The context from the complete record helps the matching process. In this paper we propose the design of a system that would extend database name matching technology to the unstructured realm of free text. Over the past 15 or so years the federal government has funded research in information extraction, e.g., the Message Understanding Conferences [4], Tipster [5], and Automatic Content
Extraction [6]. The goal of this research has been to develop the technology to extract entities, events, and their interrelationships, from free text so that the extracted entities and relationships can be stored in a relational database, or knowledgebase, to be more readily analyzed. One subtask during the last few years of the Message Understanding Conference was the Named Entity Task in which personal and company names, as well as other formatted information, was extracted from free text. The system proposed in this paper would extract personal and company names from free text for inclusion in a database, an information extraction template, or automatically marked up XML text [7]. It would expand link analysis capabilities by taking into account a broad and more realistic view of the types of name variation found in texts from diverse sources. The sophisticated name matching algorithms currently available for matching names in databases are equally suited to matching name strings drawn from text. Analogous to the way in which the context of a full database record can assist in the name matching process, in the free text application, the context of the full text of the document can be used not only to help identify and extract names, but also to match names, both within a single document and across multiple documents.
2 Database Name Matching Name matching can be defined as the process of determining whether two name strings are instances of the same name. It is a component of entity matching but is distinct from that larger task, which in many cases requires more information than a name alone. Name matching serves to create a set of candidate names for further consideration—those that are variants of the query name. ‘Al Jones’, for example, is a legitimate variant of ‘Alfred Jones,’ ‘Alan Jones,’ and ‘Albert Jones.’ Different processes from those involved in name matching will often be required to equate entities, perhaps relation to a particular place, organization, event, or numeric identifier. However, without a sufficient representation of a name (the set of variants of the name likely to occur in the data), different mentions of the same entity may not be recognized. Matching names in databases has been a persistent and well-known problem for years [8]. In the context of the English-speaking world alone, where the predominant model for names is a given name, an optional middle name, and a surname of AngloSaxon or Western European origin, a name can have any number of variant forms, and any or all of these forms may turn up in database entries. For example, Alfred James Martin can also be A. J. Martin; Mary Douglas McConnell may also be Mary Douglas or Mary McConnell or Mary Douglas-McConnell; Jack Crowley and John Crowley may both refer to the same person; the surnames Laury and Lowrie can have the same pronunciation and may be confused when names are taken orally; jSmith is a common typographical error entered for the name Smith. These familiar types of name variation pose non-trivial difficulties for automatic name matching, and numerous systems have been devised to deal with them (see [3]). The challenges to name matching are greatly increased when databases contain names from outside the Anglo-American context. Consider some common issues that arise with names from around the world.
Names: A New Frontier in Text Mining
29
In China or Korea, the surname comes first, before the given name. Some people may maintain this format in Western contexts, others may reverse the name order to fit the Western model, and still others may use either. The problem is compounded further if a Western given name is added, since there is no one place in the string of names where the additional name is required to appear. Ex: Yi Kyung Hee ~ Kyung Hee Yi ~ Kathy Yi Kyung Hee ~ Yi Kathy Kyung Hee ~ Kathy Kyung Hee Yi In some Asian countries, such as Indonesia, many people have only one name; what appears to be a surname is actually the name of the father. Names are normally indexed by the given name. Ex: former Indonesian president Abdurrahman Wahid is Mr. Abdurrahman (Wahid being the name of his father). A name from some places in the Arab world may have many components showing the bearer’s lineage, and none of these is a family name. Any one of the name elements other than the given name can be dropped. Ex: Aziz Hamid Salim Sabah ~ Aziz Hamid ~ Aziz Sabah ~ Aziz Hispanic names commonly have two surnames, but it is the first of these rather than the last that is the family name. The final surname (which is the mother’s family name) may or may not be used. Ex: Jose Felipe Ortega Ballesteros ~ Jose Felipe Ortega, but is less likely to refer to the same person as Jose Felipe Ballesteros There may be multiple standard systems for transliterating a name from a native script (e.g. Arabic, Chinese, Hangul, Cyrillic) into the Roman alphabet, individuals may make up their own Roman spelling on the fly, or database entry operators may spell an unfamiliar name according to their own understanding of how it sounds. Ex: Yi ~ Lee ~ I ~ Lie ~ Ee ~ Rhee Names may contain various kinds of affixes, which may be conjoined to the rest of the name, separated from it by white space or hyphens, or dropped altogether. Ex: Abdalsharif ~ Abd al-Sharif ~ Abd-Al-Sharif ~ Abdal Sharif; al-Qaddafi ~ Qaddafi Systems for overcoming name variation search problems typically incorporate one or more of (1) a non-culture-specific phonetic algorithm (like Soundex1 or one of its refinements, e.g. [9]); (2) allowances for transposed, additional, or missing characters; (3) allowances for transposed, additional or missing name elements and for initials and abbreviations; and (4) nickname recognition. See [10] for a recent example. Less commonly, culture-specific phonetic rules may be used. The most serious problem for name-matching software is the wide variety of naming conventions represented in modern databases, which reflects the multicultural composition of many societies. Name-matching algorithms tend to take a one-size-fits-all approach, either by underestimating the effects of cultural variation, 1
Soundex, the most well-known algorithm for variant name searching in databases, is a phonetics-based system patented in 1918. It was devised for use in indexing the 1910 U.S. census data. The system groups consonants into sets of similar sounds (based on American names reported at the time) and assigns a common code to all names beginning with the same letter and sharing the same sequence of consonant groups. Soundex does not accommodate certain errors very well, and groups many highly dissimilar names under the same code. See [11].
30
F. Patman and P. Thompson
or by assuming that names in any particular data source will be homogenous. This may give reasonable results for names that fit one model, but may perform very poorly with names that follow different conventions. In the area of spelling variation alone, which letters are considered variants of which others differs from one culture to the next. In transcribed Arabic names, for example, the letters “K” and “Q” can be used interchangeably; “Qadafi” and “Kadafi” are variants of the same name. This is not the case in Chinese transcriptions, however, where “Kuan” and “Quan” are most likely to be entirely different names. What constitutes similarity between two name strings depends on the culture of origin of the names, and typically this must be determined on a case-by-case basis rather than across an entire data set. Language Analysis Systems, Inc. (LAS) has implemented a number of approaches to coping with the wide array of multi-cultural name forms found in databases. Names are first submitted to an automatic analysis process, which determines the most likely cultural/linguistic origin of the name (or, at the discretion of the user, the culture of origin can be manually chosen). Based on this determination, an appropriate algorithm or set of rules is applied to the matching process. LAS technologies include culturally sensitive search systems and processes for generating variants of names, among others. Some of the LAS technologies are briefly discussed below. Automatic Name Analysis: The name analysis system (NameClassifier¹) contains a knowledge base of information about name strings from various cultures. An input name is compared to what is known about name strings from each of the included cultures, and the probability of the name’s being derived from each of the cultures is computed. The culture with the highest score is assigned to the input name. The culture assignment is then used by other technologies to determine the most appropriate name-matching strategy. NameVariantGenerator¹: Name variant generation produces orthographic and syntactic variants of an input string. The string is first assigned a culture of origin through automatic name analysis. Culture-specific rules are then applied to the string to produce a regular expression. The regular expression is compared to a knowledge base of frequency information about names drawn from a database of over 750,000,000 names. Variant strings with a high enough frequency score are returned in frequency-ranked order. This process creates a set of likely variants of a name, which can then be used for further querying and matching. NameHunter¹: NameHunter¹ is a search engine that computes the similarity of two name strings based on orthography, word order, and number of elements in the string. The thresholds and parameters for comparison differ depending on the culture assignment of the input string. If a string from the database has a score that exceeds the thresholds for the input name culture, the name is returned. Returns are ranked relative to each other, so that the highest scoring strings are presented first. NameHunter allows for noisy data; thresholds can be tweaked by the user to control the degree of noise in returns. MetaMatch¹: MetaMatch¹ is a phonetic-based name retrieval system. Entry strings are first submitted to automatic name analysis for a culture assignment. Strings are then transformed to phonetic representations based on culture-specific rules, which are then stored in the database along with the original entry. Query strings are similarly processed, and the culture assignment is retained to determine the particular
Names: A New Frontier in Text Mining
31
parameters and thresholds for comparison. A similarity algorithm based on linguistic principles is used to determine the degree of similarity between query and entry strings [12]. Returns are presented in ranked order. This approach is particularly effective when name entries have been drawn from oral sources, such as telephone conversations. NameGenderizer¹: This module returns the most likely gender for a given name based on frequency of assignment of the name to males or females. A major advantage of the technologies developed by LAS is that a measure of similarity between name forms is computed and used to return names in order of their degree of similarity to the query term. An example of the effectiveness of this approach over a Soundex search is provided in Fig.1 in the Appendix.
3 Named Entity Extraction The task of named entity recognition and extraction is to identify strings in text that represent names of people, organizations, and places. Work in this area began in earnest in the mid-eighties, with the initiation of the Message Understanding Conferences (MUC). MUC is largely responsible for the definition of and specifications for the named entity extraction task as it is understood today [4]. Through MUC-6 in 1995, most systems performing named entity extraction were based on hand-built patterns that recognized various features and structures in the text. These were found to be highly successful, with precision and recall figures reaching 97% and 96%, respectively [4]. However, the systems were trained exclusively on English-language newspaper articles with a fixed set of domains, leaving open the question of how they would perform on other text sources. Bikel et al. [13] found that rules developed for one newswire source had to be adapted for application to a different newswire service, and that English-language rules were of little use as a starting point for developing rules for an unrelated language like Chinese. These systems are labor-intensive and require people trained in text analysis and pattern writing to develop and maintain rule sets. Much recent work in named entity extraction has focused on statistical/ probabilistic approaches (e.g., [14], [15], [13], [16]). Results in some cases have been very good, with F-measure scores exceeding 94%, even for systems gathering information from the least computationally expensive sources, such as punctuation, dictionary look-up, and part-of-speech taggers [15]. Borthwick et al. [14] found that by training their system on outputs tagged by hand-built systems (such as SRA’s NameTag extractor), scores improved to better than 97%, exceeding the F-measure scores of hand-built systems alone, and rivaling scores of human annotators. These results are very promising and suggest that named entity extraction can be usefully applied to larger tasks such as relation detection and link analysis (see, for example, [17]).
32
F. Patman and P. Thompson
4 Intra- and Inter-document Coreference The task of determining coreference can be defined as “the process of determining whether two expressions in natural language refer to the same entity in the world,” [18]. Expressions handled by coreference systems are typically limited to noun phrases of various types—including proper names—and pronouns. This paper will consider only coreference between proper names. For a human reader, coreference processes take place within a single document as well as across multiple documents when more than one text is read. Most coreference systems deal only with coreference within a document (see [19], [20], [21], [18], [22]). Recently, researchers have also begun work on the more difficult task of crossdocument coreference ([23], [24], [25]). Bagga [26] offers a classification scheme for evaluating coreference types and systems for performing coreference resolution, based in part on the amount of processing required. Establishing coreference between proper names was determined to require named entity recognition and generation of syntactic variants of names. Indeed, the coreference systems surveyed for this paper treat proper name variation (apart from synonyms, acronyms, and abbreviations) largely as a syntactic problem. Bontcheva et al., for example, allow name variants to be an exact match, a word token match that ignores punctuation and word order (e.g., “John Smith” and “Smith, John”), a first token match for cases like “Peter Smith” and “Peter,” a last token match for e.g., “John Smith” and “Smith,” a possessive form like “John’s,” or a substring in which all word tokens in the shorter name are included in the longer one (e.g., “John J. Smith” and “John Smith”). Depending on the text source, name variants within a single document are likely to be consistent and limited to syntactic variants, shortened forms, and synonyms, such as nicknames.2 One would expect intra-document coreference results for proper names under these circumstances to be fairly good. Bontcheva et al. [19] obtained precision and recall figures ranging from 94%-98% and 92%-95%, respectively, for proper name coreferences in texts drawn from broadcast news, newswire, and newspaper sources.3 Bagga and Baldwin [23] also report very good results (F-measures up to 84.6%) for tests of their cross-document coreference system, which compares summaries created for extracted coreference chains. Note, however, that their reported research looked only for references to entities named "John Smith," and that the focus of the cross-document coreference task was maintaining distinctions between different entities with the same name. Research was conducted exclusively on texts from the New York Times. Nevertheless, their work demonstrates that context can be effectively used for disambiguation across documents. Ravin and Kazi [24] focus on both distinguishing different entities with the same name and merging variant names 2
3
Note, however, that even within a document inconsistencies are not uncommon, especially when dealing with names of non-European origin. A Wall Street Journal article appearing in January 2003 referred to Mohammed Mansour Jabarah as Mr. Jabarah, while Khalid Sheikh Mohammed was called Mr. Khalid. When items other than proper names are considered for coreference, scores are much lower than those reported by Bontcheva et al. for proper names. The highest F-measure score for coreference at the MUC-7 competition was 61.8%. This figure includes coreference between proper names, various types of noun phrases, and pronouns.
Names: A New Frontier in Text Mining
33
referring to a single entity. They use the IBM Context Thesaurus to compare the contexts in which similar names from different documents are found. If there is enough overlap in the contextual information, the names are assumed to refer to the same entity. Their work was also limited to articles from the New York Times and the Wall Street Journal, both of which are edited publications with a high degree of internal consistency. Across documents from a wide variety of sources, consistent name variants cannot be counted on, especially for names originating outside the Anglo/Western European tradition. In fact, the many types of name variation commonly found in databases can be expected. A recent web search on Google for texts about Muammar Qaddafi, for example, turned up thousands of relevant pages under the spellings Qathafi, Kaddafi, Qadafi, Gadafi, Gaddafi, Kathafi, Kadhafi, Qadhafi, Qazzafi, Kazafi, Qaddafy, Qadafy, Quadhaffi, Gadhdhafi, al-Qaddafi, Al-Qaddafi, and Al Qaddafi (and these are only a few of the variants of this name known to occur). A coreference system that can be of use to agencies dealing with international names must be able to recognize name strings with this degree of variation as potential instances of a single name. Cross-document coreference systems currently suffer from the same weakness as most database name search systems. They assume a much higher degree of source homogeneity than can be expected in the world outside the laboratory, and their analysis of name variation is based on an Anglo/Western European model. For the coreference systems surveyed here, recall would be a considerable problem within a multi-source document collection containing non-Western names. However, with an expanded definition of name variation, constrained and supplemented by contextual information, these coreference technologies can serve as a starting point for linking and disambiguating entities across documents from widely varying sources.
5 Name Text Mining Support for Visualization, Link Analysis, and Deception Detection Commercial and research products for visualization and link analysis have become widely available in recent years, e.g., Hyperbolic Tree, or Star Tree [27], SPIRE [28], COPLINK [29], and InfoGlide [30]. Visualization and link analysis continues to be an active area of on-going research [31]. Some current tools have been incorporated into systems supporting intelligence and security informatics. For example, COPLINK [29] makes use of several visualization and link analysis packages, including i2’s [32] Analyst Notebook. Products such as COPLINK and InfoGlide also support name matching and deception detection. These tools make use of sophisticated statistical record linkage, e.g. [33], and have well developed interfaces to support analysts [32, 29]. Chen et al. [29] note that COPLINK Connect has the built-in capability for partial and phonetic-based name searches. It is not clear from the paper, however, what the scope of coverage is for phonetically spelled names, or how this is implemented. Research software and commercial products have been developed, such as those presented in [34, 30], which include modules that detect fraud in database records. These applications’ foci model ways that criminals, or terrorists, typically alter records to disguise their identity. The algorithms used by these systems could be
34
F. Patman and P. Thompson
augmented by taking into account a deeper multi-cultural analysis of names, as discussed in section 2.
6 Procedure for a Name Extraction and Matching Text Mining Module In this section a procedure is presented for name extraction and matching within and across documents. This algorithm could be incorporated in a module that would work with an environment such as COPLINK. The basic algorithm is as follows. Within document: 1. Perform named entity extraction. 2. Establish coreference between name mentions within a single document, creating an equivalence class for each named entity. 3. Discover relations between equivalence classes within each document 4. Find the longest canonical name string in each equivalence class. 5. Perform automatic name analysis on canonical names using NameClassifier; retain culture assignment. 6. Generate variant forms of canonical names according to culture-specific criteria using NameVariantGenerator. Across documents: 7. For each culture identified during name analysis, match sets of canonical name variants belonging to that culture against each other; for each pair of variant sets considered, if there are no incompatible (non-matching) members in the sets, mark as potential matches (e.g., Khalid bin (son of) Jamal and Khalid abu (father of) Jamal would be incompatible). 8. For potential name set matches, use a context thesaurus like that described in [24] to compare contexts where the names in the equivalence classes are found; if there are enough overlapping descriptions, merge the equivalence classes for the name sets (which will also expand the set of relations for the class to include those found in both documents); combine variant sets for the two canonical name strings into a single set, pruning redundancies. 9. For potential name set matches where overlapping contextual descriptions do not meet the minimum threshold, mark as a potential link, but do not merge. 10. Repeat process from #7 on for each pair of variant sets, until no further comparisons are possible. This algorithm could be implemented within a software module of a larger text mining application. The simplest integration of this algorithm would be as a module that extracted personal names from free text and stored the extracted names and relationships in a database. As discussed by [7], it would also be possible to use this algorithm to annotate the free text, in addition to creating database entries. This automatic markup would provide an interface for an analyst which would show not only the entities and their relationships, but also preserve the context of the surrounding text.
Names: A New Frontier in Text Mining
35
7 Research Issues This paper proposes an extension of linguistically-based, multi-cultural database name matching functionality to the extraction and matching of names from full text documents. To accomplish such an extension implies an effective integration of database and document retrieval technology. While this has been an on-going research topic in academic research [35, 36] and has received attention from major relational database vendors such as Oracle, Sybase, and IBM, effective integration has not yet been achieved, in particular in the area of intelligence and security informatics [37]. Achieving the sophistication of database record matching for names extracted from free text implies advances in text mining [38, 39, 40, 41]. One useful structure for supporting cross document name matching would be an authority file for named entities. Library catalogs maintain authority files which have a record for each author, showing variant names, pseudonyms, and so on. An authority file for named entity extraction could be built which would maintain a record for each entity. The record could start with information about the entity extracted from database records. When the named entity was found in free text, contextual information about the entity could be extracted and stored in the authority file with an appropriate degree of probability in the accuracy of the information included. For example, a name followed by a comma-delimited parenthetical expression, is a reasonably accurate source of contextual information about an entity, e.g., “X, president of Y, resigned yesterday”. A further application of linguistic/cultural classification of names could be to tracking interactions between groups of people where there is a strong association between group membership and language. For example, an increasing number of police reports in which both Korean and Cambodian names are found in the same documents might indicate a pattern in Asian crime ring interactions. Finally, automatic recognition of name gender could be used to support the process of pronominal coreference. Work is underway to provide a quantitative comparison of key-based name matching systems (such as Soundex) with other approaches to name matching. One of the hindrances to effective name matching system comparisons is the lack of generally accepted standards for what constitutes similarity between names. Such standards are difficult to establish in part because the definition of similarity changes from one user community to the next. A standardized metric for the evaluation of degrees of correlation of name search results, and a means for using this metric to measure the usefulness of different name search technologies is sorely needed. This paper has focused on personal name matching. Matching of other named entities, such as organizations, is also of interest for intelligence and security informatics. While different matching algorithms are needed, extending company name matching, or other entity matching, to free text will also be useful. One promising research direction integrating database, information extraction, and document retrieval that could support effective text mining of names is provided by work on XIRQL [7].
36
F. Patman and P. Thompson
8 Conclusion Effective tools exist for multi-cultural database name matching and this technology is becoming available in analytic tool kits supporting intelligence and security informatics. The proportion of data of interest to intelligence and security analysts that is contained in databases, however, is very small compared to the amount of data available in free text and audio formats. The extension of name extraction and matching to free text and audio will add important text mining functionality for intelligence and security informatics toolkits.
Taft, R.L.: Name Search Techniques. Special Rep. No. 1. Bureau of Systems Development, New York State Identification and Intelligence System, Albany (1970) Verton, D.: Technology Aids Hunt for Terrorists. Computer World, 9 September (2002) Borgman, C.L., Siegfried, S.L.: Getty’s Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms. Journal of the American Society for Information Science, Vol. 43 No. 7. (1992) 459–476 Grishman, R., Sundheim, B.: Message Understanding Conference – 6: A Brief History. In: th Proceedings of the 16 International Conference on Computational Linguistics. Copenhagen (1999) DARPA. Tipster Text Program Phase III Proceedings. Morgan Kaufmann, San Francisco (1999) National Institute of Standards and Technology. ACE-Automatic Content Extraction Information Technology Laboratories. http://www.itl.nist.gov/iad/894.01/tests/ace/index.htm (2000) Fuhr, N.: XML Information Retrieval and Extraction [to appear] Hermansen, J.C.: Automatic Name Searching in Large Databases of International Names. Georgetown University Dissertation, Washington, DC (1985) Holmes, D., McCabe, M.C.: Improving Precision and Recall for Soundex Retrieval. In: Proceedings of the 2002 IEEE International Conference on Information Technology – Coding and Computing. Las Vegas (2002) Navarro, G., Baeza-Yates, R., Azevedo Arcoverde, J.M.: Matchsimile: A Flexible Approximate Matching Tool for Searching Proper Names. Journal of the American Society for Information Science and Technology, Vol. 54 No. 1 (2003) 3–15 Patman, F., Shaefer, L.: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching. Language Analysis Systems, Inc., Herndon (2001) Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002) Bikel, D.M., Schwartz, R., Weischedel, R.M.: An Algorithm that Learns What’s in a Name. Machine Learning, Vol. 34 No. 1-3. (1999) 211–231 Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: NYU: Description of the MENE Named Entity System as Used in MUC-7. In: Proceedings of the Seventh Message Understanding Conference. Fairfax (1998) Baluja, S., Mittal, V.O., Sukthankar, R.: Applying Machine Learning for High Performance Named-Entity Extraction. Pacific Association for Computational Linguistics (1999) Collins, M.,: Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted th Perceptron. In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 489–496
Names: A New Frontier in Text Mining
37
17. Zelenko, D., Aone, C., Richardella, A.: Kernel Methods for Relation Detection Extraction. Journal of Machine Learning Research [to appear] 18. Soon, W.M., Ng, H.T., Lim, D.C.Y.: A Machine Learning Approach to Coreference Resolution of Noun Phrases. Association for Computational Linguistics (2001) 19. Bontcheva, K., Dimitrov, M., Maynard, D., Tablin, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. TALN (2002) 20. Hartrumpf, S.: Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In: Proceedings of CoNLL-2001. Toulouse (2001) 137–144 21. Ng, V., Cardie, C.: Improving Machine Learning Approaches to Coreference Resolution. th In: Proceedings of the 40 Annual Meeting of the Association for Computational Linguistics. Philadelphia (2002) 104–111 22. McCarthy, J.F., Lehnert, W.G.: Using Decision Trees for Coreference Resolution. In: Mellish, C. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (1995) 1050–1055 23. Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector th Space Model. In: Proceedings of the 36 Annual Meeting of the Association for th Computational Linguistics and the 17 International Conference on Computational Linguistics (1998) 79–85 24. Ravin, Y., Kazi, Z. Is Hillary Rodham Clinton the President? Disambiguating Names Across Documents. In: Proceedings of the ACL’99 Workshop on Coreference and Its Applications (1999) 25. Schiffman, B., Mani, I., Concepcion, K.J. : Producing Biographical Summaries : th Combining Linguistic Knowledge with Corpus Statistics. In: Proceedings of the 39 Annual Meeting of the Association for Computational Linguistics (2001) 450–457 26. Bagga, A.: Evaluation of Coreferences and Coreference Resolution Systems. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998) 563–566 27. Inxight. A Research Engine for the Pharmaceutical Industry. http://www.inxight.com 28. Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualizing the Full Spectrum of Document Relationships. In: Structures and Relations in Knowledge Organization. th Proceedings of the 5 International ISKO Conference. ERGON Verlag, Wurzburg (1998) 168–175 29. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46 No. 1 (2003) 30. InfoGlide Software. Similarity Search Engine: The Power of Similarity Searching. http://www.infoglide.com/content/images/whitepapers.pdf(2002) 31. American Association for Artificial Intelligence Fall Symposium on Artificial Intelligence and Link Analysis (1998) 32. i2. Analyst’s Notebook. http://www.i2.co.uk/Products/Analysts_Notebook (2002) 33. Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical Report RR99/04. U.S. Census Bureau, http://www.census.gov/srd/papers/pdf/rr99-04.pdf 34. Wang, G., Chen, H., Atabakhsh, H.: Automatically Detecting Deceptive Criminal Identities [to appear] 35. Fuhr, N.: Probabilistic Datalog – A Logic for Powerful Retrieval Methods. In: Proceedings th of SIGIR-95, 18 ACM International Conference on Research and Development in Information Retrieval (1995) 282–290 36. Fuhr, N.: Models for Integrated Information Retrieval and Database Systems. IEEE Data Engineering Bulletin, Vol. 19 No. 1. (1996) 37. Hoogeveen, M., van der Meer, K.: Integration of Information Retrieval and Database Management in Support of Multimedia Police Work. Journal of Information Science, Vol. 20 No. 2 (1994) 38. Institute for Mathematics and Its Applications. IMA Hot Topics Workshop: Text Mining. http://www.ima.umn.edu/reactive/spring/tm.html (2000)
38
F. Patman and P. Thompson
39. KDD-2000 Workshop on Text Mining. The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston (2000) http:www2.cs.cmu.edu/~dunja/WshKDD2000.html 40. SIAM Text Mining Workshop. http://www.cs.utk.edu/tmw02 (2002) 41. Text-ML 2002 Workshop on Text Learning. The Nineteenth International Conference on Machine Learning ICML-2002. Sydney (2002)
Appendix: Comparison of LAS MetaMatch¹ Search Engine Returns with SQL-Soundex Returns
Fig. 1. These searches were conducted in databases containing common surnames found in the 1990 U.S. Census data. The surnames in the databases are identical. The MetaMatch database differs only in that the phonetic form of each surname is also stored. The exact match “Sadiq” th th was 54 in the list of Soundex returns. “Siddiqui” was returned by Soundex in 26 place. th “Sadik” was 109 .
Web-Based Intelligence Reports System Alexander Dolotov and Mary Strickler Phoenix Police Department 620 W. Washington Street, Phoenix, Arizona 85003 {alex.dolotov, mary.strickler}@phoenix.gov
Abstract. Two areas for discussion will be included in this paper. The first area targets a conceptual design of a Group Detection and Activity Prediction System (GDAPS). The second area describes the implementation of the WEBbased intelligence and monitoring reports system called the Phoenix Police Department Reports (PPDR). The PPDR System could be considered the first phase of a GDAPS System. The already operational PPDR system’s goal is to support data access to heterogeneous databases, provide a means to mine data using search engines, and to provide statistical data analysis with reporting capabilities. A variety of static and ad hoc statistical reports are produced with the use of this system for interdepartmental and public use. The system is scalable, reliable, portable and secured. Performance is supported on all system levels using a variety of effective software designs, statistical processing and heterogeneous databases/data storage access.
named to the “Data Maintaining and Reporting Subsystem” (DmRs), the “Group Detection Subsystem” (GTeS), and the “Activity Prediction Subsystem” (APreS). The first phase of the GDAPS System would be the PPDR System, which is currently operational within the Phoenix Police Department. This system is explained in detail in the remaining sections of this document. The DmRs subsystem (renamed from PPDR) supports access to heterogeneous databases using data mining search engines to perform statistical data analysis. Ultimately, the results are generated in report form. The GTeS subsystem would be designed to detect members of the targeted group or groups. In order to accomplish this, it would require monitoring communications between individuals using all available means. Intensity and duration of these communications can define relationships inside the group and possibly define the hierarchy of the group members. The GTeS subsystem would have to be adaptive enough to constantly upgrade information related to each controlled group, since every group has a life of its own. GTeS would provide the basic foundation for GDAPS. The purpose of the APreS subsystem is to monitor, in time, the intensity and modes of multiple groups’ communications by maintaining a database of all types of communications. The value of this subsystem would be the ability to predict groups’ activities based upon the historical correlation between abnormalities in the groups’ communication modes and intensities, along with any previous activities. APreS is the dynamic subsystem of GDAPS. To accelerate the GDAPS development, methodologies, already created for other industries, can be modified for use [1], [2], [3], [7]. Because of the complexity of the GDAPS system, a multi-phase approach to system development should be considered. Taking into account time and resources, this project can be broken down into manageable sub-projects with realistic development and implementation goals. The use of a multi-dimensional mathematical model will enable developers to assign values to different components, and to determine relationship between them. By using specific criteria, these values can be manipulated to determine the outcome under varying circumstances. The mathematical model, when optimized, will produce results that could be interpreted as “a high potential for criminal activity”. The multi-dimensional mathematical model is a powerful “forecasting” tool. It provides the ability to make decisions before a critical situation or uncertain conditions arise [4], [5], [6], [8]. Lastly, accumulated information must be stored in a database that is supported/serviced by a specific set of business applications. The following is a description of the PPDR system, the first phase of the Group Detection and Activity Prediction System (GDAPS).
2 Objectives A WEB-based intelligence and monitoring reports system called Phoenix Police Department Reports (PPDR) was designed in-house for use by the Phoenix Police Department (PPD). Even though this system was designed specifically for the Phoenix Police Department, it could easily be ported for use by other law enforcement agencies. Within seconds, this system provides detailed, comprehensive, and informative statistical reports reflecting the effectiveness and responsiveness of any division, for any date/time period, within the Phoenix Police Department. These reports are designed for use by all levels of management, both sworn and civilian, from police
Web-Based Intelligence Reports System
41
chiefs’ requests to public record requests. The statistical data from these reports provides information for use in making departmental decisions concerning such issues as manpower allocation, restructuring and measurement of work. Additionally, PPDR uses a powerful database mining mechanism, which would be valuable for use in the future development of the GDAPS System. In order to satisfy the needs of all users, the PPDR system is designed to meet the following requirements: - must maintain accurate and precise up-to-date information; - the use of a specific mathematical model for statistical analysis and optimization [5] [6]; - perform at a high level with quick response times; - must have the ability to support different security levels for different categories of users; - must be scalable and expandable; - must have a user friendly presentation and; - be able to easily maintain reliable and optimized databases and other information storage. The PPDR system went into production in February 2002. This system contains original and effective solutions. It provides the capability to make decisions which will ultimately have an impact on the short and long term plans for the department, the level of customer service provided to the public, overall employee satisfaction and organizational changes needed to achieve future goals. The PPDR system could be considered the first phase of a complex Intelligence Group Detection and Activity Prediction System.
3 Relationships to Other Systems and Sources of Information 3.1 Calls for Service There are two categories of information that are used for the PPDR. They are calls for service data and text messages sent by Mobile Data Terminal (MDT) and Computer Aided Dispatch (CAD) users. Both sources of information are obtained from the Department’s Computer Aided Dispatch and Mobil Data Terminal (CAD/MDT) System. The CAD/MDT System is operating on three redundant Hewlett Packard (HP) 3000 N-Series computers. The data is stored in HP’s proprietary Image database for six months. Phoenix Police Department’s CAD/MDT System handles over 7,000 calls for service daily from citizens of Phoenix. Approximately half of these calls require an officer to respond. The other half are either duplicates or ones where the caller is just asking for general information or wishing to report a non-emergency incident. Calls for Service data is collected when a citizen calls the emergency 911 number or the Department's crime stop number for service. A call entry clerk enters the initial call information into CAD. The address is validated against a street geobase which provides information required for dispatching such as the grid, the beat and the responsible precinct where the call originated. After all information is collected, the call is automatically forwarded to a dispatcher for distribution to an officer or officers in the field. Officers receive the call information on their Mobile Data Terminals (MDT). They enter the time they start on the call, arrive at the scene and the time they
42
A. Dolotov and M. Strickler
complete the call. Each call for service incident is given a disposition code that relates to how an officer or officers handled the incident. Calls for service data for completed incidents are transferred to a SQL database on a daily basis for use in the PPDR System. Police officers and detectives use calls for service information for investigative purposes. It is often requested by outside agencies for court purposes or by the general public for their personal information. It is also used internally for statistical analysis. 3.2 Messages The messages are text sent between MDT users, MDT users and CAD users, CAD users to other CAD users. The MDT system uses a Motorola radio system for communications, which interfaces to the CAD system through a programmable interface computer. The CAD system resides on a local area network within the Police Department. The message database also contains the results of inquiries on persons, vehicles, or articles requested by officers in the field from their MDTs or by CAD user from any CAD workstation within the Department. Each message stored by the CAD system contains structured data, such as the identification of the message sender, date and time sent and the free-form body of the message. Every twenty-four hours, more than 15,000 messages are passed through the CAD System. Copies of messages are requested by detectives, police officers, the general public and court systems, as well as outside law enforcement agencies.
4 PPDR System Architecture The system architecture of the PPDR system is shown in Figure 1.
5 PPDR Structural WEB Design PDR has been designed with seven distinctive subsystems incorporated within one easy to access location. The subsystems are as follows: Interdepartmental Reports; Ad Hoc Reports; Public Reports; Messages Presentation; Update functionality; Administrative Functionality; and System Security. Each subsystem is designed to be flexible as well as scaleable. Each subsystem has the capability of being easily expanded or modified to satisfy user enhancement requests.
Web-Based Intelligence Reports System
Fig. 1. PPDR Architecture (continued on next page)
43
44
A. Dolotov and M. Strickler
Fig. 1. (continued from previous page)
5.1 System Security Security begins when a user logs into the system and is continuously monitored until the user logs off. The PPDR security system is based on the assignment of roles to each user through the Administrative function. Role assignment is maintained across multiple databases. Each database maintains a set of roles for the PPDR system. Each role has the possibility of being assigned to both a database object and a WEB functionality. This results in a user being able to perform only those WEB and database functions that are available to his/her assigned role. When a user logs onto the system, the userid and password is validated with the database security information. Database security does not use custom tables but rather database tables that contain encrypted roles, passwords, userids and logins. After a successful login, the role assignment is maintained at the WEB level in a secure state and remains intact during the user’s session.
Web-Based Intelligence Reports System
45
The PPDR System has two groups of users: those that use the Computer Aided Dispatch System (CAD) and those that do not. Since most of the PPDR users are CAD users, it makes sense to keep the same userids and passwords for both CAD and PPDR. Using a scheduled Data Transfer System (DTS) process, CAD userids and passwords are relayed to the PPDR system on a daily basis, automatically initiating a database upgrade process in PPDR. The non-CAD users are entered into the PPDR system through the Administrative (ADMIN) function. This process does not involve the DTS transfer, but is performed in real time by a designated person or persons with ADMIN privileges. Security for non-CAD users is identical that of CAD users, including transaction logging that captures each WEB page access. In addition to transaction logging, another useful security feature is the storage of user history information on a database level. Anyone with ADMIN privileges can produce user statistics and historical reports upon request. 5.2 Regular Reports In general, Regular Reports are reports that have a predefined structure, based on input parameters entered by the user. In order to obtain the best performance and accuracy for these reports, the following technology has been applied: A special design of multiple databases which includes “summary” tables ( see Section V. Database Solutions); the use of cross tables reporting functionality which allows for creating a cross table recordset on a database level; and the use of a generic XML stream with XSLT performance on the client side instead of the use of ActiveX controls for the creation of reports. Three groups of Regular Reports are available within the PPDR system. The three groups are Response Time Reports, Calls for Service Reports and Details Reports. Response Time Reports. Response Time Reports present statistical information regarding the average response time for calls for service data obtained from the CAD System. Response time is the period between the time an officer was dispatched on a call for service and the time the officer actually arrived on the scene. Response time reports can be produced on several levels, including but not limited to beat, squad, precinct and even citywide level. Using input parameters such as date, time, shift, and squad area, a semi-custom report is produced within seconds. Below is an example of the “Average Quarterly Response Time By Precinct” report for the first quarter of 2002. This report calculates the average quarterly response time for each police precinct based on the priorities assigned to the calls for service. The right most column (PPD) is the citywide average, again broken down by priority. Calls for Service Reports. Calls for Service Reports are used to document the number of calls for service in a particular beat, squad, precinct area or citywide. These reports have many of the same parameters as the Response Time Reports. Some reports in this group are combination reports, displaying both the counts for calls for service
46
A. Dolotov and M. Strickler
Fig. 2. Response Time Reports
and the average response time. Below is an example of a “Monthly Calls for Service by Squad” report for the month of January 2002. This report shows a count of the calls for service for each squad area in the South Mountain precinct, broken down by calls that are dispatched and calls that are handled by a phone call made by Callback Unit.
Fig. 3. Calls For Service Report
Details Reports. These reports are designed to present important details for a particular call for service. Details for a call for service include such information as call location, disposition code (action taken by officer), radio code (type of calls for service - burglary, theft, etc.), received time and responding officer(s). From a Detail Re-
Web-Based Intelligence Reports System
47
port, other pertinent information related to a call for service is obtained quickly with a click of the mouse. Other available information includes unit history information. Unit history information is a collection of data for all the units that responded to a particular call for service, such as time the unit was dispatched, time unit arrived and what people or plates were checked. 5.3 AD HOC Reports The AD HOC Reports subsystem provides the ability to produce “custom” reports from the calls for service data. To generate an AD HOC report, a user should have basic knowledge of SQL queries using search criteria as well as basic knowledge of the calls for service data. There are three major steps involved in producing an AD HOC report: -
Selecting report columns and selecting search criteria use an active dialog. The report generation uses XML/XSLT performance. Selecting Report Columns. The first page that is presented when entering the AD HOC Reports subsystem allows the user to choose, from a list of tables, the fields that are to be displayed in the desired report. OLAP functionality is used for accessing a database’s schema, such as available tables and their characteristics, column names, formats, aliases and data types. The first presented page of the AD HOC Reports is displayed below. A selection can be made for any required field by checking the left check box. Other options such as original value (Orig), count, average (Averg), minimum (Minim), and maximum (Maxim) are also available to the user. Count, average, minimum and maximum are only available for numeric fields. As an example, if a user is requiring a count of the number of calls for service, a check is required in the Count field. When the boxes are check, the SELECT clause generates as a DHTML script. For instance, if the selected fields for an Ad Hoc report are ‘Incident Number ‘, ‘Date’, ‘Address’ and ‘Average of the Response Time’ (all members of the Incidents table), the following SELECT clause will be generated:
Date’,Incidents.Inc_Location AS ’Address’,Incidents.Inc_Time_Rec AS ’Received time’,Avg(Incidents.Inc_Time_Rec) AS ’Avg Of Received time’ FROM Incidents Syntaxing is maintained in the SELECT clause generation on the business logic level using a COM+ objects. Selecting Search Criteria. When all desired fields have been selected, click on “Submit” and the following search criteria page is presented:
48
A. Dolotov and M. Strickler
Fig. 4. Selecting Report Columns
This page will allow the user to build the search criteria necessary for the generation of the desired report. Most available criteria and their combinations are available to the user (i.e., >, <, =, etc.) with valid values presented in the drop down boxes. Options such as Grouped By, Ordered By, Ascending and Descending order are also available to the user. The final query statement generated from the example above is as follows:
SELECT Incidents.Inc_Number AS ’Incident Number’,Incidents.Inc_Date AS ’Date’,Incidents.Inc_Location AS ’Address’ FROM Incidents WHERE Incidents.Inc_Number = 20000077 All syntaxes and logic rules apply to this page through the COM+ objects and Microsoft Transaction Server (MTS) interfaces. When using the AD HOC report feature, a user may need to cross years to create the desired report. When requested, this requires accessing data stored in multiple databases. Each year’s worth of calls for service data is stored in a separate database. Generally, the date field is used to determine the correct database to access, but if the date is not a part of the search criteria, the incident number (the number assigned to the calls for service record by the CAD system) can be used. The first character of the incident number determines the year of the call for service. In the example above, the incident number was used to determine the correct database to access by using special database validation procedures. These procedures return the correct FROM statement with necessary modifications to capture the valid calls for service records. The modified SELECT statement for this query is as follows:
Web-Based Intelligence Reports System
49
Fig. 5. Selecting Search Criteria
SELECT Incidents.Inc_Number AS ’Incident Number’,Incidents.Inc_Date AS ’Date’,Incidents.Inc_Location AS ’Address’ FROM IncDB2002..Incidents WHERE Incidents.Inc_Number = 20000077 Section V. Database Solutions will provide a more detailed discussion of how yearly data is stored in multiple databases and how these multiple databases are accessed when creating a report that cross years or accesses previous year’s data. Report Generation. Report generation processing is similar to what was previously described. The returned record set is converted into the XML stream which is then converted to DHTML using XSLT script. The only difference is that reports generated using the AD HOC feature may result in a multiple page response. Special page services for the client has been added to handle multiple pages. Below is a final AD HOC report (Fig. 6) using the special page services along with the required disclaimer. 5.4 Public Reports This group of reports is designed for public dissemination. The creation of these reports is the same as described in Section 5.2.3. Details Reports. Users with a “public” role assignment can only access the reports in this group and no others. A “security” filter is applied to these reports to protect sensitive information from being distributed to the general public. This “security” filter is adjustable and can be modified as necessary if and when public information laws are changed.
50
A. Dolotov and M. Strickler
Fig. 6. Final AD HOC Report
5.5 CAD/MDT Messages Reports The Computer Aided Dispatch/Mobile Data System (CAD/MDT) captures text messages sent between mobile units in the field and messages sent to and from other CAD users such as dispatchers and desk aides. In addition to text messages, every vehicle and person query is also captured. These messages are transferred from the CAD/MDT System to the PPDR System on a daily basis. CAD/MDT only stores one day’s worth of messages (about 200 MB) at one time while PPDR retains one full year of these messages. Each message block includes message type, identification of the sender and receiver of the message, date/time when the message was generated and the body of the message, which is usually unstructured text. Single or multiple messages can be requested for reviewing and reporting for investigative purposes. Messages can be retrieved by a number of input parameters such as date, mobile unit id, CAD user id and all units in a squad. Maintaining this data on the PPDR system presented a few major obstacles. These obstacles included providing a system design with minimal data storage, while maintaining acceptable database access response time. In addition, the system had to maximize the performance of the WEB page presentation of the results. The system design techniques used to minimize storage requirements consisted of the following:
Web-Based Intelligence Reports System
-
-
51
A SQL metadatabase that contains consolidated tables with descriptors and pointers to each record of the stored message file. Building a database that maintained the relationships between the searchable subject into the unstructured text record and the associated pointers. The searchable subject could be the sender and receiver of messages, a vehicle identification number (VIN), last name, first name and date of birth. A “subject” table was created in the database. The number of those tables could be changed, depending on the number of desired searchable subjects. All subject tables are related to a pivot table with the file descriptors and are populated at the same time a transition process is run (star schema). The daily message file created from CAD/MDT is broken down into twelve (12) separate files that are compressed and stored without any internal changes. The use of DTS to load data from CAD/MDT (see details in Section 7. Data Transition).
In the current version of the PPDR system, only the senders and receivers of messages are searchable subjects. Future development is planned to include the capability to search using the other searchable subjects mentioned above. Reports are obtained using input parameters such as date and time range, a single sender/receiver and/or a group of senders/receivers. The result of any search includes all text messages with complete details. Below is an example of the Messages Search screen: Suppose a user wishes to view all messages between a CAD dispatcher and a mobile unit id of 512G for a 24-hour period on January 1, 2002. The following procedures occur in order to retrieve the requested records: -
After all parameters are entered, clicking on “Submit” will create and populate two session specific global temporary tables. One of the tables is populated with the pointers and the relationships to the text message files. The text messages files for the requested data are expanded into the designated expanded area. Only those messages that meet the search criteria are extracted from the expanded files and placed in the structured text file. The structured text file is then bulk copied into the second global temporary table.
The final step is obtaining a record set from joined global temporary tables. This record set is retrieved as an XML stream using a combination of DHTML performance and XSLT scripting. The results for the above query are shown below (Fig. 8).
52
A. Dolotov and M. Strickler
Fig. 7. Messages Search Screen
Fig. 8. Messages by Date Time Period
Web-Based Intelligence Reports System
53
The above solution provides many benefits such as a combining of the database with file storage for rapid retrieval; scalability so that a search can be performed using multiple parameters; high performance resulting in as many as 10,000 records retrieved and presented within 30 seconds; compact storage taking less than 350 Mb for 365 files; and, lastly convenient database and file maintenance. 5.6 Updater’s Block Data that is transferred from the CAD/MDT System to the PPDR System may require updating to correct erroneous fields. For example, an officer dispatched on a high priority call for service does not always depress the arrival button on the MDT to record the time he/she arrives on the scene, due to the criticality of the call. Other times may not have been recorded correctly as high priority calls are handled in an urgent matter. These inaccurate times may have a drastic effect on department statistics. The actual type of call and/or the priority of the call may have changed from the time the call was received to its completion. A daily report is generated in the CAD/MDT System that is given to the officer so that the correct the associated times, call type, and priority can be recorded for the call. Since the data has already been transferred to the PPDR System, a special feature was designed to give a select group of users (referred to as “Updaters”) the ability to update the incident after all the data on the report has been verified for accuracy. After the submission of the corrections, all recalculations are performed in the background for the database correction. The corrected data is displayed to the user on the “updater” screen. Each transaction is recorded as a special entry in the log file for future reference. Below is an example of the “Edit and Update Response Time” screen that is used by the “updater” to make corrections to a call for service (Fig. 9). 5.7 Administrative Functionality The PPDR System has a special Administrative (ADMIN) functionality for maintaining users and security. An Administrator is the only person who has the capability to add/delete users and to make changes to logins, names, roles and passwords. The users can change their passwords when they expire. User authenticity is synchronized with the CAD System. On a daily basis, valid user profile data from the CAD System is transferred to the PPDR System using DTS. Any changes made to a user’s profile in the CAD System are automatically updated in the PPDR System. Any users requiring access to the PPDR System who are not CAD users can be added by an Administrator.
54
A. Dolotov and M. Strickler
Fig. 9. Edit and Update Response Time
6 Database Solutions 6.1 Multiple Databases The PPDR System maintains multiple databases for Regular, AD HOC and the Details reports. The system archives 10 years worth of data and if all the data were consolidated in one database, WEB performance and reliability would be negatively affected. The multiple database schema is described on the following page: -
Data is accumulated annually with each year’s data stored in a separate database. All the schemas for each yearly database are identical. Multiple databases are named using the same naming convention. All databases, with the exception of the current database, are static and do not require modifications or updates. The current year’s database is dynamic and is populated on a daily basis using a DTS process. The current year database is the only database requiring maintenance. Previous years’ databases could be restored in case of system failures from any available backup. The only data that needs to be maintained and backed-up on a regular basis is the current year database.
Using the above schema, a user’s query should be directed only to the database and associated tables for the year in which the data resides. If the query crosses databases, the tables are joined from the appropriate databases. Before a query is processed, the appropriate database is determined from the date or dates requested by the user. For instance, if the requested date is 01/01/2001, then a stored procedure is called that determines the database named INCDB2001 will be accessed. If the requested date is
Web-Based Intelligence Reports System
55
a date range crossing years, then a view of the union of all existing databases is processed. Every January 1st, a new database is created and the previous database renamed appropriately. In addition, all associated views and tables from the previous year are updated. A special procedure automates this yearly renaming process. 6.2 Summary Tables Most of the user requested queries are to retrieve statistical information related to calls for service data such as average response time, weighted averages by police precincts and citywide comparisons. Approximately 7,000 calls for service are added to the PPDR system on a daily basis. To perform the many calculations on a month’s worth, or even a year’s worth, of data could pose severe performance issues and affect WEB response time. To overcome any potential performance issues, special summary tables are created and maintained for each database. Most statistical information that is requested by the user is pre-calculated on a daily basis, while the calls for service data is loading using DTS, and time is not a critical issue. The pre-calculated data is grouped by various keys such as police precinct, shift, type of call, and priority into a summary table. This table is four-dimensional. By using the summary table, WEB response time remains extremely fast for complicated statistical queries. In the summary table, each numbered field represents a value for an appropriate shift. Each record is then related to the appropriate priority for the calls for service. Since there are a possibility of three priorities, each subdivision could have up to three records for a particular date. Below is an example of the multidimensional summary table structure:
Fig. 10. Multidimensional Summary Table Structure
56
A. Dolotov and M. Strickler
6.3 File Storage and Messages MetaDatabase The PPDR System contains text messages sent between officers in the field, officers and dispatchers and between other CAD users. If all the messages were to be kept in the database, it would require gigabytes of database storage. Maintenance on such a large database would be difficult. In order to overcome some of the problems with keeping and maintaining such a massive amount of data, a special solution, using a database in combination with file storage, was created. The file storage could reside anywhere on the network and would not have to reside on the same server as the database. On a daily basis, the message file is transferred from the CAD/MDT System to the PPDR System. The file is broken down into twelve separate files. These files are zipped and compressed with an average ratio of 10. Breaking the main file into twelve files allows for parallel processing when loading the data. The original daily file has an average size of approximately 200MB. The twelve zipped files are approximately 650KB. Relationships are built between the records and file entities, such as date, time, vehicle identification and person’s name, while loading using DTS. Each entity creates a relational record in the message metadatabase. When a user performs a query, three steps are involved in returning the results. First, the relationship is determined in the metadatabase. Secondly, all appropriate messages are extracted from the multiple archived files. Lastly, the output is presented using an XML output stream. There are many benefits to using a solution involving the combination of a database with file storage. These benefits include: -
Data can be searched by any of the consolidated entities; Improved performance in that as many as 10,000 records can be retrieved and displayed in as little as 30 seconds; Compact file storage resulting in 365 files at 200MB each using less than 350 MB of space; and Maintenance of the database and stored files is more convenient.
7 Data Transition The PPDR system contains CAD/MDT data from two different sources. The data sources are the Incidents data source and the Messages data source. PPDR uses two different data transition processes for each data source. Both are performed using DTS. The data transition process for both data sources incorporates the idea of “precalculation”. “Pre-calculation” refers to a process where performances on the data occur, such as grouping, relationship building and calculations, all while the data is loaded into the PPDR database. This “pre-calculation” is absolutely necessary to maintaining superior WEB performance. The diagram below depicts the DTS process loading data the Incidents data source into the multiple databases:
Web-Based Intelligence Reports System
57
Fig. 11. Incidents Data Loading Diagram
The above DTS process creates multidimensional summary tables that contain the “pre-calculations” as the data is loaded. These summary tables reduce the need to perform calculations with every user request, thus greatly reducing system response time.
Fig. 12. Messages Data Loading Diagram
58
A. Dolotov and M. Strickler
The second DTS process shown below loads the Messages data source. This process breaks the daily file into twelve separate files, archives the twelve files and builds the relationships necessary to retrieve the files when requested. This solution of breaking the main message file into twelve files, all loaded at the same time with parallel processes, greatly reduces total performance time.
8 Conclusion The PPDR System could be considered the first phase of the Group Detection and Activity Prediction System (GDAPS). PPDR supports data access to heterogeneous databases, data mining using search engines, and the ability to produce diverse statistical reports. Using PPDR as a base, development of the GDAPS is a reality. By itself, the PPDR System is a powerful WEB based monitoring and decision-making support system. It produces a variety of statistical, interdepartmental, public, ad hoc and other informative reports. The system is scalable, reliable and portable. Performance is supported on all system levels, including: -
the presentation level – XML/XSLT to DHTML transformation, the business level – using of COM+ objects on the WEB server level, effective statistical processing algorithms [5], [6], [7] - the databases level – using multiply databases with multi-dimensional table combinations along with compressed file storage, and - the transition level – using parallelism and business calculations. The PPDR system is highly secured with database driven security. It is a system with administrative and versatile log functionality. The architecture of the PPDR system is easily expanded to add new features and functionality for future enhancements.
References 1. Flikop, Ziny. “Uncertainty and Management of Cellular Telephone Networks,” Proceedings of the International Fuzzy Engineering Symposium '91, Yokohama, Japan 2. Flikop, Ziny. “Management System for Cellular Telephone Network,” Proceedings of IEEE International Symposium on Personal, Indoor and Mobile Communications, September 1991, London, UK. 3. Flikop, Ziny. “Some Problems with the Design of Self-Learning Open-Loop Control Systems,” European Journal of Operational Research, vol.81, 1995. 4. Flikop, Ziny. “Input Set Decomposition and Open-Loop Control in Telecommunications Networks”, Proceedings of the 1995 American Control Conference, Seattle, 1995. 5. Dolotov, Alexander. “Effective algorithms for the Statistical Processing , Proceeding of the “Heterogeneous Systems Controls” Conference, Kiev, Ukraine, 1983 6. Dolotov, Alexander. “Experiments Design In a Process of a statistical Modeling Optimization” Journal of “Systems and Machines Control of the Ukraine Academy of Science vol 4 , 1973 7. Dolotov, Alexander, Sadovskiy, Vladimir. “Integrated Information Support System For Design & Management in a Construction Industry“, Proceeding of the “Computer Methods in Civil Engeneering” 1997, No3, Warsaw, Poland 8. Dolotov, Alexander. “A Method For the Distribution and Allocation Tasks Resolving”, Articles “Operations Research and Computing Systems” Vol. 24, Kiev State University, Kiev, Ukraine, 1984
Authorship Analysis in Cybercrime Investigation Rong Zheng, Yi Qin, Zan Huang, and Hsinchun Chen Artificial Intelligence Lab Department of Management Information Systems The University of Arizona Tucson, Arizona 85721, USA {rong, yiqin, zhuang, hchen}@eller.arizona.edu
Abstract. Criminals have been using the Internet to distribute a wide range of illegal materials globally in an anonymous manner, making criminal identity tracing difficult in the cybercrime investigation process. In this study we propose to adopt the authorship analysis framework to automatically trace identities of cyber criminals through messages they post on the Internet. Under this framework, three types of message features, including style markers, structural features, and content-specific features, are extracted and inductive learning algorithms are used to build feature-based models to identify authorship of illegal messages. To evaluate the effectiveness of this framework, we conducted an experimental study on data sets of English and Chinese email and online newsgroup messages. We experimented with all three types of message features and three inductive learning algorithms. The results indicate that the proposed approach can discover real identities of authors of both English and Chinese Internet messages with relatively high accuracies.
criminal identity tracing in cyberspace and allow investigators to prioritize their tasks and focus on the major criminals. In this paper we propose to adopt the authorship analysis framework in the context of cybercrime investigation to help law enforcement agencies deal with the identitytracing problem. We extract three types of features that are identified in authorship analysis research from online illegal messages and use inductive learning techniques to build feature-based models to perform automatic message author identification. We are specifically interested in evaluating the general effectiveness of this approach and the effects of using different types of features in the cybercrime investigation context. Because of the multinational nature of cybercrime, we are also interested in evaluating the applicability of the proposed framework in a multilingual context. The remainder of the paper is organized as follows. Section 2 surveys the existing work on authorship analysis and summarizes major types of text features and techniques. Section 3 describes our proposed cyber criminal identity-tracing framework in detail and presents the specific research questions that we aim to address. Section 4 presents an experimental study that answers the research questions raised in Section 3, based on several experimental data sets. We conclude the article in Section 5 by summarizing our research contributions and pointing out future directions.
2 Literature Review 2.1 Authorship Analysis Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship. More specifically, the problem can be broken down into three sub-fields [35]:
• Author Identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. • Author Characterization summarizes the characteristics of an author and generates the author profile based on his/her work. Some of these characteristics include gender, educational and cultural background, and language familiarity. • Similarity Detection compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author. Authorship analysis has many applications. It is rooted in the author attribution problem of historical literature. The most famous one is its success in resolving the debate on Shakespeare’s work [10]. Similarly, authorship analysis techniques have assisted in solving the author debates over the Federalist Papers [23] and the Unabomber Manifesto [13]. Another application domain is software forensics [14]. People try to identify or characterize the author of some malicious programs by analyzing executable code or source code to investigate the crime and prevent future attacks. Since our work is mainly concerned with text, we will not discuss software forensics in this paper. Generally, the major topics on authorship analysis in the past research are feature selection and the techniques used to facilitate the analysis process. In the following sub- section we review the literatures from these two perspectives.
Authorship Analysis in Cybercrime Investigation
61
2.2 Feature Selection The essence of authorship analysis is the formation of a set of features, or metrics, that remain relatively constant for a large number of writings created by the same person. In other words, a set of writings from one author would exhibit greater similarity in terms of these features than a set of writings from different authors. Initially researchers identified authors by categorizing different sets of words used by different authors. One example is the authorship analysis of Shakespeare’s work [10]. Elliot and Valenza [10] conducted a study that compared the poems of Shakespeare with those of Edward de Vere, the leading candidate as the true author of the works credited to Shakespeare. Modal testing based on keyword usage was conducted. However, the effectiveness of this approach is limited by the fact that word usage is highly dependent on the text topic. For discrimination purposes we need “content-free” features. We also call this kind of features as style marker. The basic idea came from Yule’s work, in which features like sentence length [39] and vocabulary richness [40] were proposed. Mosteller and Wallace [23] extracted some function words (or word-based style markers) such as ‘while’ and ‘upon’ to clarify the disputed work, Federalist Papers. Later Burrows developed a set of more than 50 high-frequency words, which were also tested on the Federalist Papers. Tomoji [32] used a 74-word set to analyze Dickens’s narrative style. Binongo and Smith [2] used the frequency of occurrence of 25 prepositions to discriminate between Oscar Wilde’s plays and essays. Holmes [17] analyzed the use of "shorter" words (2 or 3 letters word) and "vowel words" (words beginning with a vowel). Such word-based methods can require intensive effort to select the most appropriate set of words that best distinguish a given set of authors [16]. In summary, the word-based approach is highly author and language dependent and is difficult to apply to a wide range of applications. In order to avoid these problems, Baayen [4] proposed the use of syntax-based features. This approach is based on the statistical measures and methods applied to rewrite rules which appear in a syntactically annotated corpus. They demonstrated that syntax-based features can be more reliable in authorship identification problems than word-based features. Chaniak [8] discussed some statistical techniques for processing such syntactic information. Rudmen [29] concluded that almost 1,000 style markers had been used in authorship analysis applications. There is no agreement on a best set of style markers. As the size of feature set became larger, conventional methods gave way to some more powerful analytical methods such as machine learning methods. 2.3 Techniques for Authorship Analysis In early studies most analytical methods used in authorship analysis were statistical methods. The basic idea is that different authors have different text compositions which are characterized by a probability distribution of word usage. More specifically, given a population of an author’s texts, the identification of a new text can be considered as a statistical hypothesis test or a classification problem. Most early work used statistical methods to facilitate authorship analysis. Brainerd [1] used Chisquared and related distributions to perform lexical data analysis. An important statistical test was introduced by Thisted and Efron’s paper [30]. Farringdon [12] first ap-
62
R. Zheng et al.
plied the CUSUM technique in authorship analysis. Francis [11] gave a summary of early statistical approaches used to resolve the Federalist Papers dispute. Baayen [3] proposed a linguistic evaluation of diverse statistical models of word frequency. Although statistical methods achieved much success in authorship analysis, there are some constraints for particular methods. For example, Holmes [17] found that the CUSUM analysis was unreliable because the stability of those characteristics over multiple texts is not warranted. Moreover, the prediction capability of statistical methods, such as attributing a new text to a certain author, is limited. The advent of powerful computers instigated the extensive use of machine learning techniques in authorship analysis. The Bayesian model, was conducted by Mosteller and Wallace [24] to test the Federalist Papers. Based on their work, McCallum and Nigam [25] compared two different naïve Bayesian models for text classification. While the naïve Bayesian models for text classification still have structural limitations, a number of more powerful methods were also applied in text categorization and authorship analysis. The most representative one is the neural network. Tweedie [33] used a standard feedforward artificial neural network, also called multi-layer perceptron, to attribute authorship to the disputed Federalist Papers. The network they used had three hidden layers and two output layers. It was trained with a conjugate gradient and was tested by the k-fold cross-validation approach. The result was consistent with the results of the previous work on this topic. Another neural network, named radial basis function (RBF), was used by Lowe and Matthews [21]. They applied RBF to investigate the extent of Shakespeare’s collaboration with his contemporary, John Fletcher, on various plays. More recently, Khmelev [19] presented a technique for authorship attribution based on a simple Markov Chain, the key idea of which is using the probabilities of the subsequent letters as features. Diederich [9] introduced the Support Vector Machine (SVM) to this problem. Experiments were carried out to identify the writings of 7 target authors from a set of 2,652 newspaper articles written by several authors covering three topic areas. This method detected the target authors in 60%-80% of the cases. A new area of study is the identification of electronic message authors based on message contents. de Vel et al. [35] used SVM as a learning algorithm to classify 150 email documents from 3 authors. In this experiment an average accuracy of 80% was achieved. Generally speaking, machine learning methods achieved higher accuracies than statistical methods. They can model the underlying distribution of personal word usage with a large set of features. Based on the previous review, we present a taxonomy for authorship analysis research in Table 1. Table 2 shows some example studies in the field. Some general conclusions can be drawn from Table 2. First, most previous studies addressed resolving an authorship identification problem, which actually initiated this research domain and kept attracting researchers’ endeavor and application of new techniques (e.g. the dispute on Shakespeare’s work and Federalist Papers). Second, style markers were used most frequently as features. The reason is that style markers are general content-free features in most types of literatures. Finally, statistical approaches were extensively used in this field and more machine learning methods were introduced recently to this field.
Authorship Analysis in Cybercrime Investigation
63
Table 1. Taxonomy for Authorship Analysis Category
Determines the likelihood of a particular author having written a piece of work by examining other works produced by the same author. Summarizes the characteristics of an author and determines the author profile based on his/her works. Compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author Content-free features such as frequency of function word, total number of punctuations, average sentence length, vocabulary richness Such as use of a greeting statement, position of requoted text, use of a farewell statement etc. Such as frequency of keywords, special character for special content etc. Uses manual examination and analysis of a set of works to draw conclusions about the authors’ characteristics such as background, personality, and technical skill. Uses statistical methods for calculating document statistics based on metrics, in order to analyze the characteristics of the author or to examine the similarity between various pieces of work Uses classification methods to predict the author of a piece of work based on a set of metrics.
P1 P2 P3 M1 M2 M3 A1
A3 A4
Table 2. Previous Studies on Authorship Analysis Problems Research
3 Applying Authorship Analysis in Cybercrime Investigation The large amount of cyber space activities and their anonymous nature make cybercrime investigation extremely difficult. One of the major tasks in cybercrime investigation is tracing the real identity source of an illegal document. Normally the investigator tries to attribute a new illegal message to a particular criminal in order to get some new clues. Conventional ways to deal with this problem rely on manual work, which is largely limited by the sheer amount of messages and constantly changing author IDs. Automatic authorship analysis should be highly valuable to cybercrime investigators. Figure 1 depicts the typical process of cybercrime identity tracing using the authorship analysis approach.
Fig. 1. A Framework of Cybercrime Investigation with Authorship Analysis
Assume that an investigator has a collection of illegal documents created by a particular suspected cyber criminal. In the first step the feature extractor runs on those documents and generates a set of style features, which will be used as the input to/for the learning engine. A feature-based model is then created as the outcome of the learning engine. This model can identify whether a newly found illegal document is written by that suspicious criminal under different IDs or names. This information will help the investigator focus his/her effort on a small scope of illegal documents and effectively keep track of more important cyber criminals. Cyberspace texts have several characteristics which are different from those of literary works or published articles and make authorship analysis in cyber space a challenge to researchers. One big problem is that cyber documents are generally short in length. This means that many language-based features successfully used in previous studies may not be appropriate (e.g., vocabulary richness). This may also give rise to the weak perform-
Authorship Analysis in Cybercrime Investigation
65
ance of some techniques such as the Naïve Bayesian approach [35]. Also, the structure or composition style used in a cyber document is often different from normal text documents, possibly because of the different purposes of these two kinds of writings. In other words, the style of cyber documents is less formal and the vocabulary is limited and less stable. These factors might also lead to the ineffectiveness of previous feature selection heuristics. However, as a user spends more time in cyber space a more stable writing style will be formed. Some particular features, such as structural layout traits, unusual language usage, illegal content markers, and sub-stylistic features, may be useful in forming a suitable feature collection in the cybercrime investigation context. Another new challenge is that cyber criminals can use any language to conduct crime. In fact, most big crime groups or terrorists have international characteristics. They use the Internet to formulate plans, raise funds, spread propaganda, and communicate. For example, Osama bin Laden was known to use the Internet as his communication media. Applying authorship analysis in a multilingual context is becoming an important issue. Our study aimed to answer the following research questions: 1. Will authorship analysis techniques be applicable in identifying authors in cyber space? 2. What are the effects of using different types of features in identifying authors in cyber space? 3. Will the authorship analysis framework be applicable in a multilingual context?
4 Experiment Evaluation To address the proposed research questions, we created a testbed and conducted several experiments which are described in detail in this section. 4.1 Testbed Two English data sets and one Chinese data set were collected for the purpose of this study. The English data sets consist of an email message collection and an Internet newsgroup message collection. The Chinese data set consists of a Bulletin Board System (BBS) message collection. English Email Messages. The first dataset contains 70 email messages provided by 3 students. Each of the students randomly selected 20-30 messages from their primary email account. The content of these messages covered a variety of topics, ranging from school work to research activities to personal interests. The purpose of introducing different topics is to minimize the impact of content similarity which may contribute to high accuracy. English Internet Newsgroup Messages. The second dataset contains 153 Internet newsgroup messages. Over a time period of two weeks, we observed the activities of several USENET newsgroups involving computer software trading. Based on average
66
R. Zheng et al.
number of reads, posts, and unique user IDs per day, we identified the three most popular newsgroups relevant to our research. Through observation we were able to spot illegal sales of pirate software in all three newsgroups. Figure 2 is an example of such a message.
From: "The Collectaholic" <[email protected]> Subject: Software Titles - Only $3.00 Newsgroups: misc.forsale.computers.other.software Date: 2002-10-04 12:07:22 PST All CDs are the original CDs in working condition and come with all theoriginal documentation. Shipping is $3.00 for first title and $.50 for each additional title. $1.00 Titles PC World The Best of MediaClips: sounds and graphics that can be used onmedia projects… $3.00 Titles Boggle: classic word game Canon Publishing Suite: layout, drawing & photo editing tools
Fig. 2. Illegal Internet Newsgroup Message
We then identified the 9 most active users (represented by a unique ID and email address) who frequently posted messages in these newsgroups. Messages posted by these users were carefully checked to determine whether or not they indicated illegal activities. Between 8 and 30 illegal messages per user were downloaded for use in the experiment. Chinese BBS Messages. The Chinese BBS dataset consisted of 70 messages which were downloaded from the most famous Chinese BBS in the US, bbs.mit.edu. These messages were randomly selected from posted messages by three authors. Table 3, 4 and 5 summarize the composition of the three datasets.
Table 3. English Email Dataset Author T1 T2 T3 RZ 8 9 3 JX 2 18 8 YQ 3 5 14 Grand Total Number of Messages T1 = number of messages under school work T2 = number of messages under research activity T3 = number of messages under personal interest
Number of messages 20 28 22 70
Authorship Analysis in Cybercrime Investigation
67
Table 4. English Internet Newsgroup Dataset Author N1 N2 N3 Number of Messages DLW 1 28 1 30 KD 10 9 1 20 dCN 3 17 0 20 DB 0 16 4 20 SW 18 0 2 20 DLB 0 6 2 8 DLM 0 17 0 17 JKYS 9 0 0 9 JZ 0 9 0 9 Grand Total Number of Messages 153 N1 = number of messages from misc.forsale.computers.other.software N2 = number of messages from misc.forsale.computers.pc-specific.software N3 = number of messages from misc.forsale.computers.mac-specific.software
Table 5. Chinese BBS Dataset Author QQ SKY SEMA Grand Total Number of Messages
Total Number of Messages 20 28 22 70
4.2 Implementation We describe the implementation details of the two core components of our proposed authorship analysis framework: feature selection and inductive learning techniques. Feature selection. Based on the review of previous studies on text and email authorship analysis, along with the specific characteristics of the messages in our datasets, we selected a large number of features that were potentially useful for identifying message authors. Three types of features were used: style markers, structural features, and content-specific features. We used 122 function words and 48 markers suggested by de Vel [35]. Another 28 most common function words from the Oxford English Dictionary and 7 other markers were also included. And 2 additional structural features and content-specific features were added in our experiment, which are shown in Table 6. Techniques. We adopted a classification approach to predict the authorship of each message. Three learning algorithms (classifiers) were used in the experiments for comparison purposes, including decision tree [28], backpropagation neural networks [22], and support vector machines [7]. Among the various symbolic learning algorithms developed over the past decade, ID3 and its variants have been tested extensively and shown to rival other machine learning techniques in predictive power [6]. ID3 is a decision-tree building algorithm
68
R. Zheng et al. Table 6. Feature selection for authorship analysis in our experiment Additional style markers -Total number of words in subject -Total number of characters in subject (S) -Total number of upper-case characters in words in subject/S -Total number of punctuations in subject/S -Total number of whitespace characters in subject/S -Total number of lines -Total number of characters
Additional structural features -Types of signature (name, title, organization, email, URL, phone number) -Uses special characters (e.g. --------) to separate message body and signature
Content-specific Features -Has a price in subject -Position of price in message body -Has a contact email address in message body -Has a contact URL in message body -Has a contact phone number -Uses a list of products -Position of product list in body message -Indicates product categories in list -Format of product list
developed by Quinlan [28]. It adopts a divide-and-conquer strategy and the entropy measure for object classification. In this experiment, we implemented an extension of the ID3 algorithm, the C4.5 algorithm, to deal with attributes with continuous values. Backpropagation neural networks have been extremely popular for their unique learning capability [38] and have been shown to perform well in different applications such as medical applications [34]. It was also introduced to authorship analysis by Kjell [20] and Tweedie [33]. We implemented a typical backpropagation neural network which consists of three layers: an input layer, an output layer and a hidden layer[26], in which the input layer nodes are style features and output nodes are author identities. Based on the general heuristic, the number of hidden layer nodes is typically set to /2 (number of input nodes + number of output nodes). In this study, because the number of input nodes is quite large we modified the heuristic to /10 (number of input nodes + number of output nodes) and achieved relatively high accuracies in our experiments. Support vector machine (SVM) is a novel learning machine first introduced by Vapnik [37]. It is based on the Structural Risk Minimization principle from the computational learning theory. Due to the fact that SVM is capable of handling millions of inputs and does not require feature selection [7], it has been used extensively in authorship analysis, which normally involves hundreds or thousands of input features [9]. For the experiment we used an SVM program written by Hsu and Lin [15] which was publicly available on the Internet. These three algorithms have their applications in authorship analysis. In general SVM and neural networks have better performance than decision trees [9]. But most testbeds are newspaper articles, such as the Federalist Papers. Because of the differences between on-line messages and formal articles, mentioned in Section 3, we still needed to test the performances of these three algorithms on our testbed.
Authorship Analysis in Cybercrime Investigation
69
4.3 Experiment Design We designed the procedure of the experiment as follows: three experiments were conducted on the newsgroup dataset with one classifier at a time. First 205 style markers were used, 9 structural features were added in the second run, and 9 contentspecific features were added in the third run. For the email dataset and Chinese BBS dataset, two experiments were conducted with one classifier at a time; 205 style markers (67 for Chinese BBS dataset) were first used as input to the classifiers, and 9 structural features were then added for a second run. A 30-fold cross-validation testing method was used in all experiments. To evaluate the prediction performance we use accuracy, recall and precision measures which have been commonly adopted in the information retrieval and authorship analysis literature [36]. The accuracy is a measure which indicates the overall prediction performance of a particular classifier, which is defined as in (1) for our experiments:
Accuracy =
Number of messages whose author was correctly identified Total number of messages
(1)
For a particular author, we use precision and recall to measure the effectiveness of our approach for identifying messages that were written by that author. We report the average precision and recall for all authors in a data set. The precision and recall are defined as in (2) and (3):
Precision =
Recall =
Number of messages correctly assigned to the author Total number of messages assigned to the author
Number of messages correctly assigned to the author Total number of messages written by the author
(2)
(3)
4.4 Results & Analysis Based on the three datasets we prepared, we conducted experiments according to the design. The results are presented in Table 7, and detailed discussions are presented in this sub-section. Techniques comparison. We observed that SVM and neural networks achieved better performance than C4.5 decision tree algorithms in terms of precision, recall, and accuracies for all three datasets in our experiment. For example, using style markers on the email dataset, the C4.5, neural networks, and SVM achieved accuracies of 74.29%, 81.11% and 82.86% respectively. SVM also achieved consistently higher accuracies, precision, and recall than the neural networks. However, the performance differences between SVM and neural networks were relatively small. Our results were
generally consistent with previous studies, in that neural networks and SVM typically had better performance than decision tree algorithms [9]. The good performance of SVM also conformed to its success in many other fields [18, 27]. Feature selection. As illustrated in Table 7, the authorship prediction performance varied significantly with different combinations of metrics. Pair-wise t-test results indicated that:
• Using style markers and structural features outperformed using style markers only: We achieved significantly higher accuracies for all three datasets (p-values were all below 0.05) by adopting the structural features. The results might be explained by the fact that an author’s consistent writing patterns show up in the message’s structural features. • Using style markers, structural features, and content-specific features did not outperform using style markers and structural features: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly (with p-value of 0.3086). We think this is because authors of illegal messages typically deliver diverse contents in their mes-
Authorship Analysis in Cybercrime Investigation
71
sages and little additional information can be derived from the message contents to determine the authorship. In response to our second research question, we conclude that the structural features help to achieve higher accuracies, while content-specific features do not improve the performance of online message authorship identification. We also observed that high accuracies were obtained using only style markers as input features for the English datasets. The accuracies ranged from 71% to 89%. The results indicated that style markers contain a large amount of information about writing styles of online message and were surprisingly robust in predicting the authorship. Chinese dataset performance. We noticed that there is a significant drop in prediction performance measures for the Chinese BBS dataset compared with the English datasets. For example, when using style markers only, C4.5 achieved average accuracies of 86.28% and 74.29% for the English Newsgroup and email datasets, while for the Chinese dataset it only achieved an average accuracy of 54.83%. The reason is that only 67 Chinese style markers were used in our current experiments, which are significantly fewer than the 205 style markers used with the English data set. We also observed that when structural features were added all three algorithms achieved relatively high precision, recall, and accuracies (from 71% to 83%) for the Chinese dataset. Considering the significant language differences, our proposed approach to the problem of online message identity tracing appears promising in a multilingual context.
5 Conclusion & Future Work Our experiments demonstrated that with a set of carefully selected features and an effective learning algorithm, we were able to identify the authors of Internet newsgroup and email messages with a reasonably high accuracy. We achieved average prediction accuracies of 80%–90% for email messages, 90%–97% for the Newsgroup messages, and 70%–85% for Chinese Bulletin Board System (BBS) messages. Significant performance improvement was observed when structural features were added on top of style markers. We also observed that SVM outperformed the other two classifiers on all occasions. The experimental results indicated a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identitytracing problem. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet, even though they may use different identities. This study will be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of our proposed approach. Also, more illegal messages will be incorporated into our testbed. The current approach will also be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography images. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset. We believe this will have a better performance cross the different datasets.
72
R. Zheng et al.
Acknowledgment. This project has primarily been funded by the following grants: • National Science Foundation, Digital Government Program, "COPLINK Center: Information and Knowledge Management for Law Enforcement," #9983304, July, 2000-June, 2003; • National Institute of Justice, "COPLINK: Database Integration and Access for a Law Enforcement Intranet," # 97-LB-VX-K023, July 1997-January 2000. We would like to thank Robert Chang from the Taiwan National Intelligence Office for initiating this project. We would also like to thank the officers from the Tucson Police Department: Detective Tim Petersen, Sergeant Jennifer Schroeder, and Detective Daniel Casey for their assistance for the project. Members of Artificial Intelligence Laboratory who have directly contributed to this paper are Michael Chau, Jie Xu, Wingyan Chung.
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15.
B. Brainerd, Statistical analysis of Lexical data using Chi-squared and related distributions. Computers and the Humanities, 9, 161–178. (1975). Binongo and Smith, A Study of Oscar Wilde's Writings, Journal of Applied Statistics, vol. 26-7, p.781, (1999). R. H. Baayen, Statistical Models for Word Frequency Distributions: A Linguistic Evaluation. Computers and the Humanities, 26 347–363, 347–363. (1993). R. H. Baayen, H. van Halteren, and F. J. Tweedie, Outside The Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 2, 110–120, (1996). R. Bosch and J. Smith, Separating hyperplanes and the authorship of the disputed federalist papers, American Mathematical Monthly, 105(7): 601–608, (1998). H. Chen, G. Shankaranarayanan, A. Iyer, and L. She, A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing, Journal of the American Society for Information Science, Volume 49, Number 8, Pages 693–705, (1998). N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, (2000). E. Charniak, Statistical Language Learning. MIT Press, Cambridge, (1993). J. Diederich, J. Kindermann, E. Leopold, and G. Paass, Authorship Attribution with Support Vector Machines, Applied Intelligence, (2000). W. Elliot and R. Valenza, Was the Earl of Oxford The True Shakespeare? Notes and Queries, 38:501–506, (1991). I. S. Francis, An Exposition of a Statistical Approach to the Federalist Dispute. In J. Leed (Ed.), The Computer and Literary Style (pp. 38–79). Kent, Ohio: Kent State University Press. (1966). J. M. Farringdon, Analyzing for Authorship A Guide to the Cusum Technique. Cardiff: University of Wales Press. (1996). D. Foster, Author Unknown: On the Trail of Anonymous, Henry Holt, New York, (2000). A. Gray, P. Sallis, and S. MacDonell, Software forensics: Extending authorship analysis techniques to computer programs, in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1–8, (1997). C. W. Hsu and C. J. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, pages 415–425, (2002).
Authorship Analysis in Cybercrime Investigation
73
16. D. I. Holmes and R. S. Forsyth, The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10, 111–127. (1995). 17. D. I. Holmes, The Evolution of Stylometry in Humanities. Literary and Linguistic Computing, 13, 3. (1998). 18. T. Joachims, Text Categorization with Support Vector Machines, in: Proceedings of the European Conference on Machine learning (ECML), (1998). 19. D.V. Khmelev and F. J. Tweedir, Using Markov Chains for Identification of Writers, Literary and Linguistic Computing, vol.16, no.4, pp.299–307, (2001). 20. B. Kjell, Authorship Determination Using Letter-pair Frequency Features with Neural Network Classifiers. Literary and Linguistic Computing, 9, 119–124. (1994). 21. D. Lowe, and R. Matthews, Shakespeare vs. Fletcher: A Stylometric Analysis by Radial Basis Functions. Computers and the Humanities, 29, 449–461 (1995). 22. R. P. Lippmann, An Introduction to Computing with Neural Networks, IEEE Acoustics Speech and Signal Processing Magazine, 4(2): 4–22, (1987). 23. F. Mosteller and D. L. Wallace, Inference and Disputed Authorship: The Federalist, Addison-Wesley, Reading, Mass., (1964). 24. F. Mosteller, Frederick, and D. L. Wallace, Applied Bayesian and Classical Inference: the Case of the Federalist Papers, in the 2nd edition of Inference and Disputed Authorship, The Federalist, Springer-Verlag, (1964). 25. A. McCallum and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop on "Learning for Text Categorization", (1998). 26. J. Moody and J. Utans, Architecture Selection Strategies for Neural Networks Application to Corporate Bond Rating, Neural Networks in the Capital Markets, (1995). 27. E. Osuna, R. Freund and F. Girosi, Training Support Vector Machines: An Application to Face Detection, Proceedings of Computer Vision and Pattern Recognition, 130–136, (1997). 28. J. R. Quinlan, Induction of Decision Trees, Machine Learning, 1(1): 81–106, (1986). 29. J. Rudman, The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities, 31, 351–365. (1998). 30. R. Thisted, and B. Efron, Did Shakespeare Write a Newly Discovered Poem? Biometrika, 74, 445–455. (1987). 31. D. Thomas, and B. D. Loader, Introduction – Cyber Crime: law enforcement, security and surveillance in the information age, Taylor & Francis Group, New York, NY, (2000). 32. T. Tomoji, Dickens's Narrative Style: A Statistical Approach to Chronological Variation. Revue, Informatique et Statistique dans les Sciences Humaines (RISSH, Centre Informatique de Philosophie et Lettres, Universite de Liege, Belgique), 30, 165–182, (1994). 33. F. J. Tweedie, S. Singh, and D. I. Holmes, Neural Network Applications in Stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10 (1996). 34. K. M. Tolle, H. Chen and H. Chow, Estimating Drug/Plasma Concentration Levels by Applying Neural Networks to Pharmacokinetic Data Sets, Decision Support Systems, Special Issue on Decision Support for Health Care in a New Information Age, 30(2), 139– 152, (2000). 35. O. de Vel, A. Anderson, M. Corney and G. Mohay, Mining E-mail Content for Author Identification Forensics, SIGMOD Record, 30(4): 55–64, (2001). 36. O. de Vel, Mining e-mail authorship. In Proc.Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), (2000). 37. V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, (1995). 38. B. Widrow, D. E. Rumelhart and M. A. Lehr, Neural Networks: Applications in Industry, Business, and Science, Communications of the ACM, 37, 93–105, (1994). 39. G. U. Yule, On sentence length as a statistical characteristic of style in prose, Bometrikka, 30, (1938). 40. G. U. Yule, The statistical study of literary vocabulary, Cambridge University Press, (1944).
Behavior Profiling of Email Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olivier Nimeskern, and Chia-Wei Hu Columbia University, New York, NY 10027, USA {sal,shlomo,kewang,on2005,charlie}@cs.columbia.edu
Abstract. This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the Columbia Intrusion Detection (IDS) Lab. EMT provides the means of loading, parsing and analyzing email logs, including content, in a wide range of formats. Many tools and techniques have been available from the fields of Information Retrieval (IR) and Natural Language Processing (NLP) for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model “user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the “social cliques” revealed by a user’s email behavior.
1
Introduction
This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the Columbia IDS Lab. EMT provides the means of loading, parsing and analyzing email logs, including content, in a wide range of formats. Many tools and techniques have been available from the fields of IR and NLP for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model “user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the “social cliques” revealed by a user’s email behavior. EMT’s design has been driven by the core security application to detect virus propagations, spambot activity and security policy violations. However, the technology also provides critical intelligence gathering and forensic analysis capabilities for agencies to analyze disparate Internet data sources for the detection of malicious users, attackers, and other targets of interest. This dual use is graphically displayed in Figure 1. For example, one target application for intelligence gathering supported by EMT is the identification of likely “proxy email accounts”, email accounts that exhibit similar behavior and thus may be used by a single person. Although EMT has been designed specifically for email analysis, the principles of its operation are equally relevant to other Internet audit sources. This data mining technology previously reported [4,6,7], and graphically displayed in Figure 2, has been proven to automatically compute or create both signature-based misuse detection and anomaly detection-based misuse discovery. H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 74–90, 2003. c Springer-Verlag Berlin Heidelberg 2003
Behavior Profiling of Email
75
Fig. 1. User account profiling, dual use: online detection and offline analysis.
The application of this technology to diverse Internet objects and events (e.g., email and web transactions) allows for a broad range of behavior-based analyses including the detection of proxy email accounts and groups of user accounts that communicate with one another including covert group activities. Data mining applies machine learning and statistical techniques to automatically discover and detect misuse patterns, as well as anomalous activities in general. When applied to network-based activities and user account observations for the detection of errant or misuse behavior, these methods are referred to as behavior-based misuse detection. Behavior-based misuse detection can provide important new assistance for counter-terrorism intelligence. In addition to standard Internet misuse detection, these techniques will automatically detect certain patterns across user accounts that are indicative of covert, malicious or counter-intelligence activities. Moreover, behavior-based detection provides workbench functionalities to interactively assist an intelligence agent with targeted investigations and off-line forensics analyses. Intelligence officers have a myriad of tasks and problems confronting them each day. The sheer volume of source materials requires a means of honing in on those sources of maximal value to their mission. A variety of techniques can be applied drawing upon the research and technology developed in the field of Information Retrieval. There is, however, an additional source of information available that can used to aid even the simplest task of rank ordering and sorting documents for inspection: behavior models associated with the documents can be used to identify and group sources in interesting new ways. This is demonstrated
76
S.J. Stolfo et al.
Fig. 2. Overview of data mining based detection system.
by the Email Mining Toolkit that applies a variety of data mining techniques for profiling and behavior modeling of email sources. The deployment of behavior-based techniques for intelligence investigation and tracking tasks represents a significant qualitative step in the counterintelligence “arms race”. Because there is no way to predict what data mining will discover over any given data set, “counter-escalation” is particularly difficult. Behavior-based misuse detection is more robust against standard knowledgebased techniques. Behavior-based detection has the capabilities to detect new patterns (i.e., patterns that have not been previously observed), provide early warning alerts to users and analysts, and automatically adapt to both normal and misuse behavior. By applying statistical techniques over actual system and user account behavior measurements, automatically-generated models and rules are tuned to the particular source material. This process, in turn, avoids the human bias that is intrinsic when misuse signatures, patterns and other knowledge-based models are designed by hand, as is the norm. Despite this, no general infrastructure has been developed for the systematic application of behavior-based (misuse) detection across a broad set of detection and intelligence analysis tasks such as fraudulent Internet activities, virus detection, intrusion detection and user account profiling. Today’s Internet security systems are specialized to apply a small range of techniques, usually knowledgebased, to an individual misuse detection problem, such as intrusion, virus or SPAM detection. Moreover, these systems are designed for one particular network environment, such as medium-sized network enclaves, and only tap into an individual cross-section of network activity such as email activity or TCP/IP activity. Behavior-based detection technology as proposed herein will likely pro-
Behavior Profiling of Email
77
vide a quantum leap in security and in intelligence analysis in both offline and online task environments. EMT has been described in another publication, focusing on its use for security applications, including virus and spam detection, as well as security policy violations. In this paper, we focus on several of its features specific to intelligence applications, namely the means of clustering email by content based analyses, identification of “similar email accounts” based upon measuring similarity between account profiles represented by histograms, and clique analyses that are supported by EMT. Table 1. Behavior-Based Internet Applications for Security and Beyond Application: Fraud detection
Description and Variations: Examples: Audit Sources: Unauthorized outgoing email Console usurped Email Child attacks teacher Unauthenticated email Deceptive source Unauthorized transactions Purchase/credit fraud HTTP Transaction services. Malicious email detection Viruses Email Worms “SPAM” Intrusion detection Network-based detection Standard IDS TCP/IP Host-based detection Less standard IDS System logs Application-based detection Future IDS App. logs User community discovery Closely connected user-base Email ’circles’ Email Behavior-pattern Account-based patterns Suspect activities All sources: Email, HTTP, discovery Community-based patterns Transaction services, TCP/IP, Telnet traffic, FTP traffic, Clandestine activities cookiesEmail, FTP, Telnet Analyst Workbench Interactive forensic analysis Targeted intelligence All sources investigations Account proxy detection Accounts used by same user Clandestine activities All sources Collaborative filtering Website recommendations Pageview prediction HTTP Purchase recommendations Music/movie choices Transaction services Policy violation detection ISP or User espionage All sources Email enclave security policies Outgoing SPAM Web-bot detection Statistics/knowledge gathering Competitive analysis HTTP Site maintenance Finding broken links Search-engine spider Google, Altavista
1.1
Applying Behavior-Based Detection to Email Sources
Table 1 enumerates a range of behavior-based Internet applications. These applications cover a set of detection, security and marketing applications that exist within the government, commercial and private sectors. Each of these applications are within the capabilities of behavior-based techniques by applying data mining algorithms over appropriate audit data sources. Our current research has applied behavior-based methods directly to the first six applications listed in Table 1: Fraud detection, malicious email detection, intrusion detection, user community discovery, behavior pattern discovery, and analyst workbench. Each of these are Internet security applications, applying to both outbound and inbound network- and email-based traffic. Solving Internet security problems greatly assists surveillance intelligence activities. For example, the discovery of user account communities and the discovery and detection of certain community behavior patterns can be directed to uncover certain classes of covert, clandestine or espionage behavior performed with Internet resources. Furthermore, fraud detection in particular has direct
78
S.J. Stolfo et al.
benefit for an intelligence agency by profiling and identifying users and clusters of users that participate in such malicious Internet activities such as fraudulent activities. Behavior-based detection has been proven against similar, analogous security applications. The finance, telecom and energy industries have protected their customers from fraudulent misuse of their services (e.g., fraudulent misuse of credit card accounts, telephone calling cards, stealing of utility service, etc.) by modeling their individual customer accounts and detecting deviations from this model for each of their customers. The behavior-based protection paradigm applied to the Internet thus has an historical precedent that is now ubiquitous and transparent as exemplified by the credit card in the reader’s wallet or purse. 1.2
EMT as an Analyst Workbench for Interactive Intelligence Investigations
The “Malicious Email Tracking” (MET) [1] is an online system that uses email flow statistics to capture new virii, which are largely undetectable by the “signature” detection methods of today’s state-of-the-art commercial virus detection systems. Specifically, all email attachments are tracked by tracing a private hash value, temporal statistics such as replication rate are recorded to trace the attachments’ trajectory, e.g., across LANs, and these statistics directly inform the detection of self-replicating, malicious software attachments. MET has been developed and deployed as an extension to mail servers and is fully described elsewhere. MET is an example of an online “behavior-based” security system that defends and protects a system not solely by attempting to identify known attacks against a system, but rather by detecting deviations from a system’s normal behavior. Many approaches to “anomaly detection” have been proposed, including research systems that aim to detect masqueraders by modeling user behaviors in command line sequences, or even keystrokes. However, in this case, MET is architected to protect user accounts by modeling user email flows to detect malicious email attachments, especially polymorphic viruses that are not detectable or traceable via signature-based detection methods. The “Email Mining Toolkit” (EMT) on the other hand, is an offline system applied to email files gathered from server logs or client email programs. EMT computes information about email flows from and to email accounts, aggregate statistical information from groups of accounts, and analyzes content fields of emails. The EMT system provides temporal statistical feature computations and behavior-based modeling techniques, through an interactive user interface to enable targeted intelligence investigations and semi-manual forensic analysis of email files. Figure 1 illustrates the general architecture of a behavior-based system deploying dual functionality: 1. An online security detection application (in this case, MET for malicious email detection) 2. A general analyst workbench for intelligence investigations (EMT, for email source analysis)
Behavior Profiling of Email
79
As this figure illustrates, these functionalities share a great deal of overhead. With regard to the implementation, by deploying these dual functionalities, the audit module, computation of temporal statistics, user modeler and database of user models each serve for both functionalities. Moreover, with regard to the conceptual design, the particular set of temporal statistics and user model processes designed for one can improve the performance of the other. In particular, temporal features, as well as user account models and clusters, are representatively general “fundamental building blocks.” EMT provides the following functionalities, interactively: – Querying a database (warehouse) of email data and computed feature values, including: • Ordering and sorting emails on the basis of content analysis (n-gram analysis, keyword spotting, and classifications of email supported by an integrated supervised learning feature using Na¨ıve Bayes classifier trained on user selected features) • Historical features that profile user groups by statistically measuring behavior characteristics. • User models that group users according to features such as typical emailing patterns (as represented by histograms over different selectable statistics), and email communities (including the “social cliques” revealed in email exchanges between email accounts. – Applying statistical models to email data to alert on abnormal or unusual email events. EMT is also designed as a plug in to a data mining platform, originally designed and implemented at Columbia called the DW/AMG architecture (Data Warehouse/Adaptive Model Generation system). That work has been transferred to System Detection Inc (SysD http://www.sysd.com), a DARPA-spinout from Columbia who has commercialized the system as the Hawkeye Security Platform.
2
EMT Features
The full range of EMT features have been described elsewhere . For the present paper, we provide a brief overview of several of its key features of direct relevance to security analysis and intelligence applications, along with descriptive screenshots of EMT in operation. 2.1
Attachment Models
MET was initially conceived to statistically model the behavior of email attachments in real time flowing through an enclave’s email server, and support the coordinated sharing of information among a wide area of email servers to identify malicious attachments and halt their propagation before saturation. In order to properly share such information, each attachment must be uniquely identified,
80
S.J. Stolfo et al.
which is accomplished through the computation of an MD5 hash of the entire attachment. EMT runs an analysis on each attachment in the database to calculate a number of metrics. These include, birth rate, lifespan, incident rate, prevalence, threat, spread, and death rate. They are explained fully in1 , and are displayed graphically in Figure 3.
Fig. 3. Attachment Statistics
Rules specified by a security analyst using the alert logic section of EMT are evaluated over the attachment metrics to issue alerts to the analyst. This analysis may be done to archived email logs by EMT offline, or at runtime in MET while sniffing real-time email flows. The initial version of MET provides the means of specifying alerts in rule form as a collection of Boolean expressions applied to thresholds compared to each of the calculated statistics. As an example, a basic rule might check for each attachment seen if its birth rate is greater than some specified threshold AND sent from at least users. The flow statistics of each email attachment are computed by EMT, as well as the list of specific emails the 1
A paper entitled “A Behavior-based Approach to Securing Email Systems” has been prepared for submission to a technical conference and is under review. That paper describes the use of EMT for virus and spam detection. There is a minor overlap with that paper in presentation material of some of EMT’s features described herein.
Behavior Profiling of Email
81
attachment appears in, to identify recipients of those attachments. The primary detection task MET was designed for includes virus propagation and mitigation. Intelligence applications of this particular feature would include infosec security policy violations, and general evidence gathering in forensic analyses.
Fig. 4. Main analyst window to sort and inspect specific emails.
2.2
Email Content and Classification
Figure 4 illustrates EMT’s main messages tab that provides an analyst with the means to inspect, cluster and sort email messages under analysis. Emails can be selected for review and analysis on the basis of time, sender or recipient account. This data may be labeled directly by an analyst for further data mining analysis supported by other feature tabs in EMT. Interestingly, EMT also provides the means of classifying attachments by way of the fully embedded EMF system, a supervised machine learning feature. In the earliest work on MEF (Malicious Email Filter [7]), the Na¨ıve Bayes classifier was computed over user selected training sets of attachments. The features extracted include “n-grams” and their frequencies, extracted and computed directly from the attachment
82
S.J. Stolfo et al.
irrespective of its mime type. Hence, in addition to using flow statistics and attachment classifications to classify an email message, EMT uses the email body as a content-based feature. The two features supported are n-gram [8] modeling and a calculation of the frequency of a set of words [9] from the body of the email. An n-gram represents the sequence of any n adjacent characters or tokens that appear in a document. An n-character wide window is passed over the entire email body, one character at a time, and a count is computed on the number of occurrences of each n-gram. This results in a hash table that uses the n-gram as a key and the number of occurrences as the value for each email; this we refer to as the document vector. Given a set of training emails, the arithmetic average of the document vectors can be computed as the centroid for the set. Given an instance of an email, we compute the cosine distance [8] against the centroid created during training. If the cosine distance is equal to 1, then the two documents are deemed identical. The smaller the value of the cosine distance, the more different the two documents are. These content-based methods are integrated into the machine learning models for classifying sets of emails for further inspection and analysis. An analyst therefore has the means of honing in on a set of potentially relevant emails by first classifying and clustering sets of emails using the EMT GUI. Using a set of normal email and spam we collected, we did some initial experiments over our own email sets to test the efficacy of the approach. We used half of the labeled emails, both normal and spams, as training data, and used the other half as the test set. The accuracy of the classification using ngrams and word tokens varies from 70% to 94% when using different parts as training and testing sets. In the spam classification experiment, we noticed some spam emails did not vary much from normal emails. For example a spam that would be a single link to a non-threatening website. To improve accuracy we also used weighted key-words and removal of stop-words. For example, the spam email set noticeably contain the words: free, money, big, lose weight, etc in a much higher frequency than regular emails. Users can empirically assign stop-words and keywords and give higher weight to their frequency count. We continue to evaluate these content based approaches further; experiments and analysis are ongoing. 2.3
Account Statistics and Alerts
This mechanism has been extended to provide alerts based upon deviation from other baseline user and group models. EMT computes and displays three tables of statistical information for any selected email account. The first is a set of stationary email account models, i.e. statistical data represented as a histogram of the average number of messages sent over all days of the week, divided into three periods: day, evening, and night. EMT also gathers information on the average size of messages for these time periods, and the average number of recipients and attachments for these periods. These statistics can generate alerts
Behavior Profiling of Email
83
when values are above a set threshold as specified by the rule-based alert logic section of EMT. Stationary User Profiles – Histograms over discrete time intervals. Histograms are used to model the stationary behavior of a user’s email account. Figure 8 displays an example for one particular user account. Histograms are compared to find similar behavior or abnormal behavior between different accounts, and within the same account (between a long-term profile histogram, and a recent, short-term histogram). A histogram depicts the distribution of items in a given sample. EMT employs a histogram of 24 bins, for the 24 hours in a day. Email statistics are allocated to different bins according to their outbound time. The value of each bin can represent the daily average number of emails sent out in that hour, or daily average total size of attachments sent out in that hour, or other features defined over an of email account computed for some specified period of time. Two histogram comparison functions are implemented in the current version of EMT, each providing a user selectable distance function. The first comparison function is used to identify groups of email accounts that have similar usage behavior. The other function is used to compare behavior of an account’s recent behavior to the long term profile of that account. The histogram comparison functions also may be run “unanchored”, meaning, the histograms are shifted to find the best alignment with minimum distance; thus accounting for time zone changes. Similar Users – Histogram distance. Similar behaving user accounts may be identified by computing the pair-wise distances of their histograms (eg., a set of accounts may be inferred as similar to given known or suspect account that serves as a model). The histogram distance functions were modified for this detection task. First, we balance and weigh the information in the histogram representing hourly behavior with the information provided by the histogram representing behavior over different aggregate periods of a day. This is done since measures of hourly behavior may be too low a level of resolution to find proper groupings of similar accounts. For example, an account that sends most of its email between 9am and 10am should be considered similar to another that sends emails between 10am and 11am, but perhaps not to an account that emails at 5pm. Given two histograms representing a heavy 9am user, and another for a heavy 10am user, a straightforward application of any of the histogram distance functions will produce erroneous results. Thus, we divide a day into four periods: morning (7am-1pm), afternoon (1pm7pm), night (7pm-1am), and late night (1am-7am). The final distance computed is the average of the distance of the 24-hour histogram and that of the 4-bin histogram, which is obtained by regrouping the bins in the 24-hour histogram. Second, because some of the distance functions require normalizing the histograms before computing the distance function, we also take into account the volume of emails. Even with the exact distribution after normalization, a bin
84
S.J. Stolfo et al.
representing 20 emails per day should be considered quite different from an account exhibiting the emission of 200 emails per day. Figure 6 graphically displays the EMT analysis showing the target user account and a list of the most similar accounts found by EMT’s histogram analysis.
Fig. 5. Chi Square Test of recipient frequency
Abnormal User Account Behavior. EMT may apply these distance functions to one target email account. (See Figure 6.) A long term profile period is first selected by an analyst as the “normal” behavior period. The histogram computed for this period is then compared to another histogram computed for a more recent period of email behavior. If the histograms are very different (i.e., they have a high distance), an alert is generated indicating possible account misuse. We use the weighted Mahalanobis distance function for these profiles. The long term profile period is used as the training set, for example, a single month. We assume the bins in the histogram are random variables that are statistically independent. When the distance between the histogram of the selected recent period and that of the longer term profile is larger than a threshold, an alert will be generated to warn the analyst that the behavior “might be abnormal” or is deemed “abnormal”. The alert is also put into the alert log of EMT.
Behavior Profiling of Email
85
Fig. 6. Histogram Comparison to Detect Similar users
The histograms described here are stationary models; they represent statistics at discrete time frames. Other non-stationary account profiles are provided by EMT, as described next. Non-stationary User Profiles – Histograms over blocks of emails. Another type of modeling considers the changing conditions over time of an email account. Most email accounts follow certain trends, which can be modeled by some underlying distribution. As an example of what this means, many people will typically email a few addresses very frequently, while emailing many others infrequently. Day to day interaction with a limited number of peers usually results in some predefined groups of emails being sent. Other contacts with whom the email account owner interacts with on less than a day to day basis have a more infrequent email exchange behavior. The recipient frequency is used as a feature to study this concept of underlying distributions. Four behavior analysis graphs for any selected e-mail account are created by EMT for this model. These graphs display the address list size and average outgoing e-mail account spread over time, as well as the number of outgoing e-mails to each destination account. Every user of an email system develops a unique pattern of email emission to a specific list of recipients, each having their own frequency. Modeling every
86
S.J. Stolfo et al.
user’s idiosyncrasies enables the EMT system to detect malicious or anomalous activity in the account. This is similar to what happens in credit card fraud detection, where current behavior violates some past behavior patterns. Figure 5 provides a screenshot of the non-stationary model features in EMT, that are fully described elsewhere. In a nutshell, The Profile tab in Figure 5 provides a snapshot of the account’s activity in terms of recipient frequency. It contains three charts and one table. The various profile statistics selected by the analyst specify an empirical distribution that may then be compared by the analyst with a set of built-in metrics including Chi-square, and Hellinger distance [10]. Rapid changes in email emissions among accounts can then be discerned which may have particular intelligence value. 2.4
Group Communication Models: Cliques
In order to study the email flows between groups of users, EMT provides a feature that computes the set of cliques in an email archive. We seek to identify clusters or groups of related email accounts that frequently communicate with each other, and then use this information to identify unusual email behavior that violates typical group behavior, or identify similar behaviors among different user accounts on the basis of group communication activities. Clique violations may also indicate internal email security policy violations. For example, members of the legal department of a company might be expected to exchange many Word attachments containing patent applications. It would be highly unusual if members of the marketing department, and HR services would likewise receive these attachments. EMT can infer the composition of related groups by analyzing normal email flows and computing cliques (see Figure 7), and use the learned cliques to alert when emails violate clique behavior. An analyst may simply wish to compute these cliques and rank order all associated emails of the clique members for direct inspection. EMT provides the clique finding algorithm using the branch and bound algorithm described in [2]. We treat an email account as a node, and establish an edge between two nodes if the number of emails exchanged between them is greater than a user defined threshold, which is taken as a parameter (Figure 7 is displayed with a setting of 100). The cliques found are the fully connected subgraphs. For every clique, EMT computes the most frequently occurring words appearing in the subject of the emails in question which often reveals the clique’s typical subject matter under discussion. Chi Square + cliques. The Chi Square + cliques (CS + cliques) feature in EMT is the same as the Profile window described above in 2.3.4, with the addition of the calculation of clique frequencies. In summary, the clique algorithm is based on graph theory. It finds the largest cliques (group of users), which are fully connected with a minimum number of
Behavior Profiling of Email
87
Fig. 7. Clique generation for 100 messages
emails per connection at least equal to the threshold (set at 50 by default). In this window, each clique is treated as if it were a single recipient, so that each clique has a frequency associated with it. Only the cliques to which the selected user belongs will be displayed. Some users don’t belong to any clique, and for those, this window is identical to the normal Chi Square window. If the selected user belongs to one or more cliques, each clique appears under the name cliquei i:=1,2. . . and is displayed in a cell with a green color in order to be distinguishable from individual email account recipients. (One can double click on each clique’s green cell, and a window pops-up with the list of the members of the clique.) Cliques tend to have high ranks in the frequency table, as the number of emails corresponding to cliques is the aggregate total for a few recipients. These metrics are a first step to model user’s behavior in terms of group email emission frequency. A larger database will enable us to refine them, and to better understand the time-continuous stochastic process taking place. The Chi square test may be modified or completed with finer measures. The Chi Square tests if the frequencies of emission are constant for a given user. In the preliminary results that we ran on our collected database, the Chi Square test has tended to reject quite often the hypothesis that the frequencies were the same between training and testing periods, indicating that the frequencies are not stable. They change quite dynamically under short time frames,
88
S.J. Stolfo et al.
Fig. 8. Anomalous user behavior detected by histogram comparison
as new recipients and cliques become more or less popular over time. Any new model should take into account this dynamic evolution. Enclave cliques vs. User cliques. Conceptually, two types of cliques can be formulated and both are supported by EMT. The one described in the previous section can be called enclave cliques because these cliques are inferred by looking at email exchange patterns of an enclave of accounts. In this regard, no account is treated special and we are interested in email flow pattern on the enclavelevel. Any flow violation or a new flow pattern pertains to the entire enclave. On the other hand, it is possible to look at email traffic patterns from a different viewpoint altogether. Consider we are focusing on a specific account and we have access to its outbound traffic log. As an email can have multiple recipients, these recipients can be viewed as a clique associated with this account. Since a clique could be subsumed by another clique, we defined a user clique as one that is not a subset of any other cliques. In other words, user cliques of an account are its recipient lists that are not subsets of other recipient lists. User clique computation provides an intelligence analyst with the means of quickly identifying groups directly associated with a target email account, and may be used to group emails for inspection based upon various clique analyses. This is an active area of our ongoing research. Preliminary experiments have been performed using these graph theoretic features for spam and virus detection. In both cases, the clique models provide interesting new evidence to improve the
Behavior Profiling of Email
89
accuracy of detection beyond what is achievable with pure content-based features of emails.
3
Conclusion
It is important to note that testing EMT and MET in a laboratory environment is not particularly informative of its performance on specific tasks and source material. The behavior models are naturally specific to a site or particular account(s) and thus performance will vary depending upon the quality of data available for modeling, and the parameter settings and thresholds employed. EMT is designed to be as flexible as possible so an analyst can effectively explore the space of models and parameters appropriate for their mission. An analyst simply has to take it for a test spin. (EMT has been deployed and is being tested and evaluated by external organizations.) One of the core principles behind EMT’s design may be stated succinctly: there is no single monolithic model appropriate for any detection or forensic analysis task. Hence, EMT provides a pallet of models and profiling techniques (specialized to email log files) that may be combined in interesting ways by an analyst to meet their own mission objectives. It is also important to recognize that no single modeling technique in EMT’s repertoire can be guaranteed to have no false negatives, or few false positives. Rather, EMT is designed to assist an analyst or security staff member architect a set of models whose outcomes provide evidence for some particular detection task. The combination of this evidence is specified in the alert logic section as simple Boolean combinations of model outputs; and the overall detection rates will clearly be adjusted and vary depending upon the user supplied specifications of threshold logic. The Email Mining Toolkit is a work in progress. This paper has described the core concepts underlying EMT, and its related Malicious Email Tracking system, and the Malicious Email Filtering system. We have presented the features of the system currently implemented and available to a analyst for various security and intelligence applications. The GUI allows the user to easily automate many complex analyses. We believe the various behavior-based profiles computed by EMT will significantly improve analyst productivity. We are continuing our research to broaden the range of features and models one may compute over email logs. For example, the notion of clique may be over-constrained, and may be relaxed in favor of other kinds of models of communication groups. Further, we are actively exploring stochastic models of long-term user profiles, with the aim to compute these models efficiently when training such profiles. Histograms computed in fixed time periods is very efficient, but likely insufficient to model a user’s true dynamic behavior.
References 1. M. Bhattacharyya, S. Hershkop, E. Eskin, and S. J. Stolfo. MET: An Experimental System for Malicious Email Tracking. In Proceedings of the 2002 New Security Paradigms Workshop (NSPW-2002). Virginia Beach, VA, September, 2002.
90
S.J. Stolfo et al.
2. C. Bron, J. Kerbosch Finding all cliques of an undirected graph Comm. ACM 16(9) (1973) 575–577. 3. E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. J. Stolfo. A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Data Mining for Security Applications. Kluwer 2002. 4. George H. John and Pat Langley. Estimating continuous distributions in bayesian classifiers In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Pages 338–345, 1995 5. Wenke Lee, Sal Stolfo, and Kui Mok. Mining Audit Data to Build Intrusion Detection Models In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD ’98), New York, NY, August 1998 6. Wenke Lee, Sal Stolfo, and Phil Chan. Learning Patterns from Unix Process Execution Traces for Intrusion Detection AAAI Workshop: AI Approaches to Fraud Detection and Risk Management, July 1997 7. Matthew G. Schultz, Eleazar Eskin, and Salvatore J. Stolfo. Malicious Email Filter – A UNIX Mail Filter that Detects Malicious Windows Executables. Proceedings of USENIX Annual Technical Conference – FREENIX Track. Boston, MA: June 2001. 8. Damashek, M. Gauging Similarity with n-grams: language independent categorization of text Science, 267 (5199), 843–848, 1995. 9. Mitchell, T. Machine Learning, McGraw-Hill, 1997, pg. 180–183. 10. Hogg, R.V. Introduction to Mathematical Statistics, Prentice Hall, 1994.
Detecting Deception through Linguistic Analysis 1
2
1
Judee K. Burgoon , J.P. Blair , Tiantian Qin , and Jay F. Nunamaker, Jr 1
1
Center for the Management of Information, University of Arizona {jburgoon,tqin,nunamaker}@cmi.arizona.edu 2 Department of Criminal Justice, Michigan State University [email protected]
Abstract. Tools to detect deceit from language use pose a promising avenue for increasing the ability to distinguish truthful transmissions, transcripts, intercepted messages, informant reports and the like from deceptive ones. This investigation presents preliminary tests of 16 linguistic features that can be automated to return assessments of the likely truthful or deceptiveness of a piece of text. Results from a mock theft experiment demonstrate that deceivers do utilize language differently than truth tellers and that combinations of cues can improve the ability to predict which texts may contain deception.
ability in specified contexts [6]. The research to be reported here was guided by the objectives of identifying those indicators that are (1) the least context-sensitive and (2) the most amenable to automation. We present preliminary results from a mock theft experiment that is still in progress, our purposes being to illustrate the promise of examining text-based linguistic indicators and of examining such indicators using a particular statistical approach that examines combinations of cues.
2 Background Two experiments from our research program predate the one to be reported here. One was modeled after the familiar Desert Survival Problem, in which pairs of participants were given a scenario in which their jeep had crashed in the Kuwaiti desert, read material we developed from an Army field manual that we entitled, “Imperative Information for Surviving in the Desert,” and then were asked to arrive at a consensus on the rank-ordering of salvageable items in terms of their importance to survival. The task was conducted via email over the course of several days. In half of the pairs, one person was asked to deceive the partner by advocating choices opposite of what the experts recommend (e.g., discarding bulky clothing and protective materials so as to make walking more manageable). Partners discussed the rankings and their recommendations either face-to-face or using a computer-mediated form of communication such as text chat, audioconferencing, or videoconferencing. All discussions were recorded and transcribed then subjected to linguistic analysis of such features as number of words, number of sentences, number of unique words (lexical diversity), emotiveness, and pronoun usage. Of the 27 indicators that were examined, several proved to reliably distinguish truth tellers from deceivers. Deceivers were more likely to use longer messages but with less diversity and complexity, and greater uncertainty and “distancing” in language use than truth tellers. These results revealed that systematic differences in language use could help predict which messages originated from deceivers and which, from those telling the truth. The second experiment was designed as a pilot effort for the experiment to be reported below. In this experiment, participants staged a mock theft and were subsequently interviewed by untrained and trained interviewers via text chat or face-to-face (FtF) interaction [4, 5]. The FtF interactions were later transcribed, and the transcripts and chats were submitted to linguistic analysis on the same features as noted above, plus several others that are available in the Grammatik tool within WordPerfect. Due to the small sample size, none of the differences between innocents (truth tellers) and thieves (deceivers) were statistically significant, but patterns were suggestive of deceivers tending toward briefer messages (fewer syllables, words, and sentences; shorter and simpler sentences) of greater complexity (e.g., greater vocabulary and sentence complexity, lower readability scores) than truth tellers (higher FleschKincaid, sentence complexity, vocabulary complexity, syllables per word). The patterns found in these first efforts suggested that we should expect to find many linguistic differences between deceivers and truth tellers with a larger, and welldesigned experiment. We therefore hypothesized that deceptive senders display higher (a) quantity, (b) nonimmediacy, (c) expressiveness, (d) informality, and (e) affect; and less (f) complexity, (g) diversity, and (h) specificity of language in their messages than truthful senders.
Detecting Deception through Linguistic Analysis
93
3 Method Students were recruited from a multi-sectioned communication class by offering them credit for participation and the chance to win money if they were successful at their task. Half of the students were randomly assigned to be “thieves,” i.e., those who would be deceiving about a theft, and the other half became “innocents,” i.e., those who would be telling the truth. Interviewees in the deceptive condition were assigned to “steal” a wallet that was left in a classroom. In the truthful condition, interviewees were told that a “theft” would occur in class on an assigned day. All of the interviewees and interviewers then appeared for interviews according to a pre-assigned schedule. We attempted to motivate serious engagement in the task by offering interviewers $10 if they could successfully detect whether their interviewee was innocent or guilty and successfully detect whether they were deceiving or telling the truth on a series of the interview questions. In turn, we offered interviewees $10 if they convinced a trained interviewer that they were innocent and that their answers to several questions were truthful. An additional incentive was a $50 prize to be awarded to the most successful interviewee. Interviewees were then interviewed by one of three trained interviewers under one of three modalities— Face to Face (FtF), text chat, or audioconferencing. The interviews followed a standardized Behavioral Analysis Interview format that is taught to criminal investigators [7]. Interviews were subsequently transcribed and submitted to linguistic analysis. Clusters of potential indicators, all of which could be automatically calculated with a shallow parser (Grok or Iskim) or could use a look-up dictionary, were included. The specific classes of cues and respective indicators were as follows: 1. Quantity (number of syllables, number of words, number of sentences) 2. Vocabulary Complexity (number of big words, number of syllables per word) 3. Grammatical Complexity (number of short sentences, number of long sentences, Flesh-Kincaid grade level, average number of words per sentence, sentence complexity, number of conjunctions) 4. Specificity and Expressiveness (emotiveness index, rate of adjectives and adverbs, number of affective terms)
4 Results In the following two subsections, we investigate data from two perspectives: analysis of individual cues and cluster analysis. To analyze how well individual cues distinguish messages of deceivers from those of truth tellers, we conducted multivariate analyses of related groups of cues followed by directional t-tests on individual cues to identify which ones contribute most to differentiating deceivers and truthful tellers. The cluster analysis answers the question of whether combinations of cues (in a hierarchy structure) can improve overall ability to differentiate deceivers from truth tellers. Furthermore, unlike the traditional statistical cluster analysis, we used a datamining algorithm – C4.5 [8] – to cluster the cues and obtain a hierarchical tree structure. In this way, we fulfilled the “automatic” requirement of automating deception detection.
94
J.K. Burgoon et al.
4.1 Individual Cue Analysis Results were based on data of 41 subjects whose modality was text chat (txt) or audio. (In the future, data for face-to-face (FtF) will be included after those sessions have been completed and all video files are transcribed). Among 41 subjects, 29 interacted via txt and 20, via audio; 26 were “thieves” (i.e., deceivers) and 23 were “innocents (i.e., truth tellers). Table 1 presents descriptive statistics for the 16 cues that were analyzed. Table 1. Means (Deviations) for 16 cues Cues Syllables Words
Results of the multivariate tests and t-tests are shown in Table 2. (For plots of means by deception condition and modality, see the figures in the appendix.) The multivariate analysis on indicators of quantity of language produced a significant multivariate effect for deception (p = .033) and no modality by deception interaction. Deceivers said or wrote less than truth tellers. The multivariate analysis of indicators of complexity at both the sentence level (simple sentences, long sentences, short sentences, sentence complexity, FleschKincaid grade level, number of conjunctions, average-words-per-sentence (AWS)) and vocabulary level (vocabulary complexity, number of big words, averagesyllables-per-word (ASW)) did not produce overall multivariate effects, but several individual variables did show the effects of deception condition. Deceivers had significantly fewer long sentences, AWS and sentence complexity than truth tellers; and a lower Flesch-Kincaid grade level than truth tellers. This meant their language was less complex and easier to comprehend. The t-tests also provided weak support for deceivers having fewer ASW (p = .102) and conjunctions (p = .149) in messages than truth tellers. Thus, deceivers used less complex language at both the lexical (vocabulary) and grammatical (sentence and phrase) levels. Modality effects also showed that subjects in text chat used fewer conjunctions than in audio chat, indicating that that modality was less likely to exhibit compound and complex sentences. For the analyses of message specificity and expressiveness (adjectives and adverbs, emotiveness, and affect), the multivariate test showed a trend toward a main effect for the deception condition (p = .101). There was a significant univariate differ-
Detecting Deception through Linguistic Analysis
95
ence on affect, such that deceivers used less language referring to emotions and feelings than did truth tellers. Table 2. Univariate F-tests (p-values) and t-tests for individual cue analysis Test of between-subject effects Cues
Modality*Condition
Independent Samples t-Test
Modality
Condition
Syllables
2.054(.159)
1.842(.182)
.156(.695)
Words
2.363(.131)
2.407(.128)
.162(.689)
1.702(.096)*
Sentences
.810(.373)
.001(.972)
.018(.894)
.111(.912)
Short sentences
.122(.016)*
.588(.447)
.225(.637)
-.725(.472)
Long sentences
.547(.464)
6.566(.014)*
.005(.947)
2.781(.008)*
Simple sentences
.029(.886)
.002(.969)
1.874(.178)
.061(.951)
Big words
.462(.500)
.288(.594)
.146(.704)
.616(.541)
Average syllables per word
1.949(.17)
1.703(.199)
.413(.524)
-1.668(.102)
Average words per sentence
.374(.544)
4.368(.042)*
.006(.936)
2.414(.021)*
Flesch-Kincaid grade level
.001(.979)
2.690(.108)
.005(.943)
1.958(.056)*
Sentence complexity
.006(.940)
2.055(.159)
.181(.673)
1.779(.082)*
Vocabulary complexity
.657(.422)
.512(.478)
.657(.422)
-.997(.324)
# of Conjunctions
3.393(.072)*
2.569(.116)
1.496(.228)
1.426(.163)
Rate Adjectives and Adverbs
0.150(.700)
.329(.569)
.301(.586)
-.596(.554)
Emotiveness
0.020(.889)
.054(.818)
.060(.808)
-.233(.817)
Affect
1.591(.214)
3.291(.214)
.004(.948)
1.630(.110)
1.502(.140)
* p < .05, one-tailed.
4.2 Cluster Analysis by C4.5 Although many linguistic cues were not significant as shown in section 1, they can form a hierarchy tree that performs relatively well in discriminating deceptive communicators from truthful ones. Among many data-mining algorithms, we chose C4.5 because it provides a clear cluster structure (compared with neural network), as well as satisfactory precision [9]. C4.5 used a pruned tree to cluster the cues. This algorithm cuts off redundant branches while constraining error rates. We used software of Weka (University of Waikato in New Zealand; Witten and Frank, 2000) to implement the C4.5. Figure 1 is the output of a pruned tree. “1” stands for truthful condition, “2” stands for deception condition. The correct prediction rate using 15-fold cross-validation is 60.72%, which is reasonably satisfactory given the small size of the data set. As shown in the Figure 1, a combination of linguistic cues can well categorize deception behaviors. For example, sentence level complexity combined with vocabulary or affect acted as good classifier. Those significant linguistic cues in section 1 also played important roles in the cluster classification: number of conjunctions, FK grade level, AWS, affect. On the other hand, the cluster structure also showed consistency with multivariate tests: not all linguistic cues contribute in identifying deceptions. There were “unhelpful” cues, such as emotiveness, which showed no significance in both the single level structure and cluster analysis (hierarchy structure). However, it is premature to conclude the ineffectiveness of any linguistic cues at this point. Further investigations with larger data sets will give us deeper insight into the intra-relations of cues. The confusion matrix shows the number of misclassifications: 10 out of 37 true conditions are misclassified as deceptive, and 19 out of 35 deceptive conditions are misclassified. The tree mentioned above produced less misclassifications in the truth
5 Discussion This investigation was undertaken largely to demonstrate the efficacy of utilizing linguistic cues, especially ones that can be automated, to flag potentially deceptive discourse, and to use statistical clustering techniques to select the best set of cues to reliably distinguish truthful from deceptive communication. This investigation demonstrates the potential of both the general focus on language indicators and the use of hierarchical clustering techniques to improve the ability to predict what texts might be deceptive. As for the specific indicators that might prove promising, these results provide some evidence for the hypothesis that deceivers behave differently than truth tellers in communications via text chat and/or audio chat. Although many tests were not significant due to the small sample size, there was a trend shown in the profile plots demonstrating that: deceivers’ messages were briefer (i.e., lower on quantity of language), were less complex in their choice of vocabulary and sentence structure, and lack specificity or expressiveness in their text-based chats. This is consistent with pro-
Detecting Deception through Linguistic Analysis
97
files found in nonverbal deception research showing deceivers tend to adopt, at least initially, a fairly inexpressive, rigid communication style with “flat” affect. It appears that their linguistic behavior follows suit and also demonstrates their inability to create messages rich with the details and complexities that characterize truthful discourse. Over time, deceivers may alter these patterns, more closely approximating normal speech in many respects. But it is possible that language choice and complexity may fail to show changes because deceivers are not accessing real memories and real details, and thus will not have the same resources in memory upon which to draw. Unlike asynchronous experiments such as the Desert Survival experiment (DSP), subjects did not have sufficient time to provide detailed lies that contained more quantity and complexities [10]. The differences in synchronicity in these two tasks points to time for planning, rehearsal, and editing as a major factor that may alter the linguistic patterns of deceivers and truth tellers. As a consequence, no single profile of deception language across tasks is likely to emerge. Rather, it is likely that different cue models will be required for different tasks. Consistent with interpersonal deception theory [11], deceivers may adapt their language style deliberately according to the task at hand and their interpersonal goals. If the situation does not afford adequate time for more elaborate deceits, one should expect deceivers to say less. But if time permits elaboration, and/or the situation is one in which persuasive efforts may prove beneficial, deceivers may actually produce longer messages. What may not change, however, is their ability to draw upon more complex representations of reality because they are not accessing reality. In this respect, complexity measures may prove less variant across tasks and other contextual features. The issue of context invariance thus becomes an extremely important one to investigate as this line of work proceeds. Modality also plays a role in communication. Subjects talked more than they wrote, but message complexity did not seem to be much different between the text and audio modalities. Future research will explore the effect of different communication modalities on the characteristics of truthful and deceptive messages. Although clustering analysis did not consider modality effects, it provided a hierarchy tree structure to capture the combined characteristics of the cues. It also provided an exploratory threshold value to separate deceptive and true messages. It should also be noted that the analysis in this study used the absolute values of linguistic characteristics to classify statements as truthful or deceptive. Because people vary greatly in their usage of the language (e.g. some people naturally use more or less complex language than others), the use of cue values that are relative to the sender of the message may result in greater classification accuracy. This would require building a model of the baseline speech patterns of an individual and then comparing an individual message to this model. Future research will consider intra-connections among linguistic cues, tasks, and modalities. More data will also enhance reliability of current results, but it is clear from these results alone that linguistic cues that are amenable to automation may prove valuable in the arsenal of tools to detect deceit.
98
J.K. Burgoon et al.
Acknowledgement. Portions of this research were supported by funding from the U.S. Air Force Office of Scientific Research under the U.S. Department of Defense University Research Initiative (Grant #F49620-01-1-0394). The views, opinions, and/or findings in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.
References 1.
Burgoon, J. K., Buller, D. B., Ebesu, A., Rockwell, P.: Interpersonal Deception: V: Accuracy in Deception Detection. Communication Monographs 61 (1994) 303–325 2. Levine, T., McCornack, S.: Linking Love and Lies: A Formal Test of the McCornack and Parks Model of Deception Detection. J. of Social and Personal Relationships 9 (1992) 143–154 3. Zuckerman, M., DePaulo, B., Rosenthal, R.: Verbal and Nonverbal Communication of Deception. In: Berkowitz, L. (ed.): Advances in Experimental Social Psychology, Vol. 14. Academic Press, New York (1981) 1–59 4. Burgoon, J., Blair, J. P., Moyer, E.: Effects of Communication Modality on Arousal, Cognitive Complexity, Behavioral Control and Deception Detection during Deceptive Episodes. Paper submitted to the Annual Meeting of the National Communication Association, Miami. (2003, November) 5. Burgoon, J., Marett, K., Blair, J. P.: Detecting Deception in Computer-Mediated Communication. In: George, J. F. (ed.): Computers in Society: Privacy, Ethics & the Internet. Prentice-Hall, Upper Saddle River, NJ (in press) 6. Vrij, A.: Detecting Lies and Deceit. John Wiley and Sons, New York (2000) 7. Inbau, F. E., Reid, J. E., Buckley, J. P., Jayne, B. C.: Criminal Interrogations and Confessions. 4th edn. Aspen, Gaithersburg, MD (2001) 8. Quinlan, J. R.: C4.5. Morgan Kaufmann Publishers, San Mateo, CA (1993) 9. Spangler, W., May, J., Vargas, L.: Choosing Data-Mining Methods for Multiple Classification: Representational and Performance Measurement Implications for Decision Support. J. Management Information Systems 16 (1999) 37–62 10. Zhou, L. Twitchell, D., Qin, T., Burgoon, J. K., Nunamaker, J. F., Jr.: An Exploratory Study into Deception Detection in Text-based Computer-Mediated Communication. In: th Proceedings of the 36 Annual Hawaii International Conference of System Sciences. Big Island, Los Alamitos, CA (2003) 11. Buller, D. B., Burgoon, J. K.: Interpersonal Deception Theory. Communication Theory 6 (1996) 203–242
Detecting Deception through Linguistic Analysis
99
Appendix: Individual Cue Comparisons by Modality and Deception Condition* Estimated Marginal Means of total number of syllables
Estimated Marginal Means of total number of words
200
150
140
180
140
GUILT 120 guilty innocent
100 1
Estimated Marginal Means
Estimated Marginal Means
130
160 120
110
100
GUILT
90
guilty innocent
80
2
1
Modality
2
Modality
Estimated Marginal Means of number of sentences
Estimated Marginal Means of short sentences
7.2
3.2
7.0 3.0
6.8
6.4
6.2
GUILT
6.0
guilty
5.8
innocent 1
Estimated Marginal Means
Estimated Marginal Means
2.8
6.6
2.6
2.4
GUILT 2.2
guilty
2.0
2
innocent
1
Modality
2
Modality
Estimated Marginal Means of long sentences
Estimated Marginal Means of simple sentences
1.2
2.2
1.0 2.0
.6
.4
GUILT .2
guilty innocent
0.0 1
2
Modality
* Modality 1 = Text, Modality 2 = Audio
Estimated Marginal Means
Estimated Marginal Means
.8 1.8
1.6
Modality 1.4
1 2
1.2 guilty
GUILT
innocent
100
J.K. Burgoon et al. Estimated Marginal Means of big words
Estimated Marginal Means of average syllables per word
8.5
1.44
1.42
8.0
7.0
Modality 6.5
1 2
6.0 guilty
Estimated Marginal Means
Estimated Marginal Means
1.40
7.5
innocent
1.38
1.36
Modality 1.34
1 2
1.32 guilty
GUILT
GUILT
Estimated Marginal Means of Flesch-Kincaid grade level 9.5
22
9.0
20
8.5
18
Modality 16
1 2
14
Estimated Marginal Means
Estimated Marginal Means
Estimated Marginal Means of average words per sentence 24
guilty
innocent
innocent
8.0
Modality 7.5 1 2
7.0 guilty
GUILT
innocent
GUILT
Estimated Marginal Means of sentence complexity
Estimated Marginal Means of vocabulary complexity
54
10.5
52
10.0
50
9.5
46
44
Modality
42
1
40
2
guilty
GUILT
innocent
Estimated Marginal Means
9.0
48
8.5 8.0
Modality
7.5
1
7.0
2
6.5 guilty
GUILT
innocent
Detecting Deception through Linguistic Analysis
101
Estimated Marginal Means of RateAdAdj
Estimated Marginal Means of #conjunction .13
10
9
.12
7
6
Modality 5
1 2
4 guilty
Estimated Marginal Means
Estimated Marginal Means
8
.11
Modality 1 2
.10 guilty
innocent
innocent
GUILT
GUILT
Estimated Marginal Means of Affect
Estimated Marginal Means of Emotive .6
.30
.5
.4
.28
Modality 1 .27
2
guilty
innocent
GUILT
* Modality 1 = Text, Modality 2 = Audio
Estimated Marginal Means
Estimated Marginal Means
.29
.3
.2
Modality .1
1
0.0
2
guilty
GUILT
innocent
A Longitudinal Analysis of Language Behavior of Deception in E-mail 1
2
2
Lina Zhou , Judee K. Burgoon , and Douglas P. Twitchell 1
Department of Information Systems, University of Maryland, Baltimore County [email protected] 2 Center for the Management of Information, University of Arizona {jburgoon, dtwitchell}@cmi.arizona.edu
Abstract. The detection of deception is a promising but challenging task. Previous exploratory research on deception in computer-mediated communication found that language cues were effective in differentiating deceivers from truthtellers. However, whether and how these language cues change over time remains an open issue. In this paper, we investigate the effect of time on cues to deception in an empirical study. The preliminary results showed that some cues to deception change over time, while others do not. The explanation for the lack of change in the latter cases is provided. In addition, we show that the number and type of cues to deception vary from time to time. We also suggest what could be the best time to investigate cues to deception in a continuous email communication.
A Longitudinal Analysis of Language Behavior of Deception in E-mail
103
in one of the most popular types of CMC (www.info.isoc.org ), but it also allows us to focus our attention on language behavior in deception. Most past deception research focuses on non-verbal cues or a mix of verbal and other types of cues to deception in face-to-face settings [10,11,15,18]. What remains theoretically challenging is how effective verbal behavior of deception could be in email and other types of CMC. Although some text-based cues from the prior studies are potentially applicable to email, deception research in email is still rare mainly because it has to address the following challenges: 1) high dynamics of messages, especially in message length, language style, and message structure. Email is expressed through the medium of writing, though it display several of the core properties of speech, such as the expectation of responses, transience, and time-governed interactions [8]; 2) low media richness; it lacks true ability to signal meaning through kinetic and proxemic features; 3) lack of other linguistic features typical of conversational speech, which makes it difficult for language to be used in a truly conversational way [8]. Among existing text-based cues [13,16,17], we selected language cues, which are less dependent upon domain experts and can potentially be automated as a result of progress in natural language processing technologies. In this paper, we aim at studying the effect of time on language cues to deception in email. We will see what kinds of language cues vary over time and which cues exhibit consistency. A secondary objective is to explore at what time period of a continuous email communication deceivers may display language cues to deception most evidently. The above results are expected to shed light on how deceivers adjust their deception strategies over time and what part of a conversation is best for detecting deception.
2 Theoretical Foundation and Hypotheses 2.1 Media Richness Theory Being one of the least rich media, email does not have the same ability to transmit information, meaning, and emotion as does richer media, such as face-to-face interaction [9]. Due to the nature of e-mail being a less rich medium, deception is claimed to be more difficult to detect over e-mail than deceptive messages transmitted via richer media [12]. However, if users of text-based systems perceive the channel being used as able to convey richer information than it really does, they may use the system in such a way that begins to mimic the use of more rich systems. The theory does not try to account for just one modality, many of its findings are applicable to other mediated channels that are low in richness. 2.2 Interpersonal Deception Theory (IDT) IDT attempts to explain deception from an interpersonal conversational perspective, not strictly from any physiological venue [2]. IDT posits that within the context and relationship of the sender and receiver of deception, the deceiver will both engage in strategic modifications of behavior in response to the receiver’s suspicions and will display non-strategic leakage cues or indicators of deception. Tests of this theory have
104
L. Zhou, J.K. Burgoon, and D.P. Twitchell
confirmed the existence of brevity and nonimmediacy along with other identifiable cues, which may be useful in detecting deception within any modality [4, 6]. The theory not only applies to physiological or nonverbal indicators, but also pertains to verbal indicators. Information management, one of the strategic behaviors of deceivers posited in IDT, is closely related to the modification or manipulation of the central message content and its language style [5]. Most of all, the idea of the influence of interaction on participants’ subsequent behaviors indicate the possible changing of language behavior during different phases of communication. 2.3 Interpersonal Adaptation Theory (IAT) IAT clarified and described the interaction patterns of reciprocity and compensation in dyadic interaction [7]. Among other propositions, it implied a focus on longitudinal analyses of interaction. Deception is likely to be a continuous event that occurs over time [20]. Even with deception goals in mind, deceivers may manage to make their intention embedded in other messages that seem truthful to their partners. One underlying motivation for deceivers is to prevent their partners from suspecting them, which may lead to cognitive arousal. In addition, the adaptation of dyadic interaction may have some effect on deceivers occasionally displaying similar behavior to truthtellers. Therefore, we expected to see dynamics over time in language cues to deception. 2.4 Hypotheses We first extract a set of effective linguistic cues to deception based on the findings of a previous study [21]. It was found that deceivers and truth-tellers are significantly different on quantity and diversity of language. In addition, the study revealed that deceivers display different informality and affect in their language than truth-tellers do, and it partially supported the effect of non-immediacy on deception. However, it did not examine whether the above differences still hold consistently in continuous communication. In this study, therefore, we focus on the change of cues to deception over time. To juggle between conflicting goals of achieving communication goals and potential arousals due to deception, deceivers may not exhibit the same language behaviors all the time. They may intentionally manage themselves to be less deceptive at some times than other times. The proposition of cues changing along time dimension is included in Hypothesis 1. HYPOTHESIS 1. Deceivers change (a) quantity, (b) diversity, (c) informality, (d) affect, and (e) non-immediacy of language over time. To remove the potential effect of the task, we compare deceivers with truthfultellers who perform the same task to examine the change of significance of cues over time. Thus, we are also interested in Hypothesis 2.
A Longitudinal Analysis of Language Behavior of Deception in E-mail
105
HYPOTHESIS 2. Differences between deceivers’ language and truth tellers’ language on (a) quantity, (b) diversity, (c) informality, (d) affect, and (e) nonimmediacy vary across time.
3 Method The research experiment was a 2×2×3 doubly repeated measures design varying experimental condition (0: truthful, 1: deceptive), dyad role (0: sender, 1: receiver), and time (1: time 1, 2: time 2, 3: time 3), with the last two factors serving as within-dyad repeated factors. Subjects were randomly assigned to one of the two roles in one of the two experimental conditions and performed a task for 3 consecutive days under the same condition. Truthful senders served as the control condition. A series of analyses of repeated measures and variance were conducted to test Hypotheses 1 and 2. In all analyses, day was treated as a within-dyads factor, and repeated contrasts were performed for day to test for potential trends. 3.1 Experiment Design Subjects. Subjects (N= 60; 57% = female) were pairs of freshmen, sophomore, junior, and senior students recruited from an MIS course with extra credit for experimental participation. Tasks. The task is involved with decision making in Desert Survival Problem. The subjects were given a scenario that they were stranded in the desert, and their primary goal was to achieve an agreeable ranking of the given list of items in the order of the usefulness to survival. Procedures. Senders (both truthful and deceptive) first ranked the given list of items, and emailed their partner their rankings and explanations. Then, each naïve partner responded to the ranking with his/her own re-rank and explanations. The above procedure repeated for three days, the only difference being that on day 2 and day 3, additional items on the list were rendered unsalvageable (e.g., the flashlight broken), thus forcing a reconsideration of the remaining items (see details in [21]). 3.2 Independent Variables Deceptive Condition. There were two types of deceptive conditions: deception and truth. In the deception condition, a sender in a dyad was instructed to mislead the partner to a ranking that is different from the sender’s actual opinion; while in the truth condition, a sender offered his/her true opinions. Of the 30 pairs, 14 were in the truth condition, and 16 in the deception condition.
106
L. Zhou, J.K. Burgoon, and D.P. Twitchell
Time. Each of the three days in the experiment was taken as a time. Thus, there were 3 different times labeled as time 1, time 2 and time 3. 3.3 Dependent Variables Nineteen dependent variables are grouped into five constructs: quantity, diversity, informality, affect, and non-immediacy, as shown in Table 1. Quantity represents the amount of messages that are produced, diversity indicates the diversity of wording, informality expresses the degree of informality of messages being produced, affect indicates the display of emotional affect in messages, and non-immediacy shows the indirectness of messages that may prevent recipients from obtaining definite or affirmative information. The measurement of dependent variables was conducted by taking advantage of a natural language processing tool [19]. Table 1. Summaries of linguistic constructs and their component dependent variables
Non-immediacy passive voice modal verb objectification uncertainty generalizing term self reference group reference other reference
4 Results 4.1 Repeated Measures Analyses
A series of repeated measure analyses were conducted on deceivers’ email messages for each of the five linguistic constructs to test Hypothesis 1 that deceivers’ language changes over time. The results showed that deceivers change quantity, :LON¶V 0.0838; F(10, 6) = 6.645, p SDUWLDO 2 = 91.7%, and diversity, :LON¶V 0.166; F(6, 10) = 8.36, p SDUWLDO 2 = 83.4%, over time; nonimmediacy approached significance, :LON¶V ) p SDUWLDO 2 = 39.2%, in showing changes over time. The follow-up univariate analyses and post-hoc contrast analyses revealed that all individual measures of quantity decreased significantly (p<0.005) and two measures in the diversity construct increased significantly (p<0.001) over time. In addition, other reference in non-immediacy showed a decreasing pattern (p<0.1), for it decreased continuously from time 1 to time 3. Thus, deceivers’ language became briefer and more complex over time, but with fewer pronouns referencing others as time passed. Therefore, our Hypotheses 1(a) and 1(b) were strongly supported, and Hypotheses 1(e) is weakly supported, whereas Hypotheses 1(c) and 1(d) were not supported.
A Longitudinal Analysis of Language Behavior of Deception in E-mail
107
4.2 Analyses of Variance We conducted multiple ANOVAs of the effect of deceptive condition on each of the five linguistic constructors for each of the three times separately so as to analyze whether the same cues tell deceivers from truth tellers over time. Table 2. Univariate analyses results for three times and five constructs respectively
Constructs
Quantity
Diversity Informality Affect
Nonimmediacy
Variable word verb modifier noun phrase sentence lexical diversity content diversity redundancy typo ratio positive affect negative affect passive voice modal verb objectification uncertainty generalizing terms self reference group reference other reference
p-values and Direction of Relationship Time 1 Time 2 Time 3 0.022+ 0.04+ 0.008+ 0.016+ 0.018+ 0.027+ 0.047+ 0.07+ 0.072+ 0.032+ 0.0020.0390.0150.0060.014+ 0.031+ 0.05+ 0.02+ 0.0750.026+
+ = higher for deceivers; - = lower for deceivers
The results revealed that none of the constructs was effective for differentiating deceivers from truth tellers at all three times. At best, diversity was significant at time 1, :LON¶V ) p SDUWLDO 2 = 39%, and time 2, :LON¶V = 0.712; F(3, 26) = 3.506, p SDUWLDO 2 = 28.8%; quantity was significant at time 2, :LON¶V ) p SDUWLDO 2 = 35.5%, but only approached significance at time 1, :LON¶V ) p = 0.093, partial 2 = 31%; informality was only significant at time 2, F(1, 28) = 6.824, p = 0.014, parWLDO 2 = 19.6%. Affect approached significance at time 1, :LON¶V ) 2.661, p SDUWLDO 2 = 16.5%, and time 3, :LON¶V ) p SDUWLDO 2 = 16.1%, and nonimmediacy was significant only at time 2, F(1, 28) = 6.824, p SDUWLDO 2 = 19.6%. Therefore, Hypothesis 2 was well supported. As shown in Table 2, the follow-up univariate analyses of time 1 revealed that deceivers were higher than truth tellers on all quantity measures (p<0.05), with verbs the most significant (p<0.01) and sentences the least (p<0.1); they were lower than truth tellers on lexical diversity (p<0.01) and content diversity (p<0.05); and they were higher than truth tellers on negative affect (p=0.05). At time 2, the same cues re-
108
L. Zhou, J.K. Burgoon, and D.P. Twitchell
mained significant except for negative affect. In addition, more significant cues (all with p<0.05) emerged. For example, deceivers used more modal verbs, used more group references, had a greater typo ratio, and tended to be lower on self-references. The only cue significant at time 3 was positive affect (p<0.5), with deceivers being higher.
5 Discussion We identified a pattern from the results in Section 4 that the number of constructs and individual cues to deception initiated with relatively high values at time 1, peaked at time 2, which represents the middle phase of the communication, and dropped abruptly at time 3, which is equivalent to the final phase of communication. It indicates that: 1) time of communication and/or task matters in differentiating deceivers from truth-tellers; 2) deceivers tend to expose a fair number of deception cues at the beginning of communication, increase the number over time, and then finish the communication with few cues exposed. Therefore, in order to identify deception in a continuous communication, we should either merge all the exchanged messages from the same person to look for overall deception patterns, or select certain times in the middle of communication for further investigation. It is also evident in Table 2 that deceivers showed more negative affect at time 1, but more positive affect at time 3, which illustrates that deceivers may initially be taken over by negative arousal and cognitive load caused by deception at the beginning of communication. However, they gained better control of their affective display as communication continued. Finally, they were able to assume positive affect and leave a pleasant impression with their partners. Deceivers changed quantity, diversity, and non-immediacy of language significantly over time; however, they maintained informality and affect at about the same level. The direct speculation on the latter is that either the two measures are too difficult for deceivers to strategically manage, or deceivers are very cautious to keep them consistent over time. It is reasonable to expect that deceivers do not intentionally produce typos, thus informality of language may be led to by the first possibility. On the contrary, we expect affect to change over time, for it is not natural for people to display similar level of affect all the time. The lack of change in affect may be the result of intentional control of deceivers. Taking together the significant effect of time on most of the dependent constructs found in this study and that of deception condition in the prior study [21], we can clearly see that detecting deception is an extremely complex task, which involves many dynamics and contextual factors. However, automatic deception detection with accuracy beyond the level of chance is still a reachable goal as more effective cues become available. Questions remain as to the external validity of these results to other tasks or contexts. This study suffers from the weakness associated with laboratory research and student samples. However, in view of the email setting, the laboratory condition is a close approximation of a real-life situation, for real deceivers would have sufficient time to compose messages asynchronously and do not show their partners anything other than the text of the messages itself.
A Longitudinal Analysis of Language Behavior of Deception in E-mail
109
The lack of common standards and structures in email language was fully reflected in this study. Some subjects did not give a full stop to their sentences until reaching the end of their messages, while others simply used phrases and fragments rather than complete sentences in their messages. This lack of structure is an important fact of email that CMC researchers have to face.
6 Conclusion Our results confirmed that some cues to deception change over time. Others that remained unchanged may be due to lack of control or intentional control. None of the cues was effective in differentiating truth from deception across all time periods. In other words, the number and type of cues that can reliably distinguish deceivers from truth-tellers varies from time to time. Differentiation was relatively high at the beginning, peaked in the middle, and plummeted at the end of communication. These results are consistent with Interpersonal Deception Theory and Interaction Adaptation Theory, which postulate that deceivers intentionally adapt their communication over time as they gain greater control of their internal arousal and external behavior patterns and as they adapt to receiver feedback. This study indicates that affect needs to be considered in identifying deception in email, though it is implicitly embedded in messages rather than explicitly displayed as in face-to-face communication. With communicators physically distributed, deceivers may feel it much easier to adjust their affective display over time. This study suggests that cues to deception do matter, but they interact with time or phase of communication. Based on this study and the prior research, we conclude that matching cues to deception to phase of communication is important to improving performance of deception detection. After all, email is a popular communication media type and has distinguishable features from other communication types, so the importance of understanding deception patterns in email will be well recognized. We believe that this study provides empirical evidence to support intelligent deception detection in cyberspace.
Acknowledgement & Disclaimer. Portions of this research were supported by funding from the U.S. Air Force Office of Scientific Research under the U.S. Department of Defense University Research Initiative (Grant #F49620-01-1-0394). The views, opinions, and/or findings in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.
References 1. D. B. Buller and J. K. Burgoon, "Deception: Strategic and nonstrategic communication," in Strategic interpersonal communication, J. A. Daly and J. M. Wiemann, Eds. Hillsdale, NJ: Erlbaum, 1994, pp. 191–223.
110
L. Zhou, J.K. Burgoon, and D.P. Twitchell
2. D. B. Buller and J. K. Burgoon, "Interpersonal Deception Theory," Communication Theory, vol. 6, pp. 203–242, 1996. 3. J. K. Burgoon, D. B. Buller, A. S. Ebesu, and P. Rockwell, "Interpersonal deception V: Accuracy in deception detection," Communication Monographs, vol. 61, pp. 303–325, 1994. 4. J. Burgoon and D. B. Buller, "Interpersonal deception: xi. effects of deceit on perceived communication and non-verbal behavior dynamics," Journal of Nonverbal Behavior, vol. 18, pp. 155–184, 1994. 5. J. K. Burgoon, D. E. Buller, L. K. Guerrero, W. A. Afifi, and C. M. Feldman, "Interpersonal Deception: XII. Information management dimensions underlying deceptive and truthful messages," Communication Monographs, vol. 63, pp. 52–69, 1996. 6. J. K. Burgoon, D. B. Buller, C. H. White, W. Afifi, and A. L. S. Buslig, "The role of conversational involvement in deceptive interpersonal interactions," Personality & Social Psychology Bulletin, vol. 25, pp. 669–685, 1999. 7. J. K. Burgoon, N. Miczo, and L. A. Miczo, "Adaptation during deceptive interactions: testing the effects of time and partner communication style," presented at National Communication Association Convention, Atlanta, 2001. 8. D. Crystal, Language and the Internet. Cambridge: Cambridge University Press, 2001. 9 R. Daft and R. Lengel, "Organizational information, message richness and structural design," Management Science, vol. 32, pp. 554–571, 1986. 10. B. M. DePaulo, J. T. Stone, and G. D. Lassiter, "Deceiving and detecting deceit," in The Self and Social Life, B. R. Schlenker, Ed. New York: McGraw-Hill, 1985. 11. P. Ekman and M. O'Sullivan, "Who Can Catch a Liar?," American Psychologist, vol. 46, pp. 913–920, 1991. 12. J. F. George and J. R. Carlson, "Group support systems and deceptive communication," presented at HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences, 1999. 13. E. Höfer, L. Akehurst, and G. Metzger, "Reality monitoring: a chance for further development of CBCA?," presented at the Annual Meeting of the European Association on Psychology and Law, Sienna, Italy, 1996. 14. P. E. Johnson, S. Grazioli, K. Jamal, and R. G. Berryman, "Detecting deception: adversarial problem solving in a low base-rate world," Cognitive Science, vol. 25, pp. 355–392, 2001. 15. R. E. Kraut, "Verbal and nonverbal cues in the perception of lying," Journal of Personality and Social Psychology, pp. 380–391, 1978. 16. S. Porter and J. C. Yuille, "The language of deceit: An investigation of the verbal clues to deception in the interrogation context," Law and Human Behavior, vol. 20, pp. 443–458, 1996. 17. M. Steller and G. Köhnken, "Criteria-Based Content Analysis," in Psychological methods in criminal investigation and evidence, D. C. Raskin, Ed. New York: Springer Verlag, 1989, pp. 217–245. 18. A. Vrij, K. Edward, K. P. Robert, and R. Bull, "Detecting deceit via analysis of verbal and nonverbal behavior," Journal of Nonverbal Behavior, pp. 239–264, 2000. 19. A. Voutilainen, "Helsinki taggers and parsers for English," in Corpora Galore: Analysis and techniques in describing English, J. M. Kirk, Ed. Amsterdam & Atlanta: Rodopi, 2000. 20. C. H. White and J. K. Burgoon, "Adaptation and communicative design: Patterns of interaction in truthful and deceptive conversation," Human Communication Research, vol. 27, pp. 9–37, 2001. 21. L. Zhou, D. Twitchell, T. Qin, J. Burgoon, and J. Nunamaker, "An Exploratory Study into Deception Detection in Text-based Computer-Mediated Communication," presented at 36th Hawaii International Conference on System Sciences, Big Island, Hawaii, 2003.
Evacuation Planning: A Capacity Constrained Routing Approach Qingsong Lu, Yan Huang, and Shashi Shekhar Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN 55455, USA {lqingson,huangyan,shekhar}@cs.umn.edu, http://www.cs.umn.edu/research/shashi-group
Abstract. Evacuation planning is critical for applications such as disaster management and homeland defense preparation. Efficient tools are needed to produce evacuation plans to evacuate populations to safety in the event of catastrophes, natural disasters, and terrorist attacks. Current optimal methods suffer from computational complexity and may not scale up to large transportation networks. Current naive heuristic methods do not consider the capacity constraints of the evacuation network and may not produce feasible evacuation plans. In this paper, we model capacity as a time series and use a capacity constrained heuristic routing approach to solve the evacuation planning problem. We propose two heuristic algorithms, namely Single-Route Capacity Constrained Planner and Multiple-Route Capacity Constrained Planner to incorporate capacity constraints of the routes. Experiments on a real building dataset show that our proposed algorithms can produce close-to-optimal solution, which has total evacuation time within 10 percent longer than optimal solution, and also reduce the computational cost to only half of the optimal algorithm. The experiments also show that our algorithms are scalable with respect to the number of evacuees.
1
Introduction
Evacuation planning is critical for numerous important applications, e.g. emergency building evacuation, disaster management and recovery, and homeland defense preparation. Efficient tools are needed to produce evacuation plans which identifies routes and schedules to evacuate populations to safety in the event of catastrophes, natural disasters, and terrorist attacks [8,3,4]. The current methods of evacuation planning can be divided into three categories, namely warning systems, linear programming approaches, and heuristic approaches. Warning
This work is supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory under contract number DAAD19-01-2-0014. The content does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. AHPCRC and the Minnesota Supercomputer Institute provided access to computing facilities
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 111–125, 2003. c Springer-Verlag Berlin Heidelberg 2003
112
Q. Lu, Y. Huang, and S. Shekhar
systems simply convey threat descriptions and the need of evacuation to the affected people via mass media communication methods. Such systems can have unanticipated effects on the evacuation process. For example, when Hurricane Andrew was approaching Florida and Louisiana in 1992, the affected population was simply asked to leave the area as soon as possible. This caused tremendous traffic congestion on highways and led to great confusion and chaos [1]. The second type of evacuation planning uses network flow and linear programming approach. EVACNET [9,12,13] produces optimal solution using linear programming methods. It has exponential running time and cannot be applied to large transportation networks. Hoppe and Tardos [10,11] gave the first and by far the only polynomial algorithm to compute optimal solution for evacuation problem. However, their algorithm uses ellipsoid method which suffers from high computational complexity and therefore is not practical to implement. The third type of evacuation planning uses heuristics approaches to find evacuation plans. However, current naive heuristic approaches only compute the shortest distance path from a source to the nearest exit without considering route capacity constraints and traffic from other sources. It cannot produce efficient plans when the number of people to be evacuated is large and the route network is complex. New heuristic approaches are needed to account for capacity constraints of the evacuation network. A capacity constrained routing approach reserves route capacities subject to capacity constraints in an order specified by heuristics. We propose two new heuristic algorithms for capacity constrained routing, namely single-route approach and multiple-route approach. The first algorithm evacuates all the people from the same source via a single route by reserving route capacity based on an order determined by pre-computed shortest path lengths. The second algorithm can assign multiple routes to groups of people from the same source based on an order prioritized by shortest travel time path lengths re-calculated in each iteration. The multiple-route approach produces close-to-optimal solutions with significantly reduced computational time compared to optimal solution algorithms. It outperforms the single-route approach in solution quality because of its flexibility in choosing multiple routes although it is computationally more expensive since the single-route approach can produce solution for large network in seconds. Experimental results on a large building dataset show that our proposed algorithms can produce close-to-optimal solution, which has total evacuation time within 10% longer than optimal solution, and at the same time reduce the computational cost to only half of the optimal algorithm. Our algorithms are also scalable with respect to the total number of people to be evacuated. To the best of our knowledge, this is the first paper exploring heuristic algorithms using capacity constrained routing for evacuation planning. Outline: The rest of the paper is organized as follows. In Section 2, the problem formulation is provided and related concepts are illustrated by an example. Section 3 proposes two capacity constrained heuristic algorithms. The algorithm comparison and cost models are given in Section 4. In Section 5, we presents
Evacuation Planning: A Capacity Constrained Routing Approach
113
the experimental design and results. We summarize our work and discuss future directions in Section 6. Scope: The proposed algorithms cannot be applied directly to vehicle routing models in transportation networks that have intersection queuing delays and turn penalties.
2
Problem Formulation
The capacity constrained routing problem can be formulated as follows. Given a transportation network with capacity constraints, the initial number of people to be evacuated, their initial locations, and evacuation destinations, we need to produce evacuation route plans consisting of a set of origin-destination routes and a scheduling of people to be evacuated via the routes. The objective is to minimize the total time needed for evacuation. The scheduling of people onto the routes should observe the route capacity constraints. A secondary objective is to minimize the computational overhead of producing the evacuation plan. We illustrate the problem formulation and a solution with the following example. Suppose we have a simple two-story building, as shown in Figure 1 (floor map from [13]). In this building, there are two rooms on the second floor, two staircases, and one room and two exits on the first floor.
Fig. 1. Building Floor Map with Node and Edge Definition
114
Q. Lu, Y. Huang, and S. Shekhar
This building will be modelled as a node-edge graph, as shown in Figure 2. In this model, each room, corridor, staircase, and exit of the building is represented as a node, shown by an ellipsis. Each node has two attributes: maximum node capacity and initial node occupancy. For example, at node N1, which represents Room 201 in the building, the maximum capacity is 50, which means Room 201 can hold at most 50 people, while the initial occupancy is 10, which means there are initially 10 people in this room that are to be evacuated. Each pathway from one node to another node is represented as an edge, shown by arrows between two nodes in Figure 2. Each edge also has two attributes: maximum edge capacity and travel time. For example, at edge N1-N3, which represents the path linking Room 201 and the corridor, the maximum capacity is 7, which means at most 7 people can travel from Room 201 to the corridor simultaneously, while the travel time is 1, which means it takes 1 time unit to travel from the room to the corridor. This approach to model building floor-map with capacity to node-edge graph is similar to those presented in [13,5].
Fig. 2. Node-Edge Graph Model of Example Building
As shown in Figure 2, suppose we initially have 10 people at node N1, 5 at node N2, and 15 at node N8. The task is to compute an evacuation plan that evacuates the 30 people to the exits (N13 and N14) using the least amount of time.
Evacuation Planning: A Capacity Constrained Routing Approach
115
Example 1 (An Evacuation Plan). Table 1 shows an evacuation plan. In the table, each row shows one group of people moving together during the evacuation with a group ID, number of people in this group, origin node, the start time, the evacuation route, and the exit time. Take node N8 for example, initially there are 15 people at N8. They are divided into 3 groups: Group A with 6 people, Group B with 6 people and Group C with 3 people. Group A starts at time 0, follows route N8-N10-N13 and reaches EXIT1(N13) at time 4. Group B starts at time 1, also follows route N8-N10-N13 and reaches EXIT2(N13) at time 5. Group C start at time 0, follows route N8-N11-N14 and reaches EXIT2(N14) at time 4. The procedure is similar for people from N1 and N2. The whole evacuation takes 16 time units since the last group of people (Group F and J) reaches the exit at time 16.
Table 1. Evacuation Plan Example Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 0 D N1 3 0 E N1 3 1 F N1 3 2 G N1 1 0 H N2 3 0 I N2 2 1
We use a capacity constrained routing approach to conduct the evacuation planning. We model available edge capacity and available node capacity as a time series instead of a fixed number. A time series represents the available capacity at each time instant for a given edge or node. We propose an approach based on the extension of shortest path algorithms [7,6] to account for route scheduling with capacity constraints. We propose two heuristic algorithms to compute the evacuation plan. 3.1
In the Single-Route Capacity Constrained Planner (SRCCP) algorithm, first, the shortest routes from each source to any destination are pre-computed. Next, capacities are reserved along the pre-computed routes by reducing available node
116
Q. Lu, Y. Huang, and S. Shekhar
Algorithm 1 Single-Route Capacity Constrained Planner (SRCCP) Input: 1) G(N, E): a graph G with a set of nodes N and a set of edges E; Each node n ∈ N has two properties: M aximum N ode Capacity(n) : non-negative integer Initial N ode Occupancy(n) : non-negative integer Each edge e ∈ E has two properties: M aximum Edge Capacity(e) : non-negative integer T ravel time(e) : non-negative integer 2) S: set of source nodes, S ⊆ N ; 3) D: set of destination nodes, D ⊆ N ; Output: Evacuation plan Method: for each source node s ∈ S do (1) find the shortest time route Rs < n0 , n1 , . . . , nk > among routes from s to all destinations d ∈ D, ( where n0 = s and nk = d ); (2) Sort routes Rs by total travel time, increasing order; (3) for each route Rs in sorted order do { (4) Initialize next start node on route Rs to move: st = 0; (5) while not all evacuees from n0 reached nk do { (6) t =next available time to start move from node nst ; (7) nend =furthest node can be reached from nst without stopping; (8) f low = min( number of evacuee at node nst , Available Edge Capacity(all edges between nst and nend on Rs ), Available N ode Capacity(all nodes from nst+1 to nend on Rs ), ); (9) for i = st to end − 1 do { (10) t = t + T ravel time(eni ni+1 ); (11) Available Edge Capacity(eni ni+1 , t) reduced by f low; (12) Available N ode Capacity(ni+1 , t ) reduced by f low; (13) t = t ; (14) } (15) st =closest node to destination on route Rs with evacuee; (16) } (17) } (18) Postprocess results and output evacuation plan; (19)
and edge capacities at certain time points along the route. The detailed pseudocode and algorithm description are as follows. In the first step(line 1-2), for each source node s, we find the route Rs with shortest total travel time among routes between s and all the destination nodes. The total travel time of route Rs is the sum of the travel time of all edges on Rs . For example, in figure 2, RN 1 is N1-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 2 is N2-N3-N4-N6-N10-N13 with a total travel time of 14 time units. RN 8 is N8-N10-N13 with total travel time of 4 time units. This step is done by a variation of Dijkstra’s algorithm[7] in which edge travel time
Evacuation Planning: A Capacity Constrained Routing Approach
117
Table 2. Result Evacuation Plan of the Single-Route Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 2 D N1 3 0 E N1 3 0 F N1 1 0 G N1 2 1 H N1 1 1 I N2 2 0 J N2 3 0
is treated as edge weight and the algorithm terminates when the shortest route from s to one destination node is determined. The second step(line 3), is to sort the routes we obtained from step 1 in increasing order of the total travel time. Thus, in our example, the order of routes will be RN 8 ,RN 1 ,RN 2 . The third step(line 4-18), is to reserve capacities for each route in the sorted order. The reservation for route Rs is done by sending all the people initially at node s to the exit along the route in the least amount of time. The people may need to be divided into groups and sent by waves due to the constraints of the capacities of the nodes and edges on Rs . For example, for RN 8 , the first group of people that starts from N8 at time 0 is at most 6 people because the available edge capacity of N8-N10 at time 0 is 6. The algorithm makes reservations for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0 because the 6 people travel through this edge starting from time 0; for node N10 at time 3 because they arrive at N10 at time 3; for edge N10-N13 at time 3 because they travel through this edge starting from time 3. They finally arrive at N13(EXIT1) at time 4. The second group of people leaving N8 has to wait until time 1 since the first group has reserved all the capacity of edge N8-N10 at time 0. Therefore, the second group leaves N8 at time 1 and reaches N13 at time 5. Similarly, the last group of 3 people leaves N8 at time 2 and reaches N13 at time 6. Thus all people from N8 are sent to exit N13. The next two routes, RN 1 and RN 2 , will make their reservation based on the available capacities that the previous routes left with. The final step of the algorithm is to output the entire evacuation plan, as shown in Table 2, which takes 18 time units.
The Multiple-Route Capacity Constrained Planner (MRCCP) is an iterative approach. In each iteration, the algorithm re-computes the earliest time route from any source to any destination taking the previous reservations and possible onroute waiting time into consideration. Then it reserves the capacity for this route in the current iteration. The detailed pseudo-code and algorithm description are as follows. Algorithm 2 Multiple-Route Capacity Constrained Planner (MRCCP) Input: 1) G(N, E): a graph G with a set of nodes N and a set of edges E; Each node n ∈ N has two properties: M aximum N ode Capacity(n) : non-negative integer Initial N ode Occupancy(n) : non-negative integer Each edge e ∈ E has two properties: M aximum Edge Capacity(e) : non-negative integer T ravel time(e) : non-negative integer 2) S: set of source nodes, S ⊆ N ; 3) D: set of destination nodes, D ⊆ N ; Output: Evacuation plan Method: while any source node s ∈ S has evacuee do { (1) find route R < n0 , n1 , . . . , nk >= with earliest destination arrival time among routes between all s,d pairs, where s ∈ S,d ∈ D,n0 = s,nk = d; (2) f low = min( number of evacuee still at source node s, Available Edge Capacity(all edges on route R), Available N ode Capacity(all nodes from n1 to nk on route R), ); (3) for i = 0 to k − 1 do { (4) t = t + T ravel time(eni ni+1 ); (5) Available Edge Capacity(eni ni+1 , t) reduced by f low; (6) Available N ode Capacity(ni+1 , t ) reduced by f low; (7) t = t ; (8) } (9) } (10) Postprocess results and output evacuation plan; (11)
The MRCCP algorithm keeps iterating as long as there are still evacuees at any source node (line 1). Each iteration starts with finding the route R with the earliest destination arrival time from any sources node to any any exit node based on the current available capacities (line 2). This is done by generalizing Dijkstra’s shortest path algorithm [7] to work with the time series capacities and edge travel time. Route R is the route that reaches an exit in the least
Evacuation Planning: A Capacity Constrained Routing Approach
119
Table 3. Result Evacuation Plan of the Multiple-Routes Capacity Constrained Planner Group of People ID Origin No. of People Start Time A N8 6 0 B N8 6 1 C N8 3 0 D N1 3 0 E N1 3 1 F N1 3 0 G N1 1 2 H N1 3 1 I N2 2 2
amount of time and at least one person can be sent to the exit through route R. For example, at the very first iteration, R will be N8-N10-N13, which can reach N13 at time 4. The actual number of people that will travel through R is the smallest number among the number of evacuees at the source node and the available capacities of each of the nodes and edges on route R (line 3). Thus, in the example, this amount will be 6, which is the available edge capacity of N8-N10 at time 0. The next step is to reserve capacities for the people on each node and edge of route R (lines 4-9). The algorithm makes reservation for the 6 people by reducing the available capacity of each node and edge at the time point that they are at each node and edge. This means that available capacities are reduced by 6 for edge N8-N10 at time 0, for node N10 at time 3, and for edge N10-N13 at time 3. They finally arrive at N13(EXIT1) at time 4. Then, the algorithm goes back to line 2 for the next iteration. The iteration terminates when the occupancy of all source nodes is reduced to zero, which means all evacuee have been sent to exits. Line 11 outputs the evacuation plan, as shown in Table 3.
4
Comparison and Cost Models of the Two Algorithms
It can be seen that the key difference between the two algorithms is that the SRCCP algorithm only produces one single route for each source node, while the MRCCP can produce multiple routes for groups of people in each source node. MRCCP can produce evacuation plan with shorter evacuation time than SRCCP by the flexibility of adapting to the available capacities after previous reservations. Yet, MRCCP needs to re-compute the earliest time route in each iteration which incurs more computational cost than SRCCP. We then provide simple algebraic cost models for the computational cost of the two proposed heuristic algorithms. We assume the total number of nodes in the graph is n, the number of source nodes is ns , and the number of groups generated in the result evacuation plan is ng .
120
Q. Lu, Y. Huang, and S. Shekhar
The cost of the SRCCP algorithm consists of three parts: the cost of the computing the shortest time route from each source node to any exit node is denoted by Csp , the cost of sorting all the pre-computed routes by their total travel time is denoted by Css , and the cost of reserving capacities along each route for each group of people is denoted by Csr . The cost model of the SRCCP algorithm is given as follows: CostSRCCP = Csp + Css + Csr = O(ns × nlogn) + O(ns logns ) + O(n × ng ) (1) The MRCCP algorithm is an iterative approach. In each iteration, the route for one group of people is chosen and the capacities along the route are reserved. The total number of iterations is determined by the number of groups generated. In each iteration, the route with earliest destination arrival time from each source node to any exit node is re-computed with the cost of O(ns ×nlogn). Reservation is made for the node and edge capacities along the chosen route with the cost of O(n). The cost model of the MRCCP algorithm is given as follows: CostM RCCP = O((ns × nlogn + n) × ng )
(2)
In both cost models, the number of groups generated for the evacuation plan depends on the network configuration which include maximum capacity of nodes and edges, and the number of people to be evacuated at each source node.
5
Solution Quality and Performance Evaluation
In this section, we present the experiment design, our experiment setup, and the results of our experiments on a building dataset.
5.1
Experiment Design
Figure 3 describes the experimental design to evaluate the impact of parameters on the algorithms. The purpose is to compare the quality of solution and the computational cost of the two proposed algorithms with that of EVACNET which produces optimal solution. First, a test dataset which represents a building layout or road network is chosen or generated. The dataset is a evacuation network characterized by its route capacities and its size (number of nodes and edges). Next, a generator is used to generate the initial state of the evacuation by populating the network with a distribution model to assign people to source nodes. The initial state will be converted to EVACNET input format to produce optimal solution via EVACNET and converted to node-edge graph format to evaluate the proposed two heuristic algorithms. The solution qualities and algorithm performance will be analyzed in analysis module.
Evacuation Planning: A Capacity Constrained Routing Approach route capacity
number of nodes, edges
Test Dataset (Building layout or road network) number of people
121
Algorithm 1 Conversion to Node-Edge Model
initial people location distribution model
Intial State of the Building or Road Network Generator
Solution 1 Running Time1 Solution 2
Algorithm 2
Analysis
Running Time 2 Conversion to EVACNET Model
Optimal Solution Running Time 3
Fig. 3. Experiment Design
5.2
Experiment Setup and Results
The test dataset we used in the following experiments is the floor-map of Elliott Hall, a 6-story building on the University of Minnesota campus. The dataset network consists of 444 nodes with 5 exits nodes, 475 edges, and total node capacity of 3783 people. The generator produces initial states by varying source node ratio and occupancy ratio from 10% to 100%. The experiment was conducted on a workstation with Intel Pentium III 1.2GHz CPU, 256MB RAM and Windows 2000 Professional operating system. The initial state generator distributes Pn people to Sn randomly chosen Sn source nodes. The source node ratio is defined as and total number of nodes Pn . the occupancy ratio is defined as total capacity of all nodes We want to answer two questions: (1)How does people distribution affect the performance and solution quality of the algorithms? (2) Are the algorithms scalable with respect to the number of people to be evacuated? Experiment 1: Effect of People Distribution. The purpose of the first experiment is to evaluate how the people distribution affects the quality of the solution and the performance of the algorithms. We fixed the occupancy ratio and varied the source node ratio to observe the quality of the solution and the running time of the two proposed algorithms and EVACNET. The experiment was done with fixed occupancy ratio from 10% to 100% of total capacity. Here we present the experiment results with occupancy ratio fixed at 30% and source node ratio varying from 30% to 100% which shows a typical result of all test cases. Figure 4 shows the total evacuation time given by the three algorithms and Figure 5 shows their running time. As seen in Figure 4, at each source node ratio, MRCCP produces solution with total evacuation time that is within 10% longer than optimal solution produced by EVACNET. The quality of solution of MPCCP is not affected by the distribution of people when the total number of people is fixed. For SRCCP, the solution is 59% longer than EVACNET optimal solution when source node ratio is 30% and drops to 29% longer when source node ratio increases to 100%. It shows that the solution quality of SRCCP increases when source node ratio increases. In Figure 5, we can see that the running time of EVACNET grows
122
Q. Lu, Y. Huang, and S. Shekhar
Total Evacuation Time
250 200 150
SRCCP MRCCP EVACNET
100 50 0 30
50
70
90
100
Source Node Ratio (%)
Fig. 4. Quality of Solution With Respect to Source Node Ratio
35
Running Time (second)
30 25 SRCCP MRCCP EVACNET
20 15 10 5 0 30
50
70
90
100
Source Node Ratio (%)
Fig. 5. Running Time With Respect to Source Node Ratio
much faster then the running time of SRCCP and MRCCP when source node ratio increases. This experiment shows: (1)SRCCP produces solution closer to optimal solution when source node ratio is higher. (2)MRCCP produces close to optimal solution (less than 10% longer than optimal) with less than half of running time of EVACNET. (3) The distribution of people does not affect the performance of two proposed algorithms when total number people is fixed. Experiment 2: Scalability with Respect to Occupancy Ratio. In this experiment, we evaluated the performance of the algorithms when the source node ratio is fixed and the occupancy ratio is increasing. Figure 6 and Figure 7 show the total evacuation time and the running time of the 3 algorithms when the source node ratio is fixed at 70% and occupancy ratio varies from 10% to 70% which is a typical case among all test cases. As seen in Figure 6, compared with the optimal solution by EVACNET, solution quality of SRCCP decreases when occupancy ratio increases, while solution quality of MRCCP still remains within 10% longer than optimal solution. In Figure 7, the running time of EVACNET grows significantly when occupancy
Evacuation Planning: A Capacity Constrained Routing Approach
123
450
Total Evacuation Time
400 350 300 SRCCP MRCCP EVACNET
250 200 150 100 50 0 10
30
50
70
Occupany Ratio (%)
Fig. 6. Quality of Solution With Respect to Source Node Ratio
Running Time (second)
60 50 40 SRCCP MRCCP EVACNET
30 20 10 0 10
30
50
70
Occupany Ratio (%)
Fig. 7. Running Time With Respect to Source Node Ratio
ratio grows, while running time of MRCCP remains less than half of EVACNET and only grows linearly. This experiment shows: (1)The solution quality of SRCCP goes down when total number of people increases. (2) MRCCP is scalable with respect to number of people.
6
Conclusion and Future Work
In this paper, we proposed and evaluated two heuristic algorithms of capacity constrained routing approach. Cost models and experimental evaluations using a a real building dataset are presented. The proposed SRCCR algorithm can produces plan instantly but the quality of solution suffers when evacuee number grows. The MRCCR algorithm produces solution within 10% of optimal solution while the running time is scalable to number of evacuees and is reduced to half of the optimal algorithm. Both algorithms are scalable with respect to the number of evacuees. Currently, we choose the shortest travel time route without considering the available capacity of the route. In many cases, a longer route with larger available capacity may be a better choice. In our future work, we
124
Q. Lu, Y. Huang, and S. Shekhar
would like to explore heuristics with route ranking method based on weighted available capacity and travelling time while choosing best routes. We also want to extend and apply our approach to vehicle evacuation in transportation road networks. Modelling vehicle traffic during evacuation is a more complicated job than modelling pedestrian movements in building evacuation because modelling vehicle traffic at intersections and the cost of taking turns are challenging tasks. Current vehicle traffic simulation tools, such as DYNASMART [14], DYNAMIT [2], uses an assignment-simulation method to simulate the traffic based on origin-destination routes. We plan to extend our approach to work with such traffic simulation tools to address vehicle evacuation problems. Acknowledgment. We are particularly grateful to Spatial Database Group members for their helpful comments and valuable discussions. We would also like to express our thanks to Kim Koffolt for improving the readability of this paper. This work is supported by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory under contract number DAAD19-01-2-0014. The content does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. AHPCRC and the Minnesota Supercomputer Institute provided access to computing facilities.
References 1. Hurricane Evacuation web page. http://i49south.com/hurricane.htm, 2002. 2. M. Ben-Akiva et al. Deveopment of Dynamic Traffic Assignment System for Planning Purposes: DynaMIT User’s Guide. ITS Program, MIT, 2002. 3. S. Browon. Building America’s Anti-Terror Machine: How Infotech Can Combat Homeland Insecurity. Fortune, pages 99–104, July 2002. 4. The Volpe National Transportation Systems Center. Improving Regional Transportation Planning for Catastrophic Events(FHWA). Volpe Center Highlights, pages 1–3, July/August 2002. 5. L. Chalmet, R. Francis, and P. Saunders. Network Model for Building Evacuation. Management Science, 28:86–105, 1982. 6. C. Corman, T. Leiserson and R. Rivest. Introduction to Algorithms. MIT Press, 1990. 7. E.W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik, 1:269–271, 1959. 8. ESRI. GIS for Homeland Security, An ESRI white paper. http://www.esri.com/library/whitepapers/pdfs/homeland security wp.pdf, November 2001. 9. R. Francis and L. Chalmet. A Negative Exponential Solution To An Evacuation Problem. Research Report No.84-86, National Bureau of Standards, Center for Fire Research, October 1984. 10. B. Hoppe and E. Tardos. Polynomial Time Algorithms For Some Evacuation Problems. Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 433–441, 1994.
Evacuation Planning: A Capacity Constrained Routing Approach
125
11. B. Hoppe and E. Tardos. The Quickest Transshipment Problem. Proceedings of the 6th annual ACM-SIAM Symposium on Discrete Algorithms, pages 512–521, January 1995. 12. T. Kiosko and R. Francis. Evacnet+: A Computer Program to Determine Optimal Building Evacuation Plans. Fire Safety Journal, 9:211–222, 1985. 13. T. Kiosko, R. Francis, and C. Nobel. EVACNET4 User’s Guide. University of Florida, http://www.ise.ufl.edu/kisko/files/evacnet/, 1998. 14. H.S. Mahmassani et al. Development and Testing of Dynamic Traffic Assignment and Simulation Procedures for ATIS/ATMS Applications. Technical Report DTFH6 1-90-R-00074-FG, CTR, University of Texas at Austin, 1994.
Locating Hidden Groups in Communication Networks Using Hidden Markov Models Malik Magdon-Ismail1 , Mark Goldberg1 , William Wallace2 , and David Siebecker1 1
CS Department, RPI, Rm 207 Lally, 110 8th Street, Troy, NY 12180, USA. {magdon,goldberg,siebed}@cs.rpi.edu 2 DSES Department, RPI, 110 8th Street, Troy, NY 12180, USA. [email protected].
Abstract. A communication network is a collection of social groups that communicate via an underlying communication medium (for example newsgroups over the Internet). In such a network, a hidden group may try to camoflauge its communications amongst the typical communications of the network. We study the task of detecting such hidden groups given only the history of the communications for the entire communication network. We develop a probabilistic approach using a Hidden Markov model of the communication network. Our approach does not require the use of any semantic information regarding the communications. We present the general probabilistic model, and show the results of applying this framework to a simplified society. For 50 time steps of communication data, we can obtain greater than 90% accuracy in detecting both whether or not their is a hidden group, and who the hidden group members are.
1
Introduction
The tragic events of September 11, 2001 underline the need for a tool which is capable of detecting groups that hide their existence and functionality within a large and complicated communication network such as the Internet. In this paper, we present an approach to identifying such groups. Our approach does not require the use of any semantic information pertaining to the communications. This is preferable because communication within a hidden group is usually encrypted in some way, hence the semantic information will be misleading, or unavailable. Social science literature has developed a number of theories regarding how social groups evolve and communicate, [1,2,3]. For example, individuals have a higher tendency to communicate if they are members of the same group, in accordance with homophily theory. Given some of the basic laws of how social groups evolve and communicate, one can construct a model of how the communications within the society should evolve, given the (assumed) group structure. If the group structure does not adequately explain the observed communications, but the addition of an extra, hidden, group does explain them, then we H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 126–137, 2003. c Springer-Verlag Berlin Heidelberg 2003
Locating Hidden Groups in Communication Networks
127
have grounds to believe that there is a hidden group attempting to camouflage its communications within the existing communication network. The task is to determine whether such a group exists, and identify its members. We use a maximum likelihood approach to solving this task. Our approach is to model the evolution of a communication network using a Hidden Markov Model. A Hidden Markov model is appropriate when an observed process (in our case the macroscopic communication structure) is naturally driven by an unobserved, or hidden, Markov process (in our case the microscopic group evolution). Hidden Markov models have been used extensively in such diverse areas as: speech recognition, [4,5]; inferring the language of simple grammars [6]; computer vision, [7]; time series analysis, [8]; biological sequence analysis and protein structure prediction, [9,10,11,12,13]. Our interpretation of the group evolution giving rise to the observed macroscopic communications evolution makes it natural to model the evolution of communication networks using a Hidden Markov model as well. Details about the general theory of Hidden Markov models can be found in [4,14,15]. In social network analysis there are many static models of, and static metrics for the measurement and evaluation of social networks [16]. These models range from graph structures to large simulations of agent behavior. The models have been used to discover a wide array of important communication and sociological phenomenon, from the small world principle [17] to communication theories such as homophily and contagion [1]. These models, as good as they are, are not sufficient to study the evolution of social groups and the communication networks that they use; most focus on the study of the evolution of the network itself. Few attempt to explain how the use of the network shapes its evolution [18]. Few can be used to predict the future of the network and communication behavior over that network. Though there is an abundance of simulation work in the field of computational analysis of social and organizational systems [2,19,3] that attempts to develop dynamic models for social networks, none have employed the proposed approach and few incorporate sound probability theory or statistics [20] as the underlying model. The outline of the paper is as follows. First we consider a simplified example, followed by a description of the general framework. We also present some results to illustrate proof of concept on an example, and we end with some concluding remarks. 1.1
Example
A simple, concrete example will help to convey the details of our method. A more detailed formulation will follow. Consider the newsgroups, for example alt.revisionism, alt.movies. A posting to a newsgroup in reply to a previous posting is a communication between two parties. Now imagine the existence of a hidden group that attempts to hide its communications, illustrated in the figure below. Figure 1(a) shows the group structure. There are 4 observed groups. A fifth hidden group also exists, whose members are unshaded. We do not observe the actual group composition, but rather the communications (who is posting and
128
M. Magdon-Ismail et al. 1
2
X
3
4
(a)
(b)
(c)
Fig. 1. Illustration of a society.
Communication Graph Time Series for 1 Hidden Group Communication Graph, t=1
Communication Graph, t=2
Communication Graph, t=3
Communication Graph, t=4
Communication Graph, t=5
Communication Graph Time Series for No Hidden Groups Communication Graph, t=1
Communication Graph, t=2
Communication Graph, t=3
Communication Graph, t=4
Communication Graph, t=5
Fig. 2. Communication time series of two societies.
replying to posts in a given newsgroup). This is illustrated in Figure 1(b), where all the communications are between members of the same group. Figure 1(c) illustrates the situation when the hidden group members need to broadcast some information among themselves. The hidden group member who initiates the broadcast (say X) communicates with all the other hidden group members who are in the same visible groups as X. The message is then passed on in a similar manner until all the hidden members have received the broadcast. Notice that no communication needs to occur between members who are not in the same group, yet, a message can be broadcast across the whole group. In order to maintain the the appearance of being a bona-fide member of a particular newsgroup, a hidden node will participate in the “normal” communications of that group as well. Only occasionally will a message need to be broadcast through the hidden group, resulting in a communication graph as in Figure 1(c). The matter is complicated by the fact that the communications in Figure 1(c) will be overlayed onto the normal group communications, Figure 1(b). What we observe are a time
Locating Hidden Groups in Communication Networks
129
series of node to node communications as illustrated in Figure 2, which shows the evolving communications of two hypothetical communities. The individuals are represented by nodes in the graph. An edge between two nodes represents communication during that time period. The thickness of the edge indicates the intensity of the communications. The dotted lines indicate communications between the hidden group members. The task is to take the communication history of the community (for example the one above) and to determine whether or not there exists a hidden group functioning within this community, and to identify its members. It would also be useful to identify which members belong to which groups. The hidden community may or may not be functioning as an aberrant group trying to camouflage its communications. In the above example the hidden community trying to camouflage its broadcasts. However, the hidden group could just as well be a new group that has suddenly arisen, and we would like to discover its existence. We assume that we know the number of observed groups (for example the newsgroups societies are known), and we have a model of how the society evolves. We do not know who belongs to which news group, and all communications are aggregated into the communications graph for a given time period. We will develop a framework to determine the presence of a hidden group that does not rely on any semantic information regarding the communications. The motivation for this approach is that even if the semantics are available (which is not likely), the hidden communications will usually be encrypted and designed so as to mimic the regular communications anyway.
2
Probabilistic Setup
We will illustrate our general methodology by first developing the solution of the simplified example discussed above. The general case is similar, with only minor technical differences. The first step is to build a model for how individuals move from group to group. More specifically, let Ng be the number of observed groups in the society, and denote the groups by F1 , . . . , FNg . Let n be the number of individuals in the society, and denote the individuals by x1 , . . . , xn . We denote by F(t), the micro-state of the society at time t. The micro-state represents the state of the society. In our case, F(t) is the membership matrix at time t, which is a binary n × Ng matrix that specifies who is in which group, 1 if node xi is in group Fj , (1) Fij (t) = 0 otherwise. The group membership may change with time. We assume that F(t) is a Markov chain, in other words, the members decide which groups to belong to at time t + 1 based solely on the group structure at time t. In determining which groups to join in the next period, the individuals may have their own preferences, thus there is some transition probability distribution P [F(t + 1)|F(t), θ],
(2)
130
M. Magdon-Ismail et al.
where θ is a set of (fixed) parameters that determine, for example, the individual preferences. This transition matrix represents what we define as the micro-laws of the society, that determines how its group structure evolves. A particular setting to the parameters θ is a particular realization of the micro-laws. We will assume that the group membership is static, which is a trivial special case of a Markov chain where the transition matrix is the identity matrix. In the general case, this need not be so, and we pick this simplified case to illustrate the mechanics of determining the hidden group, without complicating it with the group dynamics. Thus, the group structure, F(t) is fixed, so we will drop the t dependence. We do not observe the group structure, but rather the communications that are a result of this structure. We thus need a model for how the communications arise out of the groups. Let C(t) denote the communications graph at time t. Cij (t) is the intensity of the communication between node xi and node xj at time t. C(t) is the “expression” of the micro-state F. Thus, there is some probability distribution P [C(t)|F(t), λ],
(3)
where λ is a set of parameters governing how the group structure gets expressed in the communications. Since F(t) is a Markov chain, C(t) follows a Hidden Markov process governed by the two probability distributions P [F(t + 1)|F(t), θ] and P [C(t)|F(t), λ]. In particular, we will assume that there is some parameter 0 < λ < 1 that governs how nodes in the same group communicate. We assume that the communication intensity Cij (t) has a Poisson distribution with parameter Kλ, where K is the number of groups that both nodes are members of. If K = 0, we will set the Poisson parameter to λ2 1. otherwise K = λ. Thus, nodes that are not in any groups will tend not to communicate. The Poisson distribution is often used to model such “arrival” processes. Thus, P(k; Kλ) xi and xj are in K > 0 groups together, P [Cij = k] = (4) P(k; λ2 ) xi and xj are in no groups together. Where P(k; λ) is the Poisson probability distribution function, P(k; λ) =
e−λ λk . k!
(5)
We will assume that the communications between different pairs of nodes are independent of each other, as are communications at different time steps. Suppose we have a broadcast hidden group in the society as well, as illustrated in Figure 1(c). We assume a particular model for the communications within the hidden group, namely that every pair of nodes that are in the same visible group communicate. The intensity of the communications, B is assumed to follow a Poisson distribution with parameter β, thus P [B = k] = P(k; β),
(6)
Locating Hidden Groups in Communication Networks
131
We have thus fully specified the model for the society, and how the communications will evolve. The task is to use this model to determine, from communication history (as in Figure 2), whether or not there exists a hidden group, and if so, who the hidden group members are. 2.1
The Maximum Likelihood Approach
For simplicity we will assume that the only unknown is F, the group structure. Thus, F is static and unknown and λ and β are known. Let H be a binary indicator variable that is 1 if a hidden group is present, and 0 if not. Our approach is to determine how likely the observed communications would be if there is a hidden group, l1 and compare this with how likely the observed communications would be if there was no hidden group, l0 . To do this, we use the model describing the communications evolution with a hidden group (resp. without a hidden group) to find what the best group structure F would be if this model were true, and compute the likelihood of communications given this group structure and the model. Thus, we have two optimization problems, l1 = max P [Data|F, v, λ, β, H = 1],
(7)
l0 = max P [Data|F, λ, H = 0],
(8)
F,v F
where Data represents the communication history of the society, namely {C(t)}Tt=1 , and v is a binary indicator variable that indicates who the hidden and visible members of the society are. If l1 > l0 , then the communications are more likely if there is a hidden group, and we declare that there is a hidden group. As a by product, of the optimization, we will obtain F and v, hence we will identify not only who the hidden group members are, but also the remaining group structure for the society. In what follows, we will derive this likelihood function that needs to be optimized for our example society. What remains is to then solve the two optimization problems to obtain l1 , l0 . The simpler case is when there is no hidden group, which we analyze first. Suppose that F is given. Let fij be the number of groups that nodes xi and xj are both members of, Fik Fjk . (9) fij = k
Let λij be the Poisson parameter for the intensity of the communication between nodes xi and xj , λ2 fij = 0, (10) λij = λfij fij > 0. Let P (t) be the probability of obtaining the observed communications C(t) at time t. Since the communications between nodes are assumed independent, and
132
M. Magdon-Ismail et al.
each is distributed according to a Poisson process with parameter λij , we have that P (t) = P [C(t)|F, λ, H = 0] n = P(Cij (t); λij )
(11) (12)
i<j
Since the communications at different times are independent (given the group structure at that time), we have that P [Data|F, λ, H = 0] =
T n
P(Cij (t); λij ).
(13)
t=1 i<j
Since l0 is given by the maximum value of this function, we can equivalently maximize the logarithm. Further, the value of F that attains this maximum is the estimate of the group structure, assuming that there is no hidden group, log l0 = max F
n T
log P(Cij (t); λij )
(14)
t=1 i<j
F0 = argmax
n T
F
log P(Cij (t); λij )
(15)
t=1 i<j
When a hidden group is present, we use an indicator variable v to identify who the hidden group members are. We assume that there are as many hidden members as there are groups for illustration, as in Figure 1. At time t, the observed communications can be broken down into the regular communications plus a hidden group broadcast. The intensity B of the broadcast can be B = 0, 1, . . . , with B distributed according to a Poisson distribution, P [B|β] = P(B; β) This is our model of the hidden group communications. We can thus decompose the communication graph at time t into two components, a residual graph R(t), and the hidden graph H(t), C(t) = R(t) + H(t).
(16)
The hidden graph will be of the form in Figure 1(c) and is completely specified by the broadcast intensity B. This will then specify R(t) by R(t) = C(t) − H(t). Suppose that we know F, v. In this case, P (t) is given by P (t) = P [C(t)|F, v, λ, β, H = 1] ∞ = P [R(t)|B]P [B]
(17) (18)
B=0
Where P [R(t)|B] is given by an expression exactly analogous to (12), P [R(t)|B] =
n i<j
P(Rij (t; B); λij )
(19)
Locating Hidden Groups in Communication Networks
133
where R(t; B) is the residual graph depending on B, and λij is defined exactly analogously to (10) with fij = k Fik Fjk . v places a constraint on what F can be, and serves to determine what the hidden group broadcast graph can be. Note that the sum in (18) gets truncated when B gets large enough so that the residual graph has negative edges, which is impossible, since it must be a communications graph. We will denote this maximum possible value of B by t Bmax . Then, using the fact that P [B] = P(B; β), we get that t Bmax
P (t) =
P(B; β)
n
P(Rij (t; B); λij )
(20)
i<j
B=0
Taking the logarithm and summing over t, we get that log l1 = max F,v
T
t Bmax
log
t=1
{F1 , v1 } = argmax F,v
P(B; β)
B=0 T t=1
n
t Bmax
log
B=0
P(Rij (t; B); λij )
(21)
i<j
P(B; β)
n
P(Rij (t; B); λij )
(22)
i<j
Thus, in order to obtain l0 , l1 , F0 , F1 , v1 , we need to solve two combinatorial optimization problems. Notice that the size of the search space is huge. When there is no hidden group, the size of the search space is 2nNg , and the evaluation of the objective function is O(T n2 ). When there is a hidden group, the size of the search space is 2(n−Ng )Ng n!/(n − Ng )! and the evaluation of the objective function is O(Cmax T n2 ), where Cmax is the maximum communication intensity between any two nodes. If in addition the parameters of the model, namely λ, β are also not known, then we have to optimize with respect to these parameters as well, in which case, we have a mixed continuous/discrete optimization problem. Some algorithms for discrete/combinatorial optimization problems are reactive search, [21,22], and randomized approaches, see for example [23]. Continuous problems are often approached using derivative based methods such as gradient descent, conjugate gradients, Levenberg-Marquardt, etc., [24]. Mixed discrete/continuous problems have not been studied as intensely, and most methods are based upon simulated annealing [25] or genetic algorithms, [26]. For illustration, we assume that the parameters are known, the purpose here is to set the framework for the problem. To illustrate, we have implemented a simulated annealing approach to the combinatorial optimization. We used 10, 000 steps of Monte Carlo, where at each step, the current group structure F was randomly perturbed. The probability of perturbation decreased as a function of the step number. Results. We show results on a small society (9 nodes) with 3 groups. We picked this society so that it would be computationally efficient to run many simulations. We ran simulations to test both the false positive (declaring a hidden group when there isn’t one) and false negative (declaring no hidden group when there
134
M. Magdon-Ismail et al.
is one) errors. For each, we generated a society group structure randomly, and then generated the communication time series. These communication time series were fed into the optimization algorithm to obtain l0 , l1 , F0 , F1 , v1 . If l1 > l0 we declare a hidden group to be present and identify its members in v1 and the group structure in F1 . If not, we declare no hidden group and identify the group structure in F0 . The results are summarized in Table 1. Table 1. Error matrices for different time periods. % correct is the percentage of nodes identified correctly (hidden or not) when a hidden group is present and is predicted correctly. 10 time steps
True H 1 0
Predicted H 1 0 0.73 0.19
0.27 0.81
% correct =84%
20 time steps
True H 1 0
Predicted H 1 0 0.78 0.04
0.28 0.96
% correct =89%
50 time steps
True H 1 0
Predicted H 1 0 0.88 0.03
0.12 0.97
% correct=94%
As can be seen, with just 50 time steps of data, the error rate in predicting the presence of a hidden group is lower than 0.1. 2.2
General Maximum Likelihood Formulation
In general, the group structure evolves according to the micro-law transition matrix for the Markov chain, P [F(t + 1)|F(t), θ], and, the group structure gets expressed as a communication graph according to P [C(t)|F(t), λ]. In our example, P [F(t + 1)|F(t), θ] was the identity matrix, and P [C(t)|F(t), λ] based on modeling the communications using Poisson processes. A detailed description of a general model that describes an evolving society over a communication network is given in [27]. Let N = {x1 , . . . , xn } be the set of nodes and let H ⊂ N be the subset of nodes that forms the hidden group. We assume that H does not change with time. The hidden group may have a communication pattern governed by a different probability distribution, P [H(t)|H, β], where β is a set of parameters that governs this distribution. The group structure of the society from t = 1, . . . , T is given by the time series of matrices {F(t)}Tt=1 . In our example, this time series was specified by the constant matrix F. If there is no hidden group, we can compute the likelihood of observing the communication data {C(t)}Tt=1 as follows. The probability of obtaining the evolution F(1), F(2), . . . , F(T ) is given by P [{F(t)}|θ] = P [F(1)]
T t=2
P [F(t)|F(t − 1), θ].
(23)
Locating Hidden Groups in Communication Networks
135
The likelihood of obtaining the observed communications given this evolution is then given by P [{C(t)}|{F(t)}, θ, λ] =
T
P [C(t)|F(t), λ].
(24)
t=1
Ideally, we would like to compute P [{C(t)}, {F(t)}|θ, λ] l0 = P [{C(t)}|θ, λ] = =
(25)
{F(t)}
P [{F(t)}|θ]P [{C(t)}|{F(t)}, θ, λ]
(26)
{F(t)}
=
P [F(1)]P [C(1)|F(1), λ]
T
P [F(t)|F(t − 1), θ]P [C(t)|F(t), λ](27)
t=2
{F(t)}
If θ, λ are known, then this summation can be computed using a Monte Carlo simulation. If not, then we find the values of θ, λ that maximize l0 . In this case, the optimization is computationally costly and an alternative is to simultaneously optimize with respect to {F(t)}Tt=1 , θ, λ, which is itself a non-trivial mixed discrete/continuous optimization problem. When a hidden group H is present, we decompose the communications at time t to the hidden communications H(t) and the residual communications R(t), with C(t) = R(t) + H(t). Then, P [C(t)|{F(t)}, H, θ, λ, β] = P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (28) H(t)
where this summation is finite because both R(t) and H(t) must have nonnegative edges. Taking the product over t gives us P [{C(t)}|{F(t)}, H, θ, λ, β] =
T
P [R(t)|{F(t)}, θ, λ]P [H(t)|H, β], (29)
t=1 H(t)
and finally multiplying by P [{F(t)}|θ] and summing over {F(t)}, we get that l1 = max H
{F(t)}
P [F(1)]
T
P [F(t + 1)|F(t), θ]P [{C(t)}|{F(t)}, H, θ, λ, β],
t=1
(30) where P [{C(t)}|{F(t)}, H, θ, λ, β] is given in (29). The hidden group H at which the maximum is attained identifies who the hidden group members are. We assume that the Hidden Markov model and its parameters (θ, λ, β) are known. If the parameters are not known, then they have to be optimized as well. For a relatively simple hidden group communication structure, for example the broadcast hidden group as in our example, the computation of the likelihood is tractable. For more complicated examples, one may need to use heuristic approaches to these combinatorial optimization problems.
136
3
M. Magdon-Ismail et al.
Concluding Remarks
We have presented a framework for determining the members of a hidden group that attempts to camouflage its broadcasts within a functioning communication network. The basic idea is to first have a model for the society’s evolutions. Then by examining the discrepancy between the observed and expected communications, one can draw conclusions regarding the presence or absence of a hidden group. We focussed on a specific example, where we made a number of assumptions: static group structure; Poisson communication model; independence between communications at different times; the hidden group communications were only broadcasts; we used a maximum likelihood formulation. These restrictions were made primarily for expository and computational reasons, and are dropped in the general framework (resulting in more computationally intensive and complex optimization problems). Ongoing research involves developing efficient heuristic algorithms that solve the combinatorial optimization problems faced in the more general framework, as well as applying our methodology toward finding hidden groups in real societies.
References 1. Monge, P., Contractor, N.: Theories of Communication Networks. Oxford University Press (2002) 2. Carley, K., Prietula, M., eds.: Computational Organization Theory. Lawrence Erlbaum associates, Hillsdale, NJ (2001) 3. Sanil, A., Banks, D., Carley, K.: Models for evolving fixed node networks: Model fitting and model testing. Journal oF Mathematical Sociology 21 (1996) 173–196 4. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (1989) 257–286 5. Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Magazine (1986) 4–15 6. Georgeff, M.P., Wallace, C.S.: A general selection criterion for inductive inference. European Conference on Artificial Intelligence (ECAI, ECAI84) (1984) 473–482 7. Bunke, H., Caelli, T., eds.: Hidden Markov Models. Series in Machine Perception and Artificial Intelligence – Vol. 45. World Scientific (2001) 8. Edgoose, T., Allison, L.: MML Markov classification of sequential data. Stats. and Comp. 9 (1999) 269–278 9. Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macro-molecules. J. Molec. Evol. 35 (1992) 77–89 10. Allison, L., Wallace, C.S., Yee, C.N.: Normalization of affine gap costs used in optimal sequence alignment. J. Theor. Biol. 161 (1993) 263–269 11. Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: A hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology 301 (2000) 173–90 12. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281 (1998) 565–77 13. Bystroff, C., Shao, Y.: Fully automated ab initio protein structure prediction using I-sites, HMMSTR and ROSETTA. Bioinformatics 18 (2002) S54–S61
Locating Hidden Groups in Communication Networks
137
14. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA (1998) 15. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge, new York (2001) 16. Wasserman, S., Faust, K.: Social Network Analysis. Cambridge University Press (1994) 17. Watts, D.J.: Small Worlds: The dynamics of networks between order and randomness. Princeton University Press, Princeton, NJ (1999) 18. Butler, B.: The dynamics of cyberspace: Examing and modelling online social structure. Technical report, Carnegie Melon University, Pittsburgh, PA (1999) 19. Carley, K., Wallace, A.: Computational organization theory: A new perspective. In Gass, S., Harris, C., eds.: Encyclopedia of Operations Research and Management Science. Kluwer Academic Publishers, Norwell, MA (2001) 20. Snijders, T.: The statistical evaluation of social network dynamics. In Sobel, M., Becker, M., eds.: Sociological Methodology dynamics. Basil Blackwell, Boston & London (2001) 361–395 21. Battiti, R.: Reactive search: Toward self-tuning heuristics. Modern Heuristic Search Methods, Chapter 4 (1996) 61–83 22. Battiti, R., Protasi, M.: Reactive local search for the maximum clique problem. Technical Report TR-95-052, Berkeley, ICSI, 1947 Center St. Suite 600 (1995) 23. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge, UK (2000) 24. Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995) 25. Aarts, E., Korst, J.: Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing. John Wiley & Sons Ltd., New York (1989) 26. Stelmack, M., N., N., Batill, S.: Genetic algorithms for mixed discrete/continuous optimization in multidisciplinary design. In: AIAA Paper 98-4771, AIAA/ USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, Missouri (1998) 27. Siebeker, D., Goldberg, M., Magdon-Ismail, M., Wallace, W.: A Hidden Markov Model for describing the statistical evolution of social groups over communication networks. Technical report, Rensselaer Polytechnic Institute (2003) Forthcoming.
Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department Kar Wing Li and Christopher C. Yang Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong {kwli, yang}@se.cuhk.edu.hk
Abstract. The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analysed, searched. The traditional information retrieval (IR) approaches normally require a document to share some keywords with the query. In reality, the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Apart from this, terrorists and criminals may communicate through letters, e-mails and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. To facilitate cross-lingual information retrieval, a corpusbased approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. However, collecting parallel corpora between European language and Oriental language is not an easy task due to the unique linguistics and grammar structures of oriental languages. In this paper, the text-based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. This article then reports an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval.
Automatic Construction of Cross-Lingual Networks of Concepts
139
and criminal analysis. However, Osama bin Laden’s al Qaeda terrorists are not the only threat. We also need to effectively predict and prevent other criminal activities. These include religious, racist and fascist terrorists, opportunistic crime, organized crime (narcocriminial, Mafia, Russian mob, Triads, etc.), political espionage and sabotage, anarchists and vandals. An intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. For example, historical cases of tax fraud can disclose patterns of taxpayers’ behaviors and provide indicators for potential fraud. The customers’ credit card data can reveal the patterns of transactions and help to detect credit card theft. It should also allow the user to retrieve what persons, organizations, projects, and topics are relevant to a particular event of interest, e.g. car bombing in Bali. However, information stored in the repositories is often fragmented and unstructured, especially on-line catalogs. Also, the man-made fog of deliberate deception militates against normal pattern learning from databases causes much crucial information and the knowledge underlying to be buried. Therefore this information has become inaccessible. Developing systems that can retrieve relevant information have long been the goal of many researchers since important domain knowledge or information resides in the databases. Many information retrieval systems have been created in the past for medical diagnosis and business applications. The major difficulties to retrieve relevant information are the lack of explicit semantic clustering of relevant information and the limits of conventional keyword-driven search techniques (either full text or index-based)[2]. The traditional approaches normally require a document to share some keywords with the query. In reality, it is known that the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one for the users, and another for the documents. How to create relationships for the related terms between the two spaces is an important issue. The problem can be viewed as the creation of a thesaurus. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Language boundaries is another problem for criminal analysis. In criminal analysis, we need to find out how to frame questions, or create search patterns, that would help an analyst. If the right questions are not posed, the analyst may head down a path with no conclusions. In addition, terrorists and criminals may communicate openly and less openly through letters, e-mails, faxes, bulletin boards, etc. in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. Use of every possible translation for a single term can greatly expand the set of possible meanings because some of those translations are likely to introduce additional homonomous or polysemous word senses in the second language. Also, the users can have different abilities for different languages, affecting their ability to form queries and refine results. The human expertise to decompose an information need into the queries may take a man several years to acquire. However, knowledge-based systems aim to capture human expertise or knowledge by means of computational models. Knowledge acquisition was defined by Buchanan [10] as “the transfer and transformation of potential problem-solving expertise from some knowledge source to a program”. The approach to knowledge elicitation is referred to as “knowledge mining” or
140
K.W. Li and C.C. Yang
“knowledge discovery in databases” [2]. The “knowledge discovery” approach is believed by many Artificial Intelligence experts and database researchers to be useful for resolving the information overload and knowledge acquisition bottleneck problems. In this research, our aim is to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the documents of English/Chinese daily press release issued by Hong Kong Police Department. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval. Before the generation of the thesaurus-like, semantic network knowledge base, we firstly propose the text-based approach to collect the parallel press release documents from the Web.
2 Automatic Construction of Parallel Corpus Cross-lingual semantic interoperability has drawn significant attention in recent criminal analysis as the information of criminal activities written in languages other English has grown exponentially. Since it is impractical to construct bilingual dictionary or sophisticated multilingual thesauri manually for large applications, the corpus-based approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model for cross-lingual information retrieval. Many corpora are domain-specific. To deal with criminal analysis, we use the English/Chinese daily press release articles issued by Hong Kong SAR Police Department. Bates [1] stressed the importance of building domainspecific lexicons for retrieval purposes since a domain-specific, controlled list of keywords can help identify legitimate search vocabularies and help searchers “dock” on to the retrieval system. For most domain-specific databases, there appears to be some lists of subject descriptors (e.g., the subject indexes at the back of a textbook), people’s names (e.g., author indexes), and other domain-specific objects (e.g., organizational names, procedures, location names, etc.). These domain-specific keywords can be used to identify important concepts in documents. In the criminal analysis world, the information can help the analyst to identify the people who belongs to which group or organization, uses what methods to conduct the criminal activities in where. In addition, the online bilingual newswire articles used in this experiment are dynamic. They provide a continuous large amount of information for relieving the lag between the new information and the information incorporated into a reference work. To continuously collect English/Chinese daily Police press release articles from the data stream, we investigate the text-based approach to align English/Chinese parallel documents from the Web. Parallel corpus can be generated using overt translation or covert translation. The overt translation [20] possesses a directional relationship between the pair of texts in two languages, which means texts in language A (source text) is translated into texts in language B (translated text)[25]. The covert translation [13] is non-directional, e.g. press release from the government, commentaries on a sports event broadcast live in several languages by a broadcasting organization. There are two major approaches for document aligning, namely length-based and text-based alignment. The length-based makes use of the total number of characters or
Automatic Construction of Cross-Lingual Networks of Concepts
141
words in a sentence and the text-based approaches use linguistic information in the sentence alignment [9]. Many parallel text alignment techniques have been developed in the past. These techniques attempt to map various textual units to their translation and have been proven useful for a wide range of applications and tools, e.g. crosslingual information retrieval [18], bilingual lexicography, automatic translation verification and the automatic acquisition of knowledge about translation [22]. Translation alignment technique has been used in automatic corpus construction to align two documents [16]. There are three major structures of parallel documents on the World Wide Web, parent page structure, sibling page structure, and monolingual sub-tree structure[24]. Resnik [19] noticed that the parent page of the Web page may contain the links to different versions of the web page. The sibling page structure refers to the cases where the page in one language contains a link directly to the translated pages in the other language. The third structure contains a completely separate monolingual subtree for each language, with only the single top-level Web page pointing off to the root page of single-language version of the site. Parallel corpus generated by overt translation usually uses the parent page structure and sibling page structure. However, parallel corpus generated by covert translation uses monolingual sub-tree structure. Each sub-tree is generated independently [24]. The press release issued by the HKSAR Police Department is an example.
Hong Kong SAR Police Department Web page (Chinese)
1/1/1999 (Chinese)
Article 0001
Hong Kong SAR Police Department Web page (English)
Press News Archives
Press News Archives
(Chinese)
(English)
1/1/1999 (English)
……
……
……
Article 0019
……
……
parallel articles Fig. 1. Organization of Hong Kong SAR Police Department’s press release articles in the Hong Kong SAR Police Department Web site.
142
K.W. Li and C.C. Yang
2.1 Title Alignment Titles of two texts can be treated as the representations of two texts. Referring to He [11], the titles present “micro-summaries of texts” that contain “the most important focal information in the whole representation” and as “the most concise statement of the content of a document”. In other words, titles function as the condensed summaries of the information and content of the articles. In our proposed text-based approach, the longest common subsequence is utilized to optimize the alignment of English and Chinese titles [24]. Our alignment algorithm has three major steps: 1) alignment at word level and character level, 2) reducing redundancy, 3) score function. An English title, E, is formed by a sequence of English simple words, i.e., E = e1 e2 e3 … ei … , where ei is the ith English word in E. A Chinese title, C, is formed by a sequence of Chinese characters, i.e., C = char1 char2 char3 … charq … , where charq is a Chinese character in C. An English word in E, ei, can be translated to a set of possible Chinese translations, Translated(ei), by dictionary lookup. Translated(ei) = j { Te1 , Te2 , Te3 , … , T j , … } where T is the jth Chinese translation of ei. Each i
i
i
ei
ei
Chinese translation is formed by a sequence of Chinese characters. The set of the j
j
longest-common-subsequence (LCS) of a Chinese translation T ei and C is LCS( T ei , C). MatchList(ei) is a set that holds all the unique longest common subsequences of
T eij and C for all Chinese translations of ei. Based on the hypothesis that if the characters of the Chinese translation of an English word appears adjacently in a Chinese sentence, such Chinese translation is more reliable than other translations that their characters do not appear adjacently in the Chinese sentence. Contiguous(ei) is used to determine the most reliable translation based on adjacency. The second criteria of the most reliable Chinese translations, is the length of the translations. Reliable(ei) is used to identify the longest sequence in Contiguous(ei). Due to redundancy, the translations of an English word may be repeated completely or partially in Chinese. To deal with redundancy, Dele(x,y) is an edit operation to remove the LCS(x,y) from x. WaitList is a list to save all the sequences obtained by removing the overlapping of the elements of MatchList(ei) and Reliable(ei). MatchList(ei) is initialized to ∅ and Reliable(ei) is initialized to ε . Remain is a sequence that is initialized as C, and Reliable(ei) are removed from Remain starting from the e1 until the last English word. WaitList will also be updated for each ei. When all Reliable(ei) are removed from Remain, the elements in WaitList will also be removed from Remain in order to remove the redundancy. Given E and C, the ratio of matching is determined by the portion of C that matches with the reliable translations of English words in E. Given an English title, the Chinese title that has the highest Matching_Ratio among all the Chinese titles is considered as the counterpart of the English title. However, it is possible that more than one Chinese title have the highest Matching_Ratio. In such case, we shall also consider the ratio of matching determined by the portion of English title that is able to identify a reliable translation in the Chinese title.
Automatic Construction of Cross-Lingual Networks of Concepts
143
2.2 Experiment An experiment is conducted to measure the precision and recall of the aligned parallel Chinese/English documents from the HKSAR Police press releases using the textbased approach as described in Section 2.1. Results are shown on Table 1. The Hong Kong SAR Police press releases are developed based on covert translation. From 1st January, 2001 to 31st October,2002, there are 2,698 press articles in Chinese and 2,695 press articles in English. There are only 2,664 pairs of Chinese/English parallel articles. Experimental result shows that the proposed text-based title alignment approach can effectively align the Chinese and English titles. Table 1. Experimental results
Proposed text-based approach
Precision 1.00
Recall 1.00
3 A Corpus-Based Approach: Automatic Cross-Lingual Concept Space Generation The semantic network knowledge base approach to automatic thesaurus generation is also referred to as a concept space approach[4] because a meaningful and understandable concept space (a network of terms and weighted associations) could represent the concepts (terms) and their associations for the underlying information space (i.e., documents in the database). In terms of criminal analysis, recent terrorist events have demonstrated that terrorist and other criminal activities are connected, in particular, terrorism, money laundering, drug smuggling, illegal arms trading, and illegal biological and chemical weapons smuggling. In addition, hacker activities may be connected to these other criminal activities. Information in the concept space can be split into concepts and links. Concepts include real people, aliases, groups, organizations, companies (including bank and shells), countries, towns, regions, religious groups, families, attacks (hacker, terrorist), etc. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and noncriminal activities, and communicate openly and less openly through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. It helps the analyst to detect the important anomalies. The cross-lingual concept space clustering model is originally suggested by Lin and Chen [15] and based on the Hopfield network. The cross-lingual concept space includes the concepts themselves, their translations as well as their associated concepts. The automatic Chinese-English concept space generation system consists of four components: 1)English phrase extraction; 2)Chinese phrase extraction; and 3) Hopfield network, and 4) Parallel Chinese/English Police press release corpus. The Chinese and English phrase extraction identifies important conceptual phrases in the corpora. The Hopfield network generates the cross-lingual concept space with the Chinese and English important conceptual phrases as input. A press release parallel corpus was dynamically collected from the Hong Kong Police website in order to get the relationship between Chinese terms and English terms.
144
K.W. Li and C.C. Yang
3.1 Automatic English Phrase Extraction Automatic phrase extraction is a fundamental and important phrase in concept space clustering. The clustering result will be downgraded significantly if the quality of term extraction is low. Salton [21] presents a blueprint for automatic indexing, which typically includes stop-wording and term-phrase formation. A stop-word list is used to remove non-semantic bearing words such as the, a, on, in, etc. After removing the stop words, term-phrase formation that formulates phrases by combining only adjacent words is performed[4]. 3.2 Chinese Phrase Extraction Unlike English language, there are not any natural delimiters in Chinese language to mark word boundaries. In our previous work, we have developed the boundary detection [23] and the heuristic techniques to segment Chinese sentence based on the mutual information and significant estimation [5]. The accuracy is over 90%. 3.2.1 Automatic Phrase Selection To generate the concept space, the relevance weights between the English and Chinese term phrases are first computed in order to select significant concepts from the collection. d ij = tf ij × log(
N × w j) df j
(1)
Equation 1 shows how the combined weight of term j in document i is calculated. tfij is the occurrence frequency of term j in document i. N is the total number of documents in the collection and dfj is the number of documents containing term j. wj is the length of term j. For an English term, the length of it is the number of words in it. For a Chinese term, the length of it is the number of characters in it. The weight is directly proportional to the occurrence frequency of the term because it carries important idea if it appears in the document for many times. On the other hand, it is inversely proportional to the number of documents containing the term because the meaning carried by the term may be too general. For example, "Hong Kong" frequently appears in the collection of documents from HKSAR Police. It becomes a common term in the collection and does not carry specific meaning in any document of the collection. The length of term also plays an important role in the weight. It is known that a longer term carries more specific meaning. For example, name of places and organizations are often in multiple words (for English) or characters (for Chinese). Terms, which significantly represent a document, are selected for clustering. Based on the combined weights of terms that are calculated using Equation 1, a number of terms with the largest combined weights in each document are selected for clustering. The number is based on the average length of documents in the collection. For longer average length, more terms are selected for clustering. Terms with common meaning and not representative are filtered out.
Automatic Construction of Cross-Lingual Networks of Concepts
145
3.2.2 Co-occurrence Weight After the calculation of dij, asymmetric co-occurrence function [2] is used to evaluate the relevance weights among concepts. For a pair of relevant term A and B, the weight of the link from term A to term B and that of the link from term B to term A are different. This function gives a good description of natural thinking of human to terms. For example, "Ford" and "car" are relevant. When a person comes up with "Ford", he can think of "car". However, when a person comes up with "car", he may not think of "Ford". This example shows that two terms the associations between two terms are not symmetric. Therefore, we adopt the co-occurrence weight to calculate the relevance weights. N (2) d = tf × log( × w ) ijk
ijk
j
df
jk
The co-occurrence weight, dijk , in Equation 2 is the weight between term j and term k that are both exist in document i . tfijk is the minimum between occurrence frequency of term j and that of term k in document i . The weight will be zero if either of term j or term k is not exist in the document. The calculation is similar to the calculation in Equation 1. Therefore, the co-occurrence weight is a measure of combined weight between term j and term k. n
Weight (T j , Tk ) =
∑d i =1 n
ijk
∑d i =1
(3) × WeightingF actor (Tk )
ij
n
Weight (Tk , T j ) =
∑d i =1 n
ikj
∑d
(4) × WeightingF actor (T j )
ik
i =1
Equation 3 shows the relevance weights from term j to term k. Equation 4 shows the relevance weight from term k to term j. Relevance weight measures the association between two terms in the collection. The combined weights and cooccurrence weights of terms in all documents are summed up to derive the global association between terms in the collection. log Weighting
Factor
(T j ) =
Factor
(T k ) =
(5)
log N N df k log N
log Weighting
N df j
(6)
Equation 5 shows the weighting factor of term j. Equation 6 shows the weighting factor of term k. The weighting factor is used to penalize general terms. General terms always affect the result of clustering. A lot of terms associate with the general terms. During clustering, if a general term is activated, other terms associate with that general term will also be activated. Then, the size of that concept space will be large and the precision will unavoidably low. The weighting factor is a value between 0 and 1. It carries an idea of inverse document frequency. The more the documents contain the concept, the smaller the weighting factor.
146
K.W. Li and C.C. Yang
3.2.3 The Hopfield Network Algorithm Given the relevance weights between the extracted Chinese and English term phrases in the parallel corpus, we will employ the Hopfield network to generate the concept space. The Hopfield network models the associate network and transforms a noisy pattern into a stable state representation. When a searcher starts with an English term phrase, the Hopfield network spreading activation process will identify other relevant English term phrases and gradually converge towards heavily linked Chinese term phrases through association (or vice versa). Term is represented by node in the network. The algorithm is shown below: n −1
u j ( t + 1 ) = f s [ ∑ t ij u i ( t )], 0 ≤ j ≤ n − 1
(7)
i=0
where uj(t+1) denotes the value of node j in iteration t+1, n is the total number of nodes in the network, tij denotes the relevance weight from node i to node j. fs(x) =
1 − (x − θ 1 + exp θ o
j
)
(8) Equation 8 shows the continuous SIGMOID transformation function which normalizes any given value to a value between 0 and 1[4].
∑ [u n −1
j
(t + 1) − u
j= 0
j
(t )
]
2
≤ ε
(9)
where ε was the maximal allowable difference between two iterations. ε measures the total change of values of nodes from iteration t to t+1. After several iterations, more nodes are activated and nodes with strong connection to the target node are those with high values. Total change of values of nodes is evaluated at the end of iteration. When the change is smaller than a threshold, ε, the Hopfield network is converged and the iteration process stops. Once the network converged, the final output represented the set of terms relevant to the starting term. In our system the following values were used: θ j = 0.1 , θ o = 0.01 , ε=1.
4 Concept Space Evaluation 10 students of the Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, were invited to examine the performance of concept space. The concept space is a robust and domain-specific Hong Kong Police press release thesaurus which contains 9222 Chinese/English concepts. The thesaurus includes many social, political, legislative terms, abbreviations, names of government departments and agencies. Each concept in the thesaurus may associate with up to 46 concepts. It is generated from 2548 parallel Hong Kong Police press release article pairs. The goal of this experiment is to capture meaningful conceptual association between concepts. The associations forms the basis for the decisions and inferences the user use when searching the criminal information of Hong Kong.
Automatic Construction of Cross-Lingual Networks of Concepts
147
4.1 Experimental Design Among these 10 graduate students, 5 subjects are Hong Kong students and the other 5 subjects came from Mainland China. They all have been living in Hong Kong for more than one year. They use their knowledge and experience on both the Hong Kong SAR Police system and the living environment in Hong Kong to evaluate the concept space. 50 among 9222 concepts were randomly selected as the test descriptors. Twenty five among these 50 test descriptors are English concepts. The other 25 test descriptors are Chinese concepts. Each test descriptor together with its associated concepts were presented to the 10 subjects. A small portion (about 10% of total number of associated concepts for each test descriptor) of noise terms was added to reduce the bias generated by the subjects to the concept space. The experiment is divided into two phrases: recall phrase and recognition phrase. In the recall phrase, each subject (Hong Kong graduate students and graduate students from Mainland China) was asked to generate as many related terms as possible in response to each test descriptor presented. In the recognition phrase, the subjects needed to determine the associated concept either "irrelevant" or "relevant" to the test descriptor. Terms considered too general were to be ranked as “irrelevant”. This phrase tested the ability of subjects on recognition of relevant terms. If the subjects felt the definition of a concept needed to clarify or they wished to add comments on the concept, they were asked to write them on a piece of paper. After the experiment, we found that the subjects spent more time on recognition phrase than what they spent on recall phrase. This confirms the statement made by Chen et al. [3] that human beings are more likely to recognize than to recall. Apart from the 10 students, the 50 concepts in concept space were also carefully evaluated by two experimenters and no noise term was added in the case. One of them is a graduate student of the Department of System Engineering and Engineering Management. The other is a graduate student of the Department of Translation. They both have been living in Hong Kong for more than 10 years. They also have done research on Chinese to English translation and English to Chinese translation for more than two years. Since there is no tailored bilingual thesaurus for Hong Kong government press release articles, the experimental result provided by these two senior subjects is treated as a benchmark or human verified thesaurus in comparison with the result provided by the 10 subjects. The additional associated concepts provided by the 10 subjects in the recall phrase were examined by the two senior judges before treating them as relevant terms. 4.2 Experimental Result We adopted the concept recall and concept precision for evaluation based on the following equations: Number of Retrieved Relevant Concepts (10) Concept Recall = Number of Total Relevant Concepts
Number of Retrieved Relevant Concepts Concept Precision = Number of Total Retrieved Concepts
(11)
The number of Retrieved Relevant Concepts represented the number of concepts in the concept space judged as "Relevant". The number of total relevant concepts
148
K.W. Li and C.C. Yang
includes the concepts in the concept space judged as "Relevant", the additional relevant concepts provided. The number of total retrieved concepts represented the number of concepts suggested by the concept space and the human verified thesaurus. 4.3 Evaluation Provided by 10 Graduate Subjects The 10 graduate students provided 12 to 73 new associated concepts during the experiment. The analysis is listed in Table 2. It is interested to note that all the Hong Kong graduate subjects have been living in Hong Kong for at least six years but the graduate subjects from Mainland China have been living in Hong Kong around one year. So, the Hong Kong graduate subjects are more familiar with Hong Kong Police system and they added more new concepts to the concept space. In addition, the Hong Kong graduate students added more English concepts to the concept space than that of the graduate students from Mainland China. This confirms that even though the first language of all these graduate students is Chinese, the working language for the Hong Kong graduate students is English. Table 2. The statistics of new associated concepts added by the 10 graduate students
Table 3. Precision and recall
10 graduate students 2 experimenters
Precision 0.835 0.86
Recall 0.795 0.83
Table 4. The new concepts added by the 10 graduate students
10 graduate students
Chinese added 222
concept English added 220
concept
Hong Kong is a bilingual community. Even though the Police concept space contains many technical, political and geographical English vocabularies, the Hong Kong graduate students frequently encounter these terms in their daily life. As a result, the Hong Kong graduate students naturally added more English terms into the concept space. This observation also appears in Welsh and English community [7]. Also, even though Chinese technical terms do exist, they may not common use. Therefore, the Hong Kong graduate may have limited Chinese technical vocabulary
Automatic Construction of Cross-Lingual Networks of Concepts
149
even where Chinese is their first language and use English terms when necessary. As a result, the Hong Kong graduate subjects judged more English concepts to be relevant and added more English terms into the concept space. On the other hand, the graduate students from Mainland China have a higher degree of Chinese fluency than that of Hong Kong graduate students. Also, they know more Chinese translations of those English technical vocabularies in Mainland China. These cause them to add more Chinese concepts. We also observe some associated concepts are judged as irrelevant because the associated concepts do not show the clear association with their test descriptor. For example, one of associated concepts for the test descriptor " " (smuggling) is "Mr Mark Steeple" ( ) because the Chief Inspector Anti-smuggling Task Force in Hong Kong is Mr Mark Steeple. Another associated ) because of the recent trend of smuggling by small concept is "Mirs Bay" ( craft in the Mirs Bay area. However, all the graduate students do not have a prior knowledge of these and judged them as irrelevant. Since the corpus is a dynamic resource, it is not surprise that the students do not have a prior knowledge. For criminal analyst, the information is important for identifying the recent trend of smuggling by small craft in the Mirs Bay area. In addition, one of the associated ) are “ ” (Police). We know concepts for “Golden Bauhinia Square” ( that the flag raising ceremony began promptly at 8 a.m. with the Flag Raising Parade at the twin flagpoles at Golden Bauhinia Square. The flag party, provided by the Hong Kong Police Force comprised a Senior Inspector of Police, four flag raisers. Without knowing this, the subjects only read the concept space and judged that there is no clear association between “Police” and “Golden Bauhinia Square”. The phenomenon displays that the clustering process using Hopfield network induces the relevant concepts based on the contents of documents. Apart from this, as we know, a lexical item (word) in a sentence may be a concept in one language[12], where concept is a recognizable unit of meaning in any given language [11]. A concept represented by a word in one language may be translated into a word, two words, a phrase, or even a sentence in another language [11]. A concept in one language can be a broader concept encompassing some narrower concepts, and the translation of such a concept may result in an altered concept in another language. In contrast, a narrower concept in one language may be translated as a broader concept in another language. Such relationship is known as generic-specific relationship[12]. For example, the word “China” is modified to be a specific word “ ” (Beijing), a city of China. Omission, addition, and deviation are also common phenomena. For example, “Closure” ” in some cases. “Closure” is translated to “ ” by corresponds to “ (stop service)” in some cases (deviation). dictionary, but it refers to “ Therefore, conceptual alternation may occur in translation. This also causes the judges to judge some associated concepts to be irrelevant. Nida[11] explains that conceptual alteration is caused by three major reasons: 1) no two languages were completely isomorphic, 2) different languages might have different domain vocabulary; and 3) some languages were more rhetorical than other languages. Courtial and Pomian[6] argued that searches performed in the realms of science and technology frequently involve association of concepts that lie outside the traditional associations represented in thesauri. Associative networks gleaned through textual analysis, they argued, facilitated innovation by making obvious associations that would otherwise be impossible for humans to find on their own. In early research,
150
K.W. Li and C.C. Yang
Lesk[14] found little overlap between term relationships generated through term associations and those presented in existing thesauri. This term relationship is especially important for criminal analysis. The associated concepts in the concept space can provide links about the persons who generally remain hidden, unknown, and use aliases, who, in turn, belong to various groups and organizations, use banks, vehicles, phones, meet in various locations, conduct both criminal and non-criminal activities, and communicate through bulletin boards, e-mail, phone calls, letters, word-of-mouth, etc. – encrypted or not. Ekmekcioglu, Robertson and Willet [8] tested retrieval performances for 110 queries on a database of 26,280 bibliographic records using four approaches. Their result suggested that the performance may be greatly improved if a searcher can select and use the terms suggested by a co-occurrence thesaurus in addition to the terms he has generated[4]. 4.4 Translation Ability of the Concept Space The 46683 associated concepts were also examined. For those test descriptors associating with two relevant associated concepts, 47.64% of these associated concepts are Chinese concepts and 52.36% of these associated concepts are English concepts. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.
5 Conclusion The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analyzed, shared, searched. To effectively predict and prevent criminal activities, an intelligent system is required to retrieve relevant information from the criminal records and suspect communications. The system should continuously collect information from relevant data streams and compare incoming data to the known patterns to detect the important anomalies. However, information retrieval (IR) systems present two main interface challenges: first, how to permit a user to input a query in a natural and intuitive way, and second, how to enable the user to interpret the returned results. A component of the latter encompasses ways to permit a user to comment and provide feedback on results and to iteratively improve and refine results. As we know, the vocabulary difference problem has been widely recognized: users tend to use different terms for the same information sought. Also, in terms of criminal analysis, the man-made fog of deliberate deception militates against normal pattern learning from databases cause much crucial information and the knowledge underlying to be buried. As a result, an exact match between the user's terms and those of the indexer is unlikely. An advanced tool is required to understand the user's needs. Cross-lingual information retrieval brings an added complexity to the standard
Automatic Construction of Cross-Lingual Networks of Concepts
151
IR task. Users can have different abilities for different languages, affecting their ability to form queries and interpret results. This highlights the importance of automated assistance to refine a query in cross-lingual information retrieval. This article has presented a bilingual concept space approach using Hopfield network to relieve the vocabulary problem in national security information sharing, using the Hong Kong Police press release bilingual pairs as an example. The concept space allows the user to interactively refine a search by selecting concepts which have been automatically generated and presented to the user. This allows the user to descend to the level of actual objects in a collection at any time. By observation, some information may be seemingly unconnected but actually information can help the analyst to identify the important anomalies, such traffic accidents frequently happen at a particular location. Since the press release collection is dynamically generated, the subjects may not have a full prior knowledge. However, experimental result shows the precision and recall for the bilingual concept space are over 78% in all cases. Among these 9222 test descriptors, 87.7% of them obtain their translations from the associated concepts. It shows that the concept space generated through Hopfield network can effectively recognize the translations of a concept in a parallel corpus.
References 1.
Bates, M. J. "Subject access in online catalogs: A design model". Journal of the American Society for Information Science, 37,357–376. (1986) 2. Chen, H., Lynch, K. J., "Automatic construction of networks of concepts characterizing document database" IEEE Transactions on Systems, Man and Cybernetics, vol. 22, no. 5, pp. 885–902, Sept-Oct (1992) 3. Chen, H., Schatz, B., Ng, T., Martinez, J., Kirchhoff, A., Lin, C., "A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 771–782, August (1996) 4. Chen, H., Ng, T., Martinez, J., Schatz, B., "A Concept Space Appraoch to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System" In Journal of The American Society for Information Science, 48(1):17–31. (1997) 5. Chien, L. F., "PAT-Tree-BASED Keyword Extraction for Chinese Information Retrieval", In Proceedings of ACM SIGIR,pp.50-58, Philadelphia, PA,1997. 6. Courtial, J. P. and Pomian, J. “A system based on associational logic for the interrogation of databases”, In Journal of Information Science, 13,91–97,1987 7. Cunliffe, D., Jones, H., Jarvis, M., Egan, K., Huws, R., Munro, S., “Information Architecture for Bilingual Web Sites”. In Journal of The American Society for Information Science, 53(10):866–873. 2002 8. Ekmekcioglu, F. C., Robertson, A. M. and Willett, P. “Effectiveness of query expansion in ranked-output document retrieval systems”, In Journal of Information Science, 18, 139– 147,1992. 9. Fung, P. and McKeown, K. (1997) " A technical word- and term-translation aid using noisy parallel corpora across language groups". In Machine Translation 12: 53–87. 10. Hayes-Roth, F., Waterman, D. A. and Lenat, D. (1983) "Building Expert Systems". Reading, MA: Addison-Wesley.
152
K.W. Li and C.C. Yang
11. He, S. "Translingual Alteration of Conceptual Information in Medical Translation: A Cross-Language Analysis between English and Chinese," Journal of the American Society for Information Science, Vol. 51, No. 11,2000, pp.1047–1060. 12. Larson, M. L. Meaning-based translation: A guide to cross-language equivalence. Lanham, MD: University Press of American 13. Leonardi, V., "Equivalence in Translation: Between Myth and Reality," Translation Journal, Vol. 4, No.4, 2000. 14. Lesk, M. E. (1969) “Word-word associations in document retrieval systems”, In American Documentation, 20(1),27–38,1969. 15. Lin, C. H., Chen, H., "An Automatic Indexing and Neural Network Approach to Concept Retrieval and Classification of Multilingual (Chinese-English) Documents" IEEE Transactions on Systems, Man and Cybernetics, vol 26, no.1, pp. 75–88, Feb 1996 16. Ma X. and Liberman M. (1999) “BITS: A Method for Bilingual Text Search over the Web”. In Machine Translation Summit VII, September 13th, 1999, Kent Ridge Digital Labs, National University of Singapore. 17. Oard, D. W., & Dorr, B. J. (1996). A Survey of Multilingual Text Retrieval. UMIACS-TR96-19 CS-TR-3815. 18. Oard, D. W. (1997). Alternative approaches for cross-language text retrieval. In Hull D, Oard D, (Eds.) ,1997 AAAI Symposium in Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence. 19. Resnik P. "Mining the Web for Bilingual Text," 37th Annual Meeting of the Association for Computational Linguistics (ACL'99), College Park, Maryland, June, 1999. 20. Rose, M. G. (1981). Translation Types and Conventions. In Translation Spectrum: Essays in Theory and Practice, Marilyn Gaddis Rose, Ed., State University of New York Press, pp.31–33. 21. Salton, G. (1989) Automatic Text Processing. Addison-Wesley Publishing Company, Inc., Reading, MA, 1989. 22. Simard, M. (1999) "Text-translation Alignment: Three Languages Are Better Than Two". In Proceedings of EMNLP/VLC-99. College Park, MD. 23. Yang, C. C., Luk, J., Yung, S., Yen, J., (2000) “Combination and Boundary Detection Approach for Chinese Indexing, ” In Journal of the American Society for Information Science, Special Topic Issue on Digital Libraries, vol.51, no.4, March, 2000, pp.340–351. 24. Yang, C. C. and Li, K. W. "Automatic Construction of English/Chinese Parallel Corpora," Journal of the American Society for Information Science and Technology, vol.54, no.7, May, 2003. 25. Zanettin, F,. "Bilingual comparable corpora and the training of translators," Laviosa, Sara. (ed.) META, 43:4, Special Issue. The corpus-based approach: a new paradigm in translation studies: 616–630, 1998.
Decision Based Spatial Analysis of Crime Yifei Xue and Donald E. Brown Department of Systems and Information Engineering University of Virginia, Charlottesville, VA 22904, USA. {yx8d,brown}@virginia.edu
Abstract. Spatial analysis of criminal incidents is an old and important technique used by crime analysts. However, most of this analysis considers the aggregate behavior of criminals rather than individual spatial behavior. Recent advances in the modeling of spatial choice and data mining now enable us to better understand and predict individual criminal behavior in the context of their environment. In this paper, we provide a methodology to analyze and predict the spatial behavior of criminals by combining data mining techniques and the theory of discrete choice. The models based on this approach are shown to improve the prediction of future crime locations when compared to traditional hot spot analysis. Keywords. Spatial choice, feature selection, preference specification, modelbased clustering
incidents, like many other human initiated events involve a decision making and choice process. Much criminological work has taken the criminals or offenders as decision makers who want to benefit from their criminal behaviors and avoid the risk exposure to law enforcement [8]. We take advantage of the fact that the selection of crime targets indicates the criminals’ preferences for specific sites in terms of spatial attributes. While the interest in these criminal preferences is unique to law enforcement, we can exploit work in economics that has looked at spatial choice among consumers to aid us in better understanding criminal preferences and then use this understanding to predict criminal behavior. This paper develops a spatial choice methodology based on these ideas to analyze the location-based crime data. Spatial choice theory describes human’s behaviors in space as rational decisions among the available spatial alternatives. The choices indicate certain spatial patterns and represent the decision makers’ preferences. At the heart of recent work in this area is the pioneering development by McFadden which lead to the formal modeling of discrete spatial choice [1], [18]. Discrete choice models are used for analysis and prediction of spatial decision making under uncertainty with multiple alternatives. It has been extended to a number of areas, such as consumer destination selection [12], [24], travel mode analysis [3], [19], and recreational demand models [25]. These analyses indicate the spatial decisions of a large number of individuals. In general these decisions have been studied through surveys that address a rather limited set of spatial alternatives for each decision maker. Clearly, spatial choice analysis for crime data breaks new ground. The alternatives are commercial properties, buildings, and houses in the study area and while the number of spatial alternatives is finite it is, nonetheless, very large compared to other spatial choice problems. Also, the preferences of the criminal decision makers cannot be directly or accurately assessed through interviews, surveys, or questionnaires. In the rest of this paper, we first formally define the criminal spatial choice in Section 2. Section 3 presents the spatial choice models derived from these formal definitions. In Section 4, the models are applied to actual crime data and the locations of future criminal incidents are predicted. Comparison results with these new models are reported and summarized. Section 5 contains the conclusions.
2 Problem Statement Data items for spatial or crime analysis have two components: a location component and an attribute component. They can be represented by a vector {Q, S, k}. Q is the universe of the location component, which is discrete and indexes all spatial alternatives by an ordered pair of coordinates {x, y}. S is the attribute component associated with given spatial alternatives, which indicate S different attributes S = {s1 , s 2 ,..., s S } . k : Q → S is a mapping function specifying the observed attributes of the alternatives.
Decision Based Spatial Analysis of Crime
155
The spatial decision process can be represented by a vector {Q, S, k, A, D, u, P} . The set A is a subset of Q indicating finite choices available to all individuals D. A = { a1 , a 2 ,..., a N } represents N available alternatives for decision makers to choose. For spatial analysis of crimes, N is a very large number. D is the universe of individuals who make choices over the available alternative set A. Each individual makes choices based on a decision process. u is the utility function mapping the preferences from individuals D over the alternative set A to a utility value U. For a individual d, if choice set Ad = { a1 , a 2 ,..., a N } and Ad′ = { a1′ , a ′2 ,..., a ′N } have same attribute values, then the choice sets will have same utility U= u( Ad ) = u( Ad′ ) . According to the rational decision making assumptions, individuals make choices that maximize their utility. The probability that an individual d from D will choose alternative a i from an available choice set Ad can be specified as P{ ai
| Ad , d } , which is produced from the choice process {Q, S, k, A, D,u, P} . The probability P{ ai | Ad , d } is a mapping based on the
preferences of individual d and the attributes of all alternatives in set Ad. The mapping can be stated as P : A × S × D → ( 0 ,1 ) , or indicated by a utility-based
| Ad , d } = P{ u( a i ) ≥ u( a j ) | d , a j ∈ Ad } . The utility of alter-
function P{ ai
native ai to individual d can be divided into two parts U id
= V ( d , si ) + ε( d , s i ) .
V ( d , si ) = ∑ β x is the deterministic part of the utility value and expressed as a i l
i l
l
linear additive function of all attributes.
xli ∈ X = ( S , D ) represents the lth
component of the combination of attribute values si and characteristics of individual d. ε( d , s i ) is the error term of utility function indicating unobservable components of the utility function.
3 Model Development 3.1 Spatial Choice Patterns Spatial choice theory describes how individuals choose a specific site in space as their target. Their choices show certain patterns in space. The geographical sites form a spatial alternative set A. Individuals make selections from this choice set. Since the number of alternatives for a spatial choice process is very large, individuals are unable to evaluate all spatial alternatives before they make their selections. They can only compare part of the choice set and pick one spatial alternative with highest utility value. This can be stated as a sub-optimal or locally optimal problem. According to Fotheringham’s framework of individuals’ hierarchical information processing [13], individuals make spatial choices from the alternatives they have evaluated. For
156
Y. Xue and D.E. Brown
individual d, the choice set will be Ad
⊆ A , which indicates all spatial alternatives
that individual d really considered. The choice that individual d makes will probably have the highest utility among all alternatives in choice set Ad . What is different from previous work in discrete choice theory, the real choice set
Ad in crime analysis is not
clear to the analysts. Some methods are proposed to identify or estimate the probability P( ai ∈ Ad ) that an alternative ai is considered by individual d. After the identification of the individuals’ choice set, two factors are considered in people’s spatial choice process: i) the utility of alternative ai to individual d and ii) the probability that alternative ai is available or considered by individual d. Since the number of spatial alternatives is very large, it is possible that some alternatives can give higher utility values but they are never considered. In order to reveal the individuals’ preferences, we make an assumption here. Assumption 1: The two factors (i and ii) mentioned above are equally important to the individuals’ choice decisions. The combination of P( a i ∈ Ad ) and the utility of alternative ai to individual d, U id can give a better estimation of the possibility of choices. With the assumption 1, the probability that individual d chooses alternative ai from Ad can be stated as P( U id > U jd + ln P( a j ∈ Ad ), all a j ∈ Ad )P( a i ∈ Ad ) [13]. In order to get the spatial choice model, we make another assumption. Assumption 2: The error term of individuals’ utility function
ε( d , s i ) is
independently and identically distributed with Weibul distribution [18]. The spatial choice model is derived with same method as McFadden has used [13], [18].
P( ai | Ad , d ) = exp(V ( d , si )) ⋅ P( ai ∈ Ad ) / ∑ exp(V ( d , s j )) ⋅ P( a j ∈ Ad j∈A
) (1)
This model is a multinomial logit model where each alternative’s observable utility is weighted by the probability that the alternative is evaluated. 3.2 Specification of Prior Probability We assume that the hierarchical information process takes place before the individuals’ spatial choices. Individuals will first evaluate sets of alternatives and only alternatives within the sets can be selected. We can either define the choice set Ad or give the probability that an individual will evaluate certain alternative P( a i
∈ Ad ) .
Decision Based Spatial Analysis of Crime
157
For the spatial analysis of crime data, it is not easy to know individuals’ preferences. Then we have to make an assumption to simplify the model derivation. Assumption 3: During the process of individuals’ spatial choices, the preferences of all individuals d ∈ D are same. The pre-evaluated spatial alternative set Ad for different individuals is also same. We use M to represent the set of pre-evaluated spatial alternatives for all individuals. Under assumption 3, the spatial site selection model changes to P( ai | Ad , d ) = exp( V ( d , si )) ⋅ P( ai ∈ M ) / ∑ exp( V ( d , s j )) ⋅ P( a j ∈ M ) (2) j∈A The definition of P( ai ∈ M ) is important here. We use kernel density estimation method to get the probability that spatial alternative ai is evaluated by criminals. From the study of Brown et al. [7], we know that location components of spatial alternatives alone do not provide enough information about the criminals’ preferences. There are many feature values attached with the spatial alternatives. A part of these values is believed to be relevant to the occurrence of criminal incidents. Unfortunately, we do not know which part. However, we can mine the criminal’s preferences from all feature values of past crime incidents. We use a feature selection process to find the smallest feature subset from the universe feature space. It is called the key feature set or key feature space and shows that the past criminal incidents can indicate clear patterns in this key feature space. These are possible preferences of criminals’ preevaluation. Using the selected key features, we get the prior evaluation probability P( ai ∈ M ) as follow.
P( ai ∈ M ) = 1
2
1 K
K
∑ L( k =1
s i − s k si − s k s i − s k , , ,...) h1 h2 h3 1
1
2
2
3
3
(3)
3
where, si , si , si … are the key features of spatial alternative ai. K is the total number of observations; L is a function to specify the kernel estimator. We use a Gaussian function here. h’s are bandwidths used in the kernel estimation. The change of bandwidths will influence the effect of density estimation. The choice of bandwidths is important and literature in this area offers a great deal of discussions. We use a recommended bandwidth selection method from Bowman and Azzalini [5],
4 hi = ( p + 2 )⋅ K
1 /( p + 2 )
× σ i for ith dimension. p is the number of dimensions for
density estimation. The model adjusted with the estimated prior probability P( ai ∈ M ) is called the key feature adjusted spatial choice model. 3.3 Spatial Misspecification Both spatial choice model and other discrete choice models try to include all related predictor variables to estimate decision makers’ preferences and predict their future
158
Y. Xue and D.E. Brown
choices. However, it is practically impossible to include all relevant variables that affect people’s decisions into spatial choice models. First, some variables may be very difficult to measure. Second, some variables that affect choices may not have been conceptualized or identified by analysts. Third, even it is possible to identify and include all relevant variables. Some variables will be redundant and correlated with each other. Also, too many predictor variables will make the estimated parameters unstable and reduce the models’ predictive accuracy. It is necessary and inevitable to omit many predictor variables. This leads to the misspecification of choice models. During the development of our spatial choice model, assumption 3 indicates that the preferences of all individuals d ∈ D are same. The pre-evaluated choice set Ad is also the same for all individuals. This makes it easy to estimate the pre-evaluated choice set Ad . However, it also makes the estimated individuals’ preferences biased due to the lack of related information about decision makers’ preferences. For crime analysis, it is impossible to include all preference information into the spatial choice model. But the preferences of decision makers can be specified from their past choices. To avoid the bias and increase the accuracy of spatial choice models’ prediction, it is necessary to specify the bias introduced by the absence of important factors and discover the preferences of individuals. In our spatial choice model, we will specify the pre-evaluated choice set for individuals with different preferences. One solution is to classify all decision makers by their preferences from the past incidents of choices. With well selected key features, the past choices can indicate certain patterns in the key feature space. We will use clustering methods to identify the different classes of decision makers and identify their preferences by defining the pre-evaluated choice sets. The adjusted spatial choice model is called the Preference Specified Spatial Choice Model (PSSCM). 3.4 Clustering Methods Clustering is one of the most useful tasks in data mining for discovering groups and identifying interesting distributions and patterns of an underlying data set. Clustering involves partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters. Researchers have extensively studied clustering since it occurs in many applications in engineering and science. Clustering may result in a different partitioning of a data set, depending on the specific criterion used for clustering. The basic steps to develop a clustering can be summarized as feature selection, clustering algorithm, validation of the results, and interpretation of the results. Feature selection chooses the features on which clustering is to be performed so as to encode as much information as possible. We have used the feature selection step for finding key features. By removing all features that are irrelevant to classification, the small feature space subset provides enough
Decision Based Spatial Analysis of Crime
159
information for pattern recognition therefore reducing the cost and improving the quality of classification [23]. The clustering algorithm is the most important part of the clustering process, and it includes similarity measures, partitioning methods, and stopping criteria. Each of these is described by a variety of sources [11], [16]. No matter what clustering algorithms are used, it is important to find a way to define a stopping criterion or define how many clusters are in the data set. Various strategies for simultaneous determination of the number of clusters and cluster membership have been proposed, like Engelman and Hartigan [10], Bock [4], Bozdogan [6], and Fraley and Raftery [14]. Fraley and Raftery use a model based strategy and Bayesian Information Criterion (BIC) to do clustering and determine the number of clusters. In this approach, the data are viewed as coming from a mixture of probability distributions, each representing a different cluster. Methods of this type have been applied in a number of practical applications, In model based clustering, it is assumed that the data are generated by a mixture of underlying probability distribution in which each component represents a different group or clusters. Let f k ( a i | θ k ) be the density of an observation ai from the kth component.
θ k are the corresponding parameters. The density function
f k ( a i | θ k ) is generally assumed to be a multivariate normal distribution. The function has the form as
1 exp{ ( ai − u k )T ∑ k−1 ( ai − u k )} 2 (4) f k ( ai µ k , ∑ k ) = 1/ 2 ( 2π ) p / 2 ∑ k where µk is the mean vector, ∑k is covariance matrix of observations. These are the parameters of the density distribution. The parameterization of covariance matrix ∑k decides the characteristics (orientation, volume and shape) of the distributions of clusters. These characteristics can be allowed to vary between clusters or constrained to be same for all clusters. Then expectation maximization (EM) is used to find the clusters and the Bayesian Information Criterion (BIC) is used as a criterion to compare different models.
4 Application of Spatial Choice Model for Real Crime Analysis 4.1 Crime Data Set The data for model estimation came from ReCAP (Regional Crime Analysis Program) system. The ReCAP system is an interactive shared information and decision support system that uses databases, geographic information system (GIS), and statistical tools to analyze, predict, and display future crime patterns.
160
Y. Xue and D.E. Brown
Our crime analysis was based on crime incidents between July 01 1997 and September 30 1997 in the city of Richmond Virginia. We used residential “Breaking and Entering” (B & E) crime incidents for model estimation and validation. Using the crime incidents in the training dataset, we got locations of all incidents on a geographic map. The sub regions shown in Fig. 1 are block groups, which are the smallest areas for which census counts are recorded.
Fig. 1. Breaking and Entering criminal incidents between July 01, 1997 and September 30, 1997 in Richmond, Virginia.
The analysis of B & E is related to locations of households in a city. However, it is difficult to represent all locations of individual houses in even a modest sized city, such as Richmond. Therefore, we aggregated alternatives using 2517 regular grids, which were assumed to be fine enough to represent all spatial alternatives within this area. The features of each spatial alternative came from the combination of census data (from the “censusCD + maps” compact disk held at university of Virginia’s geospatial and statistical data center) and calculated distance values. All features were possibly related to the decision process of criminals. 4.2 Feature Selection by Similarities Since the attributes of spatial alternatives came from census data and calculated distance values, it is possible that some values of these attributes are correlated. Using the calculated correlation values as similarities, we made hierarchical clustering on all features of observed spatial incidents. The clustering of features of observed spatial alternatives is shown as Fig. 2.
Fig. 2. Clusters of features of observed spatial alternatives
From the clustering tree, we divided the features into five clusters. Each cluster included correlated features. After checking the distribution of the feature values, we found that COND1.DST is almost uniformly distributed. It is not a good feature for our analysis. Then there are two choices for feature selection of the rest features, random picking from each cluster or combining the features in same clusters. We picked the features D.HIGHWAY (distance to highway), FAM.DENSITY (Family density per unit area), P.CARE.PH (personal care expenditure per household) and D.HOSPITAL (distance to hospital). The first three were used by Brown et al. [7]. These are the key features and supposed to be good enough to represent all other features in same clusters. Based on the selected features, we applied Fraley and Raftery’s clustering methods [14] to the crime data for analysis. The number of clusters was decided by the calculated BIC values. The trends of BIC value are indicated by Fig. 3. According to the Fig. 3, we decided there are 6 clusters among the crime dataset. Each cluster corresponds to certain group of criminals that have similar preferences on their choices of spatial alternatives. The distribution of crime incidents within different clusters is listed in Table 1. 4.3 Model Estimation and Prediction The number of spatial alternatives for crime spatial analysis is very large, which makes the data preparation and computation time prohibitively expensive. To handle this problem, we adopted an importance sampling technique suggested by Ben-Akiva
162
Y. Xue and D.E. Brown
-1700
[1]. Sampling alternatives is an commonly applied technique for reducing the computational burden involved in the model estimation process.
4
4
4
-1750
4 4 3
-1800
3
2 3
-1850
4 3
1
1 2
2 3 1
3 4 2 1
1
2
-1900
BIC
2 3
1
-2000
-1950
2
1 4 3 1 2 2
4
6
8
number of clusters
Fig 3. The trends of BIC values of different parameterized model-based clustering algorithms 1: equal volume, equal shape and no orientation 2: variable volume, equal shape and no orientation 3: equal volume, equal shape and equal orientation 4: variable volume, variable shape and equal orientation
Table 1. Distribution of crime incidents in clusters
Cluster 1 Cluster 2 Cluster 3
Crime incidents 109 180 200
Cluster 4 Cluster 5 Cluster 6
Crime incidents 202 133 55
Next we considered the model estimation and prediction step. The prior probability P( ai ∈ M ) of the adjusted spatial choice models were calculated as in section 3.2 for each cluster. The key features are the features coming out from the feature selection process. Using the training data set of B & E incidents of each cluster, we obtained the estimation of the preference specified spatial choice model for each cluster P( ai | Ad , d ∈ M l ) . M l indicates the presence of criminals with preferences in lth cluster. The final prediction of future crime’s spatial distribution is
Decision Based Spatial Analysis of Crime
163
the combination of the predicted probabilities of all clusters. The combination method is also very important. Given the conditional probability that spatial alternative ai will be picked by criminals within cluster M l , P( ai | Ad , d ∈ M l ) and the chance that criminals
d ∈ M l will commit next crime within the study region P( M l ) , the
probability that spatial alternative ai is picked by any criminal will be L
P( ai | Ad , d ∈ M ) = ∑ P( a i Ad , d ∈ M l )P( M l ) . L is the total number of l =1
clusters within the crime data set. The probability methods. Here we used a ratio as P( M l ) =
P( M l ) can be defined by many
P( ai ∈ M l )
(5)
L
∑ P( a ∈ M i
j
)
j =1
P( ai ∈ M l ) is the probability that an individual d ∈ M l pre-evaluate spatial alternative ai. With the preference specified spatial choice model described above, we made our predictions. Also, we use hot spot model as the comparison model to test the two models provided by this paper, the key feature adjusted spatial choice model and preference specified spatial choice model. The residential B & E incidents between October 1, 1997 and October 31, 1997 were used as testing data set. The predictions of future crimes’ spatial distribution and the testing incidents are shown as Fig. 4-6.
Fig. 4. Prediction of hot spot model with crime incidents from 10/01/97 to 10/31/97
164
Y. Xue and D.E. Brown
Fig. 5. Prediction of key feature adjusted spatial choice model with crime incidents from 10/01/97 to 10/31/97
Fig. 6. Prediction of preference specified spatial choice model with crime incidents from 10/01/97 to 10/31/97
4.4 Model Comparisons To compare different models, we standardized all predictions of the adjusted models and the comparison model. The hypothesis is that for the population of all future crime incidents, the proposed model will outperform the comparison model.
Decision Based Spatial Analysis of Crime
165
We assumed that the testing data set contains m incidents that occurred at the locations a1′ , a ′2 ,..., a m′ , respectively. For incident a i′ , let the predicted probability
pspi and that given by the comparison model be p sci . The hypothesis test was built around µ which denoted the mean of the difference
given by the proposed model be
between the predicted probability given by the proposed model and that given by the comparison model. Assumed that the proposed model have a better prediction than the comparison model for future crimes. Then the null hypothesis is that the predicted probability difference µ between the two models for future crime incident locations is less than or equal to 0. The alternative hypothesis is the predicted probability for proposed model will be significantly better than the comparison model. We performed the hypothesis test as H0: µ
≤0, H a: µ > 0 .
(6) Using the testing data set with m crime incidents, we obtained the estimated probability difference µˆ . m
(
)
ˆ = (1 m )∑ pspi − p sci µ i =1
The standard deviation of the difference,
ˆ = σ
(7)
qsi = pspi − psci was estimated by
(1 (m − 1))∑ (qs m
i =1
i
ˆ −µ
)
2
.
(8)
The results of these tests are shown in table 1. In the testing results, “Mean” and “Std. Dev.” stand for µˆ and σˆ , respectively. p-value indicates the probability that the null hypothesis will be accepted. Table 2. The comparison results Testing data set (10/01/97 - 10/31/97)
Preference Specified vs. Hot Spot Key feature adjusted vs. Hot Spot Preference Specified vs. Key feature adjusted
Mean
Std. Dev
z-Statistic
p-Value
7.757×10-4
5.051×10-3
2.624
0.004
2.861×10-5
2.603×10-4
1.878
0.030
7.471×10-4
4.922×10-3
2.593
0.005
The comparison results indicate that the two spatial choice models significantly outperform the comparison hot spot model. The preference specified spatial choice model also outperforms the key feature adjusted spatial choice model significantly. The results prove that the analysis of feature values attached to all spatial alternatives and the analysis of specified preferences of decision makers lead to the improvement
166
Y. Xue and D.E. Brown
of the prediction of future crimes’ locations. Based on the estimation of criminals’ preferences over the feature space, we provide a more efficient and accurate prediction method for the analysis of crimes’ spatial information.
5 Conclusion Spatial analysis is of critical importance to law enforcement. It enables better planning and use of scarce resources and is particularly useful when addressing the variety of threats facing modern communities. Past work in this area has concentrated on aggregated approaches to understanding criminal behavior and displayed results of this analysis as hot spots. In this paper, a new preference specified spatial choice model is provided that shows how the preferences of criminals can be modeled to better understand the spatial patterns of crime. When used with actual breaking and entering data this method increased the accuracy of the prediction of future criminal locations by a statistically significant amount. In addition, the method also provides a way to interpret the relationship between criminal decision making and spatial attributes.
References 1.
Ben-Akiva, M. and Lerman, S. (1985). Discrete choice analysis, theory and application to travel demand. the MIT press. 2. Besage, J. and Newell, J. (1991). The detection of clusters in rare diseases. Journal of the Royal Statistical Society A 154: 143–155. 3. Bhat C. (1998) Incorporating observed and unobserved heterogeneity in urban work travel mode choice modeling. Transportation science, Vol. 34, No. 2, pp. 228–238, May. 4. Bock, H. H. (1996). Probability models and hypothesis testing in partitioning cluster analysis. Clustering and Classification, Ed. By Arabie, P., Hubert, L., and DeSorte, G. World Science Publishers. 5. Bowman, A. and Azzalini, A. (1997). Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford Statistical science series. 6. Bozdogan, H. (1993) Choosing the number of component clusters in the mixture model using a new informational complexity criterion of the inverse Fisher information matrix. Information and Classification, Ed. By Opitz, O., Lausen, B., and Klar, R. 40–54. Springer-Verlag. 7. Brown, D., Liu, H. and Xue, Y. (2001). Mining preferences from spatial-temporal data. Proceedings of first SIAM conference, 2001. 8. Clarke, R. and Cornish, D. (1985). Modeling offenders’ decisions: a framework for research and policy. Crime Justice: An Annual review of research, Vol. 6, Ed. By Tonry, M. and Morris, N. University of Chicago Press. 9. Cliff, A.D. and Ord, J.K. (1981). Spatial processes, models, and applications. London: Pion. 10. Engelman, L. and Hartigan, J.A. (1969). Percentage points of a test for clusters. Journal of the American Statistical Association, 64: 1674. 11. Everitt, B. (1993). Cluster analysis. John Wiley & Sons. New York. 12. Fotheringham, S. (1988) Consumer store choice and choice set definition. Marketing Science, Summer, 299–310.
Decision Based Spatial Analysis of Crime
167
13. Fotheringham, S., Brunsdon, C. and Charlton, M. (2000). Quantitiative Geography.SAGE Publications Ltd. 14. Fraley, C. and Raftery, A.E. (1998). How many clusters? Which clustering method? – Answers via model-based cluster analysis. The Computer Journal, 41(8): 578–588. 15. Graham, U. and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons. 16. Jain, A.K., Murty, M.N. and Flynn, P.J. (1999). ACM Computing Surveys, Vol. 31, No. 3, September, 264–323. 17. Kposowa, A., and Breault, K.D. (1993) Reassessing the structural covariates for U.S. homicide rates: A county level study. Sociological Forces 26:27–46. 18. McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics, 105–142. New York, 19. McFadden, D. and Train, K. (1978). The goods/leisure tradeoff and disaggregate work trip mode choice models. Transportation research, 12,349–353. 20. Openshaw, S., Charlton, M., Wymer, C. and Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point datasets. International Journal of Geographical Information Systems 1: 335–358. 21. Openshaw, S., A. Craft, A., Carlton, M. and Birch, J. (1988). Investigation of leukemia clusters by use of a geographical analysis machine. Lancet 1:272–273. 22. Osgood, W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of quantitative criminology. Vol. 16. No. 1. 23. Ripley, B. D. (1981). Spatial statistics, John Wiley and Sons: New York. 24. Rust, R. and Donthu, N. (1995). Capturing geographically localized misspecification error in retail store choice models. Journal of Marketing research, Vol. XXXII, 103–110. 25. Train, K. (1998). Recreational demand models with taste differences over people. Land economics, 74, 230–239. 26. Upton, G., and Fingleton, B. (1985). Spatial data analysis by example. New York: John Wiley & Sons.
CrimeLink Explorer: Using Domain Knowledge to Facilitate Automated Crime Association Analysis 1
2
2
Jennifer Schroeder , Jennifer Xu , and Hsinchun Chen 1
Tucson Police Department, 270 S. Stone Avenue, Tucson, AZ 85701 [email protected] 2 Department of MIS, University of Arizona, Tucson, Arizona 85721 {jxu, hchen}@eller.arizona.edu
Abstract. Link (association) analysis has been used in law enforcement and intelligence domains to extract and search associations between people from large datasets. Nonetheless, link analysis still faces many challenging problems, such as information overload, high search complexity, and heavy reliance on domain knowledge. To address these challenges and enable crime investigators to conduct automated, effective, and efficient link analysis, we proposed three techniques which include: the concept space approach, a shortest-path algorithm, and a heuristic approach that captures domain knowledge for determining importance of associations. We implemented a system called CrimeLink Explorer based on the proposed techniques. Results from our user study involving ten crime investigators from the Tucson Police Department showed that our system could help subjects conduct link analysis more efficiently. Additionally, subjects concluded that association paths found based on the heuristic approach were more accurate than those found based on the concept space approach.
15]. Information about associations between crime entities (person, location, organization, property, etc.) is often buried in large volumes of raw data collected from multiple sources (e.g., crime incident reports, surveillance logs, telephone records, financial transaction records, etc.). Usually, link analysis entails an investigator manually expanding known entities by reading each document where the entities in question appear. If two entities appear in the same document, this indicates the two may have some association with each other. If no association is found, the investigator has to iteratively expand more documents until a significant path of associations between the entities is found. This process can be tremendously time-consuming. Second, high branching factors (the number of direct links an entity has) increase the search complexity of link analysis dramatically. A high branching factor can lead to a large number of associations that need to be evaluated when two crime entities are not directly associated. In a breadth-first-search of depth 4, for instance, an average branching factor of 7 can result in 2,401 associations that need to be evaluated. In reality, criminals who have repeated police contacts and arrests tend to commit many crimes with many people, causing high branching factors. The branching factor of an association search can be further inflated if associations with many other entity types (e.g., addresses, organizations, property, or vehicles) are considered. Third, determining the importance of associations for uncovering investigative leads relies heavily on domain knowledge. Crime investigators often focus only on those strong and important associations and paths because different types of crimes usually have different characteristic. Associations between crime entities may carry different weights in investigation of different types of crimes. For example, the relationship between a suspect and a victim may not be as important to uncover investigative leads in a burglary case as in a homicide case. Link analysis may distract or mislead an investigation if not guided by domain knowledge. There have been some link analysis software packages available for use in crime investigation. However, most of these packages do not help extract, search, and analyze associations beyond mere visualization of analysis results. Some tools facilitate only single-level association searches—finding only directly related entities. Automated, effective, efficient link analysis techniques are needed to assist law enforcement and intelligence investigators in carrying out crime investigation [21, 24]. To address the challenges of link analysis, we proposed and implemented several techniques for automated link analysis. These techniques include the concept space approach [4] to extracting associations from crime data, a heuristic-based approach to incorporating domain knowledge, and a shortest-path algorithm [7] to search association paths and reduce search complexity imposed by high branching factors. The rest of the paper is organized as follows. We review prior literature in section 2 and discuss system design in section 3. In section 4 we present results of a system evaluation study conducted at the Tucson Police Department (TPD). Section 5 concludes the paper and suggests future directions.
170
J. Schroeder, J. Xu, and H. Chen
2 Literature Review In this section, we review related work in link analysis, domain knowledge incorporation approaches, and shortest-path algorithms. 2.1 Link Analysis The earliest approach for link analysis is the Anacapa charting system [16]. In this approach, an investigator first constructs an association matrix by examining documents to identify associations between crime entities. Based on this association matrix, a link chart can be drawn for visualization purposes. In a link chart, different symbols represent different types of entities, such as individuals, organizations, vehicles, or locations. Based on this chart, an investigator may discover new investigative directions or confirm initial suspicions about specific suspects [24]. However, this approach is primarily a manual approach and depends on human investigators to extract, search, and analyze association data. It offers little help to address the information overload and high search complexity problems. Some automated approaches have been proposed for link analysis. Lee [20] developed a technique to extract association information from free text. Relying heavily on Natural Language Processing (NLP) techniques, this approach can extract entities and events from textual documents by applying large collections of predefined patterns. Associations among extracted entities and events are formed using relation-specifying words and phrases such as “member of” and “owned by”. The heavy dependence of this approach on hand-crafted language rules and patterns limits its application to crime data in diverse formats. There have been some link analysis tools that allow for “single-level” or direct association searches. Watson [2] can identify possible links and associations between entities by querying databases. Given a specific entity such as a person’s name, Watson can automatically form a database query to search for other related records. The related records found are linked to the given entity and the result is presented in a link chart. The COPLINK Detect system [5] applied a concept space approach developed by Chen and Lynch [4] for exploring associations. This approach was originally designed for generating thesauri from textual documents automatically by measuring cooccurrence weight, the frequency that two phrases appear in the same document. When applied to crime incident reports, this approach can automatically extract association information between crime entities and has been found to be efficient and useful for crime investigation [17]. However, both Watson and COPLINK Detect system allow users to search for only direct associations (“single-level”) and do not facilitate the search for association paths consisting of multiple intermediate links. Moreover, association strengths obtained using the concept space approach are merely based on co-occurrence weights. No domain knowledge is utilized to determine the importance of associations and to consider other information that can potentially suggest associations between crime enti-
CrimeLink Explorer: Using Domain Knowledge
171
ties. In next section, we review prior research on domain knowledge incorporation approaches. 2.2 Domain Knowledge Incorporation Approaches Domain knowledge often is important to solving domain-specific problems. In broader fields of artificial intelligence and data mining research, expert systems and Bayesian networks are typical techniques for incorporating domain knowledge. During the knowledge acquisition phase of expert system construction, domain experts’ knowledge and experience in addition to some common sense rules are collected and recorded. Knowledge generated usually is represented in a set of rules and stored in a knowledge base [26]. Expert systems have been employed in some domains such as factory scheduling [11], telephone switch maintenance [14], and disease diagnosis [23]. Because of the high expense of building knowledge bases and other issues such as low scalability and accuracy, expert systems have not been widely used. Bayesian network is another approach to incorporating knowledge of domain experts [18]. It encodes existing knowledge in a probability network with each node representing a variable and a link representing a dependency relationship between two variables. Some variables in a Bayesian network representing auditors’ knowledge of bank performance, for instance, can be the financial ratios indicating banks’ financial health. Other variables can be indicators of bank failure or other risks. Links between these variables specify the dependency relationships [25]. In addition to incorporating existing knowledge, Bayesian networks can learn new knowledge from data [18] and have been shown to be effective in some domains such as gene regulation function prediction [6,10]. In the domain of law enforcement and intelligence, the approaches for incorporating expert knowledge have been primarily ad-hoc. Goldberg and Senator [12] used a heuristic-based approach in the FinCEN system to forming associations between individuals who had a shared address, a shared bank account, or related transactions. Money laundering and other illegal financial activities could be detected based on associations discovered. However, these heuristics were used by investigators to manually uncover associations and have not been really incorporated into the system for automated link analysis. In case of large datasets, investigators still face the problems of information overload and high search complexity. The next section reviews shortest-path algorithms, which can help reduce search complexity for human investigators. Although they have been studied and employed widely in other domains, shortest-path algorithms have not yet been adopted widely in the law enforcement domain. 2.3 Shortest-Path Algorithms Shortest-path algorithms can find optimal paths between given nodes by evaluating link weights in a graph. One can focus on only the optimal path without being distracted by a large number of other possible paths. The Dijkstra algorithm [7] is the
172
J. Schroeder, J. Xu, and H. Chen
classical method for computing the shortest paths from a single source node to every other node in a weighted graph. Most other algorithms for solving shortest-path problems are based on the Dijkstra algorithm but have improved data structures for implementation [8]. Some researchers have proposed using neural network approaches to solving the shortest-path problem [1]. The shortest-path algorithm has been used to find the strongest association paths between two or more crime entities [27]. Another tool that employs the shortest-path algorithm is Link Discovery Tool [19]. It is able to search for association paths between two individuals that on the surface appear to be unrelated. In summary, prior work related to link analysis has proposed some approaches to addressing the challenges. However, link analysis remains to be a difficult problem for crime investigators when facing large volumes of data. In next section we present the system design of our CrimeLink Explorer to address the three challenges of link analysis.
3 System Design We designed and implemented CrimeLink Explorer for automated link analysis. The system contained a set of crime incident data originating from the Tucson Police Department (TPD) Records Management System. The concept space approach was used to identify and extract associations between all criminals in the dataset based on cooccurrence weights. Alternatively, a number of heuristics captured expert knowledge for identifying criminal associations and determining the importance of associations for investigation. To facilitate the search for the strongest association paths between individuals of interest, we implemented Dijkstra’s shortest-path algorithm with logarithmic transformations on association weights (co-occurrence weights or heuristic weights). A graphical user interface was provided to allow users to input names of interest and visualize association paths found based either on the concept space approach or on the heuristic approach. 3.1 Crime Incident Reports Law enforcement databases usually store crime incident reports, which are a rich source of data about both criminal and non-criminal incidents over extended time periods. Incident reports may document serious crimes such as homicides or trivial incidents such as suspicious activity calls or neighbor disputes. The trivial incident may later provide important information about associations that can later be used to solve serious crimes. Individuals involved in criminal activities may have repeated contacts with police, resulting in their presence in multiple incident reports. All crime incidents are classified into different types (e.g., Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, etc.) usually based on the Uniform Crime
CrimeLink Explorer: Using Domain Knowledge
173
Graphical User Interface
Association Path Search (shortest-path algorithm)
Co-occurrence Weights
Heuristic Weights
Heuristics
Concept Space
Crime Incident Reports
(crime types, shared address, shared phone)
Fig. 1. CrimeLink Explorer system architecture
Reporting (UCR) standard that has been the national standard for case classification and crime reporting since 1930 [22]. The successor to UCR, the National Incident Based Reporting System [9], has not been universally adopted by many U.S. law enforcement agencies. Thus, crime incident reports in this research are UCR based. These incident report records formed the source for automating link analysis in this research. 3.2 Concept Space Approach We used the concept space approach to automatically identifying and extracting associations from crime incident reports. We treated each incident report as a document and each crime entity as a phrase. To reduce complexity, we focused on associations only between persons and did not consider possible associations between other types of entities such as location and property. We then calculated the co-occurrence weights based on the frequency that two persons appeared together in the same crime incidents. Ideally, the value of a co-occurrence weight not only implies the presence of an association between two persons but also indicates the importance of the association for uncovering investigative leads [17]. However, this approach has its limitations when used in link analysis. An example is a burglary investigation where the victim and the suspect appear together in the incident report but have never met and are not even casual acquaintances. Moreover, co-occurrence weights obtained by the concept space algorithm had been found to be of only minor assistance when subjected to user evaluation. In previous user studies, investigators tended to made judgments about the associations independent of the cooccurrence weights provided by the system. Crime investigators were still facing the information overload problem because they had to make the final determination as to
174
J. Schroeder, J. Xu, and H. Chen
the importance of associations. In next section we discuss the heuristic approach as an alternative to the concept space approach. 3.3 Heuristic Approach We collected heuristics that domain experts often use when analyzing crime data to make judgments about the strength of associations between people. We interviewed several crime analysts and detective sergeants at the TPD. Three criteria were identified as the most important heuristics: (a) the relationship between crime type and person roles, (b) shared addresses or telephone numbers, and (c) repeated cooccurrence in incident reports. Rather than employing expert systems or Bayesian networks approaches to incorporating expert knowledge we represented heuristics collected using a 1-100 percentage scale to indicate the strength of associations ranging from weak to strong. A weak association, such as the relationship between a victim and a suspect in a burglary incident, was assigned a value of 1, and a strong association, such as a person and his close friend and criminal associate who have been arrested together repeatedly, were assigned a value near 100. Crime-Type and Person-Role Relationships. The crime investigators we interviewed specialized in investigation of one or more types of crime: Homicide, Aggravated Assault, Robbery, Fraud, Auto Theft, Sexual Assault, Child Sexual Abuse, Domestic Violence, and many others. Person roles used in the TPD dataset included: Victim, Witness, Suspect, Arrestee, and Other. We constructed a matrix and assigned scores to role combinations in each of the crime types. All of the crime investigators agreed that most co-arrestees or suspects in an incident had a strong association. Other role combinations, however, varied considerably depending on the type of crime. The score for a specific role combination was based on the estimation of the strength of the association occurring for that role combination and crime type out of every 100 incidents. For instance, the homicide detective sergeant estimated that at least 98 out of 100 homicide incidents included a victim and a suspect who were acquaintances. Thus, the corresponding score for victimsuspect combination for homicide crimes in the heuristic matrix was set to be 98. This method of assigning heuristic scores was somewhat arbitrary and could be enhanced by including a statistical analysis of the crime-type/person-role relationship. However, to capture such statistics by manually reading a large number of incident reports from each crime type to assess relationship information would be time prohibitive. We therefore relied on domain experts’ estimation based on their past experience rather than statistical analysis. Although informative, heuristics based on the relationship between crime type and person role could not necessarily provide complete information about criminal associations. For instance, the association score between two arrestees in narcotics sale incidents was assigned 95. This accurately reflected the high likelihood that the two arrestees knew each other, but did not capture the fine gradient from acquaintances to
CrimeLink Explorer: Using Domain Knowledge
175
close friends. Shared telephone and address associations and repeated appearances together in incidents could provide additional information to distinguish links from weak to strong. To allow a point spread to include this additional information, the heuristic scores based on crime type and person role were reduced to account for 85% of the final heuristic weight. Shared Address/Phone. Our domain experts stated that shared phone numbers and addresses were often important indicators for associations. We therefore included assignment of additional score to an association when two persons who shared a common phone number or address. Since phone number data were often subject to various errors in the TPD databases, they added only 5% of the final heuristic weight. Shared addresses added an additional 10% to the final heuristic weight since they were often more significant and less erroneous than phone number data. Co-occurrence. In the absence of other information suggesting an association, that two persons appeared together in multiple incidents might imply a strong relationship. This was the same rationale behind the concept space approach. However, rather than using co-occurrence weight, we estimated the strength of an association resulted from multiple co-occurrences in incidents based on an empirically derived probability distribution. We obtained the empirical distribution by analyzing a random sample of 40 incident reports of various crime types and counting the number of times each pair of persons co-occurred. We read supporting narrative reports for each incident to determine whether an association was important. We found that the more times two persons appeared together, the more likely they were involved in family related crimes. That is, a large number of co-occurrences between two persons implied a high likelihood for them to have a close relationship. For example, in 21 out of 40 incidents containing persons who appeared together four times, 15 were domestic violence incidents, custodial interferences, or family fights, six were court order enforcements or civil matters that were often related to domestic situations. Court orders and civil matters that were not family related overwhelmingly concerned persons who had some prior association. Based on our analysis, we constructed the probability distribution by assigning 1 to a single co-occurrence, indicating that it could be completely random with no other facts to support a stronger association. From two to three co-occurrences the probability increased rapidly. The probability distribution above 4 exceeded 99%, so all pairs of subjects who co-occur four or more times were given a probability of 100. Table 1. Empirically derived probability distribution Co-occurrence count 1 2 3 ≥4
Association probability (%) 1 45 98 100
176
J. Schroeder, J. Xu, and H. Chen
The final heuristic weight for a specific association was calculated by the maximum of the scores between the sum of crime-type/person-role relationship, shared address, and shared phone, and association probability based on co-occurrence counts: MAX(0.85 (crime-type/person-role score) + 0.05 (shared phone score) + 0.10 (shared address score)), 1.00 (association probability based on co-occurrence counts)). 3.4 Association Path Search For this system, we used the Dijkstra’s shortest-path algorithm [7] to address the search complexity problem. A logarithmic transformation was made on association weights because the conventional shortest-path algorithms could not be used directly to solve the problem of identifying the strongest association between a pair of persons [27]. With this transformation, a user could find the strongest association paths among two or more persons of interest. 3.5 User Interface A graphical user interface was implemented to allow a user to interact with the system. Figure 2 shows the user interface after the user has conducted a search for a path between three persons. Names are scrubbed for data confidentiality. The user entered the names of interest in the text field and then pressed the “Show Associations” button. The system conducted the shortest-path search based on either the co-occurrence weights or heuristic weights depending on the user’s choice. The user could then double-click on any node to see additional information (sex, date of birth, and Social Security number) about the person represented. The user could also double-click on a link and see information about the origin of the link, shared phone numbers or addresses, the weights from the concept space approach or from the heuristics, and the descriptions of incidents in which the two persons were involved.
4 System Evaluation We conducted a user study at the TPD to evaluate our system’s performance. We wanted to find out whether the automated link analysis approaches we proposed (concept space approach, heuristic approach, and the shortest-path algorithm) help address the information overload and search complexity problem and whether domain knowledge helps identify associations between crime entities more accurately than the concept space approach. We extracted approximately 20 months of incident reports from the TPD database. The resulting datasets contained 239,780 incident reports in which 229,938 persons were involved. Information, such as age, gender, race, address, and phone number, about those persons was also extracted.
CrimeLink Explorer: Using Domain Knowledge
177
Fig. 2. CrimeLink Explorer user interface
Ten crime analysts and criminal intelligence officers at the TPD participated in the study. Several subjects were very experienced specifically in link analyses. Each subject was asked to perform three tasks using CrimeLink Explorer and COPLINK Detection (“single-level” link analysis tool—finding crime entities that were only directly associated with a given entity): (a) use COPLINK Detect to find the strongest association paths among three given person names, (b) use the concept space approach provided by CrimeLink Explorer to find the strongest association paths among three given persons, and (b) use the heuristic approach provided by CrimeLink Explorer to find the strongest association paths among three given persons. Name sets used in the tasks were different but equally difficult. We summarize the results as follows: Subjects could conduct a link analysis more efficiently using CrimeLink Explorer than using COPLINK Detect. Because COPLINK Detect did not facilitate the search for association paths between crime entities that were indirectly connected, subjects had to expand links manually to find possible criminal associations. CrimeLink Explorer, in contrast, provided the functionality of searching for the strongest association paths between crime entities for multiple levels. Most subjects were able to find direct associations of the three given names using COPLINK Detect, but could not keep track of all the associations that were possible to generate as they traversed into the second and third level of the search. They said it would take them hours or possibly more than a day to find the paths between the names. However, all subjects could quickly find association paths for tasks (b) and (c) using CrimeLink Explorer. This
178
J. Schroeder, J. Xu, and H. Chen
result showed that automated path search functionality significantly increased the efficiency of link analysis by using the shortest-path algorithm. Subjects believed that association paths found using the heuristic approaches were more accurate than those found using the concept space approach. This was because the heuristics captured the domain knowledge crime investigators relied on to determine the importance of associations between crime entities. The heuristic weights included not only co-occurrence information but also person roles in different types of crimes, shared phones, and shared addresses. As some subjects commented, “That makes more sense, since it takes into account the kind of case". Subjects were also asked to indicate how useful the system was as an investigative tool. All subjects gave positive feedback and expressed enthusiasm about the tool. Several subjects had asked when they would be able to use the system for their daily work. The results of the user study were quite encouraging. The automated link analysis approaches we proposed in the research could greatly reduce crime investigators’ time and effort when conducting link analysis. Moreover, domain knowledge incorporated in the system could reflect human judgment more accurately about strength of associations between criminals.
5 Conclusions and Future Work Link analysis has faced challenges such as information overload, search complexity, and the reliance on domain knowledge. Several techniques were proposed in this paper for automated link analysis including the concept space approach, the shortest path algorithm, and a heuristic approach that captured domain knowledge for determining importance of associations. We implemented the proposed techniques in a system called CrimeLink Explorer. The system evaluation focused on the approaches’ efficiency and accuracy, both of which are desirable features of a sophisticated link analysis system. The user study results demonstrated the potential of our approaches to achieve these features using domain-specific heuristics. Rather than using estimates of heuristic weights, we plan in the future to apply a statistical analysis on NIBRS (National Incident-Based Reporting System) data [9], which captures specific information about the nature of associations between individuals involved in an incident, to validate the weights for the heuristic table. The heuristics can also be extended to include common vehicles and common organization associations. We also plan to encode expert knowledge in Bayesian networks and incrementally learn new knowledge from crime data. Variables in such a Bayesian network may specify whether two persons were family members, were good friends, or went to the same school. The other variables may be the likelihood of these pieces of information being important to uncovering investigative leads. Links between these variables can indicate the dependency relationships.
CrimeLink Explorer: Using Domain Knowledge
179
Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. We appreciate the critical and important comments, suggestions, and assistance from Detective Tim Petersen and other personnel from the Tucson Police Department.
References 1. 2. 3. 4.
5.
6. 7. 8. 9.
10.
11. 12.
13.
14. 15.
Ali, M., Kamoun, F.: Neural networks for shortest path computation and routing in computer networks. IEEE Transactions on Neural Networks, Vol. 4, No. 5. (1993) 941–953. Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Blair, D. C., Maron, M. E.: An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, Vol. 28, No. 3. (1985) 289–299. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document database. IEEE Transaction on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., Schroeder, J.: COPLINK: Managing law enforcement data and knowledge. Communications of the ACM, Vol. 46, No. 1. (2003) 28–34. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning, Vol. 9. (1992) 309–347. Dijkstra, E.: A note on two problems in connection with graphs. Numerische Mathematik, Vol. 1. (1959) 269–271. Evans, J., Minieka, E.: Optimization Algorithms for Networks and Graphs, 2nd edn. Marcel Dekker, New York (1992). Federal Bureau of Investigation: Uniform Crime Reporting Handbook: National IncidentBased Reporting System (NIBRS). Edition NCJ 152368. U.S. Department of Justice. Federal Bureau of Investigation (1992). Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. In: Proceedings of the Forth Annual International Conference on Computational Molecular Biology (RECOMB00) (2000). Fox, M. S., Smith, S.F.: ISIS: A knowledge-based system for factory scheduling. Expert Systems, Vol. 1, No. 1. (1984). Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goldberg, H. G., Wong, R. W.H.: Restructuring transactional data for link analysis in the FinCEN AI system. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). Goyal, S. K. et al.: COMPASS: An expert system for telephone switch maintenance. Expert Systems. July 1985. Grady, N. W., Tufano, D. R., Flanery, R. E. Jr.: Immersive visualization for link analysis. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998).
180
J. Schroeder, J. Xu, and H. Chen
16. Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. 17. Hauck, R., Atabakhsh, H., Onguasith, P., Gupta, H., Chen, H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35. (2002) 30–37. 18. Heckerman, D.: A tutorial on learning with Bayesian networks, Microsoft Research Report, MSR-TR-95-06, (1995). 19. Horn, R. D., Birdwell, J. D., Leedy, L. W.: Link discovery tool. In: Proceedings of the Counterdrug Technology Assessment Center and Counterdrug Technology Assessment Center’s ONDCP/CTAC International Symposium, Chicago, IL (1997). 20. Lee, R.: Automatic information extraction from documents: A tool for intelligence and law enforcement analysts. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis (1998). 21. McAndrew, D.: The structural analysis of criminal networks. In: Canter D., Alison L. (eds.), The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999). 22. National Archive of Criminal Justice Data. Uniform Crime Reporting Program Data [United States] Series. http://www.icpsr.umich.edu:8080/NACJD-SERIES/00057.xml 23. Shortliffe, E. H.: Computer-Based Medical Consultations: MYCIN. Elsevier, NorthHolland (1976). 24. Sparrow, M. K.: The application of network analysis to criminal intelligence: an assessment of the prospects. Social Networks, Vo. 13. (1991) 251–274. 25. Sarkar, S., Sriram, R. S.: Bayesian models for early warning of bank failures. Management Science, Vol. 47, No. 11. (2001) 1457–1475. 26. Turban, E.: Review of expert systems technology. IEEE Transactions on Engineering Management, Vol. 35, No. 2. (1988) 71–81. 27. Xu, J., Chen, H.: Using shortest-path algorithms to identify criminal associations. In: Proceedings of the National Conference for Digital Government Research (dg.o 2002), Los Angeles, CA (2002).
A Spatio Temporal Visualizer for Law Enforcement 1
1
1
1
1
Ty Buetow , Luis Chaboya , Christopher O’Toole , Tom Cushna , Damien Daspit , 2 1 1 Tim Petersen , Homa Atabakhsh , and Hsinchun Chen 1
University of Arizona, MIS Department, AI Lab {tbuetow, chaboyal, otoolec}@cs.arizona.edu {tcushna, damien, homa, hchen}@bpa.arizona.edu 2 Tucson Police Department, Tucson, Arizona 75701 [email protected]
Abstract. Analysis of crime data has long been a labor-intensive effort. Crime analysts are required to query numerous databases and sort through results manually. To alleviate this, we have integrated three different visualization techniques into one application called the Spatio Temporal Visualizer (STV). STV includes three views: a timeline; a periodic display; and a Geographic Information System (GIS). This allows for the dynamic exploration of criminal data and provides a visualization tool for our ongoing COPLINK project. This paper describes STV, its various components, and some of the lessons learned through interviews with target users at the Tucson Police Department.
1 Introduction Information visualization techniques have proven useful for presenting large amounts of data. Specifically, in the law enforcement domain visualization techniques can be very helpful for tasks such as crime investigation as well for presenting findings to supervisors and even in court. Law enforcement agencies currently use a combination of technological and manual techniques for crime analysis. However, these methods are very time consuming. We have developed the Spatio Temporal Visualizer (STV) tool to assist crime analysts in their search for information and in presenting their results. In order to visualize the data needed by crime analysts, we use three types of visualization techniques: a periodic view, timeline view and GIS view. Each technique has its own strength as follows: periodic visualization displays patterns with respect to time; timeline visualization displays characteristics of temporal data in a linear manner; GIS visualization displays information on a map and allows for spatial analysis of data. We combine these techniques into one tool to allow the same data to be examined from three different views simultaneously. In this paper we present the motivation behind STV followed by a literature review on relevant visualization techniques. Next, we demonstrate how STV provides dynamic access to data and presents three different views. We illustrate STV’s functionality with an example of how it would be used by a crime analyst or police officer. Finally, we discuss some of the lessons learned after interviews and discussions with
potential users from Tucson Police Department (TPD) and conclude with some future directions.
2 Background and Motivation Historically, law enforcement agencies have attempted to maintain records of criminal events to solve crime, aid prosecution, document response, detect serial crimes, and identify trends. Solving a crime often depends on identifying characteristics of the incident and then matching those characteristics to a known criminal or suspects whose past actions, motives or opportunities most closely correlate to the incident at hand. This matching process can occur in an individual officer’s memory or in a multimillion record database. The efficiency of the individual officer is adversely affected by the large amounts of information exceeding his memory capacity or his ability to process that information. In addition, the usefulness of a large database is dependant on the ability to display appropriate and adequate information in a manner, which can be efficiently utilized by an investigator. In the past, crime analysts have dealt with some of these issues through the use of pin maps, graphs, timelines and summarizations. All of these tend to be somewhat subjective and dependent upon the ability and understanding of the analyst doing the preparation and the quality of the data being analyzed. For a better understanding of the problem, imagine a situation in which an analyst is tasked with enlightening a group of police managers on the state of burglaries in a city. He would first need to decide how to approach this task, whether by comparing the number of incidents over several years, from year to year, or month to month. He would also need to decide whether to analyze the occurrence of these crimes between areas of the city, or time of day, day of week, type of victim or any other factors or combination of factors. He would extract data for the period (or periods) he considers appropriate and then through the use of various tools, construct graphs, charts and maps to depict the information in the manner he chooses. An undertaking of this nature often takes several days for an experienced analyst to complete. The problem is that the information the analyst chooses to survey is quite dependent upon the training and experience of the analyst or perhaps the input of his immediate supervisor. Considering the time and effort needed to compile the project, if the group of police managers had concerns or questions different from those which the analyst chose to address, a second or third separate project would be required. The current state of the art of crime analysis is hindered by limited objectivity and lack of tools to allow for dynamic review. STV aims to remedy this analysis deficiency by providing an easy, dynamic workspace.
3 Literature Review Research in the areas of the three views implemented in the STV project has been done extensively and has been applied to various application domains. In the area of crime analysis, GIS software allowing users to view crimes on a map is quite common. There are few tools that allow users to view law enforcement related data in a
A Spatio Temporal Visualizer for Law Enforcement
183
temporal context or in a periodic pattern. Analysis of these techniques shows that they are largely segregated and miss the synergy that is created when multiple views of the same data can be seen simultaneously. To the best of our knowledge, there are currently few tools that harness the power to examine a single data set from multiple perspectives. As will be described in section 4, STV aims to incorporate three different views of the same data set into one tool. 3.1 Periodic Data Visualization Tools Common methods for viewing periodic data include sequence charts, point charts, bar charts, line graphs, and spiral graphs which can all be displayed in 2D or 3D [7, 16]. We use the spiral graph method in STV due to its ability to visualize periodic patterns better than the other methods. The Spiral Graph [17] developed at the Technical University of Darmstadt is an excellent example in which the spiral method of visualization was used. Using the Spiral Graph, different periodic information can be visualized [17]. The main method of mapping data to the Spiral Graph relies upon the thickness of lines to represent the amount of data and different colors to represent different types of data. The University of Minnesota has also developed different implementations of the spiral method of visualization using the spiral of Archimedes [1]. Here we have good examples of how data can be mapped in different ways. For instance, both 2D and 3D spiral graphs in which the thickness of dots along the spokes of a spiral represents the amount of data [3]. The advantage of using a 3D representation is that several data sets can be shown simultaneously. However, using a 3D representation can become confused and make it difficult to see a developing pattern. The spiral method to which STV most resembles is the ReCAP implementation known as a Time Chart [2]. The disadvantage of Time Chart is that it only plots data in, monthly, 24-hour, or 7-day time periods. Therefore, the user does not have the ability to see yearly patterns. In addition, using Time Chart the user is unable to see how many incidents took place in a certain time period. As will be discussed in section 4.2.2, STV’s periodic pattern view overcomes these shortcomings. 3.2 Timeline Tools A timeline is a linear or graphical representation of a sequence of events. In general, timelines are a temporal ordering of a subject of interest. Events, entities, or topics of interest are displayed along an axis. As such, many projects have explored visualization through timeline techniques. The desire to visualize time relationships and patterns in data has been an ongoing area of research. One area addressed in visualization is the desire to see the big picture and be able to drill down to examine events in detail. In this regard, Snap [6] attempts to increase the total amount of data that can be displayed by placing a large number of entities into a single “aggregate”. This new collection can then be displayed for summary information or drilled down for closer inspection. Lifelines [14] display legal or medical data to professionals in those fields. Here, the goal is the visualization of a patient or case history, allowing users access to data
184
T. Buetow et al.
from one screen. In addition, this project aims to enhance anomaly and trend spotting and to streamline access to data. In an attempt to create more relational querying power in a timeline, Hibino [8] developed Multimedia Visual Information Seeking to allow users to interactively select two subsets of events and dynamically query for temporal relationships. In short, this allows for a user to ask, “How often is event type A followed by event type B?” Others, such as Kullberg [10], attempted to reinvent the 2D timeline into three dimensions. Holly [9] has proposed timelines to view program hotspots during execution. In a more general approach, Kumar [11] developed the ITER model for the basis of developing timeline applications. All of these applications offer different temporal views of their respective data sets. In addition, with the proliferation of the Internet, many forms of informal timelines are present, many of which communicate personal histories and the like. Several private companies also have timeline tools for analysis in various professional fields. Although there are many existing timeline tools, to the best of our knowledge there are very few tools that incorporate a timeline view simultaneously with other views of the same data set, as we have implemented in STV. 3.3 Crime Mapping Tools The uses of Geographic Information Systems (GIS) in law enforcement applications are becoming increasingly important in supporting crime analysts’ capabilities. This field is split between two main areas: finding better ways to display the data available and finding better ways to mine the data to help crime analysts save time. One tool might be used to mine data and another tool to display the information gleaned from the data mining. The crime analyst would still have to manually run the data mining program and then manually move the data into GIS software for display. This process can be painfully slow. One tool that combines these two areas is the Regional Crime Analysis Program (ReCAP) developed by Dr. Brown at the University of Virginia [2]. Brown realized that the current systems have three main shortcomings. These systems did not allow the user to run a spatial query to obtain the data set, in which the user has interest, nor did they automate the process of analyses through data mining. In addition, the systems required users to be proficient in GIS and mapping technologies. ReCAP was developed to fulfill these three shortcomings. A tool that deals mainly with data mining for GIS is the CrimeStat Spatial Statistics Program developed by Ned Levine & Associates [12]. This tool has an impressive amount of data mining options available. These features include spatial distribution analysis, distance analysis, hot spot analysis, interpolation (kernel density estimation), and space-time analysis (Knox and Mantel) tools. The user must manually import the data into those tools. The analyzed data can then be saved for later use. There are many examples of how other organizations have created tools to display the data they have mined using CrimeStat [12]. Two commercial tools that are popular in law enforcement for viewing crime data on a map are ArcView developed by ESRI and MapInfo developed by the MapInfo Corporation [5] [13]. These tools allow the user to import data from various file types and even perform sophisticated database operations on the imported data. Their
A Spatio Temporal Visualizer for Law Enforcement
185
popularity has the advantage that many people in the industry are already familiar with them.
4 Features of STV The STV is a data visualization tool built on top of our ongoing COPLINK project [4]. COPLINK provides a one-stop data access and search capabilities through an easy to use user interface, for local law enforcement agencies such as the Tucson Police Department (TPD). STV is intended to take COPLINK one step further by providing an interactive environment where analysts can load, save, and print police data in a dynamic fashion for exploration and dissemination. For instance, an analyst can search all robberies that have taken place over the past two years and visualize them. In addition the analyst may wish to visualize all drug arrests, simultaneously with the robberies, and see if there is any correlation between the two. 4.1 Technologies Used STV is built into a Java applet in a modular fashion. This was done with the intent that other types of views would be added in the future with relatively little work by taking advantage of object-oriented inheritance. One key advantage of an applet is that no software needs to be installed or maintained on analysts’ machines. Queries are performed using applet to servlet communication to connect to an Oracle database. Results are stored by a controller class and accessed by each STV view. On the backend, JDBC is used to connect to the COPLINK database. One addition, specifically required by the STV project, was an area to save user preferences and past queries specific to each of the views. Although this information is saved in the same database, it is independent of the COPLINK schema. This addition allows police officers the capability to save valuable time by saving the search information gathered in the application’s database. 4.2 Components STV overcomes some of the disadvantages of other existing crime visualization tools by viewing three perspectives on the same data. The detail of each view is described in the following sections. In addition, there are two screenshots of STV in figures 1 and 2, which illustrate its functionality by displaying an example of bank robbery data from 1996-2002, described in section 5. Control Panel. The control panel (figure 1.c) maintains central control over temporal aspects of the data.
• The time-slider controls the range of time viewed. Thus, the data may span six years, but the timeslider may be narrowed to focus on one year, or one month. This time window into the data may then be moved like a typical slider to incorporate
186
T. Buetow et al.
new data points and exclude others. This slider was inspired by Lifelines [14] and by Richter [15]. • Granularity, referring to unit of time, is controlled through a drop down menu. Currently, years, months, weeks, and days are implemented. Changing this option has the effect of re-labeling the timeline and altering the periodic patterns being examined. • The overall time bounds are controlled through a series of drop down menus. Thus, while all data points may lie in a particular time span, a user can narrow focus to a subset of data based on time bounds. Periodic View. The main purpose of the periodic view (figure 1.d) is to give the crime analyst a quick and easy way to search for crime patterns.
• The circle represents time in the granularity the user chooses. For instance, it may represent a year, month, week or day. • Within the circle there are sectors which divide it into different time periods within the granularity selected. The analyst also has the ability to change the granularity of the sectors. For example, the circle could be set to year granularity and the sectors could be set to represent months, weeks, or even days. The advantage of this is that the analyst may see different patterns developing over the different time periods. • Sectors are labeled to indicate their specific time interval. • Data is represented by spikes within each time period. • Rings with labels inside the circle represent quantity of data. • Using the box plot method a crime analyst can easily determine if any spikes are outliers. Timeline View. The timeline view (figure 1.a) is a 2D timeline with a hierarchical display of the data in the form of a tree.
• A specific time instant may be highlighted. When combined with the current granularity, all points in that time period are highlighted. For example, if the granularity is month and a point in June 1999 is selected, all data in June 1999 are highlighted. • The tree view and timeline views of the data are coordinated such that expanding a node in the tree expands the data points viewed on the timeline. At the same time, data under a particular node in the tree is summarized in the timeline at that node’s corresponding y-coordinate location. • The time-slider controls the current timeframe viewed. This has the effect of allowing the user to slide across the timeline at various levels of detail. • The tree view allows the user to see the data in a traditional and organized way. GIS View. The GIS view (figure 1.b) displays a map of the city of Tucson on which incidents can be represented as points of a specific color.
• The user can zoom in and out of the map. Zooming in allows for more streets to be displayed.
A Spatio Temporal Visualizer for Law Enforcement
187
• Incidents may be selected by dragging a box around points on the map. This will narrow the information being displayed by all views, focusing on the selected incidents. • The user can move backward and forward in the zoom history similar to an Internet browser. • The GIS view pronounces data points within the time period specified by the timeslider. Data points outside this period are faded. • Data points highlighted in the timeline view are highlighted in the GIS view.
Fig. 1.b Fig. 1.a
Fig. 1.d
Fig. 1.c
Fig. 1. STV. In this case, bank robberies for the last six years are displayed in the timeline, GIS and periodic views. From here, users may narrow focus through granularities and time bounds as well as geographic parameters
5 A Crime Analysis Example To illustrate STV functionality, we explore a hypothetical scenario in which a police officer has been assigned to the task of examining bank robbery data. The officer begins by logging into COPLINK as described in figures 1 and 2. He performs a search for bank robberies in Tucson and selects the results he’s interested in. STV starts by visualizing the 280 bank robberies selected. The officer looks for trends, using the three views. Upon expanding the spiral view, he notices that the period from October to December are peak months for bank robberies in Tucson. Deciding to compare this trend with the previous year, he narrows the data being viewed by inputting September 1, 2001 as a start date and December 31, 2001 as an end date (figure 3).
188
T. Buetow et al.
Fig. 2. Functionality. Views may be moved to provide better focus or because of user preference. Here, GIS view is centered and a geographic query is performed. The data set is narrowed to those selected by the user with corresponding updates in other tools. In the timeline view, points within the geo-search are emphasized, while other points are faded. The periodic view displays summary data on the selected points indicating June, April, November and December have higher incidence of bank robberies. The control panel allows for focus onto a specific period of time within the global time frame selected. Granularity (viewing in terms of days, weeks, months, years) and global time bounds may also be altered
At this point, the data has been narrowed to 31 bank robberies. By looking at the timeline view the officer sees three gaps in bank robbery occurrences (figure 4). He notices that at the beginning of September and October, no bank robberies occurred. More striking is the fact that after approximately Thanksgiving, only two robberies occurred. The officer decides to examine geographic aspects of the data to see if further trends are apparent (figure 5). He notices a cluster of robberies in the Northwest side of town. Zooming in, he sees that north of Broadway Avenue, is where the vast majority of bank robberies occurred during the selected time interval with some locations being robbed multiple times in four months. Additionally, an area around the intersection of Euclid Avenue and Grant Road appears to be the center of a concentration of activity. The officer selects points on the Northwest side of town by dragging a box around them to see if other trends become apparent. He then moves the periodic view to the center, bringing several trends to light. None of the 17 robberies occurring in this geographic region during the four month period occurred within the first week of a month while the third week of the month was the most frequently robbed. In addition, the periodic tool reveals that more robberies occur on Fridays than other days of the week (figure 6).
A Spatio Temporal Visualizer for Law Enforcement
189
Returning to the timeline view, he notices that several robberies have occurred on the same day. The officer highlights November 15. This automatically highlights the robberies on the geographic view as well. In addition, this helps the officer realize that two days earlier, two other banks were robbed in this same area. For a police officer or crime analyst, many questions arise. Why the sudden disappearance of robberies after Thanksgiving? Why was the first week of each month devoid of robberies? Why were so many banks hit in the same area at the same time? A crime analyst could use the STV for further queries, for example concerning arrests that occurred immediately after these robberies. Although further queries and exploration may be necessary, points of interest were discovered. It may now be advisable to increase patrols in those areas where increased incidents of bank robbery occurred, particularly within the time periods which became apparent. By manipulating the data, cutting and slicing, zooming in and zooming out several trends were revealed in less than 20 minutes of data manipulation.
Fig. 3. The periodic view displaying bank robberies for each month from 1996-2002. The period from October to December has more events than other months
6 Lessons Learned Although the STV tool has not yet been deployed at TPD, we have been able to receive feedback regarding the tool from ten TPD crime analysts and from a seasoned detective. It is important for the STV tool to be assessed by these sources because it is the detectives and crime analysts who will be the primary users of the STV tool. Comments made by the detective and analysts throughout the initial development are summarized below.
190
T. Buetow et al.
Fig. 4. Robberies from September 1, 2001 to December 31, 2001
Fig. 5. Selecting points in the GIS view narrows focus
A Spatio Temporal Visualizer for Law Enforcement
Fig. 6. The periodic view reveals week-per-month and day-per-week trends
Fig. 7. Highlights in the timeline view appear automatically in the GIS view
191
192
T. Buetow et al.
6.1 Current Strengths of STV From our first meeting with analysts, the options to load, save and print projects were expressed as high priorities. Once implemented, projects no longer needed to be recreated each time a user logged onto COPLINK. Similarly, the ability to produce a hard copy of information is often very desirable. These functions enable users to more easily incorporate STV into their analysis. Potential users of STV at the TPD have indicated that the ability to expand and constrict the data being displayed is important in searching for different crime patterns. For instance, an analyst may begin with a large number of incidents being displayed and then narrow them down to relevant incidents, or vice versa. They feel that the STV tool does this quickly and efficiently by means of the control panel and the GIS view. The STV tool will also allow police managers, along with the help of analysts, to discuss ongoing problems and trends. For example, TPD has a meeting known as the Targeted Operational Planning Meeting (TOP) in which Police Chiefs and other managers analyze problems and address them. Having STV available during this brainstorming session would allow these TPD officials to view additional crime trends that may not have been considered. The analysts indicated this as an important strength because quite often the Police Chiefs and managers will want to see different aspects of crime trends “on the fly”. A final strength that cannot be overestimated is STV’s ability to abstract away tedious details of database searches and displays. Computers are excellent at these types of processes. By shifting an analysts focus from a low level of computer interaction to a much higher level of patterns, causes, and effects of crime, STV increases the efficiency of analysis. 6.2 Areas of Improvement for STV While most of the feedback we received from TPD was favorable, users have indicated certain areas of potential improvement for STV. The biggest concern is the limited customization that the tool currently supports. For instance, crime analysts may wish to add a note and reference it to an incident that is being visualized. They may also wish to add events to the data set that are not present in the databases. A second area of concern is the customization of colors and shapes. For example, officers may want to have all robberies displayed by a green triangle, and all homicides displayed by a red circle. Size of data points was also expressed as a concern. A problem common to virtually all visualization techniques is that of labeling. Analysts recommended a variety of labels for data points, from standard text labels to balloon labels that appear on mouse hovers. The size and content of labels were also of interest. Crime analysts have also expressed interest in the ability to have STV communicate with COPLINK Connect/Detect [4] which has already been deployed at TPD. For instance, if a group of incidents such as robberies are visualized, an analyst may wish to select a particular incident and see the corresponding information from COPLINK Connect/Detect displayed.
A Spatio Temporal Visualizer for Law Enforcement
193
Finally, STV lacks automatic analysis functionality. This means that users cannot click a button and have an algorithm applied to their data set to solve a problem. Features such as hot spot algorithms which determine clusters of activity or algorithms that determine anomalies in data sets are currently not present, but desirable.
7 Conclusions and Future Directions The STV tool is scheduled to begin user studies at the TPD in March 2003. The plan is to have crime analysts use the STV tool in their daily activities in order to discover other strengths and areas for improvement. The experiences of crime analysts will provide valuable insights into future directions for the STV project. The ability provided in STV to synchronize three different views for visualizing crime related data would provide law enforcement an advantage in crime analysis. This combined with the dynamic access to data and STV’s user-friendly interface present advantages over traditional methods. As the veteran detective said, “This application has the potential to revolutionize the manner in which we examine crime trends and pursue criminals.” Acknowledgements. This project has primarily been funded by the following grants:
• NSF, Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July 2000-June 2003. • National Institute of Justice, “COPLINK: Database Integration and Access for A Law Enforcement Intranet,” # 97-LB-VX-K023, July 1997-Jan. 2000. • National Institute of Justice, “Distributed COPLINK Database and Concept Space Development,” #0308-01, Jan. 2001-Dec. 2001. • NSF, Information Technology Research, “Developing A Collaborative Information and Knowledge Management Infrastructure,” NSF/IIS #0114011, Sept. 2001-Aug. 2004. We would like to thank the following people for their support and assistance during the entire project development and evaluation process:
• All members of the University of Arizona Artificial Intelligence Lab staff and Coplink Staff, • Lt. Jenny Schroeder, Dan Casey and other contributing personnel from the Tucson Police Department, • The Phoenix Police Department.
194
T. Buetow et al.
References 1. 2.
3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
Archimedean spiral, http://www.2dcurves.com/spiral/spirala.html Brown, Donald E. 1998. “The Regional Crime Analysis Program (RECAP): A Framework for Mining Data to Catch Criminals.” In Proceedings for the 1998 International Conference on Systems, MAN, and Cybernetics (San Diego, CA, USA, Oct. 11–14). IEEE, Piscataway, N.J., 2848-2853. University of Virginia, June 1998. Carlis, J. (1998). “Interactive Visualization of Serial Periodic Data,” Proceedings of User Interface Software and Technology. Chen, H., D. Zeng, H. Atabakhsh, W. Wyzga & J. Schroeder (2003). “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, pp 28-34. Environmental Systems Research Institute (ESRI), http://www.esri.com Fredrikson, A., C. North, C. Plaisant & B. Schneiderman (1999). “Temporal, Geographical and Categorical Aggregations Viewed Through Coordinated Displays: a Case Study with Highway Incident Data,” Human-Computer Interaction Laboratory Technical Report No. 99-31 December 1999, NPIVM, pp 26–34. Harris, R. (1996). “Information Graphics – A Comprehensive Illustrated Reference,” Management Graphics. Hibino, S. & E.A. Rudensteiner (1998). “Comparing MMVIS to a Timeline for Temporal Trend Analysis of Video Data,” Proceedings of Advanced Visual Interfaces. Holly, M. (2001). “Temporal and Spatial Program Hot Spot Visualization,” Technical Report SOCS-01.6. Kullberg, R.L. (1996). “Dynamic Timelines: Visualizing Historical Information in Three Dimensions,” Proceeding of CHI ‘96, pp 386–387. Kumar, V. & R. Furuta (1998). “Metadata Visualization for Digital Libraries: Interactive Timeline Editing and Review,” Proceedings of the third ACM conference on Digital libraries, pp 126–133. Levine, Ned (2000). “CrimeStat: A Spatial Statistics Program for the Analysis of Crime Incident Locations (v 1.1),” URL, http://www.icpsr.umich.edu/NACJD/crimestat.html. MapInfo, http://www.mapinfo.com Plaisant, C., B. Milash, A. Rose, S. Widoff & B. Schneiderman (1996). “Lifelines: Visualizing Personal Histories,” ACM CHI ’96 Conference Proceedings. pp 221–227. Richter H, J. Brotherton, G.D. Abowd & K. Truong (1999). “A Multi-Scale Timeline Slider for Stream Visualization and Control,” GVU Technical Report GIT-GVU-99-30. Tufte, E. (1983). “The Visual Display of Quantitative Information”. Graphics Press. Webber, M., M. Alexa & W. Muller (2000). “Visualizing Time-Series on Spirals”, Technical University of Darmstadt.
Tracking Hidden Groups Using Communications Sudarshan S. Chawathe Computer Science Department University of Maryland College Park, Maryland 20742, USA [email protected]
Abstract. We address the problem of tracking a group of agents based on their communications over a network when the network devices used for communication (e.g., phones for telephony, IP addresses for the Internet) change continually. We present a system design and describe our work on its key modules. Our methods are based on detecting frequent patterns in graphs and on visual exploration of large amounts of raw and processed data using a zooming interface.
1
Introduction
Suppose a group of suspicious agents (henceforth, suspects) has been identified based on some a priori knowledge. Instead of taking immediate action to stop the suspicious activities, it is often prudent to carefully monitor the suspects and their communications in order to maximize the detection of suspects (expand the group) and uncover the nexus of activity (locate the key or controlling agents). Unfortunately, the suspects typically do not communicate using easily identifiable sources. For example, a ring of car thieves may continually change phone numbers (using prepaid cellular phones, short-term pager numbers, etc.). Similarly, globally dispersed agents planning a distributed denial-of-service attack on the cyber-infrastructure typically do not use the same IP address for very long. Such behavior makes it very difficult to accurately and efficiently track groups of suspects over extended periods of time. In this paper, we describe a strategy to solve this problem by using a combination of automated and human-directed techniques. We begin by describing the problem more precisely. Problem Development. We will use the term agents to denote real-world entities (typically, humans) that we are interested in monitoring. However, these agents are not directly observable and their real-world identities are, in general, unknown. That is, we do not have any method to directly track the actions of the agents. Instead, all we can observe is the communications between such agents. The medium used for such communication may be a phone network, the Internet, physical mail, etc. We refer to it as the network in general. We will use the term nodes to denote the devices used to communicate using this network (e.g., phone numbers in a telephone network, IP addresses on the Internet). A key feature of nodes is that they are, by virtue of their connections to the network, H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 195–208, 2003. c Springer-Verlag Berlin Heidelberg 2003
196
S.S. Chawathe
agents nodes Monitored messages Unobservable Real World Network
Observable Communication Network Unknown and dynamic mapping between agents and nodes Fig. 1. The tracking problem
easily identifiable and observable. Agents use nodes to communicate on the network. (For example, people use phone numbers to communicate using the phone network, and IP addresses to communicate using the Internet.) A group of communicating suspects is called a s-group. Note that since suspects are, in general, not directly observable, neither are s-groups. At a given point in time, there is a group of nodes (in the communication network) corresponding to the agents in an s-group; we refer to this group of nodes as a n-group. In contrast with s-groups, n-groups are easily observable. For example, the group of phone numbers used by a ring of car thieves in the past few days forms a ngroup. Over time, the n-group corresponding to a given s-group changes. For example, the ring of thieves is likely to be using a completely different set of phone numbers two months from now. The problem at hand is then the problem of tracking s-groups by observing only the n-groups. By observing a n-group, we mean tracking the communications between the nodes in the group. In this paper, we assume that the only information we can obtain from the communication network is a timestamped list of inter-node messages. We use the term messages in a general sense. In a phone network, a message is a phone call; on the Internet, a message may be a TCP connection. More precisely, monitoring the network yields a list of tuples of the form (n1 , n2 , t, A) indicating a message from n1 to n2 at time t. We use A to denote a list of additional attributes, which depend on the particulars of the communication network and the monitoring methods. In a phone network, A includes attributes such as the length of the call. On the Internet, A includes the source and destination ports associated with a TCP connection and other connection parameters. It is convenient to regard this stream of tuples as the edges of a connection multigraph whose nodes represent communication network nodes (e.g., phone numbers) and whose edges represent messages annotated with additional attributes (e.g., phone calls with durations). In most networks, such a list is never-ending and therefore better modeled as a stream of tuples. Another characteristic of the data from network monitoring is that it is typically produced at a very high rate. For example, call records
Tracking Hidden Groups Using Communications Network Monitoring Data (edge stream)
Newswires, memos, etc.
tuning
Online Analysis
197
Visual Exploratoin
alerts
Offline Analysis Storage
mining
Fig. 2. System architecture
on a phone network and TCP connection build-ups and break-downs occur at a very high rate. It is important to analyze such stream data using online methods that detect important patterns as early as possible. (For example, detecting that a ring of thieves is about to move to another state or country may prompt immediate action if it is detected in a timely manner.) Further, indiscriminately storing such stream data can exhaust even the large amounts of inexpensive storage currently available. Storing the data indiscriminately also makes it more difficult to operate on the data as less interesting data is likely to slow access to interesting data. On the other hand, many of the kinds of operations required by this application are not likely to yield to purely online methods. For example, many data mining algorithms require random access to data on disk and cannot be easily modified for the restrictions of stream data. Thus, a practical solution is likely to require both online and offline analysis methods that operate cooperatively. So far, we have not indicated how the results of the automated or semiautomated methods suggested above are presented to the analyst responsible for decisions, nor have we indicated how such analysts may use their knowledge to direct and guide the tracking process. A simple solution here is to process data in batches, and provide input in batches. For example, a detective may analyze the output of the tracking method from yesterday and adjust the input parameters for guiding the method when it is run on today’s data. This solution has problems analogous to those encountered by batch-based solutions to the tracking problem. Again, it is desirable to provide methods that permit online viewing of the results of tracking and immediate fine-tuning of the tracking process. Assuming we have at hand streaming methods for tracking s-groups, we need methods for visualizing, searching, and manipulating the streaming and dynamic data generated by these methods. System Architecture. Figure 2 depicts the high-level architecture of our system for tracking s-groups. The monitoring devices on the network (e.g., instrumented routers on the Internet) produce a stream of tuples, each of which describes a
198
S.S. Chawathe
message between nodes. This stream of tuples is sent to both the online analysis module and the storage module. The storage module is responsible for recording the stream and merging it with the archived data at suitable intervals (say, every 24 hours). The online analysis module uses the stream to trigger detection features based on the archived data and input from the analyst. The offline analysis module is where methods that are not suited to stream processing are implemented. These methods can be classified as data mining or pattern detection methods that require random access to data. The exploration module includes a graphical user interface and, more important, implementation of methods for quickly assimilating vast amounts of data at varying levels of detail. The data includes the stream data processed to varying degrees, the results of the online and offline analysis modules, and an integration with external data sources that are relevant to an analyst’s decision making process (e.g., newswire articles, police reports, memos). In Section 2, we describe methods for detecting frequent patterns in the connection graph. These methods form the building blocks for of the offline analysis module. Section 3 describes methods for exploring large volumes of graph data using a zooming interface that form the basis of the exploration module. We discuss related work briefly in Section 4 and conclude in Section 5. Due to space constraints, we do not discuss the online analysis module here, and refer the interested reader to [5] for details.
2
Detecting Frequent Patterns
In this section, we describe our method for detecting hidden groups by analyzing large volumes of historical connection data obtained by network monitoring. This method is part of the static analysis module of Figure 2. Recall that in this module, we are given a database consisting of a communication graph that forms a historical record of messages between nodes and we wish to detect potential s-groups for further investigation (and to serve as inputs for the online analysis module). The goal is to help an analyst detect s-groups by highlighting patterns in the data. The kinds of patterns of interest to analysts are likely to be varied and complex, and we do not attempt to completely automate the task of detecting them. Instead, our approach is to provide efficient implementation of a few key operations that the analyst may use to investigate the data based on real-world knowledge. In particular, we focus on the efficient implementation of an operation that is not only useful on its own, but also forms the building block for more sophisticated analysis methods (both automated and human directed). This operation is the detection and enumeration of frequently occurring patterns, which are informally patterns of communicating nodes occur frequently enough to be of potential interest for a detailed data analysis. (Such frequently occurring patterns are to our problem what frequent itemsets are to the problem of mining market basket data [1].) The main idea behind our method, which is called SEuS (Structure Extraction using Summaries) is the following three-phase process: In the first phase
Tracking Hidden Groups Using Communications
199
(summarization), we preprocess the given dataset to produce a concise summary. This summary is an abstraction of the underlying graph data. Our summary is similar to data guides and other (approximate) typing mechanisms for semistructured data [12,15,4]. In the second phase (candidate generation), our method interacts with a human analyst to iteratively search for frequent structures and refine the support threshold parameter. Since the search uses only the summary, which typically fits in main memory, it can be performed very rapidly (interactive response times) without any additional disk accesses. Although the results in this phase are approximate (a superset of final results), they are accurate enough to permit uninteresting structures to be conservatively filtered out. When the analyst has filtered potential structures using the approximate results of the search phase, an accurate count of the number of occurrences of each potential structure is produced by the third phase (counting).
Fig. 3. Example input graph
Users are often willing to sacrifice quality for a faster response. For example, during the preliminary exploration of a dataset, one might prefer to get a quick and approximate insight into the data and base further exploration decisions on this insight. In order to address this need, we introduce an approximate version of our method, called L-SEuS. This method only returns the top-n frequent structures rather than all frequent structures. We present only a brief discussion of SEuS below, and refer the reader to [11] for a detailed discussion both SEuS and L-SEuS. Summarization. We use a data summary to estimate the support of a structure (i.e., the number of subgraphs in the database that are isomorphic to the structure). The summary is a graph with the following characteristics. For each
200
S.S. Chawathe
Fig. 4. A structure and its three instances
distinct vertex label l in the original graph G, the summary graph X has an l-labeled vertex. For each m-labeled edge (v1 , v2 ) in the original graph there is an m-labeled edge (l1 , l2 ) in X , where l1 and l2 are the labels of v1 and v2 , respectively. The summary X also associates a counter with each vertex (and edge) indicating the number of vertices (respectively, edges) in the original graph that it represents. For example, Figure 5 depicts the summary generated for the input graph of Figure 3.
Fig. 5. Summary graph
We use the summary X to estimate the support of a structure S as follows: By construction, there is at most one subgraph of X (say, S ) that is isomorphic to S. If no such subgraph exists, then the estimated (and actual) support of S is 0. Otherwise, let C be the set of counters on S (i.e., C consists of counters
Tracking Hidden Groups Using Communications
201
on the nodes and edges of S ). The support of S is estimated by the minimum value in C. Given our construction of the summary, this estimate is an upper bound on the true support of S. Candidate Generation. The candidate generation phase is a simple search in the space of structures isomorphic to at least one subgraph of the database. We maintain two lists of structures: open and candidate. In the open list we store structures that have not been processed yet (and that will be checked later). The algorithm begins by adding all structures that consist of only one vertex and pass the support threshold test to the open list. The rest of the algorithm is a loop that repeats until there are no more structures to consider (i.e., the open list is empty.) In each iteration, we select a structure (S) from the open list and we use it to generate larger structures (called S’s children) by calling the expand subroutine, described below. New child structures that have an estimated support greater than the threshold are added to the open list. The qualifying structures are accumulated in the candidate list, which is returned as the output when the algorithm terminates. Given a structure S, the expand subroutine produces the set of structures generated by adding a single edge to S (termed the children of S). In the following description of the expand(S) subroutine, we use S(v) to denote the set of vertices in S that have the same label as vertex v in the data graph and V (s) to denote the set of data vertices that have the same label as a vertex s in S. For each vertex s in S, we create the set addable(S, s) of edges leaving some vertex in V (s). This set is easily determined from the data summary: It is the set of out-edges for the summary vertex representing s. Each edge e = (s, v, l) in addable(S, s) that is not already in S is a candidate for expanding S. If S(v) (the set of vertices with the same label as e’s destination vertex) is empty, we add a new vertex x with the same label as v and a new edge (s, x, l) to S. Otherwise, for each x ∈ S(v) if (s, x, l) in not in S, a new structure is created from S and e by adding the edge (s, x, l) (an edge between vertices already in S). If s does not have an l-labeled edge to any of the vertices in S(v), we also add a new structure which is obtained from S by adding a vertex x with the same label as v and an edge (s, x , l). Support Counting. Once the analyst is satisfied with the structures discovered in the candidate generation phase, she may be interested in finalizing the frequent structure list and getting the exact support of the structures. This task is performed in the support counting phase. Let us define the size of a structure to be the number of nodes and edges it contains; we refer to a structure of size k as a k-structure. From the method used for generating candidates (Section 2), it follows that for every k-structure S in the candidate list there exists a structure Sp of size k −1 or k −2 in the candidate list such that Sp is a subgraph of S. We refer to Sp as the parent of S in this context. Clearly, every instance I of S has a subgraph I that is an instance of Sp . Further, I differs from I only in having one fewer edge and, optionally, one fewer vertex. We use these properties in the support counting process.
202
S.S. Chawathe
Determining the support of a 1-structure (single vertex) consists of simply counting the number of instances of a like-labeled vertex in the database. During the counting phase, we store not only the support of each structure (as it is determined), but also a set of pointers to that structure’s instances on disk. To determine the support of a k-structure S for k > 1, we revisit the instances of its parent Sp using the saved pointers. For each such instance I , we check whether there is a neighboring edge and, optionally, a node that, when added to I generates an instance I of S. If so, I is recorded as an instance of S.
Fig. 6. A screenshot of the SEuS system
3
Visual Exploration
In this section, we describe methods for implementing the interface module of Figure 2. Recall that the task of this module is to help the analyst assimilate the output of the automated analysis modules (offline and online) as well as the external data feed (newswire articles, intelligence reports, etc.). The interconnections between data items from different sources are of particular interest. In this module, we model data as a multiscale graph in which nodes represent data items and edges represent the relationships among them. At a high level, this graph aggregates many data items into one node; at the lowest level, each node
Tracking Hidden Groups Using Communications
id
a100
a59
a64
a22
a23
a72
a42
a89
a7
a31
a35
a19
a51
a97
86 More
Forward/Cross Links
a100
a59
a64
a22
a23
a72
a89
a7
a31
a19
a51
a93 a74
Back Links
Zooming-in Details
Zooming-in Numbers id
203
a46
a28
a42
a9
a92
a35
a53
a14
a1
a67
a88
a37
a70
a63
a99
a2
a26
a58
a71
a81
a68
66 More
book/paper id name publications
p7 a100 Yeo p10 p25
p7 a22 Chawathe p2 b5 b29 ...
98 More
More Links
Fig. 7. Two kinds of logical zooming
represents a single data item or concept (e.g., a phone number). This representation allows the analyst to work at a level of abstraction best suited to the task at hand. We have implemented methods for exploring such graphical data at varying levels of detail as part of our VQBD system [6], and we describe the key ideas below. Although VQBD is extensible and incorporates many features for the power user, it is designed to be accessible to a casual user. To this end, the basic modes of interacting with the system are very simple. At all times, the VQBD display consists of a single window with a graphical representation of the XML data. Although, as we shall see below, this representation may be the result of some complex operations, the user interface is always the same: There are nodes (boxes) representing data elements (often summarized) and arcs (lines) representing relationships among them. There are no tool-bars, scroll-bars, sliders, or other widgets. We believe this simplicity is key to usability by a casual user. The basic modes of controlling, described below, VQBD are also simple and unchanging. The first three are meant for the casual user, while the next two are for users who have gained more experience with the system. Panning. The displayed objects can be moved in any direction relative to the canvas by a dragging motion with the left button of the mouse. Zooming. The display may be zoomed in (or out) by a right- (respectively, left-) dragging motion with the right mouse button. VQBD uses the position of the
204
S.S. Chawathe
pointer to determine the type of zooming. If the pointer is outside all graphical object then the result is simple graphical zooming (e.g., larger objects, bigger fonts). If the pointer is inside a graphical object then the data resolution of that object, and any others of a similar type, is increased. For example, consider the screenshot in Figure 8(b). The lower part represents speech and line objects and includes sample values from the input document. Zooming in with the pointer inside the larger box (representing the collection of line objects) results in the display of a larger number of sample speech objects. Zooming in with the pointer inside one of the smaller boxes representing an individual line object displays that object in more detail (more text). Figure 7 illustrates these two modes of zooming. In the case of other visualization modules (e.g., histograms), zooming results in actions appropriate to that module (e.g., histogram refinement). Link Navigation. Clicking on a link causes the display to recenter itself around the target of the link at an appropriate zoom level. Following the design method of the Jazz toolkit, such link navigation is not instantaneous; instead it occurs at a speed that allows the viewer to discern the relative positions of the referencing and referenced objects. In addition to selecting an appropriate graphical zoom level, VQBD automatically picks a suitable logical zoom level. For example, a collection of numbers that is too large to display in its entirety is often presented as a histogram. View Change. While VQBD automatically selects an appropriate method for visualizing data at the available resolution, the user may override this selection a pop-up menu bound to the middle mouse button. For example, a user interested in the highest values in a collection of numbers may force VQBD to change the view from histogram to sorted list. Querying. The XML document may be queried using a query-by-example interface. This interface permits users to specify selection conditions as annotations on displayed objects. In addition, the user may mark objects as distinguished objects for use in queries. Intuitively, these objects can be used as the starting points for query-based exploration. VQBD has built-in query modules for regular expressions and XPath. Additional query modules can be easily added using the plug-in interface. More precisely, these objects are logically inserted into a table that can be used in the from clause of OQL-like queries. Since we do not have access to realistic monitoring data, we illustrate the key features of VQBD using a sample user session based using Jon Bosak’s XML rendition of Shakespeare’s A Midsummer Night’s Dream, available at |http://www.ibiblio.org/xml/examples/shakespeare/—. The system parses the data and graphically and presents a summary of its implicit structure with objects representing the play, acts, scenes, and lines. This structural summary is the default view presented by VQBD. A screenshot appears as Figure 8(a). Note that the screenshots in Figure 8 are based on a rather small VQBD display (approximately 350x350 pixels). While we picked this size primarily to fit the space
Tracking Hidden Groups Using Communications
(a) Zoomed out—structural summary
205
(b) Zoomed in—instances
Fig. 8. Two screenshots of VQBD in action
constraints of this report, it also illustrates how VQBD’s zooming interface allows it to function effectively at this size. In this example, the summary is small enough to be displayed in its entirety. However, when the summary is larger (or the screen smaller), the panning and graphical zooming features of VQBD are used to view the summary. Now suppose the analyst zooms in on the speech object using a dragging motion with the right mouse button. Initially, the zooming results in standard graphical results (larger objects, higher resolution text, etc.). However, as soon as the object becomes large enough to display graphical elements within it, the graphical zooming is accompanied by a logical zooming: a few sample elements are displayed. VQBD displays randomly sampled elements, with the number of displayed elements increasing as the available space increases as a result of the zooming in operation. Figure 8(b) is a screenshot at this stage of exploration. In addition to details of the speech and line elements, details of scene elements (appearing above the speech elements in this figure as in Figure 8(b)) are partially visible, providing a useful context. These figures do not convey the colors used by VQBD for indicating many relationships, including grouping elements based on parents (enclosing elements). When a sample element is displayed in this manner, VQBD reads its attributes and sub-elements to pick a short string that distinguishes the element from others with the same tag. This string is displayed within the object representing the element on screen. In our example,
206
S.S. Chawathe
VQBD uses the scene titles to identify scene elements on screen. At this stage, the analyst also has the option of single-clicking on any of the displayed objects, causing VQBD to display all details of the selected object. For example, clicking on the scene object labeled A hall in the castle results in displaying the scene in greater detail (as much as will fit in the VQBD window). Note that this clicking action is simply an accelerated form of zooming; the same result could be achieved by zooming in to the scene object. Subelements of the scene element are displayed as active links that can be activated in order to smoothly transport the display to the referenced object. This link-based navigation can be freely interleaved with zooming. Zooming out at this point results in VQBD retracing its steps, displaying data in progressively less detail until we are back at the original structural summary view. In addition to browsing data in this manner, an analyst may also query data using the VQBD interface. For example, if a scene object is selected as the origin of a search for the string Lysander, VQBD executes the query and highlights objects in the query result. In our sample data, the query string matches elements of different types (two persona elements, one stagedir element, and several speaker and line elements). If the current resolution is insufficient to display individual objects, only the structural summary objects corresponding to the individual objects are highlighted. To view the query results in detail, one may zoom in as before. Unlike the earlier zooming action, which displayed a random sample of all elements corresponding to the summary object, VQBD now displays a sample chosen only from the elements in the query result. When all elements in the query result have been displayed, further zooming results in a random selection from the remaining elements (as before). (Colors are used to distinguish the elements in the query result from the rest of the elements.) This exploration of query results may be interleaved with zooming, panning, query refinement, and other VQBD operations.
4
Related Work
There is a long history of work on network and graph analysis. However, many of the methods do not scale to the amount of data generated by the network monitoring situations that interest us. For high-volume data, work on Communities of Interest [10,9] is perhaps the closest to our work. A method for managing high-volume call-graph data from a phone network based on daily merging of records is described in [10]. There is work on structure discovery in specific domains; a detailed comparison of several such methods appears in [7]. We are more interested in domain independent methods such as CLIP and Subdue [16,8]. The method of Section 2 differs from these in its use of a summary structure to yield an interactive system with high throughput. A detailed discussion and performance study appears in [11]. AGM [13] is an algorithm for finding frequent structures that uses an algorithm similar to the apriori algorithm for market basket data [2]. The FSG [14] is similar to AGM but uses a sparse graph representation that minimizes storage
Tracking Hidden Groups Using Communications
207
and computation costs. The FREQT algorithm is based on the idea of discovering tree structures using by attaching nodes to only the rightmost branches of trees [3]. The general idea of using a succinct summary of a graph for various purposes has a large body of work associated with it. For example, this idea is developed in semistructured databases as graph schemas, representative objects, and data guides, which are used for constraint enforcement, query optimization, and query-by-example interfaces [4,15,12].
5
Conclusion
We described and formalized the problem of tracking hidden groups of entities using only their communications, without a priori knowledge of the communication device identifiers (e.g., phone numbers) used by the entities. We discussed the practical constraints on the environment in which this problem must be solved and presented a system architecture that combines offline analysis, online analysis, and interactive exploration of both raw and processed data. We described our work on methods that form the basis of some of the system modules. We have conducted detailed evaluation of these methods by themselves and are now working on assembling and evaluating the system as a whole. Acknowledgments. Shayan Ghazizadeh helped design the and implement the SEuS system. Jihwang Yeo and Thomas Baby implemented parts of the VQBD system. This work was supported by National Science Foundation grants in the CAREER (IIS-9984296) and ITR (IIS-0081860) programs.
References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. SIGMOD Record, 22(2):207–216, June 1993. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th International Conference Very Large Data Bases, pages 487–499. Morgan Kaufmann, 1994. 3. Tatsuya Asai, Kenji Abe, Shinji Kawasoe, et al. Efficient substructure discovery from large semi-structured data. In Proc. of the Second SIAM International Conference on Data Mining, 2002. 4. P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of the International Conference on Database Theory, 1997. 5. Sudarshan S. Chawathe. Tracking moving clutches in streaming graphs. Technical Report CS-TR-4376 (UMIACS-TR-2002-56), Computer Science Department, University of Maryland, College Park, Maryland 20742, May 2002. 6. Sudarshan S. Chawathe, Thomas Baby, and Jihwang Yeo. VQBD: Exploring semistructured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001. Demonstration Description.
208
S.S. Chawathe
7. D. Conklin. Structured concept discovery: Theory and methods. Technical Report 94-366, Queen’s University, 1994. 8. D. J. Cook and L. B. Holder. Graph-based data mining. ISTA: Intelligent Systems & their applications, 15, 2000. 9. Corinna Cortes and Daryl Pregibon. Signature-based methods for data streams. Data Mining and Knowledge Discovery, 5:167–182, 2001. 10. Corinna Cortes, Daryl Pregibon, and Chris Volinsky. Communities of interest. In Fourth International Symposium on Intelligent Data Analysis (IDA 2001), Lisbon, Portugal, 2001. 11. Shayan Ghazizadeh and Sudarshan S. Chawathe. SEuS: Structure extraction using summaries. In Steffen Lange, Ken Satoh, and Carl H. Smith, editors, Proceedings of the 5th International Conference on Discovery Science, volume 2534 of Lecture Notes in Computer Science (LNCS), pages 71–85, Lubeck, Germany, November 2002. Springer-Verlag. 12. R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the Twenty-third International Conference on Very Large Data Bases, Athens, Greece, 1997. 13. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 13–23, 2000. 14. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of the 1st IEEE Conference on Data Mining, 2001. 15. S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concise representations of semistructured, hierarchial data. In Proceedings of the International Conference on Data Engineering, pages 79–90, 1997. 16. K. Yoshida, H. Motoda, and N. Indurkhya. Unifying learning methods by colored digraphs. In Proc. of the International Workshop on Algorithmic Learning Theory, volume 744, pages 342–355, 1993.
Examining Technology Acceptance by Individual Law Enforcement Officers: An Exploratory Study 1
2
2
Paul Jen-Hwa Hu , Chienting Lin , and Hsinchun Chen 1
Accounting and Information Systems, David Eccles School of Business University of Utah, Salt Lake City, Utah 84112 [email protected] 2 Management Information Systems, Eller College of Management University of Arizona, Tucson, Arizona 85721 {linc,hchen}@eller.arizona.edu
Abstract. Management of technology implementation has been a critical challenge to organizations, public or private. In particular, user acceptance is paramount to the ultimate success of a newly implemented technology in adopting organizations. This study examined acceptance of COPLINK, a suite of IT applications designed to support law enforcement officers’ analyses of criminal activities. We developed a factor model that explains or predicts individual officers’ acceptance decision-making and empirically tested this model using a survey study that involved more than 280 police officers. Overall, our model shows a reasonably good fit to officers’ acceptance assessments and exhibits satisfactory explanatory power. Our analysis suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Subjective norm also appears to have a significant effect on user acceptance through the mediation of perceived usefulness. Several managerial implications derived from our study findings are also discussed.
highly specialized tasks and often have considerable autonomy. As Chau and Hu commented [5], the fast-growing investment and deployment of innovative technologies that support individual professionals demand additional investigations of their technology acceptance decision-making. Law enforcement is a fundamental and critical aspect of government services, as measured by its profound impacts on homeland security. By and large, law enforcement agencies are in the intelligence business and their crime fighting/prevention capability depends on individual officers’ timely access to relevant and accurate information presented in an effective and easily-assimilated manner. When investigating a criminal case or monitoring an organized gang ring, a police detective usually has to access, scrutinize, and integrate relevant information from various sources, internal and external. Because of its stringent information/knowledge support requirements, law enforcement indeed represents a service sector in which applications of information systems (IS) research and practice are inherently appealing and increasingly important. Our observations also suggest that individual officers usually have considerable autonomy in their case analysis and investigative tasks, thus manifesting or resembling a professional work arrangement. Together, the specialized and critical services in law enforcement settings, extensive information/knowledge management support requirements, and individual autonomy demand further examinations of user technology acceptance in law enforcement settings. Investigations of technology acceptance by individual law enforcement officers, nonetheless, have received limited attention from IS researchers. In response, this study aims at examining user acceptance of COPLINK [6-7], a suite of applications designed to provide enhanced information sharing and knowledge management support to offers within and across law enforcement agencies. Specifically, we developed a factor model that explains or predicts individual officers’ acceptance decisionmaking and then empirically tested the model using a survey study that involved more than 280 police officers. The current research purports to identify key technology acceptance drivers in law enforcement settings and investigate how these drivers and their effects might differ from those commonly observed in business contexts. The following section reviews relevant prior research and highlights our motivation.
2 Literature Review and Motivation In this study, technology acceptance broadly refers to an individual’s psychological state with regard to his or her voluntary and intentional use of a technology [13]. User technology acceptance has been examined extensively in IS research. A review of relevant previous studies suggests the dominance of a cognitive/behavioral anchor in conceptualizing and analyzing individual technology acceptance. According to this approach, an individual is conscious about his or her acceptance of a technology that can be sufficiently explained or mediated by the underlying behavioral intention. Substantial empirical support of the explanatory/mediating power of behavioral intention for actual technology use has also been established. As Mathieson [17] concluded, “given the strong causal link between intention and actual behavior, the fact that behavior was not directly assessed is not a serious limitation.” Several theories that anchor at behavioral intention have prevailed, including the Theory of Reasoned Action
Examining Technology Acceptance by Individual Law Enforcement Officers
211
[12], the Theory of Planned Behavior [1]-[2], the Diffusion of Innovations Theory [20], and the Technology Acceptance Model [11]. Rooted in social psychology, the Theory of Reasoned Action (TRA) suggests that an individual’s acceptance of a technology can be explained by his or her intention that is jointly determined by attitudinal beliefs and (perceived) subjective norm. The Theory of Planned Behavior (TPB) extends TRA by incorporating an additional construct (i.e., perceived behavioral control) to account for situations where an individual lacks the capability or resources necessary for performing the behavior under discussion. The Diffusion of Innovations (DOI) theory also has premises established in social psychology, positing that the diffusion of an innovation in a social system is jointly affected by the communication of key innovation attributes that include relative advantages, complexity, compatibility, demonstrability and trialibility. Overall, these theories are generic and have been applied to explain a wide array of individual behaviors, including technology acceptance. Previous individual technology acceptance studies that used TRA, TPB or DOI as a theoretical foundation have garnered considerable empirical support for the respective theories. The Technology Acceptance Model (TAM) adapts from TRA and is developed specifically for explaining individual technology acceptance across different technologies, user groups, and contexts. According to TAM, an individual’s decision on whether or not to accept a technology can be sufficiently explained by behavioral intention which, in turn, is determined by his or her perception of the technology’s usefulness and ease of use. Judged by its frequent use by prior studies, TAM has emerged as a predominant model for individual technology acceptance. This model, however, has been criticized for its parsimonious structure, thus subsequently limiting its use for designing effective organizational interventions that foster technology acceptance. As Mathieson commented [17], “TAM is predictive, but its generality does not offer sufficient understanding to provide system designers with information needed for creating and promoting user acceptance of new systems.” Nevertheless, TAM offers a valid and generic framework upon which extended or detailed models can be developed for specific user acceptance scenarios. Collectively, findings from previous research suggest that analysis of user technology acceptance in an organization setting should consider key characteristics pertaining to multiple fundamental contexts. For instance, Tornatzky and Klein [24] suggested that an individual’s acceptance decision in an organizational setting is jointly affected by factors pertaining to the technological context, the organizational context, and the external environment. Similarly, Chau and Hu [5] examined individual technology acceptance in a professional setting and singled out the importance of the technological, individual, and (organizational) implementation contexts. Igbaria et al. [15] highlighted the importance of management context. Goodhue and Thompson [14] discussed the importance of the technology and task contexts, advocating a contingency fit between them. A review of the literature suggests that conceptualization of user technology acceptance needs to include multiple fundamental contexts, and that model development should proceed from identifying important characteristics of these contexts, based on the user acceptance phenomenon examined. In addition, our literature review also suggests the development and empirical evaluation of specific models that extend from generic theories or models; e.g., [4], [23], [26], and [27]. According to this approach, a generic theory or model is used as a grounded framework upon which a detailed model is developed for a targeted user acceptance scenario; e.g., via inclusion of additional constructs or antecedents of key
212
P.J.-H. Hu, C. Lin, and H. Chen
acceptance drivers. The current research used both TAM and TPB as a theoretical framework for anchoring our analysis of key determinants of individual officers’ acceptance of COPLINK. Our model contained major TAM constructs (e.g. perceived usefulness and ease of use), as well as their key antecedents and other constructs from TPB. During our model development, we also took into consideration important characteristics pertinent to our targeted technology, user group, and organizational (implementation) context.
3 Overview of COPLINK Technology The COPLINK project was initiated and undertaken by the Artificial Intelligence Lab at the University of Arizona, in collaboration with the Tucson Police Department (TPD). An important project objective was to design, develop, and deploy innovative technology solutions to support and enhance information sharing and collaborative investigation within and across regional law enforcement agencies. Funded by the National Institute of Justice (NIJ) and the Digital Government Initiative of the National Science Foundation (NSF), the project has delivered COPLINK [6]-[7] which currently consists of two distinct but complementary applications: COPLINK Connect and COPLINK Detect. COPLINK Connect allows detectives and field officers to access data in other jurisdictions or government agencies, beyond the constraints of system or platform heterogeneity. COPLINK Detect extends the capabilities of Connect by supporting individual officers’ analysis of sophisticated criminal links and networks, using integrated and shared data. At the time of the study, a large-scale deployment of COPLINK had just been completed at TPD and the implementation planning was underway in other jurisdictions in the states of Arizona and Texas. In parallel, technology development in COPLINK also continued, aiming at further enhanced information/knowledge management support and extended functionality through the use of agent and wireless technologies.
4 Research Model and Hypotheses As shown in Figure 1, our research model suggests that an individual officer’s decision to accept or not to accept a technology can be explained by important characteristics pertaining to the technological, individual, and organizational contexts. Specifically, perceived usefulness, perceived ease of use, and efficiency gain are fundamental determinants of the technological context. Consistent with the propositions of TAM, our model states that perceived usefulness and perceived ease of use jointly determine attitude, and that perceived ease of use has a direct positive effect on perceived usefulness. All other factors being equal, an officer is more likely to consider COPLINK to be useful when it is easy to use. Efficiency gain refers to the degree to which an officer perceives his or her task performance efficiency would be improved through the use of COPLINK. Agility is critical in law enforcement, where individual officers are in a constant competition against time. In most cases, officers must respond to crime fighting/prevention challenges in a timely manner. Results from our preliminary evaluation of COPLINK showed that individual officers had
Examining Technology Acceptance by Individual Law Enforcement Officers
placed great importance on task performance efficiency resulting from their use of the technology. Accordingly, we tested the following hypotheses. H1: H2: H3: H4: H5:
The usefulness of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The usefulness of COPLINK as perceived by an officer has a positive effect on his or her intention to accept the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her attitude towards the technology. The ease of use of COPLINK as perceived by an officer has a positive effect on his or her perception of the technology’s usefulness. An officer’s perceived efficiency gain through the use of COPLINK has a positive effect on his or her perception of the technology’s usefulness.
Within a law enforcement setting, attitude is critical to the individual context and refers to an individual officer’s positive or negative attitudinal beliefs about the use of COPLINK. Through previous technology demonstrations and recently completed user training, officers at TPD were expected or likely to have developed personal assessments of and attitudinal beliefs about COPLINK. According to TAM and TPB, an individual who has a positive attitude towards a technology is likely to exhibit a strong intention to accept the technology. Venkatesh and Davis [25] and others (e.g., [11]) have questioned the effectiveness of attitude in mediating the impact of perceived usefulness and perceived ease of use on behavioral intention, thus suggesting its re-
214
P.J.-H. Hu, C. Lin, and H. Chen
moval from TAM and its extensions. In this study, we retained attitude in our model as a key intention determinant, partially because of the described autonomy of individual law enforcement officers, including their technology choice and use. Thus, we tested the following hypothesis. H6:
An officer is likely to have a strong intention to accept COPLINK when he or she has a positive attitude towards the technology.
Subjective norm and availability are key characteristics of the organizational (implementation) context. Consistent with TPB, subject norm refers to an officer’s assessment or perception of significant referents’ desire or opinion on whether or not he or she should accept COPLINK [1]-[2]. In this study, the organizational context includes the communication of COPLINK assessments by administrators and individual officers in an adopting agency and therefore encompasses the management context discussed by Igbaria et al. [15]. Specifically, we posit that subjective norm has a direct positive effect on both perceived usefulness and behavioral intention. Within the social system common to law enforcement agencies, an officer’s behavior might be somewhat affected by significant referents’ opinions or suggestions. Consequently, an officer is likely to consider COPLINK to be useful and thus develops a strong intention for its acceptance when his or her significant referents are in favor of the technology. By and large, officers appear to have a relatively strong psychological attachment to their agency and the social system within it; therefore, they are likely to develop and exhibit a close bond with colleagues and administrative commanders. Such psychological attachment and personal bond might be partially attributed to several factors that include an agency’s non-profit nature, less direct peer competition for resources or promotion (as compared with business organizations), personal commitment to public services, relatively long-term career pursuit, and the closed community common to most agencies. Therefore, we tested the following hypotheses. H7: H8:
An officer is likely to perceive COPLINK to be useful when his or her significant referents are in favor of the technology. An officer is likely to have a strong intention to accept COPLINK when his or her significant referents are in favor of the technology.
Availability is also essential to the organizational context. In this study, availability refers to an officer’s perception of the availability of the computing equipment necessary for using COPLINK. Availability is a fundamental aspect of perceived behavioral control (from TPB). As noted by Ajzen [1]-[2], perceived behavioral control embraces internal (e.g., self-efficacy [3], [8]) and external conditions (e.g., facilitating condition [23]). In their comparative examination of competing models, Taylor and Todd [23] explicitly separated the internal and external aspects of control beliefs. Similarly, Venkatesh [27] also argued that the availability of resources and opportunities required to perform a target behavior is an important perspective of perceived ease of use. Availability of the computing equipment necessary for using COPLINK has been singled out as a potential concern to many officers, particularly those routinely working on criminal case analysis or away from the department offices. Results from multiple focus group discussions and interviews with individual officers consistently suggested the importance of making available the necessary computing equipment. All other factors being equal, the greater the availability as perceived by an of-
Examining Technology Acceptance by Individual Law Enforcement Officers
215
ficer, the stronger his or her intention to accept COPLINL technology. Hence, we tested the following hypothesis. H9:
Availability of the computing equipment necessary for using COPLIBK technology has a positive effect on an officer’s intention to accept the technology.
5 Instrument Development and Validation We empirically tested our model using a self-administered survey that involved more than 280 police officers who volunteered their technology acceptance assessments. Our research method choice was made primarily because of its broad coverage (e.g., number of respondents) and support of different quantitative analyses. All participating officers were from the Tucson Police Department. Our investigation proceeded immediately after the department’s having completed technology implementation (including testing) and mandatory user training. Multiple methods were used in our survey instrument development. Candidate question items were first identified from relevant previous empirical studies. In parallel, we also conducted focus group discussions, as well as unstructured and semistructured interviews with individual officers from the participating police department and other similar agencies. Preliminary measurements for each included construct were obtained by combining our interview/discussion findings and the candidate items extracted from previously validated inventories. Three police officers then assessed the validity of the resultant question items at face value. Based on their comments and suggestions, several minor wording changes were made to tailor to the law enforcement context. All questionnaire items used a seven-point Likert-scale, with anchors from “strongly agree” to “strongly disagree.” To ensure the desired balance and randomness of the questionnaire, half of the question items were worded with proper negation and all items were randomly sequenced. A pretest was then conducted to validate the instrument in terms of reliability and construct validity. Although mostly drawn from previously validated measurements, we re-examined the question items to ensure the necessary validity in the law enforcement setting [21]. Our pre-test included a total of 42 police officers who varied in rank and division. Using their responses, we examined the instrument’s reliability by evaluating the Cronbach’s alpha value for the respective constructs. As summarized in Table 1, all the constructs showed an alpha value greater than 0.70, a commonly suggested threshold for exploratory research [19]. In addition, we also used pre-test responses to assess the instrument’s construct validity in terms of convergent and discriminant validity [21]. Specifically, we performed a principal component factor analysis, which yielded a total of seven components; i.e., matching the exact number of constructs specified in our model. As shown in Table 2, items intended to measure a particular construct exhibited a distinctly higher factor loading on a single component than on other components, suggesting the measurements were of adequate convergent and discriminant validity. The validated measurements were subsequently used in the survey study from which the individuals who had participated in the instrument development or pretest study were excluded. The question items used in the study are listed in the Appendix.
216
P.J.-H. Hu, C. Lin, and H. Chen Table 1. Reliability analysis - cronbach’s alpha Construct
Perceived Usefulness (PU)
Perceived Ease of Use (PEOU) Subjective Norm (SN) Attitude (ATT) Behavioral Intention (BI)
6 Data Analysis Results A self-administered survey study was conducted to test our research model and hypotheses. With the assistance of multiple assistant chiefs and captains, questionnaires were distributed through the line of command using an email attachment. Our subjects were individual officers who had been identified as target users of COPLINK and had completed the mandatory user training. The participating officers were from investigative and field operations divisions and each of them was given two weeks to complete and return the questionnaire. Officers who had failed to complete and return the survey within the initial time window were reminded and given another two weeks to do so. A final one-week time window was then offered to whose who still failed to respond. Of the 411 questionnaires distributed, a total of 283 complete and effective responses were received, showing a 68.9% response rate. Analysis of the respondents’ gender distribution showed an approximate 4-1 ratio in favor of males. Most respondents were from the field operations divisions (60%), followed by the Criminal Investigative Division and Special Investigative Division (35%). Most of the respondents had a two-year college degree or associate bachelor’s degree (41%), followed by those having a high school diploma (30%), and those holding a four-year college degree (29%). On average, the responding officers were 38.4 years of age and had had 12.1 years of experience in law enforcement services. Comparative analysis of the of-
Examining Technology Acceptance by Individual Law Enforcement Officers
217
ficers who completed and returned the survey within the initial response period versus those who needed the extended response time window(s) showed no significant differences in gender or home division distribution, educational background, age, or experience in law enforcement. Table 3 summarizes the demographic profile of the 283 respondents in our survey. Table 2. Examination of convergent and discriminent validity – factor analysis results Factor 1 PU-1 PU-2 PU-3 PU-4 PEOU-1 PEOU-2 PEOU-3 PEOU-4 BI-1 BI-2 BI-3 ATT-1 ATT-2 ATT-3 SN-1 SN-2 AV-1 AV-2 AV-3 AV-4 EG-1 EG-2 EG-3 Eigen Values % of Variance
Model Testing Results. We tested our research model using LISREL. Analysis results showed our model exhibiting a reasonable fit to the data; e.g., Comparative Fit Index (CFI) being 0.91, Non-norm Fit Index (NNFI) being 0.89, and Standardized Root Mean Square Residual (SRMSR) being 0.06. We also assessed the model’s explanatory power. As shown in Figure 1, our model exhibited satisfactory explanatory utility, accounting for 58% of the variances in intention, 66% of the variances in attitude, and 60% of the variances in perceived usefulness. Individual causal Paths. Six of the nine hypothesized causal paths were significant statistically; i.e., p-value 0.05 or lower. As suggested by our analysis results, efficiency gain and subjective norm appeared to be significant determinants of perceived
218
P.J.-H. Hu, C. Lin, and H. Chen
usefulness, which, in turn, showed a significant effect on both attitude and behavioral intention. Perceived ease of use significantly affected attitude, which, however, was not a significant intention determinant. In addition, subjective norm appeared to have a significant effect on intention, but in direct opposition to our hypothesis. The remaining hypotheses were not supported by our data; i.e., perceived ease of use on perceived usefulness, availability on intention, and attitude on intention (which might have been somewhat significant). Table 3. Summary of respondent’s demographics profile Demographic Dimension
Descriptive Statistics
Average Age
38.4 Years
Average Experience in law Enforcement
12.1 Years
Gender Home Division
Education Background
• Male: 81% • Female: 19% • Criminal/Special Investigative: 35% • Field Operations: 60% • Other: 5% • 4-Year College or University: 29% • 2-Year College: 41% • High School: 30%
7 Discussion Overall, our model showed a reasonably good fit to the responding officers’ technology acceptance assessments and exhibited an explanatory power level compared to, if not higher than, that of representative previous studies; e.g., [17], [23]. Several research and management implications can be derived from our findings. First, our study suggests a prominent core influence path from efficiency gain to perceived usefulness and then to intention to accept. Perceived usefulness may be the single most important driver in individual officers’ technology acceptance decision-making. Based on our model testing results, perceived usefulness appears to be the only construct that has a significant direct effect on intention. The observed significance may suggest a tendency or likelihood of an officer’s anchoring his or her technology acceptance decision from a utility perspective. The discussed utility-centric view of technology is supported by the insignificant influence of perceived ease of use on perceived usefulness. Together, our findings suggest that a law enforcement officer is not likely to consider a technology to be useful simply because it is easy to use. Efficiency gain is a critical aspect or source of utility. According to our analysis, many officers feel that the use of COPLINK would improve their task performance, and that COPLINK is useful for their work. Second, subjective norm appears to be an important technology acceptance determinant, judged by its total effect on behavioral intention. According to our analysis,
Examining Technology Acceptance by Individual Law Enforcement Officers
219
subjective norm has a significant positive effect on individual acceptance decisionmaking but this effect may be mediated by other factors; e.g., perceived usefulness. Individual officers are likely to take significant referents’ opinions into consideration when assessing a technology’s usefulness. However, such normative beliefs alone may not foster positive acceptance decisions directly. In effect, our analysis shows a negative effect of subjective norm on behavioral intention, significant at the 0.05 level. One possible interpretation is that an officer exhibiting a strong intention to use COPLINK may have developed a negative response to others’ desire that he or she should accept the technology, and vice versa. The observed negative effect might be partially attributed to individual autonomy in law enforcement, thus resembling a professional setting to some degree. Third, the influence of attitude on intention may be somewhat significant, as suggested by a p-value between 0.05 and 0.10. Perceived usefulness and perceived ease of use appear to be important determinants of an individual officer’s attitude toward COPLINK and together explain a significant portion of the variances in attitude; i.e., 66%. Our finding suggests not to under-estimate the importance of individual attitudes. In this connection, administrators and technology providers need to proactively facilitate the cultivation and development of favorable attitudes by individual officers, particularly by means of convincing demonstrations and unambiguous communication of a technology’s utility and ease of operations. Management of individual attitude is essential in situations where law enforcement officers are relatively autonomous in task performance and technology use. With increased understanding of key acceptance drivers and their probable causal relationships, administrators and technology providers can identify specific areas where user acceptance is likely to be hindered and tackle these barriers accordingly. In light of the prominent influence path from efficiency gain to perceived usefulness and then to intention for acceptance, initial demonstrations and user training should concentrate on communicating a technology’s utility for improving officers’ performance and emphasize on the technology’s relevance to their routine tasks. Cultivating and promoting a favorable community assessment or view of the technology under discussion is also important and can create normative or even conformance pressure for individual acceptance decision-making. Such normative or compliant forces may not contribute directly to positive acceptance decisions, but can be so prevalent as to practically reinforce individual officers’ technology assessments. In addition, management of individual attitude towards a new implemented technology is also relevant and deserves administrative or managerial attention in situations where individual officers have considerable autonomy in their task performance and technology choice/use.
220
P.J.-H. Hu, C. Lin, and H. Chen
Acknowledgement. We would like to thank the following TPD officers for their input and support: Chief Richard Miranda, Asst. Chief Kathleen Robinson, Asst. Chief Kermit Miller, Cap. David Neri, Lt. Jenny Schroeder, Det. Tim Petersen, and Daniel Casey. We also would like to thank Andy Moosmann for his invaluable assistance in data collection. The work reported in this paper was substantially supported by the Digital Government Program, National Science Foundation (NSF Grant # 9983304: “COPLINK Center: Information and Knowledge Management for Law Enforcement”).
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10. 11. 12. 13.
14. 15.
Ajzen, I., “From Intention to Actions: A Theory of Planned Behavior,” in: Kuhl J. and Beckmann (eds): Action Control: From Cognition to Behavior, Springer Verlag, New York, 1985, pp. 11–39. Ajzen, I., “The Theory of Planned Behavior,” Organizational Behavior and Human Decision Processes, Vol. 50, 1991, pp. 179–211. Bandura, A., “Self-efficacy: Toward a Unifying Theory of Behavioral Change,” Psychological Review, Vol. 84, 1977, pp. 191–215. Chau, P.Y.K., “An Empirical Assessment of a Modified Technology Acceptance Model,” Journal of Management Information Systems, Vol. 13, No. 2, 1996, pp. 185–204. Chau, P.Y.K, and Hu, P.J., “Examining a Model for Information Technology Acceptance by Individual Professionals: An Exploratory Study”, Journal of Management Information Systems, Vol. 18, No. 4, 2002, pp. 191–229. Chen, H., Schroeder, J., V. Hauck, R., Ridgeway, L., Atabakhsh, H., Gupta, H., Boarman, C., Rasmussen, K., and Clements, A.W., “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Vol. 34, No. 3, 2003, pp. 271–285. Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and Schroeder, J., “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, Vol. 46, No. 1, 2003, pp. 28–34. Compeau, D.R., and Higgins, C.A., “Computer Self-Efficacy: Development of a Measure and Initial Test,” MIS Quarterly, Vol. 19, 1995, pp. 189–211. Cooper, R.B., and Zmud, R.W., “Information technology implementation research: A technology diffusion approach,” Management Science, Vol. 34, No. 2, 1990, pp. 123–139. Davis, F.D., “Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology,” MIS Quarterly, Vol. 13, No. 3, September 1989, pp. 319–339. Davis, F.D., Bagozzi, R.P., and Warshaw, P.R., “User Acceptance of Computer Technology: A Comparison of Two Theoretical Models,” Management Science, Vol. 35, No. 8, 1989, pp. 982–1003. Fishbein, M. and Ajzen, I., Belief, Attitude, Intention and Behavior: An Introduction to Theory and Research, Addison-Wesley, Reading, MA, 1975. Gattiker, U.E., “Managing Computer-based Office Information Technology: A Process Model for Management in Human Factors in Organizational Design,” in H. Hendrick and O. Brown (eds), Human Factors in Organizational Design, Elsevier Science, Amsterdam, The Nethelands, 1984, pp. 395–403. Goodhue, D.L. and Thompson, R.L., “Task-Technology Fit and Individual Performance,” MIS Quarterly, Vol. 19, No. 2, June 1995, pp. 213–236. Igbaria, M., Guimaraes, T. and Davis, G.B., “Testing the Determinants of Microcomputer Usage via a Structural Equation Model,” Journal of Management Information Systems, Vol. 11, No. 4, 1995, pp. 87–114.
Examining Technology Acceptance by Individual Law Enforcement Officers
221
16. Keen, P., Shaping the Future: Business Design through Information Technology, Harvard Business School Press, Boston, MA, 1991. 17. Mathieson, K., “Predicting User Intention: Comparing the Technology Acceptance Model with Theory of Planned Behavior,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 173–191. 18. Moore, G.C. and Benbasat, I., “Development of an Instrument to Measure the Perception of Adopting an Information Technology Innovation,” Information Systems Research, Vol. 2, No. 3, 1991, pp. 192–223. 19. Nunnally, J.C., Psychometric Theory, 2nd edn, McGraw-Hill, New York, 1978. 20. Rogers, E.M., Diffusion of Innovations, 4th edn, Free Press, New York, NY, 1995. 21. Straub, D.W., “Validating Instruments in MIS Research,” MIS Quarterly, Vol. 13, No. 2, 1989, pp. 147–169. 22. Szajna, B., “Empirical Evaluation of the Revised TAM,” Management Science, Vol. 42, No. 1, 1996, pp. 85–92. 23. Taylor, S. and Todd, P.A., “Understanding Information Technology Usage: A Test of Competing Models,” Information Systems Research, Vol. 6, No. 1, 1995, pp. 144–176. 24. Tornatzky, L.G. and Klein, K.J., “Innovation Characteristics and Innovation Adoption Implementation: A Meta-Analysis of Findings,” IEEE Transactions on Engineering Management, Vol. 29, No. 1, 1982, pp. 28–45. 25. Venkatesh, V. and Davis, F.D., “A Model of the Antecedents of Perceived Ease of Use: Development and Test,” Decision Sciences, Vol. 27, No. 3, 1996, pp. 451–482. 26. Venkatesh, V. and Davis, F.D., “A Theoretical Extension of the Technology Acceptance Model: Four longitudinal studies,” Management Science, Vol. 46, No. 2, 2000, pp. 186– 204. 27. Venkatesh, V., “Determinants of Perceived Ease of Use: Integrating Control, Intrinsic Motivation, and Emotion into the Technology Acceptance Model,” Information Systems Research, Vol. 11, No. 4, 2000, pp. 342–365.
Appendix: Listing of Questions Items Construct
Measurement Item
Source
PU-1: Using COPLINK would improve my job performance. Perceived Usefulness (PU)
Perceived Ease of Use (PEOU)
Attitude (ATT)
PU-2: Using COPLINK in my job would increase my productivity. PU-3: Using COPLINK would enhance my effectiveness at work. PU-4: Overall, I find COPLINK to be useful in my job. PEOU-1: My interaction with COPLINK is clear and understandable.
Venkatesh & Davis (1996)
PEOU-2: Interacting with COPLINK does not require a lot of mental effort. PEOU-3: Overall, I find COPLINK easy to use. PEOU-4: I find it easy to get COPLINK to do what I want it to do. ATT-1: Overall, it is a good idea to use COPLINK in my job. ATT-2: Using COPLINK would be pleasant. ATT-3: Using COPLINK would be beneficial to my work.
Venkatesh & Davis (1996)
Taylor & Todd (1995)
222
P.J.-H. Hu, C. Lin, and H. Chen
Subjective Norms (SN) Efficiency Gains (EG)
Availability (AV)
Behavioral Intention (BI)
SN-1: My colleagues in the department think that I should use COPLINK. SN-2: I would use COPLINK more if I knew my boss wanted me to. EG-1: Using COPLINK reduces the time I spend completing my job-related tasks. EG-2: COPLINK allows me to accomplish tasks more quickly. EG-3: Using COPLINK saves me time. AV-1: There are enough computers for everyone to use COPLINK. AV-2: I have no difficulty finding a computer to use COPLINK when I need it. AV-3: Availability of computers for accessing COPLINK is not going to be a problem. AV-4: There are enough computers for me to use COPLINK in the department. BI-1: When I have access to COPLINK, I would use it as often as needed. BI-2: To the extent possible, I intend to use COPLINK in my job. BI-3: Whenever possible, I would COPLINK for my tasks.
Taylor & Todd (1995)
Davis (1989)
Taylor & Todd (1995)
Venkatesh & Davis (1996)
“Atrium” – A Knowledge Model for Modern Security Forces in the Information and Terrorism Age Chris C. Demchak Cyberspace Policy Research Group, School of Public Administration and Policy University of Arizona, Tucson, Arizona 85721 [email protected]
Abstract. Eighty percent of business process reengineering efforts have failed. This piece argues that the missing piece is an ability to see the newer technical systems conceptually integrated into an organization as well as functionally embedded. Similarly, a model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative socio-technical organizational design labeled the “Atrium” model based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large scale security force.
the right amount of time to apply the correct electronic or other response. In short, IO like the effective application of all other advanced technologies depends as much on the organization of the people around the artifacts than on the quality of the artifacts themselves [3][8]. A model of a modern military or security institution facing asymmetries in active security threats and dealing with extremely limited strategic depth needs to focus less on precision strikes and more on knowing what can be known in advance.1 To achieve that future focus, supportive organizational designs need to engaged in the transition process. In this work, I present my conclusions after several years of research taking a knowledge-centric approach in developing an alternative model to the dominant organizational models of modern security forces––in these cases, militaries––seen in several nations, including the US.2 In the research, I focused on how the current and loosely planned future organizational designs could or could not assure that explicit and implicit knowledge in a complex system could be discovered, winnowed, connected, weighted, and applied using advancing technologies when the threats were multi-layered and present in peace as well as war. Finding that most published designs and existing relations were based on rather static notions of accessing only explicitly collected knowledge, I turned to the development of an alternative sociotechnical organizational design labeled the “Atrium” model (see Figure 1 below) based on the corporate hyperlinked model of Nonaka and Takeuchi. The rest of this work presents the basics of this model as applied to a military organization, though it could conceivably apply to any large-scale security force. Before introducing the model itself, it is important to note that, like their civilian counterparts in a rapidly globalizing environment, modern military technologies across both machine and human systems need information sharing, not hoarding, to both act quickly and to counter surprises. Designed by engineers, not social scientists, however, the newer systems tend to assume knowledge will come with the automatic and comprehensive provision of data. However, knowledge is not an automatic byproduct of networks and grids unless the surrounding social system deliberately seeks to capture that knowledge. Ultimately, in military or commercial endeavors, it is the organization, not the computer network, that is ultimately the knowledge-producing entity. And it costs a great deal to develop everything one needs on one’s own. The more distinct the organization is from a supporting and surrounding knowledge base, the more expensive the internal development of knowledge for that group of people [5]. Hence, it is preferred for the organization to share and benefit from the sharing of other organizations. Furthermore, complex systems including organizations are also path-dependent on initial conditions. The more the initial organizational design facilitates absorbing and accumulating knowledge from the beginning, including more slack, redundancy and trial and error, the more likely the design will be robust and successful in the face of surprise. Since surprise is the endemic characteristic of the systems and requirements faced by militaries, especially smaller forces, any modernizing design needs to consider these complex system realities from the outset.
1 2
The description of this model is drawn heavily from [2]. Much of the model discussion was originally presented in an earlier work developing the model for a small state and using the case of Israel [2].
“Atrium” – A Knowledge Model for Modern Security Forces
225
Fig. 1. The Atrium
The uncertainties of the new global circumstances require a different kind of modernization of the military organization – one less tied to legacy forces and more designed to support a new social construction of the role of knowledge as a player in organizational operations. To meet these aims, I propose a military or security adaptation of the commercial “hypertext” organization described by Nonaka and Takeuchi [6:99-133]. This refinement, which I labeled the “Atrium” form of information based organization, is a design that treats knowledge as a third and equal partner in the military organization’s peacetime and wartime operations. In the original model and in my refinement, the knowledge base is not merely an overlain tool or connecting pipelines. Rather, the knowledge base of the organization is actively nurtured both in the humans and in the digitized institutional integrated institutional structure. Writing for the commercial world, Nonaka and Takeuchi attempted to reconcile the competing demands and benefits of both matrix and hierarchical organizational forms. Their “hypertext” organization intermingled three intermingling structures: a matrix structure in smaller task forces specifically focused on innovative problems at hand and answering to senior managers, a second hierarchical structure that both supports the general operational systems but also contributes and then reabsorbs the members of task forces, and finally a large knowledge base that is intricately interwoven through the activities of both matrix and hierarchical units.3 3
As Nonaka and Tageuchi [6:106-107] aptly phrased it, “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”
226
C.C. Demchak
In both their and my models, the knowledge base is more than a library or a database on a server; it is a structure in and of itself integrating applications and data. It reaches into the task forces who use it for data mining while also sustaining the general operations, sharing information broadly. But it is also socially constructed as a key player in the organization such that task force members are required to download their experiences in a task force into the knowledge base before they are permitted to return to their positions in the hierarchical portion of the organization. Similarly, operations in the general hierarchy are required to interact through the knowledge base systems so that patterns in operations and actions are automatically captured for analysis [6:99-133]. The major contribution here is that the knowledge base is not a separate addition to the organization and irrelevant to the architecture of the human-machine processes as it is in the emergent US and other western models or modernizing militaries or security forces.4 Rather, it is integral to the success of processes and the survival of the institution. Several Japanese corporations seem to operate along these lines productively and one is struck by an interesting distinction––implicit knowledge developed by human interactions related to the job is not only viewed as a source of value by the corporation but also as key to long term survival.5 It is this view of knowledge that distinguishes these corporations and makes them more prepared for surprise in the marketplace. In adapting this design and social construction to a military or security setting, I have given this concept of a knowledge base a name, the “Atrium.”6 The term captures the sense of being a place to which a member of the organization can go, virtually or otherwise, to contribute and acquire essential knowledge, and that it is also a place of refuge to think out solutions. The mental image is that it is overarching, not beneath the human actors, but something that protects as well as demands inputs. Entering into and interacting with the Atrium is essentially acting with a major player in the institution. Such a conception rationalizes the efforts to ensure implicit knowledge is integrated into the long term analyses of the organization, such as the time spent in downloads of experiences and information from the task force members before they return to more hierarchical stem. The “atrium” form requires an explicit embrace of what has been called the “new knowledge management”7. In particular, the new knowledge management means using network/web technologies to move from controlling information inventories as human relationship-based “controlled hoards” to web-based “trusted source” struc4
5
6
7
A close reading of JV2010 and related US transformational documents shows a broad assumption that, as fast as the new equipment becomes, the knowledge needed to make that speed, lethality, and deployability successful will automatically be there as long as raw information is moved in real time. It is a rather naïve understanding of knowledge and complex systems but not unexpected if the decision-makers have focused on target acquisition and firing weapons at single points all their professional lives. “The goal is an organizational structure that views bureaucracy and the task force as complementary rather than mutually exclusive…..Like an actual hypertext document, hypertext organization is made up of interconnected layers or contexts…”, see [6:106-107]. In a manuscript under construction now “The Atrium–Refining the HyperText Organizational Form,” I more fully explain the mechanisms of integrating an Atrium into an organization. For a more modern use of this term, see [4] and [7].
“Atrium” – A Knowledge Model for Modern Security Forces
227
tures.8 With networks, everything is dual use and sufficient technical familiarity can be found in foreign ministries as well as in basements inhabited by teenage geeks with a sociopathic attitude. Knowledge development will inevitably come through surprises that are encountered all along the spectrum of formal declaration of operations, from peace-building, through peace-making, peace-keeping, posturing, and prevailing in actual hostilities. The design of a modern knowledge-centric military must, in effect, accept 24/7 operations with all the ethical, legal, budgetary, socio-economic, and geostrategic constraints implied.
2 The “Atrium” as Colleague and Institutional Memory Key to this model is the stabilizing the locus of institutional memory and creativity in the human-Atrium processes. In principle, according to their rank, each member of the organization will have the chance to cycle in and out of task forces, core operations or Atrium maintenance and refinement. As they cycle into a new position, gear up, operate, and then cycle out, each player does a data dump, including frustrations about process, data, and ideas, into the Atrium. Organizational members elsewhere can then apply data mining or other applications on this expanding pool of knowledge elements to guide their future processes. Explicit and implicit organizational institutional knowledge thus becomes instinctively valued and actively retained and maintained for use in ongoing or future operations.
3 The Core – Main Operational Knowledge Creation and Application Hierarchies With this new social construction of what one does with information in the military or security force (one creates, stores, refines, connects, weights, shares, and nurtures it), the Core then embraces the new knowledge potential of conscripts and reservists by reinforcing the trends in national digital education. That is, service involving computers is not only promoted as a benefit of conscription but training in computers is pursued irrespective of the actual military or security function. For example, the maintainer will expect to find knowledge about diagnostic workarounds in other maintenance units in a foray into the Atrium, as well as being expected to give back one’s own personal experiences to the system. That maintainer – who could easily be a youngster of 19 years – will have been taught not just how to do that diagnostic task but also how to manipulate digital applications in general. This education in the military or security force will enhance the surprise-reducing potential of operations but also to improve the soldiers' future marketability to the economy and their long-term contribution to Atrium nurturing as a reservist. As a side benefit, the growing unwillingness to serve in the military or in security services may be mitigated when all full
8
The evolution of the internet or the web is in essence a social history of information sharing among individuals embedded in organizations. There are a number of versions of the history of the internet. For one discussion, see [1]. See also the Internet Society web site.
228
C.C. Demchak
time members (and associated part-timers or reservists) receive what is considered a valuable education in networked technology. Furthermore, the Core also embraces the potential of part-timers or reservists for security forces by assigning tasks that further the knowledge development of the Atrium. In the United States, the role for reservists in the future conflicts involving terrorism is under debated. This model can orchestrate the accomplishment of Core tasks can be accomplished on weekends without requiring the reservist to show up during working hours in uniform. The implicit knowledge of these experienced individuals is not lost as they are able to draw upon reserve years of solving puzzles or refining data to keep their skills at usable levels while keeping other employment. Reservists can then still serve physically in uniform in the Core when called up but that period can be limited and infrequent since it is not expected the reservist will do much basic security tasks in the field. Naturally, this approach sustains all the advantages of a close connection between the wider society and its part-time or reservist security forces without having the disruption of a civilian job. As described, the Core will have plenty of tasks associated with the Atrium, both in initial creation of applications, elements, processes, and uses but also in the coordinating and integrated of these evolutions. Its use of part-time or reservist security forces provide an essential constant intellectual recharge available from the wider community, permitting the Atrium to avoid iterating into a brittle bureaucratic equilibrium. By having the problem solving of the task forces as well as the intense attention of active serving security force members, the members of the forces serving in the Core will come to understand the Atrium as an intelligent agent rather than a mindless amalgamation of individual databases. In short, the vibrancy of the Atrium in providing knowledge to accommodate surprise is due not to the professionalism of the small permanent Core party but to the newness of perspective and rising familiarity of both the active and part-time participants. However, this organization will be surrounded by complex systems as well as being a complex system itself. Problems beyond the normal Core operations and Atrium knowledge analysis will emerge constantly. Some of these will be physically dangerous and immediate. Some will be prospective, such as determining why certain neighboring political leaders have allocated budget amounts to shadowy organizations. Some will be long term, such as rechanneling the design goals of key data chunk allocations within the Atrium or retargeting some of its uses in the light of wider global trends. For these kinds of problems, a matrix organization is imminently preferred and hence we come to the final element, the task forces.
4 The Task Forces – Responses in Knowledge Creation and Security Applications Security forces, in and out of militaries, tend to fragment into many small existing units with specialized missions. Each of them develops a broad and deep array of implicit knowledge that this model would be able to capture and put to good use. Many of existing units can be altered to function as task force structures answering to the senior military or security force officers in a knowledge-centric organization. First, to capture the implicit information currently lost or buried, members of all field units
“Atrium” – A Knowledge Model for Modern Security Forces
229
will rotate in from their operations to download implicit knowledge, update their understanding of the Atrium’s holdings and possible insights, and contribute to the Core. Second, some of the more elite units will be retargeted along different modalities of knowledge acquisition and use to using such data in knowledge mining combined with other information presented in the Atrium. Some units will be left with the more physically challenging missions such as border incursion controls and basic training but their members will also be rotated in and out on longer cycles, perhaps a year, to accommodate exceptional physical requirements. Other units will be gradually altered to problem analysis units – moving from simply gathering data on all suspicious activity to meta-analyses of such activities over time and locations with an eye to proactively disrupting the initiating efforts of the infiltrating threat rather than sending squads after the cell is well established. For this, the members will have to be digitally creative as much as physically hardy. The deployed or physically demanding units will be smaller and directly answerable to senior members of the headquarters staff. However, since rotating organization members among the three – task forces, Core and the Atrium – is a basic tenet, even senior leaders must rotate. For example, senior leaders could spend most of their time leading each of the field divisions or commands but they must rotate in for Atrium service, as well as heading task forces occasionally. While on rotation to the Atrium, the senior leader must be free completely from leadership duties, thus attention must be paid to a functioning deputy leader culture. Finally, the explicit assumption is that each task force is solving a problem or exploring an opportunity but also developing important nonobvious information that must also be inputted into the Atrium's processes. Senior leaders just like lowly field members have implicit knowledge to contribute to, and skills to refine in extracting and manipulating data from, the Atrium resources. Not all of the existing military or security force units will change in their mission; rather they are more likely to scale back the size of the units and attach them higher up in the hierarchy. The ones that retain the more physically dangerous missions will alter only in that their members will rotate out of Core positions for a position in the elite force and then back through an Atrium tour before returning to the Core. Fortunately, the value placed on computer skills and possibly a civilian career to follow military or security force experience offers a way to socially construct this change for easier acceptance, as well as continuing service on a part-time basis. Also, placing them directly beneath the senior leaders also mollifies grievances over a loss of prestige. Personnel rotating in and out of these units are assured not only of interesting current problems for six months to a year but also greater visibility at senior levels. The units will benefit from the strong advantages of a matrix structure in creativity and are more produce more innovative problem solutions than can be produced today.
5 Advantages – Surprise-Oriented, Scalable Knowledge-Enabled Institutions This design has advantages in using advance knowledge to extend the limited strategic depth of a nation or community under the unknowable unknowns of the emerging information and terrorism age. Deleterious surprises by actively hostile opponents can
230
C.C. Demchak
be countered by integrating different kinds of forces across early warning and response forces, and in the innovative combination of information accumulations. The existing widely held model of a modern security force tends towards centralization of control by reducing slack in the organization’s time and/or redundancy in its resources. It has become an act of faith that this centralization explicitly promotes synchronicity of operations, and in due course centralization across networks is also encouraged. But a fixation on central decision-making and synchronized actions can encourage devastating ripple effects in an increasingly tightly coupled organization. In contrast, the Atrium model is based on an understanding of complexity across large-scale systems – the environment faced by security forces today under active threats. If only trends – not specifics – can be seen in advance, then the best preparation is to have the knowledge base and the skills in creative combinations ready and waiting for the elements of the trend to take concrete shape. The model encourages a dampening of rippling rogue outcomes by the rotation of members and inclusion of skilled part-timers. Its design presumes that surprise during operations is normal in complex systems and only slack built through knowledge mechanisms can really accommodate or mitigate or dampen the effects on a large-scale organization. Hence, the Atrium concept encourages independent thinking while permitting widespread coordination and integration across the organization, time, and operations. And that this response can be done at any scale. Having socialized into unit members some key central themes in operations is as close as the Atrium comes to endorsing expensive centralization such as the Total Information Awareness program currently under pursuit by the US Department of Defense. Furthermore, this proposal does not assume wisdom comes automatically with 100 percent visibility of any conflict arena, or that this kind of visibility of an operation is the goal of modernization. On the contrary, this Atrium organizational model presumes that the 24/7 accumulation of information, much of which implicit and never before digitized, will use data mining techniques and a constant inflow of new pairs of eyes (in rotations through the Atrium) to construct new visions of operations. Innovative operations at any scale are enhanced when integration of a wide variety of information is more possible. While a nation or a security service under threat still needs physically demanding forces and standoff weapons, other electronic options emerge such as targeted disruption efforts that may overtly or covertly derail threatening postures by hostile opponents or even a long-term, slow-roll deception goal that diverts potential hostile actors from other more dangerous choices. Furthermore, when work is digitized, internal security can increase nonobviously. It is easier and less intrusive to scan across employee actions when work is digitized. Also, when part-timers are rotating in and out of all functions and their implicit knowledge is also being accumulated in the Atrium, then individual elements of knowledge are potentially spread all over the society. With so many knowing in general the overall structure and uses of the Atrium and the military or security force’s capabilities, the competition is less for secret information but for positive social assessments by chief acquisition officers. This kind of institutional knowledge helps both in curbing corruption through database transparency and in permitting those secrets that absolutely must be kept to be buried in the data noise.
“Atrium” – A Knowledge Model for Modern Security Forces
231
References 1. Benedikt, Michael. (1991). Cyberspace: First Steps. Boston, Massachusetts: the MIT Press. 2. Demchak, Chris C. (2001). “Knowledge Burden Management and a Networked Israeli Defense Force: Partial RMA in ‘Hyper Text Organization’?” Journal of Strategic Studies. 24:2 (June). 3. Drucker, Peter F. (1959). Technology Management and Society. San Francisco, CA: Harper and Row. 4. Gleick, James. (1987). Chaos: Making a New Science. New York: Viking. 5. Landau, Martin. (1973). “On the Concept of a Self-Correcting Organization.” Public Administration Review (November- December 1973). 6. Nonaka, Ikujiro and Takeuchi, Hirotaka. (1997). “A New Organizational Structure (HyperText Organization).” In Prusak, Laurence, ed. 1997. Knowledge in Organizations. Boston: Butterworth-Heinemann. 99–133. 7. Wheatley, Margaret J. (1992). Leadership and the New Science. San Francisco: BoerrettKoehler Publishers. 8. Wilson, James Q. (1989). Bureaucracy: What Government Agencies Do and Why They Do It. New York: Basic Books, Inc.
Untangling Criminal Networks: A Case Study Jennifer Xu and Hsinchun Chen Department of Management Information Systems, University of Arizona Tucson, AZ 85721, U. S. A. {jxu, hchen}@eller.arizona.edu
Abstract. Knowledge about criminal networks has important implications for crime investigation and the anti-terrorism campaign. However, lack of advanced, automated techniques has limited law enforcement and intelligence agencies’ ability to combat crime by discovering structural patterns in criminal networks. In this research we used the concept space approach, clustering technology, social network analysis measures and approaches, and multidimensional scaling methods for automatic extraction, analysis, and visualization of criminal networks and their structural patterns. We conducted a case study with crime investigators from the Tucson Police Department. They validated the structural patterns discovered from gang and narcotics criminal enterprises. The results showed that the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might be useful in the development of effective disruptive strategies for criminal networks.
ple, removal of central members in a network may effectively upset the operational network and put a criminal enterprise out of action [3, 17, 21]. Subgroups and interaction patterns between groups are helpful for finding a network’s overall structure, which often reveals points of vulnerability [9, 19]. For a centralized structure such as a star or a wheel, the point of vulnerability lies in its central members. A decentralized network such as a chain or clique, however, does not have a single point of vulnerability and thus may be more difficult to disrupt. To analyze structural patterns of criminal networks, investigators must process large volumes of crime data gathered from multiple sources. This is a nontrivial process that consumes much human time and effort. Current practice of criminal network analysis is primarily a manual process because of the lack of advanced, automated techniques. When there is a pressing need to untangle criminal networks, manual approaches may fail to generate valuable knowledge in a timely manner. To help law enforcement and intelligence agencies analyze criminal networks, we propose applying the concept space and social network analysis approaches to extract structural patterns automatically from large volumes of data. We have implemented these techniques in a prototype system, which is able to generate network representations from crime data, detect subgroups in a network, extract between-group interaction patterns, and identify central members. Multi-dimensional scaling has also been employed to visualize criminal networks and structural patterns found in them. The rest of the paper is organized as follows: Section 2 reviews related work; Section 3 describes the system architecture; Section 4 presents the case study in detail; Section 5 concludes the paper and suggests future research directions.
2 Related Work The process of extracting structural network patterns from crime data usually includes three phases: network creation, structural analysis, and network visualization. We review related work for each phase. 2.1 Network Creation To create network representations of criminal enterprises, investigators have to wade through floods of database records to search for clues of relationships between offenders. Such a task can be time-consuming and labor-intensive. A technique called link analysis has been used to detect relationships between crime entities and create network representations. Traditional link analysis is based on the Anacapa charting approach [12] in which data have to be examined manually to identify possible relationships. For visualization purposes, an association matrix is then constructed and a link chart based upon it is drawn. An investigator can study the structure of the link chart (a network representation) to discover patterns of interest. Krebs [15], for example, mapped a terrorist network comprised of the 19 hijackers in the September 11 attacks on the World Trade Center, using such an approach. How-
234
J. Xu and H. Chen
ever, the manual link analysis approach will become extremely ineffective and inefficient for large datasets. Some automated approaches to creating representations of criminal networks based on crime data have been proposed. Goldberg and Senator [11] used a heuristic-based approach to forming links and associations between individuals who had shared addresses, bank accounts, or related transactions. The networks created were analyzed to detect money laundering and other illegal financial activities. Dombroski and Carley [8] combined multi-agent technology, a hierarchical Bayesian inference model, and biased network models to create representations of a criminal network based on prior network data and informant perceptions of the network. A different network creation method used in the COPLINK system [13] is based on the concept space approach developed by Chen and Lynch [5]. Such an approach can generate a thesaurus from documents based on co-occurrence weights that measure the frequency with which two words or phrases appear in the same document. Applying this approach to crime incident data results in a network representation in which a link between a pair of entities exists if they ever appear together in the same criminal incident report. The more frequently they appear together, the stronger the association. After a network representation has been created, the next phase is to extract structural patterns from the networks. 2.2 Structural Analysis Social Network Analysis (SNA) provides a set of measures and approaches for structural network analysis. These techniques were originally designed to discover social structures in social networks [23] and are especially appropriate for studying criminal networks [17, 18, 21]. Specifically, SNA is capable of detecting subgroups, identifying central individuals, discovering between-group interaction patterns, and uncovering a network’s organization and structure [23]. Studies involving evidence mapping in fraud and conspiracy cases have recently employed SNA measures to identify central members in criminal networks [3, 20]. Subgroup Detection. With networks represented in a matrix format, the matrix permutation approach and cluster analysis have been employed to detect underlying groupings that are not otherwise apparent in data [23]. Burt [4] proposed to apply hierarchical clustering methods based on a structural equivalence measure [16] to partition a social network into positions in which members have similar structural roles. Centrality. Centrality deals with the roles of network members. Several measures, such as degree, betweenness, and closeness, are related to centrality [10]. The degree of a particular node is its number of direct links; its betweenness is the number of geodesics (shortest paths between any two nodes) passing through it; and its closeness is the sum of all the geodesics between the particular node and every other node in the
Untangling Criminal Networks: A Case Study
235
network. Although these three measures are all intended to illustrate the importance or centrality of a node, they interpret the roles of network members differently. From an individual’s having a high degree measurement, for instance, it may be inferred to have a leadership function whereas an individual with a high level of betweenness may be seen as a gatekeeper in the network. Baker and Faulkner [3] employed these three measures, especially degree, to find the key individuals in a price-fixing conspiracy network in the electrical equipment industry. Krebs [15] found that, in the network consisting of the 19 hijackers, Mohamed Atta scored the highest on degrees. Discovery of Patterns of Interaction. Patterns of interaction between subgroups can be discovered using an SNA approach called blockmodel analysis [2]. Given a partitioned network, blockmodel analysis determines the presence or absence of an association between a pair of subgroups by comparing the density of the links between them at a predefined threshold value. In this way, blockmodeling introduces summarized individual interaction details into interactions between groups so that the overall structure of the network becomes more apparent. 2.3 Network Visualization SNA includes visualization methods that present networks graphically. The Smallest Space Analysis (SSA) approach, a branch of Multi-Dimensional Scaling (MDS), is used extensively in SNA to produce two-dimensional representations of social networks. In a graphical portrayal of a network produced by SSA, the stronger the association between two nodes or two groups, the closer they appear on the graph; the weaker the association, the farther apart [17]. Several network analysis tools, such as Analyst’s Notebook [14], Netmap [11], and Watson [1], can automatically draw a graphical representation of a criminal network. However, they do not provide much structural analysis functionality and continue rely on investigators’ manual examinations to extract structural patterns. Based on our review of related work, we proposed to employ the concept space approach, SNA measures and approaches, and MDS for extracting and visualizing structural patterns of criminal networks. We have developed a prototype system in which the proposed techniques have been implemented. The architecture of the system and its individual components are presented in the next section.
3 System Architecture The prototype system contains three major components: network creation, structural analysis, and network visualization. Figure 1 illustrates the system architecture.
236
J. Xu and H. Chen
Fig. 1. System architecture
3.1 Network Creation Component We employed the concept space approach to create networks automatically, based on crime data. We assumed that criminals who committed crimes together might be related and that the more often they appeared together the more likely it would be that they were related. We treated each incident summary (database records specifying the date, location, persons involved, and other information about a specific crime) as a document and each person’s name as a phrase. We then calculated co-occurrence weights based on the frequency with which two individuals appeared together in the same crime incident. As a result, the value of a co-occurrence weight not only implied a relationship between two criminals but also indicated the strength of the relationship.
3.2 Structural Analysis Component The structural analysis component includes three functions: network partition for detecting subgroups, centrality measures for identifying central members, and blockmodeling for extracting interaction patterns between subgroups. Network Partition. We employed hierarchical clustering, namely complete-link algorithm [6], to partition a network into subgroups based on relational strength. Clusters obtained represent subgroups. To employ the algorithm, we first transformed cooccurrence weights generated in the previous phrase into distances/dissimilarities. The
Untangling Criminal Networks: A Case Study
237
distance between two clusters was defined as the distance between the pair of nodes drawn from each cluster that were farthest apart. The algorithm worked by merging the two nearest clusters into one cluster at each step and eventually formed a cluster hierarchy. The resulting cluster hierarchy specified groupings of network members at different granularity levels. At lower levels of the hierarchy, clusters (subgroups) tended to be smaller and group members were more closely related. At higher levels of the hierarchy, subgroups are large and group members might be loosely related. Centrality Measures. We used all three centrality measures to identify central members in a given subgroup. The degree of a node could be obtained by counting the total number of links it had to all the other group members. A node’s score of betweenness and closeness required the computation of shortest paths (geodesics) using Dijkstra’s algorithm [7]. Blockmodeling. At a given level of a cluster hierarchy, we compared between-group link densities with the network’s overall link density to determine the presence or absence of between-group relationships. SNA was the key technique in our prototype system for extraction of criminal network knowledge.
3.3 Network Visualization Component To map a criminal network onto a two-dimensional display, we employed MDS to generate x-y coordinates for each member in a network. We chose Torgerson’s classical metric MDS algorithm [22] since distances transformed from co-occurrence weights were quantitative data. A graphical user interface was provided to visualize criminal networks. Figure 2 shows the screenshot of our prototype system. In this example, each node was labeled with the name of the criminal it represented. Criminal names were scrubbed for data confidentiality. A straight line connecting two nodes indicated that two corresponding criminals committed crimes together and thus were related. To find subgroups and interaction patterns between groups, a user could adjust the “level of abstraction” slider at the bottom of the panel. A high level of abstraction corresponded with a high distance level in the cluster hierarchy. Group members’ rankings in centrality are listed in a table.
238
J. Xu and H. Chen (a) Left: A 57-member criminal network. Each node is labeled using the name of the criminal it represents. Lines represent the relationships between criminals.
(c) Right: The inner structure of the biggest group (the relationships between group members).
(b) Above: The reduced structure of the network. Each circle represents one subgroup labeled by its leader’s name. The size of the circle is proportional to the number of criminals in the group. A line represents a relationship between two groups. The thickness represents the strength of the relationship. Centrality rankings of members in the biggest group are listed in a table at the right-hand side.
Fig. 2. A prototype system for criminal network analysis and visualization
4 Case Study In order to examine our system’s ability to reveal structural patterns from criminal networks, we conducted a case study at the Tucson Police Department (TPD). The study was intended to answer the following research questions: Can structural analysis approaches correctly detect subgroups from criminal networks? Can structural analysis approaches correctly identify central members from criminal networks? Can structural analysis approaches correctly identify interaction patterns between subgroups from criminal networks? Can structural analysis approaches help extract the overall structure of a criminal network? In this study, we focused on two types of networks: gang and narcotics, both of which were organized crimes. For each network, the Gang Unit at the TPD provided a list of names of active criminals. We extracted from the TPD database all crime incidents in which these criminals had been involved and we created two networks.
Untangling Criminal Networks: A Case Study
239
4.1 Data Preparation The gang network. The list of gang members consisted of 16 offenders who had been under investigation in the first quarter of 2002. These gang members had been involved in 72 crime incidents of various types (e.g., theft, burglary, aggravated assault, drug offense, etc.) since 1985. We used the concept space approach and generated links between criminals who had committed crimes together, resulting in a network of 164 members (Figure 3a). The narcotics network (The “Meth World”). The list for narcotics network consisted of 71 criminal names. A sergeant from the Gang Unit had been studying the activities of these criminals since 1995. Because most of them had committed crimes related to methamphetamines, the Sergeant called this network “Meth World.” These offenders had been involved in 1,206 incidents since 1983. A network of 744 members was generated (Figure 3b).
(a) The 164-member gang network
(b) The 744-member narcotics network
Fig. 3. The gang and narcotics networks
These two networks were analyzed using our prototype system. Several crime investigators including the sergeant and one detective from the Gang Unit and two detectives from the Information Section validated our results. 4.2 Result Validation The study was divided into two sessions. During each session, the crime investigators examined one network and evaluated the structural patterns discovered from it. Both sessions were tape-recorded and the results were summarized as follows.
240
J. Xu and H. Chen
Detection of Subgroups. Since our system could partition a network into subgroups at different levels of granularity, we selected the partition that the crime investigators considered to be closest to their knowledge of the network organizations. The result showed that our system could detect subgroups from a network correctly: Subgroups could be detected correctly using cluster analysis. Two major subgroups together with several small subgroups were found in the 164-member gang network based on the clustering results (Figure 4a). The bigger subgroup (solid circle) consisted of 99 members and the smaller subgroup (dashed circle) consisted of 24 members. In the narcotics network, no obvious subgroups except for four cliques originally could be seen because of the large network size (Figure 3b). After clustering, however, two subgroups became very obvious with the bigger one (solid circle) consisting of 397 members and the smaller one (dashed circle) consisting of 331 members (Figure 4b). Moreover, the crime investigators verified that partitions within each of the subgroups were also correct.
(a) Subgroups in the gang network
(b) Subgroups in the narcotics network
Fig. 4. Subgroups detected from the networks
Subgroups detected had different characteristics. It turned out that the subgroups found were consistent with their members’ characteristics, specializations, or responsibilities in the networks. In the gang network (Figure 4a), the subgroup represented by a solid circle was identified as a set of white gang members who often were involved in murders, shootings, and aggravated assaults. “These are people who always create a lot of trouble,” the sergeant said. The subgroup represented by a dashed circle, on the other hand, consisted of many white gang members who specialized in sale of crack cocaine. The subgroup represented by a small dotted circle was a set of back gang members who were quite separate from the whole network. The two subgroups
Untangling Criminal Networks: A Case Study
241
(solid and dashed) in Figure 4b, similarly, corresponded with two criminal enterprises led by different leaders. Moreover, each subgroup could be further broken down into smaller subgroups that might be responsible for different tasks. For example, Figure 5a presents the subgroups within one of the criminal enterprises in the narcotics network. The group in solide circle was responsible for stealing, counterfeiting, and cashing checks and providing money to other groups to carry out drug transactions. The group in the dashed circle, on the other hand, consisted of many drug dealers.
173
87
( (a) Subgroups with different responsibilities
b) Relationships between group members
Fig. 5. Subgroup characteristics and relationships
Incident-based relationships reflected other types of associations between group members. Two group members might have been related because they came from the same family, went to the same school, spent time together in prison, etc. Figure 5b, for example, presents connections among 24 members of the crack cocaine group in the gang network. Member 87 was member 173’s girlfriend (connected by a solid line) who often brought female dancers to purchase crack cocaine. In the narcotics network in Figure 4b, members of the dashed circle were former schoolmates. As the sergeant commented, “They knew each other in high school and at that time they were juvenile gang members. Then they got involved in methamphetamines.” Long-time relationships between group members showed a high frequency of committing crimes together, and high relational strength was captured by high co-occurrence weight. Identification of Central Members. We interpreted the highest degree score as an indicator for a leader, the highest betweenness score as an indicator for a gatekeeper, and the one with the lowest closeness (the least likely to be a central member) as an outlier. The crime investigators evaluated central members identified from six subgroups at different granularity levels in both gang network and narcotics network. The
242
J. Xu and H. Chen
results showed that although the system could identify important members in a subgroup, it could not necessarily identify a true leader. A member who scored the highest in degree might not necessarily be a leader. On one hand, offenders with high degree often were those who had had frequent police contacts. Such offenders may play active roles in leading a group. Three out of six leaders were identified as true leaders in their subgroups. For example, in the crack cocaine subgroup shown in Figure 5b, member 173 had the largest number of connections with other group members. This person had a lot of money, was able to buy and sell drugs frequently, and provided his house for drug transactions. As mentioned in the previous section, his girlfriend also helped bring in more people to purchase drugs. Similarly, the member with the highest degree in the murderers group (solid circle in Figure 4a) was also identified as the leader in the group. On the other hand, a high degree could not always be interpreted as an indicator of leadership for two reasons. First, in a criminal enterprise, the leader may hide behind other offenders and keep frequency of activities low by using other people to do tasks. “Especially, when they got out of prison they tended to be smarter and more educated and thus were more careful to avoid police contacts,” the sergeant commented. In Figure 6, for example, member 501 (labeled with a star) was the true leader of one subgroup from the narcotics network. However, he did not score the highest in degree in this group because he actually used other group members (along the dashed path) to sell methamphetamines for him. Second, current police databases did not capture leadership data about criminal enterprises. A crime investigator had no way to tell which group member was the leader unless he/she obtained such information from interrogation or other sources. Three out of six leaders evaluated were not the true leaders of their groups. Therefore, the degree measure should be interpreted carefully. A member who scored highest in betweenness was a gatekeeper. Our crime investigators verified that all of the six gatekeepers were correctly identified from their subgroups. These gatekeepers played important roles in maintaining the flow of money, drugs, or other illicit goods in their networks. Although not identified as a leader based on degree measure, member 501 (labeled with a star) in Figure 6a was correctly identified as a gatekeeper because he controlled and managed the flow of money and drugs in his group. The star in Figure 6b represented a gatekeeper in that group because she was responsible for cashing stolen or counterfeit checks and redistributing money to other group members. The other four gatekeepers evaluated were offenders who often rode bicycles to sell drugs on the street. “Such gatekeepers were quite important to the operation of their criminal enterprises,” a detective from the Gang Unit said. An outlier who scored the lowest in closeness might play an important role in a network. No detailed evaluation was conducted on outliers because of the long time spent on the discussion of leader and gatekeeper roles in both validation sessions. Our crime
Untangling Criminal Networks: A Case Study
(a) A group leader without the highest degree
243
(b) A gatekeeper
Fig. 6. Central members in subgroups
investigators only mentioned that it was possible that an outlier might be a true leader who stayed away from the rest of his group but actually controlled the whole group. No specific example was given, however. Identification of Interaction Patterns between Subgroups. Our crime investigators evaluated a set of between-group interaction patterns including interactions among three groups (solid, dashed, and dotted) in the gang network (Figure 4a), interactions between two major groups (solid circle and dashed circle) in the narcotics network (Figure 4b), and those between the solid and dashed groups in Figure 5a. The results showed that patterns identified using blockmodel analysis reflected the truth about interactions between criminal groups correctly. Frequency of interaction (represented by thickness of lines) between subgroups was a correct indicator of the strength of between-group relationship. In Figure 4a, for example, the blockmodeling result revealed a strong link between the murderers’ group (solid circle) and the crack cocaine group (dashed circle). When asked whether this interaction pattern was accurate, the sergeant answered: “Sure. These guys often hang together. The leaders of these two groups are best friends.” Moreover, interaction patterns might also represent flows of money and goods between groups. In Figure 5a, money and drugs flowed frequently between the dashed group (for drug sales) and the solid group (for check washing and cashing). Interaction patterns between groups might also represent problems or hatred. Frequent interactions between the two major groups in the narcotics network (Figure 4b) resulted not only from their group members’ switching back and forth but also from
244
J. Xu and H. Chen
problems between the two groups, whose leaders had been at odds for a long time. Their subordinates often ran into shootings and fights. Interaction patterns identified could help reveal relationships that previously had been overlooked. During the evaluation of the gang network (Figure 4a), the sergeant noticed that there was a line (dotted) connecting the murderers’ group (solid circle) and the black gang group (dotted circle): “I have never seen these black gang members having any connection with those white gang members”. When referring back to the original network in Figure 3a, we found a link (dotted line) between one member from the black group and a member from the murderers’ group. According to the sergeant, identifying such a connection would be very helpful for developing investigative leads. Extraction of Overall Network Structures. According to our crime investigators, gang and narcotics enterprises usually differed in structure: gang enterprises tended to be more centralized and narcotics organizations tended to be more decentralized. In order to assess our system’s abilities to reveal such structural differences, we extracted two datasets from the TPD database: (a) incident summaries of narcotics crimes from January 2000 to May 2002, and (b) incident summaries of gang-related crimes from January 1995 to May 2002. We selected four gang networks and nine narcotics networks from our datasets. Sizes of these networks ranged from 21 to 100. Other networks generated from our datasets were either too small or too large and were not analyzed. We found that the blockmodeling function in our system did reveal distinguishing structural patterns of the two types of criminal enterprises: Two out of four gang networks under study had a star structure similar to that presented in Figure 2. The third network was a chain of stars and the fourth had a star structure with some of its branches being a smaller star or a clique (Figure 7a-b). All nine narcotics networks had a chain structure (Figure 7c-d). Three of these networks were chains of stars. One network had a circle in the middle of the chain. 4.3 Usefulness of System All our crime investigators provided very positive comments on our system. They believed that the system could be very useful for extracting structural network patterns and discovering knowledge about criminal enterprises. In particular, our system could help them in the following ways: Saving investigation time. The sergeant and his assistants had obtained knowledge about the gang and narcotics organizations during several years of work. Using information gathered from a large number of arrests and interviews, he had built the networks incrementally by linking new criminals to known gangs in the network and then studied the organization of these networks. Because there was no structural analysis tool available, he did all this work by hand. With the help of our system, he expected substantial time could be saved in network creation and structural analysis.
Untangling Criminal Networks: A Case Study
(a) A 51-member gang network
(c) A 60-member narcotics network
245
(b) The star structure found in the gang network
(d) The chain structure in the narcotics network
Fig. 7. Overall structures of criminal networks
Saving training time for new investigators. New investigators who did not have sufficient knowledge of criminal organizations could use the system to grasp the essence of the network and crime history quickly. They would not have to spend a significant amount of time studying hundreds of incident reports. Suggesting investigative leads that might otherwise be overlooked. For example, the link between the back gang group and the white murderers’ group in the gang network that had been overlooked and could have suggested useful investigative leads. Helping prove guilt of criminals in court. The relationships discovered between individual criminals and criminal groups would be helpful for proving guilt when presented at court for prosecution.
246
J. Xu and H. Chen
In summary, the structural analysis approaches we proposed showed promise for extracting important patterns in criminal networks. Specifically, subgroups, central members, and interaction patterns among subgroups usually could be identified correctly by the use of centrality measures, and blockmodeling functionality.
5 Conclusions and Future Work Criminal network knowledge has important implications for crime investigation and national security. In this paper we have proposed a set of approaches that helped extract structural network patterns automatically from large volumes of data. These techniques included the concept space approach for network creation, hierarchical clustering methods for network partition, and social network analysis for structural analysis. MDS was used to visualize a criminal network and its structural patterns. We conducted a case study with crime investigators from TPD to validate the structural patterns of gang and narcotics criminal enterprises. The results were quite encouraging—the approaches we proposed could detect subgroups, central members, and between-group interaction patterns correctly most of the time. Moreover, our system could extract the overall structure for a network that might help in the development of effective disruptive strategies for criminal networks. We plan to continue our criminal network analysis research in the following directions: Allowing investigators to edit a network by adding, deleting, and modifying nodes and links. Networks created using our system were based entirely on incident data. Other important information collected from multiple sources about network members, relationships between members, and member roles would help provide a more complete picture of a criminal enterprise. Especially, knowledge about group leaders that could not be obtained using incident data from typical police databases should be added to a network representation to avoid misleading interpretation of the degree measure. Including other entity types than person. Criminal networks in our current studies were limited to only person type. Criminals’ connections with other types of entities such as location, weapon, and property could also be useful. In the “Meth World”, for example, drug offenders often used a specific hotel to carry out transactions. Examining frequencies of hotel addresses associated with a set of narcotics crimes could help in understanding the operation of a narcotics organization and predicting future crimes. Studying temporal and cross-regional patterns of criminal networks. Over time criminal networks could change in size, organization, structures, member roles and many other characteristics. The “Meth World” in Tucson had expanded from a network consisting of no more than 150 members in 1995 to the one with more than 700 members in 2002. Members and their roles in the network had also changed a lot in the past eight years: some old members left the network because of arrest or death; new members had been attracted into the network in search of profit; more powerful
Untangling Criminal Networks: A Case Study
247
new leaders might have replaced old leaders, etc. It would be interesting to study how a criminal network evolved over time. Should a certain temporal pattern be discovered, it would be helpful to predicting the trend and operation of a criminal enterprise. On the other hand, a criminal enterprise can expand across several regions or nations. The “Meth World” was initially only in Tucson and was later connected with criminals from Phoenix, California, and Mexico. Cross-regional analysis of criminal enterprises could be used to analyze criminal enterprises on a large scale and could have significant value for combating terrorism. At the same time, we will continue to develop more techniques to further advance the research on criminal networks.
Acknowledgement. This project has primarily been funded by the National Science Foundation (NSF), Digital Government Program, “COPLINK Center: Information and Knowledge Management for Law Enforcement,” #9983304, July, 2000-June, 2003 and the NSF Knowledge Discovery and Dissemination (KDD) Initiative. Special thanks go to Dr. Ronald Breiger from the Department of Sociology at the University of Arizona for his kind help with the initial design of the research framework. We would like also to thank the following people for their support and assistance during the entire project development and evaluation processes: Dr. Daniel Zeng, Michael Chau, and other members at the University of Arizona Artificial Intelligence Lab. We also appreciate important analytical comments and suggestions from personnel from the Tucson Police Department: Lieutenant Jennifer Schroeder, Sergeant Mark Nizbet of the Gang Unit, Detective Tim Petersen, and others.
References 1. 2. 3.
4. 5.
6. 7. 8.
Anderson, T., Arbetter, L., Benawides, A., Longmore-Etheridge, A.: Security works. Security Management, Vol. 38, No. 17. (1994) 17–20. Arabie, P., Boorman, S. A., Levitt, P. R.: Constructing blockmodels: How and why. Journal of Mathematical Psychology, Vol. 17. (1978) 21–63. Baker, W. E., Faulkner R. R.: The social organization of conspiracy: illegal networks in the heavy electrical equipment industry. American Sociological Review, Vol. 58, No. 12. (1993) 837–860. Burt, R. S.: Positions in networks. Social Forces, Vol. 55, No. 1. (1976) 93–122. Chen, H., Lynch, K. J.: Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, Vol. 22, No. 5. (1992) 885–902. Defays, D.: An efficient algorithm for a complete link method. Computer Journal, Vo. 20, No. 4. (1977) 364–366. Dijkstra, E.: A note on two problems in connection with graphs, Numerische Mathematik, Vol. 1. (1959) 269–271. Dombroski, M. J., Carley, K. M.: NETEST: Estimating a terrorist network’s structure. Computational & Mathematical Organization Theory, Vol. 8. (2002) 235–241.
248 9.
10. 11.
12. 13. 14.
15. 16. 17.
18. 19.
20.
21. 22. 23.
J. Xu and H. Chen Evan, W. M.: An organization-set model of interorganizational relations. In: M. Tuite, R. Chisholm, M. Radnor (eds.): Interorganizational Decision-making. Aldine, Chicago (1972) 181–200. Freeman, L.: Centrality in social networks: Conceptual clarification. Social Networks, Vol. 1. (1979) 215–239. Goldberg, H. G., Senator, T. E.: Restructuring databases for knowledge discovery by consolidation and link formation. In Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, (1998). Harper, W. R., Harris, D. H.: The application of link analysis to police intelligence. Human Factors, Vol. 17, No. 2. (1975) 157–164. Hauck, R. V., Atabakhsh, H., Ongvasith, P., Gupta, H., Chen H.: Using Coplink to analyze criminal-justice data. IEEE Computer, Vol. 35, No. 3. (2002) 30–37. Klerks, P.: The network paradigm applied to criminal organizations: Theoretical nitpicking or a relevant doctrine for investigators? Recent developments in the Netherlands, Connections, Vo. 24, No. 3. (2001) 53–65. Krebs, V. E.: Mapping networks of terrorist cells. Connections, Vo. 24, No. 3. (2001) 43– 52. Lorrain, F. P., White, H. C.: Structural equivalence of individuals in social networks, Journal of Mathematical Sociology, Vol. 1. (1971) 49–80. McAndrew, D.: The structural analysis of criminal networks. In: Canter, D., Alison, L. (eds.): The Social Psychology of Crime: Groups, Teams, and Networks, Offender Profiling Series, III, Aldershot, Dartmouth (1999) 53–94. McIllwain, J. S.: Organized crime: A social network approach. Crime, Law & Social Change, Vol. 32. (1999). 301–323. Ronfeldt, D., Arquilla, J.: What next for networks and netwars? In: Arquilla, J., Ronfeldt, D. (eds.): Networks and Netwars: The Future of Terror, Crime, and Militancy. Rand Press, (2001). Saether, M., Canter, D.V.: A structural analysis of fraud and armed robbery networks in Norway. In Proceedings of the 6th International Investigative Psychology Conference, Liverpool, (2001). Sparrow, M. K.: The application of network analysis to criminal intelligence: An assessment of the prospects. Social Networks, Vol. 13. (1991) 251–274. Torgerson, W. S.: Multidimensional scaling: Theory and method. Psychometrika, Vol. 17. (1952) 401–419. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, Cambridge, Cambridge University Press, (1994).
249
Addressing the Homeland Security Problem: A Collaborative Decision-Making Framework 1
2
3
T.S. Raghu , R. Ramesh , and Andrew B. Whinston 1
W. P. Carey School of Business, Arizona State University , Tempe, AZ. 85287 [email protected] 2 Department of Management Science & Systems, School of Management State University of New York at Buffalo, Buffalo, NY 14260 [email protected] 3 Department of Management Science & Information Systems University of Texas at Austin, Austin, TX 78712 [email protected]
Abstract. A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. This problem is further exacerbated due to the multitude of agencies involved in the decisionmaking process. Thus the decision-making processes faced by the intelligence agencies are characterized by group deliberations that are highly ill structured and yield limited analytical tractability. In this context, a collaborative approach to providing cognitive support to decision makers using a connectionist modeling approach is proposed. The connectionist modeling of such decision scenarios offers several unique and significant advantages in developing systems to support collaborative discussions. Several inference rules for augmenting the argument network and to capture implicit notions in arguments are proposed. We further explore the effects of incorporating notions of information source reliability within arguments and the effects thereof.
1 Introduction A key underlying problem intelligence agencies face in effectively combating threats to homeland security is the diversity and volume of information that need to be disseminated, analyzed and acted upon. The Office of Management and Budget (OMD) lists about 100 different federal government categories that are funded to specifically carry out anti-terrorism tasks.1 This obviously excludes state and local government agencies that are often involved in anti-terrorism operations. Given the diversity of agencies and the diversity of information sources it is quite clear that decision-making tasks related to homeland security are highly decentralized. Effective sharing, dissemination and assimilation of information is key to successful homeland security
1
Office of Management and Budget, Annual Report to Congress on Combating Terrorism, available at www.whitehouse.gov/omb/legislative/nsd_annual_report2001.pdf (Oct. 2001), Pages 89–100.
strategy. In this paper, a collaborative decision-making framework is proposed as a key enabler of a distributed information and decision-making backbone for homeland security. When acting upon and integrating intelligence information from several sources, decision makers have to consider and debate various possible decision and security alternatives. In general, most such decision problems are highly ill-structured and yield limited mathematical tractability. Consequently, such decision issues have to be resolved through discussions, where argumentative logic and persuasive presentation are critical. Conventional decision modeling tools may not be able to solve the decision issues as a whole, although they may be used to generate argumentative logic while discussing some of them. In a collaborative decision making process, the group members assume positions, which could be claims or endorsements or oppositions to other claims. These positions could be assumed with or without supporting arguments or evidential data gathered from various intelligence sources (summarized at various levels). The sequence of challenges and responses typically follows an evolutionary path until the decision issues are resolved. In this process, the argument logic and the supporting/contradicting evidential data could grow significantly in both size and complexity, causing a substantial cognitive load on the decision makers. The primary objective of this research is therefore to develop pragmatic and efficient support tools to ease the cognitive burden, focus the group on critical security related issues and guide creative positional and argument strategy development throughout the discussion. The Collaborative Decision-Making (CDM) framework in Figure 1 presents a broad architectural view of a CDM system for homeland security. The CDM system comprises of four broad components: Knowledge repositories, Group facilitation and coordination, Discussion strategy support and Dialectic decision support. Each of these perspectives share some requirements on basic systems components as backbone services, we have identified many of these components in our framework. A brief description of these perspectives is given below. Knowledge Repositories. Given the diversity of federal, state and local agencies involved in intelligence gathering and decision-making processes, a unifying, semantically developed structure to represent intelligence knowledge and information is the first key requirement. The volumes of data, diversity of culture, language and vocabularies exacerbate the complexity of knowledge storage and retrieval. To facilitate communication among geographically, culturally, and/or technically diverse populations of people and systems it is imperative to develop unified knowledge and data repositories. In this context, it is important to build domain ontology and taxonomies that will play a key role in shaping collaborative decision-support systems for homeland security. Group Facilitation and Coordination. Providing system support for enabling distributed teams to coordinate has been studied extensively in the literature. Under this perspective, the recent trends in the areas of group support systems, collaborative filtering and Computer Supported Cooperative Work (CSCW) are the key technological components of a CDM system.
Addressing the Homeland Security Problem
251
Fig. 1. A Collaborative Decision Support System Framework for Homeland Security
Discussion Strategy Support and Dialectic Support. These two aspects of CDM are perhaps the least understood. Considerable research in the areas of argumentation analysis, natural language processing, and structured knowledge interchange has taken place over the past few years. However, application of these fundamental areas in collaborative decision-making has been scarce if not non-existent. A substantial portion of the paper will delve into how semi-structured information from several sources can be meaningfully analyzed. Arguments and positions enunciated by decision-makers are enhanced through simple inference procedures and argument coherence and dialectical assessments are carried out through connectionist procedures.
252
T.S. Raghu, R. Ramesh, and A.B. Whinston
The organization of the paper is as follows. Section 2 discusses the foundations of this research. Section 3 presents the connection network architecture, and Section 4 summarizes the model elements and presents an integrated global view of dialectical support through connectionism. Section 5 presents our concluding remarks.
2 Research Foundations Intelligence communities involved in homeland security tasks represent a very complex global virtual organization. The underlying context of this domain is the geographical distribution of the strategic, tactical and operational communities and their activities over the globe. The key to achieving success and breakthroughs in homeland security lies in effective team communication, creative conflict management, sustained coordination of team efforts and continuity in collaboration, all ensured within a structured collaborative decision environment. Although the road to achieving the full potential of such teamwork is filled with challenges, both organizational and technical, advanced information technology can be used in novel ways to facilitate effective collaboration that have not even been conceived till recently. The current research is envisioned as an important milestone in this direction. We identify Information filtering as the first key challenge that would need to be overcome for effective decision-making. The objective here should be to filter the vast information base so that relevant and important intelligence information are accessible quickly to key decision makers. Most of the current filtering systems provide minimal means to classify documents and data. A common criticism of these systems is their extreme focus on information storage, and failure to capture the underlying meta-information. As a consequence, the concept of knowledge ontology has emerged, with a view to create domain level context that enable users to attach rich domain-specific semantic information and additional annotations to intelligence information and documents and employ the meta-information for information retrieval. Once information storage is augmented with knowledge ontology, it becomes easier to provide structured mechanisms for communication wherein decision makers are enabled to communicate over distributed systems. Structured communication enables one to capture the knowledge of intelligence community in easily accessible discussion archives. The underlying structure in the discussion archives would enable the provision of additional collaborative decision support to intelligence personnel. Thus, we draw upon the literature from knowledge ontology and theory of argumentation as the theoretical bases for this research. Formal ontology characterizes knowledge providing a framework binding contextual elements with the relationships that link them within the ontology, as well as the relationships with other units of knowledge [5]. The knowledge ontology consists of a conceptual model, a thesaurus, and a set of expanded attributes and axioms. Its concern is for the appropriate representation of content, which may later be augmented with a mechanistic formalism, such as UML (Unified Modeling Language), RDF (Resource Description Framework), BNF (Backus Naur Form), or formal logic [19]. The main challenge that agencies involved in homeland security face is the volume and number of different information sources that would potentially feed useful and useable information to the CDM system. For instance, the key targets that need protection include large buildings, sports arenas, nuclear facilities, airports, trains and sub-
Addressing the Homeland Security Problem
253
ways, and national symbols in over 200 cities [1]. Clearly operational and intelligence information pertaining to these key targets will be varied in format, content and context. It is therefore imperative to impose uniform semantic structures where possible and define contextual meta-data on other sources of information to enable dissemination of information across federal, state and local agencies. The main contribution of this research is to demonstrate that further decision support functionalities can be embedded in a CDM system that leverage the metainformation framework of domain knowledge ontology. This would help decision makers better utilize the volumes of information collected through various sources. The basis for collaborative decision support in our system comes from argumentation theory. The logic of argumentation can be studied in terms of its two, rather classical, elements: structure and content. The two components have a symbiotic relationship in the sense that the informational content of an argument needs a logical structure for its coherence and significance. Connectionist modeling provides a way to capture both the elements in a single framework[3,4]. Several works deal primarily with representation formalisms and heuristics for argument analysis, interpretation and outcome prediction [7,9,12,13,15,18]. Given the diffuse nature of intelligence information and the uncertainties associated with the information sources, it would be difficult for any system to provide discrete decisions on security issues. Our approach is to move towards a system of argument analysis in which one is not necessarily constrained to resolving argumentation to discrete categories [16,17]. Using binary categories as a basis for rejecting or accepting arguments prevents one from assessing the relative strengths of the arguments. While connectionist models do not have the strong theoretical underpinnings of logic based defeasible graphs[6,8], using Connectionist models for this purpose has many advantages over methods that utilize simple binary categories of acceptance and rejection [11]. Connectionist modeling achieves better sensitivity in argument assessment by indicating the degree of acceptance or rejection of arguments[14,16]. In addition, one can assign different weights on the arcs connecting the different units in the model. This enables one to capture not only the relations between units but also the strength of the relation. The basic computational details of the connectionist architecture are described in [14]. Briefly, arguments in a discussion are structured into basic, atomic-level information units along with their logical and other human-intended relationships. The basic informational units are represented as the units (which is a term used to represent network nodes in the connectionist literature) and their relationships as the arcs in a network formalism for argument logic. The dialectical power of an argument is an indicator of the strength or validity of an argument, and is measured by the activation level of the unit representing the final thesis of the argument at asymptotic convergence. For example, the final thesis of the argument can be that there is an imminent threat to a key national monument in the near future. An argument derives its dialectical power by the logical coherence inherent in its structure and by the support it derives from its evidence. The evidence could be either observed facts, intelligence information, and previous incidents or derived conclusions from other claims and arguments. The structure and content of the supporting as well as opposing logic behind an argument together determine its dialectical power. The dialectical power of various positions in a collaborative discussion is a very useful evaluative feedback to the decision makers. This measure identifies the relative strengths and weaknesses of the positions, and points to whether a discussion is
254
T.S. Raghu, R. Ramesh, and A.B. Whinston
moving towards a resolution or not. Consequently, it can be used to focus a group on critical security flaws, reexamine security measures if necessary and develop strategies to address future threats. Further, the connectionist paradigm can also be used to derive assessments on subsets of a large argument network selectively, or on higherlevel meta networks derived by aggregating argument sets from a basic network into meta-units and meta arcs. Thus the proposed model can provide selectively local views of a comprehensive discussion as well as condensed global perspectives on an entire discussion. The dialectical support functionality can provide comprehensive and dynamic monitoring/guidance systems for collaborative discussions on the Intranets.
3 Argument Structure and Connectionism 3.1 Argument Structure The basic formalism for our connectionist approach is available in [14]. We briefly describe the argumentation formalism here. For a detailed discussion please refer to [14]. The discussion of inference rules and incorporation of information source reliability are additional contributions in this paper. Let * denote the group of individuals in a collaborative discussion. Let ' denote the argument structure representing the various positions, facts, and their interrelationships generated in the discussion. Clearly, ' is a temporal entity, evolving and changing over time as the discussion proceeds. The structure ' is basically a collection of assertions made by the individuals in the group. This is indicated as follows. ' = {$$ is an Assertion}. An assertion $ is of two types: positions and inferences. A statement of position is a claim, and is assumed to be a well-formed sentence. A statement of inference is a structural relationship among a set of positions and facts. We formalize the structure of these assertion types as follows. Let / denote a language from which the structure ' is constructed. The language / is a triple <654>, where 6 constitutes the sentences, 5 is a set of assertions built using sentences, and 4 is a set of assertion qualifications. 6 provides the basis for the construction of positions and statements of fact and is composed of defeasible sentences (Gd) and factual sentences (GF). A factual statement is any evidential data that is commonly accepted by the group, while the positions are the subject of discussion. 5 provides the basis for the construction of positional and inferential assertions. This enables the construction of positional assertions from sentences obtained from 6 as well as inferential structures from other assertions. 4 provides the basis for the qualification of an argument on whether it is strict or defeasible. While a defeasible argument is subject to debate and possibly defeat, a strict argument is a logical inference that will not be questioned by anyone in the group. 5 provides two constructs <support> and to build inferential structures among positions and facts. 4 provides two constructs <strict> and <defeasible> to qualify assertions. As a result, a combination of these constructs yields the following qualified inferences: <strict support>, <defeasible support>, <strict opposition> and <defeasible opposition>. We in-
Abstract. In many intelligence and security tasks it is necessary to monitor data in database in order to detect certain events or changes. Currently, database systems offer triggers to provide active capabilities. Most triggers, however, are based on the Event-Condition-Action paradigm, which can express only very primitive events. In this paper we propose an extension of traditional triggers in which the Event is a complex situation expressed by a Select-Project-Join-GroupBy SQL query, and the trigger can be programmed to look for changes in the situation defined. Moreover, the trigger can be directed to check for changes on a periodic basis. After proposing a language to define changes, we sketch an implementation, based on the idea of incremental view maintenance, to support efficiently our extended triggers.
1
Introduction
In the past, databases were passive, low-level repositories of data on top of which smarter, domain-focused applications were built. Lately, databases have taken a more active role, offering more advanced services and higher functionality to other applications. In this framework, the database assumes responsibility for execution of some tasks previously left to the application, which offers several advantages: the possibility of better performance (since the database has direct access to the data, knows how the data is stored and distributed), better data quality (since the database is already in charge of basic data consistency) and better overall control. However, this trend has resulted in the database taking in more ambitious roles, and having to provide more advanced functionality than in the past. One of the areas where this trend is clear is in the area of active databases. In the past, the database could monitor data and respond to certain changes via triggers (also called rules1 ). However, commercial systems offer very limited capabilities in this sense. In addition to problems of performance (triggers add quite a bit of overhead) and control (because of the problems of non-terminating, non-confluent trigger sets), trigger systems are very low-level: while the events that may activate a triggers are basic database actions (insertions, deletions and updates), users are interested in complex conditions that 1
In this paper, we use the terms rule and trigger as equivalent.
H. Chen et al. (Eds.): ISI 2003, LNCS 2665, pp. 296−307, 2003. Springer-Verlag Berlin Heidelberg 2003
Active Database Systems for Monitoring and Surveillance
297
may depend on several database objects and their interactions. It is difficult to express this high-level, application-dependent events in triggers. In this paper, we describe an ongoing project whose goal is to add advanced monitoring and control functionality to database systems through the design and development of extended rule systems. In a nutshell, we develop triggers were more complex events can be stated, thus letting system users specify, in a high level language, the patterns they need to monitor. Since performance is still an issue, we also develop efficient algorithms based on the idea of incremental recomputation already used in the evaluation of materialized views in data warehouses ([9]). As a result of the added functionality, a database system will be able to monitor the appearance of complex patterns and to detect changes on said patterns. Other research in active databases has not deal with this issue. Our approach is focused on concepts that may have a practical impact; in particular, we aim at expressing more complex events, making it easier for database users to specify the conditions they are interested in monitoring, but we also propose an efficient implementation, something which is absent from most research in the area.
2
Background and Related Research
In most database systems (certainly in all commercial systems), active capabilities are incorporated through the ability to define triggers. A trigger has the form Event-Condition-Action (ECA). The typical events considered by active rules are primitives for database state changes, like insertions, deletions and updates from/to database tables. The condition is either a database predicate or a query (the query is implicitly considered true if the query returns a non-empty answer, and false otherwise). The action may include transactional commands, rollback or rule manipulation commands, or sometimes may activate externally defined procedures, including arbitrary data manipulation programs. Rules are fired when a particular event occurs; the condition is then evaluated, and if found true then the action is executed. This simple schema is found lacking for several reasons ([4, 26]). Mainly, the events used in triggers are considered too low-level to be useful for many applications; a great deal of research in active databases has focused on defining more complex events ([16, 17, 15, 12, 10, 14]). In basically all the previous research, complex events are obtained by combining primitive events in some event language, which usually includes conjunction, disjunction, negation and sequencing of events ([27]). Some approaches include time primitives ([23, 21, 18]), sometimes based on some temporal logic ([22, 6]). Although none of these projects addresses the issue we are dealing with here (active monitoring of complex conditions) we note that [24] also proposes using incremental recomputation to compute complex events (described as queries), as we do; and [2] proposes incremental computation of temporal queries. However,
298
A. Badia
these works have no concept of active monitoring. Finally, [20] also propose a system for monitoring2 . We take an approach different from previous research, based on the observation that analysts are usually interested in much higher level events, which are application and goal oriented: in particular, they screen for conditions which deviate from normal or standard, or for complex conditions which may involve several objects and their relationships. As a simplified example, assume a database with two relations, PEOPLE(name,country) and CALLS(called,caller,date), where we keep a list of suspicious people and their country of residence, and also a list of telephone calls among them as intercepted by signal intelligence. Both called and caller are foreign keys for name. At some point, an analyst is following a suspected terrorist (let’s call him ’X’) and wants to know from which country he receives the most calls. The information can be easily obtained from the database (see below), but once obtained the analyst would like to follow up on this query by monitoring changes: in particular, the analyst may be interested in being alerted when the country from which ’X’ receives the highest number of calls changes. Since sending an alert is an action that must be taken only under certain circumstances, a trigger is the obvious way to implement this functionality. However, the event of interest to the analyst -when the country from which ’X’ receives the most calls changes from the current one- cannot be expressed with trigger events, which are limited to checking for insertions, deletions and updates in relations. Note that insertions in CALLS are the only way in which the current top-calling country could. Thus, one could simulate the desired trigger by using insertions in CALLS as events, and then computing the desired information. A simple SQL query can provide a list of the countries from which ’X’ is called in order of the number of calls, so that the top-calling country is in the first row of the answer: SELECT country, count(*) as numcalls FROM PEOPLE, CALLS WHERE caller = name and called = ’X’ GROUP BY country SORT BY numcalls There are, however, two problems with this approach: it is both conceptually hard and computationally inefficient. It is hard because the above still does not give us the answer: one should keep the name of the top-calling country in some table or variable and compare it with the name in the first row of the above query every time it is recomputed. Thus, quite a bit of programming is needed 2
It is also worth noting that such research, while containing many worthwhile ideas, has seen little practical use, possibly due to two concerns. First, even though some research has been implemented in systems ([14, 21, 25, 17]), efficiency is not addressed in most approaches ([12, 24] are some exceptions); second, sophisticated logic-based languages, as proposed in the research literature ([8, 6, 23, 27]), are highly expressive, but probably outside the comfort zone of most programmers, and certainly most users.
Active Database Systems for Monitoring and Surveillance
299
to implement a relatively simple request. It is inefficient because the trigger is still fired for every table insertion, and therefore its complex condition (the query above) must be evaluated every time. A possible approach would be to use the above SQL query to define a view or table T , and declare the trigger over T . In some systems, views cannot have triggers and hence T needs to be a table. This is clearly undesirable, since T is, conceptually, a view (i.e. needs to be updated whenever the tables it is based upon are updated). Even if the system allows triggers on views, there are several things that the analyst may be interested in, only some of which are expressible with regular triggers: – continuous monitoring, immediate reaction: this is what a trigger does. Every single change in T (insertion, deletion, update) fires the trigger; as soon as a change is detected, an action takes place. This gives us real time monitoring and is certainly useful for certain situations. Note that some programming would still be necessary: because we are looking for changes to a situation, we need to store the current situation (which country is the top producer of calls) and compare it after every event with the new result. – continuous monitoring, delayed action: recheck the situation after every single change in T , but if the condition is found to be true, take action only at certain specified points in time. Delayed action is adequate for periodical reporting. This could be simulated with a trigger (store changes in a temporary relation, for instance) at the cost of more programming. – periodical monitoring, immediate action: recheck the situation at certain specified periods (for instance, every month), and execute and action whenever a check detects a change. Note that this does not give us real time, since by the time the change is detected, the change itself may have taken place time ago. This is good, though, when we need regular and constant monitoring of a situation, but we do not need to be immediately aware of every single change. Again, this could be simulated in some trigger systems, depending on what is allowed in the condition part, at the cost of quite a bit of programming. – periodical monitoring, delayed action: recheck the situation periodically as in the previous case and execute action if changes are detected at certain specified periods. This could also be simulated in some trigger systems, depending on what is exactly allowed in the condition and action parts. Note that all cases can be simulated with some trigger systems. Most systems allow arbitrary programs in the condition and action part of a trigger; therefore, this is equivalent to writing a little program for each condition we want to monitor. As stated above, this is clearly inefficient because of the human effort (programming) and machine effort (trigger execution) involved. Clearly, a more flexible approach is needed.
300
3
A. Badia
The Proposal
The aim of our approach is to overcome the limitations described in the previous section. We would like to develop an approach that provides analysts with the tools needed to monitor real-life, complex events, in an conceptually simple and efficient manner. Consequently, the project has two parts: development of languages and interpreters for extended triggers, and design of algorithms to support efficient computation of the extension. Each part is discussed next. 3.1
Extended Triggers
In our proposal, we develop extended triggers, or triggers with extended events, which correspond to high-level, semantic properties. The events would be able to monitor evolution and change in data, by giving a language in which to represent changes and complex conditions. We call this active monitoring. By using extended triggers, an analyst is able to state naturally and simply, in a declarative language, what are the activities, changes or states which are noteworthy from the analyst’s point of view. Our extensions are based on several intuitions. First, the mismatch between currently allowed events in triggers (called database events) and the events we want to monitor (called semantic events) are due to a difference in levels: semantic events are high level, related to the application; database events are low level, related to the database (in our example, top-calling country vs. insertions on CALLS). Therefore, a mechanism is needed to bridge the gap, one that will express the semantic event in terms of database events. However, expressing the semantic event is not enough, since we are interested in monitoring changes in that event (in our example, changes in the top-calling country). Hence, a language in which to express changes is also needed. Second, even if the previous mismatch did not exist, triggers are not adequate for the task of active monitoring described above, since this task requires knowing when to start, when to stop and how often to check. This information cannot be expressed in current triggers, which are more of a one time action: although the trigger is fired repeatedly as the event repeats, each firing is an isolated event, unrelated to others -unless a link or history is established by adequate programming of the trigger. Finally, and as a result of the mismatch, many database events must happen before affecting a significant change on a semantic event (in our example, many calls may have to be inserted into CALLS before the top-calling country changes). This accumulation naturally happens over time and size (of the database). Thus, it is inefficient to check for a condition after every database event; it is more efficient to do it periodically. We propose a language which will establish: a) a certain environment or baseline to express semantic events; b) the changes to the baseline that the system can monitor; and c) an interval that determines how long and how often to monitor those changes. The baseline will be established by an SQL query that will specify the context in which changes must be examined. To establish the interval, an starting point, an end point and a frequency must be defined. Our
Active Database Systems for Monitoring and Surveillance
301
language supports interval definitions in two dimensions: time and size (of the database), as discussed above The following specification is proposed (keywords are all in uppercase): BASELINE <modification> <modification>:= IN