This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Phone : (800)111-1717
14
Cuisine : …
iterative, two-step process: first, it uses the hypotheses learned in each view to probabilistically label all the unlabeled examples; then it learns a new hypothesis in each view by training on the probabilistically labeled examples provided by the other view. By interleaving active and semi-supervised learning, Co-EMT creates a powerful synergy. On one hand, Co-Testing boosts Co-EM’s performance by providing it with highly informative labeled examples (instead of random ones). On the other hand, Co-EM provides CoTesting with more accurate classifiers (learned from both labeled and unlabeled data), thus allowing CoTesting to make more informative queries. Co-EMT was not yet applied to wrapper induction, because the existing algorithms are not probabilistic learners; however, an algorithm similar to Co-EMT was applied to information extraction from free text (Jones et al., 2003). To illustrate how Co-EMT works, we describe now the generic algorithm Co-EMTWI, which combines Co-Testing with the semi-supervised wrapper induction algorithm described next. In order to perform semi-supervised wrapper induction, one can exploit a third view, which is used to evaluate the confidence of each extraction. This new content-based view (Muslea et al., 2003) describes the actual item to be extracted. For example, in the phone numbers extraction task, one can use the labeled examples to learn a simple grammar that describes the field content: (Number) Number – Number. Similarly, when extracting URLs, one can learn that a typical URL starts with the string “http://www.’’, ends with the string “.html’’, and contains no HTML tags. Based on the forward, backward, and content-based views, one can implement the following semi-supervised wrapper induction algorithm. First, the small set of labeled examples is used to learn a hypothesis in each view. Then, the forward and backward views feed each other with unlabeled examples on which they make high-confidence extractions (i.e., strings that are extracted by either the forward or the backward rule and are also compliant with the grammar learned in the third, content-based view). Given the previous Co-Testing and the semi-supervised learner, Co-EMTWI combines them as follows. First, the sets of labeled and unlabeled examples are used for semi-supervised learning. Second, the extraction rules that are learned in the previous step are used for CoTesting. After making a query, the newly labeled example is added to the training set, and the whole process is repeated for a number of iterations. The empirical study in Muslea, et al., (2002a) shows that, for a large variety of text classification tasks, Co-EMT outperforms both CoTesting and the three state-of-the-art semi-supervised learners considered in that comparison.
Active Learning with Multiple Views
View Validation: Are the Views Adequate for Multi-View Learning? The problem of view validation is defined as follows: given a new unseen multi-view learning task, how does a user choose between solving it with a multi- or a singleview algorithm? In other words, how does one know whether multi-view learning will outperform pooling all features together and applying a single-view learner? Note that this question must be answered while having access to just a few labeled and many unlabeled examples: applying both the single- and multi-view active learners and comparing their relative performances is a self-defeating strategy, because it doubles the amount of required labeled data (one must label the queries made by both algorithms). The need for view validation is motivated by the following observation: while applying Co-Testing to dozens of extraction tasks, Muslea et al. (2002b) noticed that the forward and backward views are appropriate for most, but not all, of these learning tasks. This view adequacy issue is related tightly to the best extraction accuracy reachable in each view. Consider, for example, an extraction task in which the forward and backward rules lead to a high- and low-accuracy rule, respectively. Note that Co-Testing is not appropriate for solving such tasks; by definition, multi-view learning applies only to tasks in which each view is sufficient for learning the target concept (obviously, the lowaccuracy view is insufficient for accurate extraction). To cope with this problem, one can use Adaptive View Validation (Muslea et al., 2002b), which is a metalearner that uses the experience acquired while solving past learning tasks to predict whether the views of a new unseen task are adequate for multi-view learning. The view validation algorithm takes as input several solved extraction tasks that are labeled by the user as having views that are adequate or inadequate for multi-view learning. Then, it uses these solved extraction tasks to learn a classifier that, for new unseen tasks, predicts whether the views are adequate for multi-view learning. The (meta-) features used for view validation are properties of the hypotheses that, for each solved task, are learned in each view (i.e., the percentage of unlabeled examples on which the rules extract the same string, the difference in the complexity of the forward and backward rules, the difference in the errors made on the training set, etc.). For both wrapper induction and text classification, Adaptive View Validation makes accurate predictions based on a modest amount of training data (Muslea et al., 2002b).
FUTURE TRENDS There are several major areas of future work in the field of multi-view learning. First, there is a need for a view detection algorithm that automatically partitions a domain’s features in views that are adequate for multiview learning. Such an algorithm would remove the last stumbling block against the wide applicability of multiview learning (i.e., the requirement that the user provides the views to be used). Second, in order to reduce the computational costs of active learning (re-training after each query is CPU-intensive), one must consider look-ahead’ strategies that detect and propose (near) optimal sets of queries. Finally, Adaptive View Validation has the limitation that it must be trained separately for each application domain (e.g., once for wrapper induction, once for text classification, etc.). A major improvement would be a domain-independent view validation algorithm that, once trained on a mixture of tasks from various domains, can be applied to any new learning task, independently of its application domain.
CONCLUSION In this article, we focus on three recent developments that, in the context of multi-view learning, reduce the need for labeled training data. • •
•
Co-Testing: A general-purpose, multi-view active learner that outperforms existing approaches on a variety of real-world domains. Co-EMT: A multi-view learner that obtains a robust behavior over a wide spectrum of learning tasks by interleaving active and semi-supervised multi-view learning. Adaptive View Validation: A meta-learner that uses past experiences to predict whether multiview learning is appropriate for a new unseen learning task.
REFERENCES Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the Conference on Computational Learning Theory (COLT-1998). Collins, M., & Singer, Y. (1999). Unsupervised models for named entity classification. Empirical Methods in
15
A
Active Learning with Multiple Views
Natural Language Processing & Very Large Corpora (pp. 100-110). Jones, R., Ghani, R., Mitchell, T., & Riloff, E. (2003). Active learning for information extraction with multiple view feature sets. Proceedings of the ECML-2003 Workshop on Adaptive Text Extraction and Mining. Knoblock, C. et al. (2001). The Ariadne approach to Webbased information integration. International Journal of Cooperative Information Sources, 10, 145-169. Muslea, I. (2002). Active learning with multiple views [doctoral thesis]. Los Angeles: Department of Computer Science, University of Southern California. Muslea, I., Minton, S., & Knoblock, C. (2000). Selective sampling with redundant views. Proceedings of the National Conference on Artificial Intelligence (AAAI-2000). Muslea, I., Minton, S., & Knoblock, C. (2001). Hierarchical wrapper induction for semi-structured sources. Journal of Autonomous Agents & Multi-Agent Systems, 4, 93-114. Muslea, I., Minton, S., & Knoblock, C. (2002a). Active + semi-supervised learning = robust multi-view learning. Proceedings of the International Conference on Machine Learning (ICML-2002). Muslea, I., Minton, S., & Knoblock, C. (2002b). Adaptive view validation: A first step towards automatic view detection. Proceedings of the International Conference on Machine Learning (ICML-2002). Muslea, I., Minton, S., & Knoblock, C. (2003). Active learning with strong and weak views: A case study on wrapper induction. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-2003). Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. Proceedings of the Conference on Information and Knowledge Management (CIKM-2000).
16
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2-3), 103-134. Pierce, D., & Cardie, C. (2001). Limitations of co-training for natural language learning from large datasets. Empirical Methods in Natural Language Processing, 1-10. Raskutti, B., Ferra, H., & Kowalczyk, A. (2002). Using unlabeled data for text classification through addition of cluster parameters. Proceedings of the International Conference on Machine Learning (ICML-2002). Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45-66.
KEY TERMS Active Learning: Detecting and asking the user to label only the most informative examples in the domain (rather than randomly-chosen examples). Inductive Learning: Acquiring concept descriptions from labeled examples. Meta-Learning: Learning to predict the most appropriate algorithm for a particular task. Multi-View Learning: Explicitly exploiting several disjoint sets of features, each of which is sufficient to learn the target concept. Semi-Supervised Learning: Learning from both labeled and unlabeled data. View Validation: Deciding whether a set of views is appropriate for multi-view learning. Wrapper Induction: Learning (highly accurate) rules that extract data from a collection of documents that share a similar underlying structure.
17
Administering and Managing a Data Warehouse James E. Yao Montclair State University, USA Chang Liu Northern Illinois University, USA Qiyang Chen Montclair State University, USA June Lu University of Houston-Victoria, USA
INTRODUCTION As internal and external demands on information from managers are increasing rapidly, especially the information that is processed to serve managers’ specific needs, regular databases and decision support systems (DSS) cannot provide the information needed. Data warehouses came into existence to meet these needs, consolidating and integrating information from many internal and external sources and arranging it in a meaningful format for making accurate business decisions (Martin, 1997). In the past five years, there has been a significant growth in data warehousing (Hoffer, Prescott, & McFadden, 2005). Correspondingly, this occurrence has brought up the issue of data warehouse administration and management. Data warehousing has been increasingly recognized as an effective tool for organizations to transform data into useful information for strategic decision-making. To achieve competitive advantages via data warehousing, data warehouse management is crucial (Ma, Chou, & Yen, 2000).
BACKGROUND Since the advent of computer storage technology and higher level programming languages (Inmon, 2002), organizations, especially larger organizations, have put enormous amount of investment in their information system infrastructures. In a 2003 IT spending survey, 45% of American company participants indicated that their 2003 IT purchasing budgets had increased compared with their budgets in 2002. Among the respondents, database applications ranked top in areas of technology being implemented or had been implemented, with 42% indicating a recent implementation (Informa-
tion, 2004). The fast growth of databases enables companies to capture and store a great deal of business operation data and other business-related data. The data that are stored in the databases, either historical or operational, have been considered corporate resources and an asset that must be managed and used effectively to serve the corporate business for competitive advantages. A database is a computer structure that houses a selfdescribing collection of related data (Kroenke, 2004; Rob & Coronel, 2004). This type of data is primitive, detailed, and used for day-to-day operation. The data in a warehouse is derived, meaning it is integrated, subject-oriented, time-variant, and nonvolatile (Inmon, 2002). A data warehouse is defined as an integrated decision support database whose content is derived from various operational databases (Hoffer, Prescott, & McFadden, 2005; Sen & Jacob, 1998). Often a data warehouse can be referred to as a multidimensional database because each occurrence of the subject is referenced by an occurrence of each of several dimensions or characteristics of the subject (Gillenson, 2005). Some multidimensional databases operate on a technological foundation optimal for “slicing and dicing” the data, where data can be thought of as existing in multidimensional cubes (Inmon, 2002). Regular databases load data in two-dimensional tables. A data warehouse can use OLAP (online analytical processing) to provide users with multidimensional views of their data, which can be visually represented as a cube for three dimensions (Senn, 2004). With the host of differences between a database for day-to-day operation and a data warehouse for supporting management decision-making process, the administration and management of a data warehouse is of course far from similar. For instance, a data warehouse team requires someone who does routine data extraction, transformation, and loading (ETL) from operational data-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Administering and Managing a Data Warehouse
bases into data warehouse databases. Thus the team requires a technical role called ETL Specialist. On the other hand, a data warehouse is intended to support the business decision-making process. Someone like a business analyst is also needed to ensure that business information requirements are crossed to the data warehouse development. Data in the data warehouse can be very sensitive and cross functional areas, such as personal medical records and salary information. Therefore, a higher level of security on the data is needed. Encrypting the sensitive data in data warehouse is a potential solution. Issues as such in data warehouse administration and management need to be defined and discussed.
MAIN THRUST Data warehouse administration and management covers a wide range of fields. This article focuses only on data warehouse and business strategy, data warehouse development life cycle, data warehouse team, process management, and security management to present the current concerns and issues in data warehouse administration and management.
Data Warehouse and Business Strategy “Data is the blood of an organization. Without data, the corporation has no idea where it stands and where it will go” (Ferdinandi, 1999, p. xi). With data warehousing, today’s corporations can collect and house large volumes of data. Does the size of data volume simply guarantee you a success in your business? Does it mean that the more data you have the more strategic advantages you have over your competitors? Not necessarily. There is no predetermined formula that can turn your information into competitive advantages (Inmon, Terdeman, & Imhoff, 2000). Thus, top management and data administration team are confronted with the question of how to convert corporate information into competitive advantages. A well-managed data warehouse can assist a corporation in its strategy to gain competitive advantages. This can be achieved by using an exploration warehouse, which is a direct product of data warehouse, to identify environmental factors, formulate strategic plans, and determine business specific objectives: •
18
Identifying Environmental Factors: Quantified analysis can be used for identifying a corporation’s products and services, market share of specific products and services, financial management.
•
•
Formulating Strategic Plans: Environmental factors can be matched up against the strategic plan by identifying current market positioning, financial goals, and opportunities. Determining Specific Objectives: Exploration warehouse can be used to find patterns; if found, these patterns are then compared with patterns discovered previously to optimize corporate objectives (Inmon, Terdeman, & Imhoff, 2000).
While managing a data warehouse for business strategy, what needs to be taken into consideration is the difference between companies. No one formula fits every organization. Avoid using so called “templates” from other companies. The data warehouse is used for your company’s competitive advantages. You need to follow your company’s user information requirements for strategic advantages.
Data Warehouse Development Cycle Data warehouse system development phases are similar to the phases in the systems development life cycle (SDLC) (Adelman & Rehm, 2003). However, Barker (1998) thinks that there are some differences between the two due to the unique functional and operational features of a data warehouse. As business and information requirements change, new corporate information models evolve and are synthesized into the data warehouse in the Synthesis of Model phase. These models are then used to exploit the data warehouse in the Exploit phase. The data warehouse is updated with new data using appropriate updating strategies and linked to various data sources. Inmon (2002) sees system development for data warehouse environment as almost exactly the opposite of the traditional SDLC. He thinks that traditional SDLC is concerned with and supports primarily the operational environment. The data warehouse operates under a very different life cycle called “CLDS” (the reverse of the SDLC). The CLDS is a classic data-driven development life cycle, but the SDLC is a classic requirementsdriven development life cycle.
The Data Warehouse Team Building a data warehouse is a large system development process. Participants of data warehouse development can range from a data warehouse administrator (DWA) (Hoffer, Prescott, & McFadden, 2005) to a business analyst (Ferdinandi, 1999). The data warehouse team is supposed to lead the organization into assuming their roles and thereby bringing about a part-
Administering and Managing a Data Warehouse
nership with the business (McKnight, 2000). A data warehouse team may have the following roles (Barker, 1998; Ferdinandi, 1999; Inmon, 2000, 2003; McKnight, 2000): •
•
•
• • •
• • • • • •
Data Warehouse Administrator (DWA): responsible for integrating and coordinating of metadata and data across many different data sources as well as data source management, physical database design, operation, backup and recovery, security, and performance and tuning. Manager/Director: responsible for the overall management of the entire team to ensure that the team follows the guiding principles, business requirements, and corporate strategic plans. Project Manager: responsible for data warehouse project development, including matching each team member’s skills and aspirations to tasks on the project plan. Executive Sponsor: responsible for garnering and retaining adequate resources for the construction and maintenance of the data warehouse. Business Analyst: responsible for determining what information is required from a data warehouse to manage the business competitively. System Architect: responsible for developing and implementing the overall technical architecture of the data warehouse, from the backend hardware and software to the client desktop configurations. ETL Specialist: responsible for routine work on data extraction, transformation, and loading for the warehouse databases. Front End Developer: responsible for developing the front-end, whether it is client-server or over the Web. OLAP Specialist: responsible for the development of data cubes, a multidimensional view of data in OLAP. Data Modeler: responsible for modeling the existing data in an organization into a schema that is appropriate for OLAP analysis. Trainer: responsible for training the end-users to use the system so that they can benefit from the data warehouse system. End User: responsible for providing feedback to the data warehouse team.
In terms of the size of the data warehouse administrator team, Inmon (2003) has several recommendations: • •
large warehouse requires more analysts; every 100gbs of data in a data warehouse requires another data warehouse administrator;
• • •
•
a new data warehouse administrator is required for each year a data warehouse is up and running and is being used successfully; if an ETL tool is being written manually, many data warehouse administrators are needed; if automation tool is needed much fewer staffing is required; automated data warehouse database management system (DBMS) requires fewer data warehouse administrators, otherwise more administrators are needed; fewer supporting staff is required if the corporate information factory (CIF) architecture is followed more closely; reversely, more staff is needed.
McKnight (2000) suggests that all the technical roles be performed full-time by dedicated personnel and each responsible person receives specific data warehouse training. Data warehousing is growing rapidly. As the scope and data storage size of the data warehouse change, the roles and size of a data warehouse team should be adjusted accordingly. In general, the extremes should be avoided. Without sufficient professionals, job may not be done satisfactorily. On the other hand, too many people will certainly get the team overstuffed.
Process Management Developing data warehouse has become a popular but exceedingly demanding and costly activity in information systems development and management. Data warehouse vendors are competing intensively for their customers because so much of their money and prestige are at stake. Consulting vendors have redirected their attention toward this rapidly expanding market segment. User companies are facing with a serious question on which product they should buy. Sen & Jacob’s (1998) advice is to first understand the process of data warehouse development before selecting the tools for its implementation. A data warehouse development process refers to the activities required to build a data warehouse (Barquin, 1997). Sen & Jacob (1998) and Ma, Chou, & Yen (2000) have identified some of these activities, which need to be managed during the data warehouse development cycle: initializing project, establishing the technical environment, tool integration, determining scalability, developing an enterprise information architecture, designing the data warehouse database, data extraction/transformation, managing metadata, developing the end-user interface, managing the production environment, managing decision support tools and applications, and developing warehouse roll-out. 19
A
Administering and Managing a Data Warehouse
As mentioned before, data warehouse development is a large system development process. Process management is not required in every step of the development processes. Devlin (1997) states that process management is required in the following areas: process schedule, which consists of a network of tasks and decision points; process map definition, which defines and maintains the network of tasks and decision points that make up a process; task initiation, which supports to initiate tasks on all of the hardware/software platforms in the entire data warehouse environment; status information enquiry, which enquires about the status of components that are running on all platforms.
Security Management In recent years, information technology (IT) security has become one of the hottest and most important topics facing both users and providers (Senn, 2005). The goal of database security is the protection of data from accidental or intentional threats to its integrity and access (Hoffer, Prescott, & McFadden, 2005). The same is true for a data warehouse. However, higher security methods, in addition to the common practices such as view-based control, integrity control, processing rights, and DBMS security, need to be used for the data warehouse due to the differences between a database and data warehouse. One of the differences that demand a higher level of security for a data warehouse is the scope of and detail level of data in the data warehouse, such as financial transactions, personal medical records, and salary information. A method that can be used to protect data that requires high level of security in a data warehouse is by using encryption and decryption. Confidential and sensitive data can be stored in a separate set of tables where only authorized users can have access. These data can be encrypted while they are being written into the data warehouse. In this way, the data captured and stored in the data warehouse are secure and can only be accessed on an authorized basis. Three levels of security can be offered by using encryption and decryption. The first level is that only authorized users can have access to the data in the data warehouse. Each group of users, internal or external, ranging from executives to information consumers should be granted different rights for security reasons. Unauthorized users are totally prevented from seeing the data in the data warehouse. The second level is the protection from unauthorized dumping and interpretation of data. Without the right key an unauthorized access will not be allowed to write anything into the tables. On the other hand, the existing data in the tables cannot be decrypted. The third level is the protection from unauthorized access during the 20
transmission process. Even if unauthorized access occurs during transmission, there is no harm to the encrypted data unless the user has the decryption code (Ma, Chou, & Yen, 2000).
FUTURE TRENDS Data warehousing administration and management is facing several challenges, as data warehousing becomes a mature part of the infrastructure of organizations. More legislative work is necessary to protect individual privacy from abuse by government or commercial entities that have large volumes of data concerning those individuals. The protection also calls for tightened security through technology as well as user efforts for workable rules and regulations while at the same time still granting a data warehouse the ability to perform large datasets for meaningful analyses (Marakas, 2003). Today’s data warehouse is limited to storage of structured data in the form of records, fields, and databases. Unstructured data, such as multimedia, maps, graphs, pictures, sound, and video files are demanded increasingly in organizations. How to manage the storage and retrieval of unstructured data and how to search for specific data items set a real challenge for data warehouse administration and management. Alternative storage, especially the near-line storage, which is one of the two forms of alternative storage, is considered to be one of the best future solutions for managing the storage and retrieval of unstructured data in data warehouses (Marakas, 2003). The past decade has seen a fast rise of the Internet and World Wide Web. Today, Web-enabled versions of all leading vendors’ warehouse tools are becoming available (Moeller, 2001). This recent growth in Web use and advances in e-business applications have pushed the data warehouse from the back office, where it is accessed by only a few business analysts, to the front lines of the organization, where all employees and every customer can use it. To accommodate this move to the frontline of the organization, the data warehouse demands massive scalability for data volume as well as for performance. As the number of and types of users increase rapidly, enterprise data volume is doubling in size every 9 to 12 months. Around-the-clock access to the data warehouse is becoming the norm. The data warehouse will require fast implementation, continuous scalability, and ease of management (Marakas, 2003). Additionally, building distributed warehouses, which are normally called data marts, will be on the rise. Other technical advances in data warehousing will include an increasing ability to exploit parallel processing, auto-
Administering and Managing a Data Warehouse
mated information delivery, greater support of object extensions, very large database support, and user-friendly Web-enabled analysis applications. These capabilities should make data warehouses of the future more powerful and easier to use, which will further increase the importance of data warehouse technology for business strategic decision making and competitive advantages (Ma, Chou, & Yen, 2000; Marakas, 2003; Pace University, 2004).
Information Technology Toolbox. (2004). 2003 IToolbox spending survey. Retrieved from http://datawarehouse. ittoolbox.com/research/survey.asp
CONCLUSION
Inmon, W.H. (2003). Data warehouse administration. Retrieved from http://www.billinmon.com/library/other/ dwadmin.asp
The data that organizations have captured and stored are considered organizational assets. Yet the data themselves cannot do anything until they are put into intelligent use. One way to accomplish this goal is to use data warehouse and data mining technology to transform corporate information into business competitive advantages. What impacts data warehouses the most is the Internet and Web technology. Web browser will become the universal interface for corporations, allowing employees to browse their data warehouse worldwide on public and private networks, eliminating the need to replicate data across diverse geographic locations. Thus strong data warehouse management sponsorship and an effective administration team may become a crucial factor to provide an organization with the information service needed.
REFERENCES Adelman, S., & Relm, C. (2003, November 5). What are the various phases in implementing a data warehouse solution? DMReview. Retrieved from http://www. dmreview.com/article_sub.cfm?articleId=7660 Barker, R. (1998, February). Managing a data warehouse. Chertsey, UK: Veritas Software Corporation. Barquin, F. (1997). Building, using, and managing the data warehouse. Upper Saddle River, NJ: Prentice Hall.
Inmon, W.H. (2002). Building the data warehouse (3rd ed.). New York: John Wiley & Sons Inc. Inmon, W.H. (2000). Building the data warehouse: Getting started. Retrieved from http://www.billinmon.com/ library/whiteprs/earlywp/ttbuild.pdf
Inmon, W.H., Terdeman, R.H., & Imhoff, C. (2000). Exploration warehousing. New York: John Wiley & Sons Inc. Kroenke, D.M. (2004). Database processing: Fundamentals, design, and implementation (9th ed.). Upper Saddle River, NJ: Prentice Hall. Ma, C., Chou, D.V., & Yen, D.C. (2000). Data warehousing, technology assessment and management. Industrial Management + Data Systems, 100 (3), 125-137. Marakas, G.M. (2003). Modern data warehousing, mining, and visualization: Core concepts. Upper Saddle River, NJ: Prentice Hall. Martin, J. (1997, September). New tools for decision making. DM Review, 7, 80. McKnight Associates, Inc. (2000). Effective data warehouse organizational roles and responsibilities. Sunnyvale, CA. Moeller, R.A. (2001). Distributed data warehousing using web technology: How to build a more cost-effective and flexible warehouse. New York: AMACOM American Management Association. Pace University. (2004). Emerging technology. Retrieved from http://webcomposer.pace.edu/ea10931w/Tappert/ Assignment2.htm
Devlin, B. (1997). Data warehouse: From architecture to implementation. Reading, MA: Addison-Wesley.
Post, G.V. (2005). Database management systems: designing & building business applications (3 rd ed.). New York: McGraw-Hill/Irwin.
Ferdinandi, P.L. (1999). Data warehouse advice for managers. New York: AMACOM American Management Association.
Rob, P., & Coronel, C. (2004). Database systems: Design, implementation, and management (6th ed.). Boston, MA: Course Technology.
Gillenson, M.L. (2005). Fundamentals of database management systems. New York: John Wiley & Sons Inc.
Sen, A., & Jacob, V.S. (1998). Industrial strength data warehousing: Why process is so important and so often ignored. Communication of the ACM, 41(9), 29-31.
Hoffer, J.A., Prescott, M.B., & McFadden, F.R. (2005). Modern database management (7th ed.) Upper Saddle River, NJ: Prentice Hall.
Senn, J.A. (2004). Information technology: Principles, practices, opportunities (3rd ed.). Upper Saddle River, NJ: Prentice Hall. 21
A
Administering and Managing a Data Warehouse
KEY TERMS Alternative Storage: An array of storage media that consists of two forms of storage: near-line storage and/or second storage. “CLDS”: The facetiously named system development life cycle (SDLC) for analytical, DSS systems. CLDS is so named because in fact it is the reverse of the classical SDLC. Corporate Information Factory (CIF): The corporate information factory is a logical architecture with a purpose of delivering business intelligence and business management capabilities driven by data provided from business operations. Data Mart: A data warehouse that is limited in scope and facility, but for a restricted domain.
22
Database Management System (DBMS): A set of programs used to define, administer, and process the database and its applications. Metadata: Data about data; data concerning the structure of data in a database stored in the data dictionary. Near-line Storage: Near-line storage is siloed tape storage where siloed cartridges of tape are archived, accessed, and managed robotically. Online Analytical Process (OLAP): Decision Support System (DSS) tools that uses multidimensional data analysis techniques to provide users with multidimensional views of their data. System Development Life Cycle (SDLC): The methodology used by most organizations for developing large information systems.
23
Agent-Based Mining of User Profiles for E-Services Pasquale De Meo Università “Mediterranea” di Reggio Calabria, Italy Giovanni Quattrone Università “Mediterranea” di Reggio Calabria, Italy Giorgio Terracina Università della Calabria, Italy Domenico Ursino Università “Mediterranea” di Reggio Calabria, Italy
INTRODUCTION An electronic service (e-service) can be defined as a collection of network-resident software programs that collaborate for supporting users in both accessing and selecting data and services of their interest present in a provider site. Examples of e-services are e-commerce, e-learning, and e-government applications. E-services are undoubtedly one of the engines presently supporting the Internet revolution (Hull, Benedikt, Christophides & Su, 2003). Indeed, nowadays, a large number and a great variety of providers offer their services also or exclusively via the Internet.
BACKGROUND In spite of their spectacular development and present relevance, e-services are yet to be considered a stable technology, and various improvements could be considered for them. Many of the present suggestions for bettering them are based on the concept of adaptivity (i.e., the capability to make them more flexible in such a way so as to adapt their offers and behavior to the environment in which they are operating. In this context, systems capable of constructing, maintaining, and exploiting profiles of users accessing e-services appear to be capable of playing a key role in the future. Both in the past and in the present, various e-service providers exploit (usually rough) user profiles for proposing personalized offers. However, in most cases, the profile construction methodology they adopt presents some problems. Indeed, it often requires a user to spend a certain amount of time for constructing and updating the user’s profile; in addition, it stores only information about the proposals that the user claims to be interested
in, without considering other ones somehow related to those just provided, possibly interesting the user in the future and what the user did not take into account in the past. In spite of present user profile managers, generally when accessing an e-service, a user must personally search the proposals of the user’s interest through it. As an example, consider the bookstore section of Amazon; whenever a customer looks for a book of interest, the customer must carry out an autonomous personal search of it throughout the pages of the site. We argue that, for improving the effectiveness of e-services, it is necessary to increase the interaction between the provider and the user on the one hand and to construct a rich profile of the user, taking into account the user’s desires, interests, and behavior, on the other hand. In addition, it is necessary to take into account a further important factor. Nowadays, electronic and telecommunications technology is rapidly evolving in such a way to allow cell phones, palmtops, and wireless PDAs to navigate on the Web. These mobile devices do not have the same display or bandwidth capabilities as their desktop counterparts; nonetheless, present e-service providers deliver the same content to all device typologies (Communications of the ACM, 2002). In the past, various approaches have been proposed for handling e-service activities; many of them are agent-based. For example: •
•
In Terziyan and Vitko (2002), an agent-based framework for managing commercial transactions between a buyer and a seller is proposed. It exploits a user profile that is handled by means of a content-based policy. In Garcia, Paternò, and Gil (2002), a multi-agent system called e-CoUSAL, capable of supporting Web-shop activities, is presented. Its activity is
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Agent-Based Mining of User Profiles for E-Services
•
•
•
•
•
•
based on the maintenance and the exploitation of user profiles. In Lau, Hofstede, and Bruza (2000), WEBS, an agent-based approach for supporting e-commerce activities, is proposed. It exploits probabilistic logic rules for allowing the customer preferences for other products to be deduced. Ardissono, et al. (2001) describe SETA, a multiagent system conceived for developing adaptive Web stores. SETA uses knowledge representation techniques to construct, maintain, and exploit user profiles. In Bradley and Smyth (2003), the system CASPER, for handling recruitment services, is proposed. Given a user, CASPER first ranks job advertisements according to an applicant’s desires and then recommends job proposals to the applicant on the basis of the applicant’s past behavior. In Razek, Frasson, and Kaltenbach (2002), a multiagent prototype for e-learning called CITS (Confidence Intelligent Tutoring Agent) is proposed. The approach of CITS aims at being adaptive and dynamic. In Shang, Shi, and Chen (2001), IDEAL (Intelligent Distributed Environment for Active Learning), a multi-agent system for active distance learning, is proposed. In IDEAL, course materials are decomposed into small components called lecturelets. These are XML documents containing JAVA code; they are dynamically assembled to cover course topics according to learner progress. In Zaiane (2002), an approach for exploiting Webmining techniques to build a software agent supporting e-learning activities is presented.
All these systems construct, maintain, and exploit a user profile; therefore, we can consider them adaptive w.r.t. the user; however, to the best of our knowledge, none of them is adaptive w.r.t. the device. On the other side, in various areas of computer science research, a large variety of approaches adapting their behavior to the device the user is exploiting has been proposed. As an example: •
•
24
In Anderson, Domingos, and Weld (2001), a framework called MINPATH, capable of simplifying the browsing activity of a mobile user and taking into account the device the user is exploiting, is presented. In Macskassy, Dayanik, and Hirsh (2000), a framework named i-Valets is proposed for allowing a user to visit an information source by using different devices.
•
•
Samaras and Panayiotou (2002) present a flexible agent-based system for providing wireless users with a personalized access to the Internet services. In Araniti, De Meo, Iera, and Ursino (2003), a novel XML-based multi-agent system for QoS management in wireless networks is presented.
These approaches are particularly general and interesting; however, to the best of our knowledge, none of them has been conceived for handling e-services.
MAIN THRUST Challenges to Face In order to overcome the problems outlined previously, some challenges must be tackled. First, a user can access many e-services, operating in the same or in different application contexts; a faithful and complete profile of the user can be constructed only by taking into account the user’s behavior while accessing all the sites. In other words, it should be possible to construct a unique structure on the user side, storing the user’s profile and, therefore, representing the user’s behavior while accessing all the sites. Second, for a given user and e-service provider, it should be possible to compare the profile of the user with the offers of the provider for extracting those proposals that probably will interest the user. Existing techniques for satisfying such a requirement are based mainly on the exploitation of either log files or cookies. Techniques based on log files can register only some information about the actions carried out by the user upon accessing an e-service; however, they cannot match user preferences and e-service proposals. Vice versa, techniques based on cookies are able to carry out a certain, even if primitive, match; however, they need to know and exploit some personal information that a user might consider private. Third, it should be necessary to overcome the typical one-size-fits-all philosophy of present e-service providers by developing systems capable of adapting their behavior to both the profile of the user and to the characteristics of the device the user is exploiting for accessing them (Communications of the ACM, 2002).
System Description The system we present in this article (called e-service adaptive manager [ESA-Manager]) aims at solving all
Agent-Based Mining of User Profiles for E-Services
three problems mentioned previously. It is an XMLbased multi-agent system for handling user accesses to e-services, capable of adapting its behavior to both user and device profiles. In ESA-Manager, a service provider agent is present for each e-service provider, handling the proposals stored therein as well as the interaction with the user. In addition, an agent is associated with each user, adapting its behavior to the profiles of both the user and the device the user is exploiting for visiting the sites. Actually, since a user can access e-service providers by means of different devices, the user’s profile cannot be stored in only one of them; as a matter of fact, it is necessary to have a unique copy of the user profile that registers the user’s behavior in visiting the e-service providers during the various sessions, possibly carried out by means of different devices. For this reason, the profile of a user must be handled and stored in a support different from the devices generally exploited by the user for accessing e-service providers. As a consequence, on the user side, the exploitation of a profile agent appears compulsory, storing the profiles of both involved users and devices, and a user-device agent, associated with a specific user operating by means of a specific device, supporting the user in his or her activities. As previously pointed out, for each user, a unique profile is mined and maintained, storing information about the user’s behavior in accessing all e-service providers1—the techniques for mining, maintaining, and exploiting user profiles are quite complex and slightly differ in the various applications domains; the interested reader can find examples of them, along with the corresponding validation issues, in De Meo, Rosaci, Sarnè, Terracina, and Ursino (2003) for e-commerce and in De Meo, Garro, Terracina, and Ursino (2003) for e-learning. In this way, ESA-Manager solves the first problem mentioned previously. Whenever a user accesses an e-service by means of a certain device, the corresponding service provider agent sends information about its proposals to the user device agent associated with the service provider agent and the device he or she is exploiting. The user device agent determines similarities between the proposals presented by the provider and the interests of the user. For each of these similarities, both the service provider agent and the user device agent cooperate for presenting to the user a group of Web pages adapted to the exploited device, illustrating the proposal. We argue that this behavior provides ESA-Manager with the capability of supporting the user in the search of proposals of the user’s interest offered by the provider. In addition, the algorithms underlying ESA-Manager allow it to identify not only the proposals probably interesting for the user in the present, but also other ones
possibly interesting for the user in the future and that the user disregarded to take into account in the past (see De Meo, Rosaci, Sarnè, Terracina & Ursino [2003] for a specialization of these algorithms to e-commerce). In our opinion, this is a particularly interesting feature for a novel approach devoted to deal with e-services. Last, but not the least, it is worth observing that since the user profile management is carried out at the user side, no information about the user profile is sent to the e-service providers. In this way, ESA-Manager solves privacy problems left open by cookies. All the reasonings presented show that ESA-Manager is capable of solving also the second problem mentioned previously. In ESA-Manager, the device profile plays a central role. Indeed, the proposals of a provider shown to a user, as well as their presentation formats, depend on the characteristics of the device the user is presently exploiting. However, the ESA-Manager capability of adapting its behavior to the device the user is exploiting is not restricted to the presentation format of the proposals; indeed, the exploited device can influence also the computation of the interest degree shown by a user for the proposals presented by each provider. More specifically, one of the parameters that the interest degree associated with a proposal is based on, is the time the user spends visiting the corresponding Web pages. This time is not to be considered as an absolute measure, but it must be normalized w.r.t. both the characteristics of the exploited device and the navigation costs (Chan, 2000). The following example allows this intuition to be clarified. Assume that a user visits a Web page for two times and that each visit takes n seconds. Suppose, also, that during the first access, the user exploits a mobile phone having a low processor clock and supporting a connection characterized by a low bandwidth and a high cost. During the second visit, the user uses a personal computer having a high processor clock and supporting a connection characterized by a high bandwidth and a low cost. It is possible to argue that the interest the user exhibited for the page in the former access is greater than what the user exhibited in the latter one. Also, other device parameters influence the behavior of ESA-Manager (see De Meo, Rosaci, Sarnè, Terracina & Ursino [2003] for a detailed specification of the role of these parameters). This reasoning allows us to argue that ESA-Manager solves also the third problem mentioned previously. As already pointed out, many agents are simultaneously active in ESA-Manager; they strongly interact with each other and continuously exchange information. In this scenario, an efficient management of information exchange appears crucial. One of the most promising solutions to this problem has been the adop25
A
Agent-Based Mining of User Profiles for E-Services
tion of XML. XML capabilities make it particularly suited to be exploited in the agent research. In ESA-Manager, the role of XML is central; indeed, (1) the agent ontologies are stored as XML documents; (2) the agent communication language is ACML; (3) the extraction of information from the various data structures is carried out by means of XQuery; and (4) the manipulation of agent ontologies is performed by means of the Document Object Model (DOM).
FUTURE TRENDS The spectacular growth of the Internet during the last decade has strongly conditioned the e-service landscape. Such a growth is particularly surprising in some application domains, such as financial services or egovernment. As an example, the Internet technology has enabled the expansion of financial services by integrating the already existing, quite variegate financial data and services and by providing new channels for information delivery. For instance, in 2004, “the number of households in the U.S. that will use online banking is expected to exceed approximately 24 million, nearly double the number of households at the end of 2000.” Moreover, e-services are not a leading paradigm only in business contexts, but they are an emerging standard in several application domains. As an example, they are applied vigorously by governmental units at national, regional, and local levels around the world. Moreover, e-service technology is currently successfully exploited in some metropolitan networks for providing mediation tools in a democratic system in order to make citizen participation in rule- and decisionmaking processes more feasible and direct. These are only two examples of the role e-services can play in the e-government context. Handling and managing this technology in all these environments is one of the most challenging issues for present and future researchers.
CONCLUSION In this article, we have proposed ESA-Manager, an XMLbased and adaptive multi-agent system for supporting a user accessing an e-service provider in the search of proposals present therein and appearing to be appealing according to the user’s past interests and behavior. We have shown that ESA-Manager is adaptive w.r.t. the profile of both the user and the device the user is exploiting for accessing the e-service provider. Finally, we have seen that it is XML-based, since XML is ex-
26
ploited for both storing the agent ontologies and for handling the agent communication. As for future work, we argue that various improvements could be performed on ESA-Manager for bettering its effectiveness and completeness. As an example, it might be interesting to categorize involved users on the basis of their profiles, as well as involved providers on the basis of their proposals. As a further example of profitable features with which our system could be enriched, we consider extremely promising the derivation of association rules representing and predicting the user behavior on accessing one or more providers. Finally, ESA-Manager could be made even more adaptive by considering the possibility to adapt its behavior on the basis not only of the device a user is exploiting during a certain access, but also of the context (e.g., job, holidays) in which the user is currently operating.
REFERENCES Adaptive Web. (2002). Communications of the ACM, 45(5). Anderson, C.R., Domingos, P., & Weld, D.S. (2001). Adaptive Web navigation for wireless devices. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle, Washington. Araniti, G., De Meo, P., Iera, A., & Ursino, D. (2003). Adaptively controlling the QoS of multimedia wireless applications through “user-profiling” techniques. Journal of Selected Areas in Communications, 21(10), 1546-1556. Ardissono, L. et al. (2001). Agent technologies for the development of adaptive Web stores. Agent Mediated Electronic Commerce, The European AgentLink Perspective (pp. 194-213). Lecture Notes in Computer Science, Springer. Bradley, K., & Smyth, B. (2003). Personalized information ordering: A case study in online recruitment. Knowledge-Based Systems, 16(5-6), 269-275. Chan, P.K. (2000). Constructing Web user profiles: A non-invasive learning approach. Web Usage Analysis and User Profiling, 39-55. Springer. De Meo, P., Garro, A., Terracina, G., & Ursino, D. (2003). X-Learn: An XML-based, multi-agent system for supporting “user-device” adaptive e-learning. Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE 2003), Taormina, Italy.
Agent-Based Mining of User Profiles for E-Services
De Meo, P., Rosaci, D., Sarnè G.M.L., Terracina, G., & Ursino, D. (2003). An XML-based adaptive multi-agent system for handling e-commerce activities. Proceedings of the International Conference on Web Services—Europe (ICWS-Europe ’03), Erfurt, Germany.
tion Language defined by the Foundation for Intelligent Physical Agent (FIPA).
Garcia, F.J., Paternò, F., & Gil, A.B. (2002). An adaptive e-commerce system definition. Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH’02), Malaga, Spain.
Agent: A computational entity capable of both perceiving dynamic changes in the environment it is operating in and autonomously performing user delegated tasks, possibly by communicating and cooperating with other similar entities.
Hull, R., Benedikt, M., Christophides, V., & Su, J. (2003). E-services: A look behind the curtain. Proceedings of the Symposium on Principles of Database Systems (PODS 2003), San Diego, California.
Agent Ontology: A description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.
Lau, R., Hofstede, A., & Bruza, P. (2000). Adaptive profiling agents for electronic commerce. Proceedings of the CollECTeR Conference on Electronic Commerce (CollECTeR 2000), Breckenridge, Colorado. Macskassy, S.A., Dayanik, A.A., & Hirsh, H. (2000). Information valets for intelligent information access. Proceedings of the AAAI Spring Symposia Series on Adaptive User Interfaces, (AUI-2000), Stanford, California. Razek, M.A., Frasson, C., & Kaltenbach, M. (2002). Toward more effective intelligent distance learning environments. Proceedings of the International Conference on Machine Learning and Applications (ICMLA’02), Las Vegas, Nevada. Samaras, G., & Panayiotou, C. (2002). Personalized portals for the wireless user based on mobile agents. Proceedings of the International Workshop on Mobile Commerce, Atlanta, Georgia. Shang, Y., Shi, H., & Chen, S. (2001). An intelligent distributed environment for active learning. Proceedings of the ACM International Conference on World Wide Web (WWW 2001), Hong Kong. Terziyan, V., & Vitko, O. (2002). Intelligent information management in mobile electronic commerce. Artificial Intelligence News, Journal of Russian Association of Artificial Intelligence, 5. Zaiane, O.R. (2002). Building a recommender agent for e-learning systems. Proceedings of the International Conference on Computers in Education (ICCE 2002), Auckland, New Zealand.
KEY TERMS ACML: The XML encoding of the Agent Communica-
Adaptive System: A system adapting its behavior on the basis of the environment it is operating in.
Device Profile: A model of a device storing information about both its costs and capabilities. E-Service: A collection of network-resident software programs that collaborate for supporting users in both accessing and selecting data and services of their interest handled by a provider site. Examples of eservices are e-commerce, e-learning, and e-government applications. eXtensible Markup Language (XML): The novel language, standardized by the World Wide Web Consortium, for representing, handling, and exchanging information on the Web. Multi-Agent System (MAS): A loosely coupled network of software agents that interact to solve problems that are beyond the individual capacities or knowledge of each of them. An MAS distributes computational resources and capabilities across a network of interconnected agents. The agent cooperation is handled by means of an Agent Communication Language. User Modeling: The process of gathering information specific to each user either explicitly or implicitly. This information is exploited in order to customize the content and the structure of a service to the user’s specific and individual needs. User Profile: A model of a user representing both the user’s preferences and behavior.
ENDNOTE 1
It is worth pointing out that providers could be either homogeneous (i.e., all of them operate in the same application context, such as e-commerce) or heterogeneous (i.e., they operate in different application contexts).
27
A
28
Aggregate Query Rewriting in Multidimensional Databases Leonardo Tininini CNR - Istituto di Analisi dei Sistemi e Informatica “Antonio Ruberti,” Italy
INTRODUCTION An efficient query engine is certainly one of the most important components in data warehouses (also known as OLAP systems or multidimensional databases) and its efficiency is influenced by many other aspects, both logical (data model, policy of view materialization, etc.) and physical (multidimensional or relational storage, indexes, etc). As is evident, OLAP queries are often based on the usual metaphor of the data cube and the concepts of facts, measures and dimensions and, in contrast to conventional transactional environments, they require the classification and aggregation of enormous quantities of data. In spite of that, one of the fundamental requirements for these systems is the ability to perform multidimensional analyses in online response times. Since the evaluation from scratch of a typical OLAP aggregate query may require several hours of computation, this can only be achieved by pre-computing several queries, storing the answers permanently in the database and then reusing them in the query evaluation process. These precomputed queries are commonly referred to as materialized views and the problem of evaluating a query by using (possibly only) these precomputed results is known as the problem of answering/rewriting queries using views. In this paper we briefly analyze the difference between query answering and query rewriting approach and why query rewriting is preferable in a data warehouse context. We also discuss the main techniques proposed in literature to rewrite aggregate multidimensional queries using materialized views.
BACKGROUND Multidimensional data are obtained by applying aggregations and statistical functions to elementary data, or more precisely to data groups, each containing a subset of the data and homogeneous with respect to a given set of attributes. For example, the data “Average duration of calls in 2003 by region and call plan” is obtained from the so-called fact table, which is usually the product of complex source integration activities (Lenzerini, 2002) on the raw data corresponding to each phone call in that year.
Several groups are defined; each consisting of calls made in the same region and with the same call plan, and finally applying the average aggregation function on the duration attribute of the data in each group. The pair of values (region, call plan) is used to identify each group and is associated with the corresponding average duration value. In multidimensional databases, the attributes used to group data define the dimensions, whereas the aggregate values define the measures. The term multidimensional data comes from the wellknown metaphor of the data cube (Gray, Bosworth, Layman, & Pirahesh, 1996). For each of n attributes, used to identify a single measure, a dimension of an n-dimensional space is considered. The possible values of the identifying attributes are mapped to points on the dimension’s axis, and each point of this n-dimensional space is thus mapped to a single combination of the identifying attribute values and hence to a single aggregate value. The collection of all these points, along with all possible projections in lower dimensional spaces, constitutes the so-called data cube. In most cases, dimensions are structured in hierarchies, representing several granularity levels of the corresponding measures (Jagadish, Lakshmanan, & Srivastava, 1999). Hence a time dimension can be organized into days, months and years; a territorial dimension into towns, regions and countries; a product dimension into brands, families and types. When querying multidimensional data, the user specifies the measures of interest and the level of detail required by indicating the desired hierarchy level for each dimension. In a multidimensional environment querying is often an exploratory process, where the user “moves” along the dimension hierarchies by increasing or reducing the granularity of displayed data. The drill-down operation corresponds to an increase in detail, for example, by requesting the number of calls by region and month, starting from data on the number of calls by region or by region and year. Conversely, roll-up allows the user to view data at a coarser level of granularity (Agrawal, Gupta, & Sarawagi, 1997; Cabibbo & Torlone, 1997). Multidimensional querying systems are commonly known as OLAP (Online Analytical Processing) Systems, in contrast to conventional OLTP (Online Transactional Processing) Systems. The two types have several con-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Aggregate Query Rewriting in Multidimensional Databases
trasting features, although they share the same requirement of fast “online” response times. In particular, one of the key differences between OLTP and OLAP queries is the number of records required to calculate the answer. OLTP queries typically involve a rather limited number of records, accessed through primary key or other specific indexes, which need to be processed for short, isolated transactions or to be issued on a user interface. In contrast, multidimensional queries usually require the classification and aggregation of a huge amount of data (Gupta, Harinarayan, & Quass, 1995) and fast response times are made possible by the extensive use of pre-computed queries, called materialized views (whose answers are stored permanently in the database), and by sophisticated techniques enabling the query engine to exploit these pre-computed results.
In general, query answering techniques are preferable in contexts where exact answers are unlikely to be obtained (e.g., integration of heterogeneous data sources, like Web sites), and response time requirements are not very stringent. However, as noted in Grahne & Mendelzon (1999), query answering methods can be extremely inefficient, as it is difficult or even impossible to process only the “useful” views and apply optimization techniques such as pushing selections and joins. As a consequence, the rewriting approach is more appropriate in contexts such as OLAP systems, where there is a very large amount of data and fast response times are required (Goldstein & Larson, 2001), and for query optimization, where different query plans need to be maintained in the main memory and efficiently compared (Afrati, Li, & Ullman, 2001).
MAIN THRUST
Consider a fact table Cens, of elementary census data on the simplified schema: (Census_tract_ID, Sex, Empl_status, Educ_status, Marital_status) and a collection of aggregate data representing the resident population by sex and marital status, stored in a materialized view on the schema V: (Sex, Marital_status, Pop_res). For simplicity, it is assumed that the dimensional tables are “collapsed” in the fact table Cens. A typical multidimensional query will be shown in the next section. The view V is computed by a simple count(*)-group-by query on the table Cens.
The problem of evaluating the answer to a query by using pre-computed (materialized) views has been extensively studied in literature and generically denoted as answering queries using views (Levy, Mendelzon, Sagiv, & Srivastava, 1995; Halevy, 2001). The problem can be informally stated as follows: given a query Q and a collection of views V over the same schema s, is it possible to evaluate the answer to Q by using (only) the information provided by V? A more rigorous distinction has also been made between view-based query rewriting and query answering, corresponding to two distinct approaches to the general problem (Calvanese, De Giacomo, Lenzerini, & Vardi, 2000; Halevy, 2001). This is strictly related to the distinction between view definition and view extension, which is analogous to the standard distinction between schema and instance in database literature. Broadly speaking, view definition corresponds to the way the query is syntactically defined, for example to the corresponding SQL expression, while its extension corresponds to the set of returned tuples, that is, the result obtained by evaluating the view on a specific database instance.
Query Answering vs. Query Rewriting Query rewriting is based on the use of view definitions to produce a new rewritten query, expressed in terms of available view names and equivalent to the original. The answer can then be obtained by using the rewritten query and the view extensions (instances). Query answering, in contrast, is based on the exploitation of both view definitions and extensions and attempts to determine the best possible answer, possibly a subset of the exact answer, which can be extracted from the view extensions (Abiteboul & Duschka, 1998; Grahne & Mendelzon, 1999).
Rewriting and Answering: An Example
CREATE VIEW V AS SELECT Sex, Marital_status, COUNT(*) AS Pop_res FROM Cens GROUP BY Sex, Marital_status The query Q expressed by SELECT Marital_status, COUNT(*) FROM Cens GROUP BY Marital_status corresponding to the resident population by marital status can be computed without accessing the data in Cens, and be rewritten as follows: SELECT Marital_status, SUM(Pop_res) FROM V GROUP BY Marital_status Note that the rewritten query can be obtained very efficiently by simple syntactic manipulations on Q and V and its applicability does not depend on the records in V. Suppose now some subsets of (views on) Cens are available, corresponding to the employment statuses stu29
A
Aggregate Query Rewriting in Multidimensional Databases
dents, employed and retired, called V_ST, V_EMP and V_RET respectively. For example V_RET may be defined by: CREATE VIEW V_RET AS SELECT * FROM Cens WHERE Empl_status = ‘retired’ It is evident that no rewriting can be obtained by using only the specified views, both because some individuals are not present in any of the views (e.g., young children, unemployed, housewives, etc.) and because some may be present in two views (a student may also be employed). However, a query answering technique tries to collect each useful accessible record and build the “best possible” answer, possibly by introducing approximations. By using the information on the census tract and a matching algorithm most overlapping records may be determined and an estimate (lower bound) of the result obtained by summing the non-replicated contributions from the views. Obviously, this would require a considerable computation time, but it might be able to produce an approximated answer, in a situation where rewriting techniques would produce no answer at all.
Rewriting Aggregate Queries A typical elementary multidimensional query is described by the join of the fact table with two or more dimension tables to which is applied an aggregate group by query (see the example query Q1 below). As a consequence, the rewriting of this form of query and view has been studied by many researchers. SELECT D1.dim1, D2.dim2, AGG(F.measure) FROM fact_table F, dim_table1 D1, dim_table2 D2 WHERE F.dimKey1 = D1.dimKey1 AND F.dimKey2 = D2.dimKey2 GROUP BY D1.dim1, D2.dim2 (Q1) In Gupta, Harinarayan, & Quass (1995), an algorithm is proposed to rewrite conjunctive queries with aggregations using views of the same form. The technique is based on the concept of generalized projection (GP) and some transformation rules utilizable by an optimizer, which enables the query and views to be put in a particular normal form, based on GPSJ (Generalized Projection/Selection/ Join) expressions. The query and views are analyzed in terms of their query tree, that is, the tree representing how to calculate them by applying selections, joins and generalized projections on the base relations. By using the transformation rules, the algorithm tries to produce a match between one or more view trees and subtrees (and 30
consequently to replace the calculations with access to the corresponding materialized views). The results are extended to NGPSJ (Nested GPSJ) expressions in Golfarelli & Rizzi (2000). In Srivastava, Dar, Jagadish, & Levy (1996) an algorithm is proposed to rewrite a single block (conjunctive) SQL query with GROUP BY and aggregations using various views of the same form. The aggregate functions considered are MIN, MAX, COUNT and SUM. The algorithm is based on the detection of homomorphisms from view to query, as in the non-aggregate context (Levy, Mendelzon, Sagiv, & Srivastava, 1995). However, it is shown that more restrictive conditions must be considered when dealing with aggregates, as the view has to produce not only the right tuples, but also their correct multiplicities. In Cohen, Nutt, & Serebrenik (1999, 2000) a somewhat different approach is proposed: the original query, usable views and rewritten query are all expressed by an extension of Datalog with aggregate functions (again COUNT, SUM, MIN and MAX) as query language. Queries and views are assumed to be conjunctive. Several candidates for rewriting of particular forms are considered and for each candidate, the views in its body are unfolded (i.e., replaced by their body in the view definition). Finally, the unfolded candidate is compared with the original query to verify equivalence by using known equivalence criteria for aggregate queries, particularly those proposed in Nutt, Sagiv, & Shurin (1998) for COUNT, SUM, MIN and MAX queries. The technique can be extended by using the equivalence criteria for AVG queries presented in Grumbach, Rafanelli, & Tininini (1999), based on the syntactic notion of isomorphism modulo a product. In query rewriting it is important to identify the views that may be actually useful in the rewriting process: this is often referred to as the view usability problem. In the non-aggregate context, it is shown (Levy, Mendelzon, Sagiv, & Srivastava, 1995) that a conjunctive view can be used to produce a conjunctive rewritten query if a homomorphism exists from the body of the view to that of the query. Grumbach, Rafanelli, & Tininini (1999) demonstrate that more restrictive (necessary and sufficient) conditions are needed for the usability of conjunctive count views for rewriting of conjunctive count queries, based on the concept of sound homomorphisms. It is also shown that in the presence of aggregations, it is not sufficient only to consider rewritten queries of conjunctive form: more complex forms may be required, particularly those based on the concept of isomorphism modulo a product. All rewriting algorithms proposed in the literature are based on trying to obtain a rewritten query with a particular form by using (possibly only) the available views. An
Aggregate Query Rewriting in Multidimensional Databases
interesting question is: “Can I rewrite more by considering rewritten queries of more complex form?,” and the even more ambitious one, “Given a collection of views, is the information they provide sufficient to rewrite a query?” In Grumbach & Tininini (2003) the problem is investigated in a general framework based on the concept of query subsumption. Basically, the information content of a query is characterized by its distinguishing power, that is, by its ability to determine that two database instances are different. Hence a collection of views subsumes a query if it is able to distinguish any pair of instances also distinguishable by the query, and it is shown that a query rewriting using various views exists if the views subsume the query. In the particular case of count and sum queries defined over the same fact table, an algorithm is proposed which is demonstrated to be complete. In other words, even if the algorithm (as with any algorithm of practical use) considers rewritten queries of particular forms, it is shown that no improvement could be obtained by considering rewritten queries of more complex forms. Finally, in Grumbach & Tininini (2000) a completely different approach to the problem of aggregate rewriting is proposed. The technique is based on the idea of formally expressing the relationships (metadata) between raw and aggregate data and also among aggregate data of different types and/or levels of detail. Data is stored in standard relations, while the metadata are represented by numerical dependencies, namely Horn clauses formally expressing the semantics of the aggregate attributes. The mechanism is tested by transforming the numerical dependencies into Prolog rules and then exploiting the Prolog inference engine to produce the rewriting.
FUTURE TRENDS Although query rewriting techniques are currently considered to be preferable to query answering in OLAP systems, the always increasing processing capabilities of modern computers may change the relevance of query answering techniques in the near future. Meanwhile, the limitations in the applicability of several rewriting algorithms shows that a substantial effort is still needed and important contributions may stem from results in other research areas like logic programming and automated reasoning. Particularly, aggregate query rewriting is strictly related to the problem of query equivalence for aggregate queries and current equivalence criteria only apply to rather simple forms of query, and don’t consider, for example, the combination of conjunctive formulas with nested aggregations. Also the results on view usability and query subsumption can be considered only preliminary and it
would be interesting to study the property of completeness of known rewriting algorithms and to provide necessary and sufficient conditions for the usability of a view to rewrite a query, even when both the query and the view are aggregate and of non-trivial form (e.g., allowing disjunction and some limited form of negation).
CONCLUSION This paper has discussed a fundamental issue related to multidimensional query evaluation, that is, how a multidimensional query expressed in a given language can be translated, using some available materialized views, into an (efficient) evaluation plan which retrieves the necessary information and calculates the required results. We have analyzed the difference between query answering and query rewriting approach and discussed the main techniques proposed in literature to rewrite aggregate multidimensional queries using materialized views.
REFERENCES Abiteboul, S., & Duschka, O.M. (1998). Complexity of answering queries using materialized views. In ACM Symposium on Principles of Database Systems (PODS’98) (pp. 254-263). Afrati, F.N., Li, C., & Ullman, J.D. (2001). Generating efficient plans for queries using views. In ACM International Conference on Management of Data (SIGMOD’01) (pp. 319-330). Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modeling multidimensional databases. In International Conference on Data Engineering (ICDE’97) (pp. 232-243). Cabibbo, L., & Torlone, R. (1997). Querying multidimensional databases. In International Workshop on Database Programming Languages (DBPL’97) (pp. 319-335). Calvanese, D., De Giacomo, G., Lenzerini, M., & Vardi, M.Y. (2000). What is view-based query rewriting? In International Workshop on Knowledge Representation meets Databases (KRDB’00) (pp. 17-27). Cohen, S., Nutt, W., & Serebrenik, A. (1999). Rewriting aggregate queries using views. In ACM Symposium on Principles of Database Systems (PODS’99) (pp. 155-166). Cohen, S., Nutt, W., & Serebrenik, A. (2000). Algorithms for rewriting aggregate queries using views. In ABDISDASFAA Conference 2000 (pp. 65-78). Goldstein, J., & Larson, P. (2001). Optimizing queries using materialized views: A practical, scalable solution. In 31
A
Aggregate Query Rewriting in Multidimensional Databases
ACM International Conference on Management of Data (SIGMOD’01) (pp. 331-342).
Principles of Database Systems (PODS’98) (pp. 214223).
Golfarelli, M., & Rizzi, S. (2000). Comparing nested GPSJ queries in multidimensional databases. In Workshop on Data Warehousing and OLAP (DOLAP 2000) (pp. 65-71).
Srivastava, D., Dar, S., Jagadish, H.V., & Levy, A.Y. (1996). Answering queries with aggregation using views. In International Conference on Very Large Data Bases (VLDB’96) (pp. 318-329).
Grahne, G., & Mendelzon, A.O. (1999). Tableau techniques for querying information sources through global schemas. In International Conference on Database Theory (ICDT’99) (pp. 332-347).
KEY TERMS
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In International Conference on Data Engineering (ICDE’96) (pp. 152-159).
Data Cube: A collection of aggregate values classified according to several properties of interest (dimensions). Combinations of dimension values are used to identify the single aggregate values in the cube.
Grumbach, S., Rafanelli, M., & Tininini, L. (1999). Querying aggregate data. In ACM Symposium on Principles of Database Systems (PODS’99) (pp. 174-184).
Dimension: A property of the data used to classify it and navigate the corresponding data cube. In multidimensional databases dimensions are often organized into several hierarchical levels, for example, a time dimension may be organized into days, months and years.
Grumbach, S., & Tininini, L. (2000). Automatic aggregation using explicit metadata. In International Conference on Scientific and Statistical Database Management (SSDBM’00) (pp. 85-94). Grumbach, S., & Tininini, L. (2003). On the content of materialized aggregate views. Journal of Computer and System Sciences, 66(1), 133-168. Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggregate-query processing in data warehousing environments. In International Conference on Very Large Data Bases (VLDB’95) (pp. 358-369). Halevy, A.Y. (2001). Answering queries using views. VLDB Journal, 10(4), 270-294.
Drill-Down (Roll-Up): Typical OLAP operation, by which aggregate data are visualized at a finer (coarser) level of detail along one or more analysis dimensions. Fact: A single elementary datum in an OLAP system, the properties of which correspond to dimensions and measures. Fact Table: A table of (integrated) elementary data grouped and aggregated in the multidimensional querying process. Materialized View: A particular form of query whose answer is stored in the database to accelerate the evaluation of further queries.
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D. (1999). What can hierarchies do for data warehouses? In International Conference on Very Large Data Bases (VLDB’99) (pp. 530-541).
Measure: A numeric value obtained by applying an aggregate function (such as count, sum, min, max or average) to groups of data in a fact table.
Lenzerini, M. (2002). Data integration: A theoretical perspective. In ACM Symposium on Principles of Database Systems (PODS’02) (pp. 233-246).
Query Answering: Process by which the (possibly approximate) answer to a given query is obtained by exploiting the stored answers and definitions of a collection of materialized views.
Levy, A.Y., Mendelzon, A.O., Sagiv, Y., & Srivastava, D. (1995). Answering queries using views. In ACM Symposium on Principles of Database Systems (PODS’95) (pp. 95-104). Nutt, W., Sagiv, Y., & Shurin, S. (1998). Deciding equivalences among aggregate queries. In ACM Symposium on
32
Query Rewriting: Process by which a source query is transformed into an equivalent one referring (almost exclusively) to a collection of materialized views. In multidimensional databases, query rewriting is fundamental in achieving acceptable (online) response times.
33
Aggregation for Predictive Modeling with Relational Data Claudia Perlich IBM Research, USA Foster Provost New York University, USA
INTRODUCTION Most data mining and modeling techniques have been developed for data represented as a single table, where every row is a feature vector that captures the characteristics of an observation. However, data in most domains are not of this form and consist of multiple tables with several types of entities. Such relational data are ubiquitous; both because of the large number of multi-table relational databases kept by businesses and government organizations, and because of the natural, linked nature of people, organizations, computers, and etc. Relational data pose new challenges for modeling and data mining, including the exploration of related entities and the aggregation of information from multi-sets (“bags”) of related entities.
BACKGROUND Relational learning differs from traditional featurevector learning both in the complexity of the data representation and in the complexity of the models. The relational nature of a domain manifests itself in two ways: (1) entities are not limited to a single type, and (2) entities are related to other entities. Relational learning allows the incorporation of knowledge from entities in multiple tables, including relationships between objects of varying cardinality. Thus, in order to succeed, relational learners have to be able to identify related objects and to aggregate information from bags of related objects into a final prediction. Traditionally, the analysis of relational data has involved the manual construction by a human expert of attributes (e.g., the number of purchases of a customer during the last three months) that together will form a feature vector. Automated analysis of relational data is becoming increasingly important as the number and complexity of databases increases. Early research on automated relational learning was dominated by Inductive Logic Programming (Muggleton, 1992), where the classification model is a set of first-order-logic clauses
and the information aggregation is based on existential unification. More recent relational learning approaches include distance-based methods (Kirsten et al., 2001), propositionalization (Kramer et al., 2001; Knobbe et al., 2001; Krogel et al., 2003), and upgrades of propositional learners such as Naïve Bayes (Neville et al., 2003), Logistic Regression (Popescul et al., 2002), Decision Trees (Jensen & Neville, 2002) and Bayesian Networks (Koller & Pfeffer, 1998). Similar to manual feature construction, both upgrades and propositionalization use Boolean conditions and common aggregates like min, max, or sum to transform either explicitly (propositionalization) or implicitly (upgrades) the original relational domain into a traditional feature-vector representation. Recent work by Knobbe et al. (2001) and Wrobel & Krogel (2001) recognizes the essential role of aggregation in all relational modeling and focuses specifically on the effect of aggregation choices and parameters. Wrobel & Krogel (2003) present one of the few empirical comparisons of aggregation in propositionalization approaches (however with inconclusive results). Perlich & Provost (2003) show that the choice of aggregation operator can have a much stronger impact on the resultant model’s generalization performance than the choice of the model induction method (decision trees or logistic regression, in their study).
MAIN THRUST For illustration, imagine a direct marketing task where the objective is to identify customers who would respond to a special offer. Available are demographic information and all previous purchase transactions, which include PRODUCT, TYPE and PRICE. In order to take advantage of these transactions, information has to be aggregated. The choice of the aggregation operator is crucial, since aggregation invariably involves loss of (potentially discriminative) information. Typical aggregation operators like min, max and sum can only be applied to sets of numeric values, not to
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Aggregation for Predictive Modeling with Relational Data
objects (an exception being count). It is therefore necessary to assume class-conditional independence and aggregate the attributes independently, which limits the expressive power of the model. Perlich & Provost (2003) discuss in detail the implications of various assumptions and aggregation choices on the expressive power of resulting classification models. For example, customers who buy mostly expensive books cannot be identified if price and type are aggregated separately. In contrast, ILP methods do not assume independence and can express an expensive book (TYPE=“BOOK” and PRICE>20); however aggregation through existential unification can only capture whether a customer bought at least one expensive book, not whether he has bought primarily expensive books. Only two systems, POLKA (Knobbe et al., 2001) and REGLAGGS (Wrobel & Krogel, 2001) combine Boolean conditions and numeric aggregates to increase the expressive power of the model. Another challenge is posed by categorical attributes with many possible values, such as ISBN numbers of books. Categorical attributes are commonly aggregated using mode (the most common value) or the count for all values if the number of different values is small. These approaches would be ineffective for ISBN: it has many possible values and the mode is not meaningful since customers usually buy only one copy of each book. Many relational domains include categorical attributes of this type. One common class of such domains involves networked data, where most of the information is captured by the relationships between objects, possibly without any further attributes. The identity of an entity (e.g., Bill Gates) in social, scientific, and economic networks may play a much more important role than any of its attributes (e.g., age or gender). Identifiers such as name, ISBN, or SSN are categorical attributes with excessively many possible values that cannot be accounted for by either mode or count. Perlich and Provost (2003) present a new multi-step aggregation methodology based on class-conditional distributions that shows promising performance on net-
worked data with identifier attributes. As Knobbe et al. (1999) point out, traditional aggregation operators like min, max, and count are based on histograms. A histogram itself is a crude approximation of the underlying distribution. Rather than estimating one distribution for every bag of attributes, as done by traditional aggregation operators, this new aggregation approach estimates in a first step only one distribution for each class, by combining all bags of objects for the same class. The combination of bags of related objects results in much better estimates of the distribution, since it uses many more observations. The number of parameters differs across distributions: for a normal distribution only two parameters are required, mean and variance, whereas distributions of categorical attributes have as many parameters as possible attribute values. In a second step, the bags of attributes of related objects are aggregated through vector distances (e.g., Euclidean, Cosine, Likelihood) between a normalized vector-representation of the bag and the two class-conditional distributions. Imagine the following example of a document classification domain with two tables (Document and Author) shown in Figure 1. The first aggregation step estimates the classconditional distributions DClass n of authors from the Author table. Under the alphabetical ordering of position:value pairs, 1:A, 2:B, and 3:C, the value for DClass n at position k is defined as: DClass n[k] =
The resulting estimates of the class-conditional distributions for our example are given by: DClass 0 = [0.5 0 0.5] and DClass 1 = [0.4 0.4 0.2] The second aggregation step is the representation of every document as a vector: DPn[k]
Figure 1. Example domain with two tables that are linked through Paper ID
34
Document Table
Author Table
Paper ID
Class
Paper ID
Author Name
P1
0
P1
A
P2
1
P2
B
P3
1
P2
A
P4
0
P3
B
P3
A
P3
C
P4
C
Number of occurrences of author k in the set of authors related to documents of class n Number of authors related to documents of class n
=
Number of occurrences of author k related to the document Pn Number of authors related to document Pn
The vector-representation for the above examples are D P1 = [1 0 0], D P2 = [0.5 0.5 0], D P3 = [0.33 0.33 0.33], and DP4 = [0 0 1]. The third aggregation step calculates vector distances (e.g., cosine) between the class-conditional distribution and the documents DP1,...,DP4. The new Document table with the additional cosine features is shown in Figure 2. In this simple example, the distance from DClass separates the examples perfectly; the distance from DClass 1 does not. 0
Aggregation for Predictive Modeling with Relational Data
Figure 2. Extended document table with new cosine features added Document Table Paper ID Class P1 0 P2 1 P3 1 P4 0
Cosine(Pn, DClass 1) 0.667 0.707 0.962 0.333
Cosine(Pn, DClass 0) 0.707 0.5 0.816 0.707
By taking advantage of DClass 1 and D Class 0 another new aggregation approach becomes possible. Rather than constructing counts for all distinct values (impossible for high-dimensional categorical attributes) one can select a small subset of values where the absolute difference between entries in DClass 0 and D Class 1 is maximal. This method would identify author B as the most discriminative. These new features, constructed from class-conditional distributions, show superior classification performance on a variety of relational domains (Perlich & Provost, 2003, 2004). Table 1 summarizes the relative out-of-sample performances (averaged over 10 experiments with standard deviations in parentheses) as presented in Perlich (2003) on the CORA document classification task (McCallum et al., 2000) for 400 training examples. The data set includes information about the authorship, citations, and the full text. This example also demonstrates the opportunities arising from the ability of relational models to take advantage of additional background information such as citations and authorship over simple text classification. The comparison includes in addition to two distribution-based feature construction approaches (1 and 2) using logistic regression for model induction: 3) a Naïve Bayes classifier using the full text learned by the Rainbow (McCallum, 1996) system, 4) a Probabilistic Relational Model (Koller & Pfeffer, 1998) using traditional aggregates on both text and citation/authorship with the results reported by Taskar et al. (2001), and 5) a Simple Relational Classifier (Macskassy & Provost, 2003) that uses only the known class labels of related (e.g., cited) documents. It is important to observe that traditional aggregation operators such as mode for
high-dimensional categorical fields (author names and document identifiers) are not applicable. The generalization performance of the new aggregation approach is related to a number of properties that are of particular relevance and advantage for predictive modeling: •
•
•
•
•
•
Dimensionality Reduction: The use of distances compresses the high-dimensional space of possible categorical values into a small set of dimensions — one for each class and distance metric. In particular, this allows the aggregation of object identifiers. Preservation of Discriminative Information: Changing the class labels of the target objects will change the values of the aggregates. The loss of discriminative information is lower since the class-conditional distributions capture significant differences. Domain Independence: The density estimation does not require any prior knowledge about the application domain and therefore is suitable for a variety of domains. Applicability to Numeric Attributes: The approach is not limited to categorical values but can also be applied to numeric attributes after discretization. Note that using traditional aggregation through mean and variance assumes implicitly a normal distribution; whereas this aggregation makes no prior distributional assumptions and can capture arbitrary numeric distributions. Monotonic Relationship: The use of distances to class-conditional densities constructs numerical features that are monotonic in the probability of class membership. This makes logistic regression a natural choice for the model induction step. Aggregation of Identifiers: By using object identifiers such as names it can overcome some of the limitations of the independence assumptions and even allow the learning from unobserved object properties (Perlich & Provost, 2004). The identifier represents the full information of the object and in
Table 1. Comparative classification performance Method
Used Information
Accuracy
1) Class-Conditional Distributions
(Authorship & Citations)
0.78 (0.01)
2) Class-Conditional Distributions and Most Discriminative Counts
(Authorship & Citations)
3) Naïve Bayes Classifier using Rainbow 4) Probabilistic Relational Model 5) Simple Relational Model
(Text) (Text, Authorship & Citiations) (Related Class Labels)
0.81 (0.01) 0.74 (0.03) 0.74 (0.01) 0.68 (0.01)
35
A
Aggregation for Predictive Modeling with Relational Data
particular the joint distribution of all other attributes and even further unknown properties. Task-Specific Feature Construction: The advantages outlined above are possible through the use the target value during feature construction. This practice requires the splitting of the training set into two separate portions for 1) the class-conditional density estimation and feature construction and 2) the estimation of the classification model.
The potential complexity of relational models and the resulting computational complexity of relational modeling remains an obstacle to real-time applications. This limitation has spawned work in efficiency improvements (Yin et al., 2003; Tang et al., 2003) and will remain an important task.
To summarize, most relational modeling has limited itself to a small set of existing aggregation operators. The recognition of the limited expressive power motivated the combination of Boolean conditioning and aggregation, and the development of new aggregation methodologies that are specifically designed for predictive relational modeling.
Relational modeling is a burgeoning topic within machine learning research, and is applicable commonly in real-world domains. Many domains collect large amounts of transaction and interaction data, but so far lack a reliable and automated mechanism for model estimation to support decision-making. Relational modeling with appropriate aggregation methods has the potential to fill this gap and allow the seamless integration of model estimation on top of existing relational databases, relieving the analyst from the manual, time-consuming, and omission-prone task of feature construction.
•
FUTURE TRENDS Computer-based analysis of relational data is becoming increasingly necessary as the size and complexity of databases grow. Many important tasks, including counter-terrorism (Tang et al., 2003), social and economic network analysis (Jensen & Neville, 2002), document classification (Perlich, 2003), customer relationship management, personalization, fraud detection (Fawcett & Provost, 1997), and genetics [e.g., see the overview by Deroski (2001)], used to be approached with special-purpose algorithms, but now are recognized as inherently relational. These application domains both profit from and contribute to research in relational modeling in general and aggregation for feature construction in particular. In order to accommodate such a variety of domains, new aggregators must be developed. In particular, it is necessary to account for domain-specific dependencies between attributes and entities that currently are ignored. One common type of such dependency is the temporal order of events — which is important for the discovery of causal relationships. Aggregation as a research topic poses the opportunity for significant theoretical contributions. There is little theoretical work on relational model estimation outside of first-order logic. In contrast to a large body of work in mathematics and the estimation of functional dependencies that map well-defined input spaces to output spaces, aggregation operators have not been investigated nearly as thoroughly. Model estimation tasks are usually framed as search over a structured (either in terms of parameters or increasing complexity) space of possible solutions. But the structuring of a search space of aggregation operators remains an open question. 36
CONCLUSION
REFERENCES Deroski, S. (2001). Relational data mining applications: An overview, In S. D eroski & N. Lavra è (Eds.), Relational data mining (pp. 339-364). Berlin: Springer Verlag. Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, (1). Jensen, D., & Neville, J. (2002). Data mining in social networks. In R. Breiger, K. Carley, & P. Pattison (Eds.), Dynamic social networks modeling and analysis (pp. 287-302). The National Academies Press. Kirsten, M., Wrobel, S., & Horvath, T. (2001). Distance based approaches to relational learning and clustering. In S. Deroski & N. Lavraè (Eds.), Relational data mining (pp. 213-234). Berlin: Springer Verlag. Knobbe A.J., de Haas, M., & Siebes, A. (2001). Propositionalisation and aggregates. In L. DeRaedt & A. Siebes (Eds.), Proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery (LNAI 2168) (pp. 277-288). Berlin: Springer Verlag. Koller, D., & Pfeffer, A. (1998). Probabilistic framebased systems. In Proceedings of Fifteenth/Tenth Conference on Artificial Intelligence/Innovative Application of Artificial Intelligence (pp. 580-587). American Association for Artificial Intelligence.
Aggregation for Predictive Modeling with Relational Data
Kramer, S., Lavraè , N., & Flach, P. (2001). Propositionalization approaches to relational data mining. In S. Deroski & N. Lavra è (Eds.), Relational data mining (pp. 262-291). Berlin: Springer Verlag. Krogel, M.A., Rawles, S., Zelezny, F., Flach, P.A., Lavrac, N., & Wrobel, S. (2003). Comparative evaluation of approaches to propositionalization. In T. Horváth & A. Yamamoto (Eds.), Proceedings of the 13th International Conference on Inductive Logic Programming (LNAI 2835) (pp. 197-214). Berlin: Springer-Verlag. Krogel, M.A., & Wrobel, S. (2001). Transformation-based learning using multirelational aggregation. In C. Rouveirol & M. Sebag (Eds.), Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP) (LNAI 2157) (pp. 142-155). Berlin: Springer Verlag. Krogel M.A., & Wrobel, S. (2003). Facets of aggregation approaches to propositionalization. In T. Horváth & A. Yamamoto (Eds.), Proceedings of the Work-in-Progress Track at the 13th International Conference on Inductive Logic Programming (pp. 30-39). Macskassy, S.A., & Provost, F. (2003). A simple relational classifier. In Proceedings of the Workshop of MultiRelational Data Mining at SIGKDD-2003. McCallum, A.K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Retrieved from http://www.cs.cmu.edu/ ~mccallum/bow McCallum, A.K., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the construction of Internet portals with machine learning. Information Retrieval, 3(2), 127163. Muggleton, S. (Ed.). (1992). Inductive logic programming. London: Academic Press. Neville J., Jensen, D., & Gallagher, B. (2003). Simple estimators for relational bayesian classifers. In Proceedings of the Third IEEE International Conference on Data Mining (pp. 609-612). Perlich, C. (2003). Citation-based document classification. In Proceedings of the Workshop on Information Technology and Systems (WITS). Perlich, C., & Provost, F. (2003). Aggregation-based feature invention and relational concept classes. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Perlich, C., & Provost, F. (2004). ACORA: Distributionbased aggregation for relational learning from identi-
fier attributes. Working Paper CeDER-04-04. Stern School of Business. Popescul, L., Ungar, H., Lawrence, S., & Pennock, D.M. (2002). Structural logistic regression: Combining relational and statistical learning. In Proceedings of the Workshop on Multi-Relational Data Mining. Tang, L.R., Mooney, R.J., & Melville, P. (2003). Scaling up ILP to large examples: Results on link discovery for counter-terrorism. In Proceedings of the Workshop on Multi-Relational Data Mining (pp. 107-121). Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classification and clustering in relational data. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (pp. 870-878). Yin, X., Han, J., & Yang, J. (2003). Efficient multirelational classification by tuple ID propagation. In Proceedings of the Workshop on Multi-Relational Data Mining.
KEY TERMS Aggregation: Also commonly called a summary, an aggregation is the calculation of a value from a bag or (multi)set of entities. Typical aggregations are sum, count, and average. Discretization: Conversion of a numeric variable into a categorical variable, usually though binning. The entire range of the numeric values is split into a number of bins. The numeric value of the attributes is replaced by the identifier of the bin into which it falls. Class-Conditional Independence: Property of a multivariate distribution with a categorical class variable c and a set of other variables (e.g., x and y). The probability of observing a combination of variable values given the class label is equal to the product of the probabilities of each variable value given the class: P(x,y|c) = P(x|c)*P(y|c). Inductive Logic Programming: A field of research at the intersection of logic programming and inductive machine learning, drawing ideas and methods from both disciplines. The objective of ILP methods is the inductive construction of first-order Horn clauses from a set of examples and background knowledge in relational form. Propositionalization: The process of transforming a multi-relational dataset, containing structured examples, into a propositional data set (one table) with derived attribute-value features, describing the structural properties of the example. 37
A
Aggregation for Predictive Modeling with Relational Data
Relational Data: Data where the original information cannot be represented in a single table but requires two or more tables in a relational database. Every table can either capture the characteristics of entities of a particular type (e.g., person or product) or relationships between entities (e.g., person bought product).
38
Relational Learning: Learning in relational domains that include information from multiple tables, not based on manual feature construction. Target Objects: Objects in a particular target tables for which a prediction is to be made. Other objects reside in additional “background” tables, but are not the focus of the prediction task.
39
API Standardization Efforts for Data Mining Jaroslav Zendulka Brno University of Technology, Czech Republic
INTRODUCTION Data mining technology just recently became actually usable in real-world scenarios. At present, the data mining models generated by commercial data mining and statistical applications are often used as components in other systems in such fields as customer relationship management, risk management or processing scientific data. Therefore, it seems to be natural that most data mining products concentrate on data mining technology rather than on the easy-to-use, scalability, or portability. It is evident that employing common standards greatly simplifies the integration, updating, and maintenance of applications and systems containing components provided by other producers (Grossman, Hornick, & Meyer, 2002). Data mining models generated by data mining algorithms are good examples of such components. Currently, established and emerging standards address especially the following aspects of data mining: • • •
Metadata: for representing data mining metadata that specify a data mining model and results of model operations (CWM, 2001). Application Programming Interfaces (APIs): for employing data mining components in applications. Process: for capturing the whole knowledge discovery process (CRISP-DM, 2000).
In this paper, we focus on standard APIs. The objective of these standards is to facilitate integration of data mining technology with application software. Probably the best-known initiatives in this field are OLE DB for Data Mining (OLE DB for DM), SQL/MM Data Mining (SQL/ MM DM), and Java Data Mining (JDM). Another standard, which is not an API but is important for integration and interoperability of data mining products and applications, is a Predictive Model Markup Language (PMML). It is a standard format for data mining model exchange developed by Data Mining Group (DMG) (PMML, 2003). It is supported by all the standard APIs presented in this paper.
BACKGROUND The goal of data mining API standards is to make it possible for different data mining algorithms from various
software vendors to be easily plugged into applications. A software package that provides data mining services is called data mining provider and an application that employs these services is called data mining consumer. The data mining provider itself includes three basic architectural components (Hornick et al., 2002): • •
•
API the End User Visible Component: An application developer using a data mining provider has to know only its API. Data Mining Engine (or Server): the core component of a data mining provider. It provides an infrastructure that offers a set of data mining services to data mining consumers. Metadata Repository: a repository that serves to store data mining metadata.
The standard APIs presented in this paper are not designed to support the entire knowledge discovery process but the data mining step only (Han & Kamber, 2001). They do not provide all necessary facilities for data cleaning, transformations, aggregations, and other data preparation operations. It is assumed that data preparation is done before an appropriate data mining algorithm offered by the API is applied. There are four key concepts that are supported by the APIs: a data mining model, data mining task, data mining technique, and data mining algorithm. The data mining model is a representation of a given set of data. It is the result of one of the data mining tasks, during which a data mining algorithm for a given data mining technique builds the model. For example, a decision tree as one of the classification models is the result of a run of a decision tree-based algorithm. The basic data mining tasks that the standard APIs support enable users to: 1.
Build a data mining model. This task consists of two steps. First the data model is defined, that is, the source data that will be mined is specified, the source data structure (referred to as physical schema) is mapped on inputs of a data mining algorithm (referred to as logical schema), and the algorithm used to build the data mining model is specified. Then, the data mining model is built from training data.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
API Standardization Efforts for Data Mining
2. 3. 4.
Test the quality of a mining model by applying testing data. Apply a data mining model to new data. Browse a data mining model for reporting and visualization applications.
The APIs support several commonly accepted and widely used techniques both for predictive and descriptive data mining (see Table 1). Not all techniques need all the tasks listed above. For example, association rule mining does not require testing and application to new data, whereas classification does. The goals of the APIs are very similar but the approach of each of them is different. OLE DB for DM is a languagebased interface, SQL/MM DM is based on user-defined data types in SQL:1999, and JDM contains packages of data mining oriented Java interfaces and classes. In the next section, each of the APIs is briefly characterized. An example showing their application in prediction is presented in another article in this encyclopedia.
oriented specification for a set of data access interfaces designed for record-oriented data stores. It employs SQL commands as arguments of interface operations. The approach in defining OLE DB for DM was not to extend OLE DB interfaces but to expose data mining interfaces in a language-based API. OLE DB for DM treats a data mining model as if it were a special type of “table:” (a) Input data in the form of a set of cases is associated with a data mining model and additional meta-information while defining the data mining model. (b) When input data is inserted into the data mining model (it is “populated”), a mining algorithm builds an abstraction of the data and stores it into this special table. For example, if the data model represents a decision tree, the table contains a row for each leaf node of the tree (Netz et al., 2001). Once the data mining model is populated, it can be used for prediction, or it can be browsed for visualization. OLE DB for DM extends syntax of several SQL statements for defining, populating, and using a data mining model – see Figure 1.
MAIN THRUST
SQL/MM Data Mining
OLE DB for Data Mining
SQL/MM DM is an international ISO/IEC standard (SQL, 2002), which is part of the SQL Multimedia and Application Packages (SQL/MM) (Melton & Eisenberg, 2001). It is based on SQL:1999 and its structured user-defined data types (UDT). The structured UDT is the fundamental
OLE DB for DM (OLE DB, 2000) is Microsoft’s API that aims to become the industry standard. It provides a set of extensions to OLE DB, which is a Microsoft’s objectTable 1. Supported data mining techniques Technique Association rules Clustering (segmentation) Classification Sequence and deviation analysis Density estimation Regression Approximation Attribute importance
OLE DB for DM X X X X X
SQL/MM DM X X X X
JDM X X X
X X
Figure 1. Extended SQL statements in OLE DB for DM
INSERT
source data columns
model columns
mining algorithm
algorithm settings
SELECT
CREATE 2
40
Populating the data mining model
Defining a data 1 mining model
3
Testing, applying, browsing the data mining model
API Standardization Efforts for Data Mining
facility in SQL:1999 that supports object orientation (Melton & Simon, 2001). The idea of SQL/MM DM is to provide UDTs and associated methods for defining input data, a data mining model, data mining task and its settings, and for results of testing or applying the data mining model. Training, testing, and application data must be stored in a table. Relations of the UDTs are shown in Figure 2. Some of the UDTs are related to mining techniques. Their names contain “XX” in the figure, which should be “Clas,” “Rule,” “Clus,” and “Reg” for classification, association rules, clustering and regression, respectively.
Java Data Mining Java Data Mining (JDM) known as a Java Specification Request – 73 (JSR-73) (Hornick et al., 2004) — is a Java standard being developed under SUN’s Java Community Process. The standard is based on a generalized, objectoriented, data mining conceptual model. JDM supports common data mining operations, as well as the creation, persistence, access, and maintenance of metadata supporting mining activities. Compared with OLE DB for DM and SQL/MM DM, JDM is more complex because it does not rely on any other built-in support, such as OLE DB or SQL. It is a pure Java API that specifies a set of Java interfaces and classes, which must be implemented in a data mining provider. Some of JDM concepts are close to those in SQL/MM DM but the number of Java interfaces and classes in JDM is higher than the number of UDTs in SQL/MM DM. JDM specifies interfaces for objects which provide an abstraction of the metadata needed to execute data mining tasks. Once a task is executed, another object that represents the result of the task is created.
FUTURE TRENDS OLE DB for DM is a Microsoft’s standard which aims to be an industry standard. A reference implementation of a data mining provider based on this standard is available in Microsoft SQL Server 2000 (Netz et al., 2001). SQL/MM DM was adopted as an international ISO/IEC standard. As it is based on a user-defined data type feature of SQL:1999, support of UDTs in database management systems is essential for implementations of data mining providers based on this standard. JDM must still go through several steps before being accepted as an official Java standard. At the time of writing this paper, it was in the stage of final draft public review. The Oracle9i Data Mining (Oracle9i, 2002) API provides an early look at concepts and approaches proposed for JDM. It is assumed to comply with the JDM standard when the standard is published. All the standards support PMML as a format for data mining model exchange. They enable a data mining model to be imported and exported in this format. In OLE DB for DM, a new model can be created from a PMML document. SQL/MM DM provides methods of the DM_XXModel UDT to import and export a PMML document. Similarly, JDM specifies interfaces for import and export tasks.
CONCLUSION Three standard APIs for data mining were presented in this paper. However, their implementation is not available yet or is only a reference one. Schwenkreis (2001) commented on this situation, as, “the fact that implementations come after standards is a general trend in today’s standardization efforts. It seems that in case of data
Figure 2. Relations of data mining UDTs introduced in SQL/MM DM Testing data
DM_MiningData
DM_MinigData
DM_LogicalDataSpec
DM_XXModel
DM_XXSettings
DM_ApplicationData
DM_XXTestResult
Training data DM_XXResult
DM_XXBldTask
Application data
41
A
API Standardization Efforts for Data Mining
mining, standards are not only intended to unify existing products with well-known functionality but to (partially) design the functionality such that future products match real world requirements.” A simple example of using the APIs in prediction is presented in another article of this book (Zendulka, 2005).
SAS Enterprise Miner™ to support PMML. (September 17, 2002). Retrieved from http://www.sas.com/news/ preleases/091702/news1.html
REFERENCES
SQL Multimedia and Application Packages. Part 6: Data Mining. ISO/IEC 13249-6. (2002).
Common Warehouse Metamodel Specification: Data Mining. Version 1.0. (2001). Retrieved from http:// www.omg.org/docs/ad/01-02-01.pdf Cross Industry Standard Process for Data Mining (CRISP-DM). Version 1.0. (2000). Retrieved from http:// www.crisp-dm.org/ Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data mining standards initiatives. Communications of the ACM, 45 (8), 59-61. Han, J., & Kamber, M. (2001). Data mining: concepts and techniques. Morgan Kaufmann Publishers. Hornick, M. et al. (2004). Java™Specification Request 73: Java™Data Mining (JDM). Version 0.96. Retrieved from http://jcp.org/aboutJava/communityprocess/first/ jsr073/ Melton, J., & Eisenberg, A. (2001). SQL Multimedia and Application Packages (SQL/MM). SIGMOD Record, 30 (4), 97-102. Melton, J., & Simon, A. (2001). SQL: 1999. Understanding relational language components. Morgan Kaufmann Publishers. Microsoft Corporation. (2000). OLE DB for Data Mining Specification Version 1.0. Netz, A. et al. (2001, April). Integrating data mining with SQL Databases: OLE DB for data mining. In Proceedings of the 17 th International Conference on Data Engineering (ICDE ’01) (pp. 379-387). Heidelberg, Germany. Oracle9i Data Mining. Concepts. Release 9.2.0.2. (2002). Viewable CD Release 2 (9.2.0.2.0). PMML Version 2.1. (2003). Retrieved from http:// www.dmg.org/pmml-v2-1.html Saarenvirta, G. (2001, Summer). Operation Data Mining. DB2 Magazine, 6(2). International Business Machines Corporation. Retrieved from http://www.db2mag.com/ db_area/archives/2001/q2/saarenvirta.shtml
42
Schwenkreis, F. (2001). Data mining – Technology driven by standards? Retrieved from http://www.research. microsoft.com/~jamesrh/hpts2001/submissions/ FriedemannSchwenkreis.htm
Zendulka, J. (2005). Using standard APIs for data mining in prediction. In J. Wang (Ed.) Encyclopedia of data warehousing and mining. Hershey, PA: Idea Group Reference.
KEY TERMS API: Application programming interface (API) is a description of the way one piece of software asks another program to perform a service. A standard API for data mining enables for different data mining algorithms from various vendors to be easily plugged into application programs. Data Mining Model: A high-level global description of a given set of data which is the result of a data mining technique over the set of data. It can be descriptive or predictive. DMG: Data Mining Group (DMG) is a consortium of data mining vendors for developing data mining standards. They have developed a Predictive Model Markup language (PMML). JDM: Java Data Mining (JDM) is an emerging standard API for the programming language Java. It is an object-oriented interface that specifies a set of Java classes and interfaces supporting data mining operations for building, testing, and applying a data mining model. OLE DB for DM: OLE DB for Data Mining (OLE DB for DM) is a Microsoft’s language-based standard API that introduces several SQL-like statements supporing data mining operations for building, testing, and applying a data mining model. PMML: Predictive Model Markup Language (PMML) is an XML-based language which provides a quick and easy way for applications to produce data mining models in a vendor-independent format and to share them between compliant applications.
API Standardization Efforts for Data Mining
SQL1999: Structured Query Language (SQL): 1999. The version of the standard database language SQL adapted in 1999, which introduced object-oriented features.
SQL/MM DM: SQL Multimedia and Application Packages – Part 6: Data Mining (SQL/MM DM) is an international standard the purpose of which is to define data mining user-defined types and associated routines for building, testing, and applying data mining models. It is based on structured user-defined types of SQL:1999.
43
A
44
The Application of Data Mining to Recommender Systems J. Ben Schafer University of Northern Iowa, USA
INTRODUCTION In a world where the number of choices can be overwhelming, recommender systems help users find and evaluate items of interest. They connect users with items to “consume” (purchase, view, listen to, etc.) by associating the content of recommended items or the opinions of other individuals with the consuming user’s actions or opinions. Such systems have become powerful tools in domains from electronic commerce to digital libraries and knowledge management. For example, a consumer of just about any major online retailer who expresses an interest in an item – either through viewing a product description or by placing the item in his “shopping cart” – will likely receive recommendations for additional products. These products can be recommended based on the top overall sellers on a site, on the demographics of the consumer, or on an analysis of the past buying behavior of the consumer as a prediction for future buying behavior. This paper will address the technology used to generate recommendations, focusing on the application of data mining techniques.
BACKGROUND Many different algorithmic approaches have been applied to the basic problem of making accurate and efficient recommender systems. The earliest “recommender systems” were content filtering systems designed to fight information overload in textual domains. These were often based on traditional information filtering and information retrieval systems. Recommender systems that incorporate information retrieval methods are frequently used to satisfy ephemeral needs (short-lived, often one-time needs) from relatively static databases. For example, requesting a recommendation for a book preparing a sibling for a new child in the family. Conversely, recommender systems that incorporate information-filtering methods are frequently used to satisfy persistent information (longlived, often frequent, and specific) needs from relatively stable databases in domains with a rapid turnover or frequent additions. For example, recommending AP sto-
ries to a user concerning the latest news regarding a senator’s re-election campaign. Without computers, a person often receives recommendations by listening to what people around him have to say. If many people in the office state that they enjoyed a particular movie, or if someone he tends to agree with suggests a given book, then he may treat these as recommendations. Collaborative filtering (CF) is an attempt to facilitate this process of “word of mouth.” The simplest of CF systems provide generalized recommendations by aggregating the evaluations of the community at large. More personalized systems (Resnick & Varian, 1997) employ techniques such as user-to-user correlations or a nearest-neighbor algorithm. The application of user-to-user correlations derives from statistics, where correlations between variables are used to measure the usefulness of a model. In recommender systems correlations are used to measure the extent of agreement between two users (Breese, Heckerman, & Kadie, 1998) and used to identify users whose ratings will contain high predictive value for a given user. Care must be taken, however, to identify correlations that are actually helpful. Users who have only one or two rated items in common should not be treated as strongly correlated. Herlocker et al. (1999) improved system accuracy by applying a significance weight to the correlation based on the number of co-rated items. Nearest-neighbor algorithms compute the distance between users based on their preference history. Distances vary greatly based on domain, number of users, number of recommended items, and degree of co-rating between users. Predictions of how much a user will like an item are computed by taking the weighted average of the opinions of a set of neighbors for that item. As applied in recommender systems, neighbors are often generated online on a query-by-query basis rather than through the off-line construction of a more thorough model. As such, they have the advantage of being able to rapidly incorporate the most up-to-date information, but the search for neighbors is slow in large databases. Practical algorithms use heuristics to search for good neighbors and may use opportunistic sampling when faced with large populations.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
The Application of Data Mining to Recommender Systems
Both nearest-neighbor and correlation-based recommenders provide a high level of personalization in their recommendations, and most early systems using these techniques showed promising accuracy rates. As such, CF-based systems have continued to be popular in recommender applications and have provided the benchmarks upon which more recent applications have been compared.
DATA MINING IN RECOMMENDER APPLICATIONS The term data mining refers to a broad spectrum of mathematical modeling techniques and software tools that are used to find patterns in data and user these to build models. In this context of recommender applications, the term data mining is used to describe the collection of analysis techniques used to infer recommendation rules or build recommendation models from large data sets. Recommender systems that incorporate data mining techniques make their recommendations using knowledge learned from the actions and attributes of users. These systems are often based on the development of user profiles that can be persistent (based on demographic or item “consumption” history data), ephemeral (based on the actions during the current session), or both. These algorithms include clustering, classification techniques, the generation of association rules, and the production of similarity graphs through techniques such as Horting. Clustering techniques work by identifying groups of consumers who appear to have similar preferences. Once the clusters are created, averaging the opinions of the other consumers in her cluster can be used to make predictions for an individual. Some clustering techniques represent each user with partial participation in several clusters. The prediction is then an average across the clusters, weighted by degree of participation. Clustering techniques usually produce less-personal recommendations than other methods, and in some cases, the clusters have worse accuracy than CF-based algorithms (Breese, Heckerman, & Kadie, 1998). Once the clustering is complete, however, performance can be very good, since the size of the group that must be analyzed is much smaller. Clustering techniques can also be applied as a “first step” for shrinking the candidate set in a CF-based algorithm or for distributing neighbor computations across several recommender engines. While dividing the population into clusters may hurt the accuracy of recommendations to users near the fringes of their assigned cluster, preclustering may be a worthwhile trade-off between accuracy and throughput. Classifiers are general computational models for assigning a category to an input. The inputs may be vectors
of features for the items being classified or data about relationships among the items. The category is a domainspecific classification such as malignant/benign for tumor classification, approve/reject for credit requests, or intruder/authorized for security checks. One way to build a recommender system using a classifier is to use information about a product and a customer as the input, and to have the output category represent how strongly to recommend the product to the customer. Classifiers may be implemented using many different machine-learning strategies including rule induction, neural networks, and Bayesian networks. In each case, the classifier is trained using a training set in which ground truth classifications are available. It can then be applied to classify new items for which the ground truths are not available. If subsequent ground truths become available, the classifier may be retrained over time. For example, Bayesian networks create a model based on a training set with a decision tree at each node and edges representing user information. The model can be built off-line over a matter of hours or days. The resulting model is very small, very fast, and essentially as accurate as CF methods (Breese, Heckerman, & Kadie, 1998). Bayesian networks may prove practical for environments in which knowledge of consumer preferences changes slowly with respect to the time needed to build the model but are not suitable for environments in which consumer preference models must be updated rapidly or frequently. Classifiers have been quite successful in a variety of domains ranging from the identification of fraud and credit risks in financial transactions to medical diagnosis to intrusion detection. Good et al. (1999) implemented induction-learned feature-vector classification of movies and compared the classification with CF recommendations; this study found that the classifiers did not perform as well as CF, but that combining the two added value over CF alone. One of the best-known examples of data mining in recommender systems is the discovery of association rules, or item-to-item correlations (Sarwar et. al., 2001). These techniques identify items frequently found in “association” with items in which a user has expressed interest. Association may be based on co-purchase data, preference by common users, or other measures. In its simplest implementation, item-to-item correlation can be used to identify “matching items” for a single item, such as other clothing items that are commonly purchased with a pair of pants. More powerful systems match an entire set of items, such as those in a customer’s shopping cart, to identify appropriate items to recommend. These rules can also help a merchandiser arrange products so that, for example, a consumer purchasing a child’s handheld video game sees batteries nearby. More sophisticated temporal data mining may suggest that a consumer who buys the 45
A
The Application of Data Mining to Recommender Systems
video game today is likely to buy a pair of earplugs in the next month. Item-to-item correlation recommender applications usually use current interest rather than long-term customer history, which makes them particularly well suited for ephemeral needs such as recommending gifts or locating documents on a topic of short lived interest. A user merely needs to identify one or more “starter” items to elicit recommendations tailored to the present rather than the past. Association rules have been used for many years in merchandising, both to analyze patterns of preference across products, and to recommend products to consumers based on other products they have selected. An association rule expresses the relationship that one product is often purchased along with other products. The number of possible association rules grows exponentially with the number of products in a rule, but constraints on confidence and support, combined with algorithms that build association rules with itemsets of n items from rules with n-1 item itemsets, reduce the effective search space. Association rules can form a very compact representation of preference data that may improve efficiency of storage as well as performance. They are more commonly used for larger populations rather than for individual consumers, and they, like other learning methods that first build and then apply models, are less suitable for applications where knowledge of preferences changes rapidly. Association rules have been particularly successfully in broad applications such as shelf layout in retail stores. By contrast, recommender systems based on CF techniques are easier to implement for personal recommendation in a domain where consumer opinions are frequently added, such as online retail. In addition to use in commerce, association rules have become powerful tools in recommendation applications in the domain of knowledge management. Such systems attempt to predict which Web page or document can be most useful to a user. As Géry (2003) writes, “The problem of finding Web pages visited together is similar to finding associations among itemsets in transaction databases. Once transactions have been identified, each of them could represent a basket, and each web resource an item.” Systems built on this approach have been demonstrated to produce both high accuracy and precision in the coverage of documents recommended (Geyer-Schultz et al., 2002). Horting is a graph-based technique in which nodes are users, and edges between nodes indicate degree of similarity between two users (Wolf et al., 1999). Predictions are produced by walking the graph to nearby nodes and combining the opinions of the nearby users. Horting differs from collaborative filtering as the graph may be walked through other consumers who have not rated the product in question, thus exploring transitive relationships that 46
traditional CF algorithms do not consider. In one study using synthetic data, Horting produced better predictions than a CF-based algorithm (Wolf et al., 1999).
FUTURE TRENDS As data mining algorithms have been tested and validated in their application to recommender systems, a variety of promising applications have evolved. In this section we will consider three of these applications – meta-recommenders, social data mining systems, and temporal systems that recommend when rather than what. Meta-recommenders are systems that allow users to personalize the merging of recommendations from a variety of recommendation sources employing any number of recommendation techniques. In doing so, these systems let users take advantage of the strengths of each different recommendation method. The SmartPad supermarket product recommender system (Lawrence et al., 2001) suggests new or previously unpurchased products to shoppers creating shopping lists on a personal digital assistant (PDA). The SmartPad system considers a consumer’s purchases across a store’s product taxonomy. Recommendations of product subclasses are based upon a combination of class and subclass associations drawn from information filtering and co-purchase rules drawn from data mining. Product rankings within a product subclass are based upon the products’ sales rankings within the user’s consumer cluster, a less personalized variation of collaborative filtering. MetaLens (Schafer et al., 2002) allows users to blend content requirements with personality profiles to allow users to determine which movie they should see. It does so by merging more persistent and personalized recommendations, with ephemeral content needs such as the lack of offensive content or the need to be home by a certain time. More importantly, it allows the user to customize the process by weighting the importance of each individual recommendation. While a traditional CF-based recommender typically requires users to provide explicit feedback, a social data mining system attempts to mine the social activity records of a community of users to implicitly extract the importance of individuals and documents. Such activity may include Usenet messages, system usage history, citations, or hyperlinks. TopicShop (Amento et al., 2003) is an information workspace which allows groups of common Web sites to be explored, organized into user defined collections, manipulated to extract and order common features, and annotated by one or more users. These actions on their own may not be of large interest, but the collection of these actions can be mined by TopicShop and redistributed to other users to suggest sites of
The Application of Data Mining to Recommender Systems
general and personal interest. Agrawal et al. (2003) explored the threads of newsgroups to identify the relationships between community members. Interestingly, they concluded that due to the nature of newsgroup postings – users are more likely to respond to those with whom they disagree – “links” between users are more likely to suggest that users should be placed in differing partitions rather than the same partition. Although this technique has not been directly applied to the construction of recommendations, such an application seems a logical field of future study. Although traditional recommenders suggest what item a user should consume they have tended to ignore changes over time. Temporal recommenders apply data mining techniques to suggest when a recommendation should be made or when a user should consume an item. Adomavicius and Tuzhilin (2001) suggest the construction of a recommendation warehouse, which stores ratings in a hypercube. This multidimensional structure can store data on not only the traditional user and item axes, but also for additional profile dimensions such as time. Through this approach, queries can be expanded from the traditional “what items should we suggest to user X” to “at what times would user X be most receptive to recommendations for product Y.” Hamlet (Etzioni et al., 2003) is designed to minimize the purchase price of airplane tickets. Hamlet combines the results from time series analysis, Q-learning, and the Ripper algorithm to create a multi-strategy data-mining algorithm. By watching for trends in airline pricing and suggesting when a ticket should be purchased, Hamlet was able to save the average user 23.8% when savings was possible.
CONCLUSION Recommender systems have emerged as powerful tools for helping users find and evaluate items of interest. These systems use a variety of techniques to help users identify the items that best fit their tastes or needs. While popular CF-based algorithms continue to produce meaningful, personalized results in a variety of domains, data mining techniques are increasingly being used in both hybrid systems, to improve recommendations in previously successful applications, and in stand-alone recommenders, to produce accurate recommendations in previously challenging domains. The use of data mining algorithms has also changed the types of recommendations as applications move from recommending what to consume to also recommending when to consume. While recommender systems may have started as largely a passing novelty, they clearly appear to have moved into a real and powerful tool in a variety of applications, and
that data mining algorithms can be and will continue to be an important part of the recommendation process.
REFERENCES Adomavicius, G., & Tuzhilin, A. (2001). Extending recommender systems: A multidimensional approach. IJCAI-01 Workshop on Intelligent Techniques for Web Personalization (ITWP’2001), Seattle, Washington. Agrawal, R., Rajagopalan, S., Srikant, R., & Xu, Y. (2003). Mining newsgroups using networks arising from social behavior. In Proceedings of the Twelfth World Wide Web Conference (WWW12) (pp. 529-535), Budapest, Hungary. Amento, B., Terveen, L., Hill, W., Hix, D., & Schulman, R. (2003). Experiments in social data mining: The TopicShop System. ACM Transactions on Computer-Human Interaction, 10 (1), 54-85. Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98) (pp. 43-52), Madison, Wisconsin. Etzioni, O., Knoblock, C.A., Tuchinda, R., & Yates, A. (2003). To buy or not to buy: Mining airfare data to minimize ticket purchase price. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 119-128), Washington. D.C. Géry, M., & Haddad, H. (2003). Evaluation of Web usage mining approaches for user’s next request prediction. In Fifth International Workshop on Web Information and Data Management (pp. 74-81), Madison, Wisconsin. Geyer-Schulz, A., & Hahsler, M. (2002). Evaluation of recommender algorithms for an Internet information broker based on simple association rules and on the repeatbuying theory. In Fourth WEBKDD Workshop: Web Mining for Usage Patterns & User Profiles (pp. 100-114), Edmonton, Alberta, Canada. Good, N. et al. (1999). Combining collaborative filtering with personal agents for better recommendations. In Proceedings of Sixteenth National Conference on Artificial Intelligence (AAAI-99) (pp. 439-446), Orlando, Florida. Herlocker, J., Konstan, J.A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. In Proceedings of the 1999 Conference on Research and Development in Information Retrieval, (pp. 230-237), Berkeley, California. 47
A
The Application of Data Mining to Recommender Systems
Lawrence, R.D. et al. (2001). Personalization of supermarket product recommendations. Data Mining and Knowledge Discovery, 5(1/2), 11-32. Lin, W., Alvarez, S.A., & Ruiz, C. (2002). Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6(1) 83-105. Resnick, P., & Varian, H.R. (1997). Communications of the Association of Computing Machinery Special issue on Recommender Systems, 40(3), 56-89. Sarwar, B., Karypis, G., Konstan, J.A., & Reidl, J. (2001). Item-based collaborative filtering recommendation algorithms. In Proceedings of the Tenth International Conference on World Wide Web (pp. 285-295), Hong Kong. Schafer, J.B., Konstan, J.A., & Riedl, J. (2001). E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery, 5(1/2), 115-153. Schafer, J.B., Konstan, J.A., & Riedl, J. (2002). Metarecommendation systems: User-controlled integration of diverse recommendations. In Proceedings of the Eleventh Conference on Information and Knowledge (CIKM02) (pp. 196-203), McLean, Virginia. Shoemaker, C., & Ruiz, C. (2003). Association rule mining algorithms for set-valued data. Lecture Notes in Computer Science, 2690, 669-676. Wolf, J., Aggarwal, C., Wu, K-L., & Yu, P. (1999). Horting hatches an egg: A new graph-theoretic approach to collaborative filtering. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 201-212), San Diego, CA.
48
KEY TERMS Association Rules: Used to associate items in a database sharing some relationship (e.g., co-purchase information). Often takes the for “if this, then that,” such as, “If the customer buys a handheld videogame then the customer is likely to purchase batteries.” Collaborative Filtering: Selecting content based on the preferences of people with similar interests. Meta-Recommenders: Provide users with personalized control over the generation of a single recommendation list formed from the combination of rich recommendation data from multiple information sources and recommendation techniques. Nearest-Neighbor Algorithm: A recommendation algorithm that calculates the distance between users based on the degree of correlations between scores in the users’ preference histories. Predictions of how much a user will like an item are computed by taking the weighted average of the opinions of a set of nearest neighbors for that item. Recommender Systems: Any system that provides a recommendation, prediction, opinion, or user-configured list of items that assists the user in evaluating items. Social Data-Mining: Analysis and redistribution of information from records of social activity such as newsgroup postings, hyperlinks, or system usage history. Temporal Recommenders: Recommenders that incorporate time into the recommendation process. Time can be either an input to the recommendation function, or the output of the function.
49
Approximate Range Queries by Histograms in OLAP Francesco Buccafurri University “Mediterranea” of Reggio Calabria, Italy Gianluca Lax University “Mediterranea” of Reggio Calabria, Italy
INTRODUCTION Online analytical processing applications typically analyze a large amount of data by means of repetitive queries involving aggregate measures on such data. In fast OLAP applications, it is often advantageous to provide approximate answers to queries in order to achieve very high performances. A way to obtain this goal is by submitting queries on compressed data in place of the original ones. Histograms, initially introduced in the field of query optimization, represent one of the most important techniques used in the context of OLAP for producing approximate query answers.
BACKGROUND Computing aggregate information is a widely exploited task in many OLAP applications. Every time it is necessary to produce fast query answers and a certain estimation error can be accepted, it is possible to inquire summary data rather than the original ones and to perform suitable interpolations. The typical OLAP query is the range query. The range query estimation problem in the onedimensional case can be stated as follows: given an attribute X of a relation R, and a range I belonging to the domain of X, estimate the number of records of R with value of X lying in I. The challenge is finding methods for achieving a small estimation error by consuming a fixed amount of storage space. A possible solution to this problem is using sampling methods; only a small number of suitably selected records of R, representing R well, are stored. The range query is then evaluated by exploiting this sample instead of the full relation R. Recently, Wu, Agrawal, and Abbadi (2002) have shown that in terms of accuracy, sampling techniques based on the cumulative distribution function are definitely better than the methods based on tuple sampling (Chaudhuri, Das & Narasayya, 2001; Ganti, Lee &
Ramakrishnan, 2000). The main advantage of sampling techniques is that they are very easy to implement. Besides sampling, regression techniques try to model data as a function in such a way that only a small set of coefficients representing such a function is stored, rather than the original data. The simplest regression technique is the linear one, which models a data distribution as a linear function. Despite its simplicity, not allowing the capture of complex relationships among data, this technique often produces acceptable results. There are also non linear regressions, significantly more complex than the linear one from the computational point of view, but applicable to a much larger set of cases. Another possibility for facing the range query estimation problem consists of using wavelets-based techniques (Chakrabarti, Garofalakis, Rastogi & Shim, 2001; Garofalakis & Gibbons, 2002; Garofalakis & Kumar, 2004). Wavelets are mathematical transformations storing data in a compact and hierarchical fashion used in many application contexts, like image and signal processing (Kacha, Grenez, De Doncker & Benmahammed, 2003; Khalifa, 2003). There are several types of transformations, each belonging to a family of wavelets. The result of each transformation is a set of values, called wavelet coefficients. The advantage of this technique is that, typically, the value of a (possibly large) number of wavelet coefficients results to be below a fixed threshold, so that such coefficients can be approximated by 0. Clearly, the overall approximation of the technique as well as the compression ratio depends on the value of such a threshold. In the last years, wavelets have been exploited in data mining and knowledge discovery in databases, thanks to time and space efficiency and data hierarchical decomposition characterizing them. For a deeper treatment about wavelets, see Li, Li, Zhu, and Ogihara (2002). Besides sampling and wavelets, histograms are used widely for estimating range queries. Although sometimes wavelets are viewed as a particular class of histograms, we prefer to describe histograms separately.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Approximate Range Queries by Histograms in OLAP
MAIN THRUST Histograms are a lossy compression technique widely applied in various application contexts, like query optimization, statistical and temporal databases, and OLAP applications. In OLAP, compression allows us to obtain fast approximate answers by evaluating queries on reduced data in place of the original ones. Histograms are well suited to this purpose, especially in the case of range queries. A histogram is a compact representation of a relation R. It is obtained by partitioning an attribute X of the relation R into k subranges, called buckets, and by maintaining for each of them a few pieces of information, typically corresponding to the bucket boundaries, the number of tuples with value of X belonging to the subrange associated to the bucket (often called sum of the bucket), and the number of distinct values of X of such a subrange occurring in some tuple of R (i.e., the number of non-null frequencies of the subrange). Recall that a range query, defined on an interval I of X, evaluates the number of occurrences in R with value of X in I. Thus, buckets embed a set of precomputed disjoint range queries capable of covering the whole active domain of X in R (here, active means attribute values actually appearing in R). As a consequence, the histogram, in general, does not give the possibility of evaluating exactly a range query not corresponding to one of the precomputed embedded queries. In other words, while the contribution to the answer coming from the subranges coinciding with entire buckets can be returned exactly, the contribution coming from the subranges that partially overlap buckets can only be estimated, since the actual data distribution is not available. Constructing the best histogram thus may mean defining the boundaries of buckets in such a way that the estimation of the non-precomputed range queries becomes more effective (e.g., by avoiding that large frequency differences arise inside a bucket). This approach corresponds to finding, among all possible sets of precomputed range queries, the set that guarantees the best estimation of the other (non-precomputed) queries, once a technique for estimating such queries is defined. Besides this problem, which we call the partition problem, there is another relevant issue to investigate: how to improve the estimation inside the buckets. We discuss both of these issues in the following two sections.
The Partition Problem This issue has been analyzed widely in the past, and a number of techniques has been proposed. Among these,
50
we first consider the Max-Diff histogram and the VOptimal histogram. Even though they are not the most recent techniques, we cite them, since they are still considered points of reference. We start by describing the Max-Diff histogram. Let V={v1, ... , v n}be the set of values of the attribute X actually appearing in the relation R and f(v i) be the number of tuples of R having value v i in X. A MaxDiff histogram with h buckets is obtained by putting a boundary between two adjacent attribute values v i and vi+1 of V if the difference between f(vi+1) · si+1 and f(vi) · s i is one of the h-1 largest such differences (where s i denotes the spread of vi, that is the distance from vi to the next nonnull value). A V-Optimal histogram, which is the other classical histogram we describe, produces more precise results than the Max-Diff histogram. It is obtained by selecting the boundaries for each bucket i so that Σ i SSEi is minimal, where SSE i is the standard squared error of the bucket i-th. V-Optimal histogram uses a dynamic programming technique in order to find the optimal partitioning w.r.t. a given error metrics. Even though the V-Optimal histogram results more accurate than Max-Diff, its high space and time complexities make it rarely used in practice. In order to overcome such a drawback, an approximate version of the V-Optimal histogram has been proposed. The basic idea is quite simple. First, data are partitioned into l disjoint chunks, and then the V-Optimal algorithm is used in order to compute a histogram within each chunk. The consequent problem is how to allocate buckets to the chunks such that exactly B buckets are used. This is solved by implementing a dynamic programming scheme. It is shown that an approximate V-Optimal histogram with B + l buckets has the same accuracy as the non-approximate V-Optimal with B buckets. Moreover, the time required for executing the approximate algorithm is reduced by multiplicative factor equal to1/l. We call the histograms so far described classical histograms. Besides accuracy, new histograms tend to satisfy other properties in order to allow their application to new environments (e.g., knowledge discovery). In particular, (1) the histogram should maintain in a certain measure the semantic nature of original data, in such a way that meaningful queries for mining activities can be submitted to reduced data in place of original ones. Then, (2) for a given kind of query, the accuracy of the reduced structure should be guaranteed. In addition, (3) the histogram should efficiently support hierarchical range queries in order not to limit too much the capability of drilling down and rolling up over data.
Approximate Range Queries by Histograms in OLAP
Classical histograms lack the last point, since they are flat structures. Many proposals have been presented in order to guarantee the three properties previously described, and we report some of them in the following. Requirement (3) was introduced by Koudas, Muthukrishnan, and Srivastava (2000), where the authors have shown the insufficient accuracy of classical histograms in evaluating hierarchical range queries. Therein, a polynomial-time algorithm for constructing optimal histograms with respect to hierarchical queries is proposed. The selectivity estimation problem for non-hierarchical range queries was studied by Gilbert, Kotidis, Muthukrishnan, and Strauss (2001), and, according to property (2), optimal and approximate polynomial (in the database size) algorithms with a provable approximation guarantee for constructing histograms are also presented. Guha, Koudas, and Srivastava (2002) have proposed efficient algorithms for the problem of approximating the distribution of measure attributes organized into hierarchies. Such algorithms are based on dynamic programming and on a notion of sparse intervals. Algorithms returning both optimal and suboptimal solutions for approximating range queries by histograms and their dynamic maintenance by additive changes are provided by Muthukrishnan and Strauss (2003). The best algorithm, with respect to the construction time, returning an optimal solution takes polynomial time. Buccafurri and Lax (2003) have presented a histogram based on a hierarchical decomposition of the data distribution kept in a full binary tree. Such a tree, containing a set of precomputed hierarchical queries, is encoded by using bit saving for obtaining a smaller structure and, thus, for efficiently supporting hierarchical range queries. Besides bucket-based histograms, there are other kinds of histograms whose construction is not driven by the search of a suitable partition of the attribute domain, and, further, their structure is more complex than simply a set of buckets. This class of histograms is called nonbucket based histograms. Wavelets are an example of such kind of histograms. In the next section, we deal with the second problem introduced earlier concerning the estimation of range queries partially involving buckets.
Estimation Inside a Bucket While finding the optimal bucket partition has been widely investigated in past years, the problem of estimating queries partially involving a bucket has received a little attention. Histograms are well suited to range query evaluation, since buckets basically correspond to a set of precomputed range queries. A range query that involves entirely
one or more buckets can be computed exactly, while if it partially overlaps a bucket, then the result only can be estimated. The simplest adopted estimation technique is the Continuous Value Assumption (CVA). Given a bucket of size s and sum c, a range query overlapping the bucket in i points is estimated as (i / s ) ⋅ c . This corresponds to estimating the partial contribution of the bucket to the range query result by linear interpolation. Another possibility is to use the Uniform Spread Assumption (USA). It assumes that values are distributed at equal distance from each other and that the overall frequency sum is equally distributed among them. In this case, it is necessary to know the number of non-null frequencies belonging to the bucket. Denoting by t such a value, the range query is estimated by (s − 1) + (i − 1) ⋅ (t − 1) c ⋅ . ( s − 1) t
An interesting problem is understanding whether, by exploiting information typically contained in histogram buckets and possibly by adding some concise summary information, the frequency estimation inside buckets and, then, the histogram accuracy can be improved. To this aim, starting from a theoretical analysis about limits of CVA and USA, Buccafurri, Pontieri, Rosaci, and Saccà (2002) have proposed to use an additional storage space of 32 bits, called 4LT, in each bucket in order to store the approximate representation of the data distribution inside the bucket. In particular, 4LT is used to save approximate cumulative frequencies at seven equidistant intervals internal to the bucket. Clearly, approaches similar to that followed in Buccafurri, Pontieri, Rosaci, and Saccà (2002) have to deal with the trade-off between the extra storage space required for each bucket and the number of total buckets the allowed total storage space consents.
FUTURE TRENDS Data streams is an emergent issue that in the last two years has captured the interest of many scientific communities. The crucial problem arising in several application contexts like network monitoring, sensor networks, financial applications, security, telecommunication data management, Web applications, and so on is dealing with continuous data flows (i.e., data streams) having the following characteristics: (1) they are time dependent; (2) their size is very large, so that they cannot be stored totally due to the actual memory
51
A
Approximate Range Queries by Histograms in OLAP
limitation; and (3) data arrival is very fast and unpredictable, so that each data management operation should be very efficient. Since a data stream consists of a large amount of data, it is usually managed on the basis of a sliding window, including only the most recent data (Babcock, Babu, Datar, Motwani & Widom, 2002). Thus, any technique capable of compressing sliding windows by maintaining a good approximate representation of data distribution is certainly relevant in this field. Typical queries performed on sliding windows are similarity queries and other analyses, like change mining queries (Dong, Han, Lakshmanan, Pei, Wang & Yu, 2003) useful for trend analysis and, in general, for understanding the dynamics of data. Also in this field, histograms may become an important analysis tool. The challenge is finding new histograms that (1) are fast to construct and to maintain; that is, the required updating operations (performed at each data arrival) are very efficient; (2) maintain a good accuracy in approximating data distribution; and (3) support continuous querying on data. An example of the above emerging approaches is reported in Buccafurri and Lax (2004), where a tree-like histogram with cyclic updating is proposed. By using such a compact structure, many mining techniques, which would take computational cost very high if used on real data streams, can be implemented effectively.
CONCLUSION Data reduction represents an important task both in data mining task and in OLAP, since it allows us to represent very large amounts of data in a compact structure, which efficiently perform on mining techniques or OLAP queries. Time and memory cost advantages arisen from data compression, provided that a sufficient degree of accuracy is guaranteed, may improve considerably the capabilities of mining and OLAP tools. This opportunity (added to the necessity, coming from emergent research fields such as data streams) of producing more and more compact representations of data explains the attention that the research community is giving toward techniques like histograms and wavelets, which provide a concrete answer to the previous requirements.
REFERENCES Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream system. Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 52
Buccafurri, F., & Lax, G. (2003). Pre-computing approximate hierarchical range queries in a tree-like histogram. Proceedings. of the International Conference on Data Warehousing and Knowledge Discovery. Buccafurri, F., & Lax, G. (2004). Reducing data stream sliding windows by cyclic tree-like histograms. Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases. Buccafurri, F., Pontieri, L., Rosaci, D., & Saccà, D. (2002). Improving range query estimation on histograms. Proceedings of the International Conference on Data Engineering. Chakrabarti, K., Garofalakis, M., Rastogi, R., & Shim, K. (2001). Approximate query processing using wavelets. VLDB Journal, The International Journal on Very Large Data Bases, 10(2-3), 199-223. Chaudhuri, S., Das, G., & Narasayya, V. (2001). A robust, optimization-based approach for approximate answering of aggregate queries. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Dong, G. et al. (2003). Online mining of changes from data streams: Research problems and preliminary results. Proceedings of the ACM SIGMOD Workshop on Management and Processing of Data Streams. Ganti, V., Lee, M. L., & Ramakrishnan, R. (2000). Icicles: Self-tuning samples for approximate query answering. Proceedings of 26th International Conference on Very Large Data Bases. Garofalakis, M., & Gibbons, P.B. (2002). Wavelet synopses with error guarantees. Proceedings of the ACM SIGMOD International Conference on Management of Data. Garofalakis, M., & Kumar, A. (2004). Deterministic wavelet thresholding for maximum error metrics. Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., & Strauss, M.J. (2001). Optimal and approximate computation of summary statistics for range aggregates. Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Guha, S., Koudas, N., & Srivastava, D. (2002). Fast algorithms for hierarchical range histogram construction. Proceedings of the Twenty-Ffirst ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems.
Approximate Range Queries by Histograms in OLAP
Kacha, A., Grenez, F., De Doncker, P., & Benmahammed, K. (2003). A wavelet-based approach for frequency estimation of interference signals in printed circuit boards. Proceedings of the 1st International Symposium on Information and Communication Technologies. Khalifa, O. (2003). Image data compression in wavelet transform domain using modified LBG algorithm. Proceedings of the 1st International Symposium on Information and Communication Technologies. Koudas, N., Muthukrishnan, S., & Srivastava, D. (2000). Optimal histograms for hierarchical range queries (extended abstract). Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Li, T., Li, Q., Zhu, S., & Ogihara, M. (2002). Survey on wavelet applications in data mining. ACM SIGKDD Explorations, 4(2), 49-68. Muthukrishnan, S., & Strauss, M. (2003). Rangesum histograms. Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Wu, Y., Agrawal, D., & Abbadi, A.E. (2002). Query estimation by adaptive sampling. Proceedings of the International Conference on Data Engineering.
KEY TERMS Bucket: An element obtained by partitioning the domain of an attribute X of a relation into non-overlapping intervals. Each bucket consists of a tuple
Bucket-Based Histogram: A type of histogram whose construction is driven by the search of a suitable partition of the attribute domain into buckets. Continuous Value Assumption (CVA): A technique allowing us to estimate values inside a bucket by linear interpolation. Data Preprocessing: The application of several methods preceding the mining phase, done for improving the overall data mining results. Usually, it consists of (1) data cleaning, a method for fixing missing values, outliers, and possible inconsistent data; (2) data integration, the union of (possibly heterogeneous) data coming from different sources into a unique data store; and (3) data reduction, the application of any technique working on data representation capable of saving storage space without compromising the possibility of inquiring them. Histogram: A set of buckets implementing a partition of the overall domain of a relation attribute. Range Query: A query returning an aggregate information (i.e., sum, average) about data belonging to a given interval of the domain. Uniform Spread Assumption (USA): A technique for estimating values inside a bucket by assuming that values are distributed at an equal distance from each other and that the overall frequency sum is distributed equally among them. Wavelets: Mathematical transformations implementing hierarchical decomposition of functions leading to the representation of functions through sets of wavelet coefficients.
53
A
54
Artificial Neural Networks for Prediction Rafael Martí Universitat de València, Spain
INTRODUCTION
BACKGROUND
The design and implementation of intelligent systems with human capabilities is the starting point to design Artificial Neural Networks (ANNs). The original idea takes after neuroscience theory on how neurons in the human brain cooperate to learn from a set of input signals to produce an answer. Because the power of the brain comes from the number of neurons and the multiple connections between them, the basic idea is that connecting a large number of simple elements in a specific way can form an intelligent system. Generally speaking, an ANN is a network of many simple processors called units, linked to certain neighbors with varying coefficients of connectivity (called weights) that represent the strength of these connections. The basic unit of ANNs, called an artificial neuron, simulates the basic functions of natural neurons: it receives inputs, processes them by simple combination and threshold operations, and outputs a final result. ANNs often employ supervised learning in which training data (including both the input and the desired output) is provided. Learning basically refers to the process of adjusting the weights to optimize the network performance. ANNs belongs to machine-learning algorithms because the changing of a network’s connection weights causes it to gain knowledge in order to solve the problem at hand. Neural networks have been widely used for both classification and prediction. In this article, I focus on the prediction or estimation problem (although with some few changes, my comments and descriptions also apply to classification). Estimating and forecasting future conditions are involved in different business activities. Some examples include cost estimation, prediction of product demand, and financial planning. Moreover, the field of prediction also covers other activities, such as medical diagnosis or industrial process modeling. In this short article I focus on the multilayer neural networks because they are the most common. I describe their architecture and some of the most popular training methods. Then I finish with some associated conclusions and the appropriate list of references to provide some pointers for further study.
From a technical point of view, ANNs offer a general framework for representing nonlinear mappings from several input variables to several output variables. They are built by tuning a set of parameters known as weights and can be considered as an extension of the many conventional mapping techniques. In classification or recognition problems, the net’s outputs are categories, while in prediction or approximation problems, they are continuous variables. Although this article focuses on the prediction problem, most of the key issues in the net functionality are common to both. In the process of training the net (supervised learning), the problem is to find the values of the weights w that minimize the error across a set of input/output pairs (patterns) called the training set E. For a single output and input vector x, the error measure is typically the root mean squared difference between the predicted output p(x,w) and the actual output value f(x) for all the elements x in E (RMSE); therefore, the training is an unconstrained nonlinear optimization problem, where the decision variables are the weights, and the objective is to reduce the training error. Ideally, the set E is a representative sample of points in the domain of the function f that you are approximating; however, in practice it is usually a set of points for which you know the f-value.
Min error ( E , w) = w
∑ ( f ( x) − p( x, w)) x∈E
2
E
(1)
The main goal in the design of an ANN is to obtain a model that makes good predictions for new inputs (i.e., to provide good generalization). Therefore, the net must represent the systematic aspects of the training data rather than their specific details. The standard way to measure the generalization provided by the net consists of introducing a second set of points in the domain of f called the testing set, T. Assume that no point in T belongs to E and f(x) is known for all x in T. After the optimization has been performed and the weights have been set to minimize the error in E (w=w*), the error across the testing set T is computed (error(T,w*)). The
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Artificial Neural Networks for Prediction
net must exhibit a good fit between the target f-values and the output (prediction) in the training set and also in the testing set. If the RMSE in T is significantly higher than that one in E, you say that the net has memorized the data instead of learning them (i.e., the net has overfitted the training data). The optimization of the function given in (1) is a hard problem by itself. Moreover, keep in mind that the final objective is to obtain a set of weights that provides low values of error(T,w*) for any set T. In the following sections I summarize some of the most popular and other not so popular but more efficient methods to train the net (i.e., to compute appropriate weight values).
MAIN THRUST Several models inspired by biological neural networks have been proposed throughout the years, beginning with the perceptron introduced by Rosenblatt (1962). He studied a simple architecture where the output of the net is a transformation of a linear combination of the input variables and the weights. Minskey and Papert (1969) showed that the perceptron can only solve linearly separable classification problems and is therefore of limited interest. A natural extension to overcome its limitations is given by the so-called multilayerperceptron, or, simply, multilayer neural networks. I have considered this architecture with a single hidden layer. A schematic representation of the network appears in Figure 1.
Neural Network Architecture Let NN=(N, A) be an ANN where N is the set of nodes and A is the set of arcs. N is partitioned into three subsets: NI, input nodes, NH, hidden nodes, and NO, output nodes. I assume that n variables exist in the function that I want to predict or approximate, therefore |N I|= n. The neural network has m hidden neurons (|N H|= m) with a bias term in each hidden neuron and a single output neuron (we
restrict our attention to real functions f: ℜ n → ℜ ). Figure 1 shows a net where NI = 1, 2, ..., n, NH = n+1, n+2,..., n+m and N O = s. Given an input pattern x=(x1,...,xn), the neural network provides the user with an associated output NN(x,w), which is a function of the weights w. Each node i in the input layer receives a signal of amount xi that it sends through all its incident arcs to the nodes in the hidden layer. Each node n+j in the hidden layer receives a signal input(n+j) according to the expression n
Input(n+j)=wn+j +
x1
1
w 1, n+ 1
n+1
xn
2
n
n+2
NN(x,w) = ws +
n+m
∑ output (n + j) w
n+ j ,s
j =1
In the process of training the net (supervised learning), the problem is to find the values of the weights (including the bias factors) that minimize the error (RMSE) across the training set E. After the optimization has been performed and the weights have been set (w=w*),the net is ready to produce the output for any input value. The testing error Error(T,w*) computes the Root Mean Squared Error across the elements in the testing set T={y1, y2,..,ys}, where no one belongs to the training set E:
Error(T,w*) =
∑ error ( y , w ) i =1
w n+ 1 ,s s
i ,n + j
m
s
ou tp ut x2
i
i =1
where wn+j is the bias value for node n+j, and wi,n+j is the weight value on the arc from node i in the input layer to node n+j in the hidden layer. Each hidden node transforms its input by means of a nonlinear activation function: output(j)=sig(input(j)). The most popular choice for the activation function is the sigmoid function sig(x)= 1/(1+e-x). Laguna and Martí (2002) test two activation functions for the hidden neurons and conclude that the sigmoid presents superior performance. Each hidden node n+j sends the amount of signal output(n+j) through the arc (n+j,s). The node s in the output layer receives the weighted sum of the values coming from the hidden nodes. This sum, NN(x,w), is the net’s output according to the expression:
Figure 1. Neural network diagram inp u ts
∑x w
i
s
∗
.
Training Methods Considering the supervised learning described in the previous section, many different training methods have been proposed. Here, I summarize some of the most relevant, starting with the well-known backpropagation 55
A
Artificial Neural Networks for Prediction
method. For a deeper understanding of them, see the excellent book by Bishop (1995). Backpropagation (BP) was the first method for neural network training and is still the most widely used algorithm in practical applications. It is a gradient descent method that searches for the global optimum of the network weights. Each iteration consists of two steps. First, partial derivatives ∂ Error/∂ w are computed for each weight in the net. Then weights are modified to reduce the RMSE according to the direction given by the gradient. There have been different modifications to this basic procedure; the most significant is the addition of a momentum term to prevent zigzagging in the search. Because the neural network training problem can be expressed as a nonlinear unconstrained optimization problem, I might use more elaborated nonlinear methods than the gradient descent to solve it. A selection of the best established algorithms in unconstrained nonlinear optimization has also been used in this context. Specifically, the nonlinear simplex method, the direction set method, the conjugate gradient method, the LevenbergMarquardt algorithm (Mor’e, 1978), and the GRG2 (Smith and Lasdon, 1992). Recently, metaheuristic methods have also been adapted to this problem. Specifically, on one hand you can find those methods based on local search procedures, and on the other, those methods based on population of solutions known as evolutionary methods. In the first category, two methods have been applied, simulated annealing and tabu search, while in the second you can find the so-called genetic algorithms, the scatter search, and, more recently, a path relinking implementation. Several studies (Sexton, 1998) have shown that tabu search outperforms the simulated annealing implementation; therefore, I first focus on the different tabu search implementations for ANN training.
Tabu Search Tabu search (TS) is based on the premise that in order to qualify as intelligent, problem solving must incorporate adaptive memory and responsive exploration. The adaptive memory feature of TS allows the implementation of procedures that are capable of searching the solution space economically and effectively. Because local choices are guided by information collected during the search, TS contrasts with memoryless designs that heavily rely on semirandom processes that implement a form of sampling. The emphasis on responsive exploration in tabu search, whether in a deterministic or probabilistic implementation, derives from the supposition that a bad strategic choice can yield more information than a good random choice. In a system that uses memory, a bad choice
56
based on strategy can provide useful clues about how the strategy may profitably be changed. As far as I know, the first tabu search approach for neural network training is due to Sexton et al. (1998). A short description follows. An initial solution x0 is randomly drawn from a uniform distribution in the range [-10,10]. Solutions are randomly generated in this range for a given number of iterations. When generating a new point xnew, aspiration level and tabu conditions are checked. If f(xnew)
Evolutionary Methods The idea of applying the biological principle of natural evolution to artificial systems, introduced more than three decades ago, has seen impressive growth in the past few years. Evolutionary algorithms have been successfully applied to numerous problems from different domains, including optimization, automatic programming, machine learning, economics, ecology, population genetics, studies of evolution and learning, and social systems. A genetic algorithm is an iterative procedure that consists of a constant-size population of individuals,
Artificial Neural Networks for Prediction
each represented by a finite string of symbols, known as the genome, encoding a possible solution in a given problem space. This space, referred to as the search space, comprises all possible solutions to the problem at hand. Solutions to a problem were originally encoded as binary strings due to certain computational advantages associated with such encoding. Also, the theory about the behavior of algorithms was based on binary strings. Because in many instances it is impractical to represent solutions by using binary strings, the solution representation has been extended in recent years to include character-based encoding, real-valued encoding, and tree representations. The standard genetic algorithm proceeds as follows. An initial population of individuals is generated at random or heuristically. Every evolutionary step, known as a generation, the individuals in the current population are decoded and evaluated according to some predefined quality criterion, referred to as the fitness, or fitness function. To form a new population (the next generation), individuals are selected according to their fitness. Many selection procedures are currently in use, one of the simplest being Holland’s original fitness-proportionate selection, where individuals are selected with a probability proportional to their relative fitness. This ensures that the expected number of times an individual is chosen is approximately proportional to its relative performance in the population. Thus, high-fitness (“good”) individuals stand a better chance of reproducing, while low-fitness ones are more likely to disappear. In terms of ANN training, a solution (or individual) consists of an array with the net’s weights and its associated fitness is usually the RMSE obtained with this solution in the training set. You can find a lot of research in GA implementations to ANNs. Consider, for instance, the recent work by Alba and Chicano (2004), in which a hybrid GA is proposed. Here, the hybridization refers to the inclusion of problem-dependent knowledge in a general search template. The hybrid algorithms used in this work are combinations of two algorithms (weak hybridization), where one of them acts as an operator in the other. This kind of combinations has produced the most successful training methods in the last few years. The authors proposed here the combination of GA with the BP algorithm as well as GA with the Levenberg-Marquardt for training ANNs. Scatter search (SS) was first introduced in Glover (1977) as a heuristic for integer programming. The following template is a standard for implementing scatter search that consists of five methods. A diversification generation method to generate a collection of diverse trial solutions, an improvement method to transform a trial solution into one or more enhanced trial
solutions, a reference set update method to build and maintain a reference set consisting of the b “best” solutions found, a subset generation method to operate on the reference set in order to produce a subset of its solutions as a basis for creating combined solutions, and a solution combination method to transform a given subset of solutions produced by the subset generation method into one or more combined solution vectors. An exhaustive description of these methods and how they operate can be found in Laguna and Martí (2003). Laguna and Martí (2002) proposed a three-step Scatter Search algorithm for ANNs. El-Fallahi, Martí, and Lasdon (in press) propose a new training method based on the path relinking methodology. Path relinking starts from a given set of elite solutions obtained during a previous search process. Path relinking and its cousin, Scatter Search, are mainly based on two elements: combinations and local search. Path relinking generalizes the concept of combination beyond its usual application to consider paths between solutions. Local search, performed now with the GRG2 optimizer, intensifies the search by seeking local optima. The paper shows an empirical comparison of the proposed method with the best previous evolutionary approaches, and the associated experiments show the superiority of the new method in terms of solution quality (prediction accuracy). On the other hand, these experiments confirm again that a few functions cannot be approximated with any of the current training methods.
FUTURE TRENDS An open problem in the context of prediction is to compare ANNs with some modern approximation techniques developed in statistics. Specifically, the nonparametric additive models and local regression can also offer good solutions to the general approximation or prediction problem. The development of hybrid systems from both technologies could give the starting point for a new generation of prediction systems.
CONCLUSION In this work I revise the most representative methods for neural network training. Several computational studies with some of these methods reveal that the best results are achieved with a combination of a metaheuristic procedure with a nonlinear optimizer. These experiments also show that from a practical point of view, some functions cannot be approximated.
57
A
Artificial Neural Networks for Prediction
REFERENCES Alba, E., & Chicano, J. F. (2004). Training neural networks with GA Hybrid algorithms. In K. Deb (Ed.), Proceedings of the Genetic and Evolutionary Computation Conference, USA. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. El-Fallahi, A., Martí, R., & Lasdon, L. (in press). Path relinking and GRG for artificial neural networks. European Journal of Operational Research.
Sexton. (1998). Global optimization for artificial neural networks: A tabu search application. European Journal of Operational Research, 106, 570-584. Sexton, R. S., Dorsey, R. E., & Johnson, J. D. (1999). Optimization of neural networks: A comparative analysis of the genetic algorithm and simulated annealing. European Journal of Operational Research, 114, 589-601. Smith, S., & Lasdon, L. (1992). Solving large nonlinear programs using GRG. ORSA Journal on Computing, 4(1), 2-15.
Glover, F. (1977). Heuristics for integer programming using surrogate constraints. Decision Sciences, 8, 156-166.
KEY TERMS
Glover, F., & Laguna, M. (1993). Tabu search. In C. Reeves (Ed.), Heuristic techniques for combinatorial problems (pp. 70-150).
Classification: Also known as a recognition problem; the identification of the class to which a given object belongs.
Laguna, M., & Martí, R. (2002). Neural network prediction in a system for optimizing simulations. IIE Transactions, 34(3), 273-282. Laguna, M., & Martí, R. (2003). Scatter search: Methodology and implementations in C. Kluwer Academic.
Genetic Algorithm: An iterative procedure that consists of a constant-size population of individuals, each represented by a finite string of symbols, known as the genome, encoding a possible solution in a given problem space.
Martí, R., & El-Fallahi, A. (2004). Multilayer neural networks: An experimental evaluation of on-line training methods. Computers and Operations Research, 31, 1491-1513.
Metaheuristic: A master strategy that guides and modifies other heuristics to produce solutions beyond those that are normally generated in a quest for local optimality.
Martí, R., Laguna, M., & Glover, F. (in press). Principles of scatter search. European Journal of Operational Research.
Network Training: The process of finding the values of the network weights that minimize the error across a set of input/output pairs (patterns) called the training set.
Minsky, M. L., & Papert, S.A. (1969). Perceptrons (Expanded ed.). Cambridge, MA: MIT Press. Mor’e, J. J. (1978). The Levenberg-Marquardt algorithm: Implementation and theory. In G. Watson (Ed.), Lecture Notes in Mathematics: Vol. 630. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes: The art of scientific computing. Cambridge, MA: Cambridge University Press. Rosemblatt, F. (1962). Principles of neurodynamics: Perceptrons and theory of brain mechanisms. Washington, DC: Spartan.
58
Optimization: The quantitative study of optima and the methods for finding them. Prediction: Consists of approximating unknown functions. The net’s input is the values of the function variables, and the output is the estimation of the function image. Scatter Search: A metaheuristic that belongs to the evolutionary methods. Tabu Search: A metaheuristic procedure based on principles of intelligent search. Its premise is that problem solving, in order to qualify as intelligent, must incorporate adaptive memory and responsive exploration.
59
Association Rule Mining Yew-Kwong Woon Nanyang Technological University, Singapore Wee-Keong Ng Nanyang Technological University, Singapore Ee-Peng Lim Nanyang Technological University, Singapore
INTRODUCTION
BACKGROUND
Association Rule Mining (ARM) is concerned with how items in a transactional database are grouped together. It is commonly known as market basket analysis, because it can be likened to the analysis of items that are frequently put together in a basket by shoppers in a market. From a statistical point of view, it is a semiautomatic technique to discover correlations among a set of variables. ARM is widely used in myriad applications, including recommender systems (Lawrence, Almasi, Kotlyar, Viveros, & Duri, 2001), promotional bundling (Wang, Zhou, & Han, 2002), Customer Relationship Management (CRM) (Elliott, Scionti, & Page, 2003), and crossselling (Brijs, Swinnen, Vanhoof, & Wets, 1999). In addition, its concepts have also been integrated into other mining tasks, such as Web usage mining (Woon, Ng, & Lim, 2002), clustering (Yiu & Mamoulis, 2003), outlier detection (Woon, Li, Ng, & Lu, 2003), and classification (Dong & Li, 1999), for improved efficiency and effectiveness. CRM benefits greatly from ARM as it helps in the understanding of customer behavior (Elliott et al., 2003). Marketing managers can use association rules of products to develop joint marketing campaigns to acquire new customers. The application of ARM for the crossselling of supermarket products has been successfully attempted in many cases (Brijs et al., 1999). In one particular study involving the personalization of supermarket product recommendations, ARM has been applied with much success (Lawrence et al., 2001). Together with customer segmentation, ARM helped to increase revenue by 1.8%. In the biology domain, ARM is used to extract novel knowledge on protein-protein interactions (Oyama, Kitano, Satou, & Ito, 2002). It is also successfully applied in gene expression analysis to discover biologically relevant associations between different genes or between different environment conditions (Creighton & Hanash, 2003).
Recently, a new class of problems emerged to challenge ARM researchers: Incoming data is streaming in too fast and changing too rapidly in an unordered and unbounded manner. This new phenomenon is termed data stream (Babcock, Babu, Datar, Motwani, & Widom, 2002). One major area where the data stream phenomenon is prevalent is the World Wide Web (Web). A good example is an online bookstore, where customers can purchase books from all over the world at any time. As a result, its transactional database grows at a fast rate and presents a scalability problem for ARM. Traditional ARM algorithms, such as Apriori, were not designed to handle large databases that change frequently (Agrawal & Srikant, 1994). Each time a new transaction arrives, Apriori needs to be restarted from scratch to perform ARM. Hence, it is clear that in order to conduct ARM on the latest state of the database in a timely manner, an incremental mechanism to take into consideration the latest transaction must be in place. In fact, a host of incremental algorithms have already been introduced to mine association rules incrementally (Sarda & Srinivas, 1998). However, they are only incremental to a certain extent; the moment the universal itemset (the number of unique items in a database) (Woon, Ng, & Das, 2001) is changed, they have to be restarted from scratch. The universal itemset of any online store would certainly be changed frequently, because the store needs to introduce new products and retire old ones for competitiveness. Moreover, such incremental ARM algorithms are efficient only when the database has not changed much since the last mining. The use of data structures in ARM, particularly the trie, is one viable way to address the data stream phenomenon. Data structures first appeared when programming became increasingly complex during the 1960s. In his classic book, The Art of Computer Programming Knuth (1968) reviewed and analyzed algorithms and data structures that are necessary for program efficiency.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Association Rule Mining
Since then, the traditional data structures have been extended, and new algorithms have been introduced for them. Though computing power has increased tremendously over the years, efficient algorithms with customized data structures are still necessary to obtain timely and accurate results. This fact is especially true for ARM, which is a computationally intensive process. The trie is a multiway tree structure that allows fast searches over string data. In addition, as strings with common prefixes share the same nodes, storage space is better utilized. This makes the trie very useful for storing large dictionaries of English words. Figure 1 shows a trie storing four English words (ape, apple, base, and ball). Several novel trielike data structures have been introduced to improve the efficiency of ARM, and we discuss them in this section. Amir, Feldman, & Kashi (1999) presented a new way of mining association rules by using a trie to preprocess the database. In this approach, all transactions are mapped onto a trie structure. This mapping involves the extraction of the powerset of the transaction items and the updating of the trie structure. Once built, there is no longer a need to scan the database to obtain support counts of itemsets, because the trie structure contains all their support counts. To find frequent itemsets, the structure is traversed by using depth-first search, and itemsets with support counts satisfying the minimum support threshold are added to the set of frequent itemsets. Drawing upon that work, Yang, Johar, Grama, & Szpankowski (2000) introduced a binary Patricia trie to reduce the heavy memory requirements of the preprocessing trie. To support faster support queries, the authors added a set of horizontal pointers to index nodes. They also advocated the use of some form of primary threshold to further prune the structure. However, the compression achieved by the compact Patricia trie comes at a hefty price: It greatly complicates the horizontal pointer index, which is a severe overhead. In addition, after compression, it will be difficult for the Patricia trie to be updated whenever the database is altered. Figure 1. An example of a trie for storing English words
The Frequent Pattern-growth (FP-growth) algorithm is a recent association rule mining algorithm that achieves impressive results (Han, Pei, Yin, & Mao, 2004). It uses a compact tree structure called a Frequent Pattern-tree (FP-tree) to store information about frequent 1-itemsets. This compact structure removes the need for multiple database scans and is constructed with only 2 scans. In the first database scan, frequent 1itemsets are obtained and sorted in support descending order. In the second scan, items in the transactions are first sorted according to the order of the frequent 1itemsets. These sorted items are used to construct the FP-tree. Figure 2 shows an FP-tree constructed from the database in Table 1. FP-growth then proceeds to recursively mine FPtrees of decreasing size to generate frequent itemsets without candidate generation and database scans. It does so by examining all the conditional pattern bases of the FP-tree, which consists of the set of frequent itemsets occurring with the suffix pattern. Conditional FP-trees are constructed from these conditional pattern bases, and mining is carried out recursively with such trees to discover frequent itemsets of various sizes. However, because both the construction and the use of the FPtrees are complex, the performance of FP-growth is reduced to be on par with Apriori at support thresholds of 3% and above. It only achieves significant speed-ups at support thresholds of 1.5% and below. Moreover, it is only incremental to a certain extent, depending on the FP-tree watermark (validity support threshold). As new transactions arrive, the support counts of items increase, but their relative support frequency may decrease, too. Suppose, however, that the new transactions cause too many previously infrequent itemsets to become Table 1. A sample transactional database TID 100 200 300 400
Items AC BC ABC ABCD
Figure 2. An FP-tree constructed from the database in Table 1 at a support threshold of 50%
ROOT
E
A
B
P
A
C
P
S
L
L
E
L
E
60
ROOT
A B
B
Association Rule Mining
frequent — that is, the watermark is raised too high (in order to make such itemsets infrequent) according to a user-defined level — then the FP-tree must be reconstructed. The use of lattice theory in ARM was pioneered by Zaki (2000). Lattice theory allows the vast search space to be decomposed into smaller segments that can be tackled independently in memory or even in other machines, thus promoting parallelism. However, they require additional storage space as well as different traversal and construction techniques. To complement the use of lattices, Zaki uses a vertical database format, where each itemset is associated with a list of transactions known as a tid-list (transaction identifier–list). This format is useful for fast frequency counting of itemsets but generates additional overheads because most databases have a horizontal format and would need to be converted first. The Continuous Association Rule Mining Algorithm (CARMA), together with the support lattice, allows the user to change the support threshold and continuously displays the resulting association rules with support and confidence bounds during its first scan/phase (Hidber, 1999). During the second phase, it determines the precise support of each itemset and extracts all the frequent itemsets. CARMA can readily compute frequent itemsets for varying support thresholds. However, experiments reveal that CARMA only performs faster than Apriori at support thresholds of 0.25% and below, because of the tremendous overheads involved in constructing the support lattice. The adjacency lattice, introduced by Aggarwal & Yu (2001), is similar to Zaki’s boolean powerset lattice, except the authors introduced the notion of adjacency among itemsets, and it does not rely on a vertical database format. Two itemsets are said to be adjacent to each other if one of them can be transformed to the other with the addition of a single item. To address the problem of heavy memory requirements, a primary threshold is defined. This term signifies the minimum support threshold possible to fit all the qualified itemsets into the adjacency lattice in main memory. However, this approach disallows the mining of frequent itemsets at support thresholds lower than the primary threshold.
• • • •
It is highly scalable with respect to the size of both the database and the universal itemset. It is incrementally updated as transactions are added or deleted. It is constructed independent of the support threshold and thus can be used for various support thresholds. It helps to speed up ARM algorithms to a certain extent that allows results to be obtained in realtime.
We shall now discuss our novel trie data structure that not only satisfies the above requirements but also outperforms the discussed existing structures in terms of efficiency, effectiveness, and practicality. Our structure is termed Support-Ordered Trie Itemset (SOTrieIT — pronounced “so-try-it”). It is a dual-level supportordered trie data structure used to store pertinent itemset information to speed up the discovery of frequent itemsets. As its construction is carried out before actual mining, it can be viewed as a preprocessing step. For every transaction that arrives, 1-itemsets and 2-itemsets are first extracted from it. For each itemset, the SOTrieIT will be traversed in order to locate the node that stores its support count. Support counts of 1itemsets and 2-itemsets are stored in first-level and second-level nodes, respectively. The traversal of the SOTrieIT thus requires at most two redirections, which makes it very fast. At any point in time, the SOTrieIT contains the support counts of all 1-itemsets and 2itemsets that appear in all the transactions. It will then be sorted level-wise from left to right according to the support counts of the nodes in descending order. Figure 3 shows a SOTrieIT constructed from the database in Table 1. The bracketed number beside an item is its support count. Hence, the support count of itemset {AB} is 2. Notice that the nodes are ordered by support counts in a level-wise descending order. In algorithms such as FP-growth that use a similar data structure to store itemset information, the structure must be rebuilt to accommodate updates to the universal Figure 3. A SOTrieIT structure ROOT
MAIN THRUST As shown in our previous discussion, none of the existing data structures can effectively address the issues induced by the data stream phenomenon. Here are the desirable characteristics of an ideal data structure that can help ARM cope with data streams:
C(4)
D(1)
A(3)
C(3)
B(2)
B(3)
D(1)
C(3)
D(1)
D(1)
61
A
Association Rule Mining
itemset. The SOTrieIT can be easily updated to accommodate the new changes. If a node for a new item in the universal itemset does not exist, it will be created and inserted into the SOTrieIT accordingly. If an item is removed from the universal itemset, all nodes containing that item need only be removed, and the rest of the nodes would still be valid. Unlike the trie structure of Amir et al. (1999), the SOTrieIT is ordered by support count (which speeds up mining) and does not require the powersets of transactions (which reduces construction time). The main weakness of the SOTrieIT is that it can only discover frequent 1-itemsets and 2-itemsets; its main strength is its speed in discovering them. They can be found promptly because there is no need to scan the database. In addition, the search (depth first) can be stopped at a particular level the moment a node representing a nonfrequent itemset is found, because the nodes are all support ordered. Another advantage of the SOTrieIT, compared with all previously discussed structures, is that it can be constructed online, meaning that each time a new transaction arrives, the SOTrieIT can be incrementally updated. This feature is possible because the SOTrieIT is constructed without the need to know the support threshold; it is support independent. All 1-itemsets and 2itemsets in the database are used to update the SOTrieIT regardless of their support counts. To conserve storage space, existing trie structures such as the FP-tree have to use thresholds to keep their sizes manageable; thus, when new transactions arrive, they have to be reconstructed, because the support counts of itemsets will have changed. Finally, the SOTrieIT requires far less storage space than a trie or Patricia trie because it is only two levels deep and can be easily stored in both memory and files. Although this causes some input/output (I/O) overheads, it is insignificant as shown in our extensive experiments. We have designed several algorithms to work synergistically with the SOTrieIT and, through experiments with existing prominent algorithms and a variety of databases, we have proven the practicality and superiority of our approach (Das, Ng, & Woon, 2001; Woon et al., 2001). In fact, our latest algorithm, FOLD-growth, is shown to outperform FP-growth by more than 100 times (Woon, Ng, & Lim, 2004).
FUTURE TRENDS The data stream phenomenon will eventually become ubiquitous as Internet access and bandwidth become increasingly affordable. With keen competition, products will become more complex with customization and 62
more varied to cater to a broad customer base; transaction databases will grow in both size and complexity. Hence, association rule mining research will certainly continue to receive much attention in the quest for faster, more scalable and more configurable algorithms.
CONCLUSION Association rule mining is an important data mining task with several applications. However, to cope with the current explosion of raw data, data structures must be utilized to enhance its efficiency. We have analyzed several existing trie data structures used in association rule mining and presented our novel trie structure, which has been proven to be most useful and practical. What lies ahead is the parallelization of our structure to further accommodate the ever-increasing demands of today’s need for speed and scalability to obtain association rules in a timely manner. Another challenge is to design new data structures that facilitate the discovery of trends as association rules evolve over time. Different association rules may be mined at different time points and, by understanding the patterns of changing rules, additional interesting knowledge may be discovered.
REFERENCES Aggarwal, C. C., & Yu, P. S. (2001). A new approach to online generation of association rules. IEEE Transactions on Knowledge and Data Engineering, 13(4), 527540. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Databases (pp. 487-499), Chile. Amir, A., Feldman, R., & Kashi, R. (1999). A new and versatile method for association generation. Information Systems, 22(6), 333-347. Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. Proceedings of the ACM SIGMOD/PODS Conference (pp. 1-16), USA. Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. Proceedings of the Fifth ACM SIGKDD Conference (pp. 254-260), USA. Creighton, C., & Hanash, S. (2003). Mining gene expression databases for association rules. Bioinformatics, 19(1), 79-86.
Association Rule Mining
Das, A., Ng, W. K., & Woon, Y. K. (2001). Rapid association rule mining. Proceedings of the 10th International Conference on Information and Knowledge Management (pp. 474-481), USA. Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (pp. 43-52), USA. Elliott, K., Scionti, R., & Page, M. (2003). The confluence of data mining and market research for smarter CRM. Retrieved from http://www.spss.com/home_page/ wp133.htm Han, J., Pei, J., Yin Y., & Mao, R. (2004). Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 8(1), 53-97.
Woon, Y. K., Ng, W. K., & Lim, E. P. (2002). Online and incremental mining of separately grouped web access logs. Proceedings of the Third International Conference on Web Information Systems Engineering (pp. 53-62), Singapore. Woon, Y. K., Ng, W. K., & Lim, E. P. (2004). A supportordered trie for fast frequent itemset discovery. IEEE Transactions on Knowledge and Data Engineering, 16(5). Yang, D. Y., Johar, A., Grama, A., & Szpankowski, W. (2000). Summary structures for frequency queries on large transaction sets. Proceedings of the Data Compression Conference (pp. 420-429). Yiu, M. L., & Mamoulis, N. (2003). Frequent-pattern based iterative projected clustering. Proceedings of the Third International Conference on Data Mining, USA.
Hidber, C. (1999). Online association rule mining. Proceedings of the ACM SIGMOD Conference (pp. 145-154), USA.
Zaki, M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3), 372-390.
Knuth, D.E. (1968). The art of computer programming, Vol. 1. Fundamental Algorithms. Addison-Wesley Publishing Company.
KEY TERMS
Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S., & Duri, S. (2001). Personalization of supermarket product recommendations. Data Mining and Knowledge Discovery, 5(1/2), 11-32. Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics, 18(5), 705-714. Sarda, N. L., & Srinivas, N. V. (1998). An adaptive algorithm for incremental mining of association rules. Proceedings of the Ninth International Conference on Database and Expert Systems (pp. 240-245), Austria. Wang, K., Zhou, S., & Han, J. (2002). Profit mining: From patterns to actions. Proceedings of the Eighth International Conference on Extending Database Technology (pp. 70-87), Prague. Woon, Y. K., Li, X., Ng, W. K., & Lu, W. F. (2003). Parameterless data compression and noise filtering using association rule mining. Proceedings of the Fifth International Conference on Data Warehousing and Knowledge Discovery (pp. 278-287), Prague. Woon, Y. K., Ng, W. K., & Das, A. (2001). Fast online dynamic association rule mining. Proceedings of the Second International Conference on Web Information Systems Engineering (pp. 278-287), Japan.
Apriori: A classic algorithm that popularized association rule mining. It pioneered a method to generate candidate itemsets by using only frequent itemsets in the previous pass. The idea rests on the fact that any subset of a frequent itemset must be frequent as well. This idea is also known as the downward closure property. Itemset: An unordered set of unique items, which may be products or features. For computational efficiency, the items are often represented by integers. A frequent itemset is one with a support count that exceeds the support threshold, and a candidate itemset is a potential frequent itemset. A k-itemset is an itemset with exactly k items. Key: A unique sequence of values that defines the location of a node in a tree data structure. Patricia Trie: A compressed binary trie. The Patricia (Practical Algorithm to Retrieve Information Coded in Alphanumeric) trie is compressed by avoiding one-way branches. This is accomplished by including in each node the number of bits to skip over before making the next branching decision. SOTrieIT: A dual-level trie whose nodes represent itemsets. The position of a node is ordered by the support count of the itemset it represents; the most frequent 63
A
Association Rule Mining
itemsets are found on the leftmost branches of the SOTrieIT.
rithm has to be executed many times before this value can be well adjusted to yield the desired results.
Support Count of an Itemset: The number of transactions that contain a particular itemset.
Trie: An n-ary tree whose organization is based on key space decomposition. In key space decomposition, the key range is equally subdivided, and the splitting position within the key range for each node is predefined.
Support Threshold: A threshold value that is used to decide if an itemset is interesting/frequent. It is defined by the user, and generally, an association rule mining algo-
64
65
Association Rule Mining and Application to MPIS Raymond Chi-Wing Wong The Chinese University of Hong Kong, Hong Kong Ada Wai-Chee Fu The Chinese University of Hong Kong, Hong Kong
INTRODUCTION Association rule mining (Agrawal, Imilienski, & Swami, 1993) has been proposed for understanding the relationships among items in transactions or market baskets. For instance, if a customer buys butter, what is the chance that he/she buys bread at the same time? Such information may be useful for decision makers to determine strategies in a store.
BACKGROUND Given a set I = {I1, I2,…, In} of items (e.g., carrot, orange and knife) in a supermarket. The database contains a number of transactions. Each transaction t is a binary vector with t[k]=1 if t bought item Ik and t[k]=0 otherwise (e.g., {1, 0, 0, 1, 0}). An association rule is of the form X è Ij, where X is a set of some items in I, and Ij is a single item not in X (e.g., {Orange, Knife} è Plate). A transaction t satisfies X if for all items Ik in X, t[k] = 1. The support for a rule Xè Ij is the fraction of transactions that satisfy the union of X and Ij. A rule X è Ij has confidence c% if and only if c% of transactions that satisfy X also satisfy Ij. The mining process of association rule can be divided into two steps: 1. 2.
Frequent Itemset Generation: Generate all sets of items that have support greater than or equal to a certain threshold, called minsupport Association Rule Generation: From the frequent itemsets, generate all association rules that have confidence greater than or equal to a certain threshold called minconfidence
Step 1 is much more difficult compared with Step 2. Thus, researchers have focused on the studies of frequent itemset generation. The Apriori Algorithm is a well-known approach, which was proposed by Agrawal & Srikant (1994), to find
frequent itemsets. It is an iterative approach and there are two steps in each iteration. The first step generates a set of candidate itemsets. Then, the second step prunes all disqualified candidates (i.e., all infrequent itemsets). The iterations begin with size 2 itemsets and the size is incremented at each iteration. The algorithm is based on the closure property of frequent itemsets: if a set of items is frequent, then all its proper subsets are also frequent. The weaknesses of this algorithm are the generation of a large number of candidate itemsets and the requirement to scan the database once in each iteration. A data structure called FP-tree and an efficient algorithm called FP-growth are proposed by Han, Pei, & Yin (2000) to overcome the above weaknesses. The idea of FPtree is fetching all transactions from the database and inserting them into a compressed tree structure. Then, algorithm FP-growth reads from the FP-tree structure to mine frequent itemsets.
MAIN THRUST Variations in Association Rules Many variations on the above problem formulation have been suggested. The association rules can be classified based on the following (Han & Kamber, 2000): 1.
Association Rules Based on the Type of Values of Attribute: Based on the type of values of attributes, there are two kinds – Boolean association rule, which is presented above, and quantitative association rule. Quantitative association rule describes the relationships among some quantitative attributes (e.g., income and age). An example is income(40K..50K) è age(40..45). One proposed method is grid-based — dividing each attribute into a fixed number of partitions [Association Rule Clustering System (ARCS) in Lent, Swami & Widom (1997)]. Srikant & Agrawal (1996) proposed to partition quantitative attributes dynamically and to
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Association Rule Mining and Application to MPIS
2.
3.
merge the partitions based on a measure of partial completeness. Another non-grid based approach is found in Zhang, Padmanabhan, & Tuzhilin (2004). Association Rules based on the Dimensionality of Data: Association rules can be divided into singledimensional association rules and multi-dimensional association rules. One example of singledimensional rule is buys({Orange, Knife}) è buys(Plate), which contains only the dimension buys. Multi-dimensional association rule is the one containing attributes for more than one dimension. For example, income(40K..50K) è buys(Plate). One mining approach is to borrow the concept of data cube in the field of data warehousing. Figure 1 shows a lattice for the data cube for the dimensions age, income and buys. Researchers (Kamber, Han, & Chiang, 1997) applied the data cube model and used the aggregate techniques for mining. Association Rules based on the Level of Abstractions of Attribute: The rules discussed in previous sections can be viewed as single-level association rule. A rule that references different levels of abstraction of attributes is called a multilevel association rule. Suppose there are two rules – income(10K..20K) è buys(fruit) and income(10K..20K) è buys(orange). There are two different levels of abstractions in these two rules because “fruit” is a higher-level abstraction of “orange.” Han & Fu (1995) apply a top-down strategy to the concept hierarchy in the mining of frequent itemsets.
Other Extensions to Association Rule Mining There are other extensions to association rule mining. Some of them (Bayardo, 1998) find maxpattern (i.e., maximal frequent patterns) while others (Zaki & Hsiao, 2002) find frequent closed itemsets. Maxpattern is a frequent itemset that does not have a frequent item superset. A frequent itemset is a frequent closed itemsets if there Figure 1. A lattice showing the data cube for the dimensions age, income, and buys ()
(age) (age, income)
income
age, buys)
(buys) income, buys)
(age, income, buys)
66
Figure 2. A concept hierarchy of the fruit
fruit
apple
orange
banana
…
exists no itemset X’ such that (1) X ⊂ X’ and (2) ∀ transactions t, X is in t implies X’ is in t. These considerations can reduce the resulting number of frequent itemsets significantly. Another variation of the frequent itemset problem is mining top-K frequent itemsets (Cheung & Fu, 2004). The problem is to find K frequent itemsets with the greatest supports. It is often more reasonable to assume the parameter K, instead of the data-distribution dependent parameter of minsupport because the user typically would not have the knowledge of the data distribution before data mining. The other variations of the problem are the incremental update of mining association rules (Hidber, 1999), constraint-based rule mining (Grahne & Lakshmanan, 2000), distributed and parallel association rule mining (Gilburd, Schuster, & Wolff, 2004), association rule mining with multiple minimum supports/without minimum support (Chiu, Wu, & Chen, 2004), association rule mining with weighted item and weight support (Tao, Murtagh, & Farid, 2003), and fuzzy association rule mining (Kuok, Fu, & Wong, 1998). Association rule mining has been integrated with other data mining problems. There have been the integration of classification and association rule mining (Wang, Zhou, & He, 2000) and the integration of association rule mining with relational database systems (Sarawagi, Thomas, & Agrawal, 1998).
Application of the Concept of Association Rules to MPIS Other than market basket analysis (Blischok, 1995), association rules can also help in applications such as intrusion detection (Lee, Stolfo, & Mok, 1999), heterogeneous genome data (Satou et al., 1997), mining remotely sensed images/data (Dong, Perrizo, Ding, & Zhou, 2000) and product assortment decisions (Wong, Fu, & Wang, 2003; Wong & Fu, 2004). Here we focus on the application on product assortment decisions, as it is one of very few examples where the association rules are not the end mining results.
Association Rule Mining and Application to MPIS
Transaction database in some applications can be very large. For example, Hedberg (1995) quoted that Wal-Mart kept about 20 million sales transactions per day. Such data requires sophisticated analysis. As pointed out by Blischok (1995), a major task of talented merchants is to pick the profit generating items and discard the losing items. It may be simple enough to sort items by their profit and do the selection. However, this ignores a very important aspect in market analysis — the cross-selling effect. There can be items that do not generate much profit by themselves but they are the catalysts for the sales of other profitable items. Recently, some researchers (Kleinberg, Papadimitriou, & Raghavan, 1998) suggest that concepts of association rules can be used in the item selection problem with the consideration of relationships among items. One example of the product assortment decisions is Maximal-Profit Item Selection (MPIS) with cross-selling considerations (Wong, Fu, & Wang, 2003). Consider the major task of merchants to pick profit-generating items and discard the losing items. Assume we have a history record of the sales (transactions) of all items. This problem is to select a subset from the given set of items so that the estimated profit of the resulting selection is maximal among all choices. Suppose a shop carries office equipment composed of monitors, keyboards and telephones, with profits of $1000K, $100K and $300K, respectively. If now the shop decides to remove one of the three items from its stock, the question is which two we should choose to keep. If we simply examine the profits, we may choose to keep monitors and telephones, and so the total profit is $1300K. However, we know that there is strong cross-selling effect between monitor and keyboard (see Table 1). If the shop stops carrying keyboard, the customers of monitor may choose to shop elsewhere to get both items. The profit from monitor may drop greatly, and we may be left with profit of $300K from telephones only. If we choose to keep both monitors and keyboards, then the profit can be expected to be $1100K, which is higher. MPIS will give us the desired solution. MPIS utilizes the concept of the relationship between selected items and unselected items. Such relationship is modeled by the cross-selling factor. Suppose d is the set of unselected items and I is the selected item. A loss rule is proposed in the form I à ¸d, where ¸d means the purchase of any item in d. The rule indicates that from the history, whenever a customer buys the item I, he/she also buys at least one of the items in d. Interpreting this as a pattern of customer behavior, and assuming that the pattern will not change even when some items were removed from the stock, if none of the items in d are available then the customer also will not purchase I. This is because if the customer still purchases I, without purchasing any items in d, then the pattern would be changed. Therefore, the higher the con-
fidence of I à ¸d, the more likely the profit of I should not be counted. This is the reasoning behind the above definition. In the above example, suppose we choose monitor and telephone. Then, d = {keyboard}. All profits of monitor will be lost if, in the history, we find conf(I à ¸d)=1, where I = monitor. This example illustrates the importance of the consideration of cross-selling factor in the profit estimation. Wong, Fu, & Wang (2003) propose two algorithms to deal with this problem. In the first algorithm, they approximate the total profit of the item selection in quadratic form and solve a quadratic optimization problem. The second one is a greedy approach called MPIS_Alg, which prunes items iteratively according to an estimated function based on the formula of the total profit of the item selection until J items remain. Another product assortment decision problem is studied by Wong & Fu, (2004), which addresses the problem of selecting a set of marketing items in order to boost the sales of the store.
FUTURE TRENDS A new area for investigation of the problem of mining frequent itemsets is mining data streaming for frequent itemsets (Manku & Motwani, 2002; Yu, Chong, Lu, & Zhou, 2004). In such a problem, the data is so massive that all data cannot be stored in the memory of a computer and the data cannot be processed by traditional algorithms. The objective of all proposed algorithm is to store as few as possible data and to minimize the error generated by some estimation in the model. Privacy preservation on the association rule mining is also rigorously studied in these few years (Vaidya & Clifton, 2002; Agrawal, Evfimievski, & Srikant, 2003). The problem is to mine from two or more different sources without exposing individual transaction data to each other.
CONCLUSION Association rule mining plays an important role in the literature of data mining. It poses many challenging Table 1. Monitor Keyboard Telephone 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1
67
A
Association Rule Mining and Application to MPIS
issues for the development of efficient and effective methods. After taking a closer look, we find that the application of association rules requires much more investigations in order to aid in more specific targets. We may see a trend towards the study of applications of association rules.
REFERENCES Agrawal, R., Evfimievski, A., & Srikant, R. (2003). Information sharing across private database. SIGMOD, 86-97. Agrawal, R., Imilienski, T., & Swami. (1993). Mining association rules between sets of items in large databases. SIGMOD, 129-140. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference (pp. 487-499). Bayardo, R.J. (1998). Efficiently mining long patterns from databases. SIGMOD, 85-93. Blischok, T. (1995). Every transaction tells a story. Chain Store Age Executive with Shopping Center Age, 71(3), 50-57. Cheung, Y.L., & Fu, A.W.-C. (2004). Mining association rules without support threshold: With and without Item Constraints. TKDE, 16(9), 1052-1069. Chiu, D.-Y., Wu, Y.-H., & Chen, A.L.P. (2004). An efficient algorithm for mining frequent sequences by a new strategy without support counting. ICDE, 375-386. Dong, J., Perrizo, W., Ding, Q., & Zhou, J. (2000). The application of association rule mining to remotely sensed data. In Proceedings of the 2000 ACM symposium on Applied computing (pp. 340-345). Gilburd, B., Schuster, A., & Wolff, R. (2004). A new privacy model and association-rule mining algorithm for large-scale distributed environments. SIGKDD. Grahne, G., Lakshmanan, L., & Wang, X. (2000). Efficient mining of constrained correlated sets. ICDE, 512-521 Han, J., & Fu, Y. (1995). Discovery of multiple-level association rules from large databases. In Proceedings of the 1995 International Conference on VLDB (pp. 420-431). Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. San Mateo, CA: Morgan Kaufmann Publishers. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. SIGMOD, 1-12.
68
Hedberg, S. (1995, October). The data gold rush. BYTE, 83-99. Hidber, C. (1999), Online association rule mining. SIGMOD, 145-156. Kamber, M., Han, J., & Chiang, J.Y. (1997). Metaruleguided mining of multi-dimensional association rules using data cubes. In Proceeding of the 3rd International Conference on Knowledge Discovery and Data Mining (pp. 207-210). Kleinberg, J., Papadimitriou, C., & Raghavan, P. (1998). A microeconomic view of data mining. Knowledge Discovery Journal, 2(4), 311-324. Kuok, C.M., Fu, A.W.C., & Wong, M.H., (1998). Mining fuzzy association rules in databases. ACM SIGMOD Record, 27(1), 41-46. Lee, W., Stolfo, S.J., & Mok, K.W. (1999). A data mining framework for building intrusion detection models. In IEEE Symposium on Security and Privacy (pp. 120-132). Lent, B., Swami, A.N., & Widom, J. (1997). Clustering Association Rules. In ICDE (pp. 220-231). Manku, G.S., & Motwani, R. (2002). Approximate frequency counts over data streams. In Proceedings of the 20 th International Conference on VLDB (pp. 346-357). Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD, 343-354. Satou, K., Shibayama, G., Ono, T., Yamamura, Y., Furuichi, E., Kuhara, S., & Takagi, T. (1997). Finding association rules on heterogeneous genome data. In Pacific Symposium on Biocomputing (PSB) (pp. 397-408). Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. SIGMOD, 1-12. Tao, F., Murtagh, F., & Farid, M. (2003). Weighted association rule mining using weighted support and significance framework. In The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661-666). Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. In The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 639-644). Wang, K., Zhou, S., & He, Y. (2000). Growing decision trees on support-less association rules. In Sixth ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 265-269).
Association Rule Mining and Application to MPIS
Wong, R.C.-W., & Fu, A.W.-C. (2004). ISM: Item selection for marketing with cross-selling considerations. In Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference (PAKDD) (pp. 431-440), Lecture Notes in Computer Science 3056. Berlin: Springer. Wong, R.C.-W., Fu, A.W.-C., & Wang, K. (2003). MPIS: Maximal-profit item selection with cross-selling considerations. In IEEE International Conference on Data Mining (ICDM) (pp. 371-378). Yu, J.X., Chong, Z., Lu, H., & Zhou, A. (2004). False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases. Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient algorithm for closed itemset mining. In SIAM International Conference on Data Mining (SDM). Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the 10th ACM SIGKDD Knowledge Discovery and Data Mining Conference.
KEY TERMS Association Rule: A kind of rule in the form X è Ij, where X is a set of some items and Ij is a single item not in X.
is a set of items and Ij is a single item not in X, is the fraction of the transactions containing all items in set X that also contain item Ij. Frequent Itemset/Pattern: The itemset with support greater than or equal to a certain threshold, called minsupport. Infrequent Itemset: Itemset with support smaller than a certain threshold, called minsupport. Itemset: A set of items K-Itemset: Itemset with k items Maximal-Profit Item Selection (MPIS): The problem of item selection, which selects a set of items in order to maximize the total profit with the consideration of crossselling effect Support (Itemset) Or Frequency: The support of an itemset X is the fraction of transactions containing all items in X. Support (Rule): The support of a rule X è Ij, where X is a set of items and Ij is a single item not in X, is the fraction of the transactions containing all items in set X that also contain item Ij. Transaction: A record containing the items bought by a customer.
Confidence: The confidence of a rule X è I j, where X
69
A
70
Association Rule Mining of Relational Data Anne Denton North Dakota State University, USA Christopher Besemann North Dakota State University, USA
INTRODUCTION Most data of practical relevance are structured in more complex ways than is assumed in traditional data mining algorithms, which are based on a single table. The concept of relations allows for discussing many data structures such as trees and graphs. Relational data have much generality and are of significant importance, as demonstrated by the ubiquity of relational database management systems. It is, therefore, not surprising that popular data mining techniques, such as association rule mining, have been generalized to relational data. An important aspect of the generalization process is the identification of problems that are new to the generalized setting.
BACKGROUND Several areas of databases and data mining contribute to advances in association rule mining of relational data: •
•
•
•
Relational Data Model: underlies most commercial database technology and also provides a strong mathematical framework for the manipulation of complex data. Relational algebra provides a natural starting point for generalizations of data mining techniques to complex data types. Inductive Logic Programming, ILP (Deroski & Lavraè , 2001): a form of logic programming, in which individual instances are generalized to make hypotheses about unseen data. Background knowledge is incorporated directly. Association Rule Mining, ARM (Agrawal, Imielinski, & Swami, 1993): identifies associations and correlations in large databases. Association rules are defined based on items, such as objects in a shopping cart. Efficient algorithms are designed by limiting output to sets of items that occur more frequently than a given threshold. Graph Theory: addresses networks that consist of nodes, which are connected by edges. Traditional graph theoretic problems typically assume no more
than one property per node or edge. Data associated with nodes and edges can be modeled within the relational algebra. Association rule mining of relational data incorporates important aspects of these areas to form an innovative data mining technique of important practical relevance.
MAIN THRUST The general concept of association rule mining of relational data will be explored, as well as the special case of mining a relationship that corresponds to a graph.
General Concept Two main challenges have to be addressed when applying association rule mining to relational data. Combined mining of multiple tables leads to a search space that is typically large even for moderately sized tables. Performance is, thereby, commonly an important issue in relational data mining algorithms. A less obvious problem lies in the skewing of results (Jensen & Neville, 2002). The relational join operation combines each record from one table with each occurrence of the corresponding record in a second table. That means that the information in one record is represented multiple times in the joined table. Data mining algorithms that operate either explicitly or implicitly on joined tables, thereby, use the same information multiple times. Note that this problem also applies to algorithms in which tables are joined on-the-fly by identifying corresponding records as they are needed. Further specific issues may have to be addressed when reflexive relationships are present. These issues will be discussed in the section on relations that represent a graph. A variety of techniques have been developed for data mining of relational data (D eroski & Lavraè , 2001). A typical approach is called inductive logic programming, ILP. In this approach relational structure is represented in the form of Prolog queries, leaving maximum flexibility to the user. While the notation of ILP differs from the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Association Rule Mining of Relational Data
relational notation it can be noted that all relational operators can also be represented in ILP. The approach does thereby not limit the types of problems that can be addressed. It should, however, also be noted that while relational database management system are developed with performance in mind there may be a trade-off between the generality of Prolog-based environments and their limitations in speed. Application of ARM within the ILP setting corresponds to a search for frequent Prolog queries as a generalization of traditional association rules (Dehaspe & De Raedt, 1997). Examples of association rule mining of relational data using ILP (Dehaspe & Toivonen, 2001) could be shopping behavior of customers where relationships between customers are included in the reasoning. While ILP does not use a relational joining step as such, it does also associate individual objects with multiple occurrences of corresponding objects. Problems with skewing are, thereby, also encountered in this approach. An alternative to the ILP approach is to apply the standard definition of association rule mining to relations that are joined using the relational join operation. While such an approach is less general it is often more efficient since the join operation is highly optimized in standard database systems. It is important to note that a join operation typically changes the support of an item set, and any support calculation should therefore be based on the relation that uses the smallest number of join operations (Cristofor & Simovici, 2001). Equivalent changes in item set weighting occur in ILP. Interestingness of rules is an important issue in any type of association rule mining. In traditional association rule mining the problem of rule interest has been addressed in a variety of work on redundant rules, including closed set generation (Zaki, 2000). Additional rule metrics such as lift and conviction have been defined (Brin, Motwani, Ullman, & Tsur, 1997). In relational association rule mining the problem has been approached by the definition of a deviation measure (Dehaspe & Toivonen, 2001). In general it can be noted that relational data mining poses many additional problems related to skewing of data compared with traditional mining on a single table (Jensen & Neville, 2002).
Relations that Represent a Graph One type of relational data set has traditionally received particular attention, albeit under a different name. A relation representing a relationship between entity instances of the same type, also called a reflexive relationship, can be viewed as the definition of a graph. Graphs have been used to represent social networks, biological networks, communication networks, and citation graphs, just to name a few.
A typical example of an association rule mining problem is mining of annotation data of proteins in the presence of a protein-protein interaction graph (Oyama, Kitano, Satou, & Ito, 2002). Associations are extracted that relate functions and localizations of one protein with those of interacting proteins. Oyama et al. use association rule mining, as applied to joined relations, for this work. Another example could be association rule mining of attributes associated with scientific publications on the graph of their mutual citations. A problem of the straight-forward approach of mining joined tables directly becomes obvious upon further study of the rules: In most cases the output is dominated by rules that involve the same item as it occurs in different entity instances that participate in a relationship. In the example of protein annotations within the protein interaction graph a protein in the “nucleus” is found to frequently interact with another protein that is also located in the “nucleus”. Similarities among relational neighbors have been observed more generally for relational databases (Macskassy & Provost, 2003). It can be shown that filtering of output is not a consistent solution to this problem, and items that are repeated for multiple nodes should be eliminated in a preprocessing step (Besemann & Denton, 2004). This is an example of a problem that does not occur in association rule mining of a single table and requires special attention when moving to multiple relations. The example also highlights the need to discuss differences between sets of items of related objects are (Besemann, Denton, Yekkirala, Hutchison, & Anderson, 2004).
Related Research Areas A related research area is graph-based ARM (Inokuchi, Washio, & Motoda, 2000; Yan & Han, 2002). Graph-based ARM does not typically consider more than one label on each node or edge. The goal of graph-based ARM is to find frequent substructures based on that one label, focusing on algorithms that scale to large subgraphs. In relational ARM multiple item are associated with each node and the main problem is to achieve scaling with respect to the number of items per node. Scaling to large subgraphs is usually irrelevant due to the “small world” property of many types of graphs. For most networks of practical interest any node can be reached from almost any other by means of no more than some small number of edges (Barabasi & Bonabeau, 2003). Association rules that involve longer distances are therefore unlikely to produce meaningful results. There are other areas of research on ARM in which related transactions are mined in some combined fashion. Sequential pattern or episode mining (Agrawal & Srikant, 1995; Yan, Han, & Afshar, 2003) and inter-transaction mining (Tung, Lu, Han, & Feng, 1999) are two main 71
A
Association Rule Mining of Relational Data
categories. Generally the interest in association rule mining is moving beyond the single-table setting to incorporate the complex requirements of real-world data.
Besemann, C., & Denton, A. (2004, June). UNIC: UNique item counts for association rule mining in relational data. Technical Report, North Dakota State University, Fargo, North Dakota.
FUTURE TRENDS
Besemann, C., Denton, A., Yekkirala, A., Hutchison, R., & Anderson, M. (2004, Aug.). Differential association rule mining for the study of protein-protein interaction networks. In Proceedings ACM SIGKDD Workshop on Data Mining in Bioinformatics, Seattle, WA.
The consensus in the data mining community of the importance of relational data mining was recently paraphrased by Dietterich (2003) as “I.i.d. learning is dead. Long live relational learning”. The statistics, machine learning, and ultimately data mining communities have invested decades into sound theories based on a single table. It is now time to afford as much rigor to relational data. When taking this step it is important to not only specify generalizations of existing algorithms but to also identify novel questions that may be asked that are specific to the relational setting. It is, furthermore, important to identify challenges that only occur in the relational setting, including skewing due to the application of the relational join operator, and correlations that are frequent in relational neighbors.
CONCLUSION Association rule mining of relational data is a powerful frequent pattern mining technique that is useful for several data structures including graphs. Two main approaches are distinguished. Inductive logic programming provides a high degree of flexibility, while mining of joined relations is a fast technique that allows the study problems related to skewed or uninteresting results. The potential computational complexity of relational algorithms and specific properties of relational data make its mining an important current research topic. Association rule mining takes a special role in this process, being one of the most important frequent pattern algorithms.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A.N. (1993, May). Mining association rules between sets of items in large databases. In Proceedings of the ACM International Conference on Management of Data (pp. 207-216), Washington, D.C. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proceedings of the 11 th International Conference on Data Engineering (pp. 3-14), IEEE Computer Society Press, Taipei, Taiwan. Barabasi, A.L., & Bonabeau, E. (2003). Scale-free networks. Scientific American, 288(5), 60-69. 72
Brin, S., Motwani, R., Ullman, J.D., & Tsur, S. (1997). Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, AZ. Cristofor, L., & Simovici, D. (2001). Mining association rules in entity-relationship modeled databases. Technical Report, University of Massachusetts Boston. Dehaspe, L., & De Raedt, L. (1997, Dec.). Mining association rules in multiple relations. In Proceedings of the 7th International Workshop on Inductive Logic Programming (pp. 125-132), Prague, Czech Republic. Dehaspe, L., & Toivonen, H. (2001). Discovery of relational association rules. In S. D eroski, & N. Lavra è (Eds.), Relational data mining. Berlin: Springer. Dietterich, T. (2003, Nov.). Sequential supervised learning: Methods for sequence labeling and segmentation. Invited Talk, 3rd IEEE International Conference on Data Mining, Melbourne, FL, USA. D eroski, S., & Lavraè , N. (2001). Relational data mining. Berlin: Springer. Inokuchi, A., Washio, T., & Motoda, H. (2000) An aprioribased algorithm for mining frequent substructures from graph data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 13-23), Lyon, France. Jensen, D., & Neville, J. (2002). Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the 19th International Conference on Machine Learning (pp. 259-266), Sydney, Australia. Macskassy, S., & Provost, F. (2003). A simple relational classifier. In Proceedings of the 2nd Workshop on MultiRelational Data Mining at KDD’03, Washington, D.C. Oyama, T., Kitano, K., Satou, K., & Ito, T. (2002). Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics, 18(8), 705-714.
Association Rule Mining of Relational Data
Tung, A.K.H., Lu, H., Han, J., & Feng, L. (1999). Breaking the barrier of transactions: Mining inter-transaction association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Diego, CA. Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. In Proceedings of the International Conference on Data Mining, Maebashi City, Japan. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM International Conference on Data Mining, San Francisco, CA. Zaki, M.J. (2000). Generating non-redundant association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (pp. 34-43), Boston, MA.
KEY TERMS Antecedent: The set of items A in the association rule A → B. Apriori: Association rule mining algorithm that uses the fact that the support of a non-empty subset of an item set cannot be smaller than the support of the item set itself. Association Rule: A rule of the form A → B meaning “if the set of items A is present in a transaction, then the set of items B is likely to be present too”. A typical example constitutes associations between items purchased at a supermarket.
Confidence: The confidence of a rule is the support of the item set consisting of all items in the rule (A ∪ Β) divided by the support of the antecedent. Entity-Relationship Model (E-R-Model): A model to represent real-world requirements through entities, their attributes, and a variety of relationships between them. ER-Models can be mapped automatically to the relational model. Inductive Logic Programming (ILP): Research area at the interface of machine learning and logic programming. Predicate descriptions are derived from examples and background knowledge. All examples, background knowledge and final descriptions are represented as logic programs. Redundant Association Rule: An association rule is redundant if it can be explained based entirely on one or more other rules. Relation: A mathematical structure similar to a table in which every row is unique, and neither rows nor columns have a meaningful order. Relational Database: A database that has relations and relational algebra operations as underlying mathematical concepts. All relational algebra operations result in relations as output. A join operation is used to combine relations. The concept of a relational database was introduced by E. F. Codd at IBM in 1970. Support: The support of an item set is the fraction of transactions that have all items in that item set.
73
A
74
Association Rules and Statistics Martine Cadot University of Henri Poincaré/LORIA, Nancy, France Jean-Baptiste Maj LORIA/INRIA, France Tarek Ziadé NUXEO, France
INTRODUCTION A manager would like to have a dashboard of his company without manipulating data. Usually, statistics have solved this challenge, but nowadays, data have changed (Jensen, 1992); their size has increased, and they are badly structured (Han & Kamber, 2001). A recent method—data mining—has been developed to analyze this type of data (Piatetski-Shapiro, 2000). A specific method of data mining, which fits the goal of the manager, is the extraction of association rules (Hand, Mannila & Smyth, 2001). This extraction is a part of attribute-oriented induction (Guyon & Elisseeff, 2003). The aim of this paper is to compare both types of extracted knowledge: association rules and results of statistics.
BACKGROUND Statistics have been used by people who want to extract knowledge from data for one century (Freeman, 1997). Statistics can describe, summarize and represent the data. In this paper data are structured in tables, where lines are called objects, subjects or transactions and columns are called variables, properties or attributes. For a specific variable, the value of an object can have different types: quantitative, ordinal, qualitative or binary. Furthermore, statistics tell if an effect is significant or not. They are called inferential statistics. Data mining (Srikant, 2001) has been developed to precede a huge amount of data, which is the result of progress in digital data acquisition, storage technology, and computational power. The association rules, which are produced by data-mining methods, express links on database attributes. The knowledge brought by the association rules is shared in two different parts. The first describes general links, and the second finds specific links (knowledge nuggets) (Fabris & Freitas, 1999; Padmanabhan & Tuzhilin, 2000). In this article, only the
first part is discussed and compared to statistics. Furthermore, in this article, only data structured in tables are used for association rules.
MAIN THRUST The problem differs with the number of variables. In the sequel, problems with two, three, or more variables are discussed.
Two Variables The link between two variables (A and B) depends on the coding. The outcome of statistics is better when data are quantitative. A current model is linear regression. For instance, the salary (S) of a worker can be expressed by the following equation: S = 100 Y + 20000 + ε
(1)
where Y is the number of years in the company, and ε is a random number. This model means that the salary of a newcomer in the company is $20,000 and increases by $100 per year. The association rule for this model is: Y→S. This means that there are a few senior workers with a small paycheck. For this, the variables are translated into binary variables. Y is not the number of years, but the property has seniority, which is not quantitative but of type Yes/ No. The same transformation is applied to the salary S, which becomes the property “has a big salary.” Therefore, these two methods both provide the link between the two variables and have their own instruments for measuring the quality of the link. For statistics, there are the tests of regression model (Baillargeon, 1996), and for association rules, there are measures like support, confidence, and so forth (Kodratoff, 2001). But, depending on the type of data, one model is more appropriate than the other (Figure 1).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Association Rules and Statistics
Figure 1. Coding and analysis methods Quantitative
+
Ordinal
Qualitative
variable has a particular effect on the link between Y and S, called interaction (Winer, Brown & Michels, 1991). The association rules for this model are:
Yes/No
+ -
Association Rules Statistics
• •
Three Variables If a third variable E, the experience of the worker, is integrated, the equation (1) becomes: S = 100 Y + 2000 E + 19000 + ε
(2)
E is the property “has experience.” If E=1, a new experienced worker gets a salary of $21,000, and if E=0, a new non-experienced worker gets a salary of $19,000. The increase of the salary, as a function of seniority (Y), is the same in both cases of experience. S = 50 Y + 1500 E + 50 E ´ Y + 19500 + ε
(3)
Now, if E=1, a new experienced worker gets a salary of $21,000, and if E=0, a new non-experienced worker gets a salary of $19,500. The increase of the salary, as a function of seniority (Y), is $50 higher for experienced workers. These regression models belong to a linear model of statistics (Prum, 1996), where, in the equation (3), the third
Y→S, E→S for the equation (2) Y→S, E→S, YE→S for the equation (3)
The statistical test of the regression model allows to choose with or without interaction (2) or (3). For the association rules, it is necessary to prune the set of three rules, because their measures do not give the choice between a model of two rules and a model of three rules (Zaki, 2000; Zhu, 1998).
More Variables With more variables, it is difficult to use statistical models to test the link between variables (Megiddo & Srikant, 1998). However, there are still some ways to group variables: clustering, factor analysis, and taxonomy (Govaert, 2003). But the complex links between variables, like interactions, are not given by these models and decrease the quality of the results.
Comparison Table 1 briefly compares statistics with the association rules. Two types of statistics are described: by tests and by taxonomy. Statistical tests are applied to a small amount of variables and the taxonomy to a great amount of
Table 1. Comparison between statistics and association rules
Decision Level of Knowledge No. of Variables Complex Link
Statistics
Tests Tests (+) Low (-) Small (-) Yes (-)
Taxonomy Threshold defined (-) High and simple (+) High (+) No (+)
Data Mining Association rules Threshold defined (-) High and complex (+) Small and high (+) No (-)
Figure 2. (a) Regression equations; (b) taxonomy; (c) association rules
a)
b)
c)
75
A
Association Rules and Statistics
variables. In statistics, the decision is easy to make out of test results, unlike association rules, where a difficult choice on several indices thresholds has to be performed. For the level of knowledge, the statistical results need more interpretation relative to the taxonomy and the association rules. Finally, graphs of the regression equations (Hayduk, 1987), taxonomy (Foucart, 1997), and association rules (Gras & Bailleul, 2001) are depicted in Figure 2.
FUTURE TRENDS With association rules, some researchers try to find the right indices and thresholds with stochastic methods. More development needs to be done in this area. Another sensitive problem is the set of association rules that is not made for deductive reasoning. One of the most common solutions is the pruning to suppress redundancies, contradictions and loss of transitivity. Pruning is a new method and needs to be developed.
CONCLUSION With association rules, the manager can have a fully detailed dashboard of his or her company without manipulating data. The advantage of the set of association rules relative to statistics is a high level of knowledge. This means that the manager does not have the inconvenience of reading tables of numbers and making interpretations. Furthermore, the manager can find knowledge nuggets that are not present in statistics. The association rules have some inconvenience; however, it is a new method that still needs to be developed.
REFERENCES
Govaert, G. (2003). Analyse de données. Lavoisier, France: Hermes-Science. Gras, R., & Bailleul, M. (2001). La fouille dans les données par la méthode d’analyse statistique implicative. Colloque de Caen. Ecole polytechnique de l’Université de Nantes, Nantes, France. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection: Special issue on variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. Hayduk, L.A. (1987). Structural equation modelling with LISREL. Maryland: John Hopkins Press. Jensen, D. (1992). Induction with randomization testing: Decision-oriented analysis of large data sets [doctoral thesis]. Washington University, Saint Louis, MO. Kodratoff, Y. (2001). Rating the interest of rules induced from data and within texts. Proceedings of the 12th IEEE International Conference on Database and Expert Systems Aplications-Dexa, Munich, Germany. Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. Proceedings of the Conference on Knowledge Discovery in Data, New York. Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. Proceedings of the Conference on Knowledge Discovery in Data. Boston, Massachusetts. Piatetski-Shapiro, G. (2000). Knowledge discovery in databases: 10 years after. Proceedings of the Conference on Knowledge Discovery in Data, Boston, Massachusetts.
Baillargeon, G. (1996). Méthodes statistiques de l’ingénieur: Vol. 2. Trois-Riveres, Quebec: Editions SMG.
Prum, B. (1996). Modèle linéaire: Comparaison de groupes et régression. Paris, France: INSERM.
Fabris,C., & Freitas, A. (1999). Discovery surprising patterns by detecting occurrences of Simpson’s paradox: Research and development in intelligent systems XVI. Proceedings of the 19th Conference of Knowledge-Based Systems and Applied Artificial Intelligence, Cambridge, UK.
Srikant, R. (2001). Association rules: Past, present, future. Proceedings of the Workshop on Concept LatticeBased Theory, Methods and Tools for Knowledge Discovery in Databases, California.
Foucart, T. (1997). L’analyse des données, mode d’emploi. Rennes, France: Presses Universitaires de Rennes. Freedman, D. (1997). Statistics. W.W. New York: Norton & Company.
76
Winer, B.J., Brown, D.R., & Michels, K.M.(1991). Statistical principles in experimental design. New York: McGraw-Hill. Zaki, M.J. (2000). Generating non-redundant association rules. Proceedings of the Conference on Knowledge Discovery in Data, Boston, Massachusetts.
Association Rules and Statistics
Zhu, H. (1998). On-line analytical mining of association rules [doctoral thesis]. Simon Fraser University, Burnaby, Canada.
KEY TERMS
Linear Model: A variable is fitted by a linear combination of other variables and interactions between them. Pruning: The algorithms of extraction for the association rule are optimized in computationality cost but not in other constraints. This is why a suppression has to be performed on the results that do not satisfy special constraints.
Attribute-Oriented Induction: Association rules, classification rules, and characterization rules are written with attributes (i.e., variables). These rules are obtained from data by induction and not from theory by deduction.
Structural Equations: System of several regression equations with numerous possibilities. For instance, a same variable can be made into different equations, and a latent (not defined in data) variable can be accepted.
Badly Structured Data: Data, like texts of corpus or log sessions, often do not contain explicit variables. To extract association rules, it is necessary to create variables (e.g., keyword) after defining their values (frequency of apparition in corpus texts or simply apparition/ non apparition).
Taxonomy: This belongs to clustering methods and is usually represented by a tree. Often used in life categorization.
Interaction: Two variables, A and B, are in interaction if their actions are not seperate.
Tests of Regression Model: Regression models and analysis of variance models have numerous hypothesis, e.g. normal distribution of errors. These constraints allow to determine if a coefficient of regression equation can be considered as null with a fixed level of significance.
77
A
78
Automated Anomaly Detection Brad Morantz Georgia State University, USA
INTRODUCTION Preparing a dataset is a very important step in data mining. If the input to the process contains problems, noise, or errors, then the results will reflect this, as well. Not all possible combinations of the data should exist, as the data represent real-world observations. Correlation is expected among the variables. If all possible combinations were represented, then there would be no knowledge to be gained from the mining process. The goal of anomaly detection is to identify and/or remove questionable or incorrect observations. These occur because of keyboard error, measurement or recording error, human mistakes, or other causes. Using knowledge about the data, some standard statistical techniques, and a little programming, a simple data-scrubbing program can be written that identifies or removes faulty records. Duplicates can be eliminated, because they contribute no new knowledge. Real valued variables could be within measurement error or tolerance of each other, yet each could represent a unique rule. Statistically categorizing the data would eliminate or, at least, greatly reduce this. In application of this process with actual datasets, accuracy has been increased significantly, in some cases double or more.
BACKGROUND Data mining is an exploratory process looking for as yet unknown patterns (Westphal & Blaxton, 1998). The data represent real-world occurrences, and there is correlation among the variables. Some are principled in their construction, one event triggering another. Sometimes events occur in a certain order (Westphal & Blaxton, 1998). Not all possible combinations of the data are to be expected. If this were not the case, then we would learn nothing from this data. These methods allow us to see patterns and regularities in large datasets (Mitchell, 1999). Credit reporting agencies have been examining large datasets of credit histories for quite some time, trying to determine rules that will help discern between problematic and responsible consumers (Mitchell, 1999). Datasets have been mined looking for indications for boiler explosion probabilities to high-risk pregnancies to consumer
purchasing patterns. This is the semiotics of data, as we transform data to information and finally to knowledge. Dirty data, or data containing errors, are a major problem in this process. The old saying is, “garbage in, garbage out” (Statsoft, 2004). Heuristic estimates are that 60-80% of the effort should go into preparing the data for mining, and only the small remaining portion actually is required for the data-mining effort itself. These data records that are deviations from the common rule are called anomalies. Data are always dirty and have been called the curse of data mining (Berry & Linoff, 2000). Several factors can be responsible for attenuating the quality of the data, among them errors, missing values, and outliers (Webb, 2002). Missing data have many causes, varying from recording error to illegible writing to just not supplied. This is closely related to incorrect values that also can be caused by poor penmanship as well as measurement error, keypunch mistakes, different or incorrect metrics, misplaced decimal, and other similar causes. Fuzzy definitions, where the meaning of a value is either unclear or inconsistent, are another problem (Berry & Linoff, 2000). Often, when something is being measured and recorded, mistakes happen. Even automated processes can produce dirty data (Bloom, 1998). Micro-array data has errors due to base pairs on the probe not matching correctly to genes in the test material (Shavlik et al., 2004). The sources of error are large, and it is necessary to have a process that finds these anomalies and identifies them. In real valued datasets, the possible combinations are (almost) unlimited. A dataset with eight variables, each with four significant digits, could yield as many as 1032 combinations. Mining such a dataset would not only be tedious and time-consuming, but possibly could yield an overly large number of patterns. Using (six-range) categorical data, the same problem would only have 1.67 x 106 combinations. Gauss normally distributed data can be separated into plus or minus 1, 2, or 3 sigma. Other distributions can use Chebyshev or other distributions with similar dividing points. There is no real loss of data, yet the process is greatly simplified. Finding the potentially bad observations or records is the first problem. The second problem is what to do once they are found. In many cases it is possible to go back and verify the value, correcting it, if necessary. If this is
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Automated Anomaly Detection
possible, the program should flag values that are to be verified. This may not always be possible, or it may be too expensive. Not all situations repeat within a reasonable time, if at all (i.e., observation of Halley’s comet). There are two schools of thought, the first being to substitute the mean value for the missing or wrong value. The problem with this is that it might not be a reasonable value, and it can create a new rule, one that could be false (i.e., shoe size for a giant is not average). It might introduce sample bias, as well (Berry & Linoff, 2000). Deleting the observation is the other common solution. Quite often, in large datasets, a duplicate exists, so deleting causes no loss. The cost of improper commission is greater than that of omission. Sometimes an outlier tells a story. So, one has to be careful about deletions.
THE AUTOMATED ANOMALY DETECTION PROCESS Methodology To illustrate the process, a public dataset is used. This particular one is available from the University of California at Irvine Machine Learning Repository (University of California, 2003). Known as the Abalone dataset, it consists of 4,400 observations of abalones that were captured in the wild with several measurements of each one. Natural variation exists, as well as human error, both in making the measurements and in the recording. Also listed on the Web site were some studies that used the data and their results. Accuracy in the form of hit rate varied between 0-35%. While it may seem overly simple and obvious, plotting the data is the first step. These graphical views can provide much insight into the data (Webb, 2002). The data for each variable can be plotted vs. frequency of occurrence to visually determine distribution. Combining this with knowledge of the research will help to determine the correct distribution to use for each included variable. A sum of independent terms would tend to support a Gauss normal distribution, while the product of a number of independent terms might suggest using log normal. This plotting also might suggest necessary transformations. It is necessary to understand the acceptable range for each field. Some values obtained might not be reasonable. If there is a zero in a field, is it indicative of a missing value, or is it an acceptable value? No value is not the same as zero. Some values, while within bounds, might not be possible. It is also necessary to check for obvious mistakes, inconsistencies, or out of bounds. Knowledge about the subject of study is necessary. From this, rules can be made. In the example of the abalone, the animal in the shell must weigh more than when it is
shucked (removed from the shell) for obvious reasons. Other such rules from domain knowledge can be created (abalone.net, 2004; University of Capetown, 2004; World Aquaculture, 2004). Sometimes, they may seem too obvious, but they are effective. The rules can be programmed into a subroutine specific to the dataset. Regression can be used to check for variables that are not statistically significant. Step-wise regression is a handy tool for identifying significant variables. Other ratio variables can be created and then checked for significance using regression. Again, domain knowledge can help create these variables, as well as insight and some luck. Insignificant variables can be deleted from the dataset, and new ones can be added. If the dataset is real valued, it is possible that records exist that are within tolerance or measurement error of each other. There are two ways to reduce the number of unique observations. (1) Attenuate the accuracy by rounding to reduce the number of significant digits. Each variable rounding to one less significant digit reduces the number of possible patterns by an order of magnitude. (2) Calculate a mean and standard deviation for the cleaned dataset. Using an appropriate distribution, sort the values by standard deviations from the mean. Testing to see if the chosen distribution is correct is accomplished by using a Chi square test, a Kolmogorof Smirnoff test, or the empirical test. The number of standard deviations replaces the real valued data, and a simple categorical dataset will exist. This allows for simple comparisons between observations. Otherwise, records with values as little as .0001% differences would be considered unique and different. While some of the precision of the original data is lost, this process is exploratory and finds the general patterns that are in the data. This allows one to gain insight into the database using a combination of statistics and artificial intelligence (Pazzani, 2000), using human knowledge and skill as the catalyst to improve the results. The final step before mining the data is to remove duplicates, as they add no additional information. As the collection of observations gets increasingly larger, it gets harder to introduce new experiences. This process can be incorporated into the computer program by a simple process that is similar to bubblesort. Instead of comparing to see which row is greater, it just looks for differences. If none are found, then the row is deleted.
Example Results A few variables were plotted producing, some very unusual graphs. These were definitely not the graphs that were expected. This was the first indication that the dataset was noisy. Abalones are born in very large numbers, but with an extremely high infant mortality rate (over
79
A
Automated Anomaly Detection
99%) (Bamfield Marine Science Centre, 2004). This graph did not reflect that. An initial scan of the data showed some inconsistent points, like a five-year-old infant, a shucked animal weighing more than a complete one, and other similar abnormalities. Another problem with most analyses of these datasets is that gender is not ratio or ordinal data and, therefore, had to be converted to a dummy variable. Step-wise regression removed all but five variables. The remaining variables were: diameter, height, whole weight, shucked weight, and viscera weight. Two new variables were created: shell ratio (whole weight divided by shell weight) and weight to diameter ratio. Since the diameter is directly proportional to volume, this variable is proportional to density. The proof of its significance was a ‘t’ value of 39 and an F value of 1561. These are both statistically significant. A plot of shell ratio vs. frequency yielded a fairly Gauss normal looking curve. As these are real valued data with four digits given, it is possible to have observations that vary by as little as 0.01%. This value is even less than the accuracy of the measuring instruments. In other words, there are really a relatively small number of possibilities, described by a large number of almost identical examples, some within measurement tolerance of each other. The mean and standard deviation were calculated for each of the remaining and new variables of the dataset. The empirical test was done to verify approximate meeting of Gauss normal distribution. Each value then was replaced by the integer number of standard deviations it is from the mean, creating a categorical dataset. Simple visual inspection showed two things: (1) there was, indeed, correlation among the observations; and (2) it became increasingly more difficult to introduce a new pattern. Duplicate removal process was the next step. As expected, the first 50 observations only had 22% duplicates, but by the time the entire dataset was processed, 65% of the records were removed, because it presented no new information. To better understand the quality of the data, least squares regression was performed. The model produced an ANOVA F value of 22.4, showing good confidence in it. But the Pearsonian correlation coefficient R2 of only 0.25 indicated that there was some problem. Visual observation
of the dataset and its plots led to some suspicion of the group with one ring (age = 2.5 years). OLS regression was performed on this group, yielding an F of 27, but an R 2 of only 0.03. This tells us that this portion of the data is only muddying the water and attenuating the performance of our model. Upon removal of this group of observations, OLS regression was performed on the remaining data, giving an improved F of 639 (showing that, indeed, it is a good model) and an R2 of 0.53, an acceptable level and one that can adequately describe the variation in the criterion. The results listed at the Web site where the dataset was obtained are as follows: Sam Waugh in the Computer Science Department at the University of Tasmania used this dataset in 1995 for his doctoral dissertation (University of California, 2003). His results, while the first recorded attempt, did not have good accuracy at predicting the age. The problem was encoded as a classification task. 24.86% 26.25% nodes) 21.5% 0.0% 3.57%
Cascade Correlation (no hidden nodes) Cascade Correlation (five hidden C4.5 Linear Discriminant Analysis k=5 Nearest Neighbor
Clark, et al. (1996) did further work on this dataset. They split the ring classification into three groups: 1 to 8, 9 to 10, and 11 and up. This reduced the number of targets and made each one bigger, in effect making each easier to hit. Their results were much better, as shown in the following: 64% 55%
Back propagation Dystal
The results obtained from the answer tree using the new cleaned dataset are shown in Table 1. All of the one-ring observations were filtered out in a previous step, and the extraction was 100% accurate in not predicting any as being one-ring. The hit rates are as follows:
Table 1. Hit Rate Predicted 1 2 3 4 Total
80
1 0 11 231 22 264
2 0 269 140 3 412
Actual Category 3 0 79 953 27 1059
4 0 2 280 81 363
Total 0 361 1604 133 2098
Automated Anomaly Detection
1 ring 2 ring 3 ring 4 ring Overall accuracy
100.0% correct 74.5% correct 59.4% correct 60.9% correct 62.1% correct
FUTURE TRENDS A program could be written that would input simple things like the number of variables, the number of observations, and some classification results. A rule input mechanism would accept the domain-specific rules and make them part of the analysis. Further improvements would be the inclusion of fuzzy logic. Type I would allow the use of lingual variables (i.e., big, small, hot, cold) in the records, and type II would allow for some fuzzy overlap and fit.
CONCLUSION
of the Australian Conference on Neural Networks (ACNN ’96), Canberra, Australia. Mitchell, T. (1999). Machine learning and data mining. Communications of the ACM, 42(11), 30-36. Pazzani, M.J. (2000). Knowledge discovery from data? IEEE Intelligent Systems, 15(2), 10-13. Shavlik, J., Molla, M., Waddell, M., & Page, D. (2004). Using machine learning to design and interpret geneexpression microarrays. American Association for Artificial Intelligence, 25(1), 23-44. Statsoft, Inc. (2004). Electronic statistics textbook. Retrieved from http://www.statsoft.com Tasmanian Abalone Council Ltd. (2004). http:// tasabalone.com.au University of California at Irvine. (2003). Machine learning repository, abalone database. Retrieved from http:/ /ics.uci.edu/~mlearn/MLRepository University of Capetown Zoology Department. (2004). http://web.uct.ac.za/depts/zoology/abnet
Data mining is an exploratory process to see what is in the data and what patterns can be found. Noise and errors in the dataset are reflected in the results from the mining process. Cleaning the data and identifying anomalies should be performed. Marked observations should be verified and corrected, if possible. If this cannot be done, they should be deleted. In real valued datasets, the values can be categorized with accepted statistical techniques. Anomaly detection, after some manual viewing and analysis, can be automated. Part of the process is specific to the knowledge domain of the dataset, and part could be standardized. In our example problem, this cleaning process improved results, and the mining produced a more accurate rule set.
KEY TERMS
REFERENCES
Anomaly: A value or observation that deviates from the rule or analogy. A potentially incorrect value.
Abalone.net. (2003). All about abalone: An online guide. Retrieved from http://www.abalone.net Bamfield Marine Sciences Centre Public Education Programme. (2004). Oceanlink. Retrieved from http:// oceanlink.island.net/oinfo/Abalone/abalone.html Berry, M.J.A., & Linoff, G.S. (2000). Mastering data mining: The art and science of customer relationship management. New York, NY: Wiley & Sons, Inc. Bloom, D. (1998). Technology, experimentation, and the quality of survey data. Science, 280(5365), 847-848. Clark, D., Schreter, Z., & Adams, A. (1996). A quantitative comparison of dystal and backpropagation. Proceedings
Webb, A. (2002). Statistical pattern recognition. West Sussex, England: Wiley & Sons. Westphal, C., & Blaxton, T. (1998). Data mining solutions methods and tools for solving real-world problems. New York: Wiley & Sons. World Aquaculture. (2004). http://www7.taosnet.com/ platinum/data/light/species/abalone.html
ANOVA or Analysis of Variance: A powerful statistical method for studying the relationship between a response or criterion variable and a set of one or more predictor or independent variable(s). Correlation: Amount of relationship between two variables, how they change relative to each other, range: -1 to +1. F Value: Fisher value, a statistical distribution, used here to indicate the probability that an ANOVA model is good. In the ANOVA calculations, it is the ratio of squared variances. A large number translates to confidence in the model.
81
A
Automated Anomaly Detection
Ordinal Data: Data that is in order but has no relationship between the values or to an external value. Pearsonian Correlation Coefficient: Defines how much of the variation in the criterion variable(s) is caused by the model. Range: 0 to 1. Ratio Data: Data that is in order and has fixed spacing; a relationship between the points that is relative to a fixed external point.
82
Step-Wise Regression: An automated procedure on statistical programs that adds one predictor variable at a time, and if it is not statistically significant, it removes it from the model. Some work in both directions either by adding or removing from the model, one at a time. “T” Value, also called “Student’s t”: A statistical distribution for smaller sample sizes. In regression routines in statistical programs, it indicates whether a predictor variable is statistically significant or if it truly is contributing to the model. A value more than about 3 is required for this indication.
83
Automatic Musical Instrument Sound Classification Alicja A. Wieczorkowska Polish-Japanese Institute of Information Technology, Poland
INTRODUCTION The aim of musical instrument sound classification is to process information from audio files by a classificatory system and accurately identify musical instruments playing the processed sounds. This operation and its results are called automatic classification of musical instrument sounds.
BACKGROUND Musical instruments are grouped into the following categories (Hornbostel & Sachs, 1914): • • • •
Idiophones: Made of solid, non-stretchable, sonorous material. Membranophones: Skin drums; membranophones, and idiophones are called percussion. Chordophones: Stringed instruments. Aerophones: Wind instruments: woodwinds (single-reed, double-reed, flutes) and brass (lipvibrated)
Idiophones are classified according to the material, number of idiophones and resonators in a single instrument, and whether pitch or tuning is important. Subcategories include idiophones struck together by concussion (e.g., castanets), struck (gong), rubbed (musical glasses), scraped (washboards), stamped (hard floors stamped with tap shoes), shaken (rattles), and plucked (jew’s harp). Membranophones are classified according to their shape, material, number of heads, if they have snares, etc., whether and how the drum is tuned, and how the skin is fixed and played. Subcategories include drums (cylindrical, conical, barrel, hourglass, goblet, footed, long, kettle, frame, friction drum, and mirliton/kazoo). Chordophones are classified with respect to the relationship of the strings to the body of the instrument, if they have frets (low bridges on the neck or body, where strings are stopped) or movable bridges, number of strings, and how they are played and tuned. Subcategories include zither, lute plucked, and bowed (e.g., guitars, violin), harp, lyre, and bow.
Aerophones are classified according to how the air is set in motion, mainly depending on the mouthpiece: blow hole, whistle, reed, and lip-vibrated. Subcategories include flutes (end-blown, side-blown, nose, globular, multiple), panpipes, whistle mouthpiece (recorder), singleand double-reed (clarinet, oboe), air chamber (pipe organs), lip-vibrated (trumpet or horn), and free aerophone (bullroarers) (SIL, 1999). The description of properties of musical instrument sounds is usually given in vague subjective terms, like sharp, nasal, bright, and so forth, and only some of them (i.e., brightness) have numerical counterparts. Therefore, one of the main problems in this research is to prepare the appropriate numerical sound description for instrument recognition purposes. Automatic classification of musical instrument sounds aims at classifying audio data accurately into appropriate groups representing instruments. This classification can be performed at instrument level, instrument family level (e.g., brass), or articulation (i.e., how sound is struck, sustained, and released, e.g. vibrato varying the pitch of a note up and down) (Smith, 2000). As a preprocessing, the audio data are usually parameterized (i.e., numerical or other parameters or attributes are assigned, and then data mining techniques are applied to the parameterized data). Accuracy of classification varies, depending on the audio data used in the experiments, number of instruments, parameterization, classification, and validation procedure applied. Automatic classification compares favorably with human performance. Listeners identify musical instruments with accuracy far from perfect, with results depending on the sounds chosen and experience of listeners. Classification systems allow instrument identification without participation of human experts. Therefore, such systems can be valuable assistance for users of audio data searching for specific timbre, especially if they are not experienced musicians and when the amount of available audio data is huge, thus making manual searching impractical, if possible at all. When combined with a melody-searching system, automatic instrument classification may provide a handy tool for finding favorite tunes performed by favorite instruments in audio databases.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
A
Automatic Musical Instrument Sound Classification
MAIN THRUST Research on automatic classification of musical instruments so far has been performed mainly on isolated, singular sounds; works on polyphonic sounds usually aim at source separation and operations like pitch tracking of these sounds (Viste & Evangelista, 2003). Most commonly used data include MUMS compact discs (Opolko & Wapnick, 1987), the University of Iowa samples (Fritts, 1997), and IRCAM’s Studio on Line (IRCAM, 2003). A broad range of data mining techniques was applied in this research, aiming at extraction of information hidden in audio data (i.e., sound features that are common for a given instrument and differentiate it from the others). Descriptions of musical instrument sounds are usually subjective, and finding appropriate numeric descriptors (parameters) is a challenging task. Sound parameterization is arbitrarily chosen by the researchers, and the parameters may reflect features that are known to be important for the human in the instrument recognition task, like descriptors of sound evolution in time (i.e., onset features, depth of sound vibration, etc.), subjective timbre features (i.e., brightness of the sound), and so forth. Basically, parameters characterize coefficients of sound analysis, since they are relatively easy to calculate. On the basis of parameterization, further research can be performed. Clustering applied to parameter vectors reveals similarity among sounds and adds a new glance on instrument classification, usually based on instrument construction or sound articulation. Decision rules and trees allow identification of the most descriptive sound features. Transformation of sound parameters may produce new descriptors better suited for automatic instrument classification. The classification can be performed hierarchically, taking instrument families and articulation into account. Classifiers represent a broad range of methods, from simple statistic tools to new advanced algorithms rooted in artificial intelligence.
Parameterization Methods Sound is a physical disturbance in the medium (e.g., air) through which it is propagated (Whitaker & Benson, 2002). Periodic fluctuations are perceived as sound having pitch. The audible frequency range is about 2020,000 Hz (hertz or cycles per second). The parameterization aims at capturing most distinctive sound features regarding sound amplitude evolution in time, static spectral features (frequency contents) of the most stable part of the sound, and evolution of frequency content in time. These features are based on the Fourier spectrum and time-frequency sound representations like wavelet 84
transform. Some analyses are adjusted to the properties of human hearing, which perceives changes of sound amplitude and frequency in a logarithmic-like manner (e.g., frequency contents analysis in mel scale). The results based on such analysis are easier to interpret in subjective terms. Also, statistic and mathematic operations are applied to the sound representation, yielding good results, too. Some descriptors require calculating pitch of the sound, and any inaccuracies in pitch calculation (e.g., octave errors) may lead to erroneous results. Parameter sets investigated in the research are usually a mixture of various types, since such combinations allow capturing more representative sound description for instrument classification purposes. The following analysis and parameterization methods are used to describe musical instrument sounds: •
•
• •
•
•
•
Autocorrelation and cross-correlation functions investigating periodicity of the signal and statistical parameters of spectrum obtained via Fourier transform: average amplitude and frequency variations (wide in vibrated sounds), standard deviations (Ando & Yamaguchi, 1993). Contents of selected groups of partials in the spectrum (Pollard & Jansson, 1982; Wieczorkowska, 1999a), including amount of even and odd harmonics (Martin & Kim, 1998), allowing identification of clarinet sounds. Vibrato strength and other changes of sound features in time (Martin & Kim, 1998; Wieczorkowska et al., 2003) and temporal envelope of the sound. Statistical moments of the time wave, spectral centroid (gravity center), coefficients of cepstrum (i.e., the Fourier transform applied to the logarithm of amplitude plot of the spectrum), constant-Q coefficients (i.e., for logarithmicallyspaced spectral bins) (Brown, 1999; Brown et al., 2001). Wavelet analysis, providing time-frequency plot based on decomposition of sound signal into functions called wavelets (Kostek & Czyzewski, 2001; Wieczorkowska, 1999b). Mel-frequency coefficients (i.e., in mel scale, adjusted to the properties of human hearing) and linear prediction cepstral coefficients, where future values are estimated as a linear function of previous values (Eronen, 2001). Multidimensional Scaling Analysis (MSA) trajectories obtained through Principal Component Analysis (PCA) applied to the constant-Q spectral snapshots to determine the most significant attributes of each sound (Kaminskyj, 2002). PCA transforms a set of variables into a smaller set of
Automatic Musical Instrument Sound Classification
•
uncorrelated variables, which keep as much of the variability in the data as possible. MPEG-7 audio descriptors, including log-attack time (i.e., logarithm of onset duration, fundamental frequency—pitch, spectral envelope, and spread, etc.) (ISO, 2003; Peeters, et al., 2000).
Feature vectors obtained via parameterization of musical instrument sounds are used as inputs for classifiers, both for training and recognition purposes.
•
•
Classification Techniques Automatic classification is the process by which a classificatory system processes information in order to automatically classify data accurately, or the result of such a process. A class may represent an instrument, articulation, instrument family, and so forth. Classifiers applied to this task range from probabilistic and statistical algorithms through methods based on learning by example, where classification is based on the distance between the observed sample and the nearest known neighbor, to methods originating from artificial intelligence like neural networks, which mimic neural connections in the brain. Each classifier yields a new sound description (representation). Some classifiers produce an explicit set of classification rules (e.g., decision trees or rough set based algorithms), giving insight into relationships between specific sound timbres and the calculated features. Since human-performed recognition of musical instruments is based on subjective criteria and difficult to formalize, learning algorithms that allow extraction of precise rules of sound classification are broadening our knowledge and giving formal representation of subjective sound features. The following algorithms can be applied to musical instrument sound classification: •
•
Bayes decision rule (i.e., probabilistic classification method of assignment of unknown samples) to the classes. In Brown (1999), training data were grouped into clusters obtained through k-means algorithm, and Gaussian probability density functions were formed from the mean and variance of each cluster. K-Nearest Neighbor (k-NN) algorithm, where the class (instrument) for a tested sound sample is assigned on the basis of the distances between the vector of parameters for this sample and the majority of k nearest vectors representing known samples (Kaminskyj, 2002; Martin & Kim, 1998). To improve performance, genetic algorithms are additionally applied to find the optimal set of weights for the parameters (Fujinaga & McMillan, 2000).
•
•
• •
A statistical pattern-recognition technique; maximum a posteriori classifier based on Gaussian models (introducing prior probabilities), obtained via Fisher multiple discriminant analysis that projects the high-dimensional feature space into a space of one dimension fewer than the number of classes in which the classes are separated maximally (Martin & Kim, 1998). Neural networks designed by analogy with a simplified model of the neural connections in the brain and trained to find relationships in the data; multi-layer nets and self-organizing feature maps have been used (Cosi et al., 1994; Kostek & Czyzewski, 2001). Decision trees, where nodes are labeled with sound parameters, edges are labeled with parameter values, and leaves represent classes (Wieczorkowska, 1999b). Rough-set-based algorithms; rough sets are defined by upper approximation, containing elements that belong to the set for sure, and lower approximation containing elements that may belong to the set (Wieczorkowska, 1999a). Support vector machines that aim at finding the hyperplane that best separates observations belonging to different classes (Agostini et al., 2003). Hidden Markov Models (HMM) used for representing sequences of states; in this case, can be used for representing long sequences of feature vectors that define an instrument sound (Herrera et al., 2000).
Classifiers are first trained and then tested with respect to their generalization purposes (i.e., whether they work properly on unknown samples.
Validation and Results Parameterization and classification methods yield various results, depending on the sound data and validation procedure when classifiers are tested on unseen samples. Usually, the available data are divided into training and test sets. For instance, 70% of the data is used for training and the remaining 30% for testing; this procedure is usually repeated a number of times, and the final result is the average of all runs. Other popular divisions are in proportions 80/20 or 90/10. Also, leave-one-out procedure is used, where only one sample is used for testing. Generally, the higher percentage of the training data is in proportion to the test data and the smaller the number of classes, the higher the accuracy that is obtained. Some instruments are easily identified with high accuracy, whereas, others frequently are misclassified, especially with those from 85
A
Automatic Musical Instrument Sound Classification
the same family. Classification of instruments is sometimes performed hierarchically: articulation or family is recognized first, and then the instrument is identified. Following is an overview of results obtained so far in the research on musical instrument sound classification:
•
•
•
•
•
•
•
•
•
86
Brown (1999) reported an average 84.1% recognition accuracy for two classes—oboe and saxophone—using cepstral coefficients as features and Bayes decision rules for clusters obtained via k-means algorithm. Brown, Houix, and McAdams (2001), in experiments with four classes, obtained 79–84% accuracy for bin-to-bin differences of constant-Q coefficients, and cepstral and autocorrelation coefficients using Bayesian method. K-NN classification applied to mel-frequency and linear prediction cepstral coefficients (Eronen, 2001), with training on 29 orchestral instruments and testing on 16 instruments from various recordings, yielded 35% accuracy for instruments and 77% for families. K-NN, combined with genetic algorithms (Fujinaga & McMillan, 2000), yielded 50% correctness in leave-one-out tests on spectral features representing 23 orchestral instruments played with various articulations. Kaminskyj (2002) applied k-NN to constant-Q and cepstral coefficients, MSA trajectories, amplitude envelope, and spectral centroid. He obtained 89-92% accuracy for instruments, 96% for families, and 100% in identifying impulsive vs. sustained sounds in leave-one-out tests for MUMS data. Tests on other recordings initially yielded 33-61% accuracy, and 87-90% after improvements. Multilayer neural networks applied to wavelet and Fourier-based parameterization yielded 72-99% accuracy for various groups of four instruments (Kostek & Czyzewski, 2001). The statistical pattern-recognition technique and k-NN algorithm, applied to sounds representing 14 orchestral instruments played with various articulation, yielded 71.6% accuracy for instruments, 86.9% for families, and 98.8% in discriminating continuant sounds vs. pizzicato (Martin & Kim, 1998) in 70/30 tests. The features included pitch, spectral centroid, ratio of odd-to-even harmonic energy, onset asynchrony, and the strength of vibrato and tremolo (quick changes of sound amplitude or note repetitions). Discriminant analysis and support vector machines yielded about 70% accuracy in leave-one-out tests with spectral features for 27 instruments (Agostini et al., 2003).
Rough-set-based classifiers and decision trees applied to the data representing 18 classes (11 orchestral instruments, various articulation), parameterized using Fourier and wavelet-based attributes, yielded 68-77 % accuracy in 90/10 test, and 64-68% in 70/30 tests (Wieczorkowska, 1999b). K-NN and rough-set-based classifiers, applied to spectral and temporal sound parameterization, yielded 68% accuracy in 80/20 tests for 18 classes, representing 11 orchestral instruments and various articulation (Wieczorkowska et al., 2003).
Generally, instrument families or sustained/impulsive sounds are identified with accuracy exceeding 90%, whereas instruments, if there are more than 10, are identified with accuracy reaching about 70%. These results compare favorably with human performance and exceed results obtained for inexperienced listeners.
FUTURE TRENDS Automatic indexing and searching of audio files is gaining increasing interest. MPEG-7 standard addresses the issue of content description in multimedia data, and audio descriptors provided in this standard form a basis for further research. Constant growth of audio resources available on the Internet causes an increasing need for content-based search of audio data. Therefore, we can expect intensification of research in this domain and progress of studies on automatic classification of musical instrument sounds.
CONCLUSION Results obtained so far in automatic musical instrument sound classification vary, depending on the size of the data, sound parameterization, classifier, and testing method. Also, some instruments are identified easily with high accuracy, whereas others are misclassified frequently, in case of both human and machine performance. Increasing interest in content-based searching through audiovisual data and growth of amount of multimedia data available via the Internet raises the need and perspective for further progress in automatic classification of audio data.
REFERENCES Agostini, G., Longari, M., & Pollastri, E. (2003). Musical instrument timbres classification with spectral fea-
Automatic Musical Instrument Sound Classification
tures. EURASIP Journal on Applied Signal Processing, 1, 1-11. Ando, S., & Yamaguchi, K. (1993). Statistical study of spectral parameters in musical instrument tones. Journal of the Acoustical Society of America, 94(1), 37-45. Brown, J.C. (1999). Computer identification of musical instruments using pattern recognition with cepstral coefficients as features. Journal of the Acoustical Society of America, 105, 1933-1941. Brown, J.C., Houix, O., & McAdams, S. (2001). Feature dependence in the automatic identification of musical woodwind instruments. Journal of the Acoustical Society of America, 109, 1064-1072. Cosi, P., De Poli, G., & Lauzzana, G. (1994). Auditory modelling and self-organizing neural networks for timbre classification. Journal of New Music Research, 23, 71-98.
Kostek, B., & Czyzewski, A. (2001). Representing musical instrument sounds for their automatic classification. Journal of the Audio Engineering Society, 49(9), 768-785. Martin, K.D., & Kim, Y.E. (1998). Musical instrument identification: A pattern-recognition approach. Proceedings of the 136th Meeting of the Acoustical Society of America, Norfolk, Virginia. Opolko, F., & Wapnick, J. (1987). MUMS—McGill University master samples. [CD-ROM]. McGill University, Montreal, Quebec, Canada. Peeters, G., McAdams, S., & Herrera, P. (2000). Instrument sound description in the context of MPEG-7. Proceedings of the International Computer Music Conference ICMC’2000, Berlin, Germany. Pollard, H.F., & Jansson, E.V. (1982). A tristimulus method for the specification of musical timbre. Acustica, 51, 162171.
Eronen, A. (2001). Comparison of features for musical instrument recognition. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA 2001, New York, NY, USA.
SIL. (1999). LinguaLinks library. Retrieved 2004 from http://www.sil.org/LinguaLinks/Anthropology/Expndd EthnmsclgyCtgrCltrlMtrls/MusicalInstrumentsSub categorie.htm
Fritts, L. (1997). The University of Iowa musical instrument samples. Retrieved 2004 from http://there min.music.uiowa.edu/MIS.html
Smith, R. (2000). Rod’s encyclopedic dictionary of traditional music. Retrieved 2004 from http:// www.sussexfolk.freeserve.co.uk/ency/a.htm
Fujinaga, I., & McMillan, K. (2000). Realtime recognition of orchestral instruments. Proceedings of the International Computer Music Conference, Berlin, Germany.
Viste, H., & Evangelista, G. (2003). Separation of harmonic instruments with overlapping partials in multichannel mixtures. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA-03, New Paltz, New York.
Herrera, P., Amatriain, X., Batlle, E., & Serra, X. (2000). Towards instrument segmentation for music content description: A critical review of instrument classification techniques. Proceedings of the International Symposium on Music Information Retrieval ISMIR 2000, Plymouth, Massachusetts. Hornbostel, E.M.V., & Sachs, C. (1914). Systematik der Musikinstrumente. Ein Versuch. Zeitschrift für Ethnologie, 46(4-5), 553-90. IRCAM, Institute de Recherche et Coordination Acoustique/ Musique. (2003). Studio on line. Retrieved 2004 from http:// forumnet.ircam.fr/rubrique.php3?id_rubr ique=107 ISO, International Organisation for Standardisation. (2003). MPEG-7 Overview. Retrieved 2004 from http:// www.chiariglio ne.org/mpeg/standards/mpeg-7/mpeg7.htm Kaminskyj, I. (2002). Multi-feature musical instrument sound classifier w/user determined generalisation performance. Proceedings of the Australasian Computer Music Association Conference ACMC 2002, Melbourne, Australia.
Whitaker, J.C. & Benson, K.B. (Eds.). (2002). Standard handbook of audio and radio engineering. New York: McGraw-Hill. Wieczorkowska, A. (1999a). Rough sets as a tool for audio signal classification. Foundations of Intelligent Systems LNCS/LNAI 1609, 11th Symposium on Methodologies for Intelligent Systems, Proceedings/ISMIS’99, Warsaw, Poland. Wieczorkowska, A. (1999b). Skutecznoœæ rozpoznawania dŸwiêków instrumentów muzycznych w zale¿noœci od sposobu parametryzacji i rodzaju klasyfikatora (Efficiency of musical instrument sounds recognition depending on parameterization and classifier) [doctoral thesis] [in Polish]. Gdansk: Technical University of Gdansk. Wieczorkowska, A., Wróblewski, J., Synak, P., & lêzak , D. (2003). Application of temporal descriptors to musical instrument sound recognition. Journal of Intelligent Information Systems, 21(1), 71-93. 87
A
Automatic Musical Instrument Sound Classification
KEY TERMS Aerophones: The category of musical instruments, called wind instruments, producing sound by the vibration of air. Woodwinds and brass (lip-vibrated) instruments belong to this category, including single reed woodwinds (e.g., clarinet), double reeds (oboe), flutes, and brass (trumpet). Articulation: The process by which sounds are formed; the manner in which notes are struck, sustained, and released. Examples: staccato (shortening and detaching of notes), legato (smooth), pizzicato (plucking strings), vibrato (varying the pitch of a note up and down), muted (stringed instruments, by sliding a block of rubber or similar material onto the bridge; brass (by inserting a conical device into the bell). Automatic Classification: The process by which a classificatory system processes information in order to classify data accurately; also, the result of such process. Chordophones: The category of musical instruments producing sound by means of a vibrating string; stringed instruments. Examples: guitar, violin, piano, harp, lyre, musical bow.
88
Idiophones: The category of musical instruments made of solid, non-stretchable sonorous material. Subcategories: idiophones struck together by concussion (e.g., castanets), struck (gong), rubbed (musical glasses), scraped (washboards), stamped (hard floors stamped with tap shoes), shaken (rattles), and plucked (jew’s harp). Membranophones: The category of musical instruments; skin drums. Examples: daraboukka, tambourine. Parameterization: The assignment of parameters to represent processes that usually are not easily described by equations. Sound: A physical disturbance in the medium through which it is propagated. This fluctuation may change routinely, and such periodic sound is perceived as having pitch. The audible frequency range is about 20– 20,000 Hz (hertz, or cycles per second). Harmonic sound wave consists of frequencies being integer multiples of the first component (fundamental frequency), corresponding to pitch. Spectrum: The distribution of component frequencies of the sound, each being a sine wave, with their amplitudes and phases (time locations of these components). These frequencies can be determined through Fourier analysis.
89
Bayesian Networks
B
Ahmad Bashir University of Texas at Dallas, USA Latifur Khan University of Texas at Dallas, USA Mamoun Awad University of Texas at Dallas, USA
INTRODUCTION A Bayesian network is a graphical model that finds probabilistic relationships among variables of a system. The basic components of a Bayesian network include a set of nodes, each representing a unique variable in the system, their inter-relations, as indicated graphically by edges, and associated probability values. By using these probabilities, termed conditional probabilities, and their interrelations, we can reason and calculate unknown probabilities. Furthermore, Bayesian networks have distinct advantages compared to other methods, such as neural networks, decision trees, and rule bases, which we shall discuss in this paper.
BACKGROUND Bayesian classification is based on Naïve Bayesian classifiers, which we discuss in this section. Naive Bayesian classification is the popular name for a probabilistic classification. The term Naive Bayes refers to the fact that the probability model can be derived by using Bayes’ theorem and that it incorporates strong independence assumptions that often have no bearing in reality, hence they are deliberately naïve. Depending on the model, Naïve Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for Naïve Bayes models uses the method of maximum likelihood. Abstractly, the desired probability model for a classifier is a conditional model
P(C | F1,…,F n) over a dependent class variable C with a small number of outcomes, or classes, conditional on several feature variables F1 through Fn. The naïve conditional independence assumptions play a role at this stage, when the probabilities are being computed. In this model, the assumption
is that each feature Fi is conditionally independent of every other feature Fj. This situation is mathematically represented as:
P(Fi | C, Fj) = P(Fi | C) Such naïve models are easier to compute, because they factor into P(C) and a series of independent probability distributions. The Naïve Bayes classifier combines this model with a decision rule. The common rule is to choose the label that is most probable, known as the maximum a posteriori or MAP decision rule. In a supervised learning setting, one wants to estimate the parameters of the probability model. Because of the independent feature assumption, it suffices to estimate the class prior and the conditional feature models independently by using the method of maximum likelihood. The Naïve Bayes classifier has several properties that make it simple and practical, although the independence assumptions are often violated. The overall classifier is the robust to serious deficiencies of its underlying naïve probability model, and in general, the Naïve Bayes approach is more powerful than might be expected from the extreme simplicity of its model; however, in the presence of nonindependent attributes wi, the Naïve Bayesian classifier must be upgraded to the Bayesian classifier, which will more appropriately model the situation.
MAIN THRUST The basic concept in the Bayesian treatment of certainties in causal networks is conditional probability. When the probability of an event A, P(A), is known, then it is conditioned by other known factors. A conditional probability statement has the following form: Given that event B has occurred, the probability of the event A occurring is x.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Bayesian Networks
Graphical Models
Bayesian Probabilities
A graphical model visually illustrates conditional independencies among variables in a given problem. Two variables that are conditionally independent have no direct impact on each other’s values. Furthermore, the graphical model shows any intermediary variables that separate two conditionally independent variables. Through these intermediary variables, two conditionally independent variables affect one another. A graph is composed of a set of nodes, which represent variables, and a set of edges. Each edge connects two nodes, and an edge can have an optional direction assigned to it. For X1 and X2, if a causal relationship between the variables exists, the edge will be directional, leading from the case variable to the effect variable; if just a correlation between the variables exists, the edge will be undirected. We use an example with three variables to illustrate these concepts. In this example, two conditionally independent variables, A and C, are directly related to another variable, B. To represent this situation, an edge must exist between the nodes of the variables that are directly related, that is, between A and B and between B and C. Furthermore, the relationships between A and B and B and C are correlations as opposed to causal relations; hence, the respective edges will be undirected. Figure 1 illustrates this example. Due to conditional independence, nodes A and C still have an indirect influence on one another; however, variable B encodes the information from A that impacts C, and vice versa. A Bayesian network is a specific type of graphical model, with directed edges and no cycles (Stephenson, 2000). The edges in Bayesian networks are viewed as causal connections, where each parent node causes an effect on its children. In addition, nodes in a Bayesian network contain a conditional probability table, or CPT, which stores all probabilities that may be used to reason or make inferences within the system.
Probability calculus does not require that the probabilities be based on theoretical results or frequencies of repeated experiments, commonly known as relative frequencies. Probabilities may also be completely subjective estimates of the certainty of an event. Consider an example of a basketball game. If one were to bet on an upcoming game between Team A and Team B, it is important to know the probability of Team A winning the game. This probability is definitely not a ratio, a relative frequency, or even an estimate of a relative frequency; the game cannot be repeated many times under exactly the same conditions. Rather, the probability represents only one’s belief concerning Team A’s chances of winning. Such a probability is termed a Bayesian or subjective probability and makes use of Bayes’ theorem to calculate unknown probabilities. A Bayesian probability may also be referred to as a personal probability. The Bayesian probability of an event x is a person’s degree of belief in that event. A Bayesian probability is a property of the person who assigns the probability, whereas a classical probability is a physical property of the world, meaning it is the physical probability of an event. An important difference between physical probability and Bayesian probability is that repeated trials are not necessary to measure the Bayesian probability. The Bayesian method can assign a probability for events that may be difficult to experimentally determine. An oft-voiced criticism of the Bayesian approach is that probabilities seem arbitrary, but this is a probability assessment issue that does not take away from the many possibilities that Bayesian probabilities provide.
Figure 1. Graphical model of two independent variables A and C that are directly related to a third variable B
B
A
90
C
Causal Influence Bayesian networks require an operational method for identifying causal relationships in order for accurate domain modeling. Hence, causal influence is defined in the following manner: If the action of making variable X take some value sometimes changes the value taken by variable Y, then X is assumed to be responsible for sometimes changing Y’s value, and one may conclude that X is a cause of Y. More formally, X is manipulated when we force X to take some value, and we say X causes Y if some manipulation of X leads to a change in the probability distribution of Y. Furthermore, if manipulating X leads to a change in the probability distribution of Y, then X obtaining a value by any means whatsoever also leads to a change in the probability distribution of Y. Hence, one can make the natural conclusion that causes and their effects are statis-
Bayesian Networks
tically correlated. However, note that variables can be correlated in less direct ways; that is, one variable may not necessarily cause the other. Rather, some intermediaries may be involved.
Bayesian Networks: A First Look This section provides a detailed definition of Bayesian networks in an attempt to merge the probabilistic, logic, and graphical modeling ideas that we have presented thus far. Causal relations, apart from the reasoning that they model, also have a quantitative side, namely their strength. This is expressed by associating numbers to the links. Let the variable A be a parent of the variable B in a causal network. Using probability calculus, it will be normal to let the conditional probability be the strength of the link between these variables. On the other hand, if the variable C is also a parent of the variable B, then conditional probabilities P(B|A) and P(B|C) do not provide any information on how the impacts from variable A and variable B interact. They may cooperate or counteract in various ways. Therefore, the specification of P(B|A,C) is required. If loops exist within the reasoning, the domain model may contain feedback cycles; these cycles are difficult to model quantitatively. For such networks, no calculus coping with feedback cycles has been developed. Therefore, the network must be loop free. In fact, from a practical perspective, it should be modeled as closely to a tree as possible.
Evidential Reasoning As stated previously, Bayesian networks accomplish such an economy by pointing out, for each variable Xi, the conditional probabilities P(Xi | pai) where pai is the set of parents (of Xi) that render Xi independent of all its other parents. After giving this specification, the joint probability distribution can be calculated by the product P( x1,..., xn ) = ∏ P( xi | pai ) i
Using this product, all probabilistic queries can be found coherently with probability calculus. Passing evidence up and down a Bayesian network is known as belief propagation, and exact probability calculations is NPhard, meaning that the calculation time to compute exact probabilities grows exponentially with the size of the Bayesian network. A number of algorithms are available for probabilistic calculations in Bayesian networks (Huang, King, & Lyu, 2002), beginning with Pearl’s message passing algorithm.
Advantages A Bayesian network is a graphical model that finds probabilistic relationships among variables of the system, but a number of models are available for data analysis, including rule bases, decision trees, and artificial neural networks. Several techniques for data analysis also exist, including classification, density estimation, regression, and clustering. However, Bayesian networks have distinct advantages that we discuss here. One of the biggest advantages of Bayesian networks is that they have a bidirectional message passing architecture. Learning from the evidence can be interpreted as unsupervised learning. Similarly, expectation of an action can be interpreted as unsupervised learning. Because Bayesian networks pass data between nodes and see the expectations from the world model, they can be considered as bidirectional learning systems (Helsper & Gaag, 2002). In addition to bidirectional message passing, Bayesian networks have several important features, such as the allowance of subjective a priori judgments, direct representation of causal dependence, and the ability to imitate the human thinking process. Bayesian networks handle incomplete data sets without difficulty because they discover dependencies among all the variables. When one of the inputs is not observed, most other models will end up with an inaccurate prediction because they do not calculate the correlation between the input variables. Bayesian networks suggest a natural way to encode these dependencies. For this reason, they are a natural choice for image classification or annotation, because the lack of a classification can also be viewed as missing data. Considering the Bayesian statistical techniques, Bayesian networks facilitate the combination of domain knowledge and data (Neapolitan, 2004). Prior or domain knowledge is crucially important if one performs a realworld analysis, particularly when data are inadequate or expensive. The encoding of causal prior knowledge is straightforward because Bayesian networks have causal semantics. Additionally, Bayesian networks encode the strength of causal relationships with probabilities. Therefore, prior knowledge and data can be put together with well studied techniques from Bayesian statistics for both forward and backward reasoning with the Bayesian network. Bayesian networks also ease many of the theoretical and computational difficulties of rule-based systems by utilizing graphical structures for representing and managing probabilistic knowledge. Independencies can be dealt with explicitly; they can be articulated, encoded graphically, read off the network, and reasoned about, and yet they forever remain robust to numerical impres-
91
B
Bayesian Networks
sion. Moreover, graphical representations uncover several opportunities for efficient computation and serve as understandable logic diagrams. Bayesian networks can simulate humanlike reasoning; this fact is not, however, due to any structural similarities with the human brain. Rather, it is because of the resemblance between the ways Bayesian networks and humans reason. The resemblance is more psychological than biological but nevertheless a true benefit.
Bayesian Inference Inference is the task of computing the probability of each value of a node in a Bayesian network when other variables’ values are known (Jensen, 1999). This concept is what makes Bayesian networks so powerful, as it allows the user to apply knowledge toward forward or backward reasoning. Suppose that a specific value for one or more of the variables in the network has been observed. If one variable has a definite value, or evidence, the probabilities, or belief values, for the other variables need to be revised, as this variable is not a defined value. This calculation of the updated probabilities for system variables that are based on new evidence is precisely the definition of inference.
FUTURE TRENDS The future of Bayesian networks lies in determining new ways to tackle the following issues of Bayesian inferencing and in building a Bayesian structure that accurately represents a particular system. As we discuss in this paper, conditional dependencies can be mapped into a graph in several ways, each with subtle semantic and statistical differences. Future research will give way to Bayesian networks that can understand system semantics and adapt accordingly, not only with respect to the conditional probabilities within each node but also with respect to the graph itself.
CONCLUSION A Bayesian network consists of the following elements: • • • • 92
A set of variables and a set of directed edges between variables A finite set of mutually exclusive states for each variable A directed acyclic graph (DAG), constructed from the variables coupled with the directed edges A conditional probability table (CPT) P(A | B1, B2, …,
Bn) that is associated with each variable A with parents B1, B2… Bn Bayesian networks continue to play a vital role in prediction and classification within data mining (Niedermeyer, 1998). They are a marriage between probability theory and graph theory, providing a natural tool for dealing with two problems that occur throughout applied mathematics and engineering: uncertainty and complexity. Also, Bayesian networks play an increasingly important role in the design and analysis of machine learning algorithms, serving as a promising way to approach present and future problems related to artificial intelligence and data mining (Choudhary, Rehg, Pavlovic, & Pentland, 2002; Doshi, Greenwald, & Clarke, 2002; Fenton, Cates, Forey, Marsh, Neil, & Tailor, 2003).
REFERENCES Choudhury, T., Rehg, J. M., Pavlovic, V., & Pentland, A. (2002). Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection. Proceedings of the International Conference on Pattern Recognition (ICPR), Canada, III (pp. 789-794). Doshi, P., Greenwald, L., & Clarke, J. (2002). Towards effective structure learning for large Bayesian networks. Proceedings of the AAAI Workshop on Probabilistic Approaches in Search, Canada (pp. 16-22). Fenton, N., Cates, P., Forey, S., Marsh, W., Neil, M., & Tailor, M. (2003). Modelling risk in complex software projects using Bayesian networks (Tech. Rep. No. RADAR Tech Repo). London: Queen Mary University. Helsper, E.M. & Gaag, L.C. van der. (2002). Building Bayesian networks through ontologies. Proceedings of the 15th Eureopean Conference on Artificial Intelligence, Lyon, France (pp. 680-684). Huang, K., King, I., & R. Lyu, M. (2002). Learning maximum likelihood semi-naive Bayesian network classifier. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 3, 6, Hamammet, Tunisia. Jensen, F. (1999). Gradient descent training of Bayesian networks. Proceedings of the European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (pp. 190-200). Neapolitan, R. E. (2004). Learning Bayesian networks. Upper Saddle River, NJ: Prentice-Hall. Niedermayer, D. (1998). An introduction to Bayesian networks and their contemporary applications. Re-
Bayesian Networks
trieved October 2004, from http://www.niedermayer.ca/ papers/bayesian/ Stephenson, T. (2000). An introduction to Bayesian network theory and usage. Retrieved October 2004, from http://www.idiap.ch/publications/todd00a.bib.abs.html
KEY TERMS Bayes’ Theorem: Result in probability theory that states the conditional probability of a variable A, given B, in terms of the conditional probability of variable B, given A, and the marginal probability of A alone. Conditional Probability: Probability of some event A, assuming event B, written mathematically as P(A|B). Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships such as classification, prediction, estimation, or affinity grouping.
Independent: Two random variables are independent when knowing something about the value of one of them does not yield any information about the value of the other. Joint Probability: The probability of two events occurring in conjunction. Maximum Likelihood: Method of point estimation using as an estimate of an unobservable population parameter the member of the parameter space that maximizes the likelihood function. Neural Networks: Learning systems that are designed by analogy with a simplified model of the neural connections in the brain and can be trained to find nonlinear relationships in data. Supervised Learning: A machine learning technique for creating a function from training data; the task of the supervised learner is to predict the value of the function for any valid input object after having seen only a small number of training data.
93
B
94
Best Practices in Data Warehousing from the Federal Perspective Les Pang National Defense University, USA
INTRODUCTION Data warehousing has been a successful approach for supporting the important concept of knowledge management — one of the keys to organizational success at the enterprise level. Based on successful implementations of warehousing projects, a number of lessons learned and best practices were derived from these project experiences. The scope was limited to projects funded and implemented by federal agencies, military institutions and organizations directly supporting them. Projects and organizations reviewed include the following: • • • • • • • • • • • • •
Census 2000 Cost and Progress System Defense Dental Standard System Defense Medical Logistics Support System Data Warehouse Program Department of Agriculture Rural Development Data Warehouse DOD Computerized Executive Information System Department of Transportation (DOT) Executive Reporting Framework System Environmental Protection Agency (EPA) Envirofacts Warehouse Federal Credit Union Health and Urban Development (HUD) Enterprise Data Warehouse Internal Revenue Service (IRS) Compliance Data Warehouse U.S. Army Operational Testing and Evaluation Command U.S. Coast Guard Executive Information System U.S. Navy Type Commander’s Readiness Management System
BACKGROUND Data warehousing involves the consolidation of data from various transactional data sources in order to support the strategic needs of an organization. This approach links the various silos of data that is distributed throughout an organization. By applying this ap-
proach, an organization can gain significant competitive advantages through the new level of corporate knowledge. Various agencies in the Federal Government attempted to implement a data warehousing strategy in order to achieve data interoperability. Many of these agencies have achieved significant success in improving internal decision processes as well as enhancing the delivery of products and services to the citizen. This chapter aims to identify the best practices that were implemented as part of the successful data warehousing projects within the federal sector.
MAIN THRUST Each best practice (indicated in boldface) and its rationale are listed below. Following each practice is a description of illustrative project or projects (indicated in italics), which support the practice.
Ensure the Accuracy of the Source Data to Maintain the User’s Trust of the Information in a Warehouse The user of a data warehouse needs to be confident that the data in a data warehouse is timely, precise, and complete. Otherwise, a user that discovers suspect data in warehouse will likely cease using it, thereby reducing the return on investment involved in building the warehouse. Within government circles, the appearance of suspect data takes on a new perspective. HUD Enterprise Data Warehouse - Gloria Parker, HUD Chief Information Officer, spearheaded data warehousing projects at the Department of Education and at HUD. The HUD warehouse effort was used to profile performance, detect fraud, profile customers, and do “what if” analysis. Business areas served include Federal Housing Administration loans, subsidized properties, and grants. She emphasizes that the public trust of the information is critical. Government agencies do not want to jeopardize our public trust by putting out bad data. Bad data will result in major ramifications not only from citizens but also from the government auditing arm, the General Accounting Office, and from Congress (Parker, 1999).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Best Practices in Data Warehousing from the Federal Perspective
EPA Envirofacts Warehouse - The Envirofacts data warehouse comprises of information from 12 different environmental databases for facility information, including toxic chemical releases, water discharge permit compliance, hazardous waste handling processes, Superfund status, and air emission estimates. Each program office provides its own data and is responsible for maintaining this data. Initially, the Envirofacts warehouse architects noted some data integrity problems, namely, issues with accurate data, understandable data, properly linked data and standardized data. The architects had to work hard to address these key data issues so that the public can trust that the quality of data in the warehouse (Garvey, 2003). U.S. Navy Type Commander Readiness Management System – The Navy uses a data warehouse to support the decisions of its commanding officers. Data at the lower unit levels is aggregated to the higher levels and then interfaced with other military systems for a joint military assessment of readiness as required by the Joint Chiefs of Staff. The Navy found that it was spending too much time to determine its readiness and some of its reports contained incorrect data. The Navy developed a user friendly, Web-based system that provides quick and accurate assessment of readiness data at all levels within the Navy. “The system collects, stores, reports and analyzes mission readiness data from air, sub and surface forces” for the Atlantic and Pacific Fleets. Although this effort was successful, the Navy learned that data originating from the lower levels still needs to be accurate. The reason is that a number of legacy systems, which serves as the source data for the warehouse, lacked validation functions (Microsoft, 2000).
Standardize the Organization’s Data Definitions A key attribute of a data warehouse is that it serves as “a single version of the truth.” This is a significant improvement over the different and often conflicting versions of the truth that come from an environment of disparate silos of data. To achieve this singular version of the truth, there needs to be consistent definitions of data elements to afford the consolidation of common information across different data sources. These consistent data definitions are captured in a data warehouse’s metadata repository. DoD Computerized Executive Information System (CEIS) is a 4-terabyte data warehouse holds the medical records of the 8.5 million active members of the U.S. military health care system who are treated at 115 hospitals and 461 clinics around the world. The Defense Department wanted to convert its fixed-cost health care system to a managed-care model to lower costs and increase patient care for the active military, retirees and
their dependents. Over 12,000 doctors, nurses and administrators use it. Frank Gillett, an analyst at Forrester Research, Inc., stated that, “What kills these huge data warehouse projects is that the human beings don’t agree on the definition of data. Without that . . . all that $450 million [cost of the warehouse project] could be thrown out the window” (Hamblen, 1998).
Be Selective on what Data Elements to Include in the Warehouse Users are unsure of what they want so they place an excessive number of data elements in the warehouse. This results in an immense, unwieldy warehouse in which query performance is impaired. Federal Credit Union - The data warehouse architect for this organization suggests that users know which data they use most, although they will not always admit to what they use least (Deitch, 2000).
Select the Extraction-TransformationLoading (ETL) Strategy Carefully Having an effective ETL strategy that extracts data from the various transactional systems, transforms the data to a common format, and loads the data into a relational or multidimensional database is the key to a successful data warehouse project. If the ETL strategy is not effective, it will mean delays in refreshing the data warehouse, contaminating the data warehouse with dirty data, and increasing the costs in maintaining the warehouse. IRS Compliance Warehouse supports research and decision support, allows the IRS to analyze, develop, and implement business strategies for increasing voluntary compliance, improving productivity and managing the organization. It also provides projections, forecasts, quantitative analysis, and modeling. Users are able to query this data for decision support. A major hurdle was to transform the large and diverse legacy online transactional data sets for effective use in an analytical architecture. They needed a way to process custom hierarchical data files and convert to ASCII for local processing and mapping to relational databases. They ended up with developing a script program that will do all of this. ETL is a major challenge and may be a “showstopper” for a warehouse implementation (Kmonk, 1999).
Leverage the Data Warehouse to Provide Auditing Capability An overlooked benefit of data warehouses is its capability of serving as an archive of historic knowledge that can be used as an audit trail for later investigations. 95
B
Best Practices in Data Warehousing from the Federal Perspective
U.S. Army Operational Testing and Evaluation Command (OPTEC) is charged with developing test criteria and evaluating the performance of extremely complex weapons equipment in every conceivable environment and condition. Moreover, as national defense policy is undergoing a transformation, so do the weapon systems, and thus the testing requirements. The objective of their warehouse was to consolidate a myriad of test data sets to provide analysts and auditors with access to the specific information needed to make proper decisions. OPTEC was having “fits” when audit agencies, such as the General Accounting Office (GAO), would show up to investigate a weapon system. For instance, if problems with a weapon show up five years after it is introduced into the field, people are going to want to know what tests were performed and the results of those tests. A warehouse with its metadata capability made data retrieval much more efficient (Microsoft, 2000).
Leverage the Web and Web Portals for Warehouse Data to Reach Dispersed Users In many organizations, users are geographically distributed and the World Wide Web has been very effective as a gateway for these dispersed users to access the key resources of their organization, which include data warehouses and data marts. U.S. Army OPTEC developed a Web-based front end for its warehouse so that information can be entered and accessed regardless of the hardware available to users. It supports the geographically dispersed nature of OPTEC’s mission. Users performing tests in the field can be anywhere from Albany, New York to Fort Hood, Texas. That is why the browser client the Army developed is so important to the success of the warehouse (Microsoft, 2000). DoD Defense Dental Standard System supports more than 10,000 users at 600 military installations worldwide. The solution consists of three main modules: Dental Charting, Dental Laboratory Management, and Workload and Dental Readiness Reporting. The charting module helps dentists graphically record patient information. The lab module automates the workflow between dentists and lab technicians. The reporting module allows users to see key information though Web-based online reports, which is a key to the success of the defense dental operations. IRS Compliance Data Warehouse includes a Webbased query and reporting solution that provides high-value, easy-to-use data access and analysis capabilities, be quickly and easily installed and managed, and scale to support hundreds of thousands of users. With this portal, the IRS 96
found that portals provide an effective way to access diverse data sources via a single screen (Kmonk, 1999).
Make Warehouse Data Available to All Knowledgeworkers (Not Only to Managers) The early data warehouses were designed to support upper management decision-making. However, over time, organizations have realized the importance of knowledge sharing and collaboration and its relevance to the success of the organizational mission. As a result, upper management has become aware of the need to disseminate the functionality of the data warehouse throughout the organization. IRS Compliance Data Warehouse supports a diversity of user types — economists, research analysts, and statisticians – all of whom are searching for ways to improve customer service, increase compliance with federal tax laws and increase productivity. It is not just for upper management decision making anymore (Kmonk, 1999).
Supply Data in a Format Readable by Spreadsheets Although online analytical tools such as those supported by Cognos and Business Objects are useful for data analysis, the spreadsheet is still the basic tool used by most analysts. U.S. Army OPTEC wanted users to transfer data and work with information on applications that they are familiar with. In OPTEC, they transfer the data into a format readable by spreadsheets so that analysts can really crunch the data. Specifically, pivot tables found in spreadsheets allows the analysts to manipulate the information to put meaning behind the data (Microsoft, 2000).
Restrict or Encrypt Classified/ Sensitive Data Depending on requirements, a data warehouse can contain confidential information that should not be revealed to unauthorized users. If privacy is breached, the organization may become legally liable for damages and suffer a negative reputation with the ensuing loss of customers’ trust and confidence. Financial consequences can result. DoD Computerized Executive Information System uses an online analytical processing tool from a popular vendor that could be used to restrict access to certain data, such as HIV test results, so that any confidential data would not be disclosed (Hamblen, 1998).
Best Practices in Data Warehousing from the Federal Perspective
Considerations must be made in the architecting of a data warehouse. One alternative is to use a roles-based architecture that allows access to sensitive data by only authorized users and the encryption of data in the event of data interception.
Perform Analysis During the Data Collection Process Most data analyses involve completing data collection before analysis can begin. With data warehouses, a new approach can be undertaken. Census 2000 Cost and Progress System was built to consolidate information from several computer systems. The data warehouse allowed users to perform analyses during the data collection process; something was previously not possible. The system allowed executives to take a more proactive management role. With this system, Census directors, regional offices, managers, and congressional oversight committees have the ability to track the 2000 census, which never been done before (SAS, 2000).
Leverage User Familiarity with Browsers to Reduce Training Requirements The interface of a Web browser is very familiar to most employees. Navigating through a learning management system using a browser may be more user friendly than using the navigation system of a proprietary training software. U.S. Department of Agriculture (USDA) Rural Development, Office of Community Development, administers funding programs for the Rural Empowerment Zone Initiative. There was a need to tap into legacy databases to provide accurate and timely rural funding information to top policy makers. Through Web accessibility using an intranet system, there were dramatic improvements in financial reporting accuracy and timely access to data. Prior to the intranet, questions such as “What were the Rural Development investments in 1997 for the Mississippi Delta region?” required weeks of laborious data gathering and analysis, yet yielded obsolete answers with only an 80 percent accuracy factor. Now, similar analysis takes only a few minutes to perform, and the accuracy of the data is as high as 98 percent. More than 7,000 Rural Development employees nationwide can retrieve the information at their desktops, using a standard Web browser. Because employees are familiar with the browser, they did not need training to use the new data mining system (Ferris, 2003).
Use Information Visualization Techniques Such as Geographic Information Systems (GIS)
B
A GIS combines layers of data about a physical location to give users a better understanding of that location. GIS allows users to view, understand, question, interpret, and visualize data in ways simply not possible in paragraphs of text or in the rows and columns of a spreadsheet or table. EPA Envirofacts Warehouse includes the capability of displaying its output via the EnviroMapper GIS system. It maps several types of environmental information, including drinking water, toxic and air releases, hazardous waste, water discharge permits, and Superfund sites at the national, state, and county levels (Garvey, 2003). Individuals familiar with Mapquest and other online mapping tools can easily navigate the system and quickly get the information they need.
FUTURE TRENDS Data warehousing will continue to grow as long as there are disparate silos of data sources throughout an organization. However, the irony is that there will be a proliferation of data warehouses as well as data marts, which will not interoperate within an organization. Some experts predict the evolution toward a federated architecture for the data warehousing environment. For example, there will be a common staging area for data integration and, from this source, data will flow among several data warehouses. This will ensure that the “single truth” requirement is maintained throughout the organization (Hackney, 2000). Another important trend in warehousing is one away from historic nature of data in warehouses and toward real-time distribution of data so that information visibility will be instantaneous (Carter, 2004). This is a key factor for business decision-making in a constantly changing environment. Emerging technologies, namely service-oriented architectures and Web services, are expected to be the catalyst for this to occur.
CONCLUSION An organization needs to understand how it can leverage data from a warehouse or mart to improve its level of service and the quality of its products and services. Also, the organization needs to recognize that its most valuable resource, the workforce, needs to be adequately trained in accessing and utilizing a data warehouse. The 97
Best Practices in Data Warehousing from the Federal Perspective
workforce should recognize the value of the knowledge that can be gained from data warehousing and how to apply it to achieve organizational success. A data warehouse should be part of an enterprise architecture, which is a framework for visualizing the information technology assets of an enterprise and how these assets interrelate. It should reflect the vision and business processes of an organization. It should also include standards for the assets and interoperability requirements among these assets.
Parker, G. (1999). Data warehousing at the federal government: A CIO perspective. In Proceedings from Data Warehouse Conference ’99.
REFERENCES
KEY TERMS
AMS. (1999). Military marches toward next-generation health care service: The Defense Dental Standard System.
ASCII: American Standard Code for Information Interchange. Serves a code for representing English characters as numbers with each letter assigned a number from 0 to 127.
Carter, M. (2004). The death of data warehousing. Loosely Coupled. Deitch, J. (2000). Technicians are from Mars, users are from Venus: Myths and facts about data warehouse administration (Presentation). Ferris, N. (1999). 9 hot trends for ‘99. Government Executive. Ferris, N. (2003). Information is power. Government Executive. Garvey, P. (2003). Envirofacts warehouse public access to environmental data over the Web (Presentation). Gerber, C. (1996). Feds turn to OLAP as reporting tool. Federal Computer Week.
PriceWaterhouseCoopers. (2001). Technology forecast. SAS. (2000). The U.S. Bureau of the Census counts on a better system. Schwartz, A. (2000). Making the Web Safe. Federal Computer Week.
Data Warehousing: A compilation of data designed to for decision support by executives, managers, analysts and other key stakeholders in an organization. A data warehouse contains a consistent picture of business conditions at a single point in time. Database: A collection of facts, figures, and objects that is structured so that it can easily be accessed, organized, managed, and updated. Enterprise Architecture: A business and performance-based framework to support cross-agency collaboration, transformation, and organization-wide improvement.
Hamblen, M. (1998). Pentagon to deploy huge medical data warehouse. Computer World.
Extraction-Transformation-Loading (ETL): A key transitional set of steps in migrating data from the source systems to the database housing the data warehouse. Extraction refers to drawing out the data from the source system, transformation concerns converting the data to the format of the warehouse and loading involves storing the data into the warehouse.
Kirwin, B. (2003). Management update: Total cost of ownership analysis provides many benefits. Gartner Research, IGG-08272003-01.
Geographic Information Systems: Map-based tools used to gather, transform, manipulate, analyze, and produce information related to the surface of the Earth.
Kmonk, J. (1999). Viador information portal provides Web data access and reporting for the IRS. DM Review.
Hierarchical Data Files: Database systems that are organized in the shape of a pyramid with each row of objects linked to objects directly beneath it. This approach has generally been superceded by relationship database systems.
Hackney, D. (2000). Data warehouse delivery: The federated future. DM Review.
Matthews, W. (2000). Digging digital gold. Federal Computer Week. Microsoft Corporation. (2000). OPTEC adopts data warehousing strategy to test critical weapons systems. Microsoft Corporation. (2000). U.S. Navy ensures readiness using SQL Server. 98
Knowledge Management: A concept where an organization deliberately and comprehensively gathers, organizes, and analyzes its knowledge, then shares it internally and sometimes externally.
Best Practices in Data Warehousing from the Federal Perspective
Legacy System: Typically, a database management system in which an organization has invested considerable time and money and resides on a mainframe or minicomputer. Outsourcing: Acquiring services or products from an outside supplier or manufacturer in order to cut costs and/or procure outside expertise. Performance Metrics: Key measurements of system attributes that is used to determine the success of the process. Pivot Tables: An interactive table found in most spreadsheet programs that quickly combines and compares typically large amounts of data. One can rotate its rows and columns to see different arrangements of the source data, and also display the details for areas of interest.
Terabyte: A unit of memory or data storage capacity equal to roughly 1,000 gigabytes. Total Cost of Ownership: Developed by Gartner Group, an accounting method used by organizations seeking to identify their both direct and indirect systems costs.
NOTE The views expressed in this article are those of the author and do not reflect the official policy or position of the National Defense University, the Department of Defense or the U.S. Government.
99
B
100
Bibliomining for Library Decision-Making Scott Nicholson Syracuse University School of Information Studies, USA Jeffrey Stanton Syracuse University School of Information Studies, USA
INTRODUCTION
BACKGROUND
Most people think of a library as the little brick building in the heart of their community or the big brick building in the center of a college campus. However, these notions greatly oversimplify the world of libraries. Most large commercial organizations have dedicated in-house library operations, as do schools; nongovernmental organizations; and local, state, and federal governments. With the increasing use of the World Wide Web, digital libraries have burgeoned, serving a huge variety of different user audiences. With this expanded view of libraries, two key insights arise. First, libraries are typically embedded within larger institutions. Corporate libraries serve their corporations, academic libraries serve their universities, and public libraries serve taxpaying communities who elect overseeing representatives. Second, libraries play a pivotal role within their institutions as repositories and providers of information resources. In the provider role, libraries represent in microcosm the intellectual and learning activities of the people who comprise the institution. This fact provides the basis for the strategic importance of library data mining: By ascertaining what users are seeking, bibliomining can reveal insights that have meaning in the context of the library’s host institution. Use of data mining to examine library data might be aptly termed bibliomining. With widespread adoption of computerized catalogs and search facilities over the past quarter century, library and information scientists have often used bibliometric methods (e.g., the discovery of patterns in authorship and citation within a field) to explore patterns in bibliographic information. During the same period, various researchers have developed and tested data-mining techniques, which are advanced statistical and visualization methods to locate nontrivial patterns in large datasets. Bibliomining refers to the use of these bibliometric and data-mining techniques to explore the enormous quantities of data generated by the typical automated library.
Forward-thinking authors in the field of library science began to explore sophisticated uses of library data some years before the concept of data mining became popularized. Nutter (1987) explored library data sources to support decision making but lamented that “the ability to collect, organize, and manipulate data far outstrips the ability to interpret and to apply them” (p. 143). Johnston and Weckert (1990) developed a data-driven expert system to help select library materials, and Vizine-Goetz, Weibel, and Oskins (1990) developed a system for automated cataloging based on book titles (see also Morris, 1992, and Aluri & Riggs, 1990). A special section of Library Administration and Management, “Mining your automated system,” included articles on extracting data to support system management decisions (Mancini, 1996), extracting frequencies to assist in collection decision making (Atkins, 1996), and examining transaction logs to support collection management (Peters, 1996). More recently, Banerjeree (1998) focused on describing how data mining works and how to use it to provide better access to the collection. Guenther (2000) discussed data sources and bibliomining applications but focused on the problems with heterogeneous data formats. Doszkocs (2000) discussed the potential for applying neural networks to library data to uncover possible associations between documents, indexing terms, classification codes, and queries. Liddy (2000) combined natural language processing with text mining to discover information in digital library collections. Lawrence, Giles, and Bollacker (1999) created a system to retrieve and index citations from works in digital libraries. Gutwin, Paynter, Witten, Nevill-Manning, and Frank (1999) used text mining to support resource discovery. These projects all shared a common focus on improving and automating two of the core functions of a library: acquisitions and collection management. A few authors have recently begun to address the need to support management by focusing on understanding library users:
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Bibliomining for Library Decision-Making
Schulman (1998) discussed using data mining to examine changing trends in library user behavior; Sallis, Hill, Jancee, Lovette, and Masi (1999) created a neural network that clusters digital library users; and Chau (2000) discussed the application of Web mining to personalize services in electronic reference. The December 2003 issue of Information Technology and Libraries was a special issue dedicated to the bibliomining process. Nicholson presented an overview of the process, including the importance of creating a data warehouse that protects the privacy of users. Zucca discussed the implementation of a data warehouse in an academic library. Wormell; Suárez-Balseiro, IribarrenMaestro, & Casado; and Geyer-Schultz, Neumann, & Thede used bibliomining in different ways to understand the use of academic library sources and to create appropriate library services. We extend these efforts by taking a more global view of the data generated in libraries and the variety of decisions that those data can inform. Thus, the focus of this work is on describing ways in which library and information managers can use data mining to understand patterns of behavior among library users and staff and patterns of information resource use throughout the institution.
MAIN THRUST Integrated Library Systems and Data Warehouses Most managers who wish to explore bibliomining will need to work with the technical staff of their Integrated Library System (ILS) vendors to gain access to the databases that underlie the system and create a data warehouse. The cleaning, preprocessing, and anonymizing of the data can absorb a significant amount of time and effort. Only by combining and linking different data sources, however, can managers uncover the hidden patterns that can help them understand library operations and users.
Exploration of Data Sources Available library data sources are divided into three groups for this discussion: data from the creation of the library, data from the use of the collection, and data from external sources not normally included in the ILS.
ILS Data Sources from the Creation of the Library System Bibliographic Information One source of data is the collection of bibliographic records and searching interfaces that represents materials in the library, commonly known as the Online Public Access Catalog (OPAC). In a digital library environment, the same type of information collected in a bibliographic library record can be collected as metadata. The concepts parallel those in a traditional library: Take an agreed-upon standard for describing an object, apply it to every object, and make the resulting data searchable. Therefore, digital libraries use conceptually similar bibliographic data sources to traditional libraries.
Acquisitions Information Another source of data for bibliomining comes from acquisitions, where items are ordered from suppliers and tracked until they are received and processed. Because digital libraries do not order physical goods, somewhat different acquisition methods and vendor relationships exist. Nonetheless, in both traditional and digital library environments, acquisition data have untapped potential for understanding, controlling, and forecasting information resource costs.
ILS Data Sources from Usage of the Library System User Information In order to verify the identity of users who wish to use library services, libraries maintain user databases. In libraries associated with institutions, the user database is closely aligned with the organizational database. Sophisticated public libraries link user records through zip codes with demographic information in order to learn more about their user population. Digital libraries may or may not have any information about their users, based upon the login procedure required. No matter what data are captured about the patron, it is important to ensure that the identification information about the patron is separated from the demographic information before this information is stored in a data warehouse; doing so protects the privacy of the individual.
101
B
Bibliomining for Library Decision-Making
Circulation and Usage Information The richest sources of information about library user behavior are circulation and usage records. Legal and ethical issues limit the use of circulation data, however. A data warehouse can be useful in this situation, because basic demographic information and details about the circulation could be recorded without infringing upon the privacy of the individual. Digital library services have a greater difficulty in defining circulation, as viewing a page does not carry the same meaning as checking a book out of the library, although requests to print or save a full text information resource might be similar in meaning. Some electronic fulltext services already implement the server-side capture of such requests from their user interfaces.
Searching and Navigation Information The OPAC serves as the primary means of searching for works owned by the library. Additionally, because most OPACs use a Web browser interface, users may also access bibliographic databases, the World Wide Web, and other online resources during the same session; all this information can be useful in library decision making. Digital libraries typically capture logs from users who are searching their databases and can track, through clickstream analysis, the elements of Web-based services visited by users. In addition, the combination of a login procedure and cookies allows the connection of user demographics to the services and searches they used in a session.
External Data Sources Reference Desk Interactions In the typical face-to-face or telephone interaction with a library user, the reference librarian records very little information about the interaction. Digital reference transactions, however, occur through an electronic format, and the transaction text can be captured for later analysis, which provides a much richer record than is available in traditional reference work. The utility of these data can be increased if identifying information about the user can be captured as well, but again, anonymization of these transactions is a significant challenge.
fore, tracking in-house use is also vital in discovering patterns of use. This task becomes much easier in a digital library, as Web logs can be analyzed to discover what sources the users examined.
Interlibrary Loan and Other Outsourcing Services Many libraries use interlibrary loan and/or other outsourcing methods to get items on a need-by-need basis for users. The data produced by this class of transactions will vary by service but can provide a window to areas of need in a library collection.
Applications of Bibliomining through a Data Warehouse Bibliomining can provide an understanding of the individual sources listed previously in this article; however, much more information can be discovered when sources are combined through common fields in a data warehouse.
Bibliomining to Improve Library Services Most libraries exist to serve the information needs of users, and therefore, understanding the needs of individuals or groups is crucial to a library’s success. For many decades, librarians have suggested works; market basket analysis can provide the same function through usage data in order to aid users in locating useful works. Bibliomining can also be used to determine areas of deficiency and to predict future user needs. Common areas of item requests and unsuccessful searches may point to areas of collection weakness. By looking for patterns in high-use items, librarians can better predict the demand for new items. Virtual reference desk services can build a database of questions and expert-created answers, which can be used in a number of ways. Data mining could be used to discover patterns for tools that will automatically assign questions to experts based upon past assignments. In addition, by mining the question/answer pairs for patterns, an expert system could be created that can provide users an immediate answer and a pointer to an expert for more information.
Item Use Information
Bibliomining for Organizational Decision Making Within the Library
Fussler and Simon (as cited in Nutter, 1987) estimated that 75 to 80% of the use of materials in academic libraries is in house. Some types of materials never circulate, and there-
Just as the user behavior is captured within the ILS, the behavior of library staff can also be discovered by con-
102
Bibliomining for Library Decision-Making
necting various databases to supplement existing performance review methods. Although monitoring staff through their performance may be an uncomfortable concept, tighter budgets and demands for justification require thoughtful and careful performance tracking. In addition, research has shown that incorporating clear, objective measures into performance evaluations can actually improve the fairness and effectiveness of those evaluations (Stanton, 2000). Low-use statistics for a work may indicate a problem in the selection or cataloging process. Looking at the associations between assigned subject headings, call numbers, and keywords, along with the responsible party for the catalog record, may lead to a discovery of system inefficiencies. Vendor selection and price can be examined in a similar fashion to discover if a staff member consistently uses a more expensive vendor when cheaper alternatives are available. Most libraries acquire works both by individual orders and through automated ordering plans that are configured to fit the size and type of that library. Although these automated plans do simplify the selection process, if some or many of the works they recommend go unused, then the plan might not be cost effective. Therefore, merging the acquisitions and circulation databases and seeking patterns that predict low use can aid in appropriate selection of vendors and plans.
Bibliomining for External Reporting and Justification The library may often be able to offer insights to their parent organization or community about their user base through patterns detected with bibliomining. In addition, library managers are often called upon to justify the funding for their library when budgets are tight. Likewise, managers must sometimes defend their policies, particularly when faced with user complaints. Bibliomining can provide the data-based justification to back up the anecdotal evidence usually used for such arguments. Bibliomining of circulation data can provide a number of insights about the groups who use the library. By clustering the users by materials circulated and tying demographic information into each cluster, the library can develop conceptual user groups that provide a model of the important constituencies of the institution’s user base; this grouping, in turn, can fulfill some common organizational needs for understanding where common interests and expertise reside in the user community. This capability may be particularly valuable within large organizations where research and development efforts are dispersed over multiple locations.
FUTURE TRENDS
B
Consortial Data Warehouses One future path of bibliomining is to combine the data from multiple libraries through shared data warehouses. This merger will require standards if the libraries use different systems. One such standard is the COUNTER project (2004), which is a standard for reporting the use of digital library resources. Libraries working together to pool their data will be able to gain a competitive advantage over publishers and have the data needed to make better decisions. This type of data warehouse can power evidence-based librarianship, another growing area of research (Eldredge, 2000). Combining these data sources will allow library science research to move from making statements about a particular library to making generalizations about librarianship. These generalizations can then be tested on other consortial data warehouses and in different settings and may be the inspiration for theories. Bibliomining and other forms of evidence-based librarianship can therefore encourage the expansion of the conceptual and theoretical frameworks supporting the science of librarianship.
Bibliomining, Web Mining, and Text Mining Web mining is the exploration of patterns in the use of Web pages. Bibliomining uses Web mining as its base but adds some knowledge about the user. This aids in one of the shortcomings of Web mining — many times, nothing is known about the user. This lack still holds true in some digital library applications; however, when users access password-protected areas, the library has the ability to map some information about the patron onto the usage information. Therefore, bibliomining uses tools from Web usage mining but has more data available for pattern discovery. Text mining is the exploration of the context of text in order to extract information and understand patterns. It helps to add information to the usage patterns discovered through bibliomining. To use terms from information science, bibliomining focuses on patterns in the data that label and point to the information container, while text mining focuses on the information within that container. In the future, organizations that fund digital libraries can look to text mining to greatly improve access to materials beyond the current cataloging/metadata solutions. The quality and speed of text mining continues to improve. Liddy (2000) has researched the extraction of
103
Bibliomining for Library Decision-Making
information from digital texts; implementing these technologies can allow a digital library to move from suggesting texts that might contain the answer to just providing the answer by extracting it from the appropriate text or texts. The use of such tools risks taking textual material out of context and also provides few hints about the quality of the material, but if these extractions were links directly into the texts, then context could emerge along with an answer. This situation could provide a substantial asset to organizations that maintain large bodies of technical texts, because it would promote rapid, universal access to previously scattered and/or uncataloged materials.
REFERENCES
Example of Hybrid Approach
Chaudhry, A. S. (1993). Automation systems as tools of use studies and management information. IFLA Journal, 19(4), 397-409.
Hwang and Chuang (in press) have recently combined bibliomining, Web mining, and text mining in a recommender system for an academic library. They started by using data mining on Web usage data for articles in a digital library and combining that information with information about the users. They then built a system that looked at patterns between works based on their content by using text mining. By combing these two systems into a hybrid system, they found that the hybrid system provides more accurate recommendations for users than either system taken separately. This example is a perfect representation of the future of bibliomining and how it can be used to enhance the text-mining research projects already in progress.
CONCLUSION Libraries have gathered data about their collections and users for years but have not always used those data for better decision making. By taking a more active approach based on applications of data mining, data visualization, and statistics, these information organizations can get a clearer picture of their information delivery and management needs. At the same time, libraries must continue to protect their users and employees from the misuse of personally identifiable data records. Information discovered through the application of bibliomining techniques gives the library the potential to save money, provide more appropriate programs, meet more of the users’ information needs, become aware of the gaps and strengths of their collection, and serve as a more effective information source for its users. Bibliomining can provide the databased justifications for the difficult decisions and funding requests library managers must make.
104
Atkins, S. (1996). Mining automated systems for collection management. Library Administration & Management, 10(1), 16-19. Banerjee, K. (1998). Is data mining right for your library? Computer in Libraries, 18(10), 28-31. Chau, M. Y. (2000). Mediating off-site electronic reference services: Human-computer interactions between libraries and Web mining technology. IEEE Fourth International Conference on Knowledge-Based Intelligent Engineering Systems & Allied Technologies, USA, 2 (pp. 695-699).
COUNTER (2004). COUNTER: Counting online usage of networked electronic resources. Retrieved from http:// www.projectcounter.org/about.html Doszkocs, T. E. (2000). Neural networks in libraries: The potential of a new information technology. Retrieved from http://web.simmons.edu/~chen/nit/NIT%2791/ 027~dos.htm Eldredge, J. (2000). Evidence-based librarianship: An overview. Bulletin of the Medical Library Association, 88(4), 289-302. Geyer-Schulz, A., Neumann, A., & Thede, A. (2003). An architecture for behavior-based library recommender systems. Information Technology and Libraries, 22(4), 165174. Guenther, K. (2000). Applying data mining principles to library data collection. Computers in Libraries, 20(4), 6063. Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., & Frank, E. (1999). Improving browsing in digital libraries with keyphrase indexes. Decision Support Systems, 2I, 81-104. Hwang, S., & Chuang, S. (in press). Combining article content and Web usage for literature recommendation in digital libraries. Online Information Review. Johnston, M., & Weckert, J. (1990). Selection advisor: An expert system for collection development. Information Technology and Libraries, 9(3), 219-225. Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67-71.
Bibliomining for Library Decision-Making
Liddy, L. (2000, November/December). Text mining. Bulletin of the American Society for Information Science, 1314. Mancini, D. D. (1996). Mining your automated system for systemwide decision making. Library Administration & Management, 10(1), 11-15. Morris, A. (Ed.). (1992). Application of expert systems in library and information centers. London: Bowker-Saur. Nicholson, S. (2003). The bibliomining process: Data warehousing and data mining for library decision-making. Information Technology and Libraries, 22(4), 146-151. Nutter, S. K. (1987). Online systems and the management of collections: Use and implications. Advances in Library Automation Networking, 1, 125-149. Peters, T. (1996). Using transaction log analysis for library management information. Library Administration & Management, 10(1), 20-25. Sallis, P., Hill, L., Janee, G., Lovette, K., & Masi, C. (1999). A methodology for profiling users of large interactive systems incorporating neural network data mining techniques. Proceedings of the 1999 Information Resources Management Association International Conference (pp. 994-998). Schulman, S. (1998). Data mining: Life after report generators. Information Today, 15(3), 52. Stanton, J. M. (2000). Reactions to employee performance monitoring: Framework, review, and research directions. Human Performance, 13, 85-113. Suárez-Balseiro, C. A., Iribarren-Maestro, I., Casado, E. S. (2003). A study of the use of the Carlos III University of Madrid Library’s online database service in Scientific Endeavor. Information Technology and Libraries, 22(4), 179-182. Vizine-Goetz, D., Weibel, S., & Oskins, M. (1990). Automating descriptive cataloging. In R. Aluri, & D. Riggs (Eds.), Expert systems in libraries (pp. 123-127). Norwood, NJ: Ablex Publishing Corporation. Wormell, I. (2003). Matching subject portals with the research environment. Information Technology and Libraries, 22(4), 158-166. Zucca, J. (2003). Traces in the clickstream: Early work on a management information repository at the University of Pennsylvania. Information Technology and Libraries, 22(4), 175-178.
KEY TERMS Bibliometrics: The study of regularities in citations, authorship, subjects, and other extractable facets from scientific communication by using quantitative and visualization techniques. This study allows researchers to understand patterns in the creation and documented use of scholarly publishing. Bibliomining: The application of statistical and pattern-recognition tools to large amounts of data associated with library systems in order to aid decision making or to justify services. The term bibliomining comes from the combination of bibliometrics and data mining, which are the two main toolsets used for analysis. Data Warehousing: The gathering and cleaning of data from disparate sources into a single database, which is optimized for exploration and reporting. The data warehouse holds a cleaned version of the data from operational systems, and data mining requires the type of cleaned data that live in a data warehouse. Evidence-Based Librarianship: The use of the best available evidence, combined with the experiences of working librarians and the knowledge of the local user base, to make the best decisions possible (Eldredge, 2000). Integrated Library System: The automation system for libraries that combines modules for cataloging, acquisition, circulation, end-user searching, database access, and other library functions through a common set of interfaces and databases. Online Public Access Catalog (OPAC): The module of the Integrated Library System designed for use by the public to allow discovery of the library’s holdings through the searching of bibliographic surrogates. As libraries acquire more digital materials, they are linking those materials to the OPAC entries.
NOTE This work is based on Nicholson, S., & Stanton, J. (2003). Gaining strategic advantage through bibliomining: Data mining for management decisions in corporate, special, digital, and traditional libraries. In H. Nemati & C. Barko (Eds.). Organizational data mining: Leveraging enterprise data resources for optimal performance (pp. 247– 262). Hershey, PA: Idea Group.
105
B
106
Biomedical Data Mining Using RBF Neural Networks Feng Chu Nanyang Technological University, Singapore Lipo Wang Nanyang Technological University, Singapore
INTRODUCTION Accurate diagnosis of cancers is of great importance for doctors to choose a proper treatment. Furthermore, it also plays a key role in the searching for the pathology of cancers and drug discovery. Recently, this problem attracts great attention in the context of microarray technology. Here, we apply radial basis function (RBF) neural networks to this pattern recognition problem. Our experimental results in some well-known microarray data sets indicate that our method can obtain very high accuracy with a small number of genes.
BACKGROUND Microarray is also called gene chip or DNA chip. It is a newly appeared biotechnology that allows biomedical researchers monitor thousands of genes simultaneously (Schena, Shalon, Davis, & Brown, 1995). Before the appearance of microarrays, a traditional molecular biology experiment usually works on only one gene or several genes, which makes it difficult to have a “whole picture” of an entire genome. With the help of microarrays, researchers are able to monitor, analyze and compare expression profiles of thousands of genes in one experiment. On account of their features, microarrays have been used in various tasks such as gene discovery, disease diagnosis, and drug discovery. Since the end of the last century, cancer classification based on gene expression profiles has attracted great attention in both the biological and the engineering fields. Compared with traditional cancer diagnostic methods based mainly on the morphological appearances of tumors, the method using gene expression profiles is more objective, accurate, and reliable. More importantly, some types of cancers have subtypes with very similar appearances that are very hard to be classified by traditional methods. It has been proven that gene expression has a good capability to clarify this previously muddy problem.
Thus, to develop accurate and efficient classifiers based on gene expression becomes a problem of both theoretical and practical importance. Recent approaches on this problem include artificial neural networks (Khan et al., 2001), support vector machines (Guyon, Weston, Barnhill, & Vapnik, 2002), k-nearest neighbor (Olshen & Jain, 2002), nearest shrunken centroids (Tibshirani, Hastie, Narashiman, & Chu, 2002), and so on. A solution to this problem is to find out a group of important genes that contribute most to differentiate cancer subtypes. In the meantime, we should also provide proper algorithms that are able to make correct prediction based on the expression profiles of those genes. Such work will benefit early diagnosis of cancers. In addition, it will help doctors choose proper treatment. Furthermore, it also throws light on the relationship between the cancers and those important genes. From the point of view of machine learning and statistical learning, cancer classification using gene expression profiles is a challenging problem. The reason lies in the following two points. First, typical gene expression data sets usually contain very few samples (from several to several tens for each type of cancers). In other words, the training data are scarce. Second, such data sets usually contain a large number of genes, for example, several thousands. That is, the data are high dimensional. Therefore, this is a special pattern recognition problem with relatively small number of patterns and very high dimensionality. To provide such a problem with a good solution, appropriate algorithms should be designed. In fact, a number of different approaches such as knearest neighbor (Olshen and Jain, 2002), support vector machines (Guyon et al.,2002), artificial neural networks (Khan et al., 2001) and some statistical methods have been applied to this problem since 1995. Among these approaches, some obtained very good results. For example, Khan et al. (2001) classified small round blue cell tumors (SRBCTs) with 100% accuracy by using 96 genes. Tibshirani et al. (2002) successfully classified SRBCTs with 100% accuracy by using only 43 genes. They also
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Biomedical Data Mining Using RBF Neural Networks
classified three different subtypes of lymphoma with 100% accuracy by using 48 genes. (Tibshirani, Hastie, Narashiman, & Chu, 2003) However, there are still a lot of things can be done to improve present algorithms. In this work, we use and compare two gene selection schemes, i.e., principal components analysis (PCA) (Simon, 1999) and a t-test-based method (Tusher, Tibshirani, & Chu, 2001). After that, we introduce an RBF neural network (Fu & Wang, 2003) as the classification algorithm.
MAIN THRUST After a comparative study of gene selection methods, a detailed description of the RBF neural network and some experimental results are presented in this section.
Microarray Data Sets We analyze three well-known gene expression data sets, i.e., the SRBCT data set (Khan et al., 2001), the lymphoma data set (Alizadeh et al., 2000), and the leukemia data set (Golub et al., 1999). The lymphoma data set (http://llmpp.nih.gov/lymphoma) (Alizadeh et al., 2000) contains 4026 “well measured” clones belonging to 62 samples. These samples belong to following types of lymphoid malignancies: diffuse large B-cell lymphoma (DLBCL, 42 samples), follicular lymphoma (FL, nine samples) and chronicle lymphocytic leukemia (CLL, 11 samples). In this data set, a small part of data is missing. A k-nearest neighbor algorithm was used to fill those missing values (Troyanskaya et al., 2001). The SRBCT data set (http://research.nhgri.nih.gov/ microarray/Supplement/) (Khan et al., 2001) contains the expression data of 2308 genes. There are totally 63 training samples and 25 testing samples. Five of the testing samples are not SRBCTs. The 63 training samples contain 23 Ewing family of tumors (EWS), 20 rhabdomyosarcoma (RMS), 12 neuroblastoma (NB), and eight Burkitt lymphomas (BL). And the 20 testing samples contain six EWS, five RMS, six NB, and three BL. The leukemia data set (http://www-genome.wi.mit.edu/ cgi-\\bin /cancer/publications) (Golub et al., 1999) has two types of leukemia, i.e., acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Among these samples, 38 of them are for training; the other 34 blind samples are for testing. The entire leukemia data set contains the expression data of 7,129 genes. Different with the cDNA microarray data, the leukemia data are oligonucleotide microarray data. Because such expression data are raw data, we need to normalize them to reduce
the systemic bias induced during experiments. We follow the normalization procedure used by Dudoit, Fridlyand, and Speed (2002). Three preprocessing steps were applied: (a) thresholding with floor of 100 and ceiling of 16000; (b) filtering, exclusion of genes with max/min<5 or (max-min)<500. max and min refer to the maximum and the minimum of the gene expression values, respectively; and (c) base 10 logarithmic transformation. There are 3571 genes survived after these three steps. After that, the data were standardized across experiments, i.e., minus the mean and divided by the standard deviation of each experiment.
Methods for Gene Selection As mentioned in the former part, the gene expression data are very high-dimensional. The dimension of input patterns is determined by the number of genes used. In a typical microarray experiment, usually several thousands of genes take part in. Therefore, the dimension of patterns is several thousands. However, only a small number of the genes contribute to correct classification; some others even act as “noise”. Gene selection can eliminate the influence of such “noise”. Furthermore, the fewer the genes used, the lower the computational burden to the classifier. Finally, once a smaller subset of genes is identified as relevant to a particular cancer, it helps biomedical researchers focus on these genes that contribute to the development of the cancer. The process of gene selection is ranking genes’ discriminative ability first and then retaining the genes with high ranks. As a critical step for classification, gene selection has been studied intensively in recent years. There are two main approaches, one is principal component analysis (PCA) (Simon, 1999), perhaps the most widely used method; the other is a t-test-based approach which has been more and more widely accepted. In the important papers (Alizadeh et al., 2000; Khan et al., 2001), PCA was used. The basic idea of PCA is to find the most “informative” genes that contain most of the information in the data set. Another approach is based on t-test that is able to measure the difference between two groups. Thomas, Olsen, Tapscott, and Zhao. (2001) recommended this method. Tusher et al. (2001) and Pan (2002) also proposed their method based on t-test, respectively. Besides these two main methods, there are also some other methods. For example, a method called Markov blanket was proposed by Xing, Jordan, and Karp (2001). Li, Weinberg, Darden, and Pedersen (2001) applied another method which combined genetic algorithm and K-nearest neighbor. PCA (Simon, 1999) aims at reducing the input dimension by transforming the input space into a new space described by principal components (PCs). All the PCs are
107
B
Biomedical Data Mining Using RBF Neural Networks
orthogonal and they are ordered according to the absolute value of their eigenvalues. The k-th PC is the vector with the k-th largest eigenvalue. By leaving out the vectors with small eigenvalues, the input space’s dimension is reduced. In fact, the PCs indicate the directions with largest variations of input vectors. Because PCA chooses vectors with largest eigenvalues, it covers directions with largest variations of vectors. In the directions determined by the vectors with small eigenvalues, the variations of vectors are very small. In a word, PCA intends to capture the most informative directions (Simon, 1999). We tested PCA in the lymphoma data set (Alizadeh et al., 2000). We obtained 62 PCs from the 4026 genes in the data set by using PCA. Then, we ranked those PCs according to their eigenvalues (absolute values). Finally, we used our RBF neural network that will be introduced in the latter part to classify the lymphoma data set. At first, we randomly divided the 62 samples into two parts, 31 samples for training and the other 31 samples for testing. We then input the 62 PCs one by one to the RBF network according to their eigenvalue ranks starting with the PC ranked one. That is, we first used only a single PC that is ranked 1 as the input to the RBF network. We trained the network with the training data and subsequently tested the network with the testing data. We repeated this process with the top two PCs, then the top three PCs, and so on. Figure 1 shows the testing error. From this result, we found that the RBF network can not reach 100% accuracy. The best testing accuracy is 93.55% that happened when 36 or 61 PCs were input to the classifier. The classification result using the t-test-based gene selection method will be shown in the next section, which is much better than PCA approach. The t-test-based gene selection measures the difference of genes’ distribution using a t-test based scoring scheme, i.e., t-score (TS). After that, only the genes with the highest TSs are to be put into our classifier. The TS of gene i is defined as follows (Tusher et al., 2001): ⎧⎪ x ik − x i ⎫⎪ TS i = max ⎨ , k = 1,2,...K ⎬ d s ⎪⎩ k i ⎪⎭
x ik = ∑
j∈C k
total number of samples. x i is the general mean expression value for gene i. s i is the pooled within-class standard deviation for gene i. Actually, the t-score used here is a t-statistics between a specific class and the overall centroid of all the classes. To compare the t-test-based method with PCA, we also applied it to the lymphoma data set with the same procedure as what we did by using PCA. This method obtained 100% accuracy with only the top six genes. The results are shown in Figure 1. This comparison indicated that the t-test-based method was much better than PCA in this problem.
An RBF Neural Network An RBF neural network (Haykin, 1999) has three layers. The first layer is an input layer; the second layer is a hidden layer that includes some radial basis functions, also known as hidden kernels; the third layer is an output layer. An RBF neural network can be regarded as a mapping of the input domain X onto the output domain Y. Mathematically, an RBF neural network can be described as follows: N
y m ( x) = ∑ wmi G ( x − t i ) + bm , i=1,2,…,N; m=1,2,…M i =1
Here ⋅ stands for the Euclidean norm. M is the number of outputs. N is the number of hidden kernels. y m (x) is the output m corresponding to the input x. t i is the position of kernel i. wmi is the weight between the kernel i and the output m. bm is the bias on the output m. G ( x − t i ) is the kernel function. Usually, an RBF neural
network uses Gaussian kernel functions as follows:
Figure 1. Classification results of using PCA and the ttest-based method as gene selection methods
x ij / nk
n
x i = ∑ xij / n
1
j =1
si2 =
1 n−K
∑∑ (x k
j∈C k
ij
− x ik
)
d k = 1 / nk + 1 / n
There are K classes. max {yk, k = 1,2,..K} is the maximum of all y k, k = 1,2,..K. Ck refers to class k that includes nk samples. xij is the expression value of gene i in sample j. x ik is the mean expression value in class k for gene i. n is the 108
0. 8
2
Er r or
where:
0. 6
PCA t - t est
0. 4 0. 2 0
1 6 11 16 21 26 31 36 41 46 51 56 61 Number of genes
Biomedical Data Mining Using RBF Neural Networks
G ( x − t i ) = exp(−
x −t i 2σ i
2
2
)
where σ i is the radius of the kernel i. The main steps to construct an RBF neural network include: (a) determining the positions of all the kernels (ti);
(b) determining the radius of each kernel ( σ i ); and (c) calculating the weights between each kernel and each output node. In this paper, we use a novel RBF neural network proposed by Fu and Wang (Fu and Wang, 2003), which allows for large overlaps of hidden kernels belonging to the same class.
FUTURE TRENDS Until now, the focus of work is investigating the information with statistical importance in microarray data sets. In the near future, we will try to incorporate more biological knowledge into our algorithm, especially the correlations of genes. In addition, with more and more microarray data sets produced in laboratories around the world, we will try to mine multi-data-set with our RBF neural network, i.e., we will try to process the combined data sets. Such an attempt will hopefully bring us a much broader and deeper insight into those data sets.
CONCLUSION
In the SRBCT data set, we first ranked the entire 2308 genes according to their TSs (Tusher et al., 2001). Then we picked out 96 genes with the highest TSs. We applied our RBF neural network to classify the SRBCT data set. The SRBCT data set contains 63 samples for training and 20 blind samples for testing. We input the selected 96 genes one by one to the RBF network according to their TS ranks starting with the gene ranked one. We repeated this process with the top two genes, then the top three genes, and so on. Figure 2 shows the testing errors with respect to the number of genes. The testing error decreased to 0 when the top seven genes were input into the RBF network. In the leukemia data set, we chose 56 genes with the highest TSs (Tusher et al., 2001). We followed the same procedure as in the SRBCT data set. We did classification with 1 gene, then two genes, then three genes and so on. Our RBF neural network got an accuracy of 97.06%, i.e. one error in all 34 samples, when 12, 20, 22, 32 genes were input, respectively.
Through our experiments, we conclude that the t-testbased gene selection method is an appropriate feature selection/dimension reduction approach, which can find more important genes than PCA can. The results in the SRBCT data set and the leukemia data set proved the effectiveness of our RBF neural network. In the SRBCT data set, it obtained 100% accuracy with only seven genes. In the leukemia data set, it made only one error with 12, 20, 22, and 32 genes, respectively. In view of this, we also conclude that our RBF neural network outperforms almost all the previously published methods in terms of accuracy and the number of genes required.
Figure 2. The testing result in the SRBCT data set
Figure 3. The testing result in the leukemia data set
REFERENCES Alizadeh, A.A. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511.
0. 5
0. 2
0. 4
0. 15
0. 3
Er r or
Er r or
Results
0. 2
0. 05
0. 1 0
0. 1
0 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 Number of genes
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 Number of genes
109
B
Biomedical Data Mining Using RBF Neural Networks
Dudoit, S., Fridlyand, J., & Speed, J. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of American Statistics Association, 97, 77-87. Fu, X., & Wang, L. (2003). Data dimensionality reduction with application to simplifying RBF neural network structure and improving classification performance. IEEE Trans. Syst., Man, Cybernetics. Part B: Cybernetics, 33, 399-409. Golub, T.R. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389-422. Haykin, S. (1999). Neural network, a comprehensive foundation (2nd ed.). New Jersey, U.S.A: Prentice-Hall, Inc. Khan, J.M. et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673-679. Li, L., Weinberg, C.R., Darden, T.A., & Pedersen, L.G. (2001). Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17, 1131-1142. Olshen, A.B., & Jain, A.N. (2002). Deriving quantitative conclusions from microarray expression data. Bioinformatics, 18, 961-970. Pan, W. (2002). A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 18, 546-554. Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467-470. Thomas, J.G., Olsen, J.M., Tapscott, S.J. & Zhao, L.P. (2001). An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Research, 11, 1227-1236. Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA, 99, 6567-6572. Tibshirani, R., Hastie, T., Narashiman, B., & Chu, G. (2003). Class predication by nearest shrunken centroids with applications to DNA microarrays. Statistical Science, 18, 104-117. Troyanskaya, O., Cantor, M, & Sherlock, G., et al. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520-525. 110
Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA, 98, 5116-5121. Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature selection for high-dimensional genomic microarray data. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 601-608). Morgan Kaufmann Publishers, Inc.
KEY TERMS Feature Extraction: Feature extraction is the process to obtain a group of features with the characters we need from the original data set. It usually uses a transform (e.g. principal component analysis) to obtain a group of features at one time of computation. Feature Selection: Feature selection is the process to select some features we need from all the original features. It usually measures the character (e.g. t-test score) of each feature first, then, chooses some features we need. Gene Expression Profile: Through microarray chips, an image that describes to what extent genes are expressed can be obtained. It usually uses red to indicate the high expression level and uses green to indicate the low expression level. This image is also called a gene expression profile. Microarray: A Microarray is also called a gene chip or a DNA chip. It is a newly appeared biotechnology that allows biomedical researchers monitor thousands of genes simultaneously. Principal Components Analysis: Principal components analysis transforms one vector space into a new space described by principal components (PCs). All the PCs are orthogonal to each other and they are ordered according to the absolute value of their eigenvalues. By leaving out the vectors with small eigenvalues, the dimension of the original vector space is reduced. Radial Basis Function (RBF) Neural Network: An RBF neural network is a kind of artificial neural network. It usually has three layers, i.e., an input layer, a hidden layer, and an output layer. The hidden layer of an RBF neural network contains some radial basis functions, such as Gaussian functions or polynomial functions, to transform input vector space into a new non-linear space. An RBF neural network has the universal approximation ability, i.e., it can approximate any function to any accuracy, as long as there are enough hidden neurons.
Biomedical Data Mining Using RBF Neural Networks
T-Test: T-test is a kind of statistical method that measures how large the difference is between two groups of samples. Testing a Neural Network: To know whether a trained neural network is the mapping or the regression we need, we test this network with some data that have not been used in the training process. This procedure is called testing a neural network.
Training a Neural Network: Training a neural network means using some known data to build the structure and tune the parameters of this network. The goal of training is to make the network represent a mapping or a regression we need.
111
B
112
Building Empirical-Based Knowledge for Design Recovery Hee Beng Kuan Tan Nanyang Technological University, Singapore Yuan Zhao Nanyang Technological University, Singapore
INTRODUCTION Although the use of statistically probable properties is very common in the area of medicine, it is not so in software engineering. The use of such properties may open a new avenue for the automated recovery of designs from source codes. In fact, the recovery of designs can also be called program mining, which in turn can be viewed as an extension of data mining to the mining in program source codes.
BACKGROUND Today, most of the tasks in software verification, testing, and re-engineering remain manually intensive (Beizer, 1990), time-consuming, and error prone. As many of these tasks require the recognition of designs from program source codes, automation of the recognition is an important means to improve these tasks. However, many designs are difficult (if not impossible) to recognize automatically from program source codes through theoretical knowledge alone (Biggerstaff, Mitbander, & Webster, 1994; Kozaczynski, Ning, & Engberts, 1992). Most of the approaches proposed for the recognition of designs from program source codes are based on plausible inference (Biggerstaff et al., 1994; Ferrante, Ottenstein, & Warren, 1987; Kozaczynski et al., 1992). That is, they are actually based on empirical-based knowledge (Kitchenham, Pfleeger, Pickard, Jones, Hoaglin, Emam, & Rosenberg, 2002). However, to the best of our knowledge, the building of empirical-based knowledge to supplement theoretical knowledge for the recognition of designs from program source code has not been formally discussed in the literature. This paper introduces an approach for the building and applying of empirical-based knowledge to supplement theoretical knowledge for the recognition of designs from program source codes. The first section introduces the proposed approach. The second section
discusses the application of the proposed approach for the recovery of functional dependencies enforced in database transactions. The final section shows our conclusion.
MAIN THRUST The Approach Many types of designs are usually implemented through a few methods. The use of a method has a direct influence on the programs that implement the designs. As a result, these programs may have some certain characteristics. And we may be able to recognize the designs or their properties through recognizing these characteristics, from either a theoretical or empirical basis or the combination of the two. An overview of the approach for building empiricalbased knowledge for design recovery through program analysis is shown in Figure 1. In the figure, arcs show interactions between tasks. In the approach, we first research the designs or their properties, which can be recognized from some characteristics in the programs that implement them through automated program analysis. This task requires domain knowledge or experience. The reason for using design properties is that some designs could be too complex to recognize directly. In the latter case, we first recognize the properties, then use them to infer the designs. We aim for characteristics that are not only sufficient but also necessary for recognizing designs or their properties. If the characteristics are sufficient but not necessary, even if we cannot infer the target designs, they do not imply the nonexistence of the designs. If we cannot find characteristics from which a design can be formally proved, then we will look for characteristics that have significant statistical evidence. These empirical-based characteristics are taken as hypotheses. With the use of hypotheses and theoretical knowledge, a theory is built for the inference of designs.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Building Empirical-Based Knowledge for Design Recovery
Secondly, we design experiments to validate the hypotheses. Some software tools should be developed to automate or semiautomate the characterization of the properties stated in the hypotheses. We may merge multiple hypotheses together as a single hypothesis for the convenience of hypothesis testing. An experiment is designed to conduct a binomial test (Gravetter & Wallnau, 2000) for each resulting hypothesis. If altogether we have k hypotheses denoted by H1,…., Hk and we would like the probability of validity of the proposed design recovery to be more than or equal to q, then we must choose p1,…., pk such that p1,…., pk ≥ q. For each hypothesis Hj (1 ≤ j ≤ k), the null and alternate hypothesis of the binomial test states that less than pj*100% and equal or more than pj*100%, respectively, of the cases that Hj holds. That is:
p
Hj0 (null hypothesis): probability that Hj holds < p Hj1 (alternate hypothesis): probability that Hj holds ≥
For the use of normal approximation for the binomial test, both npj and n(1-pj) must be greater than or equal to 10. As such, the sample size n must be greater than or equal to max (10/pj, 10/(1-pj)). The experiment is designed to draw a random sample of size n to test the hypothesis. For each case in the sample, the validity of the hypothesis is examined. The total number of cases, X, that the hypothesis holds is recorded and substituted in the following binomial test statistics: X /n − p
z=
j ( p (1 − p ) / n ) j j
Let α be the Type I error probability. If z is greater than zα, we reject the null hypothesis; otherwise, we accept the null hypothesis, where the probability of standard normal model for z ≥ zα is α. Thirdly, we develop the required software tools and conduct the experiments to test each hypothesis according to the design drawn in the previous step. Fourthly, if all the hypotheses are accepted, algorithms will be developed for the use of the theory to automatically recognize the designs from program source codes. A software tool will also be developed to implement the algorithms. Some experiments should also be conducted to validate the effectiveness of the method.
Figure 1. An overview of the proposed design recovery Identify Designs or their Properties
Identity Program Charactertics and Formulate Hypotheses
Develop Software Tools to aid Experiments for Hypotheses Testing
Develop Theory for the Inference of Designs
Conduct Experiements for Hypotheses Testing
accept hypothesis reject hypothesis
Develop Algorithms for Automated Design Recovery
Develop Software Tool to Implement the Algorithms
Conduct Experiment to validate the Effectiveness
Applying the Proposed Approach for the Recovery of Functional Dependencies Let R be a record type and X be a sequence of attributes of R. For any record r in R, its sequence of values of the attributes in X is referred as the X-value of r. Let R be a record type, and X and Y be sequences of attributes of R. We say that the functional dependency (FD), X → Y of R, holds at time t, if at time t, for any two R records r and s, the X-values of r and s are identical, then the Y-values of r and s are also identical. We say that the functional dependency holds in a database if, except in the midst of a transaction execution (which updates some record types involved in the dependency), the dependency always holds (Ullman, 1982). Many of the world’s database applications have been built on old generation DBMSs (database management systems). Due to the nature of system development, many
113
B
Building Empirical-Based Knowledge for Design Recovery
functional dependencies are not discovered in the initial system development. They are only identified during the system maintenance stage. Although keys can be used to implement functional dependencies in old generation DBMSs, due to the effort in restructuring databases during the system maintenance stage, most functional dependencies identified during this stage are not defined explicitly as keys in the databases. They are enforced in transactions. Furthermore, most of the conventional files and relational databases allow only the definition of one key. As such, most of the candidate keys are enforced in transactions. As a result, many functional dependencies in a legacy database are enforced in the transactions that update the database. Our previous work (Tan, 2004) has proven that if all the transactions that update a database satisfy a set of properties with reference to a functional dependency, X → Y of R, then the functional dependency holds in the database. Before proceeding further, we shall discuss these properties. For generality, we shall express a program path in the form (a1, …, an), where ai is either a node or in the form { nii ,…..,
nik }, then in (a1, …, a n), after the predeces-
sor node of a i, the subsequence of nodes,
nii ,….., nik ,
may occur any number of times (possibly zero), before proceeding to the successor of a i. Before proceeding further, we shall introduce some notations to represent certain types of nodes in control flow graphs of transactions that will be used throughout this paper:
•
•
rd(R, W == b): A node that reads or selects an R record in which the W-value is b if it exists. mdf(R, W == b, Z1 := c1,….., Z n := cn): A node that modifies the Z 1-value, ….. , Zn-value of an R record, in which the original W-value is b, to c1,….., cn, respectively. ins(R, Z1 := c1,….., Zn := cn): A node that inserts an R record, in which its Z1-value,.. … , Zn-value are set to c1,….., c n, respectively.
Here, R is a record type. W, Z 1, …, Zn are sequences of R attributes. The values of those attributes that are not mentioned in the mdf and ins nodes can be modified and set to any value. We shall also use xclu_fd(X → Y of R) to denote a node in the control flow graph of a transaction that does not perform any updating that may lead to the violation of the functional dependency X → Y of R. A program path in the form (rd(R, X == x’), {xclu_fd(X → Y of R)}, {{xclu_fd(X → Y of R)}, ins(R, X := x’, Y 114
•
nik }, in which nii ,….., nik are nodes. If ai is in the
form { nii ,…..,
•
:= y’), {xclu_fd(X → Y of R)}}), such that if the rd node successfully reads an R record specified, and y’ is identical to the Y-value of the record read, is called an insertion pattern for enforcing the FD, X → Y of R. A program path in the form ({{xclu_fd(X → Y of R)}, mdf(R, X == x0, Y := y’), {xclu_fd(X → Y of R)}}), such that all the R records in which the X-values are equal to x 0 are modified by the mdf node, is called a Y-modification pattern for enforcing the FD, X → Y of R. A program path in the form (rd(R, X == x’), {xclu_fd(X → Y of R)}, {{xclu_fd(X → Y of R)}, mdf(R, X == x0, X := x’, Y := unchange), {xclu_fd(X → Y of R)}}), such that the mdf node is only executed if the rd node does not successfully read an R record specified, is called an Xmodification pattern for enforcing the FD, X → Y of R. We have proven the following rules by Tan (2004):
•
Nonviolation of FD: In a transaction, if all the nodes that insert R records or modify the attribute X or Y in any program path from the start node to the end node are always contained in a sequence of subpaths in the patterns for enforcing the functional dependency, X → Y of R, then the transaction does not violate the functional dependency. FD in Database: Each transaction that updates any record type involved in a functional dependency does not violate the functional dependency ϕ if and only if ϕ is a functional dependency designed in the database.
Theoretically, the property stated in the first rule is not a necessary property in order for the functional dependency to hold. As such, we may not be able to recover all functional dependencies enforced by recognizing these properties. Fortunately, other than very exceptional cases, most of the enforcement of functional dependency does result to the previously mentioned property. As such, empirically, the property is usually also necessary for the functional dependency, X → Y of R, to hold in the database. Thus, we take the following hypothesis. •
Hypothesis 1: If a transaction does not violate the functional dependency, X → Y of R, then all the nodes that insert R records or modify the attribute X or Y in any program path from the start node to the end node are always contained in a sequence of subpaths in the patterns for enforcing the functional dependency.
With the hypothesis, the result discussed by Tan and Thein (in press) can be extended to the following theorem.
Building Empirical-Based Knowledge for Design Recovery
•
Theorem 1: A functional dependency, X → Y of R, holds in a database if and only if in each transaction that updates any record type involved in the functional dependency, all the nodes that insert R records or modify the attribute X or Y in any program path from the start node to the end node are always contained in a sequence of subpaths in the patterns for enforcing the functional dependency.
The proof of sufficiency can be found in previous work (see Tan & Thein, 2004). The necessity is implied directly from the hypothesis. This theorem can be used to recover all the functional dependencies in a database. We have developed an algorithm the recovery. The algorithm is similar to the algorithm presented by Tan and Thein (2004). We have conducted an experiment to validate the hypothesis. In this experiment, we get each of our 71 postgraduate students who take the software requirements analysis and design course to design the following three transactions in pseudocode to update the table, plant-equipment (plant-name, equipment-name, equipment-manufacturer, manufacturer-address), such that each transaction ensures that the functional dependency, equipment-manufacturer → manufacturer-address of plant-equipment holds: •
•
•
Insert a new tuple in the table to store a user input each time. The input has four fields, inp-plantname, inp-equipment-name, inp-equipment-manufacturer and inp-manufacturer-address. In the tuple, the value of each attribute is set according to the corresponding input field. For each user input that has a single field, inpequipment-manufacturer, delete the tuple in the table in which the value of the attribute, equipment-manufacturer, is identical to the value of inp-equipment-manufacturer. For each user input that has two fields, old-equipment-manufacturer and new-equipment-manufacturer, update the equipment-manufacturer in each tuple in which the value of equipment-manufacturer is identical to the value of old-equipment-
Table 1. The statistics of an experiment
Transaction 1 2 3
Enforcement of the Functional Dependency Correct Wrong 55 16 65 6 36 35
manufacturer to the value of new-equipment-manufacturer. We also analyzed 50 transactions from three existing database applications. Table 1 summarizes the result of the transactions designed by the students. We examined each transaction that enforces the functional dependency. We found that in each of these transactions, all the nodes that insert R records or modify the attribute X or Y in any program path from the start node to the end node are always contained in a sequence of subpaths in the patterns for enforcing the functional dependency. As such, Hypothesis 1 holds in each of these transactions. Because each transaction that does not enforce the functional dependency correctly violates the functional dependency, Hypothesis 1 always holds in such a transaction. Therefore, Hypothesis 1 holds in all the 213 transactions. For the 50 transactions that we drew from three existing database applications, each database application enforces only one functional dependency in their transactions (other functional dependencies are implemented as keys in the databases). We checked each transaction on Hypothesis 1 with respect to the functional dependency that is enforced in the transactions in the database application to which the transaction belongs. And Hypothesis 1 held in all the transactions. Therefore, Hypothesis 1 holds for all the 263 transactions in our sample. Taking 0.001 (α) as the Type I error probability, if the binomial test statistics z computed from the formula discussed previously in this paper is greater than 3.09, we reject the null hypothesis; otherwise, we accept the null hypothesis. In our experiment, we have n = 263, X = 263, and p = 0.96, so our z score is 3.31. Note that our sample size is greater than max(10/0.96, 10/(1-0.96)). Thus, we reject the null hypothesis, and the test gives evidence that Hypothesis 1 holds for equal or more than 96% of the cases at the 0.1 % level of significance. Because only one hypothesis is involved in the approach for the inference of functional dependencies, the test gives evidence that the approach holds for equal or more than 96% of the cases at the 0.1 % level of significance.
FUTURE TRENDS In general, many designs are very difficult (if not impossible) to formally prove from program characteristics (Chandra, Godefroid, & Palm, 2002; Clarke, Grumberg, & Peled,1999; Deng & Kothari, 2002). Therefore, the use of empirical-based knowledge is important for the automated recognition of designs from program source codes (Embury & Shao, 2001; Tan, Ling, & Goh, 2002; Wong, 2001). We believe that the integration of empiri115
B
Building Empirical-Based Knowledge for Design Recovery
cal-based properties into existing program analysis and model-checking techniques will be a fruitful direction in the future.
CONCLUSION Empirical-based knowledge has been used in the recognition of designs from source codes through automated program analysis. It is a promising research direction. This chapter introduces the approach for building empirical-based knowledge, a vital part for such research exploration. We have also applied it in the recognition of functional dependencies enforced in database transactions from the transactions. We believe that our approach will encourage more exploration of the discovery and use of empirical-based knowledge in this area. Recently, we have completed our work on the recovery of posttransaction user-input error (PTUIE) handling for database transaction. This approach appeared in IEEE Transactions on Knowledge and Data Engineering (Tan & Thein, 2004).
REFERENCES Basili, V. R. (1996). The role of experimentation in software engineering: Past, current, and future. The 18th International Conference on Software Engineering (pp. 442-449), Germany. Beizer, B. (1990). Software testing techniques. New York: Van Nostrand Reinhold. Chandra, S., Godefroid, P., & Palm, C. (2002). Software model checking in practice: An industrial case study. Proceedings of the International Conference on Software Engineering (pp. 431-441), USA. Clarke, E. M., Grumberg, O., & Peled, D. A. (1999). Model checking. MIR Press. Deng, Y. B., & Kothari, S. (2002). Recovering conceptual roles of data in a program. Proceedings of the International Conference on Software Maintenance (pp. 342350), Canada. Embury, S. M., & Shao, J. (2001). Assisting the comprehension of legacy transactions. Proceedings of the Working Conference on Reverse Engineering (pp. 345354), Germany. Ferrante, J., Ottenstein, K. J., & Warren, J. O. (1987). The program dependence graph and its use in optimisation. ACM Transaction on Programming Languages and Systems, 9(3), 319-349. 116
Gravetter, F. J., & Wallnau, L. B. (2000). Statistics for the behavioral sciences. Belmont, CA: Wadsworth. Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones, P. W., Hoaglin, D. C., Emam, K. E., & Rosenberg, J. (2002). Preliminary guidelines for empirical research in software engineering. IEEE Transactions on Software Engineering, 28(8), 721-734. Kozaczynski, W., Ning, J., & Engberts, A. (1992). Program concept recognition and transformation. IEEE Transactions on Software Engineering, 18(12), 1065-1075. Tan, H. B. K., Ling, T. W., & Goh, C. H. (2002). Exploring into programs for the recovery of data dependencies designed. IEEE Transactions on Knowledge and Data Engineering, 14(4), 825-835. Tan, H. B. K., & Thein, N. L. (in press). Recovery of PTUIE handling from source codes through recognizing its probable properties. IEEE Transactions on Knowledge and Data Engineering, 16(10), 1217-1231. Ullman, J. D. (1982). Principles of database systems (2nd ed.). Rockville, MD: Computer Science Press. Wong, K. (2001). Research challenges in reverse engineering community. Proceedings of the International Workshop on Program Comprehension (pp. 323-332), Canada.
KEY TERMS Control Flow Graph: An abstract data structure used in compilers. It is an abstract representation of a procedure or program, maintained internally by a compiler. Each node in the graph represents a basic block. Directed edges are used to represent jumps in the control flow. Design Recovery: Recreates design abstractions from a combination of code, existing design documentation (if available), personal experience, and general knowledge about problem and application domains. Functional Dependency: For any record r in a record type, its sequence of values of the attributes in X is referred to as the X-value of r. Let R be a record type, and X and Y be sequences of attributes of R. We say that the functional dependency, X →Y of R, holds at time t, if at time t, for any two R records r and s, the X-values of r and s are identical, then the U-values of r and s are also identical. Hypothesis Testing: Hypothesis testing refers to the process of using statistical analysis to determine if the observed differences between two or more samples are
Building Empirical-Based Knowledge for Design Recovery
due to random chance (as stated in the null hypothesis) or to true differences in the samples (as stated in the alternate hypothesis). Model Checking: A method for formally verifying finite-state concurrent systems. Specifications about the system are expressed as temporal logic formulas, and efficient symbolic algorithms are used to traverse the model defined by the system and check whether the specification holds.
to the set of values or behaviors arising dynamically at runtime when executing a program on a computer. PTUIE: Posttransaction user-input error. An error made by users in an input to a transaction execution and discovered only after completion of the execution. Transaction: An atomic set of processing steps in a database application such that all the steps are performed either fully or not at all.
Program Analysis: Offers static compile-time techniques for predicting safe and computable approximations
117
B
118
Business Processes David Sundaram The University of Auckland, New Zealand Victor Portougal The University of Auckland, New Zealand
INTRODUCTION The concept of a business process is central to many areas of business systems, specifically to business systems based on modern information technology. In the new era of computer-based business management the design of business process has eclipsed the previous functional design. Lindsay, Downs, and Lunn (2003) suggest that business process may be divided into material, information and business parts. Further search for efficiency and cost reduction will be predominantly through office automation. The information part, including data warehousing, data mining and increasing informatisation of the office processes will play a key role. Data warehousing and data mining, in and of themselves, are business processes that are aimed at increasing the intelligent density (Dhar & Stein, 1997) of the data. But more important is the significant roles they play within the context of larger business and decision processes. Apart from these the creation and maintenance of a data warehouse itself comprises a set of business processes (see Scheer, 2000). Hence a proper understanding of business processes is essential to better understand data warehousing as well as data mining.
BACKGROUND Companies that make products or provide services have several functional areas of operations. Each functional area comprises a variety of business functions or business activities. For example, functional area of financial accounting includes the business functions of financials, controlling, asset management, and so on. Human resources functional area includes the business functions of payroll, benefit administration, workforce planning, and application data administration. Historically, organizational structures have separated functional areas, and business education has been similarly organized, so each functional area was taught as a separate course. In materials requirement planning (MRP) systems, predecessors of enterprise resource planning (ERP) systems,
all functional areas were presented as subsystems supported by a separate functional area’s information system. However, in a real business environment functional areas are interdependent, each requiring data from the others. This fostered the development of the concept of a business process as a multi-functional set of activities designed to produce a specified output. This concept has a customer focus. Suppose, a defective product was delivered to a customer, it is the business function of customer service to accept the defective item. The actual repair and redelivery of the item, however, is a business process that involves several functional areas and functions within those areas. The customer is not concerned about how the product was made, or how its components were purchased, or how it was repaired, or the route the delivery truck took to get to her house. The customer wants the satisfaction of having a working product at a reasonable price. Thus, the customer is looking across the company’s functional areas in her process. Business managers are now trying to view their business operations from the perspective of a satisfied customer. Thinking in terms of business processes helps managers to look at their organization from the customer’s perspective. ERP programs help to manage company wide business processes, using a common database and shared management reporting tools. ERP software supports the efficient operation of business processes by integrating business activities, including sales, marketing, manufacturing, accounting, and staffing.
MAIN THRUST In the following sections we first look at some definitions of business processes and follow this by some representative classifications of business process. After laying this foundation we look at business processes that are specific to data warehousing and data mining. The section culminates by looking at a business process modelling methodology (namely ARIS) and briefly discussing the dark side of business process (namely malprocesses).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Business Processes
Definitions of Business Processes
“initiated in response to a specific event”
Many definitions have been put forward to define business processes: some broad and some narrow. The broad ones help us understand the range and scope of business processes but the narrow ones are also valuable in that they are actionable/pragmatic definitions that help us in defining, modelling, and reengineering business processes. Ould (1995) lists a few key features of Business Processes; it contains purposeful activity, it is carried out collaboratively by a group, it often crosses functional boundaries, it is invariably driven by outside agents or customers. Jacobson (1995) on the other hand succinctly describes a business process as: ‘The set of internal activities performed to serve a customer’. Bider (2000) suggests that the business process re-engineering (BPR) community feel there is no great mystery about what a process is - they follow the most general definition of business processes proposed by Hammer and Champy (1993) that a process is a ‘set of partially ordered activities intended to reach a goal’. Davenport (1993) defines process broadly as “a structured, measured set of activities designed to produce a specified output for a particular customer or market” and more specifically as “a specific order of work activities across time and place, with a beginning, an end, and clearly identified inputs and outputs: a structure for action.” While these definitions are useful they are not adequate. Sharp and McDermott (2001) provide an excellent working definition of a business process:
• •
A business process is a collection of interrelated work tasks, initiated in response to an event that achieves a specific result for the customer of the process. It is worth exploring each of the phrases within this definition. “achieves a particular result” • •
The result might be Goods and/or Services. It should be possible to identify and count the result eg. Fulfilment of Orders, Resolution of Complaints, Raising of Purchase Orders, etc.
“for the customer of the process” • •
Every process has a customer. The customer maybe internal (employee) or external (organisation). A key requirement is that customer should be able to give feedback on the process
Every process is initiated by an event. The event is a request for the result produced by the process.
“work tasks” •
•
The business process is a collection of clearly identifiable tasks executed by one or more actors (person or organisation or machine or department). A task could potentially be divided up into more and finer steps.
“a collection of interrelated” • •
Such steps and tasks are not necessarily sequential but could have parallel flows connected with complex logic. The steps are interconnected through their dealing with or processing one (or more) common work item(s) or business object(s)
Due to the importance of the “business process” concept to the developing of the computerized enterprise management the work on refining the definition is going on. Lindsay, Downs, and Lunn (2003) argue that the definitions of business process given in much of the literature on Business Process Management (BPM) are limited in depth and their related models of business processes are too constrained. Because they are too limited to express the true nature of business processes, they need to be further developed and adapted to today’s challenging environment.
Business Process Classification Over the years many classifications of processes have been suggested. The American Productivity & Quality Center (Process Classification Framework, 1996) distinguishes two types of processes 1) operating processes and 2) management and support processes. Operating processes include processes such as: • • • • • •
Understanding Markets and Customers Development of Vision and Strategy Design of Product and Services Marketing and Selling of Products and Services Production and Delivery of Products and Services Invoicing and Servicing of Customers
119
B
Business Processes
In contrast management and support processes include processes such as:
standable patterns in data. Some of the key steps of the data mining business process involve:
• • • • • •
• •
Development and Management of Human Resources Management of Information Management of Financial and Physical Resources Management of External Relationships Management of Improvement and Change Execution of Environmental Management Program
Most organisations would be able to fit their processes within this broad classification. However the Gartner group (Genovese, Bond, Zrimsek, & Frey, 2001) has come up with a slightly different set of processes that has a lifecycle orientation: • • • • • •
Prospect to Cash and Care Requisition to Payment Planning and Execution Plan to Performance Design to Retirement Human Capital Management
Each and every one of these processes could be potentially supported by data warehousing and data mining processes and technologies.
Data Warehousing Business Processes The information movement and use come under the category of a business process when it involves humancomputer interaction. Data warehousing includes a set of business processes, classified above as management of information. The set includes: • • • •
Data input Data export Special query and reporting (regular reporting is a fully computer process) Security and safety procedures
The bulk of the processes are of course in the data input category. It is here is that many errors originate that can disrupt the smooth operations of the management system. Strict formalisation of the activities and their sequence brings the major benefits. However, the other categories are important as well, and their formal description and enforcement help to avoid mal-functioning of the system.
Data Mining Business Processes Data mining is a key business process that enables organisations to identify valid, novel, useful, and under120
• • •
Understanding the business Preparation of the data – this would usually involve data selection, pre-processing, and potentially transformation of the data into a form that is more amenable to modelling and understanding Modelling using data mining techniques Interpretation and evaluation of the model and results which will hopefully help us to better understand the business Deploy the solution
Obviously the above steps are iterative at all stages of the process and could feed backward. Tools such as Clementine from SPSS support all the above steps.
Business Processes Modelling The business processes frequently have to be recorded. According to Scheer (2000) the components of a process description are events, statuses, users, organizational units, and information technology resources. This multitude of elements would severely complicate the business process model. In order to reduce this complexity, many researchers suggest that the model is divided into individual views that can be handled (largely) independent from one another. This is done for example in ARIS (see Scheer, 2000). The relationships between the components within a view are tightly coupled while the individual views are rather loosely linked. ARIS includes four basic views, represented by:
• • • •
Data view: statuses and events represented as data and the other data structures of the process Function view: the description of functions used in business processes Organization view: the users and the organizational units as well as their relationships and the relevant structures Control view: presenting the relationships between the other three views.
The integration of these three views through a separate view is the main difference between ARIS and other descriptive methodologies, like Object Oriented Design (OOD) (Jacobson, 1995). Breaking down the initial business process into individual views reduces the complexity of its description. Figure 1 illustrates these four views.
Business Processes
Figure 1. ARIS views/house of business engineering (Scheer, 2000)
Mal-Processes
Organisation View
Data View
Control View
Function View
The control view can be depicted using event-driven process chains (EPC). The four key components of the EPC are events (statuses), functions (transformations), organization, and information (data). Events (initial) trigger functions that then result in events (final). The functions are carried out by one or more organizational units using appropriate data. A simplified depiction, of the interconnection between the four basic components of the EPC is illustrated in Figure 2. This was modeled using the ARIS software (IDS Scheer, 2000). One or more events can be connected to two or more functions (and vice versa) using the three logical operators OR, AND, and XOR (exclusive OR). These basic elements and operators enable us to model even the most complex process. The SAP Reference Model (Curran, Keller, & Ladd, 1998) is also based on the four ARIS views, it employs several other views, which are not so essential, but nevertheless helpful in depicting, storing and retrieving business processes. For example, information technology resources can constitute a separate descriptive object, the resource Figure 2. Basic components of the EPC Initial Event
Data
Transforming Function
Final Event
view. This view, however, is significant for the subjectrelated view of business processes only when it gives an opportunity for describing in full the other components that are more directly linked toward the business.
Organizational unit
Most definitions of business processes suggest that they have a goal of accomplishing a given task for the customer of the process. This positive orientation has been reflected in business process modelling methodologies, tools, techniques and even in the reference models and implementation approaches. They do not explicitly consider the negative business process scenarios that could result in the accomplishment of undesirable outcomes. Consider a production or service organisation that implements an ERP system. Let us assume the design stage is typical and business process oriented. We use requirements elicitation techniques to define the “asis” business process and engineer “to-be” processes (reference) using as much as possible, fragments of the reference model of a corresponding ERP system. As a result of the design, a set of business processes will be created, defining a certain workflow for managers (employees) of the company. These processes will consist of functions that should be executed in a certain order defined by events. The employee responsible for execution of a business process or a part of it will be called its user. Every business process starts from a pre-defined set of events, and performs a pre-defined set of functions which involve either data input or data modification (when a definite data structure is taken from the data-base, modified and put back, or transferred to another destination). When executing a process, a user may intentionally put wrong data into the database, or modify the data in the wrong way: e.g. execute an incomplete business process, or execute a wrong branch of the business process, or even create an undesirable branch of a business process. Such processes are called mal-processes. Hence mal-processes are behaviour to be avoided by the system. It is a sequence of actions that a system can perform, interacting with a legal user of the system, resulting in harm for the organization or stakeholder if the sequence is allowed to complete. The focus is on the actions, which may be done intentionally or through neglect. Similar effects produced accidentally are a subject of interest for data safety. Just as we have best business practices, mal-processes illustrate the wrong business practices. So, it would be natural, by analogy with the reference model (that represents a repository of business models reflecting best business practices), to create a repository of business models reflecting 121
B
Business Processes
wrong business practices. Thus a designer could avoid wrong design solutions. A repository of such mal-processes would enable organisations to avoid typical mistakes in enterprise system design. Such a repository can be a valuable asset in education of ERP designers and it can be useful in general management education as well. Another application of this repository might be troubleshooting. For example, sources of some nasty errors in the sales and distribution system might be found in the mal-processes of data entry in finished goods handling.
FUTURE TRENDS There are three key trends that characterise business processes: digitisation (automation), integration (intra and inter organisational), and lifecycle management (Kalakota & Robinson, 2003). Digitisation involves the attempts by many organisations to completely automate as many of their processes as possible. Another equally important initiative is the seamless integration and coordination of processes within and without the organisation: backward to the supplier, forward to the customers, and vertically of operational, tactical, and strategic business processes. The management of both these initiatives/trends depends to a large extent on the proper management of processes throughout their lifecycle: from process identification, process modelling, process analysis, process improvement, process implementation, process execution, to process monitoring/controlling (Rosemann, 2001). Implementing such a lifecycle orientation enables organizations to move in benign cycles of improvement and sense, respond, and adapt to the changing environment (internal and external). All these trends require not only the use of Enterprise Systems as a foundation but also data warehousing and data mining solutions. These trends will continue to be major drivers of the enterprise of the future.
CONCLUSION In this modern landscape, business processes and techniques and tools for data warehousing and data mining are intricately linked together. The impacts are not just one way. Concepts from business processes could be and are used to make the data mining and data warehousing processes an integral part of organizational processes. Data warehousing and data mining processes are a regular part of organizational business processes, enabling the conversion of operational information into tactical and strategic level information. An example of this is the Cross Industry Standard Process (CRISP) for 122
Data Mining (CRISP-DM, 2004) that provides a lifecycle process oriented approach to the mining of data in organizations. Apart from this data warehousing and data mining techniques are a key element in various steps of the process lifecycle alluded to in the trends. Data warehousing and data mining techniques help us not only in process identification and analysis (identifying and analyzing candidates for improvement) but also in execution and monitoring and control of processes. Data warehousing technologies enable us in collecting, aggregating, slicing and dicing of process information while data mining technologies enable us in looking for patterns in process information enabling us to monitor and control the organizational process more efficiently. Thus there is a symbiotic mutually enhancing relationship both at the conceptual level as well as at the technology level between business processes and data warehousing and data mining.
REFERENCES APQC (1996). Process classification framework (pp. 1-6). APQC’s International Benchmark Clearinghouse & Arthur Andersen & Co. Bider, I., Johannesson, P., & Perjons, E. (2002). Goaloriented patterns for business processes. Workshop on Goal-Oriented Business Process Modelling, London. CRISP-DM (2004). Retrieved from http://www.crisp-dm.org Curran, T., Keller, G., & Ladd, A. (1998). SAP R/3 business blueprint: Understanding business process reference model. Upper Saddle River, NJ: Prentice Hall. Davenport, T.H. (1993). Process innovation. Boston, MA: Harvard Business School Press. Davis, R., (2001) Business process modelling with ARIS: A practical guide. UK: Springer-Verlag. Genovese, Y., Bond, B., Zrimsek, B., & Frey, N. (2001). The transition to ERP II: Meeting the challenges. Gartner Group. Hammer, M., & Champy, J. (1993). Re-engineering the Corporation: A manifesto for business revolution. New York: Harper Business. Jacobson. (1995). The object advantage. AddisonWesley. Kalakota, R., & Robinson, M. (2003). Services blueprint: Roadmap for execution. Boston: AddisonWesley.
Business Processes
Lindsay, D., Downs, K., & Lunn. (2003). Business processes—attempts to find a definition. Information and Software Technology, 45, 1015-1019. Ould, A. M. (1995). Business processes: Modelling and analysis for reengineering. Wiley. Rosemann, M. (2001, March). Business process lifecycle management. Queensland University of Technology. Scheer, (2000). ARIS methods. IDS.
Business Process: A business process is a collection of interrelated work tasks, initiated in response to an event that achieves a specific result for the customer of the process. Digitisation: Measures that automate processes. ERP: Enterprise resource planning system, a software system for enterprise management. It is also referred to as Enterprise Systems (ES).
Scheer, A.-W., & Habermann, F. (2000). Making ERP a success. Communications of the Association of Computing Machinery, 43(4) 57-61.
Functional Areas: Companies that make products to sell have several functional areas of operations. Each functional area comprises a variety of business functions or business activities.
Sharp, A., & McDermott, P. (2001). Just what are processes anyway? Workflow modeling: tools for process improvement and application development (pp. 53-69).
Integration of Processes: The coordination and integration of processes seamlessly within and without the organization.
KEY TERMS
Mal-Processes: A sequence of actions that a system can perform, interacting with a legal user of the system, resulting in harm for the organization or stakeholder.
ARIS: Architecture of Integrated Information Systems, a modeling and design tool for business processes. “as-is” Business Process: Current business process.
Process Lifecycle Management: Activities undertaken for the proper management of processes such as identification, analysis, improvement, implementation, execution, and monitoring. “to-be” Business Process: Re-engineered ness process.
busi-
123
B
124
Case-Based Recommender Systems Fabiana Lorenzi Universidade Luterana do Brasil, Brazil Francesco Ricci eCommerce and Tourism Research Laboratory, ITC-irst, Italy
INTRODUCTION Recommender systems are being used in e-commerce web sites to help the customers in selecting products more suitable to their needs. The growth of Internet and the business to consumer e-Commerce has brought the need for such a new technology (Schafer, Konstan, & Riedl., 2001).
BACKGROUND In the past years, a number of research projects have focused on recommender systems. These systems implement various learning strategies to collect and induce user preferences over time and automatically suggest products that fit the learned user model. The most popular recommendation methodology is collaborative filtering (Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994) that aggregates data about customer’s preferences (ratings) to recommend new products to the customers. Content-based filtering (Burke, 2000) is another approach that builds a model of user interests, one for each user, by analyzing the specific customer behavior. In collaborative filtering the recommendation depends on the previous customers’ information, and a large number of previous user/system interactions are required to build reliable recommendations. In content-based systems only the data of the current user are exploited and it requires either explicit information about user interest, or a record of implicit feedback to build a model of user interests. Content-based systems are usually implemented as classifier systems based on machine learning research (Witten & Frank, 2000). In general, both approaches do not exploit specific knowledge of the domain. For instance, if the domain is computer recommendation, the two above approaches, in building the recommendation for a specific customer, will not exploit knowledge about how a computer works and what is the function of a computer component. Conversely, in a third approach called knowledgebased, specific domain knowledge is used to reason
about what products fit the customer’s preferences (Burke, 2000). The most important advantage is that knowledge can be expressed as a detailed user model, a model of the selection process or a description of the items that will be suggested. Knowledge-based recommenders can exploit the knowledge contained in case or encoded in a similarity metric. Case-Based Reasoning (CBR) is one of the methodologies used in the knowledge-based approach. CBR is a problem solving methodology that faces a new problem by first retrieving a past, already solved similar case, and then reusing that case for solving the current problem (Aaamodt & Plaza, 1994). In a CBR recommender system (CBR-RS) a set of suggested products is retrieved from the case base by searching for cases similar to a case described by the user (Burke, 2000). In the simplest application of CBR to recommendation problem solving, the user is supposed to look for some product to purchase. He/she inputs some requirements about the product and the system searches in the case base for similar products (by means of a similarity metric) that match the user requirements. A set of cases is retrieved from the case base and these cases can be recommender to the user. If the user is not satisfied with the recommendation he/she can modify the requirements, i.e. build another query, and a new cycle of the recommendation process is started. In a CBR-RS the effectiveness of the recommendation is based on: the ability to match user preferences with product description; the tools used to explain the match and to enforce the validity of the suggestion; the function provided for navigating the information space. CBR can support the recommendation process in a number of ways. In the simplest approach the CBR retrieval is called taking in input a partial case defined by a set of user preferences (attribute-value pairs) and a set of products matching these preferences are returned to the user.
MAIN THRUST CBR systems implement a problem solving cycle very similar to the recommendation process. It starts with a new problem, retrieves similar cases from the case base and shows to the user an old solution or adapts it to better
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Case-Based Recommender Systems
Figure 1. CBR-RS’ framework (Lorenzi & Ricci, 2004)
This means that a case c = (x, u, s, e) ∈ CB, generally consists of four (optional) sub-elements x, u, s, e, which are elements of the spaces X, U, S, E respectively. Each CBR-RS adopts a particular model for the spaces X, U, S, E. These spaces could be empty, vector, set of document (textual), labeled graphs, etc. • •
• solve the new problem and finishes retaining the new case in the case base. Considering the classic CBR cycle (see Aamodt & Plaza, 1994) we specialized this general framework to the specific tasks of product recommendation. In Figure 1 the boxes, corresponding to the classical CBR steps (retrieve, reuse, revise, review, and retain), contain references to systems or functionalities (acronyms) that will be described in the next sections. We now provide a general description of the framework by making some references to systems that will be better described in the rest of the paper. The first usersystem interaction in the recommendation cycle occurs in the input stage. According to Bergmann, Richter, Schmitt, Stahl, and Vollrath (2001), there are different strategies to interact with the user, depending on the level of customer assistance offered during the input. The most popular strategy is the dialog-based, where the system offers guidance to the user by asking questions and presenting products alternatives, to help the user to decide. Several CBR recommender systems ask the user for input requirements to have an idea of what the user is looking for. In the First Case system (McSherry, 2003a), for instance, the user provides the features of a personal computer that he/she is looking for, such as, type, price, processor or speed. Expertclerk (Shimazu, 2002) asks the user to answer some questions instead of provide requirements. And with the set of the answered questions the system creates the query. In CBR-RSs, the knowledge is stored in the case base. A case is a piece of knowledge related to a particular context and representing an experience that teaches an essential lesson to reach the goal of the problem-solving activity. Case modeling deals with the problem of determining which information should be represented and which formalism of representation would be suitable. In CBR-RSs a case should represent a real experience of solving a user recommendation problem. In a CBR-RS, our analysis has identified a general internal structure of the case base: CB = [X × U × S × E].
•
Content model (X): the content model describes the attributes of the product. User profile (U): the user profile models personal user information, such as, name, address, and age or also past information about the user, such as her preferred products. Session model (S): the session model is introduced to collect information about the recommendation session (problem solving loop). In DieToRecs, for instance, a case includes a treebased model of the user interaction with the system and it is built incrementally during the recommendation session. Evaluation model (E): the evaluation model describe the outcome of the recommendation, i.e., if the suggestion was appropriate or not. This could be a user a-posteriori evaluation, or, as in (Montaner, Lopez, & la Rosa, 2002), the outcome of an evaluation algorithm that guesses the goodness of the recommendation (exploiting the case base of previous recommendations).
Actually, in CBR-RSs there is a large variability in what a case really models and therefore what components are really implemented. There are systems that use only the content model, i.e., they consider a case as a product, and other systems that focus on the perspective of cases as recommendation sessions. The first step of the recommendation cycle is the retrieval phase. This is typically the main phase of the CBR cycle and the majority of CBR-RSs can be described as sophisticated retrieval engines. For example, in the Compromise-Driven Retrieval (McSherry, 2003b) the system retrieves similar cases from the case base but also groups the cases, putting together those offering to the user the same compromise, and presents to the user just a representative case for each group. After the retrieval, the reuse stage decides if the case solution can be reused in the current problem. In the simplest CBR-RSs, the system reuses the retrieved cases showing them to the user. In more advanced solutions, such as (Montaner, Lopez, & la Rosa, 2002) or (Ricci et al., 2003), the retrieved cases are not recommended but used to rank candidate products identified with other approaches (e.g. Ricci et al., 2003) with an interactive query management component.
125
C
Case-Based Recommender Systems
In the next stage the reused case component is adapted to better fit in the new problem. Mostly, the adaptation in CBR-RSs is implemented by allowing the user to customize the retrieved set of products. This can also be implemented as a query refinement task. For example, in Comparisonbased Retrieval (McGinty & Smyth, 2003) the system asks user feedbacks (positive or negative) about the retrieved product and with this information it updates the user query. The last step of the CBR recommendation cycle is the retain phase (or learning), where the new case is retained in the case base. In DieToRecs (Fesenmaier, Ricci, Schaumlechner, Wober, & Zanella, 2003), for instance, all the user/system recommendation sessions are stored as cases in the case base. The next subsections describe very shortly some representative CBR-RSs, focusing on their peculiar characteristic (see Lorenzi & Ricci, 2004) for the complete report.
Interest Confidence Value – ICV Montaner, Lopez, and la Rosa (2002) assume that the user’s interest in a new product is similar to the user’s interest in similar past products. This means that when a new product comes up, the recommender system predicts the user’s interest in it based on interest attributes of similar experiences. A case is modeled by objective attributes describing the product (content model) and subjective attributes describing implicit or explicit interests of the user in this product (evaluation model), i.e., c∈ X × E. In the evaluation model, the authors introduced the drift attribute, which models a decaying importance of the case as time goes and the case is not used. The system can recommend in two different ways: prompted or proactive. In prompted mode, the user provides some preferences (weights in the similarity metric) and the system retrieves similar cases. In the proactive recommendation, the system does not have the user preferences, so it estimates the weights using past interactions. In the reuse phase the system extracts the interest values of retrieved cases and in the revise phase it calculates the interest confidence value of a restaurant to decide if this should be recommender to the user or not. The adaptation is done asking to the user the correct evaluation of the product and after that a new case (the product and the evaluation) is retained in the case base. It is worth noting that in this approach the recommended product is not retrieved from the case base, but the retrieved cases are used to estimate the user interest in this new product. This approach is similar to that used in DieToRecs in the single item recommendation function.
126
DieToRecs – DTR DieToRecs helps the user to plan a leisure travel (Fesenmaier, Ricci, Schaumlechner, Wober, & Zanella 2003). We present here two different approaches (decision styles) implemented in DieToRecs: the single item recommendation (SIR), and the travel completion (TC). A case represents a user interaction with the system, and it is built incrementally during the recommendation session. A case comprises all the quoted models: content, user profile, session and evaluation model. SIR starts with the user providing some preferences. The system searches in the catalog for products that (logically) match with these preferences and returns a result set. This is not to be confused with the retrieval set that contains a set of similar past recommendation session. The products in the result set are then ranked with a double similarity function (Ricci et al., 2003) in the revise stage, after a set of relevant recommendation sessions are retrieved. In the TC function the cycle starts with the user’s preferences too but the system retrieves from the case base cases matching user’s preferences. Before recommending the retrieved cases to the user, the system in the revise stage updates, or replaces, the travel products contained in the case exploiting up-to-date information taken from the catalogues. In the review phase the system allows the user to reconfigure the recommended travel plan. The system allows the user to replace, add or remove items in the recommended travel. When the user accepts the outcome (the final version of the recommendation shown to the user), the system retains this new case in the case base.
Compromise-Driven Retrieval – CDR CDR models a case only by the content component (McSherry, 2003a). In CDR, if a given case c1 is more similar to the target query than another case c2, and differs from the target query in a subset of the attributes in which c2 differs from the target query, then c 1 is more acceptable than c2. In the CDR retrieval algorithm the system sorts all the cases in the case-base according to the similarity to a given query. In a second stage, it groups together the cases making the same compromise (do not match a user preferred attribute value) and builds a reference set with just one case for each compromise group. The cases in the reference set are recommended to the user. The user can also refine (review) the original query, accepting one compromise, and adding some preference on a different attribute (not that already specified). The system will further decompose the set of cases corre-
Case-Based Recommender Systems
sponding to the selected compromise. The revise and retain phases do not appear in this approach.
ExpertClerk – EC Expertclerk is a tool for developing a virtual salesclerk system as a front-end of e-commerce websites (Shimazu, 2002). The system implements a question selection method (decision tree with information gain). Using navigationby-asking, the system starts the recommendation session asking questions to user. The questions are nodes in a decision tree. A question node subdivides the set of answer nodes and each one of these represents a different answer to the question posed by the question node. The system concatenates all the answer nodes chosen by the user and then constitutes the SQL retrieval condition expression. This query is applied to the case base to retrieve the set of cases that best match the user query. Then, the system shows three samples products to the user and explains their characteristics (positive and negative). In the review phase, the system switches to the navigation-by-proposing conversation mode and allows the user to refine the query. After refinement, the system applies the new query to the case base and retrieves new cases. These cases are ranked and shown to the user. The cycle continues until the user finds a preferred product. In this approach the revise and the retain phases are not implemented.
FUTURE TRENDS This paper presented a review of the literature on CBR recommender systems. We have found that it is often unclear how and why the proposed recommendation methodology can be defined as case-based and therefore we have introduced a general framework that can illustrate similarities and differences of various approaches. Moreover, we have found that the classical CBR problem-solving loop is implemented only partially and sometime is not clear whether a CBR stage (retrieve, reuse, revise, review, retain) is implemented or not. For this reason, the proposed unifying framework makes possible a coherent description of different CBR-RSs. In addition an extensive usage of this framework can help describing in which sense a recommender system exploits the classical CBR cycle, and can point out new interesting issues to be investigated in this area. For instance, the possible ways to adapt retrieved cases to improve the recommendation and how to learn these adapted cases. We believe, that with such a common view it will be easier to understand what the research projects in the
Table 1. Comparison of the CBR-RSs Approach ICV SIR TC OBR CDR EC
Retrieval Similarity Similarity Similarity Similarity + Ordering Similarity + Grouping Similarity
Reuse IC value Selective Default Default
Revise IC computation Rank Logical query None
Review Feedback User edit User edit Tweak
Retain Default Default Default None
Default
None
Tweak
None
Default
None
Feedback
None
area have already delivered, how the existing CBR-RSs behave and which are the topics that could be better exploit in future systems.
CONCLUSION In the previous sections we have briefly analyzed eight different CBR recommendations. Table 1 shows the main features of these approaches. Some observations are in order. The majority of the CBR-RSs stress the importance of the retrieval phase. Some systems perform retrieval in two steps. First, cases are retrieved by similarity, and then the cases are grouped or filtered. The use of pure similarity does not seem to be enough to retrieve a set of cases that satisfy the user. This seems to be true especially in those application domains that require a complex case structure (e.g. travel plans) and therefore require the development of hybrid solutions for case retrieval. The default reuse phase is used in the majority of the CBR-RSs, i.e., all the retrieved cases are recommended to the user. ICV and SIR have implemented the reuse case in different way. In SIR, for instance, the system can retrieve just part of the case. The same systems that implemented non-trivial reuse approaches, have also implemented both the revise phase, where the cases are adapted, and the retain phase, where the new case (adapted case) is stored. All the CBR-RSs analyzed implement the review phase, allowing the user to refine the query. Normally the system expects some feedback from the user (positive or negative), new requirements or a product selection.
REFERENCES Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications, 7(1), 39-59. Bergmann, R., Richter, M., Schmitt, S., Stahl, A., & Vollrath, I. (2001). Utility-oriented matching: New research direction for case-based reasoning. 9th German Workshop on Case-Based Reasoning, GWCBR’01 (pp. 14-16), Baden-Baden, Germany. 127
C
Case-Based Recommender Systems
Burke, R. (2000). Knowledge-based recommender systems. Encyclopedia of Library and Information Science, Vol. 69.
active query management and twofold similarity. 5th International Conference on Case-Based Reasoning, ICCBR 2003 (pp. 479-493), Trondheim, Norway.
Fesenmaier, D, Ricci, F, Schaumlechner, E, Wober, K., & Zanella, C. (2003). DIETORECS: Travel advisory for multiple decision styles. Information and Communication Technologies in Tourism, 232-241.
Schafer, J.B, Konstan, J. A., & Riedl, J. (2001). Ecommerce recommendation applications. Data Mining and Knowledge Discovery, 5(1/2), 115-153.
Lorenzi, F., & Ricci, F. (2004). A unifying framework for case-base reasoning recommender systems. Technical Report, IRST.
Shimazu, H. (2002). Expertclerk: A conversational casebased reasoning tool for developing salesclerk agents in e-commerce webshops. Artificial Intelligence Review, 18, 223-244.
McGinty, L., & Smyth, B. (2002). Comparison-based recommendation. Advances in Case-Based Reasoning, 6th European Conference on Case Based Reasoning, ECCBR 2002 (pp. 575-589), Aberdeen, Scotland.
Witten, I. H., & Frank, E. (2000). Data mining. Morgan Kaufmann Publisher.
McGinty, L., & Smyth, B. (2003). The power of suggestion. 18th International Joint Conference on Artificial Intelligence, IJCAI-03 (pp. 276-290), Acapulco, Mexico.
KEY TERMS
McSherry, D. (2003a). Increasing dialogue efficiency in case-based reasoning without loss of solution quality. 18 th International Joint Conference on Artificial Intelligence, IJCAI-03 (pp. 121-126), Acapulco, Mexico. McSherry, D. (2003b). Similarity and compromise. 5th International Conference on Case-Based Reasoning, ICCBR 2003 (pp. 291-305), Trondheim, Norway. Montaner, M., Lopez, B., & la Rosa, J.D. (2002). Improving case representation and case base maintenance in recommender systems. Advances in Case-Based Reasoning, 6th European Conference on Case Based Reasoning, ECCBR 2002 (pp. 234-248),Aberdeen, Scotland. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). Grouplens: An open architecture for collaborative filtering of Netnews. ACM Conference on Computer-Supported Cooperative Work (pp. 175186). Ricci, F., Venturini, A., Cavada, D., Mirzadeh, N., Blaas, D., & Nones, M. (2003). Product recommendation with inter-
128
Case-Based Reasoning: It is an Artificial Intelligence approach that solves new problems using the solutions of past cases. Collaborative Filtering: Approach that collects user ratings on currently proposed products to infer the similarity between users. Content-Based Filtering: Approach where the user expresses needs and preferences on a set of attributes and the system retrieves the items that match the description. Conversational Systems: Systems that can communicate with users through a conversational paradigm Machine Learning: The study of computer algorithms that improve automatically through experience. Recommender Systems: Systems that help the user to choose products, taking into account his/her preferences. Web Site Personalization: Web sites that are personalized to each user, knowing the user interests and needs.
129
Categorization Process and Data Mining Maria Suzana Marc Amoretti Federal University of Rio Grande do Sul (UFRGS), Brazil
INTRODUCTION For some time, the fields of computer science and cognition have diverged. Researchers in these two areas know ever less about each other’s work, and their important discoveries have had diminishing influence on each other. In many universities, researchers in these two areas are in different laboratories and programs, and sometimes in different buildings. One might conclude from this lack of contact that computer science and semiotics functions, such as perception, language, memory, representation, and categorization, reflect our independent systems. But for the last several decades, the divergence between cognition and computer science tends to disappear. These areas need to be studied together, and the cognitive science approach can afford this interdisciplinary view. This article refers to the possibility of circulation between the self-organization of the concepts and the relevance of each conceptual property of the data-mining process and especially discusses categorization in terms of a prototypical theory, based on the notion of prototype and basic level categories. Categorization is a basic means of organizing the world around us and offers a simple way to process the mass of stimuli that one perceives every day. The ability to categorize appears early in infancy and has an important role for the acquisition of concepts in a prototypical approach. Prototype structures have cognitive representations as representations of real-world categories. The senses of the English words cat or table are involved in a conceptual inclusion in which the extension of the superordinated (animal/furniture) concept includes the extension of the subordinated (Persian cat/dining room table) concept, while the intention of the more general concept is included by the intention of the more specific concept. This study is included in the categorization process. Categorization is a fundamental process of mental representation used daily for any person or for any science, and it is also a central problem in semiotics, linguistics, and data mining. Data mining also has been defined as a cognitive strategy for searching automatically new information from large datasets or selecting a document, which is possible with computer science and semiotics tools. Data mining is an analytic process to explore data in order to find interesting pattern motifs and/or variables in the great quantity of data; it depends mostly on the catego-
rization process. The computational techniques from statistics and pattern recognition are used to do this datamining practice.
BACKGROUND Semiotics is a theory of the signification (representations, symbols, categories) and meaning extraction. It is a strongly multi-disciplinary field of study, and mathematical tools of semiotics include those used in pattern recognition. Semiotics is also an inclusive methodology that incorporates all aspects of dealing with symbolic systems of signs. Signification join all the concepts in an elementary structure of signification. This structure is a related net that allows the construction of a stock of formal definitions such as semantic category. Hjelmeslev considers the category as a paradigm, where elements can be introduced only in some positions. Text categorization process, also known as text classification or topic spotting, is the task of automatically classifying a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, and selective dissemination of information to information users. There are many new categorization methods to realize the categorization task, including, among others, (1) the language model based classification; the maximum entropy classification, which is a probability distribution estimation technique used for a variety of natural language tasks, such as language modeling, part-of-speech tagging, and text segmentation (the theory underlying maximum entropy is that without external knowledge, one should prefer distributions that are uniform); (2) the Naïve Bayes classification; (3) the Nearest Neighbor (the approach clusters words into groups based on the distribution of class labels associated with each word); (4) distributional clustering of words to document classification; (5) the Latent Semantic Indexing (LSI), in which we are able to compress the feature space more aggressively, while still maintaining high document classification accuracy (this information retrieval method improves the user’s ability to find relevant information, the text categorization method based on a combination of distributional features with a Support
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
C
Categorization Process and Data Mining
Vector Machine (SVM) classifier, and the feature selection approach uses distributional clustering of words via the recently introduced information bottleneck method, which generates a more efficient representation of the documents); (6) the taxonomy method, based on hierarchical text categorization that documents are assigned to leaf-level categories of a category tree (the taxonomy is a recently emerged subfield of the semantic networks and conceptual maps). After the previous work in hierarchical classification focused on documents category, the tree classify method internal categories with a top down level-based classification that can classify concepts in the document. The networks of semantic values thus created and stabilized constitute the cultural-metaphorical ‘worlds’ which are discursively real for the speakers of particular languages. The elements of these networks, though ultimately rooted in the physical-biological realm can and do operate independently of the latter, and form the stuff of our everyday discourses (Manjali, 1997, p. 1). The prototype has given way to a true revolution (the Roschian revolution) regarding classic lexical semantics. If we observe the conceptual map for chair, for instance, we will realize that the choice of most representative chair types; that is, our prototype of chair, supposes a double adequacy: referential because the sign (concept of chair) must integrate the features retained from the real or imaginary world, and structural, because the sign must be pertinent (ideological criterion) and distinctive concerning the other neighbor concepts of chair. When I say that this object is a chair, it is supposed that I have an idea of the chair sign, forming the use of a lexical or visual image competence coming from my referential experience, and that my prototypical concept of chair is more adequate than its neighbors bench or couch, because I perceive that there is a back part and there are no arms. Then, it is useless to try to explain the creation of a prototype inside a language, because it is formed from context interactions. The double origin of a prototype is bound, then, to shared knowledge relation between the subjects and their communities (Amoretti, 2003).
MAIN THRUST Hypertext poses new challenges for a data-mining process, especially for text categorization research, because metadata extracted from Web sites provide rich information for classifying hypertext documents, and it is a new kind of problem to solve, to know how to appropriately represent that information and automatically learn statistical patterns for hypertext categorization. The use of 130
technologies in the categorization process through the making of conceptual maps, especially the possibility of creating a collaborative map made by different users, points out the cultural aspects of the concept representation in terms of existing coincidences as to the choice of the prototypical element by the same cultural group. Thus, the technologies of information, focused on the study of individual maps, demand revisited discussions on the popular perceptions concerning concepts used daily (folk psychology). It aims to identify ideological similarity and cognitive deviation, both based on the prototypes and on the levels of categorization developed in the maps, with an emphasis on the cultural and semiotic aspects of the investigated groups. It attempted to show how the semiotic and linguistic analysis of the categorization process can help in the identification of the ideological similarity and cognitive deviations, favoring the involvement of subjects in the map production, exploring and valuing the relation between the categorization process and the cultural experience of the subject in the world, both parts of the cognitive process of conceptual map construction. The concept maps, or the semantic nets, are space graphic representations of the concepts and their relationships. The concept maps represent, simultaneously, the organization process of the knowledge, by the relationships (links) and the final product, through the concepts (nodes). This way, besides the relationship between linguistic and visual factors is the interaction among their objects and their codes (Amoretti, 2001, p. 49). The building of a map involves collaboration when the subjects/students/users share information, still without modifying the data, and involves cooperation, when users not only share their knowledge but also may interfere and modify the information received from the other users, acting in a asynchronized way to build a collective map. Both cooperation and collaboration attest the autonomy of the ongoing cognitive process, the direction given by the users themselves when trying to adequate their knowledge. When people do a conceptual map, they usually privilege the level where the prototype is. The basic concept map starts with a general concept at the top of the map and then works its way down through a hierarchical structure to more specific concepts. The empirical concept (Kant) of cat and chair has been studied by users with map software. They make an initial map at the beginning of the semester and another about the same subject at the end of the semester. I first discussed how cats and chairs appear, what could be called the structure of cat and chair appearance. Second, I discussed how cat and chair are perceived and which attributes make a cat a cat and a chair
Categorization Process and Data Mining
a chair. Finally, I will consider cat and chair as an experiential category, so the point of departure is our experience in the world about cat and chair. The acquisition of the concept cat and chair is mediated by concrete experiences. Thus, the learner must possess relevant prior knowledge and a mental scheme to acquire a prototypical concept. The expertise changes the conceptual level organization competences. In the first maps, the novice chair map privileged the basic level, the most important exemplar of a class, the chair prototype. This level has a high coherence and distinctiveness. After thinking about this concept, students (now chair experts) repeated the experiment and carried out again the expert chair map with much more details in the superordinate level, showing eight different kinds of chairs: dining room chair, kitchen chair, garden chair, and so forth. This level has high coherence and low distinctiveness (Rosh, 2000). So, users learn by doing the categorization process. Language system arbitrarily cuts up the concepts into discrete categories (Hjelmeslev, 1968), and all categories have equal status. Human language is both natural and cultural. According to the prototype theory, the role played by non-linguistic factors like perception and environment is demonstrated throughout the concept as the prototype from the subjects of each community. A concept is a sort of scheme. An effective way of representing a concept is to retain only its most important properties. This group of most important properties of a concept is called prototype. The idea of prototype makes it possible for the subject to have a mental construction, identifying the typical features of several categories, and, when the subject finds a new object, he or she may compare it to the prototype in his or her memory. Thus, the prototype of chair, for instance, allows new objects to be identified and labeled as chairs. In individual conceptual maps creation, one may confirm the presence of variables for the same concept. The notion of prototype originated in the 1970s, greatly due to Eleanor Rosch’s (2000) psychological research on the organization of conceptual categories. Its revolutionary character marked a new era for the discussions on categorization and brought existing theories, such as the classical view of the prototype question. On the basis of Rosch’s results, it is argued that members of the so-called Aristotelian (or classical) categories share all the same properties and showed that categories are structured in an entirely different way; members that constitute them are assigned in terms of gradual participation, and the categorical attribution is made by human beings according to the more or less centrality/marginality of collocation within the categorical structure. Elements recognized as central members of the category represent the prototype. For instance, a chair is a very good example of the category furniture, while a television is a less typical example of the
same category. A chair is a more central member than a television, which, in turn, is a rather marginal member. Rosch (2000) claims that prototypes can only constrain, but do not determine, models of representations. The main thrust of my argument is that it is very important to data mining to know, besides the cognitive categorization process, what is the prototypical concept in a dataset. This basic level of concept organization reflects the social representation in a better way than the other levels (i.e., superordinate and subordinate levels). The prototype knowledge affords a variety of culture representations. The conceptual mapping system contains prototype data on the hierarchical way of concept levels. Using different software (three measuring levels—superordinate, basic, and subordinate), I suggest the construction of different maps for each concept to analyze the categorization cognitive process with maps and to show how the categorization performance of individual and collective or organizational team overtime is important in a data-mining work. Categorization is a part of Jakobson’s (2000) communication model (also appropriated from information theory) with cultural aspects (context). This principle allows to locate on a definite gradient objects and relations that are observed, based on similarity and contiguity associations (frame/script semantic) (Schank, 1999) and based on hierarchically relations (prototypic semantic) (Kleiber, 1990; Rosch, 2000), in terms of perceived family resemblance among category members.
FUTURE TRENDS Data-mining language technology systems typically have focused on the factual aspect of content analysis. However, there are other categorization aspects, including pragmatics, point of view, and style, which must receive more attention like types and models of subjective classification information and categorization characteristics such as centrality, polarity, intensity, and different levels of granularity (i.e., expression, clause, sentence, discourse segment, document, hypertext). It is also important to define properties heritage among different category levels, viewed throughout hierarchical relations as one that allowed to virtually add certain pairs of value (attributes from a unit to another). We should also think of concepts managing that, in a given category, are considered as an exception. It would be necessary to allow the heritage blockage of certain attributes. I will be opening new perspectives to the datamining research of categorization and prototype study, which shows the ideological similarity perception mediated by collaborative conceptual maps.
131
C
Categorization Process and Data Mining
Much is still unknowable about the future of data mining in higher education and in the business intelligence process. The categorization process is a factor that will affect this future and can be identified with the crucial role played by the prototypes. Linguistics have not yet paid this principle due attention. However, some consequences should already necessarily follow from its prototype recognition. The extremely powerful explanation of prototype categorization constitutes the most salient feature in data mining. So, a very important application in the data-mining methodology is the results of the prototype categorization research like a form of retrieval of unexpected information.
Andler, D. (1987). Introduction aux sciences cognitives. Paris: Gallimard.
CONCLUSION
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Boston: MIT Press.
Categorization explains aspects of people’s cultural logic. In this chapter, statistical pattern recognition approaches are used to classify concepts which are present in a given dataset and expressed by conceptual maps. Prototypes whose basis is the concepts representation from the heritage par défaut that allows a great economy in the acquisition and managing of the information. The objective of this study was to investigate the potential of founding the categories in the text with conceptual maps, to use these maps as data-mining tools. Based on user cognitive characteristics of knowledge organization and based on the prevalence of the basic level (the prototypical level), a case is of the necessity of the categorization cognitive process like as a step of data mining.
Hjelmeslev, L. (1968). Prolégomènes à une théorie du langage. Paris: Éditions Minuit.
REFERENCES Amoretti, M.S.M. (2001). Protótipos e estereótipos: aprendizagem de conceitos: Mapas conceituais: Uma experiência em Educação à Distância. Proceedings of the Revista Informática na Educação. Teoria e Prática, Porto Alegre, Brazil. Amoretti, M.S.M. (2003). Conceptual maps: A metacognitive strategy to learn concepts. Proceedings of the 25th Annual Meeting of the Cognitive Science Society, Boston, Massachusetts. Amoretti, M.S.M. (2004a). Categorization process and conceptual maps. Proceedings of the First International Conference on Concept Mapping, Pamplona, Spain. Amoretti, M.S.M. (2004b). Collaborative learning concepts in distance learning: Conceptual map: Analysis of prototypes and categorization levels. Proceedings of the CCM Digital Government Symposium, Tuscaloosa, Alabama. 132
Cordier, F. (1989). Les notions de typicalité et niveau d’abstracion: Analyse des propriétés des representations [thèse de doctorat d’dtat]. Paris: Sud University. das, J. (1966). Sémantique structurale. Recherche de méthode. Paris: PUF. Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1992). Knowledge discovery in databases: An overwiev. AI Magazine, 13(2), 57-70. Greimas, A.J. (1966). Sémantique structurale. Recherche de méthode. Paris: PUF.
Jakobson, R. (2000). Linguistics and poetics. Londres: Lodge and Wood. Kleiber, G. (1990). La sémantique du prototype: Catégories et sens lexical. Paris: PUF. Manjali, F.D. (1997). Dynamical models in semiotics/ semantics. Retrieved from http://www.chass.utoronto.ca/ epc/srb/cyber/manout.html Minksy, M.L. (1977). A framework for representing knowledge. In P.H. Winston (Ed.), The psychology of computer vision (pp. 211-277). New York: McGraw-Hill. Rosch, E. et al. (2000). The embodied mind. London: MIT. Schank, R. (1999). Dynamic memory revisited. Cambridge, MA: Cambridge University Press. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR). Yiming, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval.
KEY TERMS Categorization: A cognitive process based on similarity of mental schemes and concepts that subjects establish conditions that are both necessary and sufficient (properties) to capture meaning and/or the hierarchy inclusion (as part of a set) by family ressemblences shared by their members. Every category has a prototypical internal structure, depending on the context.
Categorization Process and Data Mining
Concept: A sort of scheme produced by repeated experiences. Concepts are essentially each little idea that we have in our heads about anything. This includes not only everything, but every attribute of everything. Conceptual Maps: Semiotic representation (linguistic and visual) of the concepts (nodes) and their relationships (links); represent the organization process of the knowledge When people do a conceptual map, they usually privilege the level where the prototype is. They prefer to categorize at an intermediate level; this basic level is the first level learned, the most common level named, and the most general level where visual shape and attributes are maintained.
Prototype: An effective way of representing a concept is to retain only its most important properties or the most typical element of a category, which serves as a cognitive reference point with respect to a cultural community. This group of most important properties or most typical elements of a concept is called prototype. The idea of prototype makes possible that the subject has a mental construction, identifying the typical features of several categories. Prototype is defined as the object that is a category’s best model.
133
C
134
Center-Based Clustering and Regression Clustering Bin Zhang Hewlett-Packard Research Laboratories, USA
INTRODUCTION Center-based clustering algorithms are generalized to more complex model-based, especially regressionmodel-based, clustering algorithms. This article briefly reviews three center-based clustering algorithms—KMeans, EM, and K-Harmonic Means—and their generalizations to regression clustering algorithms. More details can be found in the referenced publications.
BACKGROUND Center-based clustering is a family of techniques with applications in data mining, statistical data analysis (Kaufman et al., 1990), data compression (vector quantization) (Gersho & Gray, 1992), and many others. Kmeans (KM) (MacQueen, 1967; Selim & Ismail, 1984), and the Expectation Maximization (EM) (Dempster et al., 1977; McLachlan & Krishnan, 1997; Rendner & Walker, 1984) with linear mixing of Gaussian density functions are two of the most popular clustering algorithms. K-Means is the simplest among the three. It starts with initializing a set of centers M = {mk | k = 1,..., K } and iteratively refines the location of these centers to find the clusters in a dataset. Here are the steps:
K-Means Algorithm • • •
•
Step 1: Initialize all centers (randomly or based on any heuristic). Step 2: Associate each data point with the nearest center. This step partitions the data set into K disjoint subsets (Voronoi Partition). Step 3: Calculate the best center locations (i.e., the centroids of the partitions) to maximize a performance function (2), which is the total squared distance from each data point to the nearest center. Step 4: Repeat Steps 2 and 3 until there are no more changes on the membership of the data points (proven to converge).
With guarantee of convergence to only a local optimum, the quality of the converged results, measured by the performance function of the algorithm, could be far from its global optimum. Several researchers explored alternative initializations to achieve the convergence to a better local optimum (Bradley & Fayyad, 1998; Meila & Heckerman, 1998; Pena et al., 1999). K-Harmonic Means (KHM) (Zhang, 2001; Zhang et al., 2000) is a recent addition to the family of centerbased clustering algorithms. KHM takes a very different approach from improving the initializations. It tries to address directly the source of the problem—a single cluster is capable of trapping far more centers than its fair share. This is the main reason for the existence of a very large number of local optima under K-Means and EM when K>10. With the introduction of a dynamic weighting function of data, KHM is much less sensitive to initialization, demonstrated through a large number of experiments in Zhang (2003). The dynamic weighting function reduces the ability of a single data cluster, trapping many centers. Replacing the point-centers by more complex data model centers, especially regression models, in the second part of this article, a family of model-based clustering algorithms is created. Regression clustering has been studied under a number of different names: Clusterwise Linear Regression by Spath (1979, 1981, 1983, 1985), DeSarbo and Cron (1988), Hennig (1999, 2000) and others; Trajectory clustering using mixtures of regression models by Gaffney and Smith (1999); Fitting Regression Model to Finite Mixtures by Williams (2000); Clustering Using Regression by Gawrysiak, et. al. (2000); Clustered Partial Linear Regression by Torgo, et. al. (2000). Regression clustering is a better name for the family, because it is not limited to linear or piecewise regressions. Spath (1979, 1981, 1982) used linear regression and partition of the dataset, similar to K-means, in his algorithm that locally minimizes the total mean square error over all K-regressions. He also developed an incremental version of his algorithm. He visualized his piecewise linear regression concept in his book (Spath,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Center-Based Clustering and Regression Clustering
1985) exactly as he named his algorithm. DeSarbo (1988) used a maximum likelihood method for performing clusterwise linear regression. Hennig (1999) studied clustered linear regression, as he named it, using the same linear mixing of Gaussian density functions.
where Sk ⊂ X is the subset of x that are closer to mk than to all other centers (the Voronoi partition).
(b) EM:
K d ( x, M ) = − log ∑ pk * k =1
1
( π)
D
EXP( − || xi − ml ||2 ) , where
MAIN THRUST
{ pk }1K is a set of mixing probabilities.
For K-Means, EM, and K-Harmonic means, both their performance functions and their iterative algorithms are treated uniformly in this section for comparison. This uniform treatment is carried over to the three regression clustering algorithms, RC-KM, RC-EM and RC-KHM, in the second part.
A linear mixture of K identical spherical (Gaussian density) functions, which is still a probability density function, is used here.
Performance Functions of the CenterBased Clustering
(c)
(|| x − mk || p ) , the K-Harmonic Means: d ( x, M ) = 1HA ≤k ≤ K
harmonic average of the K distances, Perf KHM ( X , M ) = ∑ x∈X
Among many clustering algorithms, center-based clustering algorithms stand out in two important aspects— a clearly defined objective function that the algorithm minimizes, compared with agglomerative clustering algorithms that do not have a predefined objective; and a low runtime cost, compared with many other types of clustering algorithms. The time complexity per iteration for all three algorithms is linear in the size of the dataset N, the number of clusters K, and the dimensionality of data D. The number of iterations it takes to converge is very insensitive to N. Let X = {xi | i = 1,..., N } be a dataset with K clusters, iid sampled from a hidden distribution, and M = {mk | k = 1,..., K } be a set of K centers. K-Means, EM, and K-Harmonic Means find the clusters—the (local) optimal locations of the centers—by minimizing a function of the following form over the K centers,
Perf ( X , M ) = ∑ d ( x, M )
(1)
x∈X
where d(x,M) measures the distance from a data point to the set of centers. Each algorithm uses a different distance function: (a)
(|| x − mk ||) , which makes K-Means: d ( x, M ) = MIN 1≤ k ≤ K
(1) the same as the more popular form K
Perf KM ( X , M ) = ∑ ∑ || x − mk || , k =1 x∈Sk
2
(2)
K 1
∑ || x − m ||
m∈M
i
l
2
, where p > 2.
(3)
K-Means and K-Harmonic Means performance functions also can be written similarly to the EM, except that only a positive function takes the place where this probability function is (Zhang, 2001).
Center-Based Clustering Algorithms K-Means’ algorithms are shown in the Introduction. We list EM and K-Harmonic Means’ algorithms here to show their similarity.
EM (with Linear Mixing of Spherical Gausian Densities) Algorithm • • • •
Step 1: Initialize the centers and the mixing probabilities { pk }1K . Step 2: Calculate the expected membership probabilities (see item ). Step 3: Maximize the likelihood to the current membership by finding the best centers. Step 4: Repeat Steps 2 and 3 until a chosen convergence criterion is satisfied.
K-Harmonic Means Algorithm • •
Step 1: Initialize the centers. Step 2: Calculate the membership probabilities and the dynamic weighting (see item
135
C
Center-Based Clustering and Regression Clustering
•
Step 3: Find the best centers to maximize its performance function. Step 4: Repeat Steps 2 and 3 until a chosen convergence criterion is satisfied.
•
i =1
m
∑ p(m = ∑ p(m
( u −1) k
x∈X
( u −1) k
x∈X
a
( u −1)
( x) > 0,
( u −1) k
p(m
,
| x) a (u −1) ( x)
| x ) ≥ 0 and
K
∑ p(m l =1
( u −1) l
| x) = 1
(4)
(We dropped the iteration index u-1 on p() for shorter notations). Function a ( u −1) ( x) is a weight on the data point x in the current iteration. It is called a dynamic weighting function, because it changes in each iteration. Functions p (mk(u −1) | x) are soft-membership functions, or the prob-
ability of x being associated to the center mk( u −1) . For each algorithm, the details on a() and p(,) are: A.
K-Means: a ( u −1) ( x) =1 for all x in all iterations, and ( u −1) k
p (m
B.
| x ) =1 if m
( u −1) k
is the closest center to x,
otherwise p (mk( u −1) | x) =0. Intuitively, each x is 100% associated with the closest center, and there is no weighting on the data points. EM: a ( u −1) ( x) =1 for all x in all iterations, and p (mk( u −1) | x )
p (mk( u −1) | x ) =
p ( x | mk(u −1) ) * p (mk(u −1) ) , ∑ p( x | mk(u −1) ) * p(mk(u −1) )
(5)
x∈X
p (mk(u −1) ) =
K
a ( xi ) =
| x) a ( u −1) ( x) x
1 ∑ p(mk(u −2) | x), | X | x∈X
(6)
and p ( x | m
2. C.
136
K-Harmonic Means: a ( x) and p (m extracted from the KHM algorithm:
| x ) are
xi
1
N
∑ i =1
d
p+2 i ,k
K 1 ∑ p l =1 d i,l
2
, di ,k =|| xi − mk(u −1) ||
(7)
∑d k =1
1 p+2 i ,k
K 1 ∑ p k =1 di ,k
2
1 p +2 d i p(mk( u −1) | xi ) = K ,k , i = 1,..., N 1 and . (8) ∑ p +2 l =1 d i ,l
Empirical comparisons of K-Means, EM, and K-Harmonic Means on 1,200 randomly generated data sets can be found in the paper by Zhang (2003). Each data set has 50 clusters ranging from well-separated to significantly overlapping. The dimensionality of data ranges from 2 to 8. All three algorithms are run on each dataset, starting from the same initialization of the centers, and the converged results are measured by a common quality measure—the K-Means—for comparison. Sensitivity to initialization is studied by a rerun of all the experiments on different types of initializations. Major conclusions from the empirical study are as follows:
function centered at mk( u −1) . ( u −1) k
2
Empirical Comparisons of CenterBased Clustering Algorithms
) is the spherical Gausian density
( u −1)
The dynamic weighting function a ( x) ≥ 0 approaches zero when x approaches one of the centers. Intuitively, the closer a data point is to a center, the smaller weight it gets in the next iteration. This weighting reduces the ability of a cluster trapping more than one center. The effect is clearly observed in the visualization of hundreds of experiments conducted (see next section). Compared to the KHM, both KM and EM have all data points fully participate in all iterations (weighting function is a constant 1). They do not have a dynamic weighting function as K-Harmonic Means does. EM and KHM both have soft-membership functions, but K-Means has 0/1 membership function.
1. ( u −1) k
K 1 dip,k+ 2 ∑ p l =1 d i,l
as (Zhang, 2001)
All three iterative algorithms also can be written uniformly as:
(u ) k
1
N
mk(u ) = ∑
For low dimensional datasets, the performance ranking of three algorithms is KHM > KM > EM (“>” means better). For low dimensional datasets (up to 8), the difference is significant. KHM’s performance has the smallest variation under different datasets and different initializations. EM’s performance has the biggest variation. Its results are most sensitive to initializations.
Center-Based Clustering and Regression Clustering
3.
Reproducible results become even more important when we use these algorithms to different datasets that are sampled from the same hidden distribution. The results from KHM better represent the properties of the distribution and are less dependent on a particular sample set. EM’s results are more dependent on the sample set.
The details on the setup of the experiments, quantitative comparisons of the results, and the Matlab source code of K-Harmonic Means can be found in the paper.
Generalization to Complex ModelBased Clustering—Regression Clustering Clustering applies to datasets without response information (unsupervised); regression applies to datasets with response variables chosen. Given a dataset with responses, Z = ( X , Y ) = {( xi , yi ) | i = 1,..., N } , a family of functions Φ = { f } (a function class making the optimization problem well defined, such as polynomials of up to a certain degree) and a loss function e() ≥ 0 , regression solves the following minimization problem (Montgomery et al., 2001):
Regression in (9) is not effective when the dataset contains a mixture of very different response characteristics, as shown in Figure 1a; it is much better to find the partitions in the data and to learn a separate function on each partition, as shown in Figure 1b. This is the idea of Regression-Clustering (RC). Regression provides a model for the clusters; clustering partitions the data to best fit the models. The linkage between the two algorithms is a common objective function shared between the regressions and the clustering. RC algorithms can be viewed as replacing the K geometric-point centers in center-based clustering algorithms by a set of model-based centers, particularly a set of regression functions M = { f1 ,..., f K } ⊂ Φ . With the same performance function as defined in (1), but the distance from a data point to the set of centers replaced by the following, ( e( f ( x), y) =|| f ( x) − y ||2 ) a)
(RC-KM) b)
f ∈Φ
i =1
(9)
m
Commonly, Φ = {∑ βl h( x, al ) | βl ∈ R, al ∈ R n } , linear l =1 expansion of simple parametric functions, such as polynomials of degree up to m, Fourier series of bounded frequency, neural networks. Usually, p e( f ( x), y ) =|| f ( x) − y || , with p=1,2 most widely used (Friedman, 1999).
K d (( x, y ), M ) = − log ∑ pk * k =1
and c)
1
( π)
D
EXP( −e( f ( x), y ))
for EM
d (( x, y ), M ) = HA(e( f ( x ), y )) for RC K-Harmonic f ∈M
Means (RC-KHM).
N
f opt = arg min ∑ e( f ( xi ), yi )
d (( x, y ), M ) = MIN (e( f ( x ), y )) for RC with K-Means f ∈M
The three iterative algorithms—RC-KM, RC-EM, and RC-KHM—minimizing their corresponding performance function, take the following common form (10). Regression with weighting takes the places of weighted averaging in (4). The regression function centers in the uth iteration are the solution of the minimization, N
f k( u ) = arg min ∑ a ( zi ) p ( Z k | zi ) || f ( xi ) − yi ||2 f ∈Φ
i =1
(10)
where the weighting a ( zi ) and the probability p ( Z k | zi ) Figure 1. (a) Left: a single function is regressed on all training data, which is a mixture of three different distributions; (b) Right: three regression functions, each regressed on a subset found by RC. The residue errors are much smaller.
of data point zi in cluster Z k are both calculated from the (u-1)-iteration’s centers { f k( u −1) } as follows: (a)
For RC-K-Means, a( zi ) = 1 and p( Z k | zi ) =1 if e( f k(u −1) ( xi ), yi ) < e( f k('u −1) ( xi ), yi )
∀k ' ≠ k , otherwise
p ( Z k | zi ) =0. Intuitively, RC-K-Means has the fol-
lowing steps:
137
C
Center-Based Clustering and Regression Clustering
• •
•
Step 1: Initialize the regression functions. Step 2: Associate each data point (x,y) with the regression function that provides the best approximation ( arg kmin{e( f k ( x ), y ) | k = 1,..., K } .
Step 3: Recalculate the regression function on each partition that maximizes the performance function. • Step 4: Repeat Steps 2 and 3 until no more data points change their membership. Comparing these steps with the steps of K-Means, the only differences are that point-centers are replaced by regression-functions; distance from a point to a center is replaced by the residue error of a pair (x,y) approximated by a regression function. (b)
For RC-EM, a( zi ) = 1 and
p ( u ) ( Z k | zi ) =
1 pk(u −1) EXP (− e( f k(u −1) ( xi ), yi )) 2 1 pk(u −1) EXP(− e( f k(u −1) ( xi ), yi )) ∑ 2 k =1 K
and
pk(u −1) =
1 N
N
∑ p( Z i =1
( u −1) k
| zi ) .
The same parallel structure can be observed between the center-based EM clustering algorithm and the RCEM algorithm. (c)
For
RC-K-Harmonic
Means,
with
e( f ( x), y ) =|| f ( xi ) − yi || , p'
K
a p ( zi ) = ∑ dip,l' + 2 l =1
K
∑d l =1
p' i ,l
p '+ 2 and p ( Z k | zi ) = di ,k
K
∑d l =1
p '+ 2 i ,l
where d i ,l =|| f (u −1) ( xi ) − yi || . ( p’ > 2 is used.) The same parallel structure can be observed between the center-based KHM clustering algorithm and the RCKHM algorithm. Sensitivity to initialization in center-based clustering carries over to regression clustering. In addition, a new form of local optimum is illustrated in Figure 2. It happens to all three RC algorithms, RC-KM, RCKHM, and RC-EM. Figure 2. A new kind of local optimum occurs in regression clustering.
138
Empirical Comparison of Center-Based Clustering Algorithms Comparison of the three RC-algorithms on randomly generated datasets can be found in the paper by Zhang (2003a). RC-KHM is shown to be less sensitive to initialization than RC-KM and RC-EM. Details on implementing the RC algorithms with extended linear regression models are also available in the same paper.
FUTURE TRENDS Improving the understanding of dynamic weighting to the convergence behavior of clustering algorithms and finding systemic design methods to develop better performing clustering algorithms require more research. Some of the work in this direction is appearing. Nock and Nielsen (2004) took the dynamic weighting idea and developed a general framework similar to boosting theory in supervised learning. Regression clustering will find many applications in analyzing real-word data. Single-function regression has been used very widely for data analysis and forecasting. Data collected in an uncontrolled environment, like in stocks, marketing, economy, government census, and many other real-world situations, are very likely to contain a mixture of different response characters. Regression clustering is a natural extension to the classical single-function regression.
CONCLUSION Replacing the simple geometric-point centers in center-based clustering algorithms by more complex data models provides a general scheme for deriving other model-based clustering algorithms. Regression models are used in this presentation to demonstrate the process. The key step in the generalization is defining the distance function from a data point to the set of models— the regression functions in this special case. Among the three algorithms, EM has a strong foundation in probability theory. It is the convergence to only a local optimum and the existence of a very large number of optima when the number of clusters is more than a few (>5, for example) that keeps practitioners from the benefits of its theory. K-Means is the simplest and its objective function the most intuitive. But it has the similar problem as the EM’s sensitivity to initialization of the centers. K-Harmonic Means was developed with close attention to the dynamics of its convergence; it is much more robust than the other two on low dimen-
Center-Based Clustering and Regression Clustering
sional data. Improving the convergence of center-based clustering algorithms on higher dimensional data (dim > 10) still needs more research.
REFERENCES Bradley, P., & Fayyad, U.M. (1998). Refining initial points for KM clustering. MS Technical Report MSR-TR-98-36. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 138. DeSarbo, W.S., & Corn, L.W. (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification, 5, 249-282. Duda, R., & Hart, P. (1972). Pattern classification and scene analysis. John Wiley & Sons. Friedman, J., Hastie, T., & Tibshirani. R. (1998). Additive logistic regression: A statistical view of boosting [technical report]. Department of Statistics, Stanford University. Gersho, A., & Gray, R.M. (1992). Vector quantization and signal compression. Kluwer Academic Publishers. Hamerly, G., & Elkan, C. (2002). Alternatives to the kmeans algorithm that find better clusterings. Proceedings of the ACM conference on information and knowledge management (CIKM). Hamerly, G., & Elkan, C. (2003). Learning the k in k-means. Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems. Hennig, C. (1997). Datenanalyse mit modellen fur cluster linear regression [Dissertation]. Hamburg, Germany: Institut Fur Mathmatsche Stochastik, Universitat Hamburg. Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California. McLachlan, G. J., & Krishnan, T. (1997). EM algorithm and extensions. John Wiley & Sons. Meila, M., & Heckerman, D. (1998). An experimental comparison of several clustering and initialization methods. In Proceedings of the Fourteenth Conference on Intelli-
gence – Artificial in Uncertainty (pp. 386-395). Morgan Kaufman. Montgomery, D.C., Peck, E.A., & Vining, G.G. (2001). Introduction to linear regression analysis. John Wiley & Sons. Nock, R., & Nielsen, F. (2004). An abstract weighting framework for clustering algorithms. Proceedings of the Fourth International SIAM Conference on Data Mining. Orlando, Florida. Pena, J., Lozano, J., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the Kmeans algorithm. Pattern Recognition Letters, 20, 1027-1040. Rendner, R.A., & Walker, H.F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2). Schapire, R.E. (1999). Theoretical views of boosting and applications. Proceedings of the Tenth International Conference on Algorithmic Learning Theory. Selim, S.Z., & Ismail, M.A (1984). K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on PAMI-6, 1. Silverman, B.W. (1998). Density estimation for statistics and data analysis. Chapman & Hall/CRC. Spath, H. (1981). Correction to algorithm 39: Clusterwise linear regression. Computing, 26, 275. Spath, H. (1982). Algorithm 48: A fast algorithm for clusterwise linear regression. Computing, 29, 175-181. Spath, H. (1985). Cluster dissection and analysis. New York: Wiley. Tibshirani, R., Walther, G., & Hastie, T. (2000). Estimating the number of clusters in a dataset via the gap statistic. Retrieved from http://www-stat.stanford.edu/~tibs / research.html Zhang, B. (2001). Generalized K-harmonic means—Dynamic weighting of data in unsupervised learning. Proceedings of the First SIAM International Conference on Data Mining (SDM’2001), Chicago, Illinois. Zhang, B. (2003). Comparison of the performance of center-based clustering algorithms. Proceedings of PAKDD03, Seoul, South Korea. Zhang, B. (2003a). Regression clustering. Proceedings of the IEEE International Conference on Data Mining, Melbourne, Florida.
139
C
Center-Based Clustering and Regression Clustering
Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, Lyon, France.
KEY TERMS Boosting: Assigning and updating weights on data points according to a particular formula in the process of refining classification models. Center-Based Clustering: Similarity among the data points is defined through a set of centers. The distance from each data point to a center determined the data points association with that center. The clusters are represented by the centers. Clustering: Grouping data according to similarity among them. Each clustering algorithm has its own definition of similarity. Such grouping can be hierarchical. Dynamic Weighting: Reassigning weights on the data points in each iteration of an iterative algorithm.
140
Model-Based Clustering: A mixture of simpler distributions is used to fit the data, which defines the clusters of the data. EM with linear mixing of Gaussian density functions is the best example, but K-Means and K-Harmonic Means are the same type. Regression clustering algorithms are also model-based clustering algorithms with mixing of more complex distributions as its model. Regression: A statistical method of learning the relationship between two sets of variables from data. One set is the independent variables or the predictors, and the other set is the response variables. Regression Clustering: Combining the regression methods with center-based clustering methods. The simple geometric-point centers in the center-based clustering algorithms are replaced by regression models. Sensitivity to Initialization: Center-based clustering algorithms are iterative algorithms that minimizing the value of its performance function. Such algorithms converge to only a local optimum of its performance function. The converged positions of the centers depend on the initial positions of the centers where the algorithm start with.
141
Classification and Regression Trees
C
Johannes Gehrke Cornell University, USA
INTRODUCTION It is the goal of classification and regression to build a data-mining model that can be used for prediction. To construct such a model, we are given a set of training records, each having several attributes. These attributes either can be numerical (e.g., age or salary) or categorical (e.g., profession or gender). There is one distinguished attribute—the dependent attribute; the other attributes are called predictor attributes. If the dependent attribute is categorical, the problem is a classification problem. If the dependent attribute is numerical, the problem is a regression problem. It is the goal of classification and regression to construct a data-mining model that predicts the (unknown) value for a record, where the value of the dependent attribute is unknown. (We call such a record an unlabeled record.) Classification and regression have a wide range of applications, including scientific experiments, medical diagnosis, fraud detection, credit approval, and target marketing (Hand, 1997). Many classification and regression models have been proposed in the literature; among the more popular models are neural networks, genetic algorithms, Bayesian methods, linear and log-linear models and other statistical methods, decision tables, and tree-structured models, which is the focus of this article (Breiman, Friedman, Olshen & Stone, 1984). Tree-structured models, so-called decision trees, are easy to understand; they are nonparametric and, thus, do not rely on assumptions about the data distribution; and they have fast construction methods even for large training datasets (Lim, Loh & Shih, 2000). Most data-mining suites include tools for classification and regression tree construction (Goebel & Gruenwald, 1999).
BACKGROUND Let us start by introducing decision trees. For the ease of explanation, we are going to focus on binary decision trees. In binary decision trees, each internal node has two children nodes. Each internal node is associated with a predicate, called the splitting predicate, which involves only the predictor attributes. Each leaf node is associated with a unique value for the dependent attribute. A decision encodes a data-mining model as follows. For an
unlabeled record, we start at the root node. If the record satisfies the predicate associated with the root node, we follow the tree to the left child of the root, and we go to the right child otherwise. We continue this pattern through a unique path from the root of the tree to a leaf node, where we predict the value of the dependent attribute associated with this leaf node. An example decision tree for a classification problem, a classification tree, is shown in Figure 1. Note that a decision tree automatically captures interactions between variables, but it only includes interactions that help in the prediction of the dependent attribute. For example, the rightmost leaf node in the example shown in Figure 1 is associated with the classification rule: “If (Age >= 40) and (Gender=male), then YES”; as classification rule that involves an interaction between the two predictor attributes age and salary. Decision trees can be mined automatically from a training database of records, where the value of the dependent attribute is known: A decision tree construction algorithm selects which attribute(s) to involve in the splitting predicates, and the algorithm decides also on the shape and depth of the tree (Murthy, 1998).
MAIN THRUST Let us discuss how decision trees are mined from a training database. A decision tree usually is constructed in two phases. In the first phase, the growth phase, an overly large and deep tree is constructed from the training data. In the second phase, the pruning phase, the final size of the tree is determined with the goal to minimize the expected misprediction error (Quinlan, 1993).
Figure 1. An example classification tree Age < • 40
No
Gender=M
No
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Yes
Classification and Regression Trees
There are two problems that make decision tree construction a hard problem. First, construction of the optimal tree for several measures of optimality is an NP-hard problem. Thus, all decision tree construction algorithms grow the tree top-down according to the following greedy heuristic: At the root node, the training database is examined, and a splitting predicate is selected. Then the training database is partitioned according to the splitting predicate, and the same method is applied recursively at each child node. The second problem is that the training database is only a sample from a much larger population of records. The decision tree has to perform well on records drawn from the population, not on the training database. (For the records in the training database, we already know the value of the dependent attribute.) Three different algorithmic issues need to be addressed during the tree construction phase. The first issue is to devise a split selection algorithm, such that the resulting tree models the underlying dependency relationship between the predictor attributes and the dependent attribute well. During split selection, we have to make two decisions. First, we need to decide which attribute we will select as the splitting attribute. Second, given the splitting attribute, we have to decide on the actual splitting predicate. For a numerical attribute X, splitting predicates are usually of the form X ≤c, where c is a constant. For example, in the tree shown in Figure 1, the splitting predicate of the root node is of this form. For a categorical attribute X, splits are usually of the form X in C, where C is a set of values in the domain of X. For example, in the tree shown in Figure 1, the splitting predicate of the right child node of the root is of this form. There exist decision trees that have a larger class of possible splitting predicates; for example, there exist decision trees with linear combinations of numerical attribute values as splitting predicates ∑a iXi+c≥0, where i ranges over all attributes) (Loh & Shih, 1997). Such splits, also called oblique splits, result in shorter trees; however, the resulting trees are no longer easy to interpret. The second issue is to devise a pruning algorithm that selects the tree of the right size. If the tree is too large, then the tree models the training database too closely instead of modeling the underlying population. One possible choice of pruning a tree is to hold out part of the training set as a test set and to use the test set to estimate the misprediction error of trees of different size. We then simply select the tree that minimizes the misprediction error. The third issue is to devise an algorithm for intelligent management of the training database in case the training database is very large (Ramakrishnan & Gehrke, 2002). This issue has only received attention in the last decade, but there exist now many algorithms that can construct decision trees over extremely large, disk-resident training 142
databases (Gehrke, Ramakrishnan & Ganti, 2000; Shafer, Agrawal & Mehta, 1996). In most classification and regression scenarios, we also have costs associated with misclassifying a record, or with being far off in our prediction of a numerical dependent value. Existing decision tree algorithms can take costs into account, and they will bias the model toward minimizing the expected misprediction cost instead of the expected misclassification rate, or the expected difference between the predicted and true value of the dependent attribute.
FUTURE TRENDS Recent developments have expanded the types of models that a decision tree can have in its leaf nodes. So far, we assumed that each leaf node just predicts a constant value for the dependent attribute. Recent work, however, has shown how to construct decision trees with linear models in the leaf nodes (Dobra & Gehrke, 2002). Another recent development in the general area of data mining is the use of ensembles of models, and decision trees are a popular model for use as a base model in ensemble learning (Caruana, Niculescu-Mizil, Crew & Ksikes, 2004). Another recent trend is the construction of data-mining models of high-speed data streams, and there have been adaptations of decision tree construction algorithms to such environments (Domingos & Hulten, 2002). A last recent trend is to take adversarial behavior into account (e.g., in classifying spam). In this case, an adversary who produces the records to be classified actively changes his or her behavior over time to outsmart a static classifier (Dalvi, Domingos, Mausam, Sanghai & Verma, 2004).
CONCLUSION Decision trees are one of the most popular data-mining models. Decision trees are important, since they can result in powerful predictive models, while, at the same time, they allow users to get insight into the phenomenon that is being modeled.
REFERENCES Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and regression trees. Kluwer Academic Publishers. Caruana, R., Niculescu-Mizil, A., Crew, R., & Ksikes, A. (2004). Ensemble selection from libraries of models. Pro-
Classification and Regression Trees
ceedings of the Twenty-First International Conference, Banff, Alberta, Canada.
Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufman.
Dalvi, N., Domingos, P., Mausam, S.S., & Verma, D. (2004). Adversarial classification. Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, Seattle, Washington.
Ramakrishnan, R. & Gehrke, J. (2002). Database management systems (3rd ed.). McGrawHill.
Dobra, A., & Gehrke, J. (2002). SECRET: A scalable linear regression tree algorithm. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of the 22nd International Conference on Very Large Databases, Bombay, India.
Domingos, P., & Hulten, G. (2002). Learning from infinite data in finite time. Advances in Neural Information Processing Systems, 14, 673-680.
KEY TERMS
Gehrke, J., Ramakrishnan, R., & Ganti, V. (2000). Rainforest—A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery, 4(2/3), 127-162.
Categorical Attribute: Attribute that takes values from a discrete domain.
Goebel, M., & Gruenwald, L. (1999). A survey of data mining software tools. SIGKDD Explorations, 1(1), 20-33. Hand, D. (1997). Construction and assessment of classification rules. Chichester, England: John Wiley & Sons. Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 48, 203-228. Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815-840. Murthy, S.K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2(4), 345-389.
Attribute: Column of a dataset.
Classification Tree: A decision tree where the dependent attribute is categorical. Decision Tree: Tree-structured data mining model used for prediction, where internal nodes are labeled with predicates (decisions), and leaf nodes are labeled with data-mining models. Numerical Attribute: Attribute that takes values from a continuous domain. Regression Tree: A decision tree where the dependent attribute is numerical. Splitting Predicate: Predicate at an internal node of the tree; it decides which branch a record traverses on its way from the root to a leaf node.
143
C
144
Classification Methods Aijun An York University, Canada
INTRODUCTION Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output. Classification has various applications, such as learning from a patient database to diagnose a disease based on the symptoms of a patient, analyzing credit card transactions to identify fraudulent transactions, automatic recognition of letters or digits based on handwriting samples, and distinguishing highly active compounds from inactive ones based on the structures of compounds for drug discovery.
BACKGROUND Classification has been studied in statistics and machine learning. In statistics, classification is also referred to as discrimination. Early work on classification focused on discriminant analysis, which constructs a set of discriminant functions, such as linear functions of the predictor variables, based on a set of training examples to discriminate among the groups defined by the class variable. Modern studies explore more flexible classes of models, such as providing an estimate of the join distribution of the features within each class (e.g. Baye-
sian classification), classifying an example based on distances in the feature space (e.g. the k-nearest neighbor method), and constructing a classification tree that classifies examples based on tests on one or more predictor variables (i.e., classification tree analysis). In the field of machine learning, attention has more focused on generating classification expressions that are easily understood by humans. The most popular machine learning technique is decision tree learning, which learns the same tree structure as classification trees but uses different criteria during the learning process. The technique was developed in parallel with the classification tree analysis in statistics. Other machine learning techniques include classification rule learning, neural networks, Bayesian classification, instance-based learning, genetic algorithms, the rough set approach and support vector machines. These techniques mimic human reasoning in different aspects to provide insight into the learning process. The data mining community inherits the classification techniques developed in statistics and machine learning, and applies them to various real world problems. Most statistical and machine learning algorithms are memory-based, in which the whole training data set is loaded into the main memory before learning starts. In data mining, much effort has been spent on scaling up the classification algorithms to deal with large data sets. There is also a new classification technique, called association-based classification, which is based on association rule learning.
MAIN THRUST Major classification techniques are described below. The techniques differ in the learning mechanism and in the representation of the learned model.
Decision Tree Learning Decision tree learning is one of the most popular classification algorithms. It induces a decision tree from data. A decision tree is a tree structured prediction model where each internal node denotes a test on an attribute, each outgoing branch represents an outcome of the test, and each leaf node is labeled with a class or
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Classification Methods
Figure 1. A decision tree with tests on attributes X and Y X=? X•≥ 1
X<1 Y =? Y=A Class 1
Y=B
Y=C
Class 2
Class 2
Class 1
class distribution. A simple decision tree is shown in Figure 1. With a decision tree, an object is classified by following a path from the root to a leaf, taking the edges corresponding to the values of the attributes in the object. A typical decision tree learning algorithm adopts a top-down recursive divide-and-conquer strategy to construct a decision tree. Starting from a root node representing the whole training data, the data is split into two or more subsets based on the values of an attribute chosen according to a splitting criterion. For each subset a child node is created and the subset is associated with the child. The process is then separately repeated on the data in each of the child nodes, and so on, until a termination criterion is satisfied. Many decision tree learning algorithms exist. They differ mainly in attribute-selection criteria, such as information gain, gain ratio (Quinlan, 1993), gini index (Breiman, Friedman, Olshen, & Stone, 1984), etc., termination criteria and post-pruning strategies. Post-pruning is a technique that removes some branches of the tree after the tree is constructed to prevent the tree from over-fitting the training data. Representative decision tree algorithms include CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993). There are also studies on fast and scalable construction of decision trees. Representative algorithms of such kind include RainForest (Gehrke, Ramakrishnan, & Ganti, 1998) and SPRINT (Shafer, Agrawal, & Mehta., 1996).
Decision Rule Learning Decision rules are a set of if-then rules. They are the most expressive and human readable representation of classification models (Mitchell, 1997). An example of decision rules is “if X<1 and Y=B, then the example belongs to Class 2”. This type of rules is referred to as propositional rules. Rules can be generated by translating a decision tree into a set of rules – one rule for each leaf node in the tree. A second way to generate rules is to learn rules directly from the training data. There is a variety of rule induction algorithms. The algorithms induce rules by searching in a hypothesis space for a hypothesis that best matches the training data. The algorithms differ in the search method (e.g. general-tospecific, specific-to-general, or two-way search), the
search heuristics that control the search, and the pruning method used. The most widespread approach to rule induction is sequential covering, in which a greedy general-to-specific search is conducted to learn a disjunctive set of conjunctive rules. It is called sequential covering because it sequentially learns a set of rules that together cover the set of positive examples for a class. Algorithms belonging to this category include CN2 (Clark & Boswell, 1991), RIPPER (Cohen, 1995) and ELEM2 (An & Cercone, 1998).
Naive Bayesian Classifier The naive Bayesian classifier is based on Bayes’ theorem. Suppose that there are m classes, C1, C2, …, Cm. The classifier predicts an unseen example X as belonging to the class having the highest posterior probability conditioned on X. In other words, X is assigned to class Ci if and only if P(Ci|X) > P(Cj|X) for 1≤j ≤m, j ≠i. By Bayes’ theorem, we have
P (Ci | X ) =
P( X | Ci ) P (Ci ) . P( X )
As P(X) is constant for all classes, only P( X | C i ) P(C i ) needs to be maximized. Given a set of training data, P(Ci) can be estimated by counting how often each class occurs in the training data. To reduce the computational expense in estimating P(X|Ci) for all possible Xs, the classifier makes a naïve assumption that the attributes used in describing X are conditionally independent of each other given the class of X. Thus, given the attribute values (x1, x2, … xn) that describe X, we have n
P( X | C i ) = ∏ P( x j | C i ) . j =1
The probabilities P(x1|Ci), P(x2|Ci), …, P(xn|Ci) can be estimated from the training data. The naïve Bayesian classifier is simple to use and efficient to learn. It requires only one scan of the training data. Despite the fact that the independence assumption is often violated in practice, naïve Bayes often competes well with more sophisticated classifiers. Recent theoretical analysis has shown why the naive 145
C
Classification Methods
Bayesian classifier is so robust (Domingos & Pazzani, 1997; Rish, 2001).
Bayesian Belief Networks A Bayesian belief network, also known as Bayesian network and belief network, is a directed acyclic graph whose nodes represent variables and whose arcs represent dependence relations among the variables. If there is an arc from node A to another node B, then we say that A is a parent of B and B is a descendent of A. Each variable is conditionally independent of its nondescendents in the graph, given its parents. The variables may correspond to actual attributes given in the data or to “hidden variables” believed to form a relationship. A variable in the network can be selected as the class attribute. The classification process can return a probability distribution for the class attribute based on the network structure and some conditional probabilities estimated from the training data, which predicts the probability of each class. The Bayesian network provides an intermediate approach between the naïve Bayesian classification and the Bayesian classification without any independence assumptions. It describes dependencies among attributes, but allows conditional independence among subsets of attributes. The training of a belief network depends on the senario. If the network structure is known and the variables are observable, training the network only consists of estimating some conditional probabilities from the training data, which is straightforward. If the network structure is given and some of the variables are hidden, a method of gradient decent can be used to train the network (Russell, Binder, Koller, & Kanazawa, 1995). Algorithms also exist for learning the netword structure from training data given observable variables (Buntime, 1994; Cooper & Herskovits, 1992; Heckerman, Geiger, & Chickering, 1995).
The k-Nearest Neighbour Classifier The k-nearest neighbour classifier classifies an unknown example to the most common class among its k nearest neighbors in the training data. It assumes all the examples correspond to points in a n-dimensional space. A neighbour is deemed nearest if it has the smallest distance, in the Euclidian sense, in the n-dimensional feature space. When k = 1, the unknown example is classified into the class of its closest neighbour in the training set. The k-nearest neighbour method stores all the training examples and postpones learning until a new example needs to be classified. This type of learning is called instance-based or lazy learning.
146
The k-nearest neighbour classifier is intuitive, easy to implement and effective in practice. It can construct a different approximation to the target function for each new example to be classified, which is advantageous when the target function is very complex, but can be discribed by a collection of less complex local approximations (Mitchell, 1997). However, its cost of classifying new examples can be high due to the fact that almost all the computation is done at the classification time. Some refinements to the k-nearest neighbor method include weighting the attributes in the distance computation and weighting the contribution of each of the k neighbors during classification according to their distance to the example to be classified.
Neural Networks Neural networks, also referred to as artificial neural networks, are studied to simulate the human brain although brains are much more complex than any artificial neural network developed so far. A neural network is composed of a few layers of interconnected computing units (neurons or nodes). Each unit computes a simple function. The input of the units in one layer are the outputs of the units in the previous layer. Each connection between units is associated with a weight. Parallel computing can be performed among the units in each layer. The units in the first layer take input and are called the input units. The units in the last layer produces the output of the networks and are called the output units. When the network is in operation, a value is applied to each input unit, which then passes its given value to the connections leading out from it, and on each connection the value is multiplied by the weight associated with that connection. Each unit in the next layer then receives a value which is the sum of the values produced by the connections leading into it, and in each unit a simple computation is performed on the value - a sigmoid function is typical. This process is then repeated, with the results being passed through subsequent layers of nodes until the output nodes are reached. Neural networks can be used for both regression and classification. To model a classification function, we can use one output unit per class. An example can be classified into the class corresponding to the output unit with the largest output value. Neural networks differ in the way in which the neurons are connected, in the way the neurons process their input, and in the propogation and learning methods used (Nurnberger, Pedrycz, & Kruse, 2002). Learning a neural network is usually restricted to modifying the weights based on the training data; the structure of the initial network is usually left unchanged during the learning process. A typical network structure is the
Classification Methods
multilayer feed-forward neural network, in which none of the connections cycles back to a unit of a previous layer. The most widely used method for training a feedforward neural network is backpropagation (Rumelhart, Hinton, & Williams, 1986).
Support Vector Machines The support vector machine (SVM) is a recently developed technique for multidimensional function approximation. The objective of support vector machines is to determine a classifier or regression function which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error) (Vapnik, 1998). Given a set of N linearly separable training examples S = {x i ∈ R n | i = 1,2,..., N } , where each example belongs to one of the two classes, represented by yi ∈ {+1,−1} , the SVM learning method seeks the optimal hyperplane w ⋅ x + b = 0 , as the decision surface, which separates the positive and negative examples with the largest margin. The decision function for classifying linearly separable data is:
f (x) = sign(w ⋅ x + b) , where w and b are found from the training set by solving a constrained quadratic optimization problem. The final decision function is
N f (x) = sign ∑ α i y i (x i ⋅ x) + b . i =1 The function depends on the training examples for which α i is non-zero. These examples are called support vectors. Often the number of support vectors is only a small fraction of the original dataset. The basic SVM formulation can be extended to the nonlinear case by using nonlinear kernels that map the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed. The SVM classifier has become very popular due to its high performances in practical applications such as text classification and pattern recognition.
FUTURE TRENDS Classification is a major data mining task. As data mining becomes more popular, classification techniques are in-
creasingly applied to provide decision support in business, biomedicine, financial analysis, telecommunications and so on. For example, there are recent applications of classification techniques to identify fraudulent usage of credit cards based on credit card transaction databases; and various classification techniques have been explored to identify highly active compounds for drug discovery. To better solve application-specific problems, there has been a trend toward the development of more applicationspecific data mining systems (Han & Kamber, 2001). Traditional classification algorithms assume that the whole training data can fit into the main memory. As automatic data collection becomes a daily practice in many businesses, large volumes of data that exceed the memory capacity become available to the learning systems. Scalable classification algorithms become essential. Although some scalable algorithms for decision tree learning have been proposed, there is still a need to develop scalable and efficient algorithms for other types of classification techniques, such as decision rule learning. Previously, the study of classification techniques focused on exploring various learning mechanisms to improve the classification accuracy on unseen examples. However, recent study on imbalanced data sets has shown that classification accuracy is not an appropriate measure to evaluate the classification performance when the data set is extremely unbalanced, in which almost all the examples belong to one or more, larger classes and far fewer examples belong to a smaller, usually more interesting class. Since many real world data sets are unbalanced, there has been a trend toward adjusting existing classification algorithms to better identify examples in the rare class. Another issue that has become more and more important in data mining is privacy protection. As data mining tools are applied to large databases of personal records, privacy concerns are rising. Privacy-preserving data mining is currently one of the hottest research topics in data mining and will remain so in the near future.
CONCLUSION Classification is a form of data analysis that extracts a model from data to classify future data. It has been studied in parallel in statistics and machine learning, and is currently a major technique in data mining with a broad application spectrum. Since many application problems can be formulated as a classification problem and the volume of the available data has become overwhelming, developing scalable, efficient, domain-specific, and privacy-preserving classification algorithms is essential.
147
C
Classification Methods
REFERENCES An, A., & Cercone, N. (1998). ELEM2: A learning system for more accurate classifications. Proceedings of the 12th Canadian Conference on Artificial Intelligence (pp. 426-441). Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth International Group. Buntine, W.L. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159-225. Castillo, E., Gutiérrez, J.M., & Hadi, A.S. (1997). Expert systems and probabilistic network models. New York: Springer-Verlag. Clark P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. Proceedings of the 5th European Working Session on Learning (pp. 151-163). Cohen, W.W. (1995). Fast effective rule induction. Proceedings of the 11th International Conference on Machine Learning (pp. 115-123), Morgan Kaufmann. Cooper, G., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309-347. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130. Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). RainForest - A framework for fast decision tree construction of large datasets. Proceedings of the 24th International Conference on Very Large Data Bases (pp. 416-427). Han, J., & Kamber, M. (2001). Data mining — Concepts and techniques. Morgan Kaufmann. Heckerman, D., Geiger, D., & Chickering, D.M. (1995) Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197-243. Mitchell, T.M. (1997). Machine learning. McGraw-Hill. Nurnberger, A., Pedrycz, W., & Kruse, R. (2002). Neural network approaches. In Klosgen & Zytkow (Eds.), Handbook of data mining and knowledge discovery (pp. 304317). Oxford University Press. Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241-288. Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Rish, I. (2001). An empirical study of the naive Bayes classifier. Proceedings of IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. Proceedings of the 14th Joint International Conference on Artificial Intelligence, 2 (pp. 1146-1152). Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of the 22 th International Conference on Very Large Data Bases (pp. 544-555). Vapnik, V. (1998). Statistical learning theory. New York: John Wiley & Sons.
KEY TERMS Backpropagation: A neural network training algorithm for feedforward networks where the errors at the output layer are propagated back to the previous layer to update connection weights in learning. If the previous layer is not the input layer, then the errors at this hidden layer are propagated back to the layer before. Disjunctive Set of Conjunctive Rules: A conjunctive rule is a propositional rule whose antecedent consists of a conjunction of attribute-value pairs. A disjunctive set of conjunctive rules consists of a set of conjunctive rules with the same consequent. It is called disjunctive because the rules in the set can be combined into a single disjunctive rule whose antecedent consists of a disjunction of conjunctions. Generic Algorithm: An algorithm for optimizing a binary string based on an evolutionary mechanism that uses replication, deletion, and mutation operators carried out over many generations. Information Gain: Given a set E of classified examples and a partition P = {E1, ..., En} of E, the information gain is defined as n
entropy( E ) − ∑ entropy( Ei ) * i =1
where |X| is the number of examples in X, and m
entropy( X ) = −∑ p j log 2 ( p j ) (assuming there are m classes j =1
148
| Ei | , |E|
Classification Methods
in X and pj denotes the probability of the jth class in X). Intuitively, the information gain measures the decrease of the weighted average impurity of the partitions E1, ..., E n, compared with the impurity of the complete set of examples E. Machine Learning: The study of computer algorithms that develop new knowledge and improve its performance automatically through past experience. Rough Set Data Analysis: A method for modeling uncertain information in data by forming lower and upper
approximations of a class. It can be used to reduce the feature set and to generate decision rules. Sigmod Function: A mathematical function defined by the formula
P (t ) =
1 1 + e −t
Its name is due to the sigmoid shape of its graph. This function is also called the standard logistic function.
149
C
150
Closed-Itemset Incremental-Mining Problem Luminita Dumitriu “Dunarea de Jos ” University, Romania
INTRODUCTION Association rules, introduced by Agrawal, Imielinski and Swami (1993), provide useful means to discover associations in data. The problem of mining association rules in a database is defined as finding all the association rules that hold with more than a user-given minimum support threshold and a user-given minimum confidence threshold. According to Agrawal, Imielinski and Swami, this problem is solved in two steps: 1. 2.
Find all frequent itemsets in the database. For each frequent itemset I, generate all the association rules I'⇒I\I', where I'⊂I.
The second problem can be solved in a straightforward manner after the first step is completed. Hence, the problem of mining association rules is reduced to the problem of finding all frequent itemsets. This is not a trivial problem, because the number of possible frequent itemsets is equal to the size of the power set of I, 2 |I| . Many algorithms are proposed in the literature, most of them based on the Apriori mining method (Agrawal & Srikant, 1994), which relies on a basic property of frequent itemsets: All subsets of a frequent itemset are frequent. This property also says that all supersets of an infrequent itemset are infrequent. This approach works well on weakly correlated data, such as market-basket data. For overcorrelated data, such as census data, there are other approaches, including Close (Pasquier, Bastide, Taouil, & Lakhal, 1999), CHARM (Zaki & Hsiao, 1999) and Closet (Pei, Han, & Mao, 2000), which are more appropriate. An interesting study of specific approaches is performed in Zheng, Kohavi and Mason (2001), qualifying CHARM as the most adjusted algorithm to real-world data. Some later improvements of Closet are mentioned in Wang, Han, and Pei (2003) concerning speed and memory usage, while a different support-counting method is proposed in Zaki and Gouda (2001). These approaches search for closed itemsets structured in lattices that are closely related with the concept lattice in formal concept analysis (Ganter & Wille, 1999). The main advantage of a closed itemset approach is the smaller size of the resulting concept lattice versus the
number of frequent itemsets, that is, search space reduction. In this article, I describe the closed-itemset approaches, considering the fact that an association rule mining process leads to a large amount of results (most of the time, that is) difficult to understand by the user. I take into account the interactivity of the data-mining process proposed by Ankerst (2001).
BACKGROUND In this section I first describe the use of closed-itemset lattices as the theoretical framework for the closeditemset approach. The application of Formal Concept Analysis to the association rule problem was first mentioned in Zaki and Ogihara (1998). For more details on lattice theory, see Ganter and Wille (1999). The closed-itemset approach is described below. I define a context (T, I, D), the Galois connection of a context ((T, I, D), s, t), a concept of the context (X, Y), and the set of concepts in the context, denoted β (T, I, D). The main result in the Formal Concept Analysis theory is as follows:
•
Fundamental theorem of Formal Concept Analysis (FCA): Let (T, I, D) be a context. Then β (T, I, D) is a complete lattice with join and meet operators given by closed set intersection and reunion operators.
How does FCA apply to the association rule problem? First, T is the set of transaction ids, I is the set of items in the database, and D is the database itself. Second, the mapping s associates to a transaction set X the maximal itemset Y present in all transactions in X. The mapping t associates to any itemset Y the maximal transaction set X, where each transaction comprises all the items in the itemset Y. The resulting frequent concepts are considered in mining application only for their itemset side, the transaction set side being ignored. What is the advantage of FCA application? Among the results in Apriori, there are itemsets — I will call them Y and Y’, where Y’ is included in Y, and they have the same support. These two itemsets are two distinct results of Apriori, even if they characterize differently the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Closed-Itemset Incremental-Mining Problem
same transaction set. In fact, the longest itemset is the most precise characterization of that transaction set, all the others being partial and redundant definitions. Due to the observation that sº t and tºs are closure operators, the concepts in the lattice of concepts eliminate the presence of any unclosed itemsets. Under these circumstances, the FCA-based approaches have only subunitary confidence association rules as results, due to the fact that all unitary confidence association rules are considered redundant behaviors. All unitary confidence association rules can be expressed through a base, the pseudo-intent set. If I consider the data-mining process in the vision of Ankerst (2001) the resulting data model in Apriori is a long list of frequent itemsets, while FCA is a conceptual structure, namely a lattice of concepts, free of any redundancy.
MAIN THRUST Many interesting algorithms are issued from the FCA approach to the association rule problem. Although many differences exist between the support counting, memory usage and performance of all these algorithms, the general lines are almost the same. In the following sections, I take into account some of the main characteristics of these algorithms.
Resulting Data Model Most of the FCA approaches do not generate a lattice of concepts but a spanning tree on that lattice. The argument is that some of the pairs of adjacent concepts in lattice can be inferred later. The main algorithms that follow this principle are CHARM and Closet. One approach builds the entire lattice (Dumitriu, 2002), called ERA. The argument for building the entire lattice resides in the fact that missing pairs of adjacent concepts in the spanning tree-based approaches can have an identical support count, thus transforming one of the concepts in the pair in a nonconcept. In fact, CHARM has a later stage of results checking from this point of view. Another strong point of the ERA algorithm is that it offers the pseudo-intents as results as well, thus completely characterizing the data.
Itemset-Building Strategy While Apriori is a breadth-first result-building algorithm, most of the FCA-based algorithms are depth-first. Only ERA has a different strategy: Each item in the database is used to enlarge an already existing data
model built upon the previously selected items, thus generating at all times a new and extended data model. This strategy generates results layer by layer, just like an onion. The main difference between the depth-first strategy and the layer-based strategy is that interactivity is offered to the user. Just like peeling an onion, one can take a previously found data model, reduce it or enlarge it with some items, and reach the data view that is the most revealing to the individual. In breadth-first as well as depth-first strategies, it is impossible to provide interactivity to the user due to the fact that all items of interest for the mining process have to be available from the start.
FUTURE TRENDS I am considering that the most challenging trends would manifest in quantitative attribute mapping as well as in online mining. The first case has a well-known problem in what concerns the quality of numerical-to-Boolean attribute mapping. Solutions to this problem are already considered in Aumann and Lindell (1999), Hong, Kuo, Chi, and Wang (2000), Imberman and Domanski (2001), and Webb (2001), but the problem is far from resolution. An interactive association rule approach may help in a generate-and-test paradigm to find the most suitable mappings in a particular database context. Online mining is very alike cognitive processes: A data model is flooded with facts; either they are consistent with the data model, contradict it, or have a neutral character. When contradicting facts become important (in number, frequency, or any other way), the model has to change. The real problem of the data-mining process is that the model is too large in number of concepts or the concepts are too rigidly related to support change.
CONCLUSION I have introduced the idea of data model, expressed as a frequent closed-itemset lattice, with a base for global implications, if needed. In the closed-itemset incremental-mining solution, two new and important operations are applicable to data models: extension and reduction with several items. The main advantages of this approach are: •
The construction of small models of data, which makes them more understandable for the user; also, the response time is small
151
C
Closed-Itemset Incremental-Mining Problem
•
• •
The extension of data models with a set of new items returns to the user only the supplementary results, hence a smaller amount of results; the response time is considerably smaller than when building the model from scratch Whenever data models are incomprehensible, some of the items can be removed, thus obtaining and easy-to-understand data model The extension or reduction of a model spares the time spent building it, thus reusing knowledge
After many successful attempts to make it faster, the mining process becomes more interactive and flexible due to the increased number of human interventions.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993, May). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference on Management of Data (pp. 207-216), USA. Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases (pp. 487-499), Chile. Ankerst, M. (2001, May). Human involvement and interactivity of the next generation’s data mining tools. Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (pp. 178-188), USA. Aumann, Y., & Lindell, Y. (1999, August). A statistical theory for quantitative association rules. Proceedings of the International Conference on Knowledge Discovery in Databases (pp. 261-270), USA. Dumitriu, L. (2002). Interactive mining and knowledge reuse for the closed-itemset incremental-mining problem. Newsletter of the ACM SIG on Knowledge Discovery and Data Mining, 3(2), 28-36. Retrieved from http:// www.acm.org/sigs/sigkdd/explorations/issue3-2/ contents.htm Ganter, B., & Wille, R. (1999). Formal concept analysis — Mathematical foundations. Berlin, Germany: Springer-Verlag. Hong, T.-P., Kuo, C.-S., Chi, S.-C., & Wang, S.-L. (2000, March). Mining fuzzy rules from quantitative data based
152
on the AprioriTid algorithm. Proceedings of the ACM Symposium on Applied Computing (pp. 534-536), Italy. Imberman, S., & Domanski, B. (2001, August). Finding association rules from quantitative data using data Booleanization. Proceedings of the Seventh Americas Conference on Information Systems, USA. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999, January). Discovering frequent closed itemsets for association rules. Proceedings of the International Conference on Database Theory (pp. 398-416), Israel. Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: An efficient algorithm for mining frequent closed itemsets. Proceedings of the Conference on Data Mining and Knowledge Discovery (pp. 11-20), USA. Valtchev, P., Missaoui, R., Godin, R., & Meridji, M. (2002). A framework for incremental generation of frequent closed itemsets using Galois (concept) lattice theory. Journal of Experimental and Theoretical Artificial Intelligence, 14(2/3), 115-142. Wang, J., Han, J., & Pei, J. (2003, August). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 236-245), USA. Webb, G. I. (2001, August). Discovering associations with numeric variables. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 383-388), USA. Zaki, M. J., & Gouda, K. (2001). Fast vertical mining using diffsets (Tech. Rep. No. 01-1): Rensselaer Polytechnic Institute, Department of Computer Science. Zaki, M. J., & Hsiao, C. J. (1999). CHARM: An efficient algorithm for closed association rule mining (Tech. Rep. No. 99-10). Rensselaer Polytechnic Institute, Department of Computer Science. Zaki, M. J., & Ogihara, M. (1998, June). Theoretical foundations of association rules. Proceedings of the Third ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (pp. 7:1-7:8), USA. Zheng, Z., Kohavi, R., & Mason, L. (2001, August). Real world performance of association rule algorithms. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 401-406), USA.
Closed-Itemset Incremental-Mining Problem
KEY TERMS
Frequent Itemset: An itemset with support higher than a predefined threshold, denoted minsup.
Association Rule: A pair of frequent itemsets (A, B), where the ratio between the support of A∪B and A itemsets is greater than a predefined threshold, denoted minconf.
Galois Connection: Let (T, I, D) be a context. Then the mappings
Closure Operator: Let S be a set and c: ℘(S) → ℘(S); c is a closure operator on S if ∀ X, Y ⊆ S, c satisfies the following properties: 1. 2. 3.
extension, X ⊆ c(X); mononicity, if X⊆Y, then c(X) ⊆ c(Y); idempotency, c(c(X)) = c(X).
Note: sºt and tºs are closure operators, when s and t are the mappings in a Galois connection Concept: In the Galois connection of the (T, I, D) context, a concept is a pair (X, Y), X⊆ T, Y⊆ I, that satisfies s(X)=Y and t(Y)=X. X is called the extent and Y the intent of the concept (X,Y). Context: A triple (T, I, D) where T and I are sets and D⊆T×I. The elements of T are called objects, and the elements of I are called attributes. For any t ∈T and i ∈ I, note tDi when t is related to i, that is, ( t, i) ∈ D.
s: ℘(T)→ ℘(I), s(X) = { i∈ I | (∀ t ∈X) tDi } t: ℘(I)→ ℘(T), s(Y) = { t∈ T | (∀ i ∈Y) tDi } define a Galois connection between ℘(T) and ℘(I), the power sets of T and I, respectively. Itemset: A set of items in a Boolean database D, I={i1, i2, … in}. Itemset Support: The ratio between the number of transactions in D comprising all the items in I and the total number of transactions in D (support(I) = |{Ti∈D| (∀ ij∈I) ij∈ Ti }| / |D|). Pseudo-Intent: The set X is a pseudo-intent if X ≠ c(X), where c is a closure operator and for all pseudointents Q⊂ X, c(Q) ⊆X.
153
C
154
Cluster Analysis in Fitting Mixtures of Curves Tom Burr Los Alamos National Laboratory, USA
INTRODUCTION One data mining activity is cluster analysis, of which there are several types. One type deserving special attention is clustering that arises due to a mixture of curves. A mixture distribution is a combination of two or more distributions. For example, a bimodal distribution could be a mix with 30% of the values generated from one unimodal distribution and 70% of the values generated from a second unimodal distribution. The special type of mixture we consider here is a mixture of curves in a two-dimensional scatter plot. Imagine a collection of hundreds or thousands of scatter plots, each containing a few hundred points, including background noise, but also containing from zero to four or five bands of points, each having a curved shape. In a recent application (Burr et al., 2001), each curved band of points was a potential thunderstorm event (see Figure 1), as observed from a distant satellite, and the goal was to cluster the points into groups associated with thunderstorm events. Each curve has its own shape, length, and location, with varying degrees of curve overlap, point density, and noise magnitude. The scatter plots of points from curves having small noise resemble a smooth curve with very little vertical variation from the curve, but there can be a wide range in noise magnitude so that some events have large vertical variation from the center of the band. In this context, each curve is a cluster and the challenge is to use only the observations to estimate how many curves comprise the mixture, plus their shapes and locations. To achieve that goal, the human eye could train a classifier by providing cluster labels to all points in example scatter plots. Each point either would belong to a curved region or to a catch-all noise category, and a specialized cluster analysis would be used to develop an approach for labeling (clustering) the points generated according to the same mechanism in future scatter plots.
BACKGROUND Two key features that distinguish various types of clustering approaches are the assumed mechanism for how the data is generated and the dimension of the data. The
data-generation mechanism includes deterministic and stochastic components and often involves deterministic mean shifts between clusters in high dimensions. But there are other settings for cluster analysis. The particular one discussed here (see Figure 1) is a mixture of curves, where any notion of a cluster mean would be quite different from that in more typical clustering applications. Furthermore, although finding clusters in a two-dimensional scatter plot seems less challenging than in higher-dimensions (the trained human eye is likely to perform as well as any machine-automated method, although the eye would be slower), complications include overlapping clusters; varying noise magnitude; varying feature and noise and density; varying feature shape, locations, and length; and varying types of noise (scene-wide and event-specific). Any one of these complications would justify treating the fitting of curve mixtures as an important special case of cluster analysis. Although as in pattern recognition, the following methods discussed require training scatter plots with points labeled according to their cluster memberships, we regard this as cluster analysis rather than pattern recognition, because all scatter plots have from zero to four or five clusters whose shape, length, location, and extent of overlap with other clusters vary among scatter plots. The training data can be used both to train clustering methods and then to judge their quality. Fitting mixtures of curves is an important special case that has received relatively little attention to date. Fitting mixtures of probability distributions dates to Titterington et al. (1985), and several model-based clustering schemes have been developed (Banfield & Raftery, 1993; Bensmail et al., 1997; Dasgupta & Raftery, 1998), along with associated theory (Leroux, 1992). However, these models assume that the mixture is a mixture of probability distributions (often Gaussian, which can be long and thin, ellipsoidal, or more circular) rather than curves. More recently, methods for mixtures of curves have been introduced, including a mixture of principal curves model (Stanford & Raftery, 2000), a mixture of regression models (Gaffney & Smyth 2003; Hurn, Justel, & Robert, 2003; Turner, 2000), and mixtures of local regression models (i.e., smooth curves obtained using splines or nonparametric kernel smoothers for example).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Cluster Analysis in Fitting Mixtures of Curves
10 8
2
0
50
100
150
200
4
1 11111 111 1111 1111 111 1 1 11111 11 1 111111 11111 111 11111 1 11111111 1 1 1 1111 11111111 11 1111111 111111 1 1 111111 111 111 1 111 111111111111 111111111 1111 1 11 11 111 11 11 1111 111111 2 2 2
2
2
4
y
6
2
2 2 222 2 2 22 2 2 2 2 2 2 22 2
6
2
y
8
Figure 1. Four mixture examples containing (a) one, (b) one, (c) two, and (d) zero thunderstorm events plus background noise. The label “1” is for the first thunderstorm in the scene, “2” for the second, and so forth., and the highest integer label is reserved for the catch-all noise class. Therefore, in (d), because the highest integer is 1, there is no thunderstorm present (the mixture is all noise).
250
0
11 111 11 1 1 1 1 111 11 1 1 11 1 11 1 1 111 1 11111111 1 1 1 11 1111 1 1 2 111 1111111 1 1111 1 1 1 1 11111 1111 1 111111 111111 1 1111 1111111 1 11 1 1 1111 11 111 111 1111111 11111 111 2 2 2 2 22 2 22 2 2 200
400
10
x (a)
600
2
6
1 1111111 1 1 1 2221 22 1111111111111 22222222222222 2 222222 222222 2
3
200
400
600
1
1
800
x (c)
1000
2
4
3 1
3
1
4
y
3 3
y
6
8
333333 3 3 33 3
0
800
x (b)
33
3
2
11 11 1
1
1
1 11 1 11 11
1 11 1 1 1 1111 1 1 1 1 111111 1 1 1 11 1 111 1 1 111 1 1 1 11 111 1 11 11 1 1 1 11111 11111 1 11 111 111 1 11 1111111111111 1 11111111 1111111111 1 1 11 1 1111 1 11 1 1
1
1 0
200
400
600
800
1000
x (d)
MAIN THRUST Several methods have been proposed for fitting mixtures of curves. In method 1 (Burr et al., 2001), first use density estimation to reject the background noise points such as those labeled as 2 in Figure 1a. For example, each point in the scatter plot has a distance to its kth nearest neighbor, which can be used as a local density estimate (Silverman, 1986) to reject noise points. Next, use a distance measure that favors long, thin clusters (e.g., let the distance between clusters be the minimum distance between any a point in the first cluster and a point in the second cluster), together with standard hierarchical clustering to identify at least the central portion of each cluster. Alternatively, model-based clustering favoring long, thin Gaussian shapes (Banfield & Raftery, 1993) or the fitting straight lines method in Murtagh and Raftery (1984) or Campbell et al. (1997) are effective for finding the central portion of each cluster. A curve fitted to this central portion can be extrapolated and then used to accept other points as members of the cluster. Because hierarchical clustering cannot accommodate overlapping clusters, this method assumes that the central portions of each cluster are non-overlapping. Points away from the central portion from one cluster that lie close to the curve fitted to the central portion of the cluster can overlap with points
from another cluster. The noise points are identified initially as those having low local density (away from the central portion of any cluster) but, during the extrapolation, can be judged to be a cluster member, if they lie near the extrapolated curve. To increase robustness, method 1 can be applied twice, each time using slightly different inputs (such as the decision threshold for the initial noise rejection and the criteria for accepting points into a cluster that are close to the extrapolated region of the cluster’s curve). Then, only clusters that are identified both times are accepted. Method 2 uses the minimized, integrated squared error (ISE, or L 2 distance) (Scott, 2002; Scott & Szewczyk, 2002) and appears to be a good approach for fitting mixture models, including mixtures of regression models, as is our focus here. Qualitatively, the minimum L2 distance method tries to find the largest portion of the data that matches the model. In our context, at each stage, the model is all the points belonging to a single curve plus everything else. Therefore, we first seek cluster 1 having the most points, regard the remaining points as noise, remove the cluster, then repeat the procedure in search of feature 2, and so on, until a stop criterion is reached. It also should be possible to estimate the number of components in the mixture in the first evaluation of the data, but that approach has not yet been attempted. Scott (2002) has shown that in the parametric setting with model f(x|qθ), we ^
estimate θ using θˆ = arg minθ ∫ [ f ( x | θ ) − f ( x | θ 0 )]2 dx where the true parameter θO is unknown. It follows that a reasonable estimator minimizing the parametric ISE criterion is n
2 θˆL2 E = arg minθ [ ∫ f ( x | θ )2 dx − ∑ f ( xi | θ )] . n i =1
This assumes that
the correct parametric family is used; the concept can be extended to include the case in which the assumed parametric form is incorrect in order to achieve robustness. Method 3 (principal curve clustering with noise) was developed by Stanford and Raftery (2000) to locate principal curves in noisy spatial point process data. Principal curves were introduced by Hastie and Stuetzle (1989). A principal curve is a smooth curvilinear summary of p-dimensional data. It is a nonlinear generalization of the first principal component line that uses a local averaging method. Stanford and Raftery (2000) developed an algorithm that first uses hierarchical principal curve clustering (HPCC, which is a hierarchical and agglomerative clustering method) and next uses iterative relocation (reassign points to new clusters) based on the classification estimation-maximization (CEM) algorithm. A probability model included the principal curve probability model for the feature clusters and a homogeneous Poisson process model for the noise cluster. More specifically, let X denote the set of 155
C
Cluster Analysis in Fitting Mixtures of Curves
observations, x1, x2, …, xn, and C be a partition considering of clusters C0, C 1, …, C K, where the cluster Cj contains nj points. The noise cluster is C0 and assumes feature points are distributed uniformly along the true underlying feature so that projections onto the feature’s principal curve are randomly drawn from a uniform U(0,νj) distribution, where ν j is the length of the jth curve. An approximation to the probability for 0, 1, …, 5 clusters is available from the Bayesian Information Criterion (BIC), which is defined as BIC = 2 log(L(X|θ) - M log(n), where L is the likelihood of the data X, and M is the number of fitted parameters, so M = K(DF + 2) + K + 1. For each of K features, we fit 2 parameters (ν j and σ j defined below) and a curve having DF degrees of freedom; there are K mixing proportions (πj defined below), and the estimate of scene area is used to estimate the noise density. The likelihood L satisfies L(X|θ) = n
K
i =1
j =0
Therefore, it is similar to method 3 in that a mixture model is specified but differs in that the curve is fit using parametric regression rather than principal curves, so L ( xi | θ , xi ∈ C j ) = f ij = (1/ σ j )φ (
yi − xi β j
σj
) where φ is the stan-
dard Gaussian distribution. Also, Turner’s implementation did not introduce a clustering criterion, but it did attempt to estimate the number of clusters as follows. Introduce indicator variable zi of which component of the mixture generated observation yi and iteratively maximize n
K
Q = ∑∑ γ ik ln( f ik ) i =1 k =1
with respect to θ where θ is the complete
parameter set π j for each class. The γ ik satK
isfy γ ik = π k fik / ∑ π k fik . Then Q is maximized with respect to i =1
is the
t by weighted regression of yi on x1, …, xn with weights
mixture likelihood (πj is the probability that point i belongs to feature j) and
2 2 γ ik, and each σ k2 is given by σ k = ∑ λik ( yi − xi βk ) / ∑ γ ik . In
L( X | θ ) = ∏ L( xi | θ ) where L( xi | θ ) = ∑ π j L( xi | θ , xi ∈ C j )
L ( xi | θ , xi ∈ C j ) = (1/ν j )(1/ 2π σ j ) exp(
− || xi − f (λij ) ||2
n
n
i =1
i =1
practice, the components of θ come from the value of θ in n
) for the feature
the previous iteration and πk = (1/ n)∑ γ ik . A difficulty in
clusters ( || xi − f (λij ) || is the Euclidean distance from xi to its
choosing the number of components in the mixture, each with unknown mixing probability, is that the likelihood ratio statistic has an unknown distribution. Therefore, Turner (2000) implemented a bootstrap strategy to choose between 1 and 2 components, between 2 and 3, and so forth. The strategy to choose between K and K + 1 components is (a) calculate the log-likelihood ratio statistic Q for a model having K and for a model having K + 1 components; (b) simulate data from the fitted K -component model; (c) fit the K and K + 1 component
2σ 2j
projection point f(λij) on curve j) and L( xi | θ , xi ∈ C j ) = 1/ Area for the noise cluster. Space will not permit a complete description of the HPCC-CEM method, but briefly, the HPCC steps are as follows: (1) (2) (3) (4)
Make an initial estimate of noise points and remove; form an initial clustering with at least seven points in each cluster; fit a principal curve to each cluster; and calculate a clustering criterion
i =1
n
* Q*; (d) compute the p-value for Q as p = 1/ n∑ I (Q ≥ Q ) , i =1
where the indicator I( Q ≥ Q ) = 1 if Q ≥ Q* and 0 otherwise. To summarize, method 1 avoids the mathematics of mixture fitting and seeks high-density regions to use as a basis for extrapolation. It also checks the robustness of its solution by repeating the fitting using different input parameters and comparing results. Method 2 uses a likelihood for one cluster and everything else, and then removes the highest-density cluster and repeats the procedure. Methods 3 and 4 both include formal mixture fitting (known to be difficult), with method 3 assuming the curve is well modeled as a principal curve and method 4 assuming the curve is well fit using parametric regression. Only method 3 allows the user to favor or penalize clusters with gaps. All methods can be tuned to accept only thin (i.e., small residual variance) clusters. *
V = VAbout + aVAlong, where VAbout measures the orthogonal distances to the curve (residual error sum of squares), and VAlong measures the variance in arc length distances between projection points on the curve. Minimizing V (the sum is over all clusters) will lead to clusters with regularly spaced points along the curve and tightly grouped around it. Large values of a will cause the method to avoid clusters with gaps and small values of a favor thinner clusters. Clustering (merging clusters) continues until V stops decreasing. The flexibility provided by such a clustering criterion (avoid or allow gaps and avoid or favor thin clusters) is potentially extremely useful, and Method 2 is currently the only published method to include it. Method 3 (Turner, 2000) uses the well-known EM (estimation-maximization) algorithm (Dempster, Laird & Rubin, 1977) to handle a mixture of regression models.
156
Cluster Analysis in Fitting Mixtures of Curves
FUTURE TRENDS
CONCLUSION
Although we focus here on curves in the two dimensions, clearly there are analogous features in higher dimensions, such as principal curves in higher dimensions. Fitting mixtures in two dimensions is fairly straightforward, but performance is rarely as good as desired. More choices for probability distributions are needed; for example, the noise in Burr et al. (2001) was a non-homogeneous Poisson process, which might be better handled with adaptively chosen regions of the scene. Also, very little has been published regarding the measure of difficulty of curvemixture problem, although Hurn, Justel, and Robert (2003) provided a numerical Bayesian approach for mixtures involving switching regression. The term switching regression describes the situation in which the likelihood is nearly unchanged when some of the cluster labels are switched. For example, imagine that clusters 1 and 2 were slightly closer in Figure 1c, and then consider changes to the likelihood if we switched some 2 and 1 labels. Clearly, if curves are not well separated, we should not expect high clustering and classification success. In the case where groups are separated by mean shifts, it is clear how to define group separation. In the case where groups are curves that overlap to varying degrees, one could envision a few options for defining group separation. For example, the minimum distance between a point in one cluster and a point in another cluster could define the cluster separation and, therefore, also the measure of difficulty. Depending on how we define the optimum, it is possible that performance can approach the theoretical optimum. We define performance first by determining whether a cluster was found, and then by reporting the false positive and false negative rates for each found cluster. Another area of valuable research would be to accommodate mixtures of local regression models (i.e., smooth curves obtained using splines or nonparametric kernel smoothers) (Green & Silverman, 1996). Because local curve smoothers such as splines are extremely successful in the case where the mixture is known to contain only one curve, it is likely that the flexibility provided by local regression (parametric or nonparametric) would be desirable in the broader context of fitting a mixture of curves. Existing software is fairly accessible for the methods described, assuming users have access to a statistical programming language such as S-PLUS (2003). Executable code to accommodate a user-specified input format would, of course, also be a welcome future contribution, but it is now reasonable to empirically compare each method in each application of fitting mixtures of curves.
Fitting mixtures of curves with noise in two dimensions is a specialized type of cluster analysis. It is a manageable but fairly challenging and unique cluster analysis task. Four options have been described briefly. Example performance results for these methods all applied to the same data for one application are available (Burr et al., 2001). Results for methods 2, 3, and 4 also are available in their respective references (on different data sets), and results for a numerical Bayesian approach using a mixture of regressions with attention to the case having nearly overlapping clusters also has been published recently (Hurn, Justel & Robert, 2003).
REFERENCES Banfield, J., & Raftery, A. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803-821. Bensmail, H., Celeux, G., Raftery, A., & Robert, C. (1997). Inference in model-based cluster analysis, Statistics and Computing, 7, 1-10. Burr, T., Jacobson, A., & Mielke, A. (2001). Identifying storms in noisy radio frequency data via satellite: An application of density estimation and cluster analysis. Proceedings of US Army Conference on Applied Statistics, Santa Fe, New Mexico. Campbell, J., Fraley, C., Murtagh, F., & Raftery, A. (1997). Linear flaw detection in woven textiles using model-based clustering. Pattern Recognition Letters, 18, 1539-1548. Dasgupta, A., & Raftery, A. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of American Statistical Association, 93(441), 294-302.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Gaffney, S., & Smyth, P. (2003). Curve clustering with random effects regression mixtures. Proceedings of the 9th International Conference on AI and Statistics, Florida. Green, P., & Silverman, B. (1994). Nonparametric regression and generalized linear models. London: Chapman and Hall. Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of American Statistical Association, 84, 502-516.
157
C
Cluster Analysis in Fitting Mixtures of Curves
Hurn, M., Justel, A., & Robert, C. (2003). Estimating mixtures of regressions. Journal of Computational and Graphical Statistics, 12, 55-74. Leroux, B. (1992). Consistent estimation of a mixing distribution. The Annals of Statistics, 20, 1350-1360. Murtagh, F., & Raftery, A. (1984). Fitting straight lines to point patterns. Pattern Recognition, 17, 479-483. Scott, D. (2002). Parametric statistical modeling by minimum integrated square error. Technometrics, 43(3), 274-285. Scott, D., & Szewczyk, W. (2002). From kernels to mixtures. Technometrics, 43(3), 323-335. Silverman, B. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. S-PLUS Statistical Programming Language. (2003). Seattle, WA: Insightful Corp. Stanford, D., & Raftery, A. (2000). Finding curvilinear features in spatial point patterns: Principal curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6), 601-609. Tiggerington, D., Smith, A., & Kakov, U. (1985). Statistical analysis of finite mixture distributions. New York: Wiley. Turner, T. (2000). Estimating the propagation rate of a viral infection of potato plants via mixtures of regressions. Applied Statistics, 49(3), 371-384.
KEY TERMS Bayesian Information Criterion: An approximation to the Bayes Factor, which can be used to estimate the Bayesian posterior probability of a specified model. Bootstrap: A resampling scheme in which surrogate data is generated by resampling the original data or sampling from a model that was fit to the original data.
158
Cluster Analysis: Dividing objects into groups using varying assumptions regarding the number of groups, and the deterministic and stochastic mechanisms that generate the observed values. Estimation-Maximization Algorithm: An algorithm for computing maximum likelihood estimates from incomplete data. In the case of fitting mixtures, the group labels are the missing data. Mixture of Distributions: A combination of two or more distributions in which observations are generated from distribution i with probability pi and Σpi =1. Poisson Process: A stochastic process for generating observations in which the number of observations in a region (a region in space or time, for example) is distributed as a Poisson random variable. Principal Curve: A smooth, curvilinear summary of p-dimensional data. It is a nonlinear generalization of the first principal component line that uses a local average of p-dimensional data. Probability Density Function: A function that can be summed (for discrete-valued random variables) or integrated (for interval-valued random variables) to give the probability of observing values in a specified set. Probability Density Function Estimate: An estimate of the probability density function. One example is the histogram for densities that depend on one variable (or multivariate histograms for multivariate densities). However, the histogram has known deficiencies involving the arbitrary choice of bin width and locations. Therefore, the preferred density function estimator, which is a smoother estimator that uses local weighted sums with weights determined by a smoothing parameter, is free from bin width and location artifacts. Scatter Plot: One of the most common types of plots, also known as an x-y plot, in which the first component of a two-dimensional observation is displayed in the horizontal dimension and the second component is displayed in the vertical dimension.
159
Clustering Analysis and Algorithms Xiangji Huang York University, Canada
INTRODUCTION Clustering is the process of grouping a collection of objects (usually represented as points in a multidimensional space) into classes of similar objects. Cluster analysis is a very important tool in data analysis. It is a set of methodologies for automatic classification of a collection of patterns into clusters based on similarity. Intuitively, patterns within the same cluster are more similar to each other than patterns belonging to a different cluster. It is important to understand the difference between clustering (unsupervised classification) and supervised classification. Cluster analysis has wide applications in data mining, information retrieval, biology, medicine, marketing, and image segmentation. With the help of clustering algorithms, a user is able to understand natural clusters or structures underlying a data set. For example, clustering can help marketers discover distinct groups and characterize customer groups based on purchasing patterns in business. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Typical pattern clustering activity involves the following steps: (1) pattern representation (including feature extraction and/or selection), (2) definition of a pattern proximity measure appropriate to the data domain, (3) clustering, (4) data abstraction, and (5) assessment of output.
BACKGROUND General references regarding clustering include Hartigan (1975), Jain and Dubes (1988), Kaufman and Rousseeuw (1990), Mirkin (1996), Jain, Murty, and Flynn (1999), and Ghosh (2002). A good introduction to contemporary data-mining clustering techniques can be found in Han and Kamber (2001). Early clustering methods before the ’90s, such as k-means (Hartigan, 1975) and PAM and CLARA (Kaufman & Rousseeuw, 1990), are generally suitable for small data sets. CLARANS (Ng & Han, 1994) made improvements to CLARA in quality and scalability based on randomized search. After CLARANS, many algorithms were proposed to deal with
large data sets, such as BIRCH (Zhang, Ramakrishnan, & Livny, 1996), CURE (Guha, Rastogi, & Shim, 1998), Squashing (DuMouchel, Volinsky, Johnson, Cortes, & Pregibon, 1999) and Data Bubbles (Breuning, Kriegel, Kröger, & Sander, 2001).
MAIN THRUST There exist a large number of clustering algorithms in the literature. In general, major clustering algorithms can be classified into the following categories.
Hierarchical Clustering Hierarchical clustering builds a cluster hierarchy or a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows the exploration of data on different levels of granularity. Hierarchical clustering can be further classified into agglomerative (bottomup) and divisive (top-down) hierarchical clustering. An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion (for example, the requested number of k clusters) is achieved. Advantages of hierarchical clustering include (a) embedded flexibility regarding the level of granularity, (b) ease of handling of any forms of similarity or distance, and (c) applicability to any attribute types. Disadvantages of hierarchical clustering are (a) vagueness of termination criteria, and (b) the fact that most hierarchical algorithms do not revisit once-constructed clusters with the purpose of their improvement. One of the most striking developments in hierarchical clustering is the algorithm BIRCH. BIRCH creates a height-balanced tree of nodes that summarize its zero, first, and second moments. Guha et al. (1998) introduced the hierarchical agglomerative clustering algorithm called CURE (Clustering Using Representatives). This algorithm has a number of novel features of general significance. It takes special care with outliers and with
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
C
Clustering Analysis and Algorithms
label assignment. Although CURE works with numerical attributes (particularly low-dimensional spatial data), the algorithm ROCK, developed by the same researchers (Guha, Rastogi, & Shim, 1999) targets hierarchical agglomerative clustering for categorical attributes.
where E is the sum of square-error for all objects in the database, p is point in space representing a given
Partitioning Clustering
K-Medoids Method
Given a database of n objects and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a cluster. The clusters are formed to optimize a partitioning criterion, often called a similarity function, such as distance, so that the objects within a cluster are similar, whereas the objects of different clusters are dissimilar in terms of the database attributes. Partitioning clustering algorithms have advantages in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive. A problem accompanying the use of a partitioning algorithm is the choice of the number of desired output clusters. A seminal paper (Dubes, 1987) provides guidance on this key design decision. The partitioning techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of the patterns) or globally (defined over all the patterns). Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly computationally prohibitive. In practice, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all the runs is used as the output clustering. The most wellknown and commonly used partitioning algorithms are k-means, k-medoids, and their variations.
In the k-medoids algorithm, a cluster is represented by one of its points. Instead of taking the mean value of the objects in a cluster as a reference point, the medoid can be used, which is the most centrally located object in a cluster. The basic strategy of the k-medoids clustering algorithms is to find k clusters in n objects by first arbitrarily finding a representative object (the medoid) for each cluster. Each remaining object is clustered with the medoid to which it is the most similar. The strategy then iteratively replaces one of the medoids by one of the nonmedoids as long as the quality of the resulting clustering is improved. This quality is estimated by using a cost function that measures the average dissimilarity between an object and the medoid of its cluster. It is important to understand that k-means is a greedy algorithm, but k-medoids is not.
K-Means Method The k-means algorithm (Hartigan, 1975) is by far the most popular clustering tool used in scientific and industrial applications. It proceeds as follows. First, it randomly selects k objects, each of which initially represents a cluster mean or centre. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the squared-error criterion is used, defined as k
E = ∑i=1 ∑ p∈C | p − mi |2 i
160
object, and mi is the mean of cluster Ci (both p and
mi
are multidimensional).
Density-Based Clustering Heuristic clustering algorithms (such as partitioning methods) work well for finding spherical-shaped clusters in databases that are not very large. To find clusters with complex shapes and for clustering very large data sets, partitioning-based algorithms need to be extended. Most partitioning-based algorithms cluster objects based on the distance between objects. Such methods can find only spherical shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. To discover clusters with arbitrary shape, density-based clustering algorithms have been developed. These algorithms typically regard clusters as dense regions of objects in the data space that are separated by regions of low density. The general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold. That is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape. DBSCAN (EsterKriegel, Sander, & Xu, 1996) is a typical density-based algorithm that grows clusters according to a density threshold. OPTICS (Ankerst Breuning, Kriegel, & Sander, 1999) is a density-based algorithm that computes an augmented clus-
Clustering Analysis and Algorithms
ter ordering for automatic and interactive cluster analysis. DENCLUE (Hinneburg & Keim, 1998) is another clustering algorithm based on a set of density distribution functions. It differs from partition-based algorithms not only by accepting arbitrary shape clusters but also by how it handles noise.
Grid-Based Clustering Grid-based algorithms quantize the object space into a finite number of cells that form a grid structure on which all the operations for clustering are performed. To some extent, the grid-based methodology reflects a technical point of view. The category is eclectic: It contains both partitioning and hierarchical algorithms. The main advantage of this method is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space. Some typical examples of the grid-based algorithms include STING, which explores statistical information stored in the grid cells; WaveCluster, which clusters objects using a wavelet transform method; and CLIQUE, which represents a grid and density-based approach for clustering in a high-dimensional data space. The algorithm STING (Wang, W., Wang, J., & Munta, 1997) works with numerical attributes (spatial data) and is designed to facilitate region-oriented queries. In doing so, STING constructs data summaries in a way similar to BIRCH. However, it assembles statistics in a hierarchical tree of nodes that are grid-cells. The algorithm WaveCluster (Sheikholeslami, Chatterjee, & Zhang, 1998) works with numerical attributes and has an advanced multiresolution. It is also known for some outstanding properties such as (a) a high quality of clusters, (b) the ability to work well in relatively high-dimensional spatial data, (c) the successful handling of outliers, and (d) O(N) complexity. WaveCluster, which applies wavelet transforms to filter the data, is based on ideas of signal processing. The algorithm CLIQUE (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998) for numerical attributes is fundamental in subspace clustering. It combines the ideas of density-based clustering, grid-based clustering, and the induction through dimensions similar to the Apriori algorithm in association rule learning.
Model-Based Clustering Model-based clustering algorithms attempt to optimize the fit between the given data and some mathematical model. For example, clustering algorithms based on probability models offer a principled alternative to heuristic-based algorithms. In particular, the model-based
approach assumes that the data are generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The Gaussian mixture model has been shown to be a powerful tool for many applications (Banfield & Raftery, 1993). With the underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering algorithm become probabilistic model choice problems (Dasgupta & Raftery, 1998). This provides a great advantage over heuristic clustering algorithms, for which no established method to determine the number of clusters or the best clustering algorithm exists. Model-based clustering follows two major approaches: a probabilistic approach or a neural network approach.
Probabilistic Approach In the probabilistic approach, data are considered to be samples independently drawn from a mixture model of several probability distributions (McLachlan & Basford, 1988). The main assumption is that data points are generated by (1) randomly picking a model j with probability τ j and (2) drawing a point x from a corresponding distribution. The area around the mean of each distribution constitutes a natural cluster. We associate the cluster with the corresponding distribution’s parameters such as mean, variance, and so forth. Each data point carries not only its observable attributes, but also a hidden cluster ID. Each point x is assumed to belong to one and only one cluster. Probabilistic clustering has some important features. For example, it (a) can be modified to handle records of complex structure, (b) can be stopped and resumed with consecutive batches of data, and (c) results in easily interpretable cluster system. Because the mixture model has a clear probabilistic foundation, the determination of the most suitable number of clusters k becomes a more tractable task. From a datamining perspective, an excessive parameter set causes overfitting, but from a probabilistic perspective, the number of parameters can be addressed within the Bayesian framework. An important property of probabilistic clustering is that the mixture model can be naturally generalized to clustering heterogeneous data. However, statistical mixture models often require a quadratic space, and the EM algorithm converges relatively slowly, making scalability an issue.
Artificial Neural Networks Approach Artificial neural networks (ANNs) (Hertz, Krogh, & Palmer, 1999) are inspired by biological neural net161
C
Clustering Analysis and Algorithms
works. ANNs have been used extensively over the past four decades for both classification and clustering (Jain & Mao, 1994). The neural ANNs approach to clustering has two prominent methods: competitive learning and self-organizing feature maps. Both involve competing neural units. Some of the features of the ANNs that are important in pattern clustering are that they (a) process numerical vectors and so require patterns to be represented with quantitative features only, (b) are inherently parallel and distributed processing architectures, and (c) may learn their interconnection weights adaptively. The neural network approach to clustering tries to emulate actual brain processing. Further research is needed to make it readily applicable to very large databases, due to long processing times and the intricacies of complex data.
Other Clustering Techniques Traditionally, each pattern belongs to one and only one cluster. Hence, the clusters resulting from this kind of clustering are disjoint. Fuzzy clustering extends this notion to associate each object with every cluster using a membership function (Zadeh, 1965). Another approach is constrained-based clustering, introduced in (Tung, Ng, Lakshmanan & Han, 2001). This approach has important applications in clustering two-dimensional spatial data in the presence of obstacles. Another approach used in clustering analysis is the Genetic Algorithm (Goldbery, 1989). An example is the GGA (Genetically Guided Algorithm) for fuzzy and hard k-means (Hall, Ozyurt, & Bezdek, 1999).
FUTURE TRENDS Choosing a clustering algorithm for a particular problem can be a daunting task. One major challenge in using a clustering algorithm on a specific problem lies not in performing the clustering itself, but rather in choosing the algorithm and the values of the associated parameters. Clustering algorithms also face problems of scalability, both in terms of computing time and memory requirements. Despite the ongoing exponential increases in the power of computers, scalability remains still a major issue in many clustering applications. In commercial data-mining applications, the quantity of the data to be clustered can far exceed the main memory capacity of the computer, making both time and space efficiency critical; this issue is addressed by clustering systems in the database community such as BIRCH. This leads to the following set of continuing research in clustering, and in particular data mining. (1) Extend clustering algorithms to handle very large data162
bases, such as real-life terabyte data sets. (2) Gracefully eliminate the need for a priori assumptions about the data. (3) Use good sampling and data compression methods to improve efficiency and speed up clustering algorithms. (4) Cluster extremely large and high-dimensional data.
CONCLUSION This article describes five major approaches to clustering, in addition to some other clustering algorithms. Each has both positive and negative aspects, and each is suitable for different types of data and different assumptions about the cluster structure of the input. . Clustering is a process of grouping data items based on a measure of similarity. Clustering is also a subjective process; the same set of data items often needs to be partitioned differently for different applications. This subjectivity makes the process of clustering difficult, because a single algorithm or approach is not adequate to solve every clustering problem. However, clustering is an interesting, useful, and challenging problem. It has great potential in applications such as object representation, image segmentation, information filtering and retrieval, and analyzing gene expression data.
REFERENCES Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the ACM SIGMOD Conference (pp. 94-105), USA. Ankerst, M., Breuning, M., Kriegel, H., & Sander, J. (1999). OPTICS: Ordering points to identify clustering structure. Proceedings of the ACM SIGMOD Conference (pp. 4960), USA. Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821. Breuning, M., Kriegel, H., Kröger, H., & Sander, J. (2001). Data bubbles: Quality preserving performance boosting for hierarchical clustering. Proceedings of the ACM SIGMOD Conference, USA. Dasgupta, A., & Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93, 294-302. Dubes, R. C. (1987). How many clusters are best? An experiment. Pattern Recognition, 20, 645-663.
Clustering Analysis and Algorithms
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Pregibon, D. (1999). Squashing flat files flatter. Proceedings of the ACM SIGKDD Conference (pp. 6-15), USA.
Lu, S. Y., & Fu, K. S. (1978). A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on System Man Cybern., 8, 381-389.
Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clustering in large spatial databases with noise. Proceedings of the the ACM SIGKDD Conference (pp. 226-231), USA.
McLachlan, G. J., & Basford, K. D. (1998). Mixture models: Inference and application to clustering. New York: Dekker.
Ghosh, J. (2002). Scalable clustering methods for data mining. In N. Ye (Ed.), Handbook of data mining Lawrence Erlbaum. Goldbery, D. (1989). Genetic algorithm in search, optimization and machine learning. Addison-Wesley. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of the ACM SIGMOD Conference (pp. 73-84), USA. Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. Proceedings of the 15th International Conference on Data Engineering (pp. 512-521), Australia. Hall, L. O., Ozyurt, B., & Bezdek, J. C. (1999). Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2), 103-112. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann. Hartigan, J. (1975). Clustering algorithms. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison Wesley Longman. Hinneburg, A., & Keim, D. (1998). An efficient approach to clustering large multimedia databases with noise. Proceedings of the ACM SIGMOD Conference (pp. 58-65), USA. Jain, A., & Dubes, R. (1988). Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall. Jain, A., & Mao, J. (1994). Neural networks and pattern recognition. In J. M. Zurada, R. J. Marks, & C. J. Robinson (Eds.), Computational intelligence: Imitating life (pp.194-212). Jain, A., Murty, M., & Flynn, P. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Kaufman, L., & Rousseeuw, P. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Mirkin, B. (1996). Mathematic classification and clustering. Kluwer Academic. Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data mining. Proceedings of the 20th Conference on Very Large Data Bases (pp. 144-155), Chile. Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proceedings of the 24th Conference on Very Large Data Bases (pp. 428-439), USA. Tung, A. K. H., Hou, J., & Han, J. (2001). Spatial clustering in the presence of obstacles. Proceedings of the 17th International Conference on Data Engineering (pp. 359367), Germany. Tung, A. K. H., Ng, R. T., Lakshmanan, L. V. S., & Han, J. (2001). Constraint-based clustering in large databases. Proceedings of the Eighth ICDT, London. Wang, W., Wang, J., & Munta, R. (1997). STING: A statistical information grid approach to spatial data mining. Proceedings of the 23rd Conference on Very Large Data Bases (pp. 186-195), Greece. Zadeh, L. H. (1965). Fuzzy sets. Information Control, 8, 338-353. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference (pp. 103-114), Canada.
KEY TERMS Apriori Algorithm: An efficient association rule mining algorithm developed by Agrawal, in 1993. Apriori employs a breadth-first search and uses a hash tree structure to count candidate item sets efficiently. The algorithm generates candidate item sets of length k from k–1 length item sets. Then, the patterns that have an infrequent subpattern are pruned. Following that, the whole transaction database is scanned to determine frequent item sets among the candidates. For determining
163
C
Clustering Analysis and Algorithms
frequent items in a fast manner, the algorithm uses a hash tree to store candidate item sets. Association Rule: A rule in the form of “if this, then that.” It states a statistical correlation between the occurrence of certain attributes in a database. Customer Relationship Management: The process by which companies manage their interactions with customers. Data Mining: The process of efficient discovery of actionable and valuable patterns from large databases.
164
Feature Selection: The process of identifying the most effective subset of the original features to use in data analysis, such as clustering. Overfitting: The effect on data analysis, data mining, and biological learning of training too closely on limited available data and building models that do not generalize well to new unseen data. Supervised Classification: Given a collection of labeled patterns, the problem in supervised classification is to label a newly encountered but unlabeled pattern. Typically, the given labeled patterns are used to learn the descriptions of classes that in turn are used to label a new pattern.
165
Clustering in the Identification of Space Models Maribel Yasmina Santos University of Minho, Portugal Adriano Moreira University of Minho, Portugal Sofia Carneiro University of Minho, Portugal
INTRODUCTION Clustering is the process of grouping a set of objects into clusters so that objects within a cluster have high similarity with each other, but are as dissimilar as possible to objects in other clusters. Dissimilarities are measured based on the attribute values describing the objects (Han & Kamber, 2001). Clustering, as a data mining technique (Cios, Pedrycz, & Swiniarski, 1998; Groth, 2000), has been widely used to find groups of customers with similar behavior or groups of items that are bought together, allowing the identification of the clients’ profile (Berry & Linoff, 2000). This article presents another use of clustering, namely in the creation of Space Models. Space Models represent divisions of the geographic space in which the several geographic regions are grouped accordingly to their similarities with respect to a specific indicator (values of an attribute). Space Models represent natural divisions of the geographic space based on some geo-referenced data. This article addresses the development of a clustering algorithm for the creation of Space Models – STICH (Space Models Identification Through Hierarchical Clustering). The Space Models identified, integrating several clusters, point out particularities of the analyzed data, namely the exhibition of clusters with outliers, regions which behavior is strongly different from the other analyzed regions. In the work described in this article, some assumptions were adopted to the creation of Space Models through a clustering process, namely: • •
Space Models must be created by looking at the data values available and no constraints must be imposed for their identification. Space Models can include clusters of different shapes and sizes.
•
Space Models are independent of specific domain knowledge, like the specification of the final number of clusters.
The following sections, in outline, include: (i) an overview of clustering, its methods and techniques; (ii) the STICH algorithm, its assumptions, its implementation and the results for a sample dataset; (iii) future trends; and (iv) a conclusion with some comments about the proposed algorithm.
BACKGROUND Clustering (Grabmeier, 2002; Jain, Murty, & Flynn, 1999; Zaït & Messatfa, 1997) is a discovering process (Fayyad & Stolorz, 1997) that identifies homogeneous groups of segments in a dataset. Han & Kamber (2001) state that clustering is a challenging field of research integrating a set of special requirements. Some of the typical requirements of clustering in data mining are: • •
•
• •
Scalability in order to allow the analysis of large datasets, since clustering on a sample of a database may lead to biased results. Discovery of clusters with arbitrary shapes since clusters can be of any shape. Some existing clustering algorithms identify clusters that tend to be spherical, with similar size and density. Minimal domain knowledge since the clustering results can be quite sensitive to the input parameters, like the number of clusters required by many clustering algorithms. Ability to deal with noisy data avoiding the identification of clusters that were negatively influenced by outliers or erroneous data. Insensitivity to the order of input records since there are clustering algorithms that are influenced
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
C
Clustering in the Identification of Space Models
by the order in which the available records are analyzed. Two of the well-known types of clustering algorithms are based on partitioning and hierarchical methods. These methods and some of the corresponding algorithms are presented in the following subsections.
Clustering Methods Partitioning Methods A partitioning method constructs a partition of n objects into k clusters where k ≤ n. Given k, the partitioning method creates an initial partitioning and then, using an iterative relocation technique, it attempts to improve the partitioning by moving objects from one cluster to another. The clusters are formed to optimize an objectivepartitioning criterion, such as the distance between the objects. A good partitioning must aggregate objects such that objects in the same cluster are similar to each other, whereas objects in different clusters are very different (Han & Kamber, 2001). Two of the well-known partitioning clustering algorithms are the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and the k-medoid algorithm, where each cluster is represented by one of the objects located near the centre of the cluster.
Hierarchical Methods Hierarchical methods perform a hierarchical composition of a given set of data objects. It can be done bottom-up (Agglomerative Hierarchical methods) or top-down (Divisive Hierarchical methods). Agglomerative methods start with each object forming a separate cluster. Then they perform repeated merges of clusters or groups close to one another until all the groups are merged into one cluster or until some pre-defined threshold is reached (Han & Kamber, 2001). These algorithms are based on the inter-object distances and on finding the nearest neighbors objects. Divisive methods start with all the objects in a single cluster and, in each successive iteration, the clusters are divided into smaller clusters until each cluster contains only one object or a termination condition is verified. The next subsection presents two examples of clustering algorithms associated with partitioning and hierarchical methods respectively.
166
Clustering Techniques The K-Means Algorithm Partitioning-based clustering algorithms such as k-means attempt to break data into a set of k clusters (Karypis, Han, & Kumar, 1999). The k-means algorithm takes as input a parameter k that represents the number of clusters in which the n objects of a dataset will be partitioned. The obtained division tries to maximize the Intracluster similarity (a measurement of the similarity between the objects inside a cluster) and minimize the Intercluster similarity (a measurement of the similarity between different clusters). That is to say a high similarity between the objects inside a cluster and a low similarity between objects in different clusters. This similarity is measured looking at the centers of gravity (centroids) of the clusters, which are calculated as the mean value of the objects inside them. Given the input parameter k, the k-means algorithm works as follows (MacQueen, 1967): 1. 2. 3.
Randomly selects k objects, each of which initially represents the cluster centre or the cluster mean. Assign each of the remaining objects to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. Compute the new mean (centroid) of each cluster.
After the first iteration, each cluster is represented by the mean calculated in step 3. This process is repeated until the criterion function converges. The squared-error criterion is often used, which is defined as: k
l
E = ∑∑ o j ∈ Ci o j − mi i =1 j =1
2
(1)
where E is the sum of the square-error for the objects in the dada set, l is the number of objects in a given cluster, oj represents an object, and mi is the mean value of the cluster Ci. This criterion intends to make the resulting k clusters as compact and as separate as possible (Han & Kamber, 2001). The k-means algorithm is applied when the mean of a cluster can be obtained, which makes it not suitable for the analysis of categorical attributes1. One of the disadvantages that can be pointed out to this method is the necessity for users to specify k in advance. This method is also not suitable for discovering clusters with nonconvex shapes or clusters of very different sizes (Han & Kamber,
Clustering in the Identification of Space Models
2001). To overcome some of the drawbacks of the k-means algorithm, several improvements and new developments have been done. See for example (Likas, Vlassis, & Verbeek, 2003; Estivill-Castro & Yang, 2004).
The K-Nearest Neighbor Algorithm Hierarchical clustering algorithms generate a set of clusters with a single cluster at the top (integrating all data objects) and single-point clusters at the bottom, in which each cluster is formed by one data object (Karypis, Han, & Kumar, 1999). Agglomerative hierarchical algorithms, like k-nearest neighbor, start with each data object in a separate cluster. Each step of the clustering algorithm merges the k records that are most similar into a single cluster. The k-nearest neighbor algorithm2 uses a graph to represent the links between the several data objects. This graph allows the identification of the several k objects most similar to a given data item. Each node of the knearest neighbor graph represents a data item. An edge exists between two nodes p and q if q is among the k most similar objects of p, or if p is among the k most similar objects of q (Figure 1).
The Algorithm STICH uses an iterative process, in which no input parameters are required. Another characteristic of STICH is that it produces several usable Space Models, one at the end of each iteration of the clustering process. As a hierarchical clustering algorithm with an agglomerative approach, STICH starts to assign each object in the dataset to a different cluster. The clustering process begins with as many clusters as objects in the dataset, and ends with all the objects grouped into the same cluster. The approach undertaken in STICH is as follows: 1.
2. 3. 4.
MAIN THRUST The Space Models Identification Through Hierarchical Clustering (STICH) algorithm presented in this article is based on the k-nearest neighbor algorithm. However, the principles defined for STICH try to overcome some of the limitations of the k-nearest neighbor approach, namely the necessity to define a value for the input parameter k. This value imposes restrictions to the maximum number of members that a given cluster can have. The number of members of the obtained clusters is k+1 (except in the case of tie in the neighbors’ value). In some applications there are no objective reasons for the choice of a value for k. The k-means algorithm was also analyzed mainly because of its simplicity, and to allow the comparison of results with STICH.
5. 6.
7.
For the dataset, calculate the Similarity Matrix of the objects, which is the matrix of the Euclidean distances between every pair of objects in the dataset. For each object, identify its minimum distance to the dataset (the value that represents the distance between the object and its 1-nearest neighbor). Calculate the median3 of all the minimum distances identified in the previous step. For each object, identify its k-nearest neighbors, selecting the objects that have a distance value less or equal to the median calculated in step 3. The number of objects selected, k, may vary from one object to another. For each object, calculate the average c of its knearest neighbors. For each object, verify in which clusters it appears as one of the k-nearest neighbors and then assign this object to the cluster in which it appears with the minimum k-nearest neighbor average c. In case of tie, one object that appears into two clusters with equal average c, the object is assigned to the first cluster in which it appears in the Similarity Matrix. For each new cluster, calculate its centroid.
This process is iteratively repeated until all the objects are grouped into the same cluster. Each iteration of STICH produces a new Space Model, with decreasing
Figure 1. K-nearest neighbor algorithm 25
25 38
45
99
72
124 50
65
a) Dataset
45
99
72
38
124
87 50
65
87
b) 2-nearest neighbor
167
C
Clustering in the Identification of Space Models
clustering process the Intracluster similarity value is low, because of the high similarity between the objects inside the clusters, and the Intercluster similarity value is high, because of the low similarity between the several clusters. As the process proceeds, the Intracluster value increases and the Intercluster value decreases. The minimum difference between these two values means that the objects are as separate as possible, considering a low number of clusters. After this point, the clustering process forces the aggregation of all the objects into the same cluster (the stop criterion of agglomerative hierarchical clustering methods), within a few more iterations. At this minimum point (Figure 2), reached at some iteration of the clustering process, the resulting Space Model is the one where the outliers are pointed out, that is, where the model isolates in different clusters the regions that are very different from all other regions.
number of clusters, which can be used for different purposes. For each model, a quality metric is calculated after each iteration of the STICH algorithm. This metric, the ModelQuality, is defined as: (2)
ModelQuality = Intraclust er − Intercluster
and is based on the difference between the Intracluster and Intercluster similarities. The Intracluster indicator is calculated as the sum of all distances between the several objects in a given cluster (l represents the total number of objects in a cluster) and the mean value (mi) of the cluster (Ci) in which the object (oj) reside. The total number of clusters identified in each iteration is represented by t. The Intracluster indicator is calculated as follows: t
The Implementation
l
Intracluster = ∑∑ o j ∈ Ci o j − mi
(3)
i =1 j =1
STICH was implemented in Visual Basic for Applications (VBA) and integrated in the Geographic Information System (GIS) ArcView 8.2 using ArcObjects4. For the creation of Space Models, STICH considers two processes: the clustering of geographic regions based on a selected indicator, and the creation of a new space geometry where a new polygon is created for each resulting cluster by dissolving the borders of its members (the regions inside it). By using ArcGIS, both processes can be implemented in the same platform.
The Intercluster indicator is calculated as the sum of all distances existing between the centers of all the clusters identified in a given iteration. The Intercluster indicator is calculated as follows: t t Intercluster = ∑ ∑ mi − m j i =1 j =1 j ≠i
(4)
The Results
The ModelQuality metric can be used by the user in the selection of the appropriate Space Model for a given task. However, it is on the minimum value of this metric that we found the STICH outliers model, in the case they exist in the dataset (Figure 2). In the beginning of the
The Space Model presented afterward was obtained analyzing an indicator available in the EPSILON database. EPSILON (Environmental Policy via Sustainability Indi-
Figure 2. Intracluster similarity vs. Intercluster similarity 1´ 109
Intracluster similarity Intercluster similarity
8´ 108
Diference
6´ 108
Min
4´ 108
2´ 108
Iteration 2
168
4
6
8
10
Clustering in the Identification of Space Models
cators on a European-wide NUTS III Level) is a research project founded by the European Union through the Information Society Technologies program. STICH is a deliverable of this research project, which contributes to the better understanding of the European Environmental Figure 3. Space model: Analysis of the heavy metals – lead attribute
Quality and Quality of Life, by delivering a tool aimed to generate environmental sustainability indices at NUTS5III level. In the following example, an indicator collected in the EPSILON project is analyzed, namely the concentration in the air of Heavy Metals - Lead (the data is available in Table 1 for the 15 European countries that integrate the EPSILON database). Using the k-means and STICH algorithms, adopting the value k=3 in order to allow the comparison between the clusters achieved by them, the results obtained are systematized in Table 2. Analyzing these results it is possible to see that the clusters generated with the kmeans algorithm, using the implementation available in the Clementine Data Mining System v8.0, are not as homogeneous as the clusters obtained with STICH. The k-means approach integrates the 0.0231413 value, an outlier6 present in the dataset, with values like 0.00197921, leaving out of this cluster (Cluster 1) values that are more proximal to this last one. STICH obtains clusters that are more homogeneous and that separates values that are very different from the others. Figure 3 shows the Space Model created by STICH as result of the clustering process. This model, with three clusters, shows the spatial distribution of the concentration of Lead in the air, for the 15 analyzed countries.
Table 1. Data available in the Epsilon database for the Heavy Metals - Lead attribute Country
Sweden
Finland
Ireland
Denmark
France
Indicator
0.00197921
0.00207107
0.0032313
0.0035931
0.00482136
Country
England
Austria
Luxemburg
Spain
Netherlands
Indicator
0.00512893
0.00560728
0.00638133
0.00797498
0.00879874
Country
Germany
Belgium
Greece
Portugal
Italy
Indicator
0.0091856
0.00931672
0.0114614
0.0124667
0.0231413
Table 2. Results obtained for the Heavy Metals - Lead attributes
k-means
STICH
Cluster 1 =
{0.00197921, 0.0035931, 0.00482136, 0.00512893, 0.00560728, 0.00638133, 0.00797498, 0.00879874, 0.0091856, 0.00931672, 0.0114614, 0.0124667, 0.0231413}
Cluster 2 =
{0.00207107}
Cluster 3 =
{0.0032313}
Cluster 1 =
{0.00197921, 0.00207107, 0.0032313, 0.0035931, 0.00482136, 0.00512893, 0.00560728, 0.00638133}
Cluster 2 =
{0.00797498, 0.00879874, 0.0091856, 0.00931672, 0.0114614, 0.0124667}
Cluster 3 =
{0.0231413}
169
C
Clustering in the Identification of Space Models
As advantages for the creation of Space Models it can be pointed out the definition of new space geometries, which can be reused whenever needed, and the certainty that the new divisions of the space were obtained by characteristics present in data, and not imposed by human constraints as in administrative subdivisions.
FUTURE TRENDS As already mentioned, clustering has been widely used in several domain applications to find segments in data, like groups of clients with similar behavior. In the locationbased services domain, clustering can also be used to find groups of mobile users with the same interests or with similar mobility patterns. In this context, the STICH algorithm can also be used to create Space Models as divisions of the geographic space where each region represents an area with certain attributes useful to characterize the situation of a mobile user. Examples of such regions are urban or rural areas, commercial or leisure zones, or not safety zones in a town. When a mobile user is detected to be inside one of these regions, the attributes of that region can be used to identify the appropriate services and information to provide to him. These attributes can also be combined with other contextual data, such as the time of the day or the user preferences, to enhance the adaptability of a context-aware application (Salber, Dey, & Abowd, 1999).
CONCLUSION This article presented STICH, a hierarchical clustering algorithm that allows the identification of Space Models. These models are characterized by the integration of groups of regions that are similar to each other. Clusters of outliers are also formed by STICH, enabling the identification of regions that are very different from the other regions present in the dataset. As advantages of STICH, and besides the creation of new divisions of the space, we point out: • • •
170
The discovery of clusters with arbitrary shapes and sizes. The avoidance of any previous domain knowledge, as for example the definition of a value for k. The ability to deal with outliers’ values, since when they are present in the dataset, the approach undertaken allows their identification.
REFERENCES Berry, M., & Linoff, G. (2000). Mastering data mining: The art and science of customer relationship management. New York: John Wiley and Sons, Inc. Böhm, C., & Krebs, F. (2004). The k-nearest neighbor join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6). Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data mining methods for knowledge discovery. Boston: Kluwer Academic Publishers. Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8(2), 127-150. Fayyad, U., & Stolorz, P. (1997). Data mining and KDD: Promise and challenges. Future Generation Computer Systems, 13(2), 99-115. Grabmeier, J. (2002). Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery, 6(4), 303-360. Groth, R. (2000). Data mining: Building competitive advantage. Prentice Hall PTR. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. CA: Morgan Kaufmann Publishers. Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 32(8), 68-75. Laurikkala, J., Juhola, M., & Kentala, E. (2000). Informal identification of outliers in medical data. In Proceedings of Intelligent Data Analysis in Medicine and Pharmacology, Workshop at the 14th European Conference on Artificial Intelligence (pp. 20-24). Berlin. Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern Recognition, 36(2), 451-461. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281-297). Berkeley, CA: University of California Press. Modha, D.S., & Spangler, W.S. (2003). Feature weighting in k-Means clustering. Machine Learning, 52(3), 217-237.
Clustering in the Identification of Space Models
Salber, D., Dey, A.K., & Abowd, G.D. (1999). The context toolkit: Aiding the development of context-enabled applications. In Proceedings of the 1999 Conference on Human Factors in Computing Systems (CHI ’99) (pp. 434441). Pittsburgh. Zaït, M., & Messatfa, H. (1997). A comparative study of clustering methods. Future Generation Computer Systems, 13(2), 149-159.
KEY TERMS Cluster: Group of objects of a dataset that are similar with respect to specific metrics (values of the attributes used to identify the similarity between the objects). Clustering: Process of identifying homogeneous groups of clusters in a dataset. Data Mining: A discovering process that aims the identification of patterns hidden in the analyzed dataset. Hierarchical Clustering: A clustering method characterized by the successive aggregation (agglomerative hierarchical methods) or desegregation (divisive hierarchical methods) of objects in order to find clusters in a dataset. Intercluster Similarity: A measurement of the similarity between the clusters identified in a particular clustering process. These clusters must be as dissimilar as possible.
Partitioning Clustering: A clustering method characterized by the division of the initial dataset in order to find clusters that maximize the similarity between the objects inside the clusters. Space Model: A geometry of the geographic space obtained by the identification of geographic regions with similar behavior looking at a specific metric.
ENDNOTES 1
2
3
4 5 6
For datasets with multiple, heterogeneous feature spaces see Modha & Spangler (2003). The k-nearest neighbor principles are also used for classification, a data mining task (Böhm & Krebs, 2004). The median is used instead of the average, since the average can be negatively influenced by outliers. http://arcobjectsonline.esri.com Nomenclature of Territorial Units for Statistics. Following the Interquartile Range definition (Laurikkala, Juhola, & Kentala, 2000) for the identification of outliers, it can be confirmed that the value 0.0231413 is the unique value that represent a possible outlier in the analyzed dataset (in this definition, outliers are values exceeding by 1.5 the Interquartile Range below the 25th percentile or above the 75th percentile).
Intracluster Similarity: A measurement of the similarity between the objects inside a cluster. Objects in a cluster must be as similar as possible.
171
C
172
Clustering of Time Series Data Anne Denton North Dakota State University, USA
INTRODUCTION Time series data is of interest to most science and engineering disciplines and analysis techniques have been developed for hundreds of years. There have, however, in recent years been new developments in data mining techniques, such as frequent pattern mining, which take a different perspective of data. Traditional techniques were not meant for such pattern-oriented approaches. There is, as a result, a significant need for research that extends traditional time-series analysis, in particular clustering, to the requirements of the new data mining algorithms.
BACKGROUND Time series clustering is an important component in the application of data mining techniques to time series data (Roddick & Spiliopoulou, 2002) and is founded on the following research areas: •
•
•
•
Data Mining: Besides the traditional topics of classification and clustering, data mining addresses new goals, such as frequent pattern mining, association rule mining, outlier analysis, and data exploration. Time Series Data: Traditional goals include forecasting, trend analysis, pattern recognition, filter design, compression, Fourier analysis, and chaotic time series analysis. More recently frequent pattern techniques, indexing, clustering, classification, and outlier analysis have gained in importance. Clustering: Data partitioning techniques such as k-means have the goal of identifying objects that are representative of the entire data set. Densitybased clustering techniques rather focus on a description of clusters, and some algorithms identify the most common object. Hierarchical techniques define clusters at arbitrary levels of granularity. Data Streams: Many applications, such as communication networks, produce a stream of data. For real-valued attributes such a stream is amenable to time series data mining techniques.
Time series clustering draws from all of these areas. It builds on a wide range of clustering techniques that have been developed for other data, and adapts them while critically assessing their limitations in the time series setting.
MAIN THRUST Many specialized tasks have been defined on time series data. This chapter addresses one of the most universal data mining tasks, clustering, and highlights the special aspects of applying clustering to time series data. Clustering techniques overlap with frequent pattern mining techniques, since both try to identify typical representatives.
Clustering Time Series Clustering of any kind of data requires the definition of a similarity or distance measure. A time series of length n can be viewed as a vector in an n-dimensional vector space. One of the best-known distance measures, Euclidean distance, is frequently used in time series clustering. The Euclidean distance measure is a special case of an Lp norm. Lp norms may fail to capture similarity well when being applied to raw time series data because differences in the average value and average derivative affect the total distance. The problem is typically addressed by subtracting the mean and dividing the resulting vector by its L2 norm, or by working with normalized derivatives of the data (Gavrilov et al., 2000). Several specialized distance measures have been used for time series clustering, such as dynamic time warping, DTW (Berndt & Clifford 1996), and longest common subsequence similarity, LCSS (Vlachos, Gunopulos, & Kollios, 2002). Time series clustering can be performed on whole sequences or on subsequences. For clustering of whole sequences, high dimensionality is often a problem. Dimensionality reduction may be achieved through Discrete Fourier Transform (DFT), Discrete Wavelet Transform (DWT), and Principal Component Analysis (PCA), as some of the most commonly used techniques. DFT (Agrawal, Falutsos, & Swami, 1993) and DWT have the goal of eliminating high-frequency components that are typically due to noise. Specialized models have been
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Clustering of Time Series Data
introduced that ignore some information in a targeted way (Jin, Lu, & Shi 2002). Others are based on models for specific data such as socioeconomic data (Kalpakis, Gada, & Puttagunta, 2001). A large number of clustering techniques have been developed, and for a variety of purposes (Halkidi, Batistakis, & Vazirgiannis, 2001). Partition-based techniques are among the most commonly used ones for time series data. The k-means algorithm, which is based on a greedy search, has recently been generalized to a wide range of distance measures (Banerjee et al., 2004).
Clustering Subsequences of a Time Series A variety of data mining tasks require clustering of subsequences of one or more time series as preprocessing step, such as Association Rule Mining (Das et al., 1998), outlier analysis and classification. Partitionbased clustering techniques have been used for this purpose in analogy to vector quantization (Gersho & Gray, 1992) that has been developed for signal compression. It has, however, been shown that when a large number of subsequences are clustered, the resulting cluster centers are very similar for different time series (Keogh, Lin, & Truppel, 2003). The problem can be resolved by using a clustering algorithm that is robust to noise (Denton, 2004), which adapts kernel-densitybased clustering (Hinneburg & Keim, 2003) to time series data. Partition-based techniques aim at finding representatives for all objects. Cluster centers in kernel-density-based clustering, in contrast, are sequences that are in the vicinity of the largest number of similar objects. The goal of kernel-density-based techniques is thereby similar to frequent-pattern mining techniques such as Motif-finding algorithms (Patel et al., 2002).
Related Problems One time series clustering problem of particular practical relevance is clustering of gene expression data (Eisen, Spellman, Brown, & Botstein, 1998). Gene expression is typically measured at several points in time (time course experiment) that may not be equally spaced. Hierarchical clustering techniques are commonly used. Density-based clustering has recently been applied to this problem (Jiang, Pei, & Zhang, 2003) Time series with categorical data constitute a further related topic. Examples are log files and sequences of operating system commands. Some clustering algorithms in this setting borrow from frequent sequence mining algorithms (Vaarandi, 2003).
FUTURE TRENDS Storage grows exponentially at rates faster than Moore’s law for microprocessors. A natural consequence is that old data will be kept when new data arrives leading to a massive increase in the availability of the time-dependent data. Data mining techniques will increasingly have to consider the time dimension as an integral part of other techniques. Much of the current effort in the data mining community is directed at data that has a more complex structure than the simple tabular format initially covered in machine learning (Deroski & Lavraè , 2001). Examples, besides time series data, include data in relational form, such as graph- and tree-structured data, and sequences. When addressing new settings it will be of major importance to not only generalize existing techniques and make them more broadly applicable but to also critically assess problems that may appear in the generalization process.
CONCLUSION Despite the maturity of both clustering and time series analysis, time series clustering is an active and fascinating research topic. New data mining applications are constantly being developed and require new types of clustering results. Clustering techniques from different areas of data mining have to be adapted to the time series context. Noise is a particularly serious problem for time series data, thereby adding challenges to clustering process. Considering the general importance of time series data, it can be expected that time series clustering will remain an active topic for years to come.
REFERENCES Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2004, April). Clustering with Bregman divergences. In Proceedings SIAM International Conference on Data Mining, Lake Buena Vista, FL. Berndt D.J., & Clifford, J. (1996). Finding patterns in time series: A dynamic programming approach. In Advances in knowledge discovery and data mining (pp. 229-248). Menlo Park, CA AAAI Press. Das, G., & Gunopulos, D. (2003). Time series similarity and indexing. In N. Ye (Ed.), The handbook of data mining (pp. 279-304). Mahwah, NJ: Lawrence Erlbaum Associates.
173
C
Clustering of Time Series Data
Das, G., Lin, K.-I., Mannila, H., Renganathan, G., & Smyth, P. (1998, Sept). Rule discovery from time series. In Proceedings IEEE Int. Conf. on Data Mining, Rio de Janeiro, Brazil. Denton, A. (2004, August). Density-based clustering of time series subsequences. In Proceedings The Third Workshop on Mining Temporal and Sequential Data (TDM 04) In Conjunction with The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA. Deroski, & Lavraè . (2001). Relational data mining. Berlin: Springer. Jiang, D., Pei, J., & Zhang, A. (2003, March). DHC: A density-based hierarchical clustering method for time series gene expression data. In Proceedings 3rd IEEE Symposium on Bioinformatics and Bioengineering (BIBE’03), Washington, D.C. Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D. (1998, December). Cluster analysis and display of genome-wide expression patterns. In Proceedings of the National Academy of Science USA, 95 (25) (pp. 14863-8). Gavrilov, M., Anguelov, D., Indyk, P., & Motwani, R. (2000). Mining the stock market (extended abstract): Which measure is best? In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 487-496), Boston, MA. Gersho, A., & Gray, R.M. (1992). Vector quantization and signal compression. Boston, MA: Kluwer Academic Publishers. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Intelligent Information Systems Journal, 17(2-3), 107-145. Hinneburg, A., & Keim, D.A. (2003, November). A general approach to clustering in large databases with noise. Knowledge Information Systems, 5(4), 387-415. Kalpakis, K., Gada, D., & Puttagunta, V. (2001). Distance measures for effective clustering of ARIMA time-series. In Proceedings IEEE International Conference on Data Mining (pp. 273-280), San Jose, CA. Keogh, E.J., Lin, J., & Truppel, W. (2003, December). Clustering of time series subsequences is meaningless: Implications for previous and future research. In Proceedings IEEE International Conference on Data Mining (pp. 115-122), Melbourne, FL. Jin, X., Lu, Y., & Shi, C. (2002). Similarity measure based on partial information of time series. In Proceedings Eighth 174
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 544-549). Edmonton, AB, Canada. Patel, P., Keogh, E., Lin, J., & Lonardi, S. (2002). Mining motifs in massive time series databases. In Proceedings 2002 IEEE International Conference on Data Mining, Maebashi City, Japan. Reif, F. (1965). Fundamentals of statistical and thermal physics. New York: McGraw-Hill. Roddick, J.F., & Spiliopoulou, M. (2002). A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 14(4), 750-767. Vlachos, M., Gunopoulos, D., & Kollios, G., (2002, February). Discovering similar multidimensional trajectories. In Proceedings 18th International Conference on Data Engineering (ICDE’02), San Jose, CA. Vaarandi, R. (2003). A data clustering algorithm for mining patterns from event logs. In Proceedings 2003 IEEE Workshop on IP Operations and Management, Kansas City, MO.
KEY TERMS Dynamic Time Warping (DTW): Sequences are allowed to be extended by repeating individual time series elements, such as replacing the sequence X={x1,x2,x3} by X’={x1,x2,x2,x3}. The distance between two sequences under dynamic time warping is the minimum distance that can be achieved in by extending both sequences independently. Kernel-Density Estimation (KDE): Consider the vector space in which the data points are embedded. The influence of each data point is modeled through a kernel function. The total density is calculated as the sum of kernel functions for each data point. Longest Common Subsequence Similarity (LCSS): Sequences are compared based on the assumption that elements may be dropped. For example, a sequence X={x1,x2,x3} may be replaced by X”={x1,x3}. Similarity between two time series is calculated as the maximum number of matching time series elements that can be achieved if elements are dropped independently from both sequences. Matches in real-valued data are defined as lying within some predefined tolerance. Partition-Based Clustering: The data set is partitioned into k clusters, and cluster centers are defined
Clustering of Time Series Data
based on the elements of each cluster. An objective function is defined that measures the quality of clustering based on the distance of all data points to the center of the cluster to which they belong. The objective function is minimized. Principle Component Analysis (PCA): The projection of the data set to a hyper plane that preserves the maximum amount of variation. Mathematically, PCA is equivalent to singular value decomposition on the covariance matrix of the data. Random Walk: A sequence of random steps in an ndimensional space, where each step is of fixed or randomly chosen length. In a random walk time series, time is advanced for each step and the time series element is derived using the prescription of a 1-dimensional random walk of randomly chosen step length.
Sliding Window: A time series of length n has (nw+1) subsequences of length w. An algorithm that operates on all subsequences sequentially is referred to as a sliding window algorithm. Time Series: Sequence of real numbers, collected at equally spaced points in time. Each number corresponds to the value of an observed quantity. Vector Quantization: A signal compression technique in which an n-dimensional space is mapped to a finite set of vectors. Each vector is called a codeword and the collection of all codewords a codebook. The codebook is typically designed using Linde-Buzo-Gray (LBG) quantization, which is very similar to k-means clustering.
175
C
176
Clustering Techniques Sheng Ma IBM T.J. Watson Research Center, USA Tao Li Florida International University, USA
INTRODUCTION Clustering data into sensible groupings as a fundamental and effective tool for efficient data organization, summarization, understanding, and learning has been the subject of active research in several fields, such as statistics (Hartigan, 1975; Jain & Dubes, 1988), machine learning (Dempster, Laird & Rubin, 1977), information theory (Linde, Buzo & Gray, 1980), databases (Guha, Rastogi & Shim, 1998; Zhang, Ramakrishnan & Livny, 1996), and bioinformatics (Cheng & Church, 2000) from various perspectives and with various approaches and focuses. From an application perspective, clustering techniques have been employed in a wide variety of applications, such as customer segregation, hierarchal document organization, image segmentation, microarray data analysis, and psychology experiments. Intuitively, the clustering problem can be described as follows: Let W be a set of n entities, finding a partition of W into groups, such that the entities within each group are similar to each other, while entities belonging to different groups are dissimilar. The entities usually are described by a set of measurements (attributes). Clustering does not use category information that labels the objects with prior identifiers. The absence of label information distinguishes cluster analysis from classification and indicates that the goals of clustering are just finding a hidden structure or compact representation of data instead of discriminating future data into categories.
BACKGROUND Generally, clustering problems are determined by five basic components: •
•
Data Representation: What is the (physical) representation of the given dataset? What kind of attributes (e.g., numerical, categorical or ordinal) are there? Data Generation: The formal model for describing the generation of the dataset. For example, Gaussian mixture model is a model for data generation.
•
•
•
Criterion/Objective Function: What are the objective functions or criteria that the clustering solutions should aim to optimize? Typical examples include entropy, maximum likelihood, and withinclass or between-class distance (Li, Ma & Ogihara, 2004a). Optimization Procedure: What is the optimization procedure for finding the solutions? A clustering problem is known to be NP-complete (Brucker, 1977), and many approximation procedures have been developed. For instance, Expectation-Maximization(EM) type algorithms have been used widely to find local minima of optimization. Cluster Validation and Interpretation: Cluster validation evaluates the clustering results and judges the cluster structures. Interpretation often is necessary for applications. Since there is no label information, clusters are sometimes justified by ad hoc methods (such as exploratory analysis), based on specific application areas.
For a given clustering problem, the five components are tightly coupled. The formal model is induced from the physical representation of the data; the formal model, along with the objective function, determines the clustering capability, and the optimization procedure decides how efficiently and how effectively the clustering results can be obtained. The choice of the optimization procedure depends on the first three components. Validation of cluster structures is a way of verifying assumptions on data generation and of evaluating the optimization procedure.
MAIN THRUST We review some of the current clustering techniques in this section. Figure 1 gives a summary of clustering techniques. The following further discusses traditional clustering techniques, spectral-based analysis, modelbased clustering, and co-clustering. Traditional clustering techniques focus on one-sided clustering, and they can be classified as partitional, hierarchical, density-based, and grid-based (Han & Kamber, 2000). Partitional clustering attempts to directly decom-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Clustering Techniques
Figure 1. Summary of clustering techniques
pose the dataset into disjoint classes, such that the data points in a class are nearer to one another than the data points in other classes. Hierarchical clustering proceeds successively by building a tree of clusters. Densitybased clustering is grouping the neighboring points of a dataset into classes based on density conditions. Gridbased clustering quantizes the data space into a finite number of cells that form a grid-structure and then performs clustering on the grid structure. Most of these algorithms use distance functions as objective criteria and are not effective in high-dimensional spaces. As an example, we take a closer look at K-means algorithms. The typical K-means type algorithm is a widelyused partition-based clustering approach. Basically, it first chooses a set of K data points as initial cluster representatives (e.g., centers) and then performs an iterative process that alternates between assigning the data points to clusters, based on their distances to the cluster representatives, and updating the cluster representatives, based on new cluster assignments. The iterative optimization procedure of K-means algorithm is a special form of EM-type procedure. The K-means type algorithm treats each attribute equally and computes the distances between data points and cluster representatives to determine cluster memberships. A lot of algorithms have been developed recently to address the efficiency and performance issues presented in traditional clustering algorithms. Spectral analysis has been shown to tightly relate to clustering task. Spectral clustering (Ng, Jordan & Weiss, 2001; Weiss, 1999), closely related to the latent semantics index (LSI), uses selected eigenvectors of the data affinity matrix to obtain a data representation that easily can be clustered or
C
embedded in a low-dimensional space. Model-based clustering attempts to learn generative models, by which the cluster structure is determined, from the data. Tishby, Pereira, and Bialek (1999) and Slonim and Tishby (2000) developed information bottleneck formulation, in which, given the empirical joint distribution of two variables, one variable is compressed so that the mutual information about the other is preserved as much as possible. Other recent developments of clustering techniques include ensemble clustering, support vector clustering, matrix factorization, high-dimensional data clustering, distributed clustering, and so forth. Another interesting development is co-clustering, which conducts simultaneous, iterative clustering of both data points and their attributes (features) through utilizing the canonical duality contained in the point-by-attribute data representation. The idea of co-clustering of data points and attributes dates back to Anderberg (1973) and Nishisato (1980). Govaert (1985) researches simultaneous block clustering of the rows and columns of the contingency table. The idea of co-clustering also has been applied to cluster gene expression and experiments (Cheng & Church, 2000). Dhillon (2001) presents a coclustering algorithm for documents and words using bipartite graph formulation and a spectral heuristic. Recently, Dhillon, et al. (2003) proposed an informationtheoretic co-clustering method for a two-dimensional contingency table. By viewing the non-negative contingency table as a joint probability distribution between two discrete random variables, the optimal co-clustering then maximizes the mutual information between the clustered random variables. Li and Ma (2004) recently developed Iterative Feature and Data (IFD) clustering by rep177
Clustering Techniques
resenting the data generation with data and feature coefficients. IFD enables an iterative co-clustering procedure for both data and feature assignments. However, unlike previous co-clustering approaches, IFD performs clustering using the mutually reinforcing optimization procedure that has a proven convergence property. IFD only handles data with binary features. Li, Ma, and Ogihara (2004b) further extended the idea for general data.
FUTURE TRENDS Although clustering has been studied for many years, many issues, such as cluster validation, still need more investigation. In addition, new challenges, such as scalability, high-dimensionality, and complex data types, have been brought by the ever-increasing growth of information exposure and data collection. •
•
•
178
Scalability and Efficiency: With the collection of huge amounts of data, clustering faces the problems of scalability in terms of both computation time and memory requirements. To resolve the scalability issues, methods such as incremental and streaming approaches, sufficient statistics for data summary, and sampling techniques have been developed. Curse of Dimensionality: Another challenge is the high dimensionality of data. It has been shown that in a high dimensional space, the distance between every pair of points is almost the same for a wide variety of data distributions and distance functions (Beyer et al., 1999). Hence, most algorithms do not work efficiently in high-dimensional spaces, due to the curse of dimensionality. Many feature selection techniques have been applied to reduce the dimensionality of the space. However, as demonstrated in Aggarwal, et al. (1999), in many case, the correlations among the dimensions often are specific to data locality; in other words, some data points are correlated with a given set of features, and others are correlated with respect to different features. As pointed out in Hastie, Tibshirani, and Friedman (2001) and Domeniconi, Gunopulos, and Ma (2004), all methods that overcome the dimensionality problems use a metric for measuring neighborhoods, which is often implicit and/or adaptive. Complex Data Types: The problem of clustering becomes more challenging when the data contains complex types (e.g., when the attributes contain both categorical and numerical values). There are no inherent distance measures between data values. This is often the case in many applications where data are described by a set of descriptive or presence/absence attributes, many of which are not nu-
merical. The presence of complex types also makes the cluster validation and interpretation difficult. More challenges also include clustering with multiple criteria (where clustering problems often require optimization over more than one criterion), clustering relation data (where data is represented with multiple relation tables), and distributed clustering (where data sets are geographically distributed across multiple sites).
CONCLUSION Clustering is a classical topic to segment and group similar data objects. Many algorithms have been developed in the past. Looking ahead, many challenges drawn from real-world applications will drive the search for efficient algorithms that are able to handle heterogeneous data in order to process a large volume and to scale to deal with a large number of dimensions.
REFERENCES Aggarwal, C. et al. (1999). Fast algorithms for projected clustering. Proceedings of ACM SIGMOD Conference. Anderberg, M.-R. (1973). Cluster analysis for applications. Academic Press Inc. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory. Brucker, P. (1977). On the complexity of clustering problems. In R. Henn, B. Korte, & W. Oletti (Eds.), Optimization and operations research (pp. 45-54). New York: Springer-Verlag. Cheng, Y., & Church, G.M. (2000). Bi-clustering of expression data. Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB). Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1-38. Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report 2001-05. Austin CS Dept. Dhillon, I.S., Mallela, S., & Modha, S.-S. (2003). Information-theoretic co-clustering. Proceedings of ACM SIGKDD.
Clustering Techniques
Domeniconi, C., Gunopulos, D., & Ma, S. (2004). Withincluster adaptive metric for clustering. Proceedings of the SIAM International Conference on Data Mining. Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437-458. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of the ACM SIGMOD Conference. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers. Hartigan, J. (1975). Clustering algorithms. Wiley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer. Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data. Upper Saddle River, NJ: Prentice Hall. Li, T., & Ma, S. (2004). IFD: Iterative feature and data clustering. Proceedings of the SIAM International Conference on Data Mining. Li, T., Ma, S., & Ogihara, M. (2004a). Entropy-based criterion in categorical clustering. Proceedings of the International Conference on Machine Learning (ICML 2004).
Tishby, N., Pereira, F.-C., & Bialek, W. (1999). The information bottleneck method. Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing. Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of ICCV (2). Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference.
KEY TERMS Cluster: A set of entities that are similar between themselves and dissimilar to entities from other clusters. Clustering: The process of dividing the data into clusters. Cluster Validation: Evaluates the clustering results and judges the cluster structures. Co-Clustering: Performs simultaneous clustering of both points and their attributes by way of utilizing the canonical duality contained in the point-by-attribute data representation.
Linde, Y., Buzo, A., & Gray, R.M. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28(1), 84-95.
Curse of Dimensionality: This expression is due to Bellman; in statistics, it relates to the fact that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. It has been used in various scenarios to refer to the fact that the complexity of learning grows significantly with the dimensions.
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14.
Spectral Clustering: The collection of techniques that performs clustering tasks using eigenvectors of matrices derived from the data.
Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.
Subspace Clustering: An extension of traditional clustering techniques that seeks to find clusters in different subspaces within a given dataset.
Li, T., Ma, S., & Ogihara, M. (2004b). Document clustering via adaptive subspace iteration. Proceedings of the ACM SIGIR.
Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. Proceedings of the ACM SIGIR.
179
C
180
Clustering Techniques for Outlier Detection Frank Klawonn University of Applied Sciences Braunschweig/Wolfenbuettel, Germany Frank Rehm German Aerospace Center, Germany
INTRODUCTION For many applications in knowledge discovery in databases, finding outliers, which are rare events, is of importance. Outliers are observations that deviate significantly from the rest of the data, so they seem to have been generated by another process (Hawkins, 1980). Such outlier objects often contain information about an untypical behaviour of the system. However, outliers bias the results of many datamining methods such as the mean value, the standard deviation, or the positions of the prototypes of k-means clustering (Estivill-Castro & Yang, 2004; Keller, 2000). Therefore, before further analysis or processing of data is carried out with more sophisticated data-mining techniques, identifying outliers is a crucial step. Usually, data objects are considered as outliers when they occur in a region of extremely low data density. Many clustering techniques that deal with noisy data and can identify outliers, such as possibilistic clustering (PCM) (Krishnapuram & Keller, 1993, 1996) and noise clustering (NC) (Dave, 1991; Dave & Krishnapuram, 1997), need good initializations or suffer from lack of adaptability to different cluster sizes. Distance-based approaches (Knorr & Ng, 1998; Knorr, Ng, & Tucakov, 2000) have a global view on the data set. These algorithms can hardly treat data sets that contain regions with different data density (Breuning, Kriegel, Ng, & Sander, 2000). In this work, we present an approach that combines a fuzzy clustering algorithm (Höppner, Klawonn, Kruse, & Runkler, 1999) or any other prototype-based clustering algorithm with statistical distribution-based outlier detection.
BACKGROUND Prototype-based clustering algorithms approximate a feature space by means of an appropriate number of prototype vectors, where each vector is located in the centre of the group of data (the cluster) that belongs to the respective prototype. Clustering usually aims at
partitioning a data set into groups or clusters of data, where data assigned to the same cluster are similar and data from different clusters are dissimilar. With this partitioning concept in mind, an important aspect of typical applications of cluster analysis is the identification of the number of clusters in a data set. However, when we are interested in identifying outliers, the exact number of clusters is irrelevant (Georgieva & Klawonn). The idea of whether one prototype covers two or more data clusters or whether two or more prototypes compete for the same data cluster is not important as long as the actual outliers are identified and assigned to a proper cluster. The number of prototypes used for clustering depends, of course, on the number of expected clusters but also on the distance measure respectively the shape of the expected clusters. Because this information is usually not available, the Euclidean distance measure is often recommended with rather copious prototypes. One of the most referred statistical tests for outlier detection is the Grubbs’ test (Grubbs, 1969). This test is used to detect outliers in a univariate data set. Grubbs’ test detects one outlier at a time. This outlier is removed from the data set, and the test is iterated until no outliers are detected.
MAIN THRUST The detection of outliers that we propose in this work is a modified version of the one proposed in SantosPereira and Pires (2002) and is composed of two different techniques. In the first step we partition the data set with the fuzzy c-means clustering algorithm so the feature space is approximated with an adequate number of prototypes. The prototypes will be placed in the centre of regions with a high density of feature vectors. Because outliers are far away from the typical data, they influence the placing of the prototypes. After partitioning the data, only the feature vectors belonging to each single cluster are considered for the detection of outliers. For each attribute of the feature vectors of the considered cluster, the mean value and the standard deviation has to be calculated. For the vector
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Clustering Techniques for Outlier Detection
Figure 1. Outlier detection with different numbers of prototypes
(a) 1 prototype
(b) 3 prototypes
with the largest distance1 to the mean vector, which is assumed to be an outlier, the value of the z-transformation for each of its components is compared to a critical value. If one of these values is higher than the respective critical value, then this vector is declared an outlier. One can use the Mahalanobis distance as in Santos-Pereira and Pires (2002), but because simple clustering techniques such as the fuzzy c-means algorithm tend to spherical clusters, we apply a modified version of Grubbs’ test, not assuming correlated attributes within a cluster. The critical value is a parameter that must be set for each attribute depending on the specific definition of an outlier. One typical criterion can be the maximum number of outliers with respect to the amount of data (Klawonn, 2004). Eventually, large critical values lead to smaller numbers of outliers, and small critical values lead to very compact clusters. Note that the critical value is set for each attribute separately. This leads to an axes-parallel view of the data, which in cases of axesparallel clusters leads to a better outlier detection than the (hyper)spherical view of the data. If an outlier is found, the feature vector has to be removed from the data set. With the new data set, the mean value and the standard deviation have to be calculated again for each attribute. With the vector that has the largest distance to the new centre vector, the outlier test will be repeated by checking the critical values. This procedure will be repeated until no outlier is found. The other clusters are treated in the same way. Figure 1 shows the results of the proposed algorithm. The crosses in this figure are feature vectors, which are recognized as outliers. As expected, only a few points are declared as outliers when approximating the feature space with only one prototype. The prototype will be placed in the centre of all feature vectors. Hence, only points on the edges are defined as outliers. Comparing the solutions with 3 and 10 prototypes, you can determine that both solutions are almost identical. Even in the border regions, were two prototypes competing for some data points, the algorithm would rarely identify these points as outliers, which they intuitively are not.
C
(c) 10 prototypes
FUTURE TRENDS Figure 1 shows that the algorithm can identify outliers in multivariate data in a stable way. With only a few parameters, the solution can be adapted to different requirements concerning the specific definition of an outlier. With the choice of the number of prototypes, it is possible to influence the result in a way that with lots of prototypes, even smaller data groups can be found. To avoid overfitting the data, it makes sense in certain cases to eliminate very small clusters. However, finding out the proper number of prototypes should be of interest for further investigations. In the case of using a fuzzy clustering algorithm such as FCM (Bezdek, 1981) to partition the data, it is possible to assign a feature vector to different prototype vectors. In that way, you can consolidate whether a certain feature vector is an outlier if the algorithm decides for each single cluster that the corresponding feature vector is an outlier. FCM provides membership degrees for each feature vector to every cluster. One approach could be to assign a feature vector to the corresponding clusters with the two highest membership degrees. The feature vector is considered as an outlier if the algorithm makes the same decision in both clusters. In cases where the algorithm gives no definite answers, the feature vector can be labeled and processed by further analysis.
CONCLUSION In this article, we describe a method to detect outliers in multivariate data. Because information about the number and shape of clusters is often not known in advance, it is necessary to have a method that is relatively robust with respect to these parameters. To obtain a stable algorithm, we combined approved clustering techniques, including the FCM or k-means, with a statistical method to detect outliers. Because the complexity of the presented algorithm is linear in the number of points, it can be applied to large data sets. 181
Clustering Techniques for Outlier Detection
REFERENCES Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press. Breunig, M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 93-104). Dave, R. N. (1991). Characterization and detection of noise in clustering. Pattern Recognition Letters, 12, 657-664. Dave, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transactions on Fuzzy Systems, 5(2), 270-293. Estivill-Castro, V., & Yang, J. (2004). Fast and robust general purpose clustering algorithms. Data Mining and Knowledge Discovery, 8, 127-150. Georgieva, O., & Klawonn, F. A cluster identification algorithm based on noise clustering. Manuscript submitted for publication. Grubbs, F. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11(1), 1-21. Hawkins, D. (1980). Identification of outliers. London: Chapman and Hall. Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis. Chichester, UK: Wiley. Keller, A. (2000). Fuzzy clustering with outliers. Proceedings of the 19th International Conference of the North American Fuzzy Information Processing Society, USA.
Krishnapuram, R., & Keller, J. M. (1996). The possibilistic C-means algorithm: Insights and recommendations. IEEE Transactions on Fuzzy Systems, 4(3), 385-393. Santos-Pereira, C. M., & Pires, A. M. (2002). Detection of outliers in multivariate data: A method based on clustering and robust estimators. Proceedings of the 15th Symposium on Computational Statistics (pp. 291-296), Germany.
KEY TERMS Cluster Analysis: To partition a given data set into clusters where data assigned to the same cluster are similar and data from different clusters are dissimilar. Cluster Prototypes (Centres): Clusters in objective function-based clustering are represented by prototypes that define how the distance of a data object to the corresponding cluster is computed. In the simplest case, a single vector represents the cluster, and the distance to the cluster is the Euclidean distance between cluster centre and data object. Fuzzy Clustering: Cluster analysis where a data object can have membership degrees to different clusters. Usually it is assumed that the membership degrees of a data object to all clusters add up to one, so a membership degree can also be interpreted as the probability that the data object belongs to the corresponding cluster.
Klawonn, F. (2004). Noise clustering with a fixed fraction of noise. In A. Lotfi & J. M. Garibaldi (Eds.), Applications and science in soft computing. Berlin, Germany: Springer.
Noise Clustering: An additional noise cluster is induced in objective function-based clustering to collect the noise data or outliers. All data objects are assumed to have a fixed (large) distance to the noise cluster, so only data that is far away from all the other clusters will be assigned to the cluster.
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. Proceedings of the 24th International Conference on Very Large Data Bases (pp. 392-403).
Outliers: Observations in a sample, so far separated in value from the remainder as to suggest that they are generated by another process or are the result of an error in measurement.
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distancebased outliers: Algorithms and applications. VLDB Journal, 8(3-4), 237-253.
Overfitting: The phenomenon that a learning algorithm adapts so well to a training set that the random disturbances in the training set are included in the model as being meaningful. Consequently, as these disturbances do not reflect the underlying distribution, the performance on the test set, with its own but definitively other disturbances, will suffer from techniques that tend to fit well to the training set.
Krishnapuram, R., & Keller, J. M. (1993). A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1(2), 98-110.
182
Clustering Techniques for Outlier Detection
Possibilistic Clustering: The Possibilistic c-means (PCM) family of clustering algorithms is designed to alleviate the noise problem by relaxing the constraint on memberships used in probabilistic fuzzy clustering. Z-Transformation: With the z-transformation, one can transform the values of any variable into values of the standard normal distribution.
ENDNOTE 1
Different distance measures can be used to determine the vector with the largest distance to the centre vector. For elliptical non-axes-parallel clusters, the Mahalanobis distance leads to good results. If no information about the shape of the clusters is available, the Euclidean distance is commonly used.
183
C
184
Combining Induction Methods with the Multimethod Approach Mitja Leniè University of Maribor, FERI, Slovenia Peter Kokol University of Maribor, FERI, Slovenia Petra Povalej University of Maribor, FERI, Slovenia Milan Zorman University of Maribor, FERI, Slovenia
INTRODUCTION The aggressive rate of growth of disk storage and, thus, the ability to store enormous quantities of data have far outpaced our ability to process and utilize that. This challenge has produced a phenomenon called data tombs—data is deposited to merely rest in peace, never to be accessed again. But the growing appreciation that data tombs represent missed opportunities in cases supporting scientific discovering, business exploitation, or complex decision making has awakened the growing commercial interest in knowledge discovery and data-mining techniques. That, in order, has stimulated new interest in the automatic knowledge induction from cases stored in large databases—a very important class of techniques in the data-mining field. With the variety of environments, it is almost impossible to develop a single-induction method that would fit all possible requirements. Thereafter, we constructed a new so-called multi-method approach, trying out some original solutions.
BACKGROUND Through time, different approaches have evolved, such as symbolic approaches, computational learning theory, neural networks, and so forth. In our case, we focus on an induction process to find a way to extract generalized knowledge from observed cases (instances). That is accomplished by using inductive inference that is the process of moving from concrete instances to general model(s), where the goal is to learn how to extract knowledge from objects by analyzing a set of instances
(already solved cases) whose classes are known. Instances are typically represented as attribute-value vectors. Learning input consists of a set of vectors/instances, each belonging to a known class, and the output consists of a mapping from attribute values to classes. This mapping hypothesis should accurately classify both the learning instances and the new unseen instances. The hypothesis hopefully represents generalized knowledge that is interesting for domain experts. The fundamental theory of learning is presented by Valiant (1984) and Auer (1995).
Single-Method Approaches When comparing single approaches with different knowledge representations and different learning algorithms, there is no clear winner. Each method has its own advantages and some inherent limitations. Decision trees (Quinlan, 1993), for example, are easily understandable by a human and can be used even without a computer, but they have difficulties expressing complex nonlinear problems. On the other hand, connectivistic approaches that simulate cognitive abilities of the brain can extract complex relations, but solutions are not easily understandable to humans (only numbers of weights), and, therefore, as such, they are not directly usable for data mining. Evolutionary approaches to knowledge extraction are also a good alternative, because they are not inherently limited to a local solution (Goldberg, 1989) but are computationally expensive. There are many other approaches, like representation of the knowledge with rules, rough sets, case-based reasoning, support vector machines, different fuzzy methodologies, and ensemble methods (Dietterich, 2000).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Combining Induction Methods with the Multimethod Approach
Hybrid Approaches
MULTI-METHOD APPROACH
Hybrid approaches rest on the assumption that only the synergetic combination of single models can unleash their full power. Each of the single methods has its advantages but also inherent limitations and disadvantages, which must be taken into account when using a particular method. For example, symbolic methods usually represent the knowledge in human readable form, and the connectivistc methods perform better in classification of unseen objects and are less affected by the noise in data as are symbolic methods. Therefore, the logical step is to combine both worlds to overcome the disadvantages and limitations of a single one. In general, the hybrids can be divided according to the flow of knowledge into four categories (Iglesias, 1996):
Multi-method approach was introduced in Leni and Kokol (2002). While studying other approaches, we were inspired by the idea of hybrid approaches and evolutionary algorithms. Both approaches are very promising in achieving the goal to improve the quality of knowledge extraction and are not inherently limited to sub-optimal solutions. We also noticed that almost all attempts to combine different methods use the loose coupling approach. Of course, loose coupling is easier to implement, but methods work almost independently of each other, and, therefore, a lot of luck is needed to make them work as a team. Opposed to the conventional hybrids described in the previous section, our idea is to dynamically combine and apply different methods in no predefined order to the same problem or decomposition of the problem. The main concern of the mutli-method approach is to find a way to enable a dynamic combination of methods to the somehow quasi-unified knowledge representation. In multiple, equally-qualitative solutions, like evolutionary algorithms (EA), each solution is obtained using an application of different methods with different parameters. Therefore, we introduce a population composed of individuals/solutions that have the common goal to improve their classification abilities on a given environment/problem. We also enable the coexistence of different types of knowledge representation in the same population. The most common knowledge representation models have to be standardized and strictly typed to support the applicability of different methods on individuals. Each induction method implementation uses its own internal knowledge representation that is not compatible with other methods that use the same type of knowledge. A typical example is WEKA, which uses at least four different knowledge representations for decision trees. Standardization, in general, brings greater modularity and interchangeability, but it has the following disadvantage: already existing methods cannot be directly integrated and have to be adjusted to the standardized representation. Initial population of extracted knowledge is generated using different methods. In each generation, different operations appropriate for individual knowledge are applied to improve existing and create new intelligent systems. That enables incremental refinement of extracted knowledge, with different views on a given problem. The main problem is how to combine methods that
•
•
•
•
Sequential Hybrid (Chain Processing): The output of one method is an input to another method. For example, the neural net is trained with the training set to reduce noise. Parallel Hybrid (Co-Processing): Different methods are used to extract knowledge. In the next phase, some arbitration mechanism should be used to generate appropriate results. External Hybrid (Meta Processing): One method uses another external one. For example, meta decision trees (Todorovski & Dzeroski, 2000) that use neural nets in decision nodes to improve the classification results. Embedded Hybrid (Sub-Processing): One method is embedded in another. That is the most powerful hybrid, but the least modular one, because usually the methods are coupled tightly.
The hybrid systems are commonly static in structure and cannot change the order of how single methods are applied. To be able to use embedded hybrids of different internal knowledge representation, it is commonly required to transform one method representation into another. Some transformations are trivial, especially when converting from symbolic approaches. The problem is when the knowledge is not so clearly presented, like in a case of the neural network (McGarry, Wermter & MacIntyre, 2001; Zorman, Kokol & Podgorelec, 2000). The knowledge representation issue is very important in the multi-method approach, and we solved it in the original manner.
185
C
Combining Induction Methods with the Multimethod Approach
Figure 1. An example of a decision tree induced using multi-method approach. Each node is induced with appropriate method (GA—genetic algorithm, ID3, Gini, Chi-square, J-measure, SVM, neural network, etc.) GA
ID3, GA
Gini
GA
GA
…
ID3
…
use different knowledge representations (e.g., neural networks and decision trees). In such cases, we have two alternatives: (1) to convert one knowledge representation to another using different already known methods or (2) to combine both knowledge representations in a single intelligent system. In both cases, knowledge transmutation is executed (Cox & Ram, 1999). In the first case, conversion between different knowledge representations must be implemented, which is usually not perfect, and some parts of the knowledge can be lost. But on the other hand, it can provide a different view and a good starting point in hypothesis search space. The second approach, which is based on combining knowledge, requires some cut-points where knowledge representations can be merged. For example, in a decision tree, such cut-points are internal nodes, where the condition in an internal node can be replaced by another intelligent system (e.g., support vector machine [SVM]). The same idea also can be applied in decision leafs (Figure 1).
Operators
Meta-Level Control
Using the idea of the multi-method approach, we designed a framework that operates on a population of extracted knowledge representations—individuals. Since methods usually are composed of operations that can be
Figure 2. Multi-method framework
Framework support Individual operator 1
Methods Individual operator i+1
. . .
. . .
Individual operator i
Individual operator n
Framework support Population operator 1 . . .
Population operator k
Methods Population operator k+1 . . .
Population operator l
Framework support Strategy 1 . . .
Strategy i
Methods Strategy i+1 . . .
Strategy m
Population
186
reused in other methods, we introduced methods on the basis of operators. Therefore, we introduced the operation on an individual as a function that transforms one or more individuals into a single individual. Operation can be a part of one or more methods, like a pruning operator, a boosting operator, and so forth. An operator-based view provides the ability simply to add new operations to the framework (Figure 2). Usually, methods are composed of operations that can be reused in other methods. We introduced the operation on an individual that is a function, which transforms one or more individuals into a single individual. Operation can be part of one or more methods, like a pruning operator, a boosting operator, and so forth. The transformation to another knowledge representation also is introduced on the individual operator level. Therefore, the transition from one knowledge representation to another is presented as a method. An operator-based view provides us with the ability simply to add new operations to the framework (Figure 2). Representation with individual operations facilitates is an effective and modular way to represent the result as a single individual, but, in general, the result of operation also can be a population of individuals (e.g., mutation operation in EA is defined on an individual level and on the population level). The single method is composed of population operations that use individual operations and is introduced as a strategy in the framework that improves individuals in a population. A single method is composed out of population operations that use individual operations and is introduced as a strategy in the framework that improves individuals in a population (Figure 2). Population operators can be generalized with higher order functions and thereby reused in different methods.
An important concern of the multi-method framework is how to provide the meta-level services to manage the available resources and the application of the methods. We extended the quest for knowledge into another dimension; that is, the quest for the best application order of methods. The problem that arises is how to control the quality of resulting individuals and how to intervene in the case of bad results. Due to different knowledge representations, solutions cannot be compared trivially to each other, and the assessment of which method is better is hard to imagine. Individuals in a population cannot be evaluated explicitly, and the explicit fitness function cannot be calculated easily. Therefore, the comparison of individuals and the assessment of the quality of the whole population cannot be given. Even if some criteria to calculate fitness
Combining Induction Methods with the Multimethod Approach
function could be found, it probably would be very timeconsuming and computational-intensive. Therefore, the idea of classical evolutionary algorithms controlling the quality of population and guidance to the objective cannot be applied. To achieve self-adaptive behavior of the evolutionary algorithm, the strategy parameters have to be coded directly into the chromosome (Thrun & Pratt, 1998). But in our approach, the meta-level strategy does not know about the structure of the chromosome, and not all of the methods use the EA approach to produce a solution. Therefore, for meta-level chromosomes, the parameters of the method and its individuals are taken. When dealing with self-adapting population with no explicit evaluation/fitness function, there is also an issue of the best or most promising individual (Šprogar et al. 2000). But, of course, the question of how to control the population size or increase selection pressure must be answered effectively. Our solution was to classify population operators into three categories: operators for reducing, operators for maintaining, and operators for increasing the population size.
CONCRETE COMBINATION TECHNIQUES The multi-method approach searches for solutions in huge (infinite) search space and exploits the acquisition technique of each integrated method. Population of individual solutions represents a different aspect of extracted knowledge. Transmutation from one to another knowledge representation introduces new aspects. We can draw parallels between the multi-method approach and the scientific discovery. In real life, based on the observed phenomena, various hypotheses are constructed. Different scientific communities draw different conclusions (hypotheses) consistent with collected data. For example, there are many theories about the creation of the universe, but the current, widely accepted theory is the theory of the big bang. During the following phase, scientists discuss their theories, knowledge is exchanged, new aspects are encountered, and data collected is reevaluated. In that manner, existing hypotheses are improved, and new better hypotheses are constructed.
Decision Trees Decision trees are very appropriate to use as glue between different methods. In general, condition nodes contain classifier (usually simple attribute comparison) that enables quick and easy integration of different
Figure 3. Hypothesis separation using GA
C
h1 h3
h2
methods. On other hand, there are many different methods for decision-tree induction that all generate knowledge in the form of tree. The most popular method is greedy heuristic induction of a decision tree, which produces a single decision tree with respect to the purity measure (heuristic) of each split in a decision tree. Altering purity measure may produce totally different results with different aspects (hypotheses) for a given problem. Hypothesis induction is done with the use of evolutionary algorithms (EA). When designing EA, algorithm operators for mutation, crossover and selection have to be carefully chosen (Podgorelec et al., 2001). Combining the EA approach with heuristic approaches dramatically reduces hypothesis search space. There is another issue when combining two methods using another classifier for separation. For example, in Figure 3, there are two hypotheses, h1 and h2, that could be perfectly separately induced using an existing set of methods. Let’s suppose that there is no method that is able to acquire both hypotheses in a single hypothesis. Therefore, we need a separation of the problem using another hypothesis, h3, which has no special meaning to induce a successful composite hypothesis.
Problem Adaptation In many domains, we encounter data with very unbalanced class distribution. That is especially true for applications in the medical domain, where most of the instances are regular and only a small percent are assigned to an irregular class. Therefore, for most of the classifiers that want to achieve high accuracy and low complexity, it is most rational to classify all new instances into a majority class. But that feature is not desired, because we want to extract knowledge (especially when we want to explain a decision-making process) and determine reasons for separation of classes. To cope with the presented problem, we introduced an instance reweighting method that works in a similar manner, boosting but on a different level. Instances that are rarely correctly classified gain importance. Fitness criteria of individuals take importance into account and force competition among individual induction methods. Of course, there is a danger of over-adapting to the 187
Combining Induction Methods with the Multimethod Approach
noise, but in that case, overall classification ability would be decreased, and other induced classifiers can perform better classification (self adaptation). We achieve a similar effect in boosting by concentrating on hard learnable instances and not dismissing already extracted knowledge.
EXPERIMENTAL RESULTS Our current data-mining research is performed mainly in the medical domain; thereafter, the knowledge representation that should be in a human understandable form is very important; so we are focused on the decisiontree induction. To make objective assessment of our method, a comparison of extracted knowledge used for classification was made with reference methods C4.5, C5/See5 without boosting, C5/See5 with boosting, and genetic algorithm for decision-tree construction (Podgorelec et al., 2001). The following quantitative measures were used: accuracy =
num of correctly classified objects num. of all objects
accuracyc =
num of correctly classified objects in class c num. of all objects in class c
average class accuracy =
∑ accuracy
i
i
medicine. For that reason, we have selected only symbolic knowledge from a whole population of resulting solutions. A detailed description of databases can be found in Leniè and Kokol (2002). Other databases have been downloaded from the online repository of machine-learning datasets maintained at UCI. We compared two variations of our multi-method approach (MultiVeDec) with four conventional approaches; namely, C4.5 (Quinlan, 1993), C5/See5, Boosted C5, and genetic algorithm (Podgorelec & Kokol, 2001). The results are presented in Table 1. Gray marked fields represent the best method on a specific database.
FUTURE TRENDS As confirmed in many application domains, methods that use only single approaches often can lead to local optima and do not necessarily provide the big picture about the problem. By applying different methods, induction power of combined methods can supersede single methods. Of course, there is no single way to wave methods together. Our approach emphasizes modularity based on knowledge sharing. Implicit induction method knowledge is shared with others via produced hypothesis. Synergy of methods also can be improved by weaving implicit knowledge of induction method learning algorithms, which requires very tight coupling of two or many induction methods.
CONCLUSION
num. of classes
We decided to use average class accuracy instead of sensitivity and specificity that are usually used when dealing with medical databases. Experiments have been made with seven real-world databases from the field of
The success of the multi-method approach can be explained with the fact that some methods converge to local optima. With the combination of multiple methods (operators) in different order, a better local (and hopefully global) solution can be found. Static hybrid
Table 1. A comparison of the multimethod approach to conventional approaches C4.5
◊ av 76.7 44.5 breastcancer 96.4 94.4 heartdisease 74.3 74.4 Hepatitis 80.8 58.7 mracid 94.1 75.0 Lipids 60.0 58.6 Mvp 91.5 65.5 accuracy on test set
188
C5
C5 Boost
Genetically induced decision trees ◊ ◊ ◊ 81.4 48.6 83.7 50.0 90.7 71.4 95.3 95.0 96.6 96.7 96.6 95.5 79.2 80.7 82.2 81.1 78.2 80.2 82.7 59.8 82.7 59.8 86.5 62.1 100.0 100.0 100.0 100. 0 94.1 75.0 66.2 67.6 71.2 64.9 76.3 67.6 91.5 65.5 92.3 70.0 91.5 65.5 ◊ average class accuracy on test set
SVM(RBF) Multimethod Multimethod without problem adaption ◊ ◊ ◊ 83.7 55.7 93.0 78.7 90.0 71.4 96.5 97.1 96.6 95.1 96.9 97.8 39.6 42.7 78.2 78.3 83.1 82.8 76.9 62.5 88.5 75.2 88.5 75.2 94.1 75.0 100.0 100.0 100.0 100.0 25.0 58.9 75.0 73.3 75.0 73.3 36.1 38.2 93.8 84.3 93.8 84.3
Combining Induction Methods with the Multimethod Approach
systems usually work sequentially or in parallel on the fixed structure and order, performing whole tasks. On the other hand, the multi-method approach works simultaneously with several methods on a single task (i.e., some parts are induced with different classical heuristics, some parts with hybrid methods, and still other parts with evolutionary programming). The presented multi-method approach enables a quick and modular way to integrate different methods into an existing system and enables the simultaneous application of several methods. It also enables partial application of method operations to improve and recombine aspects and has no limitation to the order and number of applied methods.
REFERENCES Auer, P., Holte, R.C., & Cohen, W.W. (1995). Theory and applications of agnostic PAC-learning with small decision trees. Proceedings of the 12th International Conference on Machine Learning. Cox, M.T., & Ram, A. (1999). Introspective multistrategy learning: On the construction of learning strategies. Artificial Intelligence, 112, 1-55. Dietterich, T.G. (2000). Ensemble methods in machine learning. Proceedings of the First International Workshop on Multiple Classifier Systems. Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison Wesley.
cally induced decision trees. Proceedings of the International ICSC Congress on Computational Intelligence: Methods and Applications (CIMA’2001). Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers. Šprogar, M., et al. (2000). Vector decision trees. Intelligent Data Analysis, 4(3,4), 305-321. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers. Todorovski, L., & Dzeroski, S. (2000). Combining multiple models with meta decision trees. Proceedings of the Fourth European Conference on Principles of Data Mining and Knowledge Discovery. Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134-1142. Vapnik, V.N. (1995). The nature of statistical learning theory. New York: Springer Verlag. Wolpert, D.H., & Macready, W.G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe, NM. Zorman, M., Kokol, P., & Podgorelec, V. (2000). Medical decision making supported by hybrid decision trees. Proceedings of ISA’2000.
KEY TERMS
Iglesias, C.J. (1996). The role of hybrid systems in intelligent data management: The case of fuzzy/neural hybrids. Control Engineering Practice, 4(6), 839-845.
Boosting: Creation of ensemble of hypothesis to convert weak learner to strong one by modifying expected instance distribution.
Leniè , M., & Kokol, P. (2002). Combining classifiers with multimethod approach. Proceedings of the Second International Conference on Hybrid Intelligent Systems, Soft Computing Systems: Design, Management and Applications, Frontiers in Artificial Intelligence and Applications, Amsterdam.
Induction Method: Process of learning, from cases or instances, resulting in a general hypothesis of hidden concept in data.
McGarry, K., Wermter, S., & MacIntyre, J. (2001). The extraction and comparison of knowledge from local function networks, International Journal of Computational Intelligence and Applications, 1(4), 369-382. Podgorelec, V., & Kokol, P. (2001). Evolutionary decision forests—Decision making with multiple evolutionary constructed decision trees: Problems in applied mathematics and computational intelligence. World Scientific and Engineering Society Press, 97-103. Podgorelec, V., Kokol, P., Yamamoto, R., Masuda, G., & Sakamoto, N. (2001). Knowledge discovery with geneti-
Method Level Operator: Partial operation of induction/knowledge transformation level of specific induction method. Multi-Method Approach: Investigation of a research question using a variety of research methods, each of which may contain inherent limitations, with the expectation that combining multiple methods may produce convergent evidence. Population Level Operator: Unusually parameterized operation that applies method level operators (parameter) to part or whole evolving population. Transmutator: Knowledge transformation operator that modifies learner knowledge by exploring learner experience. 189
C
190
Comprehensibility of Data Mining Algorithms Zhi-Hua Zhou Nanjing University, China
INTRODUCTION Data mining attempts to identify valid, novel, potentially useful, and ultimately understandable patterns from huge volume of data. The mined patterns must be ultimately understandable because the purpose of data mining is to aid decision-making. If the decision-makers cannot understand what does a mined pattern mean, then the pattern cannot be used well. Since most decision-makers are not data mining experts, ideally, the patterns should be in a style comprehensible to common people. So, comprehensibility of data mining algorithms, that is, the ability of a data mining algorithm to produce patterns understandable to human beings, is an important factor.
•
•
•
BACKGROUND A data mining algorithm is usually inherently associated with some representations for the patterns it mines. Therefore, an important aspect of a data mining algorithm is the comprehensibility of the representations it forms. That is, whether or not the algorithm encodes the patterns it mines in such a way that they can be inspected and understood by human beings. Actually, such an importance has been argued by a machine learning pioneer many years ago (Michalski, 1983): The results of computer induction should be symbolic descriptions of given entities, semantically and structurally similar to those a human expert might produce observing the same entities. Components of these descriptions should be comprehensible as single ‘chunks’ of information, directly interpretable in natural language, and should relate quantitative and qualitative concepts in an integrated fashion. Craven and Shavlik (1995) have indicated a number of concrete reasons why the comprehensibility of machine learning algorithms is very important. With slight modification, these reasons are also applicable to data mining algorithms. •
Validation: If the designers and end-users of a data mining algorithm are to be confident in the perfor-
•
mance of the algorithm, they must understand how it arrives at its decisions. Discovery: Data mining algorithms may play an important role in the process of scientific discovery. An algorithm may discover salient features and relationships in the input data whose importance was not previously recognized. If the patterns mined by the algorithm are comprehensible, then these discoveries can be made accessible to human review. Explanation: In some domains, it is desirable to be able to explain actions a data mining algorithm suggest take for individual input patterns. If the mined patterns are understandable in such a domain, then explanations of the suggested actions on a particular case can be garnered. Improving performance: The feature representation used for a data mining task can have a significant impact on how well an algorithm is able to mine. Mined patterns that can be understood and analyzed may provide insight into devising better feature representations. Refinement: Data mining algorithms can be used to refine approximately-correct domain theories. In order to complete the theory-refinement process, it is important to be able to express, in a comprehensible manner, the changes that have been imparted to the theory during mining.
MAIN THRUST It is evident that data mining algorithms with good comprehensibility are very desirable. Unfortunately, most data mining algorithms are not very comprehensible and therefore their comprehensibility has to be enhanced by extra mechanisms. Since there are many different data mining tasks and corresponding data mining algorithms, it is difficult for such a short article to cover all of them. So, the following discussions are restricted to the comprehensibility of classification algorithms, but some essence is also applicable to other kinds of data mining algorithms. Some classification algorithms are deemed as comprehensible because the patterns they mine are expressed in an explicit way. Representatives are decision tree algorithms that encode the mined patterns in the form of a
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Comprehensibility of Data Mining Algorithms
decision tree which can be easily inspected. Some other classification algorithms are deemed as incomprehensible because the patterns they mine are expressed in an implicit way. Representatives are artificial neural networks that encode the mined patterns in real-valued connection weights. Actually, many methods have been developed to improve the comprehensibility of incomprehensible classification algorithms, especially for artificial neural networks. The main scheme for improving the comprehensibility of artificial neural networks is rule extraction, that is, extracting symbolic rules from trained artificial neural networks. It originates from Gallant’s work on connectionist expert system (Gallant, 1983). Good reviews can be found in (Andrews, Diederich, & Tickle, 1995; Tickle, Andrews, Golea, & Diederich, 1998). Roughly speaking, current rule extraction algorithms can be categorized into four categories, namely the decompositional, pedagogical, eclectic, or compositional algorithms. Each category is illustrated with an example below. The decompositional algorithms extract rules from each unit in an artificial neural network and then aggregate. A representative is the RX algorithm (Setiono, 1997), which prunes the network and discretizes outputs of hidden units for reducing computational complexity in examining the network. If a hidden unit has many connections then it is split into several output units and some new hidden units are introduced to construct a subnetwork, so that the rule extraction process is iteratively executed. The RX algorithm is summarized in Table 1. The pedagogical algorithms regard the trained artificial neural network as an opaque and aim to extract rules that map inputs directly into outputs. A representative is the TREPAN algorithm (Craven & Shavlik, 1996), which regards the rule extraction process as an inductive learning problem and uses oracle queries to induce an ID2-of3 decision tree that approximates the concept represented by a given network. The pseudo-code of this algorithm is shown in Table 2.
The eclectic algorithms incorporate elements of both the decompositional and pedagogical ones. A representative is the DEDEC algorithm (Tickle, Orlowski, & Diederich, 1996), which extracts a set of rules to reflect the functional dependencies between the inputs and the outputs of the artificial neural networks. Fig. 1 shows its working routine. The compositional algorithms are not strictly decompositional because they do not extract rules from individual units with subsequent aggregation to form a global relationship, nor do them fit into the eclectic category because there is no aspect that fits the pedagogical profile. Algorithms belonging to this category are mainly designed for extracting deterministic finite-state automata (DFA) from recurrent artificial neural networks. A representative is the algorithm proposed by Omlin and Giles (1996), which exploits the phenomenon that the outputs of the recurrent state units tend to cluster, and if each cluster is regarded as a state of a DFA then the relationship between different outputs can be used to set up the transitions between different states. For example, assuming there are two recurrent state units s0 and s1, and their outputs appear as nine clusters, then the working style of the algorithm is shown in Fig. 2. During the past years, powerful classification algorithms have been developed in the ensemble learning area. An ensemble of classifiers works through training multiple classifiers and then combining their predictions, which is usually much more accurate than a single classifier (Dietterich, 2002). However, since the classification is made by a collection of classifiers, the comprehensibility of an ensemble is poor even when its component classifiers are comprehensible. A pedagogical algorithm has been proposed by Zhou, Jiang, and Chen (2003) to improve the comprehensibility of ensembles of artificial neural networks, which utilizes the trained ensemble to generate instances and then extracts symbolic rules from them. The success of this
Table 1. The RX algorithm 1. Train and prune the artificial neural network. 2. Discretize the activation values of the hidden units by clustering. 3. Generate rules that describe the network outputs using the discretized activation values. 4. For each hidden unit: 1) If the number of input connections is less than an upper bound, then extract rules to describe the activation values in terms of the inputs. 2) Else form a subnetwork: (a) Set the number of output units equal to the number of discrete activation values. Treat each discrete activation value as a target output. (b) Set the number of input units equal to the inputs connected to the hidden units. (c) Introduce a new hidden layer. (d) Apply RX to this subnetwork. 5. Generate rules that relate the inputs and the outputs by merging rules generated in Steps 3 and 4.
191
C
Comprehensibility of Data Mining Algorithms
Table 2. The TREPAN algorithm TREPAN(training_examples, features) Queue ← Ø for each example E ∈ training_examples E.label ← ORACLE(E) initialize the root of the tree, T, as a leaf node put
Figure 1. Working routine of the DEDEC algorithm analyze the weights to obtain a rank of the inputs according to their relative importance in predicting the output
search for functional dependencies between the important inputs and the output, then extract the corresponding rules
Weight Vector Analysis
Iterative FD/Rule Extraction
ANN Training
Figure 2. The working style of Omlin and Giles’s algorithm s1
s1
1
2
s1
1
1
2
1
3
4
1
2
1
3
4
0.5
0
0
1
0.5
s0
0
0
5
1
0.5
s0
4
0
0
1
0.5
s0
5
4
2 2
1
2
1
1 3
(a) All the possible transitions (b) All the possible transitions from state 1 from state 2
192
3
(c) All the possible transitions from states 3 and 4
Comprehensibility of Data Mining Algorithms
algorithm suggests that research on improving comprehensibility of artificial neural networks can give illumination to the improvement of comprehensibility of other complicated classification algorithms. Recently, Zhou & Jiang (2003) proposed to combine ensemble learning and rule induction algorithms to obtain accurate and comprehensible classifiers. Their algorithm uses an ensemble of artificial neural networks as a data preprocessing mechanism for the induction of symbolic rules. Later, they (Zhou & Jiang, 2004) presented a new decision tree algorithm and shown that when the ensemble is significantly more accurate than the decision tree directly grown from the original training set and the original training set has not fully captured the target distribution, using an ensemble as the preprocessing mechanism is beneficial. These works suggest the twicelearning paradigm to develop accurate and comprehensible classifiers, that is, using coupled classifiers where a classifier devotes to the accuracy while the other devotes to the comprehensibility.
FUTURE TRENDS It was supposed that an algorithm which could produce explicitly expressed patterns is comprehensible. However, such a supposition might not be so valid as it appears to be. For example, as for a decision tree containing hundreds of leaves, whether or not it is comprehensible? A quantitative answer might be more feasible than a qualitative one. Thus, quantitative measure of comprehensibility is needed. Such a measure can also help solve a long-standing problem, that is, how to compare the comprehensibility of different algorithms. Since rule extraction is an important scheme for improving the comprehensibility of complicated data mining algorithms, frameworks for evaluating the quality of extracted rules are important. Actually, the FACC (Fidelity, Accuracy, Comprehensibility, Consistency) framework proposed by Andrews, Diederich, and Tickle (1995) has been used for almost a decade, which contains two important criteria, i.e. fidelity and accuracy. Recently, Zhou (2004) identified the fidelity-accuracy dilemma which indicates that in some cases pursuing high fidelity and high accuracy simultaneously is impossible. Therefore, new evaluation frameworks have to be developed and employed, while the ACC (eliminating Fidelity from FACC) framework suggested by Zhou (2004) might be a good candidate. Most current rule extraction algorithms suffer from high computational complexity. For example, in decompositional algorithms, if all the possible relationships between the connection weights and units in a trained artificial neural network are considered, then com-
binatorial explosion is inevitable for even moderate-sized networks. Although many mechanisms such as pruning have been employed to reduce the computational complexity, the efficiency of most current algorithms is not good enough. In order to work well in real-world applications, effective algorithms with better efficiency are needed. Until now almost all works on improving comprehensibility of complicated algorithms rely on rule extraction. Although symbolic rule is relatively easy to be understood by human beings, it is not the only comprehensible style that could be exploited. For example, visualization may provide good insight into a pattern. However, although there are a few works (Frank & Hall, 2004; Melnik, 2002) utilizing visualization techniques to improve the comprehensibility of data mining algorithms, few work attempts to exploit together rule extraction and visualization, which is evidently very worth exploring. Previous research on comprehensibility has mainly focused on classification algorithms. Recently, some works on improving the comprehensibility of complicated regression algorithms have been presented (Saito & Nakano, 2002; Setiono, Leow, & Zurada, 2002). Since complicated algorithms exist extensively in data mining, more scenarios besides classification should be considered.
CONCLUSION This short article briefly discusses complexity issues in data mining. Although there is still a long way to produce patterns that can be understood by common people in any data mining tasks, endeavors on improving the comprehensibility of complicated algorithms have paced a promising way. It could be anticipated that experiences and lessons learned from these research might give illumination on how to design data mining algorithms whose comprehensibility is good enough, not needed to be further improved. Only when the comprehensibility is not a problem, the fruits of data mining can be fully enjoyed.
REFERENCES Andrews, R., Diederich, J., & Tickle, A.B. (1995). Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-Based Systems, 8(6), 373-389. Craven, M.W., & Shavlik, J.W. (1995). Extracting comprehensible concept representations from trained neural networks. In Working Notes of the IJCAI’95 Workshop on Comprehensiblility in Machine Learning (pp. 61-75), Montreal, Canada. 193
C
Comprehensibility of Data Mining Algorithms
Craven, M.W., & Shavlik, J.W. (1996). Extracting treestructured representations of trained networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems, 8 (pp. 24-30). Cambridge, MA: MIT Press. Dietterich, T.G. (2002). Ensemble learning. In M.A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Frank, E., & Hall, M. (2003). Visualizing class probability estimation. In N. Lavraè , D. Gamberger, H. Blockeel, & L. Todorovski (Eds.), Lecture Notes in Artificial Intelligence, 2838. Berlin: Springer, 168-179. Gallant, S.I. (1983). Connectionist expert systems. Communications of the ACM, 31(2), 152-169. Melnik, O. (2002). Decision region connectivity analysis: a method for analyzing high-dimensional classifiers. Machine Learning, 48(1-3), 321-351. Michalski, R. (1983). A theory and methodology of inductive learning. Artificial Intelligence, 20(2), 111-161. Omlin, C.W., & Giles, C.L. (1996). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41-52. Saito, K., & Nakano, R. (2002). Extracting regression rules from neural networks. Neural Networks, 15(10), 12791288. Setiono, R. (1997). Extracting rules from neural networks by pruning and hidden-unit splitting. Neural Computation, 9(1), 205-225. Setiono, R., Leow, W.K., & Zurada, J.M. (2002). Extraction of rules from artificial neural networks for nonlinear regression. IEEE Transactions on Neural Networks, 13(3), 564-577. Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057-1067.
Zhou, Z.-H., & Jiang, Y. (2003). Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble. IEEE Transactions on Information Technology in Biomedicine, 7(1), 37-42. Zhou, Z.-H., & Jiang, Y. (2004). NeC4.5: neural ensemble based C4.5. IEEE Transactions on Knowledge and Data Engineering, 16(6), 770-773. Zhou, Z.-H., Jiang, Y., & Chen, S.-F. (2003). Extracting symbolic rules from trained neural network ensembles. AI Communications, 16(1), 3-15.
KEY TERMS Accuracy: The measure of how well a pattern can generalize. In classification it is usually defined as the percentage of examples that are correctly classified. Artificial Neural Networks: A system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or units. Comprehensibility: The understandability of a pattern to human beings; the ability of a data mining algorithm to produce patterns understandable to human beings. Decision Tree: A flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf represents a class or class distribution. Ensemble Learning: A machine learning paradigm using multiple learners to solve a problem. Fidelity: The measure of how well the rules extracted from a complicated model mimic the behavior of that model. MOFN Expression: A boolean expression consisted of an integer threshold m and n boolean antecedents, which is fired when at least m antecedents are fired. For example, the MOFN expression 2-of-{ a, ¬b, c } is logically equivalent to (a ∧ ¬b) Ú (a ∧ c) ∨ (¬ b ∧ c).
Tickle, A.B., Orlowski, M., & Diederich, J. (1996). DEDEC: A methodology for extracting rule from trained artificial neural networks. In Proceedings of the AISB’96 Workshop on Rule Extraction from Trained Neural Networks (pp. 90-102), Brighton, UK.
Rule Extraction: Given a complicated model such as an artificial neural network and the data used to train it, produce a symbolic description of the model.
Zhou, Z.-H. (2004). Rule extraction: Using neural networks or for neural networks? Journal of Computer Science and Technology, 19(2), 249-253.
Symbolic Rule: A pattern explicitly comprising an antecedent and a consequent, usually in the form of “IF … THEN …”.
194
Comprehensibility of Data Mining Algorithms
Twice-Learning: A machine learning paradigm using coupled learners to achieve two aspects of advantages. In its original form, two classifiers are coupled together where one classifier is devoted to the accuracy while the other devoted to the comprehensibility.
C
195
196
Computation of OLAP Cubes Amin A. Abdulghani Quantiva, USA
INTRODUCTION The focus of Online Analytical Processing (OLAP) is to provide a platform for analyzing data (e.g., sales data) with multiple dimensions (e.g., product, location, time) and multiple measures (e.g., total sales or total cost). OLAP operations then allow viewing of this data from a number of perspectives. For analysis, the object or data structure of primary interest in OLAP is a cube.
BACKGROUND An n-dimensional cube is defined as a group of k-dimensional (k<=n) cuboids arranged by the dimensions of the data. A cell represents an association of a measure m (e.g., total sales) with a member of every dimension (e.g., product=“toys”, location=“NJ”, year=“2003”). By definition, a cell in an n-dimensional cube can have fewer dimensions than n. The dimensions not present in the cell are aggregated over all possible members. For example, you can have a two-dimensional (2-D) cell, C1(product=“toys”, year=“2003”). Here, the implicit value for the dimension location is ‘*’, and the measure m (e.g., Figure 1. A 3-D cube that consists of 1-D, 2-D, and 3-D cuboids dimension
total sales) is aggregated over all locations. A cuboid is a group-by of a subset of dimensions, obtained by aggregating all tuples on these dimensions. In an n-dimensional cube, a cuboid is a base cuboid if it has exactly n dimensions. If the number of dimensions is fewer than n, then it is an aggregate cuboid. Any of the standard aggregate functions such as count, total, average, minimum, or maximum can be used for aggregating. Figure 1 shows an example of a three-dimensional (3-D) cube, and Figure 2 shows an aggregate 2-D cuboid. In theory, no special operators or SQL extensions are required to take a set of records in the database and generate all the cells for the cube. Rather, the SQL groupby and union operators can be used in conjunction with d sorts of the dataset to produce all cuboids. However, such an approach would be very inefficient, given the obvious interrelationships between the various groupbys produced.
MAIN THRUST I now describe the essence of the major methods for the computation of OLAP cubes. Figure 2. An example 2-D cuboid on (product, year) for the 3-D cube in Figure 1 (location='*'); total sales needs to be aggregated (e.g., SUM)
Measure(total sales)
2003
2003
year
year
2002 2002
2001
PA 2001
2.5M
NY NJ
toy s
clothes product
cosmetic s
location nn
toy s
clothes product
cosmetic s
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Computation of OLAP Cubes
Top-Down Computation In a seminal paper, Gray, Bosworth, Layman, and Pirahesh (1996) proposed the data cube operator as a means of simplifying the process of data cube construction. The algorithm presented forms the basis of the top-down approach. In the approach, the aggregation functions are categorized into three classes: •
•
•
Distributive: An aggregate function F is called distributive if there exists a function g such that the value of F for an n-dimensional cell can be computed by applying g to the value of F in an (n + 1)– dimensional cell. Examples of such functions include SUM and COUNT. For example, the COUNT() of an n-dimensional cell can be computed by applying SUM to the value of COUNT() in an (n+1)– dimensional cell. Algebraic: An aggregate function F is algebraic if F of an n-dimensional cell can be computed by using a constant number of aggregates of the (n + 1)– dimensional cell. An example is the AVERAGE() function. The AVERAGE() of an n-dimensional cell can be computed by taking the sum and count of the (n+1)–dimensional cell and then dividing the SUM by the COUNT to produce the global average. Holistic: An aggregate function F is called holistic if the value of F for an n-dimensional cell cannot be computed from a constant number of aggregates of the (n+1)–dimensional cell. Median and mode are examples of holistic functions.
compute the child (and no intermediate group-bys exist between the parent and child). Figure 3 depicts a sample lattice where A, B, C, and D are dimensions, nodes represent group-bys, and the edges show the parent-child relationship. The basic idea for top-down cube construction is to start by computing the base cuboid (group-by for which no cube dimensions are aggregated). A single pass is made over the data, a record is examined, and the appropriate base cell is incremented. The remaining group-bys are computed by aggregating over already computed finer grade group-by. If a group-by can be computed from one or more possible parent group-bys, then the algorithm uses the parent smallest in size. For example, for computing the cube ABCD, the algorithm starts out by computing the cuboids for ABCD. Then, using ABCD, it computes the cuboids for ABC, ABD, and BCD. The algorithm then repeats itself by computing the 2-D cuboids, AB, BC, AD, and BD. Note that the 2-D–cuboids can be computed from multiple parents. For example, AB can be computed from ABC or ABD. The algorithm selects the smaller group-by (the group-by with the fewest number of cells). An example top-down cube computation is shown in Figure 4. Variants of this approach optimize on additional costs. The best-known methods are the PipeSort and PipeHash (Agarwal, Agrawal, Deshpande, Gupta, Naughton, Ramakrishnan, et al., 1996). The basic idea of both algorithms is that a minimum spanning tree should be generated from the original lattice such that the cost of traversing edges will be minimized. The optimizations for the costs that these algorithms include are as follows: •
Cache-results: This optimization aims at ensuring that the result of a group-by is cached (in memory) so other group-bys can use it in the future. Amortize-scans: This optimization amortizes the cost of a disk read by computing the maximum possible number of group-bys together in memory. Share-sorts: For a sort-based algorithm, this aims at sharing sorting cost across multiple group-bys.
The top-down cube computation works with distributive or algebraic functions. These functions have the property that more detailed aggregates (i.e., more dimensions) can be used to compute less detailed aggregates. This property induces a partial-ordering (i.e., a lattice) on all the group-bys of the cube. A group-by is called a child of some parent group-by if the parent can be used to
•
Figure 3. Cube lattice
Figure 4. Top-down cube computation
•
ABCD ABC
ABD
AB AC A
ACD
AD B
BCD
BC C
all
ABCD
BD D
ABC
CD
ABD
AB AC A
ACD
AD B
BCD
BC C
BD
CD
D
all 197
C
Computation of OLAP Cubes
•
Share-partitions: For a hash-based algorithm, when the hash-table is too large to fit in memory, data are partitioned and aggregated to fit in memory. This can be achieved by sharing this cost across multiple group- bys.
Both PipeSort and PipeHash are examples of ROLAP (Relational-OLAP) algorithms as they operate on multidimensional tables. An alternative is MOLAP (Multidimensional OLAP), where the approach is to store the cube directly as a multidimensional array. If the data are in relational form, then they are loaded into arrays, and the cube is computed from the arrays. The advantage is that no tuples are needed for comparison purposes; all operations are performed by using array indices. A good example for a MOLAP-based cubing algorithm is the MultiWay algorithm (Zhao, Deshpande, & Naughton, 1997). Here, to make efficient use of memory, a single large d-dimensional array is divided into smaller d-dimensional arrays, called chunks. It is not necessary to keep all chunks in memory at the same time; only part of the group-by arrays are needed for a given instance. The algorithm starts at the base cuboid with a single chunk. The results are passed to immediate lower-level cuboids of the top-down cube computation tree. After the chunk processing is complete, the next chunk is loaded. The process then repeats itself. An advantage of the MultiWay algorithm is that it also allows for shared computation, where intermediate aggregate values are reused for the computation of successive descendant cuboids. The disadvantage of MOLAP-based algorithms is that they are not scalable for a large number of dimensions and sparse data.
Bottom-Up Computation Computing and storing full cubes may be overkill. Consider, for example, a dataset with 10 dimensions, each with a domain size of 9 (i.e., can have nine distinct values). Computing the full cube for this dataset would require computing and storing 1010 cuboids. Not all cuboids in the full cube may be of interest. For example, if the measure is total sales, and a user is interested in finding the cuboids that have high sales (greater than a given threshold), then computing cells with low total sales would be redundant. It would be more efficient here to only compute the cells that have total sales greater than the threshold. The group of cells that satisfies the given query is referred to as an iceberg cube. The query returning the group of cells is referred to as an iceberg query. Iceberg queries are typically computed using a bottomup approach (Beyer & Ramakrishnan, 1999). These methods work from the bottom of the cube lattice and then work their way up towards the cells with a larger number of dimensions. Figure 5 illustrates this approach. The ap198
Figure 5. Bottom-up cube computation ABCD ABC AB AC A
ABD
ACD
AD B
BCD
BC C
BD
CD
D
all
proach is typically more efficient for high-dimensional, sparse datasets. The key observation to its success is in its ability to prune cells in the lattice that do not lead to useful answers. Suppose, for example, that the iceberg query is COUNT()>100. Also, suppose there is a cell X(A=a1, B=b1) with COUNT=75. Then, clearly, C does not satisfy the query. However, in addition, you can see that any cell X’ in the cube lattice that includes the dimensions (e.g. A=a1, B=b1) will also not satisfy the query. Thus, you need not look at any such cell after having looked at cell X. This forms the basis of support pruning. The observation was first made in the context of frequent sets for the generation of association rules (Agrawal, Imielinski, & Swami, 1993). For more arbitrary queries, detecting whether a cell is prunable is more difficult. The fundamental property that allows pruning is called monotonicity of the query: Let D be a database and X⊆D be a cuboid. A query Q⋅) is monotonic at X if the condition Q(X) is FALSE implies that Q(X’) is FALSE for any X⊆X. With this definition, a cell X in the cube would be pruned if the query Q is monotonic at X. However, determining whether a query Q is monotonic in terms of this definition is an NP-hard problem for many simple class of queries (Imielinski, Khachiyan, & Abdulghani, 2002). To work around this problem, another notion of monotonicity referred to as view monotonicity of a query is introduced. Suppose you have cuboids defined on a set S of dimension and measures. A view V on S is an assignment of values to the elements of the set. If the assignment holds for the dimensions and measures in a given cell X, then V is a view for X on the set S. So, for example, in a cube of rural buyers, if the average sale of bread is 15 with 20 people, then the view on the set {areaType, COUNT(), AVG(salesBread)} for the cube is {areaType=‘rural’, COUNT()=20, AVG(salesBread)=15}. Extending the definition, a view on a query Q for X is an assignment of values for the set of dimension and measure attributes of the query expression.
Computation of OLAP Cubes
A query Q(⋅) view monotonic on view V(Q,X) if for any cell X in any database D such that V is the view for X, the condition Q is FALSE, for X implies Q is FALSE for all X’ ⊆X. An important property of view monotonicity is that the time and space required for checking it for a query depends on the number of terms in the query and not on the size of the database or the number of its attributes. Because most queries typically have few terms, it would be useful in many practical situations. Imielinski et al. (2002) presents a method for checking view monotonicity for a query that includes constraints of type (Agg {<, >, =, !=} c), where c is a constant and Agg can be MIN, SUM, MAX, AVERAGE, COUNT, aggregates that are higher order moments about the origin, or aggregates that are an integral of a function on a single attribute.
Hybrid Approach
exceeds the benefit of any other nonmaterialized groupby. Let S be the set of materialized group-bys. This benefit of including a group-by v in the set S is the total savings achieved for computing the group-bys not included in S by using v versus the cost of computing them through some group-by already in S. Gupta, Harinarayan, Rajaraman, and Ullman (1997) further extend this work to include indices in the cost. The subset of the cuboids selected for materialization is referred to as a partial cube. There have been efficient approaches suggested in the literature for computing the partial cube. In one such approach suggested by Dehne, Eavis, and Rau-Chaplin (2004), the cuboids are computed in a top-down fashion. The process starts with the original lattice or the PipeSort spanning tree of the original lattice, organizes the selected cuboids into a tree of minimal cost, and then further tries to reduce the cost by possibly adding intermediate nodes. After a set of cuboids has been materialized, queries are evaluated by using the materialized results. Park, Kim, and Lee (2001) describe a typical approach. Here, the OLAP queries are answered in a three-step process. In the first step, it selects the materialized results that will be used for rewriting and identifies the part of the query (region) that the materialized result can answer. Next, query blocks are generated for these query regions. Finally, query blocks are integrated into a rewritten query.
Typically, for low-dimension, low-cardinality, dense datasets, the top-down approach is more applicable than the bottom-up one. However, combining the two approaches leads to an even more efficient algorithm (Xin, Han, Li, & Wah, 2003). On the global computation order, the work presented uses the top-down approach. At a sublayer underneath, it exploits the potential of the bottom-up model. Consider the top-down computation tree in Figure 4. Notice that the dimension ABC is included for all the cuboids in the leftmost subtree. Similarly, all the cuboids in the second subtree include the dimensions AB. These common dimensions are termed the shared dimensions of the particular subtrees and enable bottomup computation. The observation is that if a query is FALSE and (view)-monotonic on the cell defined by the shared dimensions, then the rest of the cells generated from this shared dimension are unneeded. The critical requirement is that for every cell X, the cell for the shared dimensions must be computed first. The advantage of such an approach is that it allows for shared computation as in the top-down approach as well as for pruning, when possible.
FUTURE TRENDS
Other Approaches
•
Until now, I have considered computing the cuboids from the base data. Another commonly used approach is to materialize the results of a selected set of group-bys and evaluate all queries by using the materialized results. Harinarayan, Rajaraman, and Ullman (1996) describe an approach to materialize a limit of k group-bys. The first group-by to materialize always includes the top group-by, as none of the group-bys can be used to answer queries for this group-by. The next group-by to materialize is included such that the benefit of including it in the set
In this paper, I focus on the basic aspects of cube computation. The field is pretty recent, going back to no more than 10 years. However, as the field is beginning to mature, issues are becoming better understood. Some of the issues that will get more attention in future work include: •
•
Advanced data structures for organizing input tuples of the input cuboids (Han, Pei, Dong, & Wang, 2001; Xin et al., 2003). Making use of inherent property of the dataset to reduce computation of the data cubes. An example is the range cubing algorithm, which utilizes the correlation in the datasets to reduce the computation cost (Feng, Agrawal, Abbadi, & Metwally, 2004). Compressing the size of the data cube and storing it efficiently. A good example is a quotient cube, which partitions the cube into classes such that each cell in a class has the same aggregate value, and the lattice generated preserves the original cube’s semantics (Lakshmanan, Pei, & Han, 2002).
199
C
Computation of OLAP Cubes
Additionally, I believe that future work in this direction would attack the problem from multiple issues rather than a single focus. Sismanis, Deligiannakis, Roussopoulus, and Kotidis (2002) describe one such work, “Dwarf.” Its architecture integrates multiple features including the compressing of data cubes, a tunable parameter for controlling the amount of materialization, and indexing and support for incremental updates (which is important for the case when the underlying data are periodically updated).
CONCLUSION
ceedings of the Hawaii International Conference on System Sciences, USA. Feng, Y., Agrawal, D., Abbadi, A. E., & Metwally, A. (2004). Range CUBE: Efficient cube computation by exploiting data correlation. Proceedings of the International Conference on Data Engineering (pp. 658-670), USA. Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. Proceedings of the International Conference on Data Engineering (pp. 152159), USA.
This paper focuses on the methods for OLAP cube computation. All methods share the similarity that they make use of the ordering defined by the cube lattice to drive the computation. For the top-down approach, traversal occurs from the top of the lattice. This has the advantage that for the computation of successive descendant cuboids, intermediate node results are used. The bottomup approach traverses the lattice in the reverse direction. The method can no longer rely on making use of the intermediate node results. Its advantage lies in the ability to prune cuboids in the lattice that do not lead to useful answers. The hybrid approach uses the combination of both methods, taking advantages of both. The other major option for computing the cubes is to materialize only a subset of the cuboids and to evaluate queries by using this set. The advantage lies in storage costs, but additional issues, such as identification of the cuboids to materialize, algorithms for materializing these, and query rewrite algorithms, are raised.
Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1997). Index selection for OLAP. Proceedings of the International Conference on Data Engineering (pp. 208219), UK.
REFERENCES
Park, C.-S., Kim, M. H., & Lee, Y.-J. (2001). Rewriting OLAP queries using materialized views and dimension hierarchies in data warehouses. Proceedings of the International Conference on Data Engineering, Germany.
Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J. F., Ramakrishnan, R., et al. (1996). On the computation of multidimensional aggregates. Proceedings of the International Conference on Very Large Data Bases (pp. 506-521), India. Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference (pp. 207-216), USA. Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up computation of sparse and iceberg cubes. Proceedings of the ACM SIGMOD Conference (pp. 359-370), USA. Dehne, F. K. H., Eavis, T., & Rau-Chaplin, A. (2004). Topdown computation of partial ROLAP data cubes. Pro-
200
Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient computation of iceberg cubes with complex measures. Proceedings of the ACM SIGMOD Conference (pp. 1-12), USA. Harinarayan, V., Rajaraman, A., & Ullman, J. D. (1996). Implementing data cubes efficiently. Proceedings of the ACM SIGMOD Conference (pp. 205-216), Canada. Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002). Cubegrades: Generalizing association rules. Journal of Data Mining and Knowledge Discovery, 6(3), 219-257. Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient cube: How to summarize the semantics of a data cube. Proceedings of the International Conference on Very Large Data Bases (pp. 778-789), China.
Sismanis, Y., Deligiannakis, A., Roussopoulus, N., & Kotidis, Y. (2002). Dwarf: Shrinking the petacube. Proceedings of the ACM SIGMOD Conference (pp. 464-475), USA. Xin, D., Han, J., Li, X., & Wah, B. W. (2003). Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. Proceedings of the International Conference on Very Large Data Bases (pp. 476-487), Germany. Zhao, Y., Deshpande, P. M., & Naughton, J. F. (1997). An array-based algorithm for simultaneous multidimensional aggregates. Proceedings of the ACM SIGMOD Conference (pp. 159-170), USA.
Computation of OLAP Cubes
KEY TERMS
by applying g to the value of F in (n + 1)–dimensional cuboid.
Algebraic Function: An aggregate function F is algebraic if F of an n-dimensional cuboid can be computed by using a constant number of aggregates of the (n + 1)– dimensional cuboid.
Holistic Function: An aggregate function F is called holistic if the value of F for an n-dimensional cell cannot be computed from a constant number of aggregates of the (n+1)–dimensional cell.
Bottom-Up Cube Computation: Cube construction that starts by computing from the bottom of the cube lattice and then working up toward the cells with a greater number of dimensions.
N-Dimensional Cube: A group of k-dimensional (k<=n) cuboids arranged by the dimension of the data.
Cube Cell: Represents an association of a measure m with a member of every dimension. Cuboid: A group-by of a subset of dimensions, obtained by aggregating all tuples on these dimensions. Distributive Function: An aggregate function F is called distributive if there exists a function g such that the value of F for an n-dimensional cuboid can be computed
Partial Cube: The subset of the cuboids selected for materialization. Sparse Cube: A cube is sparse if a high ratio of the cube’s possible cells does not contain any measure value. Top-Down Cube Computation: Cube construction that starts by computing the base cuboid and then iteratively computing the remaining cells by aggregating over already computed finer-grade cells in the lattice.
201
C
202
Concept Drift Marcus A. Maloof Georgetown University, USA
INTRODUCTION Traditional approaches to data mining are based on an assumption that the process that generated or is generating a data stream is static. Although this assumption holds for many applications, it does not hold for many others. Consider systems that build models for identifying important e-mail. Through interaction with and feedback from a user, such a system might determine that particular e-mail addresses and certain words of the subject are useful for predicting the importance of email. However, when the user or the persons sending email start other projects or take on additional responsibilities, what constitutes important e-mail will change. That is, the concept of important e-mail will change or drift. Such a system must be able to adapt its model or concept description in response to this change. Coping with or tracking concept drift is important for other applications, such as market-basket analysis, intrusion detection, and intelligent user interfaces, to name a few.
BACKGROUND Concept drift may occur suddenly, what some call revolutionary drift. Or it may occur gradually, what some call evolutionary drift (Klenner & Hahn, 1994). Drift may occur at different time scales and at varying rates. Concepts may change and then reoccur, perhaps with contextual cues (Widmer, 1997). For example, the concept of warm is different in the summer than in the winter. A contextual cue or variable is perhaps the season or the daily mean temperature. It helps identify the appropriate model to use for determining if, for example, it is a warm day. Coping with concept drift requires an online approach, meaning that the system must mine a stream of data. If drift occurs slowly enough, distinguishing between real and virtual concept drift may be difficult (Klinkenberg & Joachims, 2000). The former occurs when concepts indeed change over time; the latter occurs when performance drops during the normal online process of building and refining a model. Such drops could be due to differences in the data in different parts of the stream. The richness of this problem has led to an equally rich collection of approaches.
Research on the problem of concept drift has been empirical and theoretical (see Kuh, Petsche, & Rivest, 1991; Mesterharm, 2003); however, the focus of this article is on empirical approaches. In this regard, researchers have evaluated such approaches by using synthetic data sets (for examples, see Maloof & Michalski, 2004; Schlimmer & Grainger, 1986; Street & Kim, 2001; Widmer & Kubat, 1996) and real data sets (for examples, see Black & Hickey, 2002; Blum, 1997; Lane & Brodley, 1998), both small (Maloof & Michalski, 2004; Schlimmer & Grainger, 1986; Widmer & Kubat, 1996) and large (Hulten, Spencer, & Domingos, 2001; Kolter & Maloof, 2003; Street & Kim, 2001; Wang, Fan, Yu, & Han, 2003). Systems for coping with concept drift fall into three broad categories: incremental, partial memory, and ensemble. Incremental approaches use new instances to modify existing models. Partial-memory approaches maintain a subset of previously encountered instances, previously built and refined models, or both. When new instances arrive, such systems use the new instances and their store of past instances and models to build new models or refine existing ones. Finally, ensemble methods build and maintain multiple models to cope with concept drift. Naturally, systems designed for concept drift do not always fall neatly into only one of these categories. For example, some partial-memory approaches are also incremental (Maloof & Michalski, 2004; Widmer & Kubat, 1996). A model or concept description is a generalized representation of instances or training data. Such representations are important for prediction and for better understanding the data set from which they were built. Researchers have used a variety of representations for drifting concepts, including the instances themselves, probabilities, linear equations, decision trees, and decision rules. Methods for inducing and building such representations include instance-based learning, naïve Bayes, support vector machines, C4.5, and AQ15, respectively. See Hand, Mannila, and Smyth (2001) for additional information.
MAIN THRUST Researchers have proposed a variety of approaches for learning concepts that drift. However, evaluating such
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Concept Drift
approaches is a critical issue. In the next two sections, I survey approaches for learning concepts that change over time and discuss issues of evaluating such approaches.
Survey of Approaches for Concept Drift STAGGER (Schlimmer & Grainger, 1986) was the first system for coping with concept drift. Its model consists of nodes, corresponding to features and class labels, linked together with probabilistic arcs, representing the strength of association between features and class labels. As STAGGER processes new instances, it increases or decreases probabilities, and it may add nodes and arcs. To classify an unknown instance, STAGGER predicts the most probable class. Partial-memory approaches maintain a store of partially built models, a portion of the previously encountered instances, or both. Such approaches vary in how they use such information for adjusting current models. The FLORA Systems (Widmer & Kubat, 1996) maintain a sequence of examples over a dynamically adjusted window of time. The Window Adjustment Heuristic (WAH) adjusts the size of this window in response to performance changes. Generally, if performance is decreasing or poor, then the heuristic reduces the window’s size; if it is increasing or acceptable, then it increases the size. These systems also maintain a store of rules, including ones that are overly general, although these are not used for prediction. As the systems process instances, they create new rules or refine existing ones. To classify an instance, the FLORA systems select the rule that best matches and return its class label. The AQ-PM Systems maintain a set of examples over a window of time, but the systems select examples from the boundaries of rules, so they can retain examples that do not reoccur in the data stream. AQ-PM (Maloof & Michalski, 2000) builds new rules when new examples arrive, whereas AQ11-PM (Maloof & Michalski, 2004) refines existing rules. These systems maintain examples over a static window of time, but AQ11-PM+WAH (Maloof, 2003) incorporates Widmer and Kubat’s (1996) WAH for dynamically sizing this window. Because these systems use rules, when classifying an instance, they return as their prediction the class label of the best matching rule. The Concept-adapting Very Fast Decision Tree system (Hulten et al., 2001), or CVFDT, progressively grows a decision tree downward from the leaf nodes. It maintains frequency counts for attribute values by class and extends the tree when a statistical test indicates that a change has occurred. CVFDT also maintains at each node a list of alternate subtrees, which it swaps with the current subtree when it detects drift. To classify an
unknown instance, the method uses the instance’s values to traverse the current tree from the root to a leaf node, returning as the prediction the associated label. The Concept Drift 3 system (Black & Hickey, 1999), or CD3, uses batches of instances annotated with a time stamp of either current or new to build a decision tree. When drift occurs, time becomes more relevant for prediction, so the time-stamp attribute will appear higher in the decision tree. After pruning, CD3 converts the tree to rules by enumerating all paths containing a new time stamp and then removing conditions involving the time stamp. CD3 predicts the class of the best matching rule. Ensemble methods maintain a set of models and use a voting procedure to yield a global prediction. Blum’s (1997) implementation of Weighted-Majority (Littlestone & Warmuth, 1994) uses as models histories of labels associated with pairs of features. If a model’s features are present in an instance, then it predicts the most frequent label present in its history. The method initializes each model with a weight of 1 and reduces a model’s weight if it predicts incorrectly. It predicts based on a weighted vote of the predictions of the models. The Streaming Ensemble Algorithm (Street & Kim, 2001) maintains a fixed-size collection of models, each built from a fixed number of instances. When a new batch of instances arrives, SEA builds a new model. If space exists in the collection, then it adds the new model. Otherwise, it replaces the worst performing model with the new model, if one exists. SEA predicts the majority vote of the predictions of the models in the collection. The Accuracy-Weighted Ensemble (Wang et al., 2003) also maintains a fixed-size collection of models, each built from a batch of instances. However, this method weights each classifier in the collection based on its performance on the most recent batch. When adding a new weighted model, if there is no space in the collection, then the method stores only the top weighted models. The method predicts based on a weightedmajority vote of the predictions of the models in the collection. Dynamic Weighted Majority (Kolter & Maloof, 2003), or DWM, maintains a collection of weighted models but dynamically adds and removes models with changes in performance. Instead of building a single model with each batch, DWM uses new instances to refine all the models in the collection. Each time a model predicts incorrectly, DWM reduces its weight, and DWM removes a model from the collection if its weight falls below a threshold. Like the previous method, DWM predicts based on a weighted-majority vote of the predictions of the models, but if the global prediction 203
C
Concept Drift
(i.e., the weighted-majority vote) is incorrect for an instance, then DWM adds a new model to the collection.
Evaluation Evaluating systems for concept drift is a critical issue. Ideally, one wants to use a real-world data set, but finding such a data set in which concept drift is easy to identify is itself a challenge. After all, if it were easy to detect concept drift in large data sets, then the task of writing systems to cope with it would be trivial. Even if one establishes that drift is occurring in a data set, to conduct a proper evaluation, the phenomenon must produce a measurable effect in a method’s performance. Moreover, the effect of concept drift on the method’s performance must be greater than all other effects, such as that due to the variability from processing new examples of a target concept. As a result of these issues, the majority of evaluations have involved synthetic or artificial data sets. A hallmark of synthetic data sets for concept drift is that they cycle through a series of target concepts. The first target concept persists for a period of time, then the second persists, and so on. For each time step of a period, one randomly generates a set of training examples, which the method uses to build or refine its models. The method evaluates the resulting model on the examples in a test set and computes a measure of performance, such as predictive accuracy (i.e., the percentage of the test examples the method predicted correctly). One generates test cases every time step or every time period. As the system processes examples, ideally, its performance on the test set improves at a certain rate and to a certain level of accuracy. Indeed, the slope and asymptote are critical characteristics of a method’s performance. Crucially, each time the target concept changes, one generates a new set of testing examples. Naturally, when the method applies the model built for the previous target concept to the examples of the new concept, the method’s performance will be poor. However, as the method processes training examples of the new target concept, as before, performance will improve with some slope and to some asymptote. By far, the most widely used synthetic data set is the so-called STAGGER Concepts. Originally used by Schlimmer and Grainger (1986), it has been the centerpiece for many evaluations (e.g., Kolter & Maloof, 2003; Maloof & Michalski, 2000, 2004; Widmer, 1997; Widmer & Kubat, 1996). There are three attributes: size, color, and shape. Size can be small, medium, or large; color can be red, blue, or green; shape can be triangle, circle, or rectangle. There are three target concepts, and the presentation of examples lasts for 120 time steps. 204
The target concept for the first 40 time steps is [size = small] & [color = red]. For the next 40, it is [color = green] ∨ [shape = circle]. For the final 40, it is [size = medium ∨ large]. At each time step, one generates a single training example and 100 test cases of the target concept. Naturally, the method updates its model by using the training example and evaluates it by using the testing examples, calculating accuracy. One presentation of the STAGGER Concepts is not sufficient for a proper evaluation, so researchers conduct multiple runs, averaging accuracy at each time step. Clearly, a small problem with only 27 possible examples, researchers have recently proposed larger synthetic data sets involving concept drift (e.g., Hulten et al., 2001; Street & Kim, 2001; Wang et al., 2003). I do not have the space here to survey the details, similarities, and differences of these synthetic problems, but they are all based on the same ideas present in the STAGGER Concepts. For instance, researchers have used rotating (Hulten et al., 2001) and shifting (Street & Kim, 2001) hyperplanes as changing target concepts. As noted previously, there have also been evaluations involving concept drift in real data sets. Blum (1997) used a calendar-scheduling task in which a user’s preference for meetings changed over time. Lane and Brodley (1998) examined an intrusion-detection application, mining sequences of UNIX commands. Finally, Black and Hickey (2002) studied concept drift in the phone records of customers of British Telecom.
FUTURE TRENDS Tracking concept drift is a rich problem that has led to a diverse set of approaches and evaluation methodologies, and there are many opportunities for further investigation. One is the development of better methods for evaluating systems that cope with evolutionary drift. I have already discussed the difficult nature of this problem: To evaluate how systems cope with drift, there must be a measurable effect. Slow or evolutionary drift may not produce an effect different enough from that of mining the sequence of instances. If noise is present, then it is even more difficult to distinguish among the variability due to instances, drift, and noise. Existing systems are probably capable of tracking slow concept drift, but we need evaluation methodologies for measuring how such drift affects performance. I have made the case for the importance of strong evaluations, and in this regard, researchers need to better place their work in context with past efforts. Presently, researchers often choose not to use data sets from past studies, and create new ones for their investigation. Because new studies typically consist of
Concept Drift
new methods evaluated on new data sets, it is difficult to place new methods in context with previous work. As a consequence, it is presently impossible to truly understand the strengths and weaknesses of existing methods, both old and new. This is not to say that researchers should not create or introduce new data sets. Indeed, I have already mentioned that the STAGGER Concepts is a small problem, so creating a new, larger one is required if, say, researchers want to establish how methods scale to larger data streams. Nonetheless, by first evaluating new methods on existing problems, researchers will be able to make stronger conclusions about the performance of their method, and the community will be able to better understand the contribution of the method. Finally, we need to develop a systems theory of concept drift. Presently, we have little to guide the development of future methods or to guide the selection of a particular method for a new application. However, before developing such a theory, we will need to develop better methods for evolutionary concept drift, and we will need stronger evaluations that place new work in context with that of the past.
Lecture Notes in Computer Science: Vol. 2311. Software 2002: Computing in an imperfect world (pp. 74-87). New York: Springer.
CONCLUSION
Kolter, J., & Maloof, M. (2003). Dynamic weighted majority: A new ensemble method for tracking concept drift. Proceedings of the Third IEEE International Conference on Data Mining (pp. 123-130), USA.
Tracking concept drift is important for many applications, from e-mail sorting to market-basket analysis. Concepts may change quickly or gradually; they may occur once or at regular intervals. The diversity and complexity of coping with concept drift has led researchers to propose an equally varied set of approaches. Some modify their models, some do so with partial memory of the past, and some rely on groups of models. Although there have been evaluations of these approaches involving real data sets, the majority have involved synthetic data sets, which give researchers great control in testing hypotheses. With stronger evaluations that place new systems in context with past work, we will be able to propose theories for systems that cope with concept drift, theories that we can test and that will lead to new systems for this critical problem for many applications.
REFERENCES Black, M., & Hickey, R. (1999). Maintaining the performance of a learned classifier under concept drift. Intelligent Data Analysis, 3, 453-474. Black, M., & Hickey, R. (2002). Classification of customer call data in the presence of concept drift and noise. In
Blum, A. (1997). Empirical support for Winnow and Weighted-Majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26, 5-23. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97-106). Klenner, M., & Hahn, U. (1994). Concept versioning: A methodology for tracking evolutionary concept drift in dynamic concept systems. Proceedings of the 11th European Conference on Artificial Intelligence, England, XX (pp. 473-477). Klinkenberg, R., & Joachims, T. (2000). Detecting concept drift with support vector machines. Proceedings of the 17th International Conference on Machine Learning (pp. 487-494), USA.
Kuh, A., Petsche, T., & Rivest, R. L. (1991). Learning timevarying concepts. In Advances in neural information processing systems 3 (pp. 183-189). San Francisco: Morgan Kaufmann. Lane, T., & Brodley, C. (1998). Approaches to online learning and concept drift for user identification in computer security. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 259-263), USA. Littlestone, N., & Warmuth, M. K. (1994). The Weighted Majority algorithm. Information and Computation, 108, 212-261. Maloof, M. (2003). Incremental rule learning with partial instance memory for changing concepts. Proceedings of the International Joint Conference on Neural Networks (pp. 2764-2769), USA. Maloof, M., & Michalski, R. (2000). Selecting examples for partial memory learning. Machine Learning, 41, 27-52. Maloof, M., & Michalski, R. (2004). Incremental learning with partial instance memory. Artificial Intelligence, 154, 95-126.
205
C
Concept Drift
Mesterharm, C. (2003). Tracking linear-threshold concepts with Winnow. Journal of Machine Learning Research, 4, 819-838. Schlimmer, J., & Granger, R. (1986). Beyond incremental processing: Tracking concept drift. Proceedings of the Fifth National Conference on Artificial Intelligence (pp. 502-507), USA. Street, W., & Kim, Y. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 377-382), USA.
KEY TERMS Class Label: A label identifying the concept or class of an instance. Concept Drift: A phenomenon in which the class labels of instances change over time. Data Set: A set of instances of target concepts. Data Stream: A data set distributed over time. Instance: A set of attributes, their values, and a class label. Also called example, record, or case.
Wang, H., Fan, W., Yu, P., & Han, J. (2003). Mining concept-drifting data streams using ensemble classifiers. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 226-235), USA.
Model: A representation of a data set or stream that one can use for prediction or for better understanding the data from which it was derived. Also called hypothesis or concept description. Models include probabilities, linear equations, decision trees, decision rules, and the like.
Widmer, G. (1997). Tracking context changes through meta-learning. Machine Learning, 27, 259-286.
Synthetic Data Set: A set of artificial target concepts.
Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23, 69-101.
Target Concept: The true, typically unknown model underlying a data set or data stream.
206
207
Condensed Representations for Data Mining Jean-Francois Boulicaut INSA de Lyon, France
INTRODUCTION Condensed representations have been proposed in Mannila and Toivonen (1996) as a useful concept for the optimization of typical data-mining tasks. It appears as a key concept within the inductive database framework (Boulicaut et al., 1999; de Raedt, 2002; Imielinski & Mannila, 1996), and this article introduces this research domain, its achievements in the context of frequent itemset mining (FIM) from transactional data, and its future trends. Within the inductive database framework, knowledge discovery processes are considered as querying processes. Inductive databases (IDBs) contain not only data, but also patterns. In an IDB, ordinary queries can be used to access and manipulate data, while inductive queries can be used to generate (mine), manipulate, and apply patterns. To motivate the need for condensed representations, let us start from the simple model proposed in Mannila and Toivonen (1997). Many data-mining tasks can be abstracted into the computation of a theory. Given a language L of patterns (e.g., itemsets), a database instance r (e.g., a transactional database) and a selection predicate q, which specifies whether a given pattern is interesting or not (e.g., the itemset is frequent in r), a datamining task can be formalized as the computation of Th(L,q,r) = {φ ∈ L | q(φ,r) is true}. This also can be considered as the evaluation for the inductive query q. Notice that it specifies that every pattern that satisfies q has to be computed. This completeness assumption is quite common for local pattern discovery tasks but is generally not acceptable for more complex tasks (e.g., accuracy optimization for predictive model mining). The selection predicate q can be defined in terms of a Boolean expression over some primitive constraints (e.g., a minimal frequency constraint used in conjunction with a syntactic constraint, which enforces the presence or the absence of some subpatterns). Some of the primitive constraints generally refer to the behavior of a pattern in the data by using the so-called evaluation functions (e.g., frequency). To support the whole knowledge discovery process, it is important to support the computation of many different but correlated theories. It is well known that a generate-and-test approach that would enumerate the sentences of L and then test the selection predicate q is generally impossible. A huge
effort has been made by data-mining researchers to make an active use of the primitive constraints occurring in q to achieve a tractable evaluation of useful mining queries. It is the domain of constraint-based mining (e.g., the seminal paper) (Ng et al., 1998). In real applications, the computation of Th(L,q,r) can remain extremely expensive or even impossible, and the framework of condensed representations has been designed to cope with such a situation. The idea of ε-adequate representations was introduced in Mannila and Toivonen (1996) and Boulicaut and Bykowski (2000). Intuitively, they are alternative representations of the data that enable answering to a class of query (e.g., frequency queries for itemsets in transactional data) with a bounded precision. At a given precision ε, one can be interested in the smaller representations, which are then called concise or condensed representations. It means that a condensed representation for Th(L,q,r) is a collection C ⊂ Th(L,q,r) such that every pattern from Th(L,q,r) can be derived efficiently from C. In the database-mining context, where r might contain a huge volume of records, we assume that efficiently means without further access to the data. The following figure illustrates that we can compute Th(L,q,r) either directly (Arrow 1) or by means of a condensed representation (Arrow 2) followed by a regeneration phase (Arrow 3). We know several examples of condensed representations for which Phases 2 and 3 are much less expensive than Phase 1. We now introduce the background for understanding condensed representations in the well studied context of FIM.
Figure 1. Condensed representation C for Th(L,q,r) 3
2
1
Data r
Theory Th(L,q,r)
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
C
Condensed Representations for Data Mining
BACKGROUND In many cases (e.g., itemsets, inclusion dependencies, sequential patterns) and for a given selection predicate or constraint, the search space L is structured by an antimonotonic specialization relation, which provides a lattice structure. For instance, in transactional data, when L is the power set of items, and the selection predicate enforces a minimal frequency, set inclusion is such an anti-monotonic specialization relation. Anti-monotonicity means that when a sentence does not satisfy q (e.g., an itemset is not frequent), then none of its specializations can satisfy q (e.g., none of its supersets are frequent). It becomes possible to prune huge parts of the search space, which cannot contain interesting sentences. This has been studied a lot within the « learning as search » framework (Mitchell, 1982), and the generic level-wise algorithm from Mannila and Toivonen (1997) has inspired many algorithmic developments. It computes Th(L,q,r) level-wise in the lattice by considering first the most general sentences (e.g., the singleton in the FIM problem). Then, it alternates candidate evaluation (e.g., frequency counting) and candidate generation (e.g., building larger itemsets from discovered frequent itemsets) phases. The algorithm stops when it cannot generate new candidates or, in other terms, when the most specific sentences have been found (e.g., all the maximal frequent itemsets). This collection of the most specific sentences is called a positive border in Mannila and Toivonen (1997), and it corresponds to the S set of a Version Space in Mitchell’s terminology. The a priori algorithm (Agrawal et al., 1996) is clearly the most famous instance of this level-wise algorithm. The dual property of monotonicity is interesting, as well. A selection predicate is monotonic when its negation is anti-monotonic (i.e., when a sentence satisfies it, all its specializations satisfy it, as well). In the itemset pattern domain, the maximal frequency constraint or a syntactic constraint that enforces that a given item belongs to the itemsets are two monotonic constraints. Thanks to the duality of these definitions, a monotonic constraint gives rise to a border G, which contains the minimally general sentences w.r.t., the monotonic constraint (see the following figure). Figure 2.
Specialization relation
MAIN RESULTS We emphasize the main results concerning condensed representations for frequent itemsets, since this is the context for which it has been studied a lot.
Condensed Representations by Borders Border S Border G
208
When the predicate selection is a conjunction of an anti-monotonic part and a monotonic part, the two borders define the solution set: solutions are between S and G, and (S,G) is a version space. For this conjunction case, several algorithms can be used (Bucila et al., 2002; Bonchi, 2003; de Raedt & Kramer, 2001; Jeudy & Boulicaut, 2002). When arbitrary Boolean combinations of anti-monotonic and monotonic constraints are used (e.g., disjunctions), the solution space is defined as a union of several version spaces (i.e., unions of couples of borders) (de Raedt et al., 2002). Borders appear as a typical case of condensed representation. Assume that the collection of the maximal frequent itemsets in r is available (i.e., the S border for the minimal frequency constraint); this collection is generally several orders of magnitude smaller than the complete collection of the frequent itemsets in r, while all of them can be generated from S without any access to the data. However, in most of the applications of pattern discovery tasks, the user not only wants to get the interesting patterns, but also wants the results of some evaluation functions about these patterns. This is obvious for the FIM problem; these patterns are generally exploited in a postprocessing step to derive more useful statements about the data (e.g., the popular frequent association rules that have a high enough confidence) (Agrawal et al., 1996). This can be done efficiently, if we compute not only the collection of frequent itemsets, but also their frequencies. In fact, the semantics of an inductive query are better captured by extended theories (i.e., collections like {(φ,e) ∈ L⊗E | q(φ,r) est vrai et e=ζ(φ,r)}), where e is the result of an evaluation function ζ in r with values in E. In our FIM problem, ζ denotes the frequency (e is a number in [0,1]) of an itemset in a transactional database r. The challenge of designing condensed representations for an extended theory ThE is then to identify subsets of ThE, from which it is possible to generate ThE either exactly or with an approximation on the evaluation functions.
It makes sense to use borders as condensed representations. For FIM, specific algorithms have been designed for computing directly the S border (Bayardo, 1998). Also, the algorithm in Kramer and de Raedt (2001) computes
Condensed Representations for Data Mining
borders S and G and has been applied successfully to feature extraction in the domain of molecular fragment finding. In this case, a conjunction of a minimal frequency in one set of molecule (e.g., the active ones) and a maximal frequency in another set of molecules (e.g., the inactive ones) is used. This kind of research is related to the socalled emergent pattern discovery (Dong & Li, 1999). Considering the extended theory for frequent itemsets, it is clear that, given the maximal frequent sets and their frequencies, we have an approximate condensed representation of the frequent itemsets. Without looking at the data, we can regenerate the whole collection of the frequent itemsets (subsets of the maximal ones), and we have a bounded error on their frequencies: when considering a subset of a maximal σ-frequent itemset, we know that its frequency is in [σ,1]. Even though more precise bounds can be computed, this approximation is useless in practice. Indeed, when using borders, users have other applications in mind (e.g., feature construction). The maximal frequent itemsets can be computed in cases where very large frequent itemsets hold such that the regeneration process becomes impossible. Typically, when a maximal frequent itemset has a size 30, it should lead to the regeneration of around 1010 frequent sets.
generators. Interestingly, these generators constitute condensed representations, as well. An important characterization is the one of free sets (Boulicaut et al., 2000), which has been proposed independently under the name key patterns (Bastide et al., 2000). By definition, the closures of (frequent) free sets are (frequent) closed sets. Given the quivalence classes we quoted earlier, free sets are their minimal elements, and the freeness property can lead to efficient pruning, thanks to its anti-monotonicity. Computing the frequent free sets plus an extra collection of some non-free itemsets (part of the so-called negative border of the frequent free sets), it is possible to regenerate the whole collection of the frequent itemsets and their frequencies (Boulicaut et al., 2000; Boulicaut et al., 2003). On one hand, we often have much more frequent free sets than frequent closed sets but, on another hand, they are smaller. The concept of freeness has been generalized for other exact condensed representations like the disjunct-free itemsets (Bykowski & Rigotti, 2001), the non derivable itemsets (Calders & Goethals, 2002), and the minimal k-free representations of frequent sets (Calders & Goethals, 2003). Regeneration algorithms and translations between condensed representations have been studied, as well (Kryszkiewicz et al., 2004).
Exact Condensed Representations of Frequent Sets and Their Frequencies
Approximate Condensed Representations for Frequent Sets and Their Frequencies
Since Boulicaut and Bykowski (2000), a lot of attention has been put on exact condensed representations based on frequent closed sets. According to the classical framework of Galois connection, the closure of an itemset X in r, closure(X,r), is the maximal superset of X, which has the same frequency as X in r. Furthermore, a set X is closed if X=closure(X,r). Interestingly, the sets of itemsets that have the same closures constitute equivalence classes of itemsets that have the same frequencies, the maximal one in each equivalence class being a closed set (Bastide et al., 2000). Frequent closed sets are itemsets that are both closed and frequent. In dense and/or highly correlated data, we can have orders of magnitude less frequent closed sets than frequent itemsets. In other terms, it makes sense to materialize and, when possible, to look for the fastest computations of the frequent closed sets only. It is then easy to derive the frequencies of every frequent itemset without any access to the data. Many algorithms have been designed for computing frequent closed itemsets (Pasquier et al., 1999; Zaki, 2002). Empirical evaluations of many algorithms for the FIM problem are reported in Goethals and Zaki (2004). Frequent closed set mining is a real breakthrough to the computational complexity of FIM in difficult contexts. A specificity of frequent closed set mining algorithms is the need for a characterization of (frequent) closed set
Looking for approximate condensed representations can be useful for very hard datasets. We mentioned that borders can be considered as approximate condensed representations but with a poor approximation of the needed frequencies. The idea is to be able to compute the frequent itemsets with less counting at the price of an acceptable approximation on their frequencies. One idea is to compute the frequent itemsets and their frequencies on a sample of the original data (Mannila & Toivonen, 1996), but bounding the error on the frequencies is hard. In Boulicaut et al. (2000), the concept of δ-free set was introduced. In fact, the free itemsets are a special case when δ = 0. When δ > 0, we have less δ-free sets than free sets, and, thus, the representation is more condensed. Interestingly, experimentations on real datasets have shown that the error in practice was much smaller than the theoretical bound (Boulicaut et al., 2000; Boulicaut et al., 2003). Another family of approximate condensed representations has been designed in Pei, et al. (2002), where the idea is to consider the maximal frequent itemsets for different frequency thresholds and then approximate the frequency of any frequent set by means of the computed frequencies.
209
C
Condensed Representations for Data Mining
Multiples Uses of Condensed Representations Another interesting achievement is that condensed representations of frequent itemsets are not only useful for FIM in difficult cases, but they also derive more meaningful patterns. In other terms, instead of a regeneration phase, which can be impossible, however, due to the size of the collection of frequent itemsets, it is possible to use directly the condensed representations. Indeed, closed sets can be used to derive informative or non-redundant association rules. Also, frequent and valid association rules with a minimal left-hand side can be derived from dðfree sets, and these rules can be used, among others, for association-based classification. It is also possible to derive formal concepts in Wille’s terminology and, thus, apply the so-called formal concept analysis.
FUTURE TRENDS So far, we consider that two promising directions of research are considered and should provide new breakthroughs. First, it is useful to consider new mining tasks and design condensed representations for them. For instance, Casali, et al. (2003) consider the computation of aggregates from data cubes, and Yan, et al. (2003) address sequential pattern mining. An interesting open problem is to use condensed representations during model mining (e.g., classifiers). Then, merging condensed representations with constraint-based mining (with various constraints, including constraints that are neither anti-monotonic nor monotonic) seems to be a major issue. One challenging problem is to decide which kind of condensed representation has to be materialized for optimizing sequences of inductive queries (e.g., for interactive association rule mining); that is, the real context of knowledge discovery processes.
CONCLUSION We introduced the key concept of condensed representations for the optimization of data-mining queries and, thus, the development of the inductive database framework. We have summarized an up-to-date view on borders. Then, referencing the work on the various condensed representations for frequent itemsets, we have pointed out the breakthrough of the computational complexity of the FIM tasks. This is important, because the applications of frequent itemsets go much further than the classical association rule-mining task. Finally, the concepts of condensed representations and ε-adequate rep210
resentations are quite general and might be considered with success for many other pattern domains.
REFERENCES Agrawal, R. et al. (1996). Fast discovery of association rules. In (Ed.), Advances in knowledge discovery and data mining (pp. 307-328). MIT Press. Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2), 66-75. Bayardo, R. (1998). Efficiently mining long patterns from databases. Proceedings of the International Conference on Management of Data, Seattle, Washington. Bonchi, F. (2003). Frequent pattern queries: Language and optimisations [doctoral thesis]. Pisa, Italy: University of Pisa, Italy. Boulicaut, J.-F. & Bykowski, A. (2000). Frequent closures as a concise representation for binary data mining. Proceedings of the Knowledge Discovery and Data Mining, Current Issues and New Applications, Kyoto, Japan. Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2000). Approximation of frequency queries by means of free-sets. Proceedings of Principles of Data Mining and Knowledge Discovery, Lyon, France. Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2003). Freesets: A condensed representation of Boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery, 7(1), 5-22. Boulicaut, J.-F., Klemettinen, M., & Mannila, H. (1999). Modeling KDD processes within the inductive database framework. Proceedings of the Data Warehousing and Knowledge Discover, Florence, Italy. Bucila, C., Gehrke, J., Kifer, D., & White, W.M. (2003). DualMiner: A dual-pruning algorithm for itemsets with constraints. Data Mining and Knowledge Discovery, 7(3), 241-272. Bykowski, A., & Rigotti, C. (2001). A condensed representation to find frequent patterns. Proceedings of the ACM Principles of Database Systems, Santa Barbara, California. Calders, T., & Goethals, B. (2002). Mining all non-derivable frequent itemsets. Proceedings of the Principles of Data Mining and Knowledge Discovery, Helsinki, Finland. Calders, T., & Goethals, B. (2003). Minimal k-free representations of frequent sets. Proceedings of the Principles of Data Mining and Knowledge Discovery, Dubrovnik, Croatia.
Condensed Representations for Data Mining
Casali, A., Cicchetti, R., & Lakhal, L. (2003). Cube lattices: A framework for multidimensional data mining. Proceedings of the SIAM International Conference on Data Mining, San Francisco, California. de Raedt, L. (2002). A perspective on inductive databases. SIGKDD Explorations, 4(2), 66-77. de Raedt, L., Jäger, M., Lee, S.D., & Mannila, H. (2002). A theory of inductive query answering. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan. de Raedt, L., & Kramer, S. (2001). The levelwise version space algorithm and its application to molecular fragment finding. Proceedings of the International Joint Conference on Artificial Intelligence, Seattle, Washington. Dong, G., & Li., J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Diego, Caifornia. Goethals, B., & Zaki, M.J. (2004). Advances in frequent itemset mining implementations. SIGKDD Explorations, 6(1), 109-117. Imielinski, T., & Mannila, H. (1996). A database perspective on knowledge discovery. Communications of the ACM, 39(11), 58-64. Jeudy, B., & Boulicaut, J.-F. (2002). Optimization of association rule mining queries. Intelligent Data Analysis, 6(4), 341-357. Kryszkiewicz, M., Rybinski, H., & Gajek, M. (2004). Dataless transitions between concise representations of frequent patterns. Intelligent Information Systems, 22(1), 41-70. Mannila, H., & Toivonen, H. (1996). Multiple uses of frequent sets and condensed representations. Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. Mannila, H., & Toivonen, T. (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3), 241-258. Mitchell, T.M. (1982). Generalization as search. Artificial Intelligence, 18, 203-226. Ng, R., Lakshmanan, L.V.S., Han, J., & Pang, A. (1998). Exploratory mining and pruning optimizations of constrained associations rules. Proceedings of the Interna-
tional Conference on Management of Data, Seattle, Washington. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1), 25-46. Pei, J., Dong, G., Zou, W., & Han, J. (2002). On computing condensed frequent pattern bases. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. Proceedings of the SIAM International Conference on Data Mining, San Francisco, California. Zaki, M.J., & Hsiao, C.J. (2002). CHARM: An efficient algorithm for closed itemset mining. Proceedings of the SIAM International Conference on Data Mining, Arlington, Texas.
KEY TERMS Condensed Representations: Alternative representations of the data that preserve crucial information for being able to answer some kind of queries. The most studied example concerns frequent sets and their frequencies. Their condensed representations can be several orders of magnitude smaller than the collection of the frequent itemsets. Constraint-Based Data Mining: Concerns the active use of constraints that specify the interestingness of patterns. Technically, it needs strategies to push the constraints, or at least part of them, deeply into the datamining algorithms. Inductive Databases: An emerging research domain, where knowledge discovery processes are considered as querying processes. Inductive databases contain both data and patterns, or models, which hold in the data. They are queried by means of more or less ad-hoc query languages. Pattern Domains: A pattern domain is the definition of a language of patterns, a collection of evaluation functions that provide properties of patterns in database instances, and the kinds of constraints that can be used to specify pattern interestingness.
211
C
212
Content-Based Image Retrieval Timo R. Bretschneider Nanyang Technological University, Singapore Odej Kao University of Paderborn, Germany
INTRODUCTION
BACKGROUND
Sensing and processing multimedia information is one of the basic traits of human beings: The audiovisual system registers and transports surrounding images and sounds. This complex re-cording system, complemented by the senses of touch, taste, and smell, enables perception and provides humans with data for analysing and interpreting the environment. Imitating this perception and the simulation of the processing was and still is one of the major leitmotifs of multimedia technology developments. The goal is to find a representation for every type of knowledge, which makes the reception and processing of information as easy as possible. The need to process given information, deliver it, and explain it to a certain audience exists in nearly all areas of day-to-day life: commerce, science, education, and entertainment (Smeulders, Worring, Santini, Gupta, & Jain, 2000). The development of digital technologies and applications allowed the production of huge amounts of multimedia data. This information has to be systematically collected, registered, organised, and classified. Furthermore, search procedures, methods to formulate queries, and ways to visualise the results have to be provided. In early years, this task was tended to by existing database management systems (DBMS) with multimedia extensions. The basis for representing and modelling multimedia data is so-called binary large objects, which store images, video, and audio sequences without any formatting and analysis done by the system. Often, however, only a reference to the object is handled within the DBMS. For the utilisation of the stored multimedia data, user-defined functions (e.g., content analysis) access the actual data and integrate their results in the existing database. Hence, content-based retrieval becomes possible. A survey of existing retrieval systems was presented, for example, by Naphade & Huang (2002). This article provides an overview of the complex relations and interactions among the different aspects of a content-based retrieval system, whereby the scope is purposely limited to images. The main issues of the data description, similarity expression, and access are addressed and illustrated for an actual system.
The concept of content-based retrieval is datacentric per se; that is, the design of a system has to reflect the characteristics of the data. Hence, neither an optimal solution that can span all kinds of multimedia data exists nor is addressing the variety of data characteristics within one type even possible. However, there are parallels that lay the foundation, which then require tailor-made adaptation and specialisation. This section provides the general groundwork by pointing out the different types of the so-called metainformation, which describes the raw data: • •
• •
Technical information refers to the details of the recording, conversion, and saving process (i.e., format and name of the stored media). Extracted attributes are those that have been deduced by analysing the media content. They are usually called features and emphasise a certain aspect of the media. Simple features describe, for instance, statistical values of the contained information, while complex features and their weighted combinations attempt to describe the entire media content. Knowledge-based information links the objects, people, scenarios, and so forth, detected in the media to entities in the real world. World-oriented information encompasses information on the producer of the media, the date, location, and so forth. Manually added keywords are especially in this group, which makes a primitive description and characterisation of the content possible.
As can be seen by this classification, technical and world-oriented information can be modelled straightforwardly in traditional database structures. Organising and searching can be done by using existing database functions. The utilisation of the extracted attributes and knowledge-based information is more complex in nature. Although most of the currently avail-able DBMSs can be extended with multimedia add-ins, in many cases these are not sufficient, because they cannot describe the stored data to the required degree of retrieval accuracy. How-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Content-Based Image Retrieval
ever, only these two latter types of metainformation lift the system to an abstract level that allows the full exploitation of the content. For an in-depth overview of content-based image retrieval techniques and systems, refer to Deb and Zhang (2004), Kalipsiz (2000), Smeulders et al. (2000), Vasconcelos and Kunt (2001), and Xiang and Huang (2000).
MAIN THRUST The goal of multimedia retrieval is the selection of one or more images whose metainformation meets certain requirements or is similar to a given sample media instance. Searching the metainformation is usually based on a fulltext search among the assigned keywords. Furthermore, content references, such as colour distributions in an image, or more complex information, such as wavelet coefficients, can be used. To solve the issue of having the desired search characteristic in the first place, most systems prefer to use a query with an example media item. The systems use this media as a starting point for the search and processed it in the same manner as the other media objects, when they were inserted in the database. The content is then analysed with the selected procedures, and the media is mapped to a vector consisting of (semi) automatically extracted features. Hereafter, the raw data is only needed for display purposes, and all further processing focuses on analysing and comparing the representative vectors. The result of this comparison is a similarity ranking. The following interfaces can be used to specify a query in a multimedia database: •
•
•
Browsing: Beginning with a predefined data set, the user can navigate in any desired direction by using a browser until a suitable media sample is found. This approach is often used when no suitable starting media is available. Search with keywords: Technical and world-oriented data are represented by alphanumerical fields. These can be searched for a given keyword. Choosing these keywords is extraordinarily difficult for abstract structures such as textures, partially due to the subjectivity of the human perception. Similarity search: The similarity search is based on comparing features extracted from the raw data. Most of these features do not exhibit immediate references to the image, making them highly abstract for users without special knowledge (Assfalg, Del Bimbo, & Pala, 2002; Brunelli & Mich, 2000). Depending on the availability and characteristic of the query medium, one differentiates between query by pictorial example, query by painting, selection from standards, and image montage.
All approaches have their individual advantages as well as disadvantages, and a suitable selection depends on the domain. For example, a fingerprint database is best realised by using the query-by-pictorial-example technique, but selection from standards is a suitable candidate for a comic strip data-base with a limited number of characters. However, the similarity search, in particular the query-by-pictorial-example approach, is one of the most powerful methods because it provides the greatest degree of flexibility. Thus, it determines the focus hereafter. Many different methods for feature extraction were developed and can be classified by various criteria. Based on the point in time in which the features are extracted, apriori and dynamically extracted features are distinguished. Although the first group was extracted during insertion of the corresponding media object in the database, the latter kind is generated at query time. The advantage of the dynamic feature extraction is that the user can define relevant elements in the sample image, and the remaining parts of the query image do not distract the actual search objective. Note that both approaches can be combined. Regardless of the chosen approach, the actual features have to be extracted from the considered data. Examples for this step are histogram-based methods, calculation of statistical colour information (Mojsilovic, Hu, & Soljanin, 2002), contour descriptors (Berretti, Del Bimbo, & Pala, 2000), texture analysis (Gevers, 2002), and wavelet coefficient selection (Albuz, Kocalar, & Khokhar, 2001). The gained information — possibly from different algorithms — is combined in a so-called feature vector that is, by orders of magnitude, smaller than the raw data. This reduction in volume enables not only a suitable handling within the DBMS but also a higher level of abstraction. Therefore, it can often be utilised directly by semantic-based approaches (Djeraba, 2003; Fan, Luo, & Elmagarmid, 2004; Lu, Zhang, Liu, & Hu, 2003) and datamining techniques (Datcu, Daschiel, & Pelizzari, 2003; Li & Narayanan, 2004). The similarity of two multimedia objects in the content-based retrieval process is determined by comparing the representing feature vectors. Over the years, a large variety of metrics and similarity functions was developed for this pur-pose, whereby the best-known methods compute a multi-dimensional distance between the vectors: The smaller the distance, the higher the similarity of the corresponding media objects. Through the introduction of weights for the individual positions within the feature vectors, it is possible to emphasise and/or suppress desirable and undesirable query characteristics, respectively. In particular, the approach can help to particularise the query in iterative retrieval systems; that is, if the users select suitable and unsuitable retrievals, which are used by the system for adaptation in the next iteration (Jing, Li, 213
C
Content-Based Image Retrieval
Zhang, H.-J., & Zhang, B., 2004). However, the application of distance-based similarity measures is not undisputed, because the meaning of distance in a multidimensional vector space spanned by arbitrary features can be ambiguous. Proposed alternatives to the general problem include angular measures or the incorporation of the data statistic in the actual distance measure. Nevertheless, the application of even the simplest measures can be successful for all intents and pur-poses if they are tailor-made with respect to the actual database. Still, if this requirement is not met, the retrieval can nevertheless be sufficient as long as a -sufficient number of matches for all the kinds of possible queries exist. Hitherto the discussion on the evaluation of similarity among different feature vectors was limited to fairly simple structures; that is, the order and length of the vectors were clearly defined. However, these assumptions are insufficient when dealing with more complex feature extraction algorithms. An example is given by algorithms that result in an arbitrary number of features solely depending on the content of the multimedia object. Thus, feature vectors of different lengths have to be compared. To circumvent this problem, most of the current systems limit the dimensionality to a fixed extent — either by the choice of extraction algorithms or through a reducing postprocessing. Instances of the latter approach are the principal component analysis (PCA) and the vector quantisation (VQ). The retrieval results are acceptable, but the overall system performance suffers tremendously — that is, either by constantly repeating the PCA or optimising the code-book for the VQ dynamically — in systems with a high insertion rate, because a permanent adaptation is required. Meanwhile, a variety of alternative measures that can handle more complex feature vectors were developed, and one of the most prominent cases is the Earth Mover’s Distance (EMD) (Rubner, Tomasi, & Guibas, 2000). Selected features of an object, file, or other data structures are stored in indices, a offering accelerated access. This fact implies that the construction and maintenance method of an index is of utmost importance for database efficiency. Data structures employed to support such queries are called multidimensional index structures. Wellknown examples are the k-d-trees, grid files, R/R*-trees, SS/SR-trees, VP-trees, and VA files. A general overview of the context of multimedia retrieval can be found in Lu (2002).
FUTURE TRENDS Future trends in image retrieval are manifold, covering areas such as detailed search for image objects via consideration of semantic information, from identified entities on the image as well as the extension of the retrieval methods 214
to special applications such as remote sensing. A sample application of the latter is given as an example in order to clarify the necessary adaptations. The developments and applications of space-borne remote sensing platforms result in the production of huge amounts of image data. In particular, the step towards higher spatial and spectral resolution increases the obtained amount of data by orders of magnitude. To enable maintenance as well as retrieval, a large number of operational DBMS for remotely sensed imagery is available (Bretschneider, Cavet, & Kao, 2002; Datcu et al., 2003). However, due to the approach to base the queries on world-oriented descriptions (e.g., location, scanner, and date), the systems are not always suitable for users with little expertise in remote sensing. Furthermore, content-related inquiries are not feasible in situations like the following scenario: A certain region reveals symptoms of saltification and a corresponding satellite image of the area was purchased. It is of interest to find other regions which suffered from the same phenomenon and which were successfully recovered or preserved, respectively. Thus the applied strategies in these regions can help to develop effective counteractions for the specific case under investigation. Generic content-based retrieval systems as introduced earlier in this paper are not suitable, because their extracted features do not consider the special characteristic of the satellite imagery; that is, for these systems, all remotely sensed images exhibit a high degree of similarity that prohibits an appropriate differentiation. The Grid Retrieval System for Remotely Sensed Imagery, G(RS)2I, is a tailor-made retrieval system that consists of highly specialised feature extraction modules, corresponding similarity evaluation, and an adapted indexing technique under the umbrella of a web-accessible DBMS. Most of the image content in terms of remote sensing is contained in the spectral characteristic of an obtained scene, whereby in contrast to generic images, the number of “colour bands” can vary between a few and several hundred. For the description of such data, a ground cover classification approach is most suitable, because it is not only an abstract description of the content but also is easily linkable with the human understanding of observed features, for example, water surfaces, forests, and urban areas. For data that consists of multiple bands, this approach leads to a highly precise retrieval. Secondly, the spatial arrangement of the detected regions, as well as their textural composition, are retrieved as features. Last but not least, highly specialised extraction techniques ana-lyse the data for specific features such as airports, rivers, and road networks. Due to the large
Content-Based Image Retrieval
spatial coverage of a satellite scene (covering hundreds of kilometres and therefore containing highly varying landscapes), the extraction of several feature vectors at different positions within a scene is required, because a global descriptor provides only insufficient accuracy. To solve this issue, the G(RS)2I uses feature functions that describe the under-lying data; that is, the multidimensional feature vectors are approximated by a hyper surface, which is modelled by radial basis function (Bretschneider & Li, 2003). These analytically described surfaces enable powerful access approaches to the underlying data. Hence, the search for a best match of an extracted feature vector from a query image becomes an optimisation problem that is easily solvable, due to the explicit existence of the first derivative of the feature function. For the measurement of similarity, throughout the entire system the G(RS)2I uses a modified version of the EMD (Li & Bretschneider, 2003), because the amount of content in a satellite scene — and therefore the length of the feature vector — is generally not predictable. This concept extends even within the indexing of the data that is based on a VP-tree and the realisation of the iterative search engine. With respect to the latter aspect, the problem in content-based retrieval is that one feature vector often cannot de-scribe the desired search characteristic precisely enough. Instead of adapting weights obtained through the user’s feedback regarding the relevance of the previously retrieved data, the G(RS)2I is not limited to moving a single query point in the search space. The approach is to actually fuse the information content from the positively rated feature vectors by analysing the corresponding EMD flow matrix (Li & Bretschneider, 2003).
CONCLUSION Content-based retrieval is beneficial for most types of data because it resembles the human approach of accessing the respective medium. In particular, this idea holds true for data that mankind can directly process. The actual conceptual and technical realisation of this natural process is a major challenge, because knowledge is fairly limited with respect to which way this is accomplished.
REFERENCES Albuz, E., Kocalar, E., & Khokhar, A. A. (2001). Scalable color image indexing and retrieval using vector wavelets. IEEE Transactions on Knowledge and Data Engineering, 13(5), 851-861.
Assfalg, J., Del Bimbo, A., & Pala, P. (2002). Threedimensional interfaces for querying by example in content-based image retrieval. IEEE Transactions on Visualization and Computer Graphics, 8(4), 305-318. Berretti, S., Del Bimbo, A., & Pala, P. (2000). Retrieval by shape similarity with perceptual distance and effective indexing. IEEE Transactions on Multimedia, 2(4), 225239. Bretschneider, T., Cavet, R., & Kao, O. (2002). Retrieval of remotely sensed imagery using spectral information content. Proceedings of the Geoscience and Remote Sensing Symposium, 4 (pp. 2253-2256). Bretschneider, T., & Li, Y. (2003). On the problems of locally defined content vectors in image databases for large images. Proceedings of the Pacific-Rim Conference on Multimedia, 3 (pp. 1604-1608). Brunelli, R., & Mich, O. (2000). Image retrieval by examples. IEEE Transactions on Multimedia, 2(3), 164-171. Datcu, M., Daschiel, H., & Pelizzari, A. (2003). Information mining in remote sensing image archives: System concepts. IEEE Transactions on Geoscience and Remote Sensing, 41(12), 2923-2936. Deb, S., & Zhang, Y. (2004). An overview of contentbased image retrieval techniques. Proceedings of the International Conference on Advanced Information Networking and Applications, 1 (pp. 59-64). Djeraba, C. (2003). Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering, 15(1), 118-135. Fan, J., Luo, H., & Elmagarmid, A. K. (2004). Conceptoriented indexing of video databases: Toward semantic sensitive retrieval and browsing. IEEE Transactions on Image Processing, 13(7), 974-992. Gevers, T. (2002). Image segmentation and similarity of color-texture objects. IEEE Transactions on Multimedia, 4(4), 509-516. Jing, F., Li, M., Zhang, H.-J., & Zhang, B. (2004). Relevance feedback in region-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 14(5), 672-681. Kalipsiz, O. (2000). Multimedia databases. Proceedings of the IEEE International Conference on Information Visualization (pp. 111-115). Li, J., & Narayanan, R. M. (2004). Integrated spectral and spatial information mining in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing, 42(3), 673-685. 215
C
Content-Based Image Retrieval
Li, Y., & Bretschneider, T. (2003). Supervised contentbased satellite image retrieval using piecewise defined signature similarities. Proceedings of the Geoscience and Remote Sensing Sympo-sium, 2 (pp. 734-736).
Xiang, S. Z., & Huang, T. S. (2000). Image retrieval: Feature primitives, feature representation, and relevance feedback. Proceedings of the IEEE Workshop on Contentbased Access of Image and Video Libraries (pp. 10-14).
Lu, G.-J. (2002). Techniques and data structures for efficient multimedia retrieval based on similarity. IEEE Transactions on Multimedia, 4(3), 372-384.
KEY TERMS
Lu, Y., Zhang, H., Liu, W., & Hu, C. (2003). Joint semantics and feature based image retrieval using relevance feedback. IEEE Transactions on Multimedia, 5(3), 339-347.
Content-Based Retrieval: The search for suitable objects in a database based on the content; often used to retrieve multimedia data.
Mojsilovic, A., Hu, H., & Soljanin, E. (2002). Extraction of perceptually important colors and similarity measurement for image matching, retrieval and analysis. IEEE Transactions on Image Processing, 11(11), 1238-1248.
Dynamic Feature Extraction: Analysis and description of the media content at the time of querying the database. The information is computed on demand and discarded after the query has been processed.
Mojsilovic, A., Kovacevic, J., Hu, J.-Y., Safranek, R. J., & Ganapathy, S. K. (2000). Matching and retrieval based on the vocabulary and grammar of color patterns. IEEE Transactions on Image Processing, 9(1), 38-54.
Feature Vector: Data that describes the content of the corresponding multimedia object. The elements of the feature vector represent the extracted descriptive information with respect to the utilised analysis.
Naphade, M. R., & Huang, T. S. (2002). Extracting semantics from audio-visual content: The final frontier in multimedia retrieval. IEEE Transactions on Neural Networks, 13(4), 793-810.
Index Structures: Adapted data structures to accelerate the retrieval. The a-priori extracted features are organised in such a way that the comparisons can be focused to a certain area around the query.
Rubner, Y., Tomasi, C., & Guibas, I. J. (2000). The earth mover’s distance as a metric for image retrieval. Journal of Computer Vision, 40(2), 99-121.
Multimedia Database: A multimedia database system consists of a high-performance database management system and a database with a large storage capacity and supports and manages, in addition to alphanumerical data types, multimedia objects regarding storage, querying, and searching.
Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349-1380. Vasconcelos, N., & Kunt, M. (2001). Content-based retrieval from image databases: Current solutions and future directions. Proceedings of the IEEE International Conference on Image Processing, 3 (pp. 6-9).
216
Query by Pictorial Example: The query is formulated by using a user-provided example for the desired retrieval. Both the query and stored media objects are analysed in the same way. Similarity: Correspondence of two data objects of the same medium. The similarity is determined by comparing the corresponding feature vectors, for example, by a metric or distance function.
217
Continuous Auditing and Data Mining Edward J. Garrity Canisius College, USA Joseph B. O’Donnell Canisius College, USA G. Lawrence Sanders State University of New York at Buffalo, USA
INTRODUCTION Investor confidence in the financial markets has been rocked by recent corporate frauds and many in the investment community are searching for solutions. Meanwhile, due to changes in technology, organizations are increasingly able to produce financial reports on a real-time basis. Access to this timely information can help investors, shareholders, and other third parties, but only if this information is accurate and verifiable. Real-time financial reporting requires real-time or continuous auditing (CA) to ensure integrity of the reported information. Continuous auditing “ is a type of auditing which produces audit results simultaneously, or a short period of time after, the occurrence of relevant events” (Kogan, Sudit, & Vasarhelyi, 2003, p. 1). CA is facilitated by eXtensible Business Reporting Language (XBRL), which enables seamless transmission of company financial information to auditor data warehouses. Data mining of these warehouses provides opportunities for the auditor to determine financial trends and identify erroneous transactions.
BACKGROUND Auditing is a “systematic process of objectively obtaining and evaluating evidence of assertions about economic actions and events to ascertain the correspondence between those assertions and established criteria and communicating the results to interested parties” (Konrath, 2002, p. 5). In CA, the collection of evidence is constant, and evaluation of the evidence occurs promptly after collection (Kogan et al., 2003). Computerized Assisted Auditing Techniques (CAATs) are computer programs or software applications that are used to improve audit efficiency. CAATs offer great promise to improve audits but have not met expectations “due to a lack of a common interface with IT systems” (Liang et al., 2001, p. 131). Also, “concurrent
CAATS…. often require that special audit software modules be embedded at the EDP system design stage” (Liang et al., 2001, p. 131). Many entities are reluctant to allow the implementation of embedded audit modules, which perform CA, due to concerns these CAATs could adversely affect systems processing in areas such as reducing response times. Considering these difficulties, it is not surprising that continuous transaction monitoring tools are the second least used software by auditors (Daigle & Lampe, 2003). These hurdles to CAAT usage are being minimized by the emergence of eXtensible Markup Language (XML) and eXtensible Business Reporting Language that minimize system interface issues. XML is a mark-up language that allows tagging of data to give the data meaning. XBRL is a variant of XML that is designed specifically for financial reporting and provides the capability of real-time online performance reporting. Both XML and XBRL enable the receiver of the data to seamlessly download information to the receiver’s data warehouse According to David & Steinbart (2000), data warehouses improve audit quality and efficiency by reducing the time needed to access data and perform data analysis. Improved audit quality should lead to early detection, and possible prevention, of fraudulent financial reporting. Auditor data warehouses may also be used in financial fraud litigations in providing evidence to evaluate the legitimacy of transactions and appropriateness of auditor actions in assessing transactions. Data mining techniques are well suited to evaluate CA generated warehouse data but advances in audit tools are needed. Data mining and analysis software is the most commonly used audit software (Bierstaker, Burnaby, & Hass, 2003). Auditor data mining and analysis software typically includes low level statistical tools and auditor specific models like Benford’s Law. Benford’s Law holds that there is a naturally occurring pattern of values in the digits of a number (Nigrini, 2002). Significant variation from the expected number pattern may be due to erroneous or fraudulent transactions.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
C
Continuous Auditing and Data Mining
More robust auditing tools, using more sophisticated data mining methods, are needed for mining large databases and to help auditors meet auditing requirements. According to auditing standards (Statement on Audit Standards 99), auditors should incorporate unpredictability in procedures performed (Ramos, 2003). Otherwise, perpetrators of frauds may become familiar with common audit procedures and conceal fraud by placing it in areas that auditors are least likely to look.
MAIN THRUST It is critical for the modern auditor to understand the nature of CA, and the capabilities of different data mining methods in designing an effective audit approach. Toward this end, this paper addresses these issues through discussion of CA, comparison of data mining methods, and we also provide a potential CA and data mining architecture. Although it is beyond the scope of this paper to provide an in-depth technical discussion of the details of the proposed architecture, we hope this stimulates technical research in this area and provides a starting point for CA system designers.
Continuous Auditing (CA) Audits involve three major components: audit planning, conducting the audit, and reporting on audit findings (Konrath, 2002). The CA approach can be used for the audit planning and conducting the audit phases. According to Pushkin (2003), CA is useful for the strategic audit planning component that “addresses the strategic risk of reaching an inappropriate conclusion by not integrating essential activities into the audit plan” (p. 27). “Strategic information may be captured from the entity’s Intranets and from the global internet using intelligent agents” (Pushkin, 2003, p. 28). CA is also useful for performing the audit or what Pushkin (2003) refers to as the tactical component of the audit. “Tactical activities are most often directed at obtaining transactional evidence as a basis on which to assess the validity of assertions embodied in account balances” (p.27). For example, CA is useful in testing that entities comply with financial performance measures of debt covenants in loan agreements (Woodroof & Searcy, 2001). CA requires prompt responses to high-risk transactions and the ability to identify financial trends from large volumes of data. Intelligent agents can be used to promptly identify and respond to erroneous transactions. Understanding the capabilities of data mining methods in identifying financial trends is useful in selecting an appropriate data mining approach for CA. 218
Comparing Methods of Data Mining Data mining is a process by which one discovers previously unknown information from large sets of data. Data mining algorithms can be divided into three major groups: (1) mathematical-based methods, (2) logic-based methods, and (3) distance-based methods (Weiss & Indurkhya, 1998). Common examples from each major category are described below.
Mathematical-Based Methods Neural Networks An Artificial Neural Network (ANN) is a network of nodes modeled after a neuron or neural circuit. The neural network mimics the processing of the human brain. In a neural network, neurons are grouped into layers or slabs (Lam, 2004). An input layer consists of neurons that receive input from the external environment. The output layer communicates results of the ANN to the user or external environment. The ANN may also consist of a number of intermediate (hidden) layers. The processing of an ANN starts with inputs being received by the input layer and, upon being excited; the neurons are fired and produce outputs to the other layers of the system. The nodes or neurons are interconnected and they will send signals or “fire” only if the signals it receives exceed a certain threshold value. The value of a node is a nonlinear, (usually logistic), function of the weighted sum of the values sent to it by nodes that are connected to it (Spangler, May, & Vargas, 1999). Programming a neural network to process a set of inputs and produce the desired output is a matter of designing the interactions among the neurons. This process consists of the following: (1) arranging neurons in various layers, (2) deciding both the connections among neurons of different layers, as well as the neurons within a layer, (3) determining the way a neuron receives input and produces output (e.g., the type of function used), and (4) determining the strength of connections within the network by selecting and using a training data set so that the ANN can determine the appropriate values of connection weights (Lam, 2004). Prior neural network research has addressed audit areas of risk assessment, errors and fraud, going concern audit opinion, financial distress, and bankruptcy prediction (Lin, Hwang, & Becker, 2003). Research has identified successful uses of ANN, however, there are still many other issues to address. For instance, ANN is effective for analytical review procedures although there is no clear guideline for the performance measures to use for this analysis (Koskivaara, 2004). Neural network research found differences between the patterns of quantitative
Continuous Auditing and Data Mining
and qualitative (textual) information from financial statements (Back, Toivenen, Vanharanta, & Visa, 2001). Auditors would benefit from advancement in identifying appropriate financial inputs for ANN and developing models that integrate quantitative and textual information.
Logic-Based Methods
Discriminant Analysis
Tree and Rule Induction
Linear discriminant analysis is similar to multiple regression analysis. The major difference is that discriminant analysis uses a nominal or ordinal dependent variable whereas regression uses a continuous dependent variable. Discriminant analysis constructs classification functions of the form,
The Tree and Rule Induction approach to data mining uses an algorithm (e.g., the ID3 algorithm) to induce a decision tree from a file of individual cases, where the case is described by a set of attributes and the class to which the case belongs (Spangler et al., 1999). Once the decision tree is generated, it is a simple matter to convert from the decision tree format to an equivalent rule-based view. A big advantage of Tree and Rule Induction is the ability to communicate and understand the information derived from the data mining. Tree and Rule induction research has addressed audit areas of bankruptcy, bank failure, and credit risk (Spangler et al., 1999). A comparison of data mining approaches based on representational scheme and learning approach is presented in Table 1. Learning approaches are classified as either supervised, where induction rules are used in assigning observations to predefined classifications, or unsupervised, where both categories and classification rules are determined by the data mining method (Spangler et al., 1999). Representational scheme relates to how the data mining methods model relationships.
Y = c 1 a 1 + c 2a 2 … + c m a m + c 0 Where ci is the coefficient for the case attribute ai and c0 is a constant, for each of the n categories (Spangler et al., 1999). One field in the set of data is assigned as the target (response or class, Y) variable, and the algorithm produces models for target variables as a function of the other fields in the data set, that are identified as explanatory variables (features, cases) (Apte et al., 2003).
Distance-Based Methods Clustering Clustering is a data mining approach that partitions large sets of data objects into homogeneous groups. Clustering uses unsupervised classification where little manual prescreening of data is necessary – thus it is useful in situations where no predefined knowledge of categories is available (Maltseva et al., 2000). Generally, an object is described by a set of m attributes (features), and can be represented by an m-dimensional vector, called a pattern. Therefore, if a dataset contains n objects, it can be viewed as an n by m pattern matrix. Rows of the matrix correspond to patterns and columns correspond to features. Mathematical techniques are then employed to identify clus-
ters, which can be described as regions of this space containing points that are close to each other (Maltseva et al., 2000).
Criteria and Issues in Selecting a Data Mining Method Discriminant analysis and statistical techniques have enjoyed a long run of popularity in data mining. However, because of increasing availability of large volumes of data, a trend toward increased use of search-based, nonparametric modeling techniques such as neural networks and rule-based or decision trees has been taking place (Apte et al., 2003). Ultimately however, when evaluating and deciding on a data mining method, users must weigh
Table 1. Data mining approaches compared1 Decision Tree Rule Induction Logic-based Supervised
Method Type Learning approach Representational Decision nodes, Scheme rules
Neural Networks Math-based Supervised
Discriminant Analysis Math-based Supervised
Clustering Distance-based Unsupervised
Functional relationship between attributes and classes
Functional relationship between attributes and classes
Functional relationship between attributes and classes
219
C
Continuous Auditing and Data Mining
and often compromise between predictive accuracy, level of understandability, and computational demand (Apte et al., 2003). Indeed, when one examines issues such as scalability (i.e., how well the data mining method works regardless of the size of the data set), accuracy (i.e., how well the information extracted remains stable and constant beyond the boundaries of the data from which it was extracted, or trained), robustness (i.e., how well the data mining method works in a wide variety of domains), and interpretability (i.e., how well the data mining method provides understandable information and insight of value to the end user), it becomes clear that no data mining methods currently excel in all areas (Apte et al., 2003). The specific demands of CA raise a number of issues that relate to the selection of data mining methods and their appropriate application in this domain. The next section explores these issues in greater detail.
data with differing file formats and record structures from heterogeneous platforms (Rezaee et al., 2002). Another consideration in developing a technology infrastructure to gather transactions and other activity for auditing is the degree of automation employed. The degree of automation can vary depending on the audit system design but at least three possibilities are:
The CA Challenge: Difficulties and Issues
3.
The demand for more timely communication of auditing information to business stakeholders requires auditors to find new ways to monitor, assemble and analyze audit information. A number of continuous audit tools and techniques will need to be developed to enable auditors to assess risk, evaluate internal controls, extract data, download data for analytical review, and identify exceptions and unusual transactions. One of the challenges in developing a CA capability is developing a technology infrastructure for extracting
1.
Embedded audit modules where audit programs are tightly integrated with application source code to constantly monitor and report on exceptional conditions (limited use of this approach due to potential adverse effects on entity processing, see Background section for further discussion); The automatic capture and transformation of data and storage in data warehouses but still requiring auditor involvement in running queries to isolate exceptions and detect unusual patterns (automatic capture but with auditor intervention); The automatic capture and transformation of data and storage in data warehouses and the integration of intelligent agents to modify data capture routines and exception reporting and trend spotting via multiple data mining methods (automatic, modified capture and modified data mining for trend analysis).
2.
Finally, the appropriate selection of a data mining method is critical for determining unusual trends and spotting fraudulent behavior. Important considerations include: (1) the size of the data set, (2) the accuracy of the given data mining method in the particular domain, and (3)
Figure 1. Continuous auditing and data mining architecture2 Entity Systems
Auditor Systems
Databases Environmental Monitoring Transactions Transactions
Capture and Transmit
XBRL XML
IntelligentAudit Evaluator
Transactions
Audit Data Warehouse
Data Transformation
Data Mart
Data Mart
Intelligent Agent: Data Mining Applications Manager ______________
220
Continuous Auditing and Data Mining
the interpretability of the data mining output information. Regardless of the particular context, in the area of CA a prime consideration is the frequent and (automatic) alteration or change of the data mining technique and the appropriate selection and changing of the selected auditing information to be reviewed. These are important considerations because the audited parties should not be able to anticipate the audit tests and thus create erroneous transactions that might go undetected (Ramos, 2003).
reliability of performance reporting. Auditors need to understand the capabilities of different data mining approaches to ensure effective continuous auditing. Looking ahead, researchers will need to further develop data mining approaches and tools for the expanding needs of the audit profession.
A Potential Architecture
Apte, C.V., Hong, S.J., Natarajan, R., Pednault, E.P.D., Tipu, F.A., & Weiss, S.M. (2003). Data-intensive analytics for predictive modeling. IBM Journal of Research and Development, 47(1), 17-23.
In order to handle the issues identified above, a proposed architecture for the CA process is presented in Figure 1. Two important features of the architecture are: (1) the use of data warehousing and data marts, and (2) the use of intelligent agents to periodically modify the data extraction and data mining methods. The architecture involves the transfer of data from the entity’s systems to the auditor’s systems through the use of XBRL and XML. The auditor’s systems include environmental monitoring of political, economic, and technological factors to address strategic audit issues and data mining of transactions to address tactical audit issues.
FUTURE TRENDS In the future, auditing and information systems researchers will need to identify additional tools and data mining methods that are most appropriate for this new application area. Emerging innovations in Web technology and the general public’s demand for reliable performance reporting are expected to spur demand for the use of data mining for CA. Expected growth in data mining for CA will provide opportunities and issues for auditors and researchers in several areas. Advancement in the areas of data mining of textual information will be useful to CA. Recognizing patterns in qualitative information provides a broader perspective and richer understanding of the entity’s internal and external situation (Back et al., 2001). Also, for risk based auditing ANN model builders will need “qualitative and quantitative factors that capture political, economic, and technological factors, as well as balanced scorecard metrics that signal the extent to which an entity is achieving its strategic objectives” (Lin et al., 2003, p. 230).
CONCLUSION The use of data mining for continuous auditing is poised to improve the effectiveness of audits, and ultimately the
REFERENCES
Back, B., Toivenen, J., Vanharanta, H., & Visa, A. (2001). Comparing numerical data and text information from annual reports using self-organizing maps. International Journal of Accounting Information Systems, 2(2001), 249-269. Bierstaker, J.L., Burnaby, P., & Hass, S. (2003). Recent changes in internal auditors’ use of technology. Internal Auditing, 18(4), 39-45. David, J.S., & Steinbart, P.J. (2000). Data warehousing and data mining: Opportunities for internal auditors. Altamonte Springs, FL: The Institute of Internal Auditors Research Foundation. Kogan, A., Sudit, E.F., & Vasarhelyi, M.A. (2003). Continuous online auditing: An evolution (pp. 1-25). Unpublished Workpaper. Konrath, L.F. (2002). Auditing: A risk analysis approach. Cincinnati, OH: South-Western. Koskivaara, E. (2004). Artificial neural networks in analytical procedures. Managerial Auditing Journal, 19(2), 191-223. Lam, M. (2004). Neural network techniques for financial performance prediction: Integrating fundamental and technical analysis. Decision Support Systems, 37(4), 567-581. Liang, D., Fengyi, L., & Wu, S. (2001). Electronically auditing EDP systems with the support of emerging information technologies. International Journal of Accounting Information Systems, 2, 130-147. Lin, J.W., Hwang, M.I., & Becker, J.D. (2003). A fuzzy neural network for assessing the risk of fraudulent financial reporting. Managerial Auditing Journal, 18(8), 657-665.
221
C
Continuous Auditing and Data Mining
Maltseva, E., Pizzuti, C., & Talia, D. (2000). Indirect knowledge discovery by using singular value decomposition. In Data Mining II. Southhampton, UK: WIT Press. Nigrini, M.J. (2002). Analysis of digits and number patterns. In J.C. Robertson (Ed.), Fraud examination for managers and auditors (pp. 495-518). Austin, TX: Atex Austin, Inc.
Auditing: Systematic process of objectively obtaining and evaluating evidence of assertions about economic actions and events to ascertain the correspondence between those assertions and established criteria and communicating the results to interested parties. Clustering: Data mining approach that partitions large sets of data objects into homogeneous groups.
Pushkin, A.B. (2003). Comprehensive continuous auditing: The strategic component. Internal Auditing, 18(1), 26-33.
Computerized Assisted Auditing Techniques (CAATs): Software applications that are used to improve the efficiency of an audit.
Ramos, M. (2003). Auditors’ responsibility for fraud detection. Journal of Accountancy, 195(1), 28-36.
Continuous Auditing: Type of auditing, which produces audit results simultaneously, or a short period of time after, the occurrence of relevant events.
Rezaee, Z., Sharbatoghlie, A., Elam, R., & McMickle, P.L. (2002). Continuous auditing: Building automated auditing capability. Auditing: A Journal of Practice & Theory, 21(1), 147-163. Spangler, W.E., May, J.H., & Vargas, L.G. (1999). Choosing data mining methods for multiple classification: Representational and performance measurement implications for decision support. Journal of Management Information Systems, 16(1), 37-62. Weiss, S.M., & Indurkhya, N. (1998). Predictive data mining. San Francisco, CA: Morgan Kaufmann. Woodroof, J., & Searcy, D. (2001). Continuous audit model development and implementation within a debt covenant compliance domain. International Journal of Accounting Information Systems, 2, 169-191.
Discriminant Analysis: Statistical methodology used for classification that is based on the general regression model, and uses a nominal or ordinal dependent variable. eXtensible Business Reporting Language (XBRL): Mark up language that allows for the tagging of data and is designed for performance reporting. It is a variant of XML. eXtensible Markup Language (XML): Mark-up language that allows for the tagging of data in order to add meaning to the data. Tree and Rule Induction: Data mining approach that uses an algorithm (e.g., the ID3 algorithm) to induce a decision tree from a file of individual cases, where the case is described by a set of attributes and the class to which the case belongs.
KEY TERMS Artificial Neural Network (ANN): A network of nodes modeled after a neuron or neural circuit. The neural network mimics the processing of the human brain.
222
ENDNOTES 1 2
Adapted from Spangler et al. (1999). Adapted from Rezaee et al. (2002).
223
Data Driven vs. Metric Driven Data Warehouse Design John M. Artz The George Washington University, USA
INTRODUCTION Although data warehousing theory and technology have been around for well over a decade, they may well be the next hot technologies. How can it be that a technology sleeps for so long and then begins to move rapidly to the foreground? This question can have several answers. Perhaps the technology had not yet caught up to the theory or that computer technology 10 years ago did not have the capacity to delivery what the theory promised. Perhaps the ideas and the products were just ahead of their time. All these answers are true to some extent. But the real answer, I believe, is that data warehousing is in the process of undergoing a radical theoretical and paradigmatic shift, and that shift will reposition data warehousing to meet future demands.
BACKGROUND Just recently I started teaching a new course in data warehousing. I have only taught it a few times so far, but I have already noticed that there are two distinct and largely incompatible views of the nature of a data warehouse. A prospective student, who had several years of industry experience in data warehousing but little theoretical insight, came by my office one day to find out more about the course. “Are you an Inmonite or a Kimballite?” she inquired, reducing the possibilities to the core issues. “Well, I suppose if you put it that way,” I replied, “I would have to classify myself as a Kimballite.” William Inmon (2000, 2002) and Ralph Kimball (1996, 1998, 2000) are the two most widely recognized authors in data warehousing and represent two competing positions on the nature of a data warehouse. The issue that this student was trying to get at was whether or not I viewed the dimensional data model as the core concept in data warehousing. I do, of course, but there is, I believe, a lot more to the emerging competition between these alternative views of data warehouse design. One of these views, which I call the data-driven view of data warehouse design, begins with existing organizational data. These data have more than likely been produced by existing transaction processing systems. They are cleansed and summarized and are
used to gain greater insight into the functioning of the organization. The analysis that can be done is a function of the data that were collected in the transaction processing systems. This was, perhaps, the original view of data warehousing and, as will be shown, much of the current research in data warehousing assumes this view. The competing view, which I call the metric-driven view of data warehouse design, begins by identifying key business processes that need to be measured and tracked over time in order for the organization to function more efficiently. A dimensional model is designed to facilitate that measurement over time, and data are collected to populate that dimensional model. If existing organizational data can be used to populate that dimensional model, so much the better. But if not, the data need to be acquired somehow. The metric-driven view of data warehouse design, as will be shown, is superior both theoretically and philosophically. In addition, it dramatically changes the research program in data warehousing. The metric-driven and data-driven approaches to data warehouse design have also been referred to, respectively, as metric pull versus data push (Artz, 2003).
MAIN THRUST Data-Driven Design The classic view of data warehousing sees the data warehouse as an extension of decision support systems. Again, in a classic view, decision support systems sit atop management information systems and use data extracted from management information and transaction processing systems to support decisions within the organization. This view can be thought of as a datadriven view of data warehousing, because the exploitations that can be done in the data warehouse are driven by the data made available in the underlying operational information systems. This data-driven model has several advantages. First, it is much more concrete. The data in the data warehouse are defined as an extension of existing data. Second, it is evolutionary. The data warehouse can be populated and exploited as new uses are found for existing data. Finally, there is no question that summary data can be
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Driven vs. Metric Driven Data Warehouse Design
derived, because the summaries are based upon existing data. However, it is not without flaws. First, the integration of multiple data sources may be difficult. These operational data sources may have been developed independently, and the semantics may not agree. It is difficult to resolve these conflicting semantics without a known end state to aim for. But the more damaging problem is epistemological. The summary data derived from the operational systems represent something, but the exact nature of that something may not be clear. Consequently, the meaning of the information that describes that something may also be unclear. This is related to the semantic disintegrity problem in relational databases. A user asks a question of the database and gets an answer, but it is not the answer to the question that the user asked. When the somethings that are represented in the database are not fully understood, then answers derived from the data warehouse are likely to be applied incorrectly to known somethings. Unfortunately, this also undermines data mining. Data mining helps people find hidden relationships in the data. But if the data do not represent something of interest in the world, then those relationships do not represent anything interesting, either. Research problems in data warehousing currently reflect this data-driven view. Current research in data warehousing focuses on a) data extraction and integration, b) data aggregation and production of summary sets, c) query optimization, and d) update propagation (Jarke, Lenzerini, Vassiliou, & Vassiliadis, 2000). All these issues address the production of summary data based on operational data stores.
A Poverty of Epistemology The primary flaw in data-driven data warehouse design is that it is based on an impoverish epistemology. Epistemology is that branch of philosophy concerned with theories of knowledge and the criteria for valid knowledge (Fetzer & Almeder, 1993; Palmer, 2001). That is to say, when you derive information from a data warehouse based on the data-driven approach, what does that information mean? How does it relate to the work of the organization? To see this issue, consider the following example. If I asked each student in a class of 30 for their ages, then summed those ages and divided by 30, I should have the average age of the class, assuming that everyone reported their age accurately. If I were to generate a list of 30 random numbers between 20 and 40 and took the average, that average would be the average of the numbers in that data set and would have nothing to do with the average age of the class. In between those two extremes are any number of options. I could guess the ages of students based on their looks. I could ask mem224
bers of the class to guess the ages of other members. I could rank the students by age and then use the ranking number instead of age. The point is that each of these attempts is somewhere between the two extremes, and the validity of my data improves as I move closer to the first extreme. That is, I have measurements of a specific phenomenon, and those measurements are likely to represent that phenomenon faithfully. The epistemological problem in data-driven data warehouse design is that data is collected for one purpose and then used for another purpose. The strongest validity claim that can be made is that any information derived from this data is true about the data set, but its connection to the organization is tenuous. This not only creates problems with the data warehouse, but all subsequent data-mining discoveries are suspect also.
METRIC-DRIVEN DESIGN The metric-driven approach to data warehouse design begins by defining key business processes that need to be measured and tracked in order to maintain or improve the efficiency and productivity of the organization. After these key business processes are defined, they are modeled in a dimensional data model and then further analysis is done to determine how the dimensional model will be populated. Hopefully, much of the data can be derived from operational data stores, but the metrics are the driver, not the availability of data from operational data stores. A relational database models the entities or objects of interest to an organization (Teorey, 1999). These objects of interest may include customers, products, employees, and the like. The entity model represents these things and the relationships between them. As occurrences of these entities enter or leave the organization, that addition or deletion is reflected in the database. As these entities change in state, somehow, those state changes are also reflected in the database. So, theoretically, at any point in time, the database faithfully represents the state of the organization. Queries can be submitted to the database, and the answers to those queries should, indeed, be the answers to those questions if they were asked and answered with respect to the organization. A data warehouse, on the other hand, models the business processes in an organization to measure and track those processes over time. Processes may include sales, productivity, the effectiveness of promotions, and the like. The dimensional model contains facts that represent measurements over time of a key business process. It also contains dimensions that are attributes of these facts. The fact table can be thought of as the
Data Driven vs. Metric Driven Data Warehouse Design
dependent variable in a statistical model, and the dimensions can be thought of as the independent variables. So the data warehouse becomes a longitudinal dataset tracking key business processes.
A Parallel with Pre-Relational Days You can see certain parallels between the state of data warehousing and the state of database prior to the relational model. The relational model was introduced in 1970 by Codd but was not realized in a commercial product until the early 1980s (Date, 2004). At that time, a large number of nonrelational database management systems existed. All these products handled data in different ways, because they were software products developed to handle the problem of storing and retrieving data. They were not developed as implementations of a theoretical model of data. When the first relational product came out, the world of databases changed almost overnight. Every nonrelational product attempted, unsuccessfully, to claim that it was really a relational product (Codd, 1985). But no one believed the claims, and the nonrelational products lost their market share almost immediately. Similarly, a wide variety of data warehousing products are on the market today. Some are based on the dimensional model, and some are not. The dimensional model provides a basis for an underlying theory of data that tracks processes over time rather than the current state of entities. Admittedly, this model of data needs quite a bit of work, but the relational model did not come into dominance until it was coupled with entity theory, so the parallel still holds. We may never have an announcement in data warehousing as dramatic as Codd’s paper in relational theory. It is more likely that a theory of temporal dimensional data will accumulate over time. However, in order for data warehousing to become a major force in the world of databases, an underlying theory of data is needed and will eventually be developed.
The Implications for Research The implications for research in data warehousing are rather profound. Current research focuses on issues such as data extraction and integration, data aggregation and summary sets, and query optimization and update propagation. All these problems are applied problems in software development and do not advance our understanding of the theory of data. But a metric-driven approach to data warehouse design introduces some problems whose resolution can make a lasting contribution to data theory. Research problems in a metric-driven data warehouse include a) How do we identify key business processes? b) How do
we construct appropriate measures for these processes? c) How do we know those measures are valid? d) How do we know that a dimensional model has accurately captured the independent variables? e) Can we develop an abstract theory of aggregation so that the data aggregation problem can be understood and advanced theoretically? and, finally, f) can we develop an abstract data language so that aggregations can be expressed mathematically by the user and realized by the machine? Initially, both data-driven and metric-driven designs appear to be legitimate competing paradigms for data warehousing. The epistemological flaw in the datadriven approach is a little difficult to grasp, and the distinction — that information derived from a datadriven model is information about the data set, but information derived from a metric-driven model is information about the organization — may also be a bit elusive. However, the implications are enormous. The data-driven model has little future in that it is founded on a model of data exploitation rather than a model of data. The metric-driven model, on the other hand, is likely to have some major impacts and implications.
FUTURE TRENDS The Impact on White-Collar Work The data-driven view of data warehousing limits the future of data warehousing to the possibilities inherent in summarizing large collections of old data without a specific purpose in mind. The metric-driven view of data warehousing opens up vast new possibilities for improving the efficiency and productivity of an organization by tracking the performance of key business processes. The introduction of quality management procedures in manufacturing a few decades ago dramatically improved the efficiency and productivity of manufacturing processes, but such improvements have not occurred in white-collar work. The reason that we have not seen such an improvement in white-collar work is that we have not had metrics to track the productivity of white-collar workers. And even if we did have the metrics, we did not have a reasonable way to collect them and track them over time. The identification of measurable key business processes and the modeling of those processes in a data warehouse provides the opportunity to perform quality management and process improvement on white-collar work. Subjecting white-collar work to the same rigorous definition as blue-collar work may seem daunting, and indeed that level of definition and specification will not come easily. So what would motivate a business to 225
,
Data Driven vs. Metric Driven Data Warehouse Design
do this? The answer is simple: Businesses will have to do this when the competitors in their industry do it. Whoever does this first will achieve such productivity gains that competitors will have to follow suit in order to compete. In the early 1970s, corporations were not revamping their internal procedures because computerized accounting systems were fun. They were revamping their internal procedures because they could not protect themselves from their competitors without the information for decision making and organizational control provided by their accounting information systems. A similar phenomenon is likely to drive data warehousing.
Dimensional Algebras The relational model introduced Structured Query Language (SQL), an entirely new data language that allowed nontechnical people to access data in a database. SQL also provided a means of thinking about record selection and limited aggregation. Dimensional models can be exploited by a dimensional query language such as MDX (Spofford, 2001), but much greater advances are possible. Research in data warehousing will likely yield some sort of a dimensional algebra that will provide, at the same time, a mathematical means of describing data aggregation and correlation and a set of concepts for thinking about aggregation and correlation. To see how this could happen, think about how the relational model led us to think about the organization as a collection of entity types or how statistical software made the concepts of correlation and regression much more concrete.
A Unified Theory Of Data In the organization today, the database administrator and the statistician seem worlds apart. Of course, the statistician may have to extract some data from a relational database in order to do his or her analysis. And the statistician may engage in limited data modeling in designing a data set for analysis by using a statistical tool. The database administrator, on the other hand, will spend most of his or her time in designing, populating, and maintaining a database. A limited amount of time may be devoted to statistical thinking when counts, sums, or averages are derived from the database. But these two individuals will largely view themselves as participating in greatly differing disciplines. With dimensional modeling, the gap between database theory and statistics begins to close. In dimensional modeling we have to begin thinking in terms of construct validity and temporal data. We need to think about correlations between dependent and independent 226
variables. We begin to realize that the choice of data types (e.g., interval or ratio) will affect the types of analysis we can do on the data and will hence potentially limit the queries. So the database designer has to address concerns that have traditionally been the domain of the statistician. Similarly, the statistician cannot afford the luxury of constructing a data set for a single purpose or a single type of analysis. The data set must be rich enough to allow the statistician to find relationships that may not have been considered when the data set was being constructed. Variables must be included that may potentially have impact, may have impact at some times but not others, or may have impact in conjunction with other variables. So the statistician has to address concerns that have traditionally been the domain of the database designer. What this points to is the fact that database design and statistical exploitation are just different ends of the same problem. After these two ends have been connected by data warehouse technology, a single theory of data must be developed to address the entire problem. This unified theory of data would include entity theory and measurement theory at one end and statistical exploitation at the other. The middle ground of this theory will show how decisions made in database design will affect the potential exploitations, so intelligent design decisions can be made that will allow full exploitation of the data to serve the organization’s needs to model itself in data.
CONCLUSION Data warehousing is undergoing a theoretical shift from a data-driven model to a metric-driven model. The metric-driven model rests on a much firmer epistemological foundation and promises a much richer and more productive future for data warehousing. It is easy to haze over the differences or significance between these two approaches today. The purpose of this article was to show the potentially dramatic, if somewhat speculative, implications of the metric-driven approach.
REFERENCES Artz, J. (2003). Data push versus metric pull: Competing paradigms for data warehouse design and their implications. In M. Khosrow-Pour (Ed.), Information technology and organizations: Trends, issues, challenges and solutions. Hershey, PA: Idea Group Publishing. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387.
Data Driven vs. Metric Driven Data Warehouse Design
Codd, E. F. (1985, October 14). Is your DBMS really relational? Computerworld.
KEY TERMS
Date, C. J. (2004). An introduction to database systems (8th ed.). Addison-Wesley.
Data-Driven Design: A data warehouse design that begins with existing historical data and attempts to derive useful information regarding trends in the organization.
Fetzer, J., & Almeder, F. (1993). Glossary of epistemology/ philosophy of science. Paragon House. Inmon, W. (2002). Building the data warehouse. Wiley. Inmon, W. (2000). Exploration warehousing: Turning business information into business opportunity. Wiley. Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (2000). Fundamentals of data warehouses. SpringerVerlag. Kimball, R. (1996). The data warehouse toolkit. Wiley. Kimball, R. (1998). The data warehouse lifecycle toolkit. Wiley. Kimball, R., & Merz, R. (2000). The data webhouse toolkit. Wiley. Palmer, D. (2001). Does the center hold? An introduction to Western philosophy. McGraw-Hill. Spofford, G. (2001). MDX Solutions. Wiley.
Data Warehouse: A repository of time-oriented organizational data used to track the performance of key business processes. Dimensional Data Model: A data model that represents measurements of a process and the independent variables that may affect that process. Epistemology: A branch of philosophy that attempts to determine the criteria for valid knowledge. Key Business Process: A business process that can be clearly defined, is measurable, and is worthy of improvement. Metric-Driven Design: A data warehouse design that beings with the definition of metrics that can be used to track key business processes. Relational Data Model: A data model that represents entities in the form of data tables.
Teorey, T. (1999). Database modeling & design. Morgan Kaufmann. Thomsen, E. (2002). OLAP solutions. Wiley.
227
,
228
Data Management in Three-Dimensional Structures Xiong Wang California State University at Fullerton, USA
INTRODUCTION
BACKGROUND
Data management in its general term refers to activities that involve the acquisition, storage, and retrieval of data. Traditionally, information retrieval is facilitated through queries, such as exact search, nearest neighbor search, range search, etc. In the last decade, data mining has emerged as one of the most dynamic fields in the frontier of data management. Data mining refers to the process of extracting useful knowledge from the data. Popular data mining techniques include association rule discovery, frequent pattern discovery, classification, and clustering. In this chapter, we discuss data management in a specific type of data i.e., three-dimensional structures. While research on text and multimedia data management has attracted considerable attention and substantial progress has been made, data management in three-dimensional structures is still in its infancy (Castelli & Bergman, 2001; Paquet & Rioux, 1999). Data management in 3D structures raises several interesting problems:
Three-dimensional structures can be used to describe data in different domains. In biology and chemistry, for example, a molecule is represented as a 3D structure with connections. Each point is the center of an atom and the connections are bonds between atoms. In computeraided design, an object is specified as a set of 3D vectors that describes the shape of the object (Veltkamp, 2001; Suzuki & Sugimoto, 2003). In computer vision, the shape of a 3D object can be caught by X-ray or ultrasonic scanning devices. The result is a set of 3D points (Hilaga, Shinagawa, Kohmura, & Kunii, 2001). In medical imaging, 3D images of tissues or tumors can be collected using magnetic resonance imaging or computer tomography (Akutsu, Arakawa, & Murase, 2002). With advances in the Internet, scanning devices, and storage, the World Wide Web is becoming a huge reservoir of all kinds of data. The 3D models available over the Internet dramatically increased in the last two decades. Similarity search is a highly desirable technique in all these domains. Classification and clustering of biological data or chemical compounds have special significances. For example, traditionally proteins are classified to families according to their specific functions. However, recently, many approaches have been proposed to classify proteins according to their structures. Some of these approaches achieve very high accuracy when compared with their biological counterparts. Classification and clustering can also help build index structures in 3D model retrieval to speed up similarity search. Currently, there is not a universal model or framework for the representation, storage, and retrieval of threedimensional structures. Most of these data are stored in plain text files in some specific format. The format is different from application to application. Likewise, the existing techniques for information retrieval and data mining in three-dimensional structures take root in the areas of application. Two main areas of application are computer vision and scientific data mining, where computation-intensive techniques have been developed and are still in demand. We focus on data management in these two areas.
1. 2. 3. 4.
Similarity search Pattern discovery Classification Clustering
Given a database of 3D structures and a query 3D structure, similarity search looks for those structures in the database that match the query structure within a range of tolerable errors. The similarity could be defined in two different measurements. The first measurement compares the data structure with the query structure in their entirety i.e., a point-to-point match. We will call this aggregate similarity search. The second measurement compares only the contours or shapes of the data structure with that of the query structure. This is generally referred to as shape-based similarity search. The range of tolerable errors specifies how close the match should be when the data structure is aligned with the query structure. Pattern discovery is concerned with similar substructures that occur in multiple structures. Classification and clustering when applied to these domains attempt to group 3D structures with similar shapes or containing similar patterns together.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Management in Three-Dimensional Structures
ADVANCES IN COMPUTER VISION Shape-based recognition of 3D objects is a core problem in computer vision and has been studied for decades. There are roughly three categories of approaches: volume-based, feature-based, and interactive. For example, Keim (1999) proposed a geometric-based similarity search tree to deal with 3D volume-based search. He suggested using voxels to approximate the 3D objects. For each 3D object, a Minimum Surrounding Volume (MSV) and Maximum Including Volume (MIV) are constructed using voxels. Similar objects are clustered in data pages and the MSV and MIV approximations of the objects are stored in the directory pages. Another interesting scheme uses superellipsoids to approximate the shape of the volume. Superellipsoids are similar to ellipsoids except the terms in the definition are raised to parameterized exponents which are not necessarily integers. Indexing the superellipsoids is very difficult due to their different shapes and sizes. Feature-based approaches have been developed extensively in the literature. In (Belongie, Malik, & Puzicha, 2001, 2002), Belongie and co-authors designed a descriptor, called the shape context, to characterize the distribution of the structure. For each point pi in the structure, the set of 3D vectors originating from pi to every other point in the structure is collected as the shape context. A histogram is constructed based on comparison between the shape context of the query shape and that of the data shape. Osada, Funkhouser, Chazelle and Dobkin (2001) suggested using a similar descriptor, called the shape distribution. The shape distribution is a set of values that are calculated by a shape function, such as the Euclidean distance between two randomly selected points on the surface of a 3D object. A histogram is calculated based on the shape distributions of the two shapes under consideration. Kriegel et al. introduced a 3D shape similarity model that decomposes the 3D model into concentric shells and sectors around the center point (Ankerst, Kastenmüller, Kriegel, & Seidl, 1999). The histograms are determined by counting the number of points within each cell. The similarity is calculated using a quadratic form distance function. Korn, Sidiropoulos, Faloutsos, Siegel and Protopapas (1998) proposed another descriptor, called size distribution that is similar to the pattern spectrum. They introduced a multi-scale distance function based on mathematical morphology. The distance function is integrated to the GEMINI framework to prune the search space. GEMINI is an index structure for high dimensional feature spaces. Bespalov, Shokoufandeh, Regli, & Sun (2003) used Singular Value Decomposition to find suitable feature vectors. A data level shape descriptor, called the spin image, was used in (Johnson & Hebert, 1999) to match surfaces represented as surface meshes. The system is
capable of recognizing multiple objects in 3D scenes that contain clutter and occlusion. Saupe and Vranic (2001) suggested using rays that cast from the center of the mass of the object as feature vectors. The representation is compared using spherical harmonics and moments. All these descriptors are very high dimensional spaces and indexing them is a well known difficult problem, due to “the curse of dimensionality”. Furthermore, the approaches are not suitable for pattern discovery. An interactive search scheme was introduced in (Elad, Tal, & Ar, 2001). The algorithm sets the center of a structure to the origin and calculates the normalized moment value of each point. The normalized moment values are used to approximate the structure. Two structures are compared according to a weighted Euclidean distance. The weights are adapted based on the feedback from the user, using Singular Value Decomposition. Chui and Rangarajan (2000) developed an iterative optimization algorithm to estimate point correspondences and image transformations, such as affine or thin plate splines. The cost function is the sum of Euclidean distances between the transformed query shape and the transformed data shape. The distance function needs an initial correspondence between the two structures. Interactive refinement was also used in a medical image database system developed by Lehmann et al. (2004). A survey of shape matching in computer vision can be found in (Veltkamp & Hagedoorn, 1999). Shape-based similarity search in a large database of 3D objects is much more challenging and is still a young research area. The most recent achievements are a search engine developed at Princeton University (Funkhouser, et al., 2003) and a search system 3DESS for 3D engineering shapes built at Purdue University (Lou, Prabhakar, & Ramani, 2004). These techniques were concentrated on computer vision and they did not compare the 3D models point-to-point. Many of them only search for 3D objects that look similar. The effectiveness of such techniques often depends on subjective perception.
EMERGING TECHNIQUES IN SCIENTIFIC DATA MINING In structural biology, detecting similarity in protein structures has been a useful tool for discovering evolutionary relationships, analyzing functional similarity, and predicting protein structures. The differences in proteins or chemical compounds are very subtle. To discover functionally related proteins or chemical compounds, in many cases we have to match them atom-to-atom. Aggregate similarity search is known as an NP complete problem in combinatorial pattern matching. Even though the significance of the problem surfaced with applications to 229
,
Data Management in Three-Dimensional Structures
bioinformatics, few practical approaches have been developed. The latest results are a performance study based on 2D structures that was conducted by Gavrilov, Indyk, Motwani and Venkatasubramanian at Stanford (1999) and a novel algorithm that was developed by Wang and collaborators in (Wang & Wang, 2000). The algorithm discovers patterns in 3D data with no assumptions of edges. The research group headed by Karypis at University of Minnesota studies pattern discovery in graphs, with vertices and edges (Kuramochi & Karypis, 2002). Pattern discovery is another highly desirable technique in these domains. A motif is a substructure in proteins that has specific geometric arrangement and, in many cases, is associated with a particular function, such as DNA binding. Active sites are another type of patterns in protein structures. They play an important role through protein-protein and protein-ligand interaction, i.e. the binding process. In drug design, scientists try to determine the binding sites of the target molecules and seek inhibitors that can be bound to it. For example in determining the structure of HIV protease and looking for effective inhibitors, over 120 of these structure determinations have been done and at least two inhibitors of HIV protease are now being regularly used to treat AIDS. In the past two decades, the number of 3D protein structures dramatically increased. The Protein Data Bank maintains 26,800 entries as of August 2004. New structures are deposited everyday. Performing similarity search and pattern discovery in such an enormous data set urgently demands highly efficient computational tools. Wang and coauthors (2002) developed a framework for discovering frequently occurring patterns in 3D structures and applied the approach to scientific data mining. The algorithm is a variant of the geometric hashing technique invented for model based recognition in computer vision in 1988 (Lamdan & Wolfson, 1988). Similar approaches were also introduced in (Verbitsky, Nussinov, & Wolfson, 1999). Since hash functions map floating point numbers to integers when calculating the hash bin addresses, they do not preserve precise information of the data. As a consequence, these approaches are not suitable for similarity search that allows variable ranges of tolerable errors. Furthermore, false matches must be filtered out via a verification process. It is well known in the literature that geometric hashing is too sensitive to noise in the data. Due to regularity of biological and chemical structures, dissimilarity is often very subtle. Inaccuracy introduced by the scanning devices adds noise to the data (Ankerst, Kastenmüller, & Kriegel, Seidl, 1999). It is extremely difficult to choose a fixed range of tolerable errors, especially, when the data are collected by different domain experts, using different equipments, such as in the case of Protein Data Bank (Berman, et al., 2000; Westbrook, et al., 2002). It is critical that the range of tolerable errors be set to a tunable parameter, so that the 230
domain expert can choose an optimal value according to the context. A new index structure, ∆B+ Tree, was introduced by Wang in 2002 to overcome these weaknesses (Wang, 2002). The ∆B+ Trees preserve precise information of the data, allow pattern discovery with variable ranges of tolerable errors, and remove false matches totally. In an attempt to compare the shapes of 3D structures, Wang also invented a new notion called the a-surface to capture the surface of a 3D structure in variable details (Wang, 2001).
FUTURE TRENDS A universal model or framework for the representation, storage, and retrieval of three-dimensional structures is highly desirable. However, due to the different nature and applications in the two areas, a model or framework that fits both is not feasible. In computer vision and 3D model retrieval, the number of points in each object is huge and a single point does not impact perception substantially. In fact, a point that does not get along with other points is likely to be considered an outlier or noise. The number of objects increases tremendously everyday in different sources over the Internet. We are likely to see a 3D model search engine much like the search engine for text documents. A centralized database for 3D models is possible, but will be only a small portion of the available search space. The search engine will be capable to access data in different formats to answer the user query. Shapebased statistical approaches will remain a main stream. On the other hand, in scientific data, each individual point is very important. For example, it may represent the center of an atom. Each point may also be associated with different properties. Thus blindly searching through different sources for data in different formats is not practical. Instead, we are likely to see large centralized data reservoir like The Protein Data Bank. For most of the applications, pattern discovery, classification, and clustering are most desirable techniques.
CONCLUSION The human interest about three-dimensional structures began even before Euclid. After centuries of investigation and learning, it became clear that although we have a better perception of 3D objects compared with our comprehension of text and multimedia data our techniques in the representation, storage, search, and discovery of 3D structures are lagged far behind. In computer vision, 3D object recognition is a well know difficult problem. With advances in computational power and storage devices, maintaining an enormous amount of 3D
Data Management in Three-Dimensional Structures
structures is not only feasible but also inexpensive. However, few effective approaches have been developed to facilitate similarity search and pattern discovery in a very large database of 3D structures. Applications of 3D structures to bioinformatics pose even more intricate challenges, due to the nature that these structures often need to be compared point-to-point. Fortunately, research in these topics has attracted more and more attention from computer scientists in different areas, such as computer vision, database systems, artificial intelligence, etc. It can be foreseen that substantial progress will be achieved and novel techniques will emerge in the near future.
REFERENCES Akutsu, T., Arakawa, K., & Murase H. (2002). Shape from contour using adaptive image selection. Systems and Computers in Japan, 33(11), 50-60. Ankerst, M., Kastenmüller, G., Kriegel, H-P., & Seidl, T. (1999). Nearest neighbor classification in 3D protein databases. Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology (pp. 34-43), Heidelberg, Germany. Belongie, S. Malik, J., & Puzicha, J. (2001). Matching shapes. Proc. of the Eighth International Conference on Computer Vision (pp. 454-463), Los Alamitos, California. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4), 509-522. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al. (2000). The protein data bank. Nucleic Acids Research, 28(1), 235-242. Bespalov, D., Shokoufandeh, A., Regli, W.C., & Sun, W. (2003). Scale-space representation of 3D models and topological matching. ACM Symposium on Solid Modeling and Applications (pp. 208-215). Castelli, V., & Bergman, L. (2001). Image databases: Search and retrieval of digital imagery. John Wiley & Sons. Chui, H., & Rangarajan A. (2000). A new algorithm for nonrigid point matching. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 44-51), Hilton Head Island, South Carolina. Elad, M., Tal, A., & Ar, S. (2001), Content based retrieval of VRML objects – An iterative and interactive approach. Proc. of the Eurographics Workshop on Multimedia (pp. 97-108), Manchester, UK.
Funkhouser, T.A., Min, P. Kazhdan, M.M., Chen, J., Halderman, A., Dobkin, D.P., et al. (2003). A search engine for 3D models. ACM Transactions on Graphics, 22(1), 83-105. Gavrilov, M., Indyk, P., Motwani, R., & Venkatasubramanian, S. (1999). Geometric pattern matching: A performance study. Proc. of the Fifteenth Annual Symposium on Computational Geometry (pp. 79-85), Miami Beach, Florida. Hilaga, M., Y. Shinagawa, Y., T. Kohmura, T., & T. L. Kunii T.L. (2001). Topology matching for fully automatic similarity estimation of 3D shapes. SIGGRAPH, 203-212. Johnson, A.E., & Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), 433-449. Keim, D.A. (1999). Efficient geometry-based similarity search of 3D spatial databases. Proc. of ACM SIGMOD International Conference on Management of Data (pp. 419-430), Philadephia, Pennsylvania. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., & Protopapas, Z. (1998). Fast and effective retrieval of medical tumor shapes. IEEE Transactions on Knowledge and Data Engineering, 10(6), 889-904. Kuramochi, M., & Karypis, G. (2002). Discovering geometric frequent subgraphs. Proc. of the 2002 IEEE International Conference on Data Mining (pp. 258-265), Maebashi, Japan. Lamdan, Y., & Wolfson, H. (1988). Geometric hashing: A general and efficient model-based recognition scheme. Proc. of International Conference on Computer Vision (pp. 237-249). Lehmann, T.M., Plodowski, B., Spitzer, K., Wein, B.B., Ney, H., & Seidl, T. (2004). Extended query refinement for content-based access to large medical image databases. Proc. of SPIE Medical Imaging (pp. 90-98). Lou, K., Prabhakar, S., & Ramani, K. (2004). Contentbased three-dimensional engineering shape search. Proc. of the 2004 IEEE International Conference on Data Engineering (pp. 754-765), Boston. Osada, R., Funkhouser, T., Chazelle, B., & Dobkin, D. (2001). Matching 3D models with shape distributions. Proc. of the International Conference on Shape Modeling and Applications, Genova, Italy. Paquet, E., & Rioux, M. (1999). Crawling, indexing and retrieval of three-dimensional data on the web in the framework of MPEG-7. VISUAL, 179-18.
231
,
Data Management in Three-Dimensional Structures
Saupe, D., & Vranic, D. V. (2001). 3D model retrieval with spherical harmonics and moments. DAGM-Symposium 2001 (pp. 392-397). Suzuki, M.T., & Sugimoto, Y.Y. (2003). A search method to find partially similar triangular faces from 3D polygonal models. Modeling and Simulation, 323-328. Veltkamp, R. C. (2001). Shape matching: Similarity measures and algorithms. IEEE Shape Modeling International, 188-197. Veltkamp, R. C., & Hagedoorn, M. (1999). State-of-the-art in shape matching. Technical Report UU-CS-1999-27, Utrecht. Verbitsky, G., Nussinov, R., & Wolfson, H. J. (1999). Flexible structural comparison allowing hinge bending and swiveling motions. Proteins, 34, 232-254. Wang, X. (2001). α -Surface and its application to mining protein data. Proc. of the IEEE International Conference on Data Mining (pp. 659-662), San Jose, California. Wang, X. (2002). ∆B+ tree: Indexing 3D point sets for pattern discovery. Proc. of the IEEE International Conference on Data Mining (pp. 701-704), Maebashi, Japan. Wang X., & Wang, J.T.L. (2000). Fast similarity search in three-dimensional structure databases. Journal of Chemical Information and Computer Sciences, 40(2), 442-451. Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Rigoutsos, I., & Zhang, K. (2002). Finding patterns in three dimensional graphs: Algorithms and applications to scientific data mining. IEEE Transaction on Knowledge and Data Engineering, 14(4), 731-749. Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., Ravichandran, V. et al. (2002). The protein data bank: Unifying the archive. Nucleic Acids Research, 30(1), 245248.
232
KEY TERMS ∆ B+ Tree: An index structure that decomposes a 3D structure into point-triplets and indexes the triplets in a three-dimensional B + tree. α -Surface: The surface of a 3D structure that is constructed by rolling a solid ball with radius α along the contour and extracting every point the solid ball touches. Aggregate Similarity Search: The search operation in 3D structures that matches the structures point-topoint in their entirety. Feature Vector: A vector in which every dimension represents a property of a 3D structure. A good feature vector captures similarity and dissimilarity of 3D structures. Histogram: The original term refers to a bar graph that represents a distribution. In information retrieval, it also refers to a weighted vector that describes the properties of an object or the comparison of properties between two objects. Like a feature vector a good histogram captures similarity and dissimilarity of the objects of interest. Shape-Based Similarity Search: The search operation in 3D structures that matches only the surfaces of the structures without referring to the points inside the surfaces. Superimposition: The process of matching two 3D structures by alignment through rigid translations and rotations. The Curse of Dimensionality: The original term refers to the exponential growth of hyper-volume as a function of dimensionality. In information retrieval, it refers to the phenomenon that the performance of an index structure for nearest neighbor search and ε-range search deteriorates rapidly due to the growth of hyper-volume.
233
Data Mining and Decision Support for Business and Science Auroop R. Ganguly Oak Ridge National Laboratory, USA Amar Gupta University of Arizona, USA Shiraj Khan University of South Florida, USA
INTRODUCTION Analytical Information Technologies Information by itself is no longer perceived as an asset. Billions of business transactions are recorded in enterprise-scale data warehouses every day. Acquisition, storage, and management of business information are commonplace and often automated. Recent advances in remote or other sensor technologies have led to the development of scientific data repositories. Database technologies, ranging from relational systems to extensions like spatial, temporal, time series, text, or media, as well as specialized tools like geographical information systems (GIS) or online analytical processing (OLAP), have transformed the design of enterprise-scale business or large scientific applications. The question increasingly faced by the scientific or business decision-
Table 1. Analytical information technologies Data-Mining Technologies • Association, correlation, clustering, classification, regression, database knowledge discovery • Signal and image processing, nonlinear systems analysis, time series and spatial statistics, time and frequency domain analysis • Expert systems, case-based reasoning, system dynamics • Econometrics, management science Decision Support Systems • Automated analysis and modeling o Operations research o Data assimilation, estimation, and tracking • Human computer interaction o Multidimensional OLAP and spreadsheets o Allocation and consolidation engine, alerts o Business workflows and data sharing
maker is not how one can get more information or design better information systems but what to make of the information and systems already in place. The challenge is to be able to utilize the available information, to gain a better understanding of the past, and to predict or influence the future through better decision making. Researchers in data mining technologies (DMT) and decision support systems (DSS) are responding to this challenge. Broadly defined, data mining (DM) relies on scalable statistics, artificial intelligence, machine learning, or knowledge discovery in databases (KDD). DSS utilize available information and DMT to provide a decision-making tool usually relying on human-computer interaction. Together, DMT and DSS represent the spectrum of analytical information technologies (AIT) and provide a unifying platform for an optimal combination of data dictated and human-driven analytics.
Table 2. Application examples Science and Engineering • Bio-Informatics • Genomics • Hydrology, Hydrometeorology • Weather Prediction • Climate Change Science • Remote Sensing • Smart Infrastructures • Sensor Technologies • Land Use, Urban Planning • Materials Science Business and Economics • Financial Planning • Risk Analysis • Supply Chain Planning • Marketing Plans • Text and Video Mining • Handwriting/Speech Recognition • Image and Pattern Recognition • Long-Range Economic Planning • Homeland Security
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining and Decision Support for Business and Science
BACKGROUND Tables 1 and 2 describe the state of the art in DMT and DSS for science and business, and provide examples of their applications. Researchers and practitioners have reviewed the state of the art in analytic technologies for business (Apte et al., 2002; Kohavi et al., 2002; Linden & Fenn, 2003) or science (Han et al., 2002), as well as data mining methods, software, and standards (Fayyad & Uthurusamy, 2002; Ganguly, 2002a; Grossman et al., 2002; Hand et al., 2001; Smyth et al., 2002) and decision support systems (Carlsson & Turban, 2002; Shim et al., 2002).
MAIN THRUST Scientific and Business Applications Rapid advances in information and sensor technologies (IT and ST) along with the availability of large-scale scientific and business data repositories or database management technologies, combined with breakthroughs in computing technologies, computational methods, and processing speeds, have opened the floodgates to datadictated models and pattern matching (Fayyad & Uthurusamy, 2002; Hand et al., 2001). The use of sophisticated and computationally-intensive analytical methods is expected to become even more commonplace with recent research breakthroughs in computational methods and their commercialization by leading vendors (Bradley et al., 2002; Grossman et al., 2002; Smyth et al., 2002). Scientists and engineers have developed innovative methodologies for extracting correlations and associations, dimensionality reduction, clustering or classification, regression, and predictive modeling tools based on expert systems and case-based reasoning, as well as decision support systems for batch or real-time analysis. They have utilized tools from areas like traditional statistics, signal processing, and artificial intelligence, as well as emerging fields like data mining, machine learning, operations research, systems analysis, and nonlinear dynamics. Innovative models and newly discovered patterns in complex, nonlinear, and stochastic systems encompassing the natural and human environments have demonstrated the effectiveness of these approaches. However, applications that can utilize these tools in the context of scientific databases in a scalable fashion have only begun to emerge (Curtarolo et al., 2003; Ganguly, 2002b; Grossman & Mazzucco, 2002; Grossman et al., 2001; Han et al., 2002; Kamath et al., 2002; Thompson et al., 2002). 234
Business solution providers and IT vendors, on the other hand, have focused primarily on scalability, process automation and workflows, and the ability to combine results from relatively simple analytics with judgments from human experts. For example, e-business applications in the areas of supply-chain planning, financial analysis, and business forecasting traditionally rely on decision-support systems with embedded data mining, operations research and OLAP technologies, business intelligence (BI), and reporting tools, as well as an easy-to-use GUI (graphical user interface) and extensible business workflows (Geoffrion & Krishnan, 2003). These applications can be custom built by utilizing software tools or are available as prepackaged ebusiness application suites from large vendors like SAP®, PeopleSoft®, and Oracle®, as well as best of breed and specialized applications from smaller vendors like Seibel® and i2 ®. A recent report by the market research firm Gartner (Linden & Fenn, 2003) summarizes the relative maturity and current industry perception of advanced analytics. For reasons ranging from excessive (IT) vendor hype to misperceptions among end users caused by inadequate quantitative background, the business community is barely beginning to realize the value of data-dictated predictive, analytical, or simulation models. However, there are notable exceptions to this trend (Agosta et al., 2003; Apte et al., 2002; Geoffrion & Krishnan, 2003; Kohavi et al., 2002; Wang & Jain, 2003; Yurkiewicz, 2003).
Solutions Utilizing DMT and DSS For a scientist or an engineer, as well as for a business manager or management scientist, DMT and DSS are tools used for developing domain-specific applications. These applications might combine knowledge about the specific scientific or business domain (e.g., through the use of physically-based or conceptual-scientific models, business best practices and known constraints, etc.) with data-dictated or decision-making tools like DSS and DMT. Within the context of these applications, DMT and DSS can aid in the discovery of novel patterns, development of predictive or descriptive models, mitigation of natural or manmade hazards, preservation of civil societies and infrastructures, improvement in the quality and span of life as well as in economic prosperity and well being, and development of natural and built environments in a sustainable fashion. Disparate applications utilizing DMT and DSS tools tend to have interesting similarities. Examples of current best practices in the context of business and scientific applications are provided next. Business forecasting, planning, and decision support applications (Carlsson & Turban, 2002; Shim et al.,
Data Mining and Decision Support for Business and Science
2002; Wang & Jain, 2003; Yurkiewicz, 2003) usually need to read data from a variety of sources like online transactional processing (OLTP) systems, historical data warehouses and data marts, syndicated data vendors, legacy systems, or public domain sources like the Internet, as well as in the form of real-time or incremental data entry from external or internal collaborators, expert consultants, planners, decision makers, and/or executives. Data from disparate sources usually are mapped to a predefined common data model and incorporated through extraction, transformation, and loading (ETL) tools. End users are provided GUI-based access to define application contexts and settings, structure business workflows and planning cycles and format data models for visualization, judgmental updates, or analytical and predictive modeling. The parameters of the embedded data mining models might be preset, calculated dynamically based on data or user inputs, or specified by a power user. The results of the data mining models can be utilized automatically for optimization and recommendation systems and/or can be used to serve as baselines for planners and decision makers. Tools like BI, Reports, and OLAP (Hammer, 2003) are utilized to help planners and decision makers visualize key metrics and predictive modeling results, as well as to utilize alert mechanisms and selection tools to manage by exception or by objectives. Judgmental updates at various levels of aggregation and their reconciliation, collaboration among internal experts and external trading partners, as well as managerial review processes and adherence to corporate directives, are aided by allocation and consolidation engines, tools for simulation, ad hoc and predefined reports, user-defined business workflows, audit trails with comments and reason codes, and flexible information transfer and data-handling capabilities. Emerging technologies include the use of automated DMT for aiding traditional DSS tasks; for example, the use of data mining to zero down on the cause of aggregate exceptions in multidimensional OLAP cubes. The end results of the planning process are usually published in a pre-defined placeholder (e.g., a relational database table), which, in turn, can be accessed by execution systems or other planning applications. The use of elaborate mechanisms for user-driven analysis and judgmental or collaborative decisions, as opposed to reliance on automated DMT, remains a guiding principle for the current genre of business-planning applications. The value of collaborative decision making and global visibility of information is near axiomatic for business applications. However, researchers need to design better DMT applications that can utilize available information from disparate sources through advanced analytics and account for specific domain knowledge, constraints, or bottlenecks. Valuable and/or scarce human resources can be conserved by automating routine tasks and by reserving expert resources
for high value-added jobs (e.g., after a Pareto classification) or for exceptional situations (e.g., large prediction variance or significant risks). In addition, certain research studies have indicated that judgmental overrides may not improve upon the results of automated predictive models on the (longer-term) average. Scientists and engineers traditionally have utilized advanced quantitative approaches for making sense of observations and experimental results, formulating theories and hypotheses, and designing experiments. For users of statistical and numerical approaches in these domains, DMT often seems like the proverbial old wine in new bottles. However, innovative use of DMT includes the development of algorithms, systems, and practices that can not only apply novel methodologies, but also can scale to large scientific data repositories (Connover et al., 2003; Graves, 2003; Han et al., 2002; He et al., 2003; Ramachandran et al., 2003). While scientific and business data mining have a lot in common, the incorporation of domain knowledge is probably more critical in scientific applications. When appropriately combined with domain specific knowledge about the physics or the data sources/uncertainties, DMT approaches have the potential to revolutionize the processes of scientific discovery, verification, and prediction (Han et al., 2002; Karypis, 2002). This potential has been demonstrated by recent applications in diverse areas like remote sensing (Hinke et al., 2000), material sciences (Curtarolo et al., 2003), bioinformatics (Graves, 2003), and the earth sciences (Ganguly, 2002b; Kamath et al., 2002; Potter et al., 2003; Thompson et al., 2002; see http://datamining.itsc.uah.edu/adam/). Besides physical and data-dictated methods, human-computer interaction retains a significant role in real-world scientific decision making. This necessitates the use of DSS, where the results of DMT can be combined with expert judgment and techniques from simulation, operations research (OR), and other DSS tools. The Reviews of Geophysics (American Geophysical Union, 1995) provides a slightly dated discussion on the use of data assimilation, estimation, and OR, as well as DSS, (http:/ /www.agu.org/journals/rg/rg9504S/contents.html# hydrology). Examples of decision support systems and tools in scientific and engineering applications also can be found in dedicated journals like Decision Support Systems or Journal of Decision Systems (see Vol. 8, Number 2, 1998, and the latest issues), as well as in journals or Web sites dealing with scientific and engineering topics (McCuistion & Birk, 2002) (see the NASA air traffic control Web site at http:// www.asc.nasa.gov/aatt/dst.html; NASA research Web sites for a global carbon DSS http://geo.arc.nasa.gov/ website/cquestwebsite/index.html; and institutes like
235
,
Data Mining and Decision Support for Business and Science
MIT Lincoln Laboratories at http://www.ll.mit.edu/ AviationWeather/index2.html).
South Florida and the second author was a member of the faculty at the MIT Sloan School of Management.
FUTURE TREND
REFERENCES
Business applications have focused on DSS with embedded and scalable implementations of relatively straightforward DMT. Scientific applications have focused traditionally on advanced DMT in prototype applications with sample data. Researchers and practitioners of the future need to utilize advanced DMT for business applications and scalable DMT, and DSS for scientists and engineers. This provides a perfect opportunity for innovative and multi-disciplinary collaborations.
Agosta, L., Orlov, L.M., & Hudso R. (2003). The future of data mining: Predictive analytics. Forrester Brief.
CONCLUSION The power of information technologies has been utilized to acquire, manage, store, retrieve, and represent data in information repositories, and to share, report, process, collaborate on, and move data in scientific and business applications. Database management and data warehousing technologies have matured significantly over the years. Tools for building custom and packaged applications, including, but not limited to, workflow technologies, Web servers, and GUI-based data entry and viewing forms, are steadily maturing. There is a clear and present need to exploit the available data and technologies to develop the next generation of scientific and business applications, which can combine datadictated methods with domain-specific knowledge. Analytical information technologies, which include DMT and DSS, are particularly suited for these tasks. These technologies can facilitate both automated (data-dictated) and human expert-driven knowledge discovery and predictive analytics, and can also be made to utilize the results of models and simulations that are based on process physics or business insights. If DMT and DSS were to be defined broadly, a broad statement can perhaps be made that, while business applications have scalable but straightforward DMT embedded within DSS, scientific applications have utilized advanced DMT but focused less on scalability and DSS. Multidisciplinary research and development efforts are needed in the future for maximal utilization of analytical information technologies in the context of these applications.
ACKNOWLEDGMENTS Most of the work was completed while the first author was on a visiting faculty appointment at the University of 236
Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002, August). Business applications of data mining. Communications of the ACM, 45(8), 49-53. Bradley, P. et al. (2002). Scaling mining algorithms to large databases. Communications of the ACM, 45(8), 38-43. Carlsson, C., & Turban, E. (2002). DSS: Directions for the next decade. Decision Support Systems, 33(2), 105-110. Conover, H. et al. (2003). Data mining on the TeraGrid. Proceedings of the Supercomputing Conference, Phoenix, Arizona. Curtarolo, S. et al. (2003). Predicting crystal structures with data mining of quantum calculations. Physics Review Letters, 91(13). Fayyad, U., & Uthurusamy, R. (2002, August). Evolving data mining into solutions for insights. Communications of the ACM, 45(8), 28-31. Ganguly, A.R. (2002a). Software review—Data mining components. ORMS Today, 29(5), 56-59. Ganguly, A.R. (2002b). A hybrid approach to improving rainfall forecasts. Computers in Science and Engineering, 4(4), 14-21. Geoffrion, A.M., & Krishnan, R. (Eds.). (2003). Ebusiness and management science—Mutual impacts (Parts 1 and 2). Management Science, 49(10-11). Graves, S.J. (2003). Data mining on a bioinformatics grid. Proceedings of the SURA BioGrid Workshop, Raleigh, North Carolina. Grossman, R. et al. (2001). Data mining for scientific and engineering applications. Kluwer. Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data mining standards initiative. Communications of the ACM, 45(8), 59-61. Grossman, R.L., & Mazzucco, M. (2002). DataSpace: A data Web for the exploratory analysis and mining of data. Computers in Science and Engineering, 4(4), 44-51. Hammer, J. (Ed.). (2003). Advances in online analytical processing. Data & Knowledge Engineering, 45(2), 127256.
Data Mining and Decision Support for Business and Science
Han, J. et al. (2002). Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54-58. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. He, Y. et al. (2003). Framework for mining and analysis of space science data. Proceedings of the SIAM International Conference on Data Mining, San Francisco, California. Hinke, T., Rushing, J., Ranganath, H.S., & Graves, S.J. (2000). Techniques and experience in mining remotely sensed satellite data. Artificial Intelligence Review, 14(6), 503-531. Kamath, C. et al. (2002). Classifying of bent-double galaxies. Computers in Science and Engineering, 4(4), 52-60. Karypis, G. (2002). Guest editor’s introduction: Data mining. Computers in Science and Engineering, 4(4), 12-13. Kohavi, R., Rothleder, N.J., & Simoudis, E. (2002). Emerging trends in business analytics. Communications of the ACM, 45(8), 45-48. Linden, A., & Fenn, J. (2003). Hype cycle for advanced analytics, 2003. Gartner Strategic Analysis Report. McCuistion, J.D., & Birk, R. (2002). From observations to decision support: The new paradigm for satellite data. NASA Technical Report. Retrieved from http:// www.iaanet.org/symp/berlin/IAA-B4-0102.pdf Potter, C. et al. (2003). Global teleconnections of ocean climate to terrestrial carbon flux. Journal of Geophysical Research, American Geophysical Union, 108(D17), 4556. Ramachandran, R. et al. (2003). Flexible framework for mining meteorological data. Proceedings of the American Meteorological Society’s (AMS) 19th International Conference on Interactive Information Processing Systems (IIPS) Meteorology, Oceanography, and Hydrology, Long Beach, California. Shim, J. et al. (2002). Past, present, and future of decision support technology. Decision Support Systems, 33(2), 111-126. Smyth, P., Pregibon, D., & Faloutsos, C. (2002). Datadriven evolution of data mining algorithms. Communications of the ACM, 45(8), 33-37. Thompson, D.S. et al. (2002). Physics-based feature mining for large data exploration. Computers in Science and Engineering, 4(4), 22-30.
Wang, G.C.S., & Jain, C.L. (2003). Regression analysis: Modeling and forecasting. Institute of Business Forecasting. Yurkiewicz, J. (2003). Forecasting software survey: Predicting which product is right for you. ORMS Today.
KEY TERMS Analytical Information Technologies (AIT): Information technologies that facilitate tasks like predictive modeling, data assimilation, planning, or decision making through automated data-driven methods, numerical solutions of physical or dynamical systems, human-computer interaction, or a combination. AIT includes DMT, DSS, BI, OLAP, GIS, and other supporting tools and technologies. Business and Scientific Applications: End-user modules that are capable of utilizing AIT along with domainspecific knowledge (e.g., business insights or constraints, process physics, engineering know-how). Applications can be custom-built or pre-packaged and are often distinguished form other information technologies by their cognizance of the specific domains for which they are designed. This can entail the incorporation of domainspecific insights or models, as well as pre-defined information and process flows. Business Intelligence (BI): Broad set of tools and technologies that facilitate management of business knowledge, performance, and strategy through automated analytics or human-computer interaction. Data Assimilation: Statistical and other automated methods for parameter estimation, followed by prediction and tracking. Data Mining Technologies (DMT): Broadly defined, these include all types of data-dictated analytical tools and technologies that can detect generic and interesting patterns, scale (or can be made to scale) to large data volumes, and help in automated knowledge discovery or prediction tasks. These include determining associations and correlations, clustering, classifying, and regressing, as well as developing predictive or forecasting models. The specific tools used can range from traditional or emerging statistics and signal or image processing, to machine learning, artificial intelligence, and knowledge discovery from large databases, as well as econometrics, management science, and tools for modeling and predicting the evolutions of nonlinear dynamical and stochastic systems.
237
,
Data Mining and Decision Support for Business and Science
Decision Support Systems (DSS): Broadly defined, these include technologies that facilitate decision making. These can embed DMT and utilize these through automated batch processes and/or user-driven simulations or what-if scenario planning. The tools for decision support include analytical or automated approaches like data assimilation and operations research, as well as tools that help the human experts or decision makers manage by objectives or by exception, like OLAP or GIS. Geographical Information Systems (GIS): Tools that rely on data management technologies to manage, process, and present geospatial data, which, in turn, can vary with time. Online Analytical Processing (OLAP): Broad set of technologies that facilitate drill-down or aggregate analy-
238
ses, as well as presentation, allocation, and consolidation of information along multiple dimensions (e.g., product, location, and time). These technologies are well suited for management by exceptions or objectives, as well as automated or judgmental decision making. Operations Research (OR): Mathematical and constraint programming and other techniques for mathematically or computationally determining optimal solutions for objective functions in the presence of constraints. Predictive Modeling: The process through which mathematical or numerical technologies are utilized to understand or reconstruct past behavior and predict expected behavior in the future. Commonly utilized tools include statistics, data mining, and operations research, as well as numerical or analytical methodologies that rely on domain-knowledge.
239
Data Mining and Warehousing in Pharma Industry Andrew Kusiak The University of Iowa, USA Shital C. Shah The University of Iowa, USA
INTRODUCTION Most processes in pharmaceutical industry are data driven. Company’s ability to capture the data and making use of it will grow in significance and may become the main factor differentiating the industry. Basic concepts of data mining, data warehousing, and data modeling are introduced. These new data-driven concepts lead to a paradigm shift in pharmaceutical industry.
BACKGROUND The volume of data in pharmaceutical industry has been growing at an unprecedented rate. For example, a microarray (equivalent of one test) may provide thousands of data points. These huge datasets offer challenges and opportunities. The primary challenges are data storage, management, and knowledge discovery. The pharmaceutical industry is one of the most datadriven industries and getting value out of the data is key to its success. Thus processing the data with novel tools and methods for extracting useful knowledge for speedy drug discovery and optimal matching of the drugs with the patients may well become the main factor differentiating the pharmaceutical companies. The main sources of pharmaceutical data are patients, genetic tests, regulatory sources, pharmaceutical literature, and so on. Data and information collected from different sources, agencies, and clinics, has been traditionally used for narrow reporting (Berndt et al., 2001). Massive clinical trials may lead to errors such as protocol violations, data integrity, data formats, transfer errors, and so on. Traditionally the analysis in pharmaceutical industry [from prediction of drug stability (King et al., 1984) to drug discovery techniques (Tye, 2004)] is performed using the population based statistical techniques. To minimize possible errors and increase confidence in predictions there is a need for a comprehensive methodology for data collection, storage, and analysis. Automation of the pharmaceutical data over its lifecycle seems to be an obvious alternative.
The vision for pharmaceutical industry is captured in Figure 1. It has become apparent that data will play a central role in drug discovery and delivery to an individual patient. We are about to see pharmaceutical companies delivering drugs, developing test kits (including genetic tests), and computer programs to deliver the best drug to the patient. The current data mining (DM) tools are capable of issuing customized prescriptions, providing most effective drugs, and dosages with minimal adverse effects. It has become practical to think of designing, producing, and delivering drugs intended for an individual patient (or a small population of patients). Here, analogy of the one-of-a-kind production of today versus mass production is worthy attention. Many industries are attempting to produce customized products. The pharmaceutical industry may soon become a new addition to this collection of industries.
MAIN THRUST Paradigm Shift Predicting implies knowing in advance, which in a business environment translates into competitive advantage. Figure 1. The future of pharmaceutical industry Pharma company
Data
Drugs
Test kits Computer program
Today Customized prescription
Tomorrow Customized medication
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining and Warehousing in Pharma Industry
A future event can be predicted in two major ways: population-based and individual-based. The populationbased prediction says, for example, a drug A has been effective in treating 80% of patients in the population P as symbolically illustrated in Figure 2. Of course, any patient would like to belong to the 80% rather than the 20% category before the drug A is administered. Statistics and other tools have been widely used in support of the population-paradigm, among others in medicine and pharmaceutical industry. The individual-based approach supported by numerous DM algorithms emphasizes an individual patient rather than the population (Kusiak et al., 2005). One of many decision-making scenarios is illustrated in Figure 3, where the original population P of patients has been partitioned into two segments 1 and 2. The decisions for each patient in Segment 1 are made with high confidence; say 99%, while the decisions for Segment 2 are predicted with lower confidence. It is quite possible that Segment 2 patients would seek an alternative drug or a treatment. There are different ways of using DM algorithms. They cover the range between the population and individual-based paradigms. The exiting DM algorithms can be grouped into the following basic ten classes (Kusiak, 2001): A. B.
Classical statistical methods (e.g., linear, quadratic, and logistic discriminant analyses) Modern statistical techniques (e.g., projection pursuit classification, density estimation, k-nearest neighbor, Bayes algorithm)
Figure 2. Population-based paradigm 80% P cured Drug A
Population P Confidence = 95% 20% P not cured
Figure 3. Individual-based paradigm using data mining tools
C. D. E. F. G. H. I. J.
Neural network (Mitchell, 1997) Support vector machines Decision tree algorithms [C4.5 (Quinlan, 1992)] Decision rule algorithms [Rough set algorithms (Pawlak, 1991)] Association rule algorithms Learning classifier systems Inductive learning algorithms Text learning algorithms
Each class containing numerous algorithms, for example, there are more than 100 implementations of the decision tree algorithm (class E).
Data Warehouse Design A warehouse has to be designed to meet users’ requirements. DM, online analytical processing (OLAP), and reporting are the top items on the list of requirements. Systems design methodologies, and tools can be used to facilitate the requirements capture. Examples of methodologies for analysis of data warehouse (DW) requirements include AND/OR graphs and the house of quality (Kusiak, 2000). The architecture of a typical DW embedded in a pharmaceutical environment is shown in Figure 4. The pharmaceutical data is extracted from numerous sources and preprocessed to minimize inconsistencies. Also, data transformation will capture intricate solution spaces to improve knowledge discovery. The cleaned and transformed data is loaded and refreshed directly into a DW or data marts. A data mart might be a precursor to the full-fledged DW or function as a specialized DW. A special purpose (exploratory) data mart or a DW might be created for exploratory data analysis and research. The warehouse and data marts serve various applications that justify the development and maintenance cost in this data storage technology. The range of services that could be developed off a DW could be expanded beyond OLAP and DM into almost all pharmaceutical business areas, including interactions with federal agencies and other businesses.
Data Flow Analysis
70% P cured Population Segment 1
Confidence = 99% 15% P not cured
Drug A 10% P cured Population Segment 2
Confidence = 85% 5% P not cured
240
A task that may parallel the capture of requirements for a DW involves analysis of data flow. A warehouse is to integrate various streams of data that have to be identified. The information analysts and users need to feel comfortable with the data flow methodology selected for capturing the data flow logic. At minimum this data flow modeling exercise should increase efficiency of the data handling and management. An example to a methodology that can be used to model data flow is the
Data Mining and Warehousing in Pharma Industry
Figure 4. Data warehouse architecture
, Data marts
Data mining Data Transformation and Loading
Datya Data Servive
Raw Data
OLAP
Data warehouse
Exploratory data warehouse
Data Sources
Integrated Definition (Kusiak, 2000) process modeling methodology, IDEF3.
Preprocessing For Data Warehousing Before a viable and accurate DW is built, there is a need to solve specific problems associated with data management. The collection process, coding schemes and standards may lead to errors, which must be fixed. Protocols for data collection and transfer should concentrate, among other things, on feature related unwanted characters, illegal values, out of range feature values, non-uniform primary keys, and so on. Clinical trials are costly and there are various methods to reducing the number of participating subjects. Selecting the top and bottom quartile patients (score/criterion based) for both placebo and drug subjects can significantly reduce the trial magnitude. Thus special consideration is required to handle data for smaller number of subjects. The outcomes are often measured with subjective scores that vary from subject to subject and also for each subject. Handling these subjective outcomes is a challenging problem for developing practical solutions. An example of subjective outcomes measured over a nine-month period is shown in Table 1.
• • • • •
Information and Knowledge Services
Decision 1 = (Score_9_months - Score_3_months)/ (Number of months) Decision 2 = Slope of the scores over time Decision 3 = Average score over time Decision 4 = Percentage improvement between first and last reading Decision 5 = Score based on group means with first and last reading over time
The clinical trials performed over an extended period of time lead to temporal data. There is a need to incorporate additional parameters in the temporal dataset. These parameters can be a trend variable, time, or a transformed parameter. Table 1 provides an example of temporal decision handling by employing five different decision measures. Case 2 is a unique example as decision 3 and 5 imply that there is no improvement, contrary to the actual fact. Thus a new meta-decision measure can be compiled based on the above five decisions. Handling noisy data, including genetic data, is a challenge (Table 2). There are various models that can compress, identify and reduce noisy genetic data such as clustering and feature reduction methods (Shah & Kusiak, 2004; Dudoit et al., 2000).
Table 1. Example of subjective and temporal outcomes Subjective Score Possible Decision Measures ID 3_months 6_months 9_months Decision 1 Decision 2 Decision 3 Decision 4 Decision 5 1 70.6 78.41 97.3 2.97 4.4496 82.1 112.09 -2.98 2 30.01 62.78 77.29 5.25 7.8805 56.7 171.41 -4.88 3 61.12 86.55 26.77 -3.82 -5.7244 58.15 101.38 3.2 4 74.05 44.44 55.67 -2.04 -3.0637 58.05 81.16 1.54 5 75.28 81.95 41.2 -3.79 -5.6812 66.14 92.89 3.12
241
Data Mining and Warehousing in Pharma Industry
Table 2. Noisy genetic data ID Treatment SNP1 1 Drug A/T 2 Drug A/T 3 Drug A/T 4 Drug A/A 5 Drug A/T 6 Placebo A/T 7 Placebo A/A 8 Placebo A/T 9 Placebo A/T 10 Placebo T/T
SNP2 C/T T/T C/T C/C C/T C/T C/T C/C C/T T/T
SNP3 A/C A/A A/C A/A A/C C/C A/C A/C A/C C/C
SNP4 G/T G/G G/T T/T G/T G/T G/G G/T G/T T/T
SNP5 C/T T/T T/T C/T C/T T/T C/T C/T C/T C/C
Decision Improved Improved Not_Improved Not_Improved Not_Improved Improved Improved Improved Not_Improved Not_Improved
Data Warehouse Characteristics A DW is a subject-oriented, integrated, nonvolatile, timevariant collection of data in support of users (Elmasri & Navathe, 2000). The prominent feature of the DW is to support large volume of data and relatively small number of users with relatively long interactions, high performance levels, multidimensional view, usability, manageability, and flexible reporting. It must efficiently extract and process data. Data cuboids in the DW are multidimensional constructs reflecting the relationships in the underlying data. A three-dimensional cuboid is called a data cube and is illustrated in Figure 5. The data cube represents three dimensions, namely cancer type, medication, and time factor. The data cubes can be views through multiple orientations using a technique called pivoting (Elmasri & Navathe, 2000). Thus the data cube in can be pivoted to form a separate medication and time factor table for each cancer type. This data structure allows for roll-up and drill-down capabilities. Roll-up moves up the hierarchy, for example grouping the cancer type dimension to present a single medication and time factor view. Similarly drilldown helps in finer grained view, for example, the medications can be broken down into cardiac, respiratory, and skin medications. The DW is normally designed with star schema or the snowflake schema. The star schema is designed with a fact table (tuples arranged in one per recorded fact) and single table for each dimension (Elmasri & Navathe, Figure 5. Data cube Time Cancer type
Medication
2000). The snowflake schema is a variation of star schema in which the dimensional tables are organized into a hierarchy by normalizing the tables. Another important issue while designing the DW especially for pharmaceutical industry is the security issues, that is, avoiding unauthorized access.
Data Modeling The data stored in DW is relatively error free and compact. This data contains hidden information, which needs to be extracted. It forms a staging ground for knowledge discovery, and decision-making (Elmasri & Navathe, 2000). The desired functionality of DW access tools is as follows: • • • • • • • • •
Tabular reporting and information mapping Complex queries and sophisticated criteria search Ranking Multivariable and time series analysis Data visualization, graphing, charting, and pivoting Complex textual search Advanced statistical analysis Trend discovery and analysis Pattern and associations discovery
There are three main streams for data analysis namely, OLAP, statistics, and DM. OLAP supports various viewpoints at different levels of abstraction, which helps in data visualization and analysis. Population-based statistical analysis of the data can be performed through various tools ranging from simple regression to complex multivariate analysis. DM offers tools (decision trees, decision rules, support vector machines, neural networks, association rules, and so on) for discovery of new knowledge by converting hidden information into business models. It provides tools for identifying valid, novel, potentially useful, and ultimately understandable patterns from data and constructs high confidence predictions for individuals (Fayyad et al., 1997). This may represent valuable knowledge that might lead to medical discoveries, for example, certain ranges of parameter values leading to longer survival time. Thus OLAP with DM complements each other in the analysis and provides different but much needed functionality for data understanding, visualization, and making individualized decisions. Data modeling provides analytical reasoning for decision-making to solve specific patient and business specific issues.
Applications DM, OLAP, and decision-making algorithms will lead to outcome predictions, identification of significant patterns
242
Data Mining and Warehousing in Pharma Industry
Figure 6. Pharmaceutical and medical applications
Prediction Identification Classification Optimization
Application Area Parameter Business selections & information identification Drug discovery
Adverse effects
Outcome predictions
Customized protocols
, Individualized protocols Treatments Predictions
and parameters, patient-specific decisions, and ultimately process/goal optimization. Prediction, identification, classification, and optimization are key to drug discovery, adverse effect analysis, outcome predictions, and individualized protocols (Figure 6). The issues highlighted in this article are illustrated with various medical informatics research projects. DM has led to identifying predictive parameters, formulating individualized treatment protocols, categorizing model patients, and developing a decision-making model for predictions of survival for dialysis patients (Kusiak et al., 2005). A gene/SNP selection approach (Shah & Kusiak, 2004) was developed using weighted decision trees and genetic algorithm. The high quality significant gene/SNP subset can be targeted for drug development and validation. An intelligent DM system determined the wellness score for the infants with hypoplastic left heart syndrome based on 73 physiologic, laboratory, and nurse-assessed parameters (Kusiak et al., 2003). For cancer-related analysis, a patient acceptance model can be developed to predict the treatment type, drug toxicity, and the length of disease free status after the treatment. These concept discussed in this chapter can be applied across all medical topics including phenotypic and genotypic data. Other examples are Merck gene sequencing (Eckman et al., 1998), epidemiological and clinical toxicology (Helma et al., 2000), prediction of rodent carcinogenicity bioassays (Bahler et al., 2000), gene expression level analysis (Dudoit et al., 2000), predicting risk of coronary artery disease (Tham et al., 2003), and VA health care system (Smith & Joseph, 2003).
of parameter relevancy. Dynamic clinical studies would provide additional facet to interact and intervene, resulting in clean, error free, and necessary data collection and analysis.
FUTURE TRENDS
Eckman, B.A., Aaronson, J.S., Borkowski, J.A., Bailey, W.J., Elliston, K.O., Williamson, A. R., & Blevins, R.A. (1998). The Merck gene index browser: An extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics, 14, 2-13.
The future of data mining, warehousing, and modeling in pharmaceutical industry offers numerous challenges. The first one is the scale of the data, measured with the number of features. New scalable schemes for parameter selection and knowledge discovery will be developed. There is a need to develop new tools for rapid evaluation
CONCLUSION Data warehousing, data modeling, data mining, OLAP, and decision-making algorithms will ultimately lead to targeted drug discovery and individualized treatments with minimum adverse effects. The success of the pharmaceutical industry will largely depend on following the course outlined by the new data paradigm.
REFERENCES Bahler, D., Stone, B., Wellington, C., & Bristol, D.W. (2000). Symbolic, neural, and Bayesian machine learning models for predicting carcinogenicity of chemical compounds. Journal of chemical information and computer sciences, 40(4), 906-914. Berndt, D.J., Fisher, J.W., Hevner, R.A., & Studnicki, J. (2001). Healthcare data warehousing and quality assurance. IEEE Computer, 34(12), 56-65. Dudoit, S., Fridlyand, J., & Speed, T.P. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576. Berkeley, CA: Department of Statistics, University of California.
Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems. New York: Addison-Wesley. 243
Data Mining and Warehousing in Pharma Industry
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1997). Advances in knowledge discovery and data mining. Cambridge, MA: MIT Press.
Tye, H. (2004). Application of statistical “design of experiments” methods in drug discovery. Drug Discovery Today, 9(11), 485-491.
Helma, C., Gottmann, E., & Kramer, S. (2000). Knowledge discovery and data mining in toxicology. Statistical Methods in Medical Research, 9(4), 329-358.
KEY TERMS
King, S.Y., Kung, M.S., & Fung, H.L. (1984). Statistical prediction of drug stability based on nonlinear parameter estimation. Journal of pharmaceutical sciences, 73(5), 657-662.
Adverse Effects: Any untoward medical occurrences that may be life threatening and requires in-patient hospitalization.
Kusiak, A. (2000). Computational intelligence in design and manufacturing. New York: John Wiley. Kusiak, A. (2001). Feature transformation methods in data mining. IEEE Transactions on Electronics Packaging Manufacturing, 24(3), 214-221. Kusiak, A., Caldarone, C.A., Kelleher, M.D., Lamb, F.S., Persoon, T.J., Gan, Y., & Burns, A. (2003, April). Mining temporal data sets: Hypoplastic left heart syndrome case study. In Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery: Theory, Tools, and Technology V 5098, SPIE (pp. 93-101). Belingham, WA. Kusiak, A., Dixon, B., & Shah, S. (2005). Predicting survival time for kidney dialysis patients: A data mining approach. Computers in Biology and Medicine, 35(4), 311-327. Kusiak, A., Kern, J.A., Kernstine, K.H., & Tseng, T.L. (2000). Autonomous decision-making: a data mining approach. IEEE Transactions on Information Technology in Biomedicine, 4(4), 274-284. Mitchell, T.M. (1997). Machine learning. New York: McGraw Hill. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Boston, MA: Kluwer. Quinlan, R. (1992). C 4.5 programs for machine learning. San Meteo, CA: Morgan Kaufmann. Shah, S.C., & Kusiak, A. (2004). Data mining and genetic algorithm based Gene/SNP selection. Artificial Intelligence in Medicine, 31(3), 183-196. Smith, M.W., & Joseph, G.J. (2003). Pharmacy data in the VA health care system. Medical Care Research and Review: MCRR, 60 (3 Suppl), 92S-123S. Tham, C.K., Heng, C.K., & Chin, W.C. (2003). Predicting risk of coronary artery disease from DNA microarraybased genotyping using neural networks and other statistical analysis tool. Journal of Bioinformatics and Computational Biology, 1(3), 521-539. 244
Clustering: Clustering algorithms discover similarities and differences among groups of items. It divides a dataset so that patients with similar content are in the same group, and groups are as different as possible from each other. Customized Protocols: A specific set of treatment parameters and their values those are unique for individual. The customized protocols are derived from discovered knowledge patterns. Data Visualization: The method or end result of transforming numeric and textual information into a graphic format. Visualizations are used to explore large quantities of data holistically in order to understand trends or principles. Decision Trees: Decision-tree algorithm creates rules based on decision trees or sets of if-then statements to maximize interpretability. Drug Discovery: It is a research process that identifies molecules with desired biological effects so a to develop new therapeutic drugs. Feature Reduction Methods: The goal of feature reduction method is to identify the minimum set of nonredundant features (e.g., SNPs, genes) that are useful in classification. Knowledge Discovery: Knowledge discovery in databases is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns/models in data. Neural Networks: Neural networks are a set of simple units (neurons) that receive a number of real values inputs, which are processed through the network to produce a real value output. OLAP: Online analytical processing is a category of software tools that provides analysis of data stored in a database.
245
Data Mining for Damage Detection in Engineering Structures RamdevKanapady University of Minnesota, USA Aleksandar Lazarevic University of Minnesota, USA
INTRODUCTION The process of implementing and maintaining a structural health monitoring system consists of operational evaluation, data processing, damage detection, and life prediction of structures. This process involves the observation of a structure over a period of time using continuous or periodic monitoring of spaced measurements, the extraction of features from these measurements, and the analysis of these features to determine the current state of health of the system. Such health monitoring systems are common for bridge structures, and many examples are citied in (Maalej et al., 2002). The phenomenon of damage in structures includes localized softening or cracks in a certain neighborhood of a structural component due to high operational loads, or the presence of flaws due to manufacturing defects. Methods that detect damage in the structure are useful for non-destructive evaluations that typically are employed in agile manufacturing and rapid prototyping systems. In addition, there are a number of structures, such as turbine blades, suspension bridges, skyscrapers, aircraft structures, and various structures deployed in space, for which structural integrity is of paramount concern (Figure 1). With the increasing demand for safety and reliability of Figure 1. Damage detection is part of structural health monitoring system
aerospace, mechanical and civilian structures damage detection techniques become critical to reliable prediction of damage in these structural systems. The most currently used damage detection methods are manual, such as tap test, visual, or specially localized measurement techniques (Doherty, 1997). These techniques require that the location of the damage be on the surface of the structure. In addition, location of the damage has to be known a priori, and these locations have to be readily accessible. This makes current maintenance procedure of large structural systems very time consuming and expensive due to its heavy reliance on human labor.
BACKGROUND The damage in structures and structural systems is defined through comparison of two different states of the system; the first one is the initial undamaged state, and the second one is the damaged state. Due to the changes in the properties of the structure or quantities derived from these properties, the process of damage detection eventually reduces to a form of pattern recognition/data mining problem. In addition, emerging continuous monitoring of an instrumented structural system often results in the accumulation of a large amount of data that need to be processed, analyzed, and interpreted for feature extraction in an efficient damage detection system. However, the rate of accumulating such data sets far outstrips the ability to analyze them manually. More specifically, there is often information hidden in the data that cannot be discovered manually. As a result, there is a need to develop an intelligent data processing component that can significantly improve the current damage detection systems. An immediate alternative is to design data mining techniques that can enable real-time prediction and identification of damage for newly available test data, once a sufficiently accurate model is developed. In recent years, various data mining techniques, such as artificial neural networks (ANNs) (Anderson, Lemoine
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
D
Data Mining for Damage Detection in Engineering Structures
& Ambur, 2003; Lazarevic et al., 2004; Ni, Wang & Ko, 2002; Sandhu et al., 2001; Yun & Bahng, 2000; Zhao, Ivan & DeWolf, 1998), support vector machines (SVMs) (Mita & Hagiwara 2003), decision trees (Sandhu et al., 2001), have been applied successfully to structural damage detection problems, thus showing that they can be potentially useful for such class of problems. This success can be attributed to numerous disciplines integrated with data mining, such as pattern recognition, machine learning, and statistics. In addition, it is well known that data mining techniques effectively can handle noisy, partially incomplete, and faulty data, which is particularly useful, since in damage detection applications, measured data are expected to be incomplete, noisy, and corrupted. The intent of this paper is to provide a survey of emerging data mining techniques for damage detection structures. Although the field of damage detection is very broad and consists of vast literature that is not based on data mining techniques, this survey will be focused predominantly on data mining techniques for damage detection based on changes in properties of the structure. However, a large amount of literature applicable to fault detection and diagnosis to application-specific system, such as rotating machinery, is not within the scope of this paper.
CATEGORIZATION OF STRUCTURAL DAMAGE The damage in structures can be classified as linear or nonlinear. Damage is considered as linear if the undamaged linear-elastic structure remains linear-elastic after damage. However, if the initially linear-elastic structure behaves in a nonlinear manner after the damage initiation, then the damage is considered as nonlinear. However, it is possible that the damage is linear at the damage initiation phase, but after prolonged growth in time, it may become nonlinear. For example, loose connections between the structures at the joints or the joints that rattle (Sekhar, 2003) are considered non-linear damages. Examples of such non-linear damage detection systems are described in Adams and Farrar (2002) and Kerschen and Golinval (2004). Most of the modal data in the literature are proposed for the linear case. They are based on the following three levels of damage identification: (1) Recognition—-qualitative indication that damage might be present in the structure; (2) Localization—information about the probable location of the damage in the structure; and (3) Assessment—estimate of the extent of severity of the damage in the structure. Such linear damage detection techniques can be found in Yun and Bahng (2000), Ni, et al. (2002), and Lazarevic, et al. (2004).
246
CLASSIFICATION OF DAMAGE DETECTION TECHNIQUES We provide several different criteria for classification of damage detection techniques based on data mining. In the first classification, damage detection techniques can be categorized into continuous (Keller & Ray, 2003) and periodic (Patsias & Staszewski, 2002) damage detection systems. Continuous techniques usually employ an integrated approach that consists of data acquisition process, feature extraction from large amounts of data collected from real-time sensors, and damage detection process. In periodic techniques, feature extraction process is optional, since the amount of data that need to be processed is not large and does not necessarily require data mining techniques for feature extraction. In the second classification, we distinguish between application-based and application-independent techniques. Application-based techniques are generally applicable to a specific structural system, and they typically assume that the monitored structure responds in some predetermined manner that can be accurately modeled by (i) numerical techniques such as finite element (Sandhu et al., 2001) or boundary element analysis (Anderson, Lemoine & Ambur, 2003) and/or (ii) behavior of the response of the structures that are based on physics-based models (Keller & Ray, 2003). Most of damage detection techniques that exist in literature belong to the application-based approach, where the minimization of the residue between the experimental and the analytical model is built into the system. Often, this type of data is not available and can render application-based methods impractical for certain applications, particularly for structures that are designed and commissioned without such models. On the other hand, application-independent techniques do not depend on specific structure, and they are generally applicable to any structural system. However, the literature on these techniques is very sparse, and the research in this area is at a very nascent stage (Bernal & Gunes, 2000; Zang, Friswell & Imregun 2004). In the third classification, damage detection techniques are split into signature-based and non-signaturebased methods. Signature-based techniques extensively use signatures of known damages in the given structure that are provided by human experts. These techniques commonly fall into the category of recognition of damage detection, which only provides the qualitative indication that damage might be present in the structure (Friswell, Penny & Wilson, 1994) and, to a certain extent, the localization of the damage (Friswell, Penny & Garvey, 1997). Non-signature methods are not based on signatures of known damages, and they not only recognize but also localize and assess the extent of damage. Most of the damage detection techniques in the literature fall into this
Data Mining for Damage Detection in Engineering Structures
category (Lazarevic et al., 2004; Ni et al., 2002; Yun & Bahng, 2000). In the fourth classification, damage detection techniques are classified into local (Sekhar, 2003; Wang, 2003) and global (Fritzen & Bohle, 2001) techniques. Typically, the damage is initiated in a small region of the structure, and, hence, it can be considered as local phenomenon. One could employ local or global damage detection features that are derived from local or global response or properties of the structure. Although local features can detect the damage effectively, these features are very difficult to obtain from the experimental data, such as higher natural frequencies and mode shapes of structure (Ni et al., 2002). In addition, since the vicinity of damage is not known a priori, the global methods that can employ only global damage detection features, such as lower natural frequencies of the structure (Lazarevic et al., 2004), are preferred. Finally, damage detection techniques can be classified as traditional and emerging data mining techniques. Traditional analytical techniques employ mathematical models to approximate the relationships between specific damage conditions and changes in the structural response or dynamic properties. Such relationships can be computed by solving a class of so-called inverse problems. The major drawbacks of the existing approaches are as follows: (i) the more sophisticated methods involve computationally cumbersome system solvers, which are typically solved by singular value decomposition techniques, non-negative least-squares techniques, bounded variable least-squares techniques, and so forth; and, (ii) all computationally intensive procedures need to be repeated for any newly available measured test data for a given structure. A brief survey of these methods can be found in Doebling, et al. (1996). On the other hand, data mining techniques consist of application techniques to model an explicit inverse relation between damage detection features and damage by minimization of the residue between the experimental and the analytical model at the training level. For example, the damage detection features could be natural frequencies, mode shapes, mode curvatures, and so forth. It should be noted that data mining techniques also are applied to detect features in large amounts of measurement data. In the next few sections, we will provide a short description of several types of data mining algorithms used for damage detection.
Classification Data mining techniques based on classification were successfully applied to identify the damage in the structures. For example, decision trees have been applied to detect damage in an electrical transmission tower (Sandhu et al., 2001). It has been found that in this approach, decision trees can be easily understood, while many interesting
rules about the structural damage were found. In addition, a method using support vector machines (SVMs) has been proposed to detect local damages in a building structure (Mita & Hagiwara, 2003). The method is verified to have the capability to identify not only the location of damage, but also the magnitude of damage with satisfactory accuracy, employing modal frequency patterns as damage features.
Pattern Recognition Pattern recognition techniques also are applied to damage detection by various researchers. For example, the statistical pattern recognition is applied to damage detection employing relatively few measurements of modal data collected from three scale model-reinforced concrete bridges (Haritos & Owen, 2004), but the method only was able to indicate that damage had occurred. In addition, independent component analysis, a multivariate statistical method also known as proper orthogonal decomposition, has been applied to damage detection problems on time history data to capture essential pattern of the measured vibration data (Zang, Friswell & Imregun, 2004).
Neural Networks Prediction-based techniques such as neural networks (Lazarevic et al., 2004; Ni et al., 2002, Zhao et al., 1998) has been successfully applied to detect the existence, location, and quantification of the damage in the structure employing modal data. Neural networks have been extremely popular in recent years due to their capabilities as universal approximators. In damage detection approaches based on neural networks, the damage location and severity are simultaneously identified using a one-stage scheme, also called the direct method (Zhao et al., 1998), where the neural network is trained with different damage levels at each possible damage location. However, these studies were restricted to very small models with a small number of target variables (order of 10), and the development of a predictive model that could identify correctly the location and severity of damage in practical large-scale complex structures using this direct approach was a considerable challenge. Increased geometric complexity of the structure caused an increase in the number of target variables, thus resulting in data sets with a large number of target variables. Since the number of prediction models that needs to be built for each continuous target variable increases, the number of training data records required for effective training of neural networks also increases, thus requiring more computational time for training neural networks, but also more time for data 247
,
Data Mining for Damage Detection in Engineering Structures
generation, since each damage state (data record) requires an eigen solver to generate natural frequency and mode shapes of the structure. The earlier direct approach, employed by numerous researchers, required the prediction of the material property; namely, the Young’s modulus of elasticity considering all the elements in the domain individually or simultaneously. However, this approach does not scale to situations in which thousands of elements are present in the complex geometry of the structure or when multiple elements in the structure have been damaged simultaneously. To reduce the size of the system under consideration, several substructure-based approaches have been proposed (Sandhu et al., 2002; Yun & Bahng, 2000). These approaches partition the structure into logical substructures and then predict the existence of the damage in each of them. However, pinpointing the location of damage and extent of the damage is not resolved completely in these approaches. Recently, these issues have been addressed in two hierarchical approaches (Lazarevic et al., 2004; Ni et al., 2002). In the former, neural networks are hierarchically trained using one-level damage samples to first locate the position of the damage, and then the network is retrained by an incremental weight update method using additional samples corresponding to different damage degrees, but only at the identified location at the first stage. The input attributes of neural networks are designed to depend only on damage location, and they consisted of several natural frequencies and a few incomplete modal vectors. Since measuring mode shapes are difficult, global methods based only on natural frequencies are highly preferred. However, employing natural frequencies as features traditionally has many drawbacks (e.g., two symmetric damage locations cannot be distinguished using only natural frequencies). To overcome these drawbacks, Lazarevic, et al. (2004) proposed hierarchical and localized clustering approaches based only on natural frequencies as features where symmetrical damage locations of damage as well as spatial characteristics of structural systems are integrated in building the model.
Other Techniques Other data mining based approaches also have been applied to different problems in structural health monitoring. For example, outlier-based analysis techniques (Worden, Manson & Fieller, 2000) have been used to detect the existence of damage; wavelet-based approaches (Wang, 2003) have been used to detect damage features; and a combination of independent component analysis and artificial neural networks (Zang, Friswell & Imregun, 2004) have been applied successfully to detect damages in structures.
248
FUTURE TRENDS Damage detection is increasingly becoming an indispensable and integral component of any comprehensive structural health monitoring program for mechanical and largescale civilian, aerospace, and space structures. Although a variety of techniques have been developed for detecting damages, there is still a number of research issues concerning the prediction performance and efficiency of the techniques that needs to be addressed (Auwerarer & Peeters, 2003; De Boe & Golinval, 2001).
CONCLUSION In this paper, a survey of emerging data mining techniques for damage detection in structures is provided. This survey reveals that the existing data mining techniques are based predominantly on changes in properties of the structure to classify, localize, and predict the extent of damage.
REFERENCES Adams, D., & Farrar, C. (2002). Classifying linear and nonlinear structural damage using frequency domain ARX models. Structural Health Monitoring, 1(2), 185-201. Anderson, T., Lemoine, G., & Ambur, D. (2003). An artificial neural network based damage detection scheme for electrically conductive composite structures. Proceedings of the 44th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, Norfolk, Virginia. Auwerarer, H., & Peeters, B. (2003). International research projects on structural health monitoring: An overview. Structural Health Monitoring, 2(4), 341-358. Bernal, D., & Gunes, B. (2000). Extraction of system matrices from state-space realizations. Proceedings of the 14 th Engineering Mechanics Conference, Austin, Texas. De Boe, P., & Golinval, J.-C. (2001). Damage localization using principal component analysis of distributed sensor array. In F.K. Chang (Ed.), Structural health monitoring: The demands and challenges (pp. 860-861). Boca Raton, FL: CRC Press. Doebling, S., Farrar, C., Prime, M., & Shevitz, D. (1996). Damage identification and health monitoring of structural systems from changes in their vibration characteristics: A literature review [Report LA-12767-MS]. Los Alamos National Laboratory, Los Alamos, NM.
Data Mining for Damage Detection in Engineering Structures
Doherty, J. (1987). Nondestructive evaluation. In A.S. Kobayashi (Ed.), Handbook on experimental mechanics (ch. 12). Society of Experimental Mechanics, Inc, England Cliffs, NJ.
Maalej, M., Karasaridis, A., Pantazopoulou, S., & Hatzinakos, D. (2002). Structural health monitoring of smart structures. Smart Materials and Structures, 11, 581-589.
Friswell, M., Penny J., & Garvey, S. (1997). Parameter subset selection in damage location. Inverse Problems in Engineering, 5(3), 189-215.
Mita, A., & Hagiwara, H. (2003). Damage diagnosis of a building structure using support vector machine and modal frequency patterns. Proceedings of SPIE, 5057, San Diego, CA.
Friswell, M., Penny J., & Wilson, D. (1994). Using vibration data and statistical measures to locate damage in structures. Modal Analysis: The International Journal of Analytical and Experimental Modal Analysis, 9(4), 239-254. Fritzen, C., & Bohle, K. (2001). Vibration based global damage identification—A tool for rapid evaluation of structural safety. In F.D. Chang (Ed.), Structural health monitoring: The demands and challenges (pp. 849-859). Boca Raton, FL: CRC Press. Haritos, N., & Owen, J.S. (2004). The use of vibration data for damage detection in bridges: A comparison of system identification and pattern recognition approaches. Structural Health Monitoring, 3(2), 141-163. Keller, E., & Ray, A. (2003). Real-time health monitoring of mechanical structures. Structural Health Monitoring, 2(3), 191-203. Kerschen, G., & Golinval, J-C. (2004). Feature extraction using auto-associative neural networks. Smart Materials and Structures, 13, 211-219. Khoo, L., Mantena, P., & Jadhav, P. (2004). Structural damage assessment using vibration modal analysis. Structural Health Monitoring, 3(2), 177-194. Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., & Kumar, V. (2003a). Localized prediction of continuous target variables using hierarchical clustering. Proceedings of the Third IEEE International Conference on Data Mining, Florida. Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., & Kumar, V. (2003b). Damage prediction in structural mechanics using hierarchical localized clustering-based approach. Proceedings of Data Mining and Knowledge Discovery: Theory, Tools, and Technology V, Orlando, Florida. Lazarevic, A., Kanapady, R., Tamma, K.K., Kamath, C., & Kumar, V. (2004). Effective localized regression for damage detection in large complex mechanical structures. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington.
Ni, Y., Wang, B., & Ko, J. (2002). Constructing input vectors to neural networks for structural damage identification. Smart Materials and Structures, 11, 825-833. Patsias, S., & Staszewski, W.J. (2002). Damage detection using optical measurements and wavelets. Structural Heath Monitoring, 1(1), 5-22. Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., & Kumar, V. (2001). Damage prediction and estimation in structural mechanics based on data mining. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining/Fourth Workshop on Mining Scientific Datasets, San Francisco, California. Sandhu, S., Kanapady, R., Tamma, K.K., Kamath, C., & Kumar, V. (2002). A sub-structuring approach via data mining for damage prediction and estimation in complex structures. Proceedings of the SIAM International Conference on Data Mining, Arlington, Virginia. Sekhar, S. (2003). Identification of a crack in rotor system using a model-based wavelet approach. Structural Health Monitoring, 2(4), 293-308. Wang, W. (2003). An evaluation of some emerging techniques for gear fault detection. Structural Health Monitoring, 2(3), 225-242. Worden, K., Manson, G., & Fieller, N. (2000). Damage detection using outlier analysis. Journal of Sound and Vibration, 229(3), 647-667. Yun, C., & Bahng, E.Y. (2000). Sub-structural identification using neural networks. Computers & Structures, 77, 41-52. Zang, C., Friswell, M.I., & Imregun, M. (2004). Structural damage detection using independent component analysis. Structural Health Monitoring, 3(1), 69-83. Zhao, J., Ivan, J., & De Wolf, J. (1998). Structural damage detection using artificial neural networks. Journal of Infrastructure Systems, 4(3), 93-101.
249
,
Data Mining for Damage Detection in Engineering Structures
KEY TERMS
Mode Shapes: Eigen vectors associated with the natural frequencies of the structure.
Boundary Element Method: Numerical method to solve the differential equations with boundary/initial conditions over surface of a domain.
Natural Frequency: Eigen values of the mass and stiffness matrix system of the structure.
Finite Element Method: Numerical method to solve the differential equations with boundary/initial conditions over a domain. Modal Properties: Natural frequency, mode shapes, and mode curvatures constitutes modal properties.
250
Smart Structure: A structure with a structurally integrated fiber optic sensing system. Structures and Structural System: The word structure used has been employed loosely in the paper. The structure is referred to as the continuum material, whereas the structural system consists of structures that are connected at joints.
251
Data Mining for Intrusion Detection Aleksandar Lazarevic University of Minnesota, USA
INTRODUCTION Today computers control power, oil and gas delivery, communication systems, transportation networks, banking and financial services, and various other infrastructure services critical to the functioning of our society. However, as the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to a wide variety of cyber threats. According to a recent survey by CERT/CC (Computer Emergency Response Team/Coordination Center), the rate of cyber attacks has been more than doubling every year in recent times (Figure 1). In addition, the severity and sophistication of the attacks are also growing. For example, Slammer/Sapphire Worm was the fastest computer worm in history. As it began spreading throughout the Internet, it doubled in size every 8.5 seconds and infected at least 75,000 hosts causing network outages and unforeseen consequences such as canceled airline flights, interference with elections, and ATM failures (Moore, 2003). It has become increasingly important to make our information systems, especially those used for critical functions in the military and commercial sectors, resistant to and tolerant of such attacks. The conventional approach for securing computer systems is to design security mechanisms, such as firewalls, authentication mechanisms, and Virtual Private Networks (VPN) that create a protective “shield” around them. However, such security mechanisms almost always have inevitable vulFigure 1. Growth rate of cyber incidents reported to Computer Emergency Response Team/Coordination Center (CERT/CC) 120000
Number of cyber incidents
100000 80000
nerabilities and they are usually not sufficient to ensure complete security of the infrastructure and to ward off attacks that are continually being adapted to exploit the system’s weaknesses often caused by careless design and implementation flaws. This has created the need for security technology that can monitor systems and identify computer attacks. This component is called intrusion detection and is a complementary to conventional security mechanisms. This article provides an overview of current status of research in intrusion detection based on data mining.
BACKGROUND Intrusion detection includes identifying a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources. An Intrusion Detection System (IDS) can be defined as a combination of software and/or hardware components that monitors computer systems and raises an alarm when an intrusion happens. Traditional intrusion detection systems are based on extensive knowledge of signatures of known attacks. However, the signature database has to be manually revised for each new type of intrusion that is discovered. In addition, signature-based methods cannot detect emerging cyber threats, since by their very nature these threats are launched using previously unknown attacks. Finally, very often there is substantial latency in deployment of newly created signatures. All these limitations have led to an increasing interest in intrusion detection techniques based upon data mining. The tremendous increase of novel cyber attacks has made data mining based intrusion detection techniques extremely useful in their detection. Data mining techniques for intrusion detection generally fall into one of three categories; misuse detection, anomaly detection and summarization of monitored data.
60000
MAIN THRUST
40000 20000 0 1 1991 2 1992 3 1993 4 1994 5 6 7 8 9 10 2000 11 2001 12 2002 13 2003 14 1990 1995 1996 1997 1998 1999
Year
Before applying data mining techniques to the problem of intrusion detection, the data has to be collected. Different types of data can be collected about informa-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining for Intrusion Detection
tion systems (e.g., tcpdump and netflow data for network intrusion detection, syslogs or system calls for host intrusion detection). However, such collected data is often available in a raw format and needs to be processed in order to be used in data mining techniques. For example, in MADAM ID project (Lee, 2000, 2001) at Columbia University, association rules and frequent episodes were extracted from network connection records to construct three groups of features: (i) content-based features that describe intrinsic characteristics of a network connection (e.g., number of packets, acknowledgments, data bytes from source to destination), (ii) time-based traffic features that compute the number of connections in some recent time interval (e.g., last few seconds) and (iii) connection based features that compute the number of connections from a specific source to a specific destination in the last N connections (e.g., N = 1000). When the feature construction step is complete, obtained features may be used in any data mining technique.
Misuse Detection In misuse detection based on data mining, each instance in a data set is labeled as “normal” or “attack/intrusion” and a learning algorithm is trained over the labeled data. These techniques are able to automatically retrain intrusion detection models on different input data that include new types of attacks, as long as they have been labeled appropriately. Unlike signature-based intrusion detection systems, data mining based misuse detection models are created automatically, and can be more sophisticated and precise than manually created signatures. In spite of the fact that misuse detection models have high degree of accuracy in detecting known attacks and their variations, their obvious drawback is the inability to detect attacks whose instances have not yet been observed. In addition, labeling data instances as normal or intrusive may require enormous time for many human experts. Since standard data mining techniques are not directly applicable to the problem of intrusion detection due to dealing with skewed class distribution (attacks/ intrusions correspond to a class of interest that is much smaller, i.e., rarer, than the class representing normal behavior) and learning from data streams (attacks/intrusions very often represent sequence of events), a number of researchers have developed specially designed data mining algorithms that are suitable for intrusion detection. Research in misuse detection has focused mainly on classification of network intrusions using various standard data mining algorithms (Barbara, 2001; Ghosh, 1999; Lee, 2001; Sinclair, 1999), rare class predictive models (Joshi, 2001) and association rules (Barbara, 2001; Lee, 2000; Manganaris, 2000). 252
MADAM ID (Lee, 2000, 2001) was one of the first projects that applied data mining techniques to the intrusion detection problem. In addition to the standard features that were available directly from the network traffic (e.g., duration, start time, service), three groups of constructed features were also used by the RIPPER algorithm to learn intrusion detection rules from DARPA 1998 data set (Lippmann, 1999). Other classification algorithms that are applied to the intrusion detection problem include standard decision trees (Bloedorn, 2001; Sinclair, 1999), modified nearest neighbor algorithms (Ye, 2001b), fuzzy association rules (Bridges, 2000), neural networks (Dao, 2002; Lippman, 2000a), naïve Bayes classifiers (Schultz, 2001), genetic algorithms (Bridges, 2000), genetic programming (Mukkamala, 2003a), and etcetera. Most of these approaches attempt to directly apply specified standard techniques to publicly available intrusion detection data sets (Lippmann, 1999, 2000b), assuming that the labels for normal and intrusive behavior are already known. Since this is not realistic assumption, misuse detection based on data mining has not been very successful in practice.
Anomaly Detection Anomaly detection creates profiles of normal “legitimate” computer activity (e.g., normal behavior of users, hosts, or network connections) using different techniques and then uses a variety of measures to detect deviations from defined normal behavior as potential anomaly. Anomaly detection models often learn from a set of “normal” (attack-free) data, but this also requires cleaning data from attacks and labeling only normal data records. Nevertheless, other anomaly detection techniques detect anomalous behavior without using any knowledge about the training data. Such models typically assume that the data records that do not belong to the majority behavior correspond to anomalies. The major benefit of anomaly detection algorithms is their ability to potentially recognize unforeseen and emerging cyber attacks. However, their major limitation is potentially high false alarm rate, since deviations detected by anomaly detection algorithms may not necessarily represent actual attacks, but new or unusual, but still legitimate, network behavior. Anomaly detection algorithms can be classified into several groups: (i) statistical methods; (ii) rule-based methods; (iii) distance-based methods; (iv) profiling methods; and (v) model-based approaches (Lazarevic, 2004). Although anomaly detection algorithms are quite diverse in nature, and thus may fit into more than one proposed category, most of them employ certain data mining or artificial intelligence techniques.
Data Mining for Intrusion Detection
•
•
Statistical methods: Statistical methods monitor the user or system behavior by measuring certain variables over time (e.g., login and logout time of each session). The basic models keep averages of these variables and detect whether thresholds are exceeded based on the standard deviation of the variable. More advanced statistical models compute profiles of long-term and short-term user activities by employing different techniques, such as Kolmogorov-Smirnov test (Cabrera, 2000), Chisquare (c 2) statistics (Ye, 2001a), probabilistic modeling (Yamanishi, 2000), and likelihood of data distributions (Eskin, 2000). Distance-based methods: Most statistical approaches have limitation when detecting outliers in higher dimensional spaces, since it becomes increasingly difficult and inaccurate to estimate the multidimensional distributions of the data points. Distance-based approaches attempt to overcome limitations of statistical outlier detection approaches and they detect outliers by computing distances among points. Several distance-based outlier detection algorithms that have been recently proposed for detecting anomalies in network traffic (Lazarevic, 2003) are based on computing the full dimensional distances of points from one another using all the available features, and on computing the densities of local neighborhoods. Values of categorical features are converted into the frequencies of their occurrences and they are further considered as continuous ones. MINDS (Minnesota Intrusion Detection System) (Ertoz, 2004) employs outlier detection algorithms to assign an anomaly score to each network connection. A human analyst then has to look at only the most anomalous connections to determine if they are actual attacks or other interesting behavior. Experiments on live network traffic have shown that MINDS is able to routinely detect various suspicious behavior (e.g., policy violations), worms, as well as various scanning activities.
In addition, several clustering based techniques, such as fixed-width and canopy clustering (Eskin, 2002), have been used to detect network intrusions in DARPA 1998 data sets as small clusters when compared to the large ones that corresponded to the normal behavior. In another interesting approach (Fan, 2001), artificial anomalies in the network intrusion detection data are generated around the edges of the sparsely populated data regions, thus forcing the learning algorithm to discover the specific boundaries that distinguish these regions from the rest of the data.
•
•
Rule-based systems: Rule-based systems were used in earlier anomaly detection based IDSs to characterize normal behavior of users, networks and/or computer systems by a set of rules. Examples of such rule-based IDSs include ComputerWatch (Dowell, 1990) and Wisdom & Sense (Liepins, 1992). Profiling methods: In profiling methods, profiles of normal behavior are built for different types of network traffic, users, programs and etcetera; and deviations from them are considered as intrusions. Profiling methods vary greatly ranging from different data mining techniques to various heuristic-based approaches.
For example, ADAM (Audit Data and Mining) (Barbara, 2001) is a hybrid anomaly detector trained on both attack-free traffic and traffic with labeled attacks. The system uses a combination of association rule mining and classification to discover novel attacks in tcpdump data by using the pseudo-Bayes estimator. Recently reported IDDM system (Abraham, 2001) represents an off-line IDS, where the intrusions are detected only when sufficient amounts of data are collected and analyzed. The IDDM system describes profiles of network data at different times, identifies any large deviations between these data descriptions and produces alarms in such cases. PHAD (packet header anomaly detection) (Mahoney, 2002) monitors network packet headers and builds profiles for 33 different fields from these headers by observing attack free traffic and building contiguous clusters for the values observed for each field. ALAD (application layer anomaly detection) (Mahoney, 2002) uses the same method for calculating the anomaly scores as PHAD, but it monitors TCP data and builds TCP streams when the destination port is smaller than 1024. Finally, there have also been several recently proposed commercial products that use profiling-based anomaly detection techniques. For example, Antura from System Detection (System Detection, 2003) use data mining based user profiling, while Mazu Profiler from Mazu Networks (Mazu Networks, 2003) and Peakflow X from Arbor networks (Arbor Networks, 2003) use rate-based and connection profiling anomaly detection schemes. •
Model-based approaches: Many researchers have used different types of data mining models, such as replicator neural networks (Hawkins, 2002) or unsupervised support vector machines (Eskin, 2002; Lazarevic, 2003), to characterize the normal behavior of the monitored system. In
253
,
Data Mining for Intrusion Detection
the model-based approaches, anomalies are detected as deviations for the model that represents the normal behavior. Replicator four-layer feedforward neural network (RNN) reconstructs input variables at the output layer during the training phase, and then uses the reconstruction error of individual data points as a measure of outlyingness, while unsupervised support vector machines attempt to find a small region where most of the data lies, label these data points as a normal behavior, and then detect deviations from learned models as potential intrusions. In addition, standard neural networks (NN) were also used in intrusion detection problems to learn a normal profile. For example NNs were often used to model the normal behavior of individual users (Ryan, 1997), to build profiles of software behavior (Ghosh, 1999) or to profile network packets and queue statistics (Lee, 2000).
Summarization Summarization techniques use frequent itemsets or association rules to characterize normal and anomalous behavior in the monitored computer systems. For example, association patterns generated at different times were used to study significant changes in the network traffic characteristics at different periods of time (Lee, 2001). Association pattern analysis has also been shown to be beneficial in constructing profiles of normal network traffic behavior (Manganaris, 1999). MINDS (Ertoz, 2004) uses association patterns to provide highlevel summary of network connections that are ranked highly anomalous in the anomaly detection module. These summaries allow a human analyst to examine a large number of anomalous connections quickly and to provide templates from which signatures of novel attacks can be built for augmenting the database of signature-based intrusion detection systems.
FUTURE TRENDS Intrusion detection techniques have improved dramatically over time, especially in the past few years. IDS technology is developing rapidly and its near-term future is very promising. Data mining techniques for intrusion detection increasingly become an indispensable and integral component of any comprehensive enterprise security program, since they successfully complement traditional security mechanisms.
254
CONCLUSION Although a variety of techniques have been developed for detecting different types of computer attacks in different computer systems, there are still a number of research issues concerning the prediction performance, efficiency and fault tolerance of IDSs that need to be addressed. Signature analysis, the most common strategy in the commercial domain until recently, is increasingly integrated with different anomaly detection and alert correlation techniques based on advanced data mining and artificial intelligence techniques in order to detect emerging and coordinated computer attacks.
REFERENCES Abraham, T. (2001). IDDM: Intrusion detection using data mining techniques. Australia Technical Report DSTO-GD-0286. DSTO Electronics and Surveillance Research Laboratory, Department of Defense. Arbor Networks. (2003). Intelligent network management with peakflow traffic. Retrieved from http:// www.arbornetworks.com/products_sp.php Barbara, D., Wu, N., & Jajodia, S. (2001). Detecting novel network intrusions using Bayes estimators. In Proceedings of the First SIAM Conference on Data Mining, Chicago, IL. Bloedorn, E., Christiansen, A., Hill, W., Skorupka, C., Talbot, L., & Tivel, J. (2001). Data mining for network intrusion detection: How to get started. MITRE Technical Report. Retrieved from www.mitre.org/work/ tech_papers/tech_papers_01/bloedorn_datamining Bridges, S., & Vaughn, R. (2000). Fuzzy data mining and genetic algorithms applied to intrusion detection. In Proceedings of the 23rd National Information Systems Security Conference, Baltimore, MD. Cabrera, J., Ravichandran, B., & Mehra, R. (2000). Statistical traffic modeling for network intrusion detection. In The Proceedings of 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication System. Dao, V., & Vemuri, R. (2002). Computer network intrusion detection: A comparison of neural networks methods. Differential Equations and Dynamical Systems, Special Issue on Neural Networks.
Data Mining for Intrusion Detection
Dowell, C., & Ramstedt, P. (1990). The Computerwatch data reduction tool. In Proceedings of the 13th National Computer Security Conference, Washington, DC. Ertoz, L., Eilertson, E., Lazarevic, A., Tan, P., Srivastava, J., Kumar, V., & Dokas, P. (2004). The MINDS: Minnesota intrusion detection system. In A. Joshi, H. Kargupta, K. Sivakumar, & Y. Yesha (Eds.), Next generation data mining. Boston: Kluwer Academic Publishers. Eskin, E. (2000). Anomaly detection over noisy data using learned probability distributions. In Proceedings of the International Conference on Machine Learning, Stanford University, CA. Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In S. Jajodia & D. Barbara (Eds.), Applications of data mining in computer security, advances in information security. Boston: Kluwer Academic Publishers. Fan, W., Lee, W., Miller, M., Stolfo, S.J., & Chan, P.K. (2001). Using artificial anomalies to detect unknown and known network intrusions. In the Proceedings of the First IEEE International Conference on Data Mining, San Jose, CA. Ghosh, A., & Schwartzbard, A. (1999). A study in using neural networks for anomaly and misuse detection. In Proceedings of the Eighth USENIX Security Symposium (pp. 141-151). Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier detection using replicator neural networks. In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery (pp. 170-180). Lecture Notes in Computer Science 2454. Aix-en-Provence, France. Joshi, M., Agarwal, R., & Kumar, V. (2001). PNrule, mining needles in a haystack: Classifying rare classes via two-phase rule induction. In Proceedings of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA. Lazarevic, A., Ertoz, L., Ozgur, A., Srivastava, J., & Kumar, V. (2003). A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA. Lazarevic, A., Kumar, V., & Srivastava, J. (2004). Intrusion detection: A survey. In V. Kumar, J. Srivastava, & A. Lazarevic (Eds.), Managing cyber threats: Issues, approaches and challenges. Boston: Kluwer Academic Publishers.
Lee, W., & Stolfo, S.J. (2000). A framework for constructing features and models for intrusion detection systems. ACM Transactions on Information and System Security, 3(4), 227-261. Lee, W., Stolfo, S.J., & Mok, K. (2001). Adaptive intrusion detection: A data mining approach. Artificial Intelligence Review, 14, 533-567. Liepins, G., & Vaccaro, H. (1992). Intrusion detection: It’s role and validation. Computers and Security, 347-355. Lippmann, R., & Cunningham, R. (2000a). Improving intrusion detection performance using keyword selection and neural networks. Computer Networks, 34 (4), 597-603. Lippmann, R., Haines, J.W., Fried, D.J., Korba, J., & Das, K. (2000b). The 1999 DARPA off-line intrusion detection evaluation. Computer Networks. Lippmann, R.P., Cunningham, R.K., Fried, D.J., Graf, I., Kendall, K.R., Webster, S.E., & Zissman, M.A. (1999). Results of the DARPA 1998 offline intrusion detection evaluation. In Proceedings of Workshop on Recent Advances in Intrusion Detection. Mahoney, M., & Chan, P. (2002). Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the Eighth ACM International Conference on Knowledge Discovery and Data Mining (pp. 376-385), Edmonton, Canada. Manganaris, S., Christensen, M., Serkle, D., & Hermiz, K. (2000). A data mining analysis of RTID alarms. Computer Networks, 34(4), 571-577. Mazu Networks. (2003). Mazu Profiler. An Overview. www.mazunetworks.com/solutions/white_papers/download/Mazu_Profiler.pdf Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). The spread of the Sapphire/ Slammer Worm. Retrieved from www.cs.berkeley.edu/ ~nweaver/sapphire Mukkamala, S., Sung, A., & Abraham, A. (2003a). A linear genetic programming approach for modeling intrusion. In Proceedings of the IEEE Congress on Evolutionary Computation, Perth, Australia. Ryan, J., Lin, M-J., & Miikkulainen, R. (1997). Intrusion detection with neural networks. In Proceedings of the AAAI Workshop on AI Approaches to Fraud Detection and Risk Management (pp. 72-77), Providence, RI. Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001). Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy (pp. 38-49), Oakland, CA. 255
,
Data Mining for Intrusion Detection
Sinclair, C., Pierce, L., & Matzner, S. (1999). An application of machine learning to network intrusion detection. In Proceedings of the 15th Annual Computer Security Applications Conference (pp. 371-377). System Detection. (2003). Anomaly detection: The Antura difference. Retrieved from http://www.sysd.com/library/ anomaly.pdf Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 320-324), Boston, MA. Ye, N., & Chen, Q. (2001a). An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International Journal, 17(2), 105-112. Ye, N., & Li, X. (2001b, June). A scalable clustering technique for intrusion signature recognition. In Proceedings of the IEEE Workshop on Information Assurance and Security. United States Military Academy, West Point, NY.
256
KEY TERMS Anomaly Detection: Analysis strategy that identifies intrusions as unusual behavior that differs from the normal behavior of the monitored system. Intrusion: Malicious, externally induced, operational fault in the computer system. Intrusion Detection: Identifying a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources. Misuse Detection: Analysis strategy that looks for events or sets of events that match a predefined pattern of a known attack. Signature-Based Intrusion Detection: Analysis strategy where monitored events are matched against a database of attack signatures to detect intrusions. Tcpdump: Computer network debugging and security tool that allows the user to intercept and display TCP/IP packets being transmitted over a network to which the computer is attached. Worms: Self-replicating programs that aggressively spread through a network by taking advantage of automatic packet sending and receiving features found on many computers.
257
Data Mining in Diabetes Diagnosis and Detection Indranil Bose The University of Hong Kong, Hong Kong
INTRODUCTION Diabetes is a disease worrying hundreds of millions of people around the world. In the USA, the population of diabetic patients is about 15.7 million (Breault et al., 2002). It is reported that the direct and indirect cost of diabetes in the USA is $132 billion (Diabetes Facts, 2004). Since there is no method that is able to eradicate diabetes, doctors are striving for ways to fight this doom. Researchers are trying to link the cause of diabetes with patients’ lifestyles, inheritance information, age, and so forth in order to get to the root of the problem. Due to the prevalence of a large number of responsible factors and the availability of historical data, data mining tools have been used to generate inference rules on the cause and effect of diabetes as well as to help in knowledge discovery in this area. The goal of this chapter is to explain the different steps involved in mining diabetes data and to show, using case studies, how data mining has been carried out for detection and diagnosis of diabetes in Hong Kong, USA, Poland, and Singapore.
BACKGROUND Diabetes is a severe metabolic disorder marked by high blood glucose level, excessive urination, and persistent thirst, caused by lack of insulin actions. There are usually three forms of diabetes—Type 1, Type 2, and gestational. It is believed that diabetes is a particularly opportune disease for data mining technology for a number of reasons (Breault, 2001): • • •
There are many diabetic databases with historic patient information. New knowledge about treatment of diabetes can help save money. Diabetes can produce terrible complications like blindness, kidney failure, and so forth, so physicians need to know how to identify potential cases quickly.
The availability of historical data on diabetes naturally leads to the application of data mining techniques to discover interesting patterns (Apte et al., 2002; Hsu et al., 2000). The objective is to find rules that help to understand diabetes, facilitate early detection of diabetes, and discover how diabetes may be associated with different segments of the population. The data mining process for diagnosis of diabetes can be divided into five steps, though the underlying principles and techniques used for data mining diabetic databases may differ for different projects in different countries. Following is a brief description of the five steps.
Step 1: Data Cleaning Before carrying out data mining on the diabetic patient database, the data should be cleaned. Errors such as missing values, typographical errors, or wrong information are contained in the patient records, and, worse still, many records are duplicate records. Two approaches can be used to clean the data, namely standardized format schema and sorted neighborhood method (Hsu et al., 2000). By generating the standardized format schema, a user defines mappings among attributes in different formats, and each of the database files are modified into this standardized format. The next task is to look for duplicate records that have to be removed from the standardized format using the sorted neighborhood method. Under this scheme, the database is sorted on one or more fields that uniquely identify each record, and the chosen fields of the records are compared within a sliding window. When duplicates are detected, the user is called in to verify it, and the duplicates are removed from the database.
Step 2: Data Preparation The importance of the data preparation step cannot be overstated, since the success of data analysis depends on it (Breault et al., 2002). A fundamental issue is whether to use a relational database comprising multiple tables or a flat file that is best suited for data mining (Breault, 2002).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining in Diabetes Diagnosis and Detection
Generally, either of the following two methods is adopted. In the first method, a set of flat files is used to represent all the data, then data mining tools are used separately on each file, and the results obtained from the flat files are linked together in some fashion. An alternative method is to find a data reduction technique that allows fields in a complex database to be transformed into vector scores instead of considering them separately. For example, for an Australian study, the 26 items recorded for a patient from a diabetes database was converted to a four-component vector
Step 3: Data Analysis Data mining tools are used to convert the data stored in databases into useful knowledge. Different software programs using different models are run on the same data to provide different results. Sometimes, the result may be unexpected, but many of the rules and causal relationships discovered can conform to the trends. A software that is commonly used for data mining is Classification and Regression Tree (CART). CART recursively partitions the input variable space to maximize purity in the terminal tree nodes (Breault et al., 2002).
Step 4: Knowledge Evaluation The rules obtained from data mining may not be meaningful or true. The experts have to evaluate the knowledge before using it. For example, multiple random samples should be used to evaluate the data mining tools used to insure that results are not just by chance (Breault, 2001). For diabetes, similar data mining studies in other geographic and cultural locations are needed to prove any results that are suspected to be specific to one region or segment of population.
Step 5: Knowledge Usage After knowledge is extracted as a result of data mining, the developers have to address the issue of cost effectiveness of applying that knowledge for detection of diabetes. As a result of Step 4, the data miners may obtain a costly solution that can help a small group of people. However, this may not prove helpful for diagnosing diabetes for the entire population. It is the effective usage of the knowledge gathered from data mining for real-life application that will mark the success of the endeavor.
258
MAIN THRUST Hong Kong Case—Diabetes Registry and Statistical Analysis In Hong Kong, around 10% of the population is suffering from diabetes, of which 5% belong to Type 1 and 95% belong to the Type 2 form of diabetes. The prevalence of Type 2 diabetes is increasing at an alarming rate among Chinese, and its development is believed to involve the interplay between genetic and environmental factors. In view of this, the major hospitals have established their own diabetes registry for knowledge discovery. For instance, the diabetes clinic at the Prince of Wales Hospital of Hong Kong has adopted a structured diabetes care protocol and has created a diabetes registry based on a modified Europe DIAB-CARE format (Apte et al., 2002) since 1995. In September 2000, a diabetes data mining study was conducted to investigate the patterns of diabetes and their relationships with clinical characteristics in Hong Kong Chinese patients with late-onset (over age 34) Type 2 diabetes (Lee et al., 2000). This study involved 2,310 patients selected from a hospital clinic-based diabetes registry. A statistical analysis tool—Statistical Package for Social Sciences (SPSS)—was used for conducting ttest, Mann-Whitney U test, analysis of variance, and χ 2 test. Many useful results were generated, which can
be found on the two tables on page 1366 of the paper by Lee, et al. (2000). For example, it was found that the patients, irrespective of their sex, were more likely to have a diabetic mother than a diabetic father. Also, female patients with a diabetic mother were found to have higher levels of plasma total cholesterol compared to those having a diabetic father. In two-group comparisons, there was also evidence that the male patients with a diabetic father had higher body mass index (BMI) values than the male patients with a diabetic mother. It also was shown that both maternal and paternal factors may be responsible for the development of Type 2 diabetes in the Chinese population. All these rules greatly improved the physicians’ understanding of diabetes.
United States Case—Classification and Regression Trees (CART) Diabetes is a major health problem in the United States. According to current statistics, about 5.9% of the population, or 16 million people in the USA, have diabetes. It is estimated that the percentage will rise to 8.9% by 2025. In the United States, there is a long history of diabetic
Data Mining in Diabetes Diagnosis and Detection
registries and databases with systematically collected patient information. A large-scale research was carried out by applying data mining techniques to a diabetic data warehouse from an integrated health care system in the New Orleans area with 30,383 diabetic patients (Breault et al., 2002). To prepare the data tables, structured query language (SQL) statements were executed on the data warehouse to form the flat file as input to data mining software. To do data mining, CART software was used with a binary target variable and ten predictors: age, sex, emergency department visits, office visits, comorbidity index, dyslipidemia, hypertension, cardiovascular disease, retinopathy, and end-stage renal disease. The outcome showed that the most important variable associated with bad glycemic control is younger age, not the comorbiditity index or whether patients have related diseases. The total classification error (40.5%) was substantial.
PIMA Indian Case—Machine Learning Mining Approach Another special group in the USA that deserves a separate mention because of the importance of the findings is the PIMA Indian case. With a very high rate of diabetes for Pima Indians, the Pima Indian Diabetes Database (PIDD) with 768 diabetes patients has been established by the National Institute of Health. There have been many studies applying data mining techniques to the PIDD. Some well-known examples of data mining techniques used include multi-stream dependency detection (MSDD) algorithm, Bayesian neural networks, and multiplier-free feed-forward neural networks. Although the cited examples use somewhat different subgroups of the PIDD, accuracy for predicting diabetes ranges from 66% to 81%. It is interesting to see that with a wide variety of prediction tools available, efficiency and accuracy can be improved greatly in diabetes data mining. Recently, several modified data mining techniques have been used successfully on the same database. This includes the usage of decision trees with an augmented splitting criterion (Buja & Lee, 2001). The use of the fuzzy Naïve Bayes method yielded a best-case accuracy of 76.95% (Tang et al., 2002). The application of shunted inhibitory neural networks, where the neurons can act as adaptive non-linear filters, resulted in an accuracy of over 80% and performed better than multi-layer perceptrons, in general (Arulampalam & Bouzerdoum, 2001).
Poland Case—Rough Sets Technique In Poland, more than 1.5 million people suffer from diabetes, and more than 41,000 have Type 1 diabetes (Medtronic
Figure 1. Image of ROSETTA software on training and test data sets
Limited, 2002). Rough sets have been applied to mine data in a Polish diabetes database using the ROSETTA software (Øhrn, 1997). The rough sets approach investigates structural relationships in the data rather than probability distributions and produces decision tables rather than trees. A recent study from a Polish medical school used a dataset of 107 patients aged five to 22 who were suffering from insulin-dependent diabetes. In this study, it was found that the minimal subsets of attributes that are efficient for rule making included age of disease diagnosis, microalbuminuria (yes/no) and disease duration in years. With the use of rough sets techniques, decision rules were generated to predict microalbuminuria. The best predictor was age <7 predicting no microalbuminuria at 83.3% accuracy.
Singapore Case—Data Cleansing and Interaction With Domain Experts Around 10% of the population in Singapore is diabetic. In the diabetes data mining exercise conducted in Singapore, the objective was to find rules that could be used by physicians to understand more about diabetes and to find special patterns about a particular patient population (Hsu et al., 2000). In order to deal with noisy data, a semi-automatic data cleaning system was utilized to reconcile format differences among tables with user mapping input. A sorted neighborhood method was used to remove duplicate records. In a particular case at the National University of Singapore, the researchers mined the data via a classification with association rule tool, using thresholds of 1% support and 50% confidence. They generated 700 rules. The physicians were overwhelmed with the large number and also wanted to know causal connections rather than 259
,
Data Mining in Diabetes Diagnosis and Detection
associations. It was realized that post-processing was needed in order to make the results usable for the physicians. The tree method was employed to generate general rules giving the underlying trends in the data that the physicians already knew and exception rules giving deviations to these trends. The physicians found the exception rules especially helpful in understanding how subpopulation trends differ from the main population. With the help of the physicians’ domain knowledge, the data mining process was optimized by reducing the size of data and hypothesis space and by removing unnecessary query operations (Owrang, 2000).
FUTURE TRENDS An important step in using data mining for diagnosing diabetes is the creation of a centralized database that can store most diabetic patients’ health data. The larger the data pool, the more accurate the results from data mining are likely to be. The experiences and the many challenges faced in building a data warehouse for a not-for-profit organization called Christiana Care in the state of Delaware, USA, is described by Ewen, et al., and it is reported that the creation of the data warehouse was responsible for a gain in operating revenue in 1997 (Ewen et al., 1998). Though the task is not easy, for successful data mining, it is almost imperative to create data warehouses for sharing of diabetes-related information among physicians across the world. The major obstacle is that there is only a small number of diabetic patients in this world that have access to shared care, while many others remain undiagnosed, untreated, or suboptimally treated (Chan, 2000). This is a major challenge that has to be overcome in the future. The case studies clearly indicate that there is no single method that proves to be effective for mining diabetes databases. Many tools and techniques, such as statistical methods, machine learning algorithms, neural networks, Bayesian neural sets, and rough sets, have been employed to study and figure out useful rules and patterns to help physicians combat the deadly disease. There is further need in the future to use other techniques like case-based reasoning for assessment of risk of complications for individual diabetes patients (Armengol et al., 2004; Montani et al., 2003) or hybrid techniques like rule induction, using simulated annealing for discovering associations between observations made of patients on their first visit and early mortality (Richards et al., 2001). Moreover, besides doing data mining in diabetes databases, it is also important that the same work be carried out for studying the complications of diabetes and understanding its relationship to other diseases. By applying different data-mining techniques and tools, more 260
knowledge about associations between diabetes and other diseases have to be dug out in order to improve public health in the future. Though Type 1 and Type 2 diabetes have been studied frequently, more effort is needed in the future to study the diagnosis and detection of gestational diabetes.
CONCLUSION The occurrence of diabetes is increasing at an alarming rate all over the world, and its development is believed to involve the interplay among many unknown and mysterious reasons such as genetic and environmental factors. In view of this, data-mining technology can play a very important role in analyzing existing diabetes databases and identifying useful rules that help diabetes prevention and control. It is worthwhile to recognize that successful application of data-mining technologies to diabetes prevention and control requires: (1) preparing a comprehensive diabetes database for input into data mining software to avoid garbage in and garbage out; this will require data cleansing and transformations from a relational data warehouse to a data mining data table that is useable by data mining tools; (2) selecting and skillfully applying the appropriate data-mining software and techniques; (3) intelligently sifting through the software output to prioritize the areas that will provide the most cost savings or outcomes improvement; and (4) interacting with domain experts to select the best-fit rules and patterns that optimize the efficiency and effectiveness of the data-mining process. In this paper, we have studied how the above steps are applied for mining diabetes databases in general, and, in particular, we have discussed the various methods adopted by researchers in different countries for diagnosis of diabetes.
REFERENCES Apte, C., Liu, B., Pednault, E.P.D., & Smyth, P. (2002). Business applications of data mining. Communications of the ACM, 45(8), 49-53. Armengol, E., Palaudaries, A., & Plaza, E. (2004). Individual prognosis of diabetes long-term risks: A CBR approach. Technical Report IIIA. Arulampalam, G., & Bouzerdoum, A. (2001). Application of shunting inhibitory artificial neural networks to medical diagnosis. Proceedings of the Seventh Australian and New Zealand Intelligent Information Systems Conference.
Data Mining in Diabetes Diagnosis and Detection
Breault, J.L. (2001). Data mining diabetic databases: Are rough sets a useful addition? [Electronic version]. Computing Science and Statistics, 33. Breault, J.L. (2002). Mathematical challenges of variable transformations in data mining diabetic data warehouses. Retrieved from http://www.ipam.ucla.edu/publications/sdm2002/sdm2002_jbreault_poster.pdf Breault, J.L., Goodall C.R., & Fos, P.J. (2002). Data mining a diabetic data warehouse. Artificial Intelligence in Medicine, 26, 37-54. Buja, A., & Lee, Y-S. (2001). Data mining criteria for treebased regression and classification. San Francisco, CA: KDD 2001. Chan, J.C.N. (2000). Heterogeneity of diabetes mellitus in the Hong Kong Chinese population. Hong Kong Medical Journal, 6(1), 77-84. Diabetes Facts. (2004). Retrieved from http:// diabetes.mdmercy.com/about_diabetes/facts.html Ewen, E.F. et al. (1998). Data warehousing in an integrated health system: Building the business case. Washington, D.C.: DOLAP 98. Hsu, W., Lee, M.L., Liu, B., & Ling, T.W. (2000). Exploration mining in diabetic patients databases: Findings and conclusions. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Boston, Massachusetts. Lee, S.C. et al. (2000). Diabetes in Hong Kong Chinese: Evidence for familial clustering and parental effects. Diabetes Care, 23, 1365-1368. Medtronic Limited. (2002). Diabetes overview. Retrieved from http://www.medtronic.com/UK/health/diabetes/ diabetes_overview.html Montani, S. et al. (2003). Integrating model-based decision support in a multi-modal reasoning systems for managing type 1 diabetic patients. Artificial Intelligence in Medicine, 29, 131-151. Øhrn, A., & Komorowski, J. (1997). ROSETTA: A rough set toolkit for analysis of data. Proceedings of the Joint Conference of Information Sciences, Durham, North Carolina. Owrang, M.M. (2000). Using domain knowledge to optimize the knowledge discovery process in databases. International Journal of Intelligent Systems,15, 45-60.
Richards, G., Rayward-Smith, V.J., Sonksen, P.H., Carey, S., & Weng, C. (2001). Data mining for indicators of early mortality in a database of clinical records. Artificial Intelligence in Medicine, 22, 215-231. Tang, Y., Pan, W., Qiu, X, & Xu, Y. (2002). The identification of fuzzy weighted classification system incorporated with fuzzy Naïve Bayes from data. Proceedings of the IEEE International Conference on Systems, Man and Cybernetics.
KEY TERMS Body Mass Index (BMI): A measure of mass of an individual that is calculated as weight divided by height squared. Data Warehouse: A database, frequently very large, that can access all of a company’s information. It contains data about how the warehouse is organized, where the information can be found, and any connections between existing data. Gestational Diabetes: This form of diabetes develops in 2% to 5% of all pregnancies but disappears when a pregnancy is over. Neural Networks: Computer processors or software based on the human brain’s mesh-like neuron structure. Neural networks can learn to recognize patterns and programs to solve related problems on their own. Rough Sets: A method of representation of uncertainty in the membership of a set. It is related to fuzzy sets and is a popular data mining technique in medicine and finance. Type 1 Diabetes: This insulin-dependent diabetes mellitus includes risk factors that are less well defined. Autoimmune, genetic, and environmental factors are involved in the development of this type of diabetes. Type 2 Diabetes: This non-insulin-dependent diabetes mellitus accounts for about 90% to 95% of all diagnosed cases of diabetes. Risk factors include older age, obesity, family history, prior history of gestational diabetes, impaired glucose tolerance, physical inactivity, and race/ethnicity.
261
,
262
Data Mining in Human Resources Marvin D. Troutt Kent State University, USA Lori K. Long Kent State University, USA
INTRODUCTION In this paper, we briefly review and update our earlier work (Long & Troutt, 2003) on the topic of data mining in the human resources area. To gain efficiency, many organizations have turned to technology to automate many HR processes (Hendrickson, 2003). As a result of this automation, HR professionals are able to make more informed strategic HR decisions (Bussler & Davis, 2002). While HR professionals may no longer need to manage the manual processing of data, they should not abandon their ties to data collected on and about the organization’s employees. Using HR data in decision-making provides a firm with the opportunity to make more informed strategic decisions. If a firm can extract useful or unique information on the behavior and potential of their people from HR data, they can contribute to the firm’s strategic planning process. The challenge is identifying useful information in vast human resources databases that are the result of the automation of HR related transaction processing. Data mining is essentially the extracting of knowledge based on patterns of data in very large databases and is an analytical technique that may become a valuable tool for HR professionals. Organizations that employ thousands of employees and track employment related information might find valuable information patterns contained within their databases to provide insights in such areas as employee retention and compensation planning. To develop an understanding of the potential of data mining HR information in a firm, we will identify opportunities as well as concerns in applying data mining techniques to HR Information Systems.
BACKGROUND In this section, we review existing work on the topic. The human resource information systems (HRIS) of most organizations today feature relational database systems that allow data to be stored in separate files that can be linked by common elements such as name or identification number. The relational database provides organizations with the ability to keep a virtually limitless amount of data
on employees. It also allows organizations to access the data in a variety of ways. For example, a firm can retrieve data on a particular employee or they can retrieve data on a certain group of employees through conducting a search based on a specific parameter such as job classification. The development of relational databases in organizations along with advances in storage technology has resulted in organizations collecting a large amount of data on employees. While these calculations are helpful to quantify the value of some HR practices, the bottom-line impact of HR practices is not always so clear. One can evaluate the cost per hire, but does that information provide any reference to the value of that hire? Should a greater value be assessed to an employee who stays with the organization for an extended period of time? Most data analysis retrieved from HRIS does not provide an opportunity to seek out additional relationships beyond those that the system was originally designed to identify. Traditional data analysis methods often involve manual work and interpretation of data that is slow, expensive and highly subjective (Fayyad, Piatsky-Shapiro, & Smyth, 1996a). For example, if an HR professional is interested in analyzing the cost of turnover, they might have to extract data from several different sources such as accounting records, termination reports and personnel hiring records. That data is then combined, reconciled and evaluated. This process creates many opportunities for errors. As business databases have grown in size, the traditional approach has grown more impractical. Data mining has been used successfully in many functional areas such as finance and marketing. HRIS applications in many organizations provide an as yet unexplored opportunity to apply data mining techniques (Patterson & Lindsay, 2003). While most applications provide opportunities to generate ad-hoc or standardized reports from specific sets of data, the relationships between the data sets are rarely explored. It is this type of relationship that data mining seeks to discover. The blind application of data mining techniques can easily lead to the discovery of meaningless and invalid patterns. If one searches long enough in any data set, it is likely possible to find patterns that appear to hold but
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Mining in Human Resources
are not necessarily statistically significant or useful (Fayyad et al., 1996a). There has not been any specific exploration of applying these techniques to human resource applications; however, there are some guidelines in the process that are transferable to an HRIS. Feelders, Daniels and Holsheimer (2000) outline six important steps in the data mining process: 1) problem definition, 2) acquisition of background knowledge, 3) selection of data, 4) pre-processing of data, 5) analysis and interpretation, and 6) reporting and use. At each of these steps, we will look at important considerations as they relate to data mining human resources databases. Further, we will examine some specific legal and ethical considerations of data mining in the HR context. The formulation of the questions to be explored is an important aspect of the data mining process. As mentioned earlier, with enough searching or application of sufficiently many techniques, one might be able to find useless or ungeneralizable patterns in almost any set of data. Therefore, the effectiveness of a data mining project is improved through establishing some general outlines of inquiry prior to start the project. To this extent, data mining and the more traditional statistical studies are similar. Thus, careful attention to the scientific method and sound research methods are to be followed. A widely respected source of guidelines on research methods is the book by Kerlinger and Lee (2000). A certain level of expertise is necessary to carefully evaluate questions posed in a data mining project. Obviously, a requirement is data mining and statistical expertise, but one must also have some intimate understanding of the data that is available, along with its business context. Furthermore, some subject matter expertise is needed to determine useful questions, select relevant data and interpret results (Feelders et al., 2000). For example, a firm with interest in evaluating the success of an affirmative action program needs to understand the Equal Employment Opportunity (EEO) classification system to know what data is relevant. Another important consideration in the process of developing a question to look at is the role of causality (Feelders et al., 2000). A subject matter expert’s involvement is important in interpreting the results of the data analysis. For example, a firm might find a pattern indicating a relationship between high compensation levels and extended length of service. The question then becomes, do employees stay with the company longer because they receive high compensation? Or do employees receive higher compensation if they stay longer with the company? An expert in the area can take the relationship discovered and build upon it with additional information available in the organization to help understand the cause and effect of the specific relationship identified.
Selecting and preparing the data is the next step in the data mining process. Some organizations have independent Human Resource Information Systems that feature multiple databases that are not connected to each other. This type of system is sometimes selected to offer greater flexibility to remote organizational locations or sub-groups with unique information needs (Anthony et al., 1996). The possible inconsistency of the design of the databases could make data mining difficult when multiple databases exist. Data warehousing can prevent this problem and an organization may need to create a data warehouse before they begin a data-mining project. The advantage gained in first developing the data warehouse or mart is that most of the data editing is effectively done in advance. Another challenge in mining data is dealing with the issues of missing or noisy data. Data quality may be insufficient if data is collected without any specific analysis in mind (Feelders et al., 2000). This is especially true for human resource information. Typically when HR data is collected, the purpose is some kind of administrative need such as payroll processing. The need of data for the required transaction is the only consideration in the type of data to collect. Future analysis needs and the value in the data collected is not usually considered. Missing data may also be a problem, especially if the system administrator does not have control over data input. Many organizations have taken advantage of web-based technology to allow for employee input and updating of their own data (Hendrickson, 2003). Employees may choose not to enter certain types of data resulting in missing data. However, a data warehouse or datamart may help to prevent or systemized the handling of many of these problems. There are many types of algorithms in use in data mining. The choice of the algorithm depends on the intended use of the extracted knowledge (Brodley, Lane, & Stough, 1999). The goals of data mining can be broken down into two main categories. Some applications seek to verify the hypothesis formulated by the user. The other main goal is the discovery or uncovering new patterns systematically (Fayyad et al., 1996a). Within discovery, the data can be used to either predict future behavior or describe patterns in an understandable form. A complete discussion of data mining techniques is beyond the scope of this paper. However, the following techniques have the potential to be applicable for data mining of human resources information. Clustering and classification is an example of a set of data mining techniques borrowed from classical statistical methods that can help describe patterns in information. Clustering seeks to identify a small set of exhaustive and mutual exclusive categories to describe the data that is present (Fayyad et al., 1996a). This might be a useful application to human resource data if you were trying to
263
,
Data Mining in Human Resources
identify a certain set of employees with consistent attributes. For example, an employer may want to find out what the main categories of top performers are that employees fall into with an eye towards tailoring various programs to the groups or further study of such groups. One category may be more or less appropriate for one type of training program. A difficulty with clustering techniques is that no normative techniques are known that specify the correct number of clusters that should be formed. In addition, there exist many different logics that may be followed in forming the clusters. Therefore, the art of the analyst is critical. Similarly, classification is a data mining technique that maps a data item into one of several pre-defined classes (Fayyad et al., 1996a). Classification may be useful in human resources to classify trends of movement through the organization for certain sets of successful employees. A company is at an advantage when recruiting if it can point out some realistic career paths for new employees. Being able to support those career paths with information reflecting employee success can make this a strong resource for those charged with hiring in an organization. Decision Tree Analysis, also called tree or hierarchical partitioning, is a somewhat related technique but follows a very different logic and can be rendered somewhat more automatic. Here, a variable is chosen first in such a way as to maximize the difference or contrast formed by splitting the data into two groups. One group consists of all observations having a value higher on a certain value of the variable, such as the mean. Then the complement, namely those lower than that value, becomes the other group. Then each half can be subjected to successive further splits with possibly different variables becoming important to different halves. For example, employees might first be split into two groups – above and below average tenure with the firm. Then the statistics of the two groups can be compared and contrasted to gain insights about employee turnover factors. A further split of the lower tenure group, say based on gender, may help prioritize those most likely to need special programs for retention. Thus, clusters or categories can be formed by binary cuts, a kind of divide and conquer approach. In addition, the order of variables can be chosen differently to make the technique more flexible. For each group formed, summary statistics can be presented and compared. This technique is a rather pure form of data mining and can be performed in the absence of specific questions or issues. It might be applied as a way of seeking interesting questions about a very large datamart. Regression and related models, also borrowed from classical statistics, permits estimating a linear function of independent variables that best explains or predicts a given dependent variable. Since this technique is generally well known, we will not dwell on the details here. However, data warehouses and datamarts may be so large 264
that direct use of all available observations is impractical for regression and similar studies. Thus, random sampling may be necessary to use regression analysis. Various nonlinear regression techniques are also available in commercial statistical packages and can be used in a similar way for data mining. Recently, a new model fitting technique was proposed in Troutt, et al. (2001). In this approach the objective is to explain the highest or lowest performers, respectively, as a function of one or more independent variables. The final step in the process emphasizes the value of the use of the information. The information extracted must be consolidated and resolved with previous information and then shared and acted upon (Fayyad, PiatskyShapiro, Smyth, & Uthurusamy, 1996b). Too often, organizations go through the effort and expense of collecting and analyzing data without any idea of how to use the information retrieved. Applying data mining techniques to an HRIS can help support the justification of the investment in the system. Therefore, the firm should have some expected use for the information retrieved in the process. One use of human resource related information is to support decision-making in the organization. The results obtained from data mining may be used for a full range of decision-making steps. It can be used to provide information to support a decision, or can be fully integrated into an end-user application (Feelders et al., 2000). For example, a firm might be able to set up decision rules regarding employees based on the results of data mining. They might be able to determine when an employee is promotion eligible or when a certain work group should be eligible for additional company benefits. Organizational leaders must be aware of legislation concerning legal and privacy issues when making decisions about using personal data collected from individuals in organizations (Hubbard, Forcht, & Thomas, 1998). By their nature, systems that collect employee information run the risk of invading the privacy of employees by allowing access to the information to others within the organization. Although there is no explicit constitutional right to privacy, certain amendments and federal laws have relevance to this issue as they provide protection for employees from invasion of privacy and defamation (Fisher, Schoenfeldt, & Shaw, 1999). Further, recent epidemics of identity theft create a need for organizations to monitor access to employee information (Carlson, 2004). Organizations can protect themselves from these employee concerns by having solid business reasons for any data collected from employees and ensuring access to this data is restricted. There are also some potential legal issues if a firm uses inappropriate information extracted from data mining to make employment related decisions. Even if a
Data Mining in Human Resources
manager has an understanding of current laws, they could still face challenges as laws and regulations constantly change (Ledvinka & Scarpello, 1992). An extreme example that may violate equal opportunity laws is a decision to hire only females in a job classification because the data mining uncovered that females were consistently more successful. One research study found that an employee’s ability to authorize disclosure of personal information affected their perceptions of fairness and invasion of privacy (Eddy & Stone-Romero, 1999). Therefore, it is recommended that firms notify employees upon hire that the information they provide may be used in data analyses. Another recommendation is to establish a committee or review board to monitor any activities relating to analysis of personal information (Osborn, 1978). This committee can review any proposed research and ensure compliance with any relevant employment or privacy laws. Often the employee’s reaction to the use of their information is based upon their perceptions. If there is a perception that the company is analyzing the data to take negative actions against employees, employees are more apt to object to the use. However, if the employer takes the time to notify employees and obtain their permission, the perception of negativity may be removed. Even if there may be no legal consequences, employee confidence is something that employers need to maintain.
UPDATES AND ISSUES As of this writing, there are still surprisingly few reported industry experiences with DM in HR. Of course, the newness of the area will require a time lag. In addition, many HR groups will not yet have the requisite experts and information systems capabilities. However, there are several forces at work that should improve the situation in the near future. First, DM for the human resources area is rapidly becoming of interest to industry associations and software vendors. Human Resources Benchmarking Association™(http://www.hrba.org/roundtable.pdf) now provides links related to DM. In fact, this association is affiliated with and linked to the Data Mining Benchmarking Association. The former offers a variety of services related to DM, such as • • • •
Consortium studies with costs divided HR benchmarking efforts Data collection and database access Benchmarking studies of important data mining processes.
Similarly, Dynamic Health Strategies (DHS, http:// www.dhsgroup.com/) is an association that concentrates on group health benefit issues. It combines the use of proprietary technology, audit discipline, analytical software, and bio-statistical evaluation techniques to assess and improve the quality and performance of existing health care providers, health plans, and health systems. DHS performs analysis services for self-insured corporations, universities, government entities and group health management. The process allows DHS to pinpoint specific measures that enable clients to reduce costs, reduce healthcare risk exposure and improve quality of care. Group health analysis allows the monitoring of cost, quality and utilization of health care prior to problems becoming major issues. Clients and consultants are then able to apply the findings and recommendations to realize benefits such as: targeting of specific health and wellness issues, monitoring of utilization and cost over time, assessment of effectiveness and efficiency of health and wellness initiatives, design health plan benefits to meet specific needs, application of risk models and cost projections for budgeting and financial strategies. Evidently, the group health insurance benefits area is spearheading interest in DM. We may postulate several reasons for this. First, this area represents one of considerable importance in terms of financial impacts. Next, the existence of the Health Insurance Portability and Accountability Act of 1996 (HIPAA) (http:// www.hipaadvisory.com/) created a requirement for administrative simplification, compliance, and reporting within healthcare entities. Third, the National Committee for Quality Assurance (NCQA) (http://www.ncqa.org/ index.asp) has established a quality assurance system called the Health Plan Employer Data and Information Set (HEDIS). Information Systems for reporting related to HIPAA and HEDIS serve as a ready data source for DM. Software vendors have been developing new DM tools at an accelerating pace. At least one vendor, Lawson (http://www.bitpipe.com/), has developed a product specialized to provide HR reporting, monitoring, and decision support. The Defense Software Collaborators (DACS, http://www.dacs.dtic.mil/) website has collected links to a very large number of DM tools and vendors. Many of these purport to be designed for the nontechnical business user.
FUTURE TRENDS The increasing interest in DM by HR related associations will likely continue and should provide a strong impetus and guidance for individual member firms. These associations make it possible for member firms to pool their databases. Such pooled information gives members a 265
,
Data Mining in Human Resources
statistical leverage in that their own data may be insufficient for particular studies. However, when their particular data are viewed in the context of the larger database, Bayesian methods might be brought to bear to make better inferences that may not be reliable with smaller data sets (Bickell & Doksum, 1977). In addition to more firm-specific experience studies, evaluation studies and comparisons are needed for the various DM tools becoming available. Potential adopters need assessments of both the effectiveness and ease of use.
CONCLUSION The use of HR data beyond administrative purposes can provide the basis for a competitive advantage by allowing organizations to strategically analyze one of their most important assets, their employees. Organizations must be able to transform the data they have collected into useful information. Data mining provides an attractive opportunity that has not yet been adequately exploited. The application of DM techniques to HR requires organizational expertise and work to prepare the system for mining. In particular, a datamart for HR is a useful first step. With proper preparation and consideration, HR databases together with data mining create an opportunity for organizations to develop their competitive advantage through using that information for strategic decision-making. A number of current influences promise to increase the interest of firms in the application of DM to the Human Resources area. Trade association and software vendor activities should also facilitate an increase acceptance and willingness to adopt DM.
REFERENCES Bickell, P.J., & Doksum, K. A. (1977). Mathematical statistics: Basic ideas and selected topics. San Francisco: Holden Day, Inc.
tion of privacy and procedural justice perspectives. Personnel Psychology, 52(2), 335-358. Fayyad, U.M., Piatsky-Shapiro, G., & Smyth, P. (1996a). From data mining to knowledge discovery in databases. AI Magazine, (7), 37-54. Fayyad, U.M., Piatsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996b). Advances in knowledge discovery and data mining. California: American Association for Artificial Intelligence. Feelders, A., Daniels, H., & Holsheimer, M. (2000). Methodological and practical aspects of data mining. Information & Management, (37), 271-281. Fisher, C.D., Schoenfeldt, L.F., & Shaw, J.B. (1999). Human resource management. Boston: Houghton Mifflin Company. Hendrickson, A. (2003). Human resource information systems: Backbone technology of contemporary human resources. Journal of Labor Research, 24(3) 381-394. Hubbard, J.C., Forcht, K.A., & Thomas, D.S. (1998). Human resource information systems: An overview of current ethical and legal issues. Journal of Business Ethics, (17), 1319-1323. Kerlinger, F. N., & Lee, H. B. (2000). Foundations of Behavioral Research (3rd ed.). Orlando: Harcourt, Inc. Ledvinka, J., & Scarpello, V.G. (1992). Federal regulation of personnel and human resource management. Belmont: Wadsworth Publishing Company. Long, L. K., & M.D. Troutt. (2003). Data mining human resource information systems. In J. Wang (Ed.), Data mining: Opportunities and challenges (pp. 366-381). Hershey: Idea Group Publishing. Osborn, J.L. (1978). Personal information: Privacy at the workplace. New York: AMACOM.
Brodley, C.E., Lane, T., & Stough, T.M. (1999). Knowledge discovery and data mining. American Scientist, 87, 54-61.
Patterson, B., & Lindsey, S. (2003). Mining the gold: Gain competitive advantage through HR data analysis. HR Magazine, 48(9), 131-136.
Bussler, L., & Davis, E. (2002). Information systems: The quite revolution in human resource management. Journal of Computer Information Systems, 17-20.
SAS Institute Inc. (2001). John Deer harvests HR records with SAS. Retrieved from http://www.sas.com/news/success/johndeere.html
Carlson, L. (2004). Employers offering identity theft protection. Employee Benefit News, 18(3), 50-51.
Townsend, A., & Hendrickson, A. (1996). Recasting HRIS as an information resource. HR Magazine, 41(2), 91-96.
Eddy, E.R., Stone, D.L., & Stone-Romero, E.F. (1999). The effects of information management policies on reactions to human resource information systems: An integra-
Troutt, M.D., Hu, M., Shanker, M., & Acar, W. (2003). Frontier versus ordinary regression models for data mining. In P.C. Pendharker (Ed.), Managing data mining
266
Data Mining in Human Resources
technologies in organizations: Techniques and applications (pp. 21-31). Hershey: Idea Group Publishing.
Opportunity Commission (EEOC) for demographic reporting requirements.
KEY TERMS
: Health Plan Employer Data and Information HEDIS Set, a quality assurance system established by the National Committee for Quality Assurance (NCQA).
Benchmarking: To identify the “Best in Class” of business processes, which might then be implemented or adapted for use by other businesses. Enterprise Resource Planning System: An integrated software system processing data from a variety of functional areas such as finance, operations, sales, human resources and supply-chain management. Equal Employment Opportunity Classification: A job classification system set forth by the Equal Employment
HIPAA: the Health Insurance Portability and Accountability Act of 1996. Human Resource Information System: An integrated system used to gather and store information regarding an organization’s employees. Subject Matter Expert: A person who is knowledgeable about the skills and abilities required for a specific domain such as Human Resources.
267
,
268
Data Mining in the Federal Government Les Pang National Defense University, USA
INTRODUCTION Data mining has been a successful approach for improving the level of business intelligence and knowledge management throughout an organization. This article identifies lessons learned from data mining projects within the federal government including military services. These lessons learned were derived from the following project experiences: • • • • • • • •
Defense Medical Logistics Support System Data Warehouse Program Department of Defense (DoD) Defense Financial and Accounting Service (DFAS) “Operation Mongoose” DoD Computerized Executive Information System (CEIS) Department of Transportation (DOT) Executive Reporting Framework System Federal Aviation Administration (FAA) Aircraft Accident Data Mining Project General Accounting Office (GAO) Data Mining of DoD Purchase and Travel Card Programs U.S. Coast Guard Executive Information System Veteran Administrations (VA) Demographics System
BACKGROUND Data mining involves analyzing diverse data sources in order to identify relationships, trends, deviations and other relevant information that would be valuable to an organization. This approach typically examines large single databases or linked databases that are dispersed throughout an organization. Pattern recognition technologies and statistical and mathematical techniques are often used to perform data mining. By utilizing this approach, an organization can gain a new level of corporate knowledge that can be used to address its business requirements. Many agencies in the federal government have applied a data mining strategy with significant success. This chapter aims to identify the lessons gained as a result of these many data mining implementations within the federal sector. Based on a thorough literature review, these lessons were uncovered and selected by the
author as being critical factors which led toward the success of the real-world data mining projects. Also, some of these lessons reflect novel and imaginative practices.
MAIN THRUST Each lesson learned (indicated in boldface) is listed below. Following each practice is a description of illustrative project or projects (indicated in italics), which support the lesson learned.
Avoid the Privacy Trap DoD Computerized Executive Information System Patients as well as the system developers indicate their concern for protecting the privacy of individuals — their medical records need safeguards. “Any kind of large database like that where you talk about personal info raises red flags,” said Alex Fowler, a spokesman for the Electronic Frontier Foundation. “There are all kinds of questions raised about who accesses that info or protects it and how somebody fixes mistakes” (Hamblen, 1998). Proper security safeguards need to be implemented to protect the privacy of those in the mined databases. Vigilant measures are needed to ensure that only authorized individuals have the capability of accessing, viewing and analyzing the data. Efforts should also be made to protect the data through encryption and identity management controls. Evidence of the public’s high concern for privacy was the demise of the Pentagon’s $54 million Terrorist Information Awareness (originally, Total Information Awareness) effort — the program in which government computers were to be used to scan an enormous array of databases for clues and patterns related to criminal or terrorist activity. To the dismay of privacy advocates, many government agencies are still mining numerous databases (General Accounting Office, 2004; Gillmor, 2004). “Data mining can be a useful tool for the government, but safeguards should be put in place to ensure that information is not abused,” stated the chief privacy officer for the Department of Homeland Security (Sullivan, 2004). Congressional concerns on privacy are so high that the body is looking at introducing
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Mining in the Federal Government
legislation that would require agencies to report to Congress on data mining activities to support homeland security purposes (Miller, 2004).
Steer Clear of the “Guns Drawn” Mentality if Data Mining Unearths a Discovery DoD Defense Finance & Accounting Service’s Operation Mongoose was a program aimed to discover billing errors and fraud through data mining. About 2.5 million financial transactions were searched to locate inaccurate charges. This approach detected data patterns that might indicate improper use. Examples include purchases made on weekends and holidays, entertainment expenses, highly frequent purchases, multiple purchases from a single vendor and other transactions that do not match with the agency’s past purchasing patterns. It turned up a cluster of 345 cardholders (out of 400,000) who had made suspicious purchases. However, the process needs some fine-tuning. As an example, buying golf equipment appeared suspicious until it was learned that a manager of a military recreation center had the authority to buy the equipment. Also, casino-related expense revealed to be a commonplace hotel bill. Nevertheless, the data mining results have shown sufficient potential that data mining will become a standard part of the Department’s efforts to curb fraud.
Create a Business Case Based on Case Histories to Justify Costs FAA Aircraft Accident Data Mining Project involved the Federal Aviation Administration hiring MITRE Corporation to identify approaches it can use to mine volumes of aircraft accident data to detect clues about their causes and how those clues could help avert future crashes (Bloedorn, 2000). One significant data mining finding was that planes with instrument displays that can be viewed without requiring a pilot to look away from the windshield were damaged a smaller amount in runway accidents than planes without this feature. On the other hand, the government is careful about committing significant funds to data mining projects. “One of the problems is how do you prove that you kept the plane from falling out of the sky,” said Trish Carbone, a technology manager at MITRE. It is difficult to justify data mining costs and relate it to benefits (Matthews, 2000). One way to justify data mining program is to look at past successes in data mining. Historically, fraud detection has been the highest payoff in data mining, but other
areas have also benefited from the approach such as in sales and marketing in the private sector. Statistics (dollars recovered) from efforts such as this can be used to support future data mining projects.
Use Data Mining for Supporting Budgetary Requests Veteran’s Administration Demographics System predicts demographic changes based on patterns among its 3.6 million patients as well as data gathered from insurance companies. Data mining enables the VA to provide Congress with much more accurate budget requests. The VA spends approximately $19 billion a year to provide medical care to veterans. All government agencies such as the VA are under increasing scrutiny to prove that they are operating effectively and efficiently. This is particularly true as a driving force behind the President’s Management Agenda (Executive Office of the President, 2002). For many, data mining is becoming the tool of choice to highlight good performance or dig out waste. United States Coast Guard developed an executive information system designed for managers to see what resources are available to them and better understand the organization’s needs. Also, it is also used to identify relationships between Coast Guard initiatives and seizures of contraband cocaine and establish tradeoffs between costs of alternative strategies. The Coast Guard has numerous databases; content overlap each other; and only one employee understands each database well enough to extract information. In addition, field offices are organized geographically but budgets are drawn up by programs that operate nationwide. Therefore, there is a disconnect between organizational structure and appropriations structure. The Coast Guard successfully used their data mining program to overcome these issues (Ferris, 2000). DOT Executive Reporting Framework (ERF) aims to provide complete, reliable and timely information in an environment that allows for the cross-cutting identification, analysis, discussion and resolution of issues. The ERF also manages grants and operations data. Grants data shows the taxes and fees that DOT distributes to the states’ highway and bridge construction, airport development and transit systems. Operations data covers payroll, administrative expenses, travel, training and other operations cost. The ERF system accesses data from various financial and programmatic systems in use by operating administrators. Before 1993 there was no financial analysis system to compare the department’s budget with congressional appropriations. There was also no system to track performance against the budget or how they were doing with the budget. The ERF system changed this by track269
,
Data Mining in the Federal Government
ing of the budget and providing the ability to correct any areas that have been over planned. Using ERF, adjustments were made within the quarter so the agency did not go over budget. ERF is being extended to manage budget projections, development and formulation. It can be used as a proactive tool, which allows an agency to be more dynamically to project ahead. The system has improved financial accountability in the agency (Ferris, 1999).
Give Users Continual Computer-Based Training DoD Medical Logistics Support System (DMLSS) built a data warehouse with a front-end decision support/data mining tools to help manage the growing costs of health care, enhance health care delivery, enhance health delivery in peacetime and promote wartime readiness and sustainability. DMLSS is responsible for the supply of medical equipment and medicine worldwide for DoD medical care facilities. The system received recognition for reducing the inventory in its medical depot system by 80 percent and reducing supply request response time from 71 to 15 days (Government Computer News, 2001). One major challenge faced by the agency was the difficulty in keeping up on the training of users because of the constant turnover of military personnel. It was determined that there is a need to provide quality computerbased training on a continuous basis (Olsen, 1997).
Provide the Right Blend of Technology, Human Capital Expertise and Data Security Measures General Accounting Office used data mining to identify numerous instances of illegal purchases of goods and services from restaurants, grocery stores, casinos, toy stores, clothing retailers, electronics stores, gentlemen’s clubs, brothels, auto dealers and gasoline service stations. This was all part of their effort to audit and investigate federal government purchase and travel card and related programs (General Accounting Office, 2003). Data mining goes beyond using the most effective technology and tools. There must be well-trained individuals involved who know about the process, procedures and culture of the system being investigated. They need to understand the capabilities and limitation of data mining concepts and tools. In addition, these individual must recognize the data security issues associated with the use of large, complex and detailed databases.
270
FUTURE TRENDS Despite the privacy concerns, data mining continues to offer much potential in identifying waste and abuse, potential terrorist and criminal activity, and identify clues to improve efficiency and effectiveness within organizations. This approach will become more pervasive because of its integration with online analytical tools, the improved ease of use in utilizing data mining tools and the appearance of novel visualization techniques for reporting results. Also, the emergence of a new branch of data mining called text mining to help improve the efficiency of searching on the Web. This approach transforms textual data into a useable format that facilitates classifying documents, finds explicit relationships or associations among documents, and clusters documents into categories (SAS, 2004).
CONCLUSION These lessons learned may or may not fit in all environments due to cultural, social and financial considerations. However, the careful review and selection of relevant lessons learned could result in addressing the required goals of the organization by improving the level of corporate knowledge. A decision maker needs to think “outside the box” and move away from the traditional approaches to successfully implement and manage their programs. Data mining poses a challenging but highly effective approach to improve business intelligence within one’s domain.
REFERENCES Bloedorn, E. (2000). Data mining for aviation safety. MITRE Publications. Executive Office of the President, Office of Management and Budget. (2002). President’s Management Agenda. Fiscal Year 2002. Ferris, N. (1999). 9 Hot Trends for ’99. Government Executive. Ferris, N. (2000). Information is power. Government Executive. General Accounting Office. (2003). Data mining: Results and challenges for government program audits and investigations. GAO-03-591T.
Data Mining in the Federal Government
General Accounting Office. (2004). Data mining: Federal efforts cover a wide range of uses. GAO-04-584. Gillmor, D. (2004). Data mining by government rampant. eJournal. Government Computer News. (2001). Ten agencies honored for innovative projects. Government Computer News. Hamblen, M. (1998). Pentagon to deploy huge medical data warehouse. Computer World. Matthews, W. (2000). Digging Digital Gold. Federal Computer Week. Miller, J. (2004). Lawmakers renew push for datamining law. Government Computer News. Olsen, F. (1997). Health record project hits pay dirt. Government Computer News. SAS. (2004). SAS Text Miner. Schwartz, A. (2000). Making the Web safe. Federal Computer Week.
Federal Government: The national government of the United States, established by the Constitution, which consists of the executive, legislative, and judicial branches. The head of the executive branch is the President of the United States. The legislative branch consists of the United States Congress, and the Supreme Court of the United States is the head of the judicial branch. Legacy System: Typically, a database management system in which an organization has invested considerable time and money and resides on a mainframe or minicomputer. Logistics Support System: A computer package that assists in the planning and deploying the movement and maintenance of forces in the military. The package could deals with the design and development, acquisition, storage, movement, distribution, maintenance, evacuation and disposition of material; movement, evacuation, and hospitalization of personnel; acquisition of construction, maintenance, operation and disposition of facilities; and acquisition of furnishing of services.
Sullivan, A. (2004). U.S. Government still data mining. Reuters.
Purchase Cards: Credit cards used in the federal government used by authorized government official for small purchases, usually under $2,500.
KEY TERMS
Travel Cards: Credit cards issued to federal employees to pay for costs incurred on official business travel.
Computer-Based Training: A recent approach involving the use of microcomputer, optical disks such as compact disks, and/or the Internet to address an organization’s training needs.
NOTE
Executive Information System: An application designed for the top executives that often features a dashboard interface, drill-down capabilities and trend analysis.
The views expressed in this article are those of the author and do not reflect the official policy or position of the National Defense University, the Department of Defense or the U.S. Government.
271
,
272
Data Mining in the Soft Computing Paradigm Pradip Kumar Bala IBAT, Deemed University, India Shamik Sural Indian Institute of Technology, Kharagpur, India Rabindra Nath Banerjee Indian Institute of Technology, Kharagpur, India
INTRODUCTION Data mining is a set of tools, techniques and methods that can be used to find new, hidden or unexpected patterns from a large volume of data typically stored in a data warehouse. Results obtained from data mining help an organization in more effective individual and group decision-making. Regardless of the specific technique, data mining methods can be classified by the function they perform or by their class of application. Association rule is a type of data mining that correlates one set of items or events with another set of items or events. It employs association or linkage analysis, searching transactions from operational systems for interesting patterns with a high probability of repetition. Classification techniques include mining processes intended to discover rules that define whether an item or event belongs to a particular predefined subset or a class of data. This category of techniques is probably the most broadly applicable to different types of business problems. In some cases, it is difficult to define the parameters of a class of data to be analyzed. When parameters are elusive, clustering methods can be used to create partitions so that all members of each set are similar according to a specified set of metrics. Summarization describes a set of data in compact form. Regression techniques are used to predict a continuous value. The regression can be linear or non-linear with one predictor variable or more than one predictor variables, known as multiple regression. Soft computing, which includes application of fuzzy logic, neural network, rough set and genetic algorithm, is an emerging area in data mining. By studying combinations of variables and how different combinations affect data sets, we develop neural network, a non-linear predictive model that “learns.” Machine learning techniques, such as genetic algorithms and fuzzy logic, can derive meaning from complicated and imprecise data. They can extract patterns from and detect trends within the data that
are far too complex to be noticed by either humans or more conventional automated analysis techniques. Because of this ability, neural computing and machine learning technologies demonstrate broad applicability in the world of data mining and, thus, to a wide variety of complex business problems. Rough set is the approximation of an imprecise and uncertain set by pair of precise concepts, called the lower and upper approximations. Each soft computing technique addresses problems in its domain using a distinct methodology. However, they are not substitute of each other. In fact, these soft computing tools work in a cooperative manner, rather than being competitive. This has led to the development of hybridization of soft computing tools for data mining applications (Mitra et al., 2002). It should, however, be kept in mind that soft computing techniques have been traditionally developed to handle small data sets. Extending the soft computing paradigm for processing large volumes of data is itself a challenging task. In the next section, we give a brief background of the various soft computing techniques.
BACKGROUND Fuzzy rules offer an attractive trade-off between the need for accuracy and compactness on one hand, and scalability on the other, when reasoning systems within a particular knowledge domain become quite complex. Fuzzy rules generalize the concept of categorization because, by definition, the same object can belong to multiple sets with different degrees of membership. In this sense, fuzzy logic eliminates the problems associated with borderline cases: where, for example, a value of degree of membership with 0.9 may cause a rule to fire but a value 0.899 may not. The net result is that fuzzy systems tend to provide greater accuracy than traditional rule-based systems when continuous variables are involved. Neural network, which draws its inspiration from neuroscience, attempts to mirror the way a human brain works
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Mining in the Soft Computing Paradigm
in recognizing patterns by developing mathematical structures with the ability to learn. An artificial neural network (ANN) learns through training. These are simple computer-based programs whose primary function is to construct models of a problem space based on trial and error. The process of training a neural net to associate certain input patterns with correct output responses involves the use of repetitive examples and feedback, much like the training of a human being. Rough set theory finds application in studying imprecision, vagueness, and uncertainty in data analysis and is based on the establishment of equivalence classes within a given training data. A rough set gives an approximation of a vague concept by two precise concepts, called the lower and upper approximations. These two approximations are a classification of the domain of interest into disjoint categories. The lower approximation is a description of the domain objects known with certainty to belong to the subset of interest and upper approximation is a description that may possibly belong to the subset. Genetic algorithms (GAs) are computational models used in efficient and global search methods for optimality in problem solving. These search algorithms are based on the mechanics of natural genetics theory combined with Darwin’s theory of “survival of the fittest” and are particularly suitable for solving complex optimization problems as well as applications that require adaptive problemsolving strategies. In data mining, GA finds application in hypothesis testing and refinement. With this background, we next present how soft computing techniques can be applied to specific data mining problems.
•
MAIN THRUST
•
Association Rule Mining: Following are some of the applications of soft computing tools in association rule mining.
•
•
Fuzzy Logic: A generalized association rule may involve binary, quantitative or categorical data and hierarchical relation. In quantitative or categorical association rule mining, irrespective of the methodology used, “sharp boundaries” remain a problem which under-estimates or over-emphasizes the elements near the boundaries. This, may, therefore, lead to an inaccurate representation of semantics. To deal with the problem, fuzzy sets and fuzzy items, usually in the form of labels or linguistic terms, are used and defined onto the domains (Chien et al., 2001). In the fuzzy framework, conventional notions of support and confidence could be extended as well. The partial belongingness of an item in a
•
subset is taken into account while computing the degree of support and the degree of confidence. The measures are similar in spirit to the count operator used for fuzzy cardinality. Subsequently, with these extended measures incorporated, several mining algorithms have been developed (Gyenesei, 2000; Gyenesei & Teuhola, 2001; Shu et al., 2001). Instead of dividing quantitative attributes into fixed intervals, linguistic terms can be used to represent the regularities and exceptions the way in which humans perceive the reality. Chen et al. (2002) have developed an algorithm for fuzzy association rules in dealing with partitioning quantitative data domains. Wei & Chen (1999) extended generalized association rules with fuzzy taxonomies, by which partial belongings could be incorporated. Furthermore, a recent effort has been made which incorporates linguistic hedges on existing fuzzy taxonomies (Chen et al., 1999; Chen et al., 2002a). Several fuzzy extensions have been made on interestingness measures. A measure called “Interestingness Degree” has been proposed which can be seen as the increase in probability of an event Y caused by the occurrence of another event X. Attempts have been made to introduce thresholds for filtering databases in dealing with very low membership degrees (Hullermeier, 2001). Genetic Algorithm: Min et al. (2001) have used a GA-based data mining approach in e-commerce to find association rules of IF-THEN form for adopters and non-adopters of e-purchasing. Association rules of the form IF-THEN form can also be mined, which provides a high degree of accuracy and coverage (Lopes et al., 1999). Clustering: Following are some of the applications of soft computing tools in clustering. Fuzzy Logic: A fuzzy clustering algorithm makes an attempt to group the prospects into categories based on their identifying characteristics. For example, for prospective customers of any business, the key attributes can include geographic data, psychographic data and others. Clusters expressed in linguistic terms can be easily handled using fuzzy sets. Using fuzzy sets, we can also find dependencies between data expressed in qualitative format. Use of fuzzy logic can help in avoiding searching for less important, trivial or meaningless patterns in databases. Fuzzy clustering algorithms have been developed for mining telecommunications customer and prospect database to gain customer information for deciding a marketing strategy (Russell et al., 1999). Neural Network: Self Organizing Map (SOM) is one of the most widely used unsupervised neural network models that employ competitive learning steps. 273
,
Data Mining in the Soft Computing Paradigm
• •
•
274
An important data mining task is organizing data points with single or multiple dimensions in their natural cluster. Kohonen et al. (2000) has demonstrated the applicability of self-organizing map where large data sets can be partitioned in stages. A twophase method can be used where in the first phase, the step-wise strategy of SOM is used, and then in the second phase, the resulting prototypes of the SOM are clustered by an agglomerative clustering method or by k-means clustering (Vesanto et al., 2000). A dimension-independent algorithm has been developed which allows hierarchical clustering of SOMs, based on a spread factor. This can be used as a controlling measure for generating maps with different dimensionality (Alahakoon et al., 2000). Classification and Rule Extraction: Following are some of the applications of soft computing tools in classification and rule extraction. Neural Network: For the purpose of classification or rule extraction, ANN is used in the supervised learning paradigm. The most common supervised learning paradigm is error back-propagation. Here, a neural network receives an input example, generates a guess, and compares that guess with the expected result. The error between the guess and the desired result is fed back to improve the guess in an iterative manner. In this sense, the ANN is being supervised by the feedback, which shows the network where it made the mistakes and how the correct result should look. The most common form of the back-propagation algorithm uses a sum-of-squared-errors approach to generate an aggregate measure of the difference error. A general framework for classification rule mining, called NEUCRUM (NEUral Classification RUle Mining) has been developed. It has two components, one is a specific neural classifier named FANNC and the other is a novel rule extraction approach named STARE (Zhou et al., 2000). Rough Set: Piasta (1999) has presented an approach called the ProbRough system to analyze business databases based on rule induction. The ProbRough system can induce decision rules from databases with a very high number of objects and attributes. Based on rough set theory, another approach in the selection of attributes for construction of decision tree has been developed (Wei, 2003). According to Han & Kamber (2003), rough set theory can be applied for classification to discover structural relationships in imprecise or noisy data. It can be applied to discrete-valued attributes and hence, continuous-valued attributes must be discretized prior to its use. A classifier can be trained using rough set learning algorithm for rule extraction in IF-THEN form, from a decision table.
•
•
•
Neuro-Fuzzy Computing: Neuro-Fuzzy computing combines strong features of neural network and fuzzy approach together. To deal with numerical and linguistic data and granular knowledge, a granular neural network can be designed (Zhang et al., 2000). High-level granular knowledge in the form of rules is generated by compressing the lowlevel granular data. An algorithm has been developed to mine mixed fuzzy rules involving both numeric and categorical attributes. Fuzzy set-driven computational techniques for data mining have been discussed by Pedrycz (1998), establishing the relationship between data mining and fuzzy modeling. Other Hybrid Approaches: Hybrid prediction system based on neural network with its learning based on memory-based reasoning can be designed for classification. It can also learn the dynamic behavior of a system over a period of time. It has been established by experimentation that a hybrid system has a high potential in solving data mining problems (Shin et al., 2000). The concepts of fuzzy logic and rough set can be applied in Multilayer Perceptron (MLP) neural network to extract rules from crude domain knowledge. Appropriate number of the hidden nodes is automatically determined and the dependency factors are used in the initial weight encoding. Other Data Mining Applications: The soft computing paradigm has also been extended to some other types of data mining as discussed below.
Genetic algorithms have been used in regression analysis. One basic assumption in traditional regression models is that there is no interaction amongst the attributes. To learn non-linear multi-regression from a set of training data, an adaptive GA can be used. A genetic algorithm can handle attribute interactions efficiently. GA has been used to discover interesting rules in a dependency-modeling task (Noda et al., 1999). The system developed by Shin et al. (2000) for classification, as discussed above, can be applied for regression analysis also. Fuzzy functional dependencies are extensions of classical functional dependencies, aimed at dealing with fuzziness in databases and reflecting the semantics that close values of a collection of certain items are dependent on close values of a collection of a set of different items. Generally, fuzzy functional dependencies have different forms depending on the different aspects of integrating fuzzy logic in classical functional dependencies. Fuzzy inference generalizes both imprecise and precise inference. An attempt has been made by Yang & Singhal (2001) to develop a framework of linking fuzzy
Data Mining in the Soft Computing Paradigm
functional dependencies and fuzzy association rules in a closer manner. Discovering relationships among time series is an interesting application since time series patterns reflect the evolution of changes in item values with sequential factors like time. The value of each time series item is viewed as a pattern over time, and the similarity between any two patterns is measured by pattern matching. Chen et al. (2001) have presented a method based on Dynamic Time Warping (DTW) to discover pattern associations. Summarization is one of the major components of data mining. Lee & Kim (1997) have proposed an interactive top-down summary discovery process which utilizes fuzzy ISA hierarchies as domain knowledge. They have defined a generalized tuple as a representational form of a database summary including fuzzy concepts. By virtue of fuzzy ISA hierarchies, where fuzzy ISA relationships common in actual domains are naturally expressed, the discovery process comes up with more accurate database summaries. They have also presented an “informativeness” measure for distinguishing generalized tuples that deliver more information to users, based on Shannon’s information theory. GA-based approach can be used for discovering temporal trends by synthesizing Bayesian Networks (Novobilski & Kamangar, 2002). Bonaventura et al. (2003) have developed a hybrid model for the prediction of the linguistic origin of the surnames. It is a neural network module combining the results provided both by the lexical rule module and by the statistical module and used to compute the evidence for the classes.
FUTURE DIRECTIONS In this paper, we have discussed various soft computing methods used in data mining. Our focus has been on four primary soft computing techniques, namely, fuzzy logic, neural network, rough set and genetic algorithm. Although these techniques have not yet attained maturity to the extent the conventional data mining techniques have, it is expected that these techniques will mature well enough to be dealt with as independent areas of data mining very soon.
CONCLUSION Hybridization of fuzzy, neural and genetic algorithms in solving data mining problems seems to be an upcoming area in the field of data mining. However, more stress needs to be given to the domain of improving the efficiency of these soft computing techniques when applied to the data mining problems. Processing of tera bytes of
data with a reasonable time response is a primary requirement of any kind of data mining algorithm. Since soft computing techniques typically require more processing than the traditional techniques, it remains to be seen how well they can be adopted successfully to the interesting and challenging field of data mining.
REFERENCES Alahakoon, D., Halgamuge, S.K., & Srinivasan, B. (2000). Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Transactions on Neural Networks, 11(3), 601-614. Bonaventura, P., Marco, G., Marco, M., Franco, S., & Sheng, J. (2003). A hybrid model for the prediction of the linguistic origin of surnames. IEEE Transactions on Knowledge and Data Engineering, 15(3), 760-763. Chen, G.Q., Wei, Q., & Kerre, E.E. (1999). Fuzzy data mining: Discovery of fuzzy generalized association rules. In Recent Research Issues on Management of Fuzziness in Databases. Berlin: Springer-Verlag. Chen, G.Q., Yan, P., & Kerre, E.E. (2002a). Mining fuzzy implication-based association rules in quantitative databases. In International FLINS Conference on Computational Intelligent Systems for Applied Research, Belgium. Chen, G.Q., Wei, Q., Liu, D., & Wets, G. (2002). Simple association rules (SAR) and the SAR-based rule discovery. Journal of Computer & Industrial Engineering, 43, 721-733. Chen, G.Q., Wei, Q., & Zhang, H. (2001). Discovering similar time-series patterns with fuzzy clustering and DTW methods. In International Fuzzy Systems Association Conference, Vancouver, BA, Canada. Chien, B.C., Lin, Z.L., & Hong, T.P. (2001). An efficient clustering algorithm for mining fuzzy quantitative association rules. In Ninth International Fuzzy Systems Association World Congress (pp. 1306-1311), Vancouver, Canada. Gyenesei, A. (2000). A fuzzy approach for mining quantitative association rules. TUCS Technical Reports 336. Department of Computer Science, University of Turku, Finland. Gyenesei, A., & Teuhola, J. (2001). Interestingness measures for fuzzy association rules. In Principles and Practice of Knowledge Discovery in Databases, Freiburg, Germany.
275
,
Data Mining in the Soft Computing Paradigm
Han, J., & Kamber, M. (2003). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hullermeier, E. (2001). Fuzzy association rules: Semantics issues and quality measures. In Lecture Notes in Computer Science 2206 (pp. 380-391). Berlin & Heidelberg: Springer. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574-585. Lee, D.H., & Kim, M.H. (1997). Database summarization using fuzzy ISA hierarchies. IEEE Transactions on Systems, Man and Cybernetics – Part B: Cybernetics, 27 (1). Lopes, C., Pacheco, M., Vellasco, M., & Passos, E. (1999). Rule-Evolver: An evolutionary approach for data mining. In Seventh International Workshop on Rough Sets, Fuzzy Sets, Data Mining and Granular-Soft Computing (pp. 458-462), Yamaguchi, Japan. Min, H., Smolinski, T., & Boratyn, G.A. (2001). A genetic algorithm-based data mining approach to profiling the adopters and non-adopters of e-purchasing. In Third International Conference on Information Reuse and Integration, Las Vegas, USA. Mitra, S., Pal, S.K., & Mitra, P. (2002). Data mining in soft computing framework: A survey. IEEE Transactions on Neural Networks, 13, 3-14. Noda, E., Freitas, A.A., & Lopes, H.S. (1999). Discovering interesting prediction rules with a genetic algorithm. In IEEE Congress on Evolutionary Computing (pp. 1322-1329). Novobilski, A., & Kamangar, F. (2002). A genetic algorithm based approach for discovering temporal trends using Bayesian networks. In Sixth World Conference on Systemics, Cybernetics, and Informatics. Pedrycz, W. (1998). Fuzzy set technology in knowledge discovery. Fuzzy Sets and Systems, 98, 279-290. Piasta, Z. (1999). Analyzing business databases with the ProbRough rule induction system. In Workshop on Data Mining in Economics, Marketing and Finance (pp. 2229), Chania, Greece. Russell, S., & Lodwick, W. (1999). Fuzzy clustering in data mining for telco database marketing campaigns. In North Atlantic Fuzzy Information Processing Symposium (pp. 720-726), New York. Shin,C.K., Tak Yun, U., Kang Kim, H., & Chan Park, S. (2000). A hybrid approach of neural network and memorybased learning to data mining. IEEE Transactions on Neural Networks, 11(3), 637-646. 276
Shu, J.Y., Tsang, E.C.C., Daniel, & Yeung, S. (2001). Query fuzzy association rules in relational database. In Ninth International Fuzzy Systems Association World Congress, Vancouver, Canada. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the selforganizing map. IEEE Transactions on Neural Networks, 11(3), 586-600. Wei, J.M. (2003). Rough set based approach to selection of node. International Journal of Computational Cognition, 1(2). Wei, Q., & Chen, G.Q. (1999). Mining generalized association rules with fuzzy taxonomic structures. In Eighteenth International Conference of North Atlantic Fuzzy Information Processing Systems (pp. 477-481), New York, NY, USA. Yang, Y., & Singhal, M. (2001). Fuzzy functional dependencies and fuzzy association rules. In First International Conference on Data Warehouse and Knowledge Discovery (pp. 229-240), Florence, Italy. Zhang, Y.Q., Fraser, M.D., Gagliano, R.A., & Kandel, A. (2000). Granular neural networks for numerical-linguistic data fusion and knowledge discovery. IEEE Transactions on Neural Networks, 11, 658-667. Zhou, Z.H., Yuan, J., & Chen, S.F. (2000). A general neural framework for classification rule mining. International Journal of Computers, Systems, and Signals, 1(2), 154-168.
KEY TERMS Data Mining: A set of tools, techniques and methods used to find new, hidden or unexpected patterns from a large collection of data typically stored in a data warehouse. Fuzzy Set: A set that captures the different degrees of belongingness of different objects in the universe instead of a sharp demarcation between objects that belong to a set and those that does not. Genetic Algorithm: A search and optimization technique that uses the concept of survival of genetic materials over various generations of populations much like the theory of natural evolution. Hybrid Technique: A combination of two or more soft computing techniques used for data mining. Examples are neuro-fuzzy, neuro-genetic, and etcetera. Neural Network: A connectionist model that can be trained in supervised or unsupervised mode for learning patterns in data.
Data Mining in the Soft Computing Paradigm
Rough Set: A method of modeling impreciseness and vagueness in data through two sets representing the upper bound and lower bound of the data set.
Soft Computing: Collection of methods and techniques like fuzzy set, neural network, rough set and genetic algorithm for solving complex real world problems.
277
,
278
Data Mining Medical Digital Libraries Colleen Cunningham Drexel University, USA Xiaohua Hu Drexel University, USA
INTRODUCTION
Text Mining Process
Given the exponential growth rate of medical data and the accompanying biomedical literature, more than 10,000 documents per week (Leroy et al., 2003), it has become increasingly necessary to apply data mining techniques to medical digital libraries in order to assess a more complete view of genes, their biological functions and diseases. Data mining techniques, as applied to digital libraries, are also known as text mining.
Text mining can be viewed as a modular process that involves two modules: an information retrieval module and an information extraction module. presents the relationship between the modules and the relationships between the phases within the information retrieval module. The former module involves using NLP techniques to pre-process the written language and using techniques for document categorization in order to find relevant documents. The latter module involves finding specific and relevant facts within text. NLP consists of three distinct phases: (1) tokenization, (2) parts of speech (PoS) tagging and (3) parsing. In the tokenization step, the text is decomposed into its subparts, which are subsequently tagged during the second phase with the part of speech that each token represents (e.g., noun, verb, adjective, etc.). It should be noted that generating the rules for PoS tagging is a very manual and laborintensive task. Typically, the parsing phase utilizes shallow parsing in order to group syntactically related words together because full parsing is both less efficient (i.e., very slow) and less accurate (Shatkay & Feldman, 2003). Once the documents have been pre-processed, then they can be categorized. There are two approaches to document categorization: Knowledge Engineering (KE) and Machine Learning (ML). Knowledge Engineering requires the user to manually define rules, which can consequently be used to categorize documents into specific pre-defined categories. Clearly, one of the drawbacks of KE is the time that it would take a person (or group of people) to manually construct and maintain the rules. ML, on the other hand, uses a set of training documents to learn the rules for classifying documents. Specific ML techniques that have successfully been used to categorize text documents include, but are not limited to, Decision Trees, Artificial Neural Networks, Nearest Neighbor and Support Vector Machines (SVM) (Stapley et al., 2002). Once the documents have been categorized, then documents that satisfy specific search criteria can be retrieved.
BACKGROUND Text mining is the process of analyzing unstructured text in order to discover information and knowledge that are typically difficult to retrieve. In general, text mining involves three broad areas: Information Retrieval (IR), Natural Language Processing (NLP) and Information Extraction (IE). Each of these areas are defined as follows: • • •
Natural Language Processing: a discipline that deals with various aspects of automatically processing written and spoken language. Information Retrieval: a discipline that deals with finding documents that meet a set of specific requirements. Information Extraction: a sub-field of NLP that addresses finding specific entities and facts in unstructured text.
MAIN THRUST The current state of text mining in digital libraries is provided in order to facilitate continued research, which subsequently can be used to develop large-scale text mining systems. Specifically, an overview of the process, recent research efforts and practical uses of mining digital libraries, future trends and conclusions are presented.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Mining Medical Digital Libraries
Figure 1. Overview of text mining process
D
Text Tokenization Phase
Knowledge Enginee ring Technique
Boolean Technique
Decomposed, untagged tokens (i.e. words)
OR
OR
Part of Speech Phase N
A
Tokens (i.e. words) tagge d
V
N
with part of speech (e.g. noun, verb, etc.)
Parsing Phase Verbs
Adjecti ves
Nouns Adverbs
Machine Learning Technique s (e.g. Decision Trees, Artificial Neural Networks, Nearest Neighbor, Support Vector Machines, etc.)
Vector Technique s
Groups of sy ntacticallyrelated words
NLP techniques to pre-process written languag e
Document Categorization Techniques
Information Retrieval Module
Document Retrieval Technique s
Retrieved Documents Information Extraction Module
There are several techniques for retrieving documents that satisfy specific search criteria. The Boolean approach returns documents that contain the terms (or phrases) contained in the search criteria; whereas, the vector approach returns documents based upon the term frequency-inverse document frequency (TF x IDF) for the term vectors that represent the documents. Variations of clustering and clustering ensemble algorithms (Iliopoulous et al., 2001; Hu, 2004), classification algorithms (Marcotte et al., 2001) and co-occurrence vectors (Stephens et al., 2001) have been successfully used to retrieve related documents. An important point to mention is that the terms that are used to represent the search criteria as well as the terms used to represent the documents are critical to successfully and accurately returning related documents. However, terms often have multiple meanings (i.e., polysemy) and multiple terms can have the same meaning (i.e., synonyms). This represents one of the current issues in text mining, which will be discussed in the next section. The last part of the text mining process is information extraction, of which the most popular technique is co-occurrence (Blaschke & Valencia, 2002; Jenssen et al., 2001). There are two disadvantages to this approach, each of which creates opportunities for further research. First, this approach depends upon assumptions regarding sentence structure, entity names, and etcetera
that do not always hold true (Pearson, 2001). Furthermore, this approach relies heavily on completeness of the list of gene names and synonyms and summarizes the modular process of text mining.
Research to Address Issues in Mining Digital Libraries The issues in mining digital libraries, specifically medical digital libraries, include scalability, ambiguous English and biomedical terms, non-standard terms and structure and inconsistencies between medical repositories (Shatkay & Feldman, 2003). Most of the current text mining research focuses on automating information extraction (Shatkay & Feldman, 2003). The scalability of the text mining approaches is of concern because of the rapid rate of growth of the literature. As such, while most of the existing methods have been applied to relatively small sample sets, there has been an increase in the number of studies that have been focused on scaling techniques to apply to large collections (Pustejovsky et al., 2002; Jenssen et al., 2002). One exception to this is the study by Jenssen et al. (2001) in which the authors used a predefined list of genes to retrieve all related abstracts from PubMed that contained the genes on the predefined list. 279
Data Mining Medical Digital Libraries
Table 1. General text mining process General Step General Purpose
IR
General Issues General Solutions
Identify and retrieve relevant docs
aliases, synonyms & homonyms
Find specific and relevant facts within text IE
[e.g., find specific gene entities (entity extraction) & relationships between specific genes (relationship extraction)]
Since mining digital libraries relies heavily on the ability to accurately identify terms, the issues of ambiguous terms, special jargon and the lack of naming conventions are not trivial. This is particularly true in the case of digital libraries where the issue is further compounded by nonstandard terms. In fact, a lot of effort has been dedicated to building ontologies to be used in conjunction with text mining techniques (Boeckmann et al., 2003; HUGO, 2003; Liu et al., 2001; NLM, 2003; Oliver et al., 2002; Pruitt & Maglott, 2001; Pustejovsky et al., 2002). Manually building and maintaining ontologies, however, is a time consuming effort. In light of that, there have been several efforts to find ways of automatically extracting terms to incorporate into and build ontologies (Nenadic et al., 2002; Ono et al., 2001). The ontologies are subsequently used to match terms. For instance, Nenadic et al. (2002) developed the Tagged Information Management System (TIMS), which is an XML-based Knowledge Acquisition system that uses ontology for information extraction over large collections.
Uses of Text Mining in Medical Digital Libraries There are many uses for mining medical digital libraries that range from generating hypotheses (Srinivasan, 2004)
Controlled vocabularies (ontologies) aliases, synonyms & homonyms
[e.g., gene terms and synonyms: LocusLink (Pruitt & Maglott, 2001), SwissProt (Boeckmann et al., 2003), HUGO (HUGO, 2003), National Library of Medicine's MeSH (NLM, 2003)]
to discovering protein associations (Fu et al., 2003). For instance, Srinivasan (2004) developed MeSH-based text mining methods that generate hypotheses by identifying potentially interesting terms related to specific input. Further examples include, but are not limited to: uncovering uses for thalidomide (Weeber et al., 2003), discovering functional connections between genes (Chaussabel & Sher, 2002) and identifying viruses that could be used as biological weapons (Swanson et al., 2001). He summarizes some of the recent uses of text mining in medical digital libraries.
FUTURE TRENDS The large volume of genomic data resulting and the accompanying literature from the Human Genomic project is expected to continue to grow. As such, there will be a continued need for research to develop scalable and effective data mining techniques that can be used to analyze the growing wealth of biomedical data. Additionally, given the importance of gene names in the context of mining biomedical literature and the fact that there are a number of medical sources that use different naming conventions and structures, research
Table 2. Some uses of text mining medical digital libraries
280
Application
Technique
Supporting Literature
Building gene networks
Co-occurrence
Jenssen et al., 2001; Stapley & Benoit, 2000; Adamic et al., 2002
Discovering protein association
Unsupervised cluster learning and vector classification
Fu et al., 2003
Discovering gene interactions
Co-occurrence
Stephens et al., 2001; Chaussabel & Sher, 2002
Discovering uses for thalidomide
Mapping phrases to UMLS concepts
Weeber et al., 2001
Extracting and combining relations
Rule-based parser and co-occurrence Leroy et al., 2003
Generating hypotheses Identifying biological virus weapons
Incorporating ontologies (e.g., mapping Srinivasan, 2004; Weeber et al., 2003 terms to MeSH) Swanson et al., 2001
Data Mining Medical Digital Libraries
to further develop ontology will play an important part in mining medical digital libraries. Finally, it is worth mentioning that there has been some effort to link the unstructured text documents within medical digital libraries with their related structured data in data repositories.
CONCLUSION Given the practical applications of mining digital libraries and the continued growth of available data, mining digital libraries will continue to be an important area that will help researchers and practitioners gain invaluable and undiscovered insights into genes, their relationships, biological functions, diseases and possible therapeutic treatments.
REFERENCES Blaschke, C., & Valencia, A. (2002). The frame-based module of the SUISEKI information extraction system. IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology, 17(2), 14-20.
concept discovery in molecular biology. In Proceedings of the Pacific Symposium on Biocomputing (PSB) (pp. 384-395). Jenssen, T.K., Laegrid, A., Komorowski, J., & Hovig, E. (2001). A literature network of human genes for highthroughput analysis of gene expression. Nature Genetics, 28(1), 21-28. Leroy, G., Chen, H., Martinez, J.D., Eggers, S., Flasey, R.R., Kislin, K.L., Huang, Z., Li, J., Xu, J., McDonald, D.M., & Ng, G. (2003). Genescene: Biomedical text and data mining. In Proceedings of the Third ACM/IEEE-CS joint conference on Digital Libraries (pp. 116-118). Liu, H. Lussier, Y.A., & Friedman, C. (2001). Diambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method-abstract. Journal of Biomedical Informatics, 34(4), 249-261. Marcotte, E.M, Xenarios, I., & Eisenberg, D. (2001). Mining literature for protein-protein interactions. Bioinformatics, 17(4), 359-363. NLM. (2003). MeSH: Medical Subject Headings. Retrieved from http://www.nlm.nih.gov/mesh/ Oliver, D.E., Rubin, D.L., Stuart, J.M., Hewett, M., Klein, T.E., & Altman, R.B. (2002). Ontology development for a pharmacogenetics knowledge base. Proceedings of Pacific Symposium on Biocomputing (PSB)-2002, 7, 65-76.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., & Schneider, M. (2003). The SWISS-PROT Protein Knowledgebase and its Supplement TrEMBL in 2003. Nucleic Acids Research, 31(1), 365-370.
Pearson, H. (2001). Biology’s name game. Nature, 411(6838), 631-632.
Chaussabel, D., & Sher, A. (2002). Mining microarray expression data by literature profiling. Genome Biology, 3(10), research0055.1-0055.16.
Pruitt, K.D., & Maglott, D.R. (2001). RefSeq and LocusLink:NCBI gene-centered resources. Nucleic Acids Research, 29(1), 137-140.
De Bruijn, B., & Martin, J. (2002). Getting to the core of knowledge: Mining biomedical literature. International Journal of Medical Informatics, 67, 7-18.
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., & Cochran, B. (2002). Robust relational parsing over biomedical literature: Extracting inhibit relations. Proceedings of Pacific Symposium on Biocomputing (PSB)2002, 7, 362-373.
Fu, Y., Mostafa, J., & Seki, K. (2003). Protein association discovery in biomedical literature. In Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries (pp.113-115).
Shatkay, H., & Feldman, R. (2003). Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6), 821-855.
Hu, X. (2004). Integration of cluster ensemble and text summarization for gene expression. In Proceedings of the 2004 IEEE Symposium of Bioinformatics and Bioengineering.
Srinivasan, P. (2004). Text Mining: Generating hypotheses from MEDLINE. Journal of the American Society for Information Science and Technology, 55(5), 396-413.
HUGO. (2003). HUGO (The Human Genome Organization) Gene Nomenclature Committee. Retrieved from http://www.gene.ucl.ac.uk/nomenclature
Stapley, B.J., Kelley, L.A., & Sternberg, M.J. (2002). Predicting the sub-cellular location of proteins from text using support vector machines. Proceedings of the Pacific Symposium on Biocomputing (PSB), 7, 374-385.
Iliopolous, I., Enright, A.J., & Ouzounis, C.A. (2001). Textquest: Document clustering of Medline abstracts for
281
,
Data Mining Medical Digital Libraries
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., & Mostafa, J. (2001). Detecting gene relations from Medline abstracts. In Proceedings of the Pacific Symposium on Biocomputing (PSB) (pp. 483-496). Swanson, D.R., Smalheiser, N.R., & Bookstein, A. (2001). Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science, 52(10), 797812. Weeber, M., Klein, H., Berg, L., & Vos, R. (2001). Using concepts in literature-based discovery: Simulating Swanson’s Raynaud-Fish Oil and Migraine-Magnesium discoveries. Journal of the American Society for Information Science, 52(7), 548-557. Weeber, M., Vos, R., Klein, H., de Jong-Van den Berg, L.T.W., Aronson, A., & Molema, G. (2003). Generating hypotheses by discovering implicit associations in the literature: A case report for new potential therapeutic uses for Thalidomide. Journal of the American Medical Informatics Association, 10(3), 252-259.
KEY TERMS Bibliomining: Data mining applied to digital libraries to discover patterns in large collections.
282
Bioinformatics: Data mining applied to medical digital libraries. Clustering: An algorithm that takes a dataset and groups the objects such that objects within the same cluster have a high similarity to each other, but are dissimilar to objects in other clusters. Information Extraction: A sub-field of NLP that addresses finding specific entities and facts in unstructured text. Information Retrieval: A discipline that deals with finding documents that meet a set of specific requirements. Machine Learning: Artificial intelligence methods that use a dataset to allow the computer to learn models that fit the data. Natural Language Processing: A discipline that deals with various aspects of automatically processing written and spoken language. Supervised Learning: A machine learning technique that requires a set of training data, which consists of known inputs and a priori desired outputs (e.g., classification labels) that can subsequently be used for either prediction or classification tasks. Unsupervised Learning: A machine learning technique, which is used to create a model based upon a dataset; however, unlike supervised learning, the desired output is not known a priori.
283
Data Mining Methods for Microarray Data Analysis Lei Yu Arizona State University, USA Huan Liu Arizona State University, USA
INTRODUCTION The advent of gene expression microarray technology enables the simultaneous measurement of expression levels for thousands or tens of thousands of genes in a single experiment (Schena, et al., 1995). Analysis of gene expression microarray data presents unprecedented opportunities and challenges for data mining in areas such as gene clustering (Eisen, et al., 1998; Tamayo, et al., 1999), sample clustering and class discovery (Alon, et al., 1999; Golub, et al., 1999), sample class prediction (Golub, et al., 1999; Wu, et al., 2003), and gene selection (Xing, Jordan, & Karp, 2001; Yu & Liu, 2004). This article introduces the basic concepts of gene expression microarray data and describes relevant data-mining tasks. It briefly reviews the state-of-the-art methods for each data-mining task and identifies emerging challenges and future research directions in microarray data analysis.
BACKGROUND AND MOTIVATION The rapid advances of gene expression microarray technology have provided scientists, for the first time, the opportunity of observing complex relationships among various genes in a genome by simultaneously measuring the expression levels of thousands of genes in massive experiments. In order to extract biologically meaningful insights from a plethora of data generated from microarray experiments, advanced data analysis techniques are in demand. Data-mining methods, which discover patterns, statistical or predictive models, and relationships among massive data, are effective tools for microarray data analysis. Gene expression microarrays are silicon chips that simultaneously measure the expression levels of thousands of genes. The description of technologies for constructing these chips and measuring gene expression levels is beyond the scope of this article (refer to Draghici, 2003, for an introduction). Each expression level of a specific gene among thousands of genes measured in an experiment is eventually recorded as a
numerical value. Expression levels of the same set of genes under study are normally accumulated through multiple experiments on different samples (or the same sample under different conditions) and recorded in a data matrix. In data mining, data are often stored in the form of a matrix, of which each column is described by a feature or attribute and each row consists of feature values and forms an instance, also called a record or data point, in a multidimensional space defined by the features. Figure 1 illustrates two ways of representing microarray data in a matrix form. In Figure 1a, each feature is a sample (S) and each instance is a gene (G). Each gene’s expression levels are measured across all the samples (or conditions), so fij is the measurement of the expression level of the ith gene for the jth sample, where i = 1,..., n and j = 1,..., m. In Figure 1b, the data matrix is the transpose of the one in Figure 1a, in which features are genes, and instances are samples. Sometimes, data in Figure 1b may have class labels ci for each instance, represented in the last column. The class labels can be different types of diseases or phenotypes of the underlying samples. A typical microarray data set may contain thousands of genes but only a small number of samples (often less than 100). The number of samples is likely to remain small — at least for the near future — due to the expense of collecting microarray samples (Dougherty, 2001). The two different forms of data shown in Figure 1 have different data-mining tasks. When instances are genes (Figure 1a), gene clustering can be performed to find similarly expressed genes across many samples. When instances are samples (Figure 1b), three different tasks can be performed: sample clustering, which involves grouping similar samples together to discover classes or subclasses of samples; sample class prediction, which involves predicting diseases or phenotypes of novel samples based on patterns learned from training samples with known class labels; and gene selection, which involves selecting a small number of genes from thousands of genes to reduce the dimensionality of the data and improve the performance of classification and clustering methods.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining Methods for Microarray Data Analysis
Figure 1. Microarray data matrix
a
MAJOR LINES OF RESEARCH AND DEVELOPMENT In this part, we briefly review methods for each of the data-mining tasks identified earlier: gene clustering, sample clustering, sample class prediction, and gene selection. We discuss gene clustering and sample clustering together, for these two tasks are common; however, they are applied on microarray data from different directions.
Clustering Clustering is a process of grouping similar samples, objects, or instances into clusters. Many clustering methods exist (for a review, see Jain, Murty, & Flynn, 1999; Parson, Ehtesham, & Liu, 2004). They can be applied to microarray data analysis for clustering genes or samples. In this article, we present three groups of frequently used clustering methods. The first use of hierarchical clustering in gene clustering is first described in Eisen et al. (1998). Each instance forms a cluster in the beginning, and the two most similar clusters are merged until all instances are in one single cluster. The clustering results in the form of a tree structure, called dendrogram, which can be broken at different levels by using domain knowledge. Tree structures are easy to understand and can reveal close relationships among resulting clusters, but they do not provide a unique partition among all the instances, because different ways to determine a basic level in the dendrogram can result in different clustering results. Unlike hierarchical clustering methods, partitionbased clustering methods divide the whole data into a fixed number of clusters. Examples are K-means (Herwig, et at., 1999), self-organizing maps (Tamayo, et al., 1999), and graph-based partitioning (Xu, Y., Olman, & Xu, D., 284
b 2002). The methods of K-means often require specification of the number of clusters, K, and the selection of K instances as the initial clusters. All instances are then partitioned into the K clusters, optimizing some objective function (e.g., inner-cluster similarity) by assigning each instance to the most similar cluster, which is determined by the distance between the instance and the mean of each cluster in the current iteration. Selforganizing maps (SOMs) are variations of K-means methods and require specification of the initial topology of K nodes to construct the map. In graph-based partitioning methods, a Minimum Spanning Tree (MST) is often constructed, and the clusters are generated by deleting the MST edges with the largest lengths. Graphbased partitioning methods do not heavily depend on the regularity of the geometric shape of cluster boundaries, as K-means and SOMs do. Traditional clustering methods require that each instance belong to a single cluster, even though some instances may be only slightly relevant for the biological significance of their assigned clusters. Fuzzy Cmeans (Dembele & Kastner, 2003) apply a fuzzy partitioning method that assigns cluster membership values to instances; this process is called fuzzy clustering. It links each instance to all clusters via a real-value vector of indexes. The value of each index lies between 0 and 1, where a value close to 1 indicates a strong association to the corresponding cluster, while a value close to 0 indicates no association. The vector of indexes thus defines the membership of an instance with respect to the various clusters.
Sample Class Prediction Apart from clustering methods, which do not require a priori knowledge about the classes of available instances, a classification method requires training instances with labeled classes, leans patterns that discriminate be-
Data Mining Methods for Microarray Data Analysis
tween various classes, and ideally, correctly predicts the classes of unseen instances. Many classification methods can be applied to predict diseases or phenotypes of novel samples from microarray data. We present four commonly used methods in this section. Linear discriminative analysis (LDA) For an m x n gene expression matrix X (m is the number of samples, and n is the number of genes), Linear discriminative analysis (LDA) seeks linear combinations, xa, of sample vectors xi = (xi1,…,xin) with large ratios of between-class to withinclass sums of squares. In other words, it tries to maximize the ratio aTBa/aTWa, where B and W denote, respectively, the n x n matrices of between-class and within-class sums of squares (Dudoit, Fridlyand, & Speed, 2000). Nearest neighbor (NN) usually does not learn during the training phase. Only when it is required to classify a new sample does NN search the data to find the nearest neighbor for the new sample, using the class label of the nearest neighbor to predict the class label of the new sample. K-NN makes the prediction for a new sample based on the most common class label among the K training samples most similar to the new sample. Examples can be found in Pomeroy, et. al, 2002). Decision trees classify samples by building a treelike structure. Specifically, they recursively split samples into two child branches based on the values of a selected feature, starting with all the samples. Each leaf node of the tree is pure in terms of classes, and the resulting partition corresponds to a classifier. By limiting the number of consecutive branches, they can produce more generalized classifiers. Different forms of trees exist. In Wu et al. (2003), classification and regression trees are applied for sample classification. Support vector machines (SVMs) have also been shown effective in sample classification (Brown, et al., 2000). They try to separate a set of training samples of two different classes with a hyperplane in an n-dimensional space defined by n features (genes). If no separating hyperplane exists in the original space, a kernel function is used to map the samples into a higher dimensional space where a separating hyperplane exists. Complex kernel functions that provide nonlinear mappings result in nonlinear classifiers. SVMs avoid overfitting by selecting a hyperplane that is maximally distant from the training samples of two different classes, called maximum margin separating hyperplane, from among many hyperplanes that can separate the two classes.
Therefore, selecting a small number of discriminative genes from thousands of genes is essential for successful sample classification and clustering. Feature selection methods (for a review, see Blum & Langley, 1997; Liu & Motoda, 1998) can be applied in microarray data analysis for gene selection. Among gene selection methods, earlier methods often evaluate genes in isolation without considering gene-to-gene correlation. They rank genes according to their individual relevance or discriminative power to the targeted classes and select top-ranked genes. Some methods based on statistical tests or information gain have been employed in Golub et al. (1999) and Model, (2001). However, a number of studies (Ding & Peng, 2003; Xion, Fang, & Zhao, 2001) point out that simply combining a highly ranked gene with another highly ranked gene often does not form a good gene set, because some highly correlated genes could be redundant. Removing redundant genes among selected ones can achieve a better representation of the characteristics of the targeted classes and lead to improved classification accuracy. Methods that handle gene redundancy based on pair-wise correlation analysis among genes can be found in Ding and Peng (2003); Xing, Jordan, and Karp (2001); and Yu and Liu (2004). A gene selection method for unlabeled samples is also proposed and is shown effective for sample clustering in Xing and Karp (2001).
FUTURE TRENDS Traditional data-mining methods are often designed for data where the number of instances is significantly larger than the number of features. In microarray data analysis, however, the number of features (genes) is huge, and the number of instances (samples) is relatively small, for tasks of sample clustering or classification. This unique characteristic of microarray data presents a challenge to the scalability of current datamining methods to high dimensionality. In addition, the relative shortage of instances in the context of high dimensionality often causes many methods to overfit the training data. Therefore, besides improving current data-mining methods, substantial research efforts are needed to come up with new methods specifically designed for microarray data.
Gene Selection The nature of relatively high dimensionality but a small sample size of data in sample classification and clustering can cause the problem of curse of dimensionality and overfitting of the training data (Dougherty, 2001).
CONCLUSION Gene expression microarrays are a revolutionary technology with great potential to provide accurate medical diagnostics, develop cures for diseases, and produce a 285
,
Data Mining Methods for Microarray Data Analysis
detailed genome-wide molecular portrait of cellular states (Piatetsky-Shapiro & Tamayo, 2003). Data-mining methods are effective tools to turn massive raw data from microarray experiments into biologically important insights. In this article, we provide a brief introduction to microarray data analysis and a concise review of various data-mining methods for microarray data.
REFERENCES Alon, U. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Acad. Sci., 96 (pp. 6745-6750), USA. Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271. Brown, M. et al. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Acad. Sci., 97 (pp. 262-267), USA. Dembele, D., & Kastner, P. (2003). Fuzzy C-means method for clustering microarray data. Bioinformatics, 19(8), 973980. Ding, C., & Peng, H. (2003). Minimum redundancy feature selection from microarray gene expression data. Proceedings of the Computational Systems Bioinformatics Conference (pp. 523-529). Dougherty, E. R. (2001). Small sample issue for microarraybased classification. Functional Genomics, 2, 28-34. Draghici, S. (2003). Data analysis tools for DNA microarrays. Chapman & Hall/CRC. Dudoit, S., Fridlyand, J., & Speed, T. P. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data (Tech. Rep. No. 576). Berkeley, CA: University of California at Berkeley, Department of Statistics. Eisen, M. (1998). Clustering analysis and display of genome-wide expression patterns. Proceedings of the National Acad. Sci., 95 (pp. 14863-14868), USA. Golub, T. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537. Herwig, R., (1999). Large-scale clustering of cDNA fingerprints. Genomics, 66, 249-256.
286
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining. Boston: Kluwer Academic. Model, F. (2001). Feature selection for DNA methylation based cancer classification. Bioinformatics, 17, 154-164. Parson, L., Ehtesham, H., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations, 6(1), 90-105. Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray data mining: Facing the challenges. SIGKDD Explorations, 5(2), 1-5. Pomeroy, S. L. (2002). Prediction of central nervous system embryonal tumor outcome based on gene expression. Nature, 415, 436-442. Schena, M. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467-470. Tamayo, P. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Acad. Sci., 96 (pp. 2907-2912), USA. Wu, B., (2003). Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636-1643. Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. Proceedings of the 18th International Conference on Machine Learning (pp. 601-608). Xing, E., & Karp, R. (2001). CLIFF: Clustering of highdimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, 17, S306-S315. Xion, M., Fang, Z. & Zhao, J. (2001). Biomarker identification by feature wrappers. Genome Research, 11, 18781887. Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics, 18(4), 536-545. Yu, L., & Liu, H. (2004). Redundancy based feature selection for microarray data. Proceedings of the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Data Mining Methods for Microarray Data Analysis
KEY TERMS Classification: The process of predicting the classes of unseen instances based on patterns learned from available instances with predefined classes. Clustering: The process of grouping instances into clusters so that instances are similar to one another within a cluster but dissimilar to instances in other clusters. Data Mining: The application of analytical methods and tools to data for the purpose of discovering patterns, statistical or predictive models, and relationships among massive data.
Feature Selection: A process of choosing an optimal subset of features from original features, according to a certain criterion. Gene: A hereditary unit consisting of a sequence of DNA that contains all the information necessary to produce a molecule that performs some biological function. Gene Expression Microarrays: Silicon chips that simultaneously measure the expression levels of thousands of genes. Genome: All the genetic information or hereditary material possessed by an organism.
287
,
288
Data Mining with Cubegrades Amin A. Abdulghani Quantiva, USA
INTRODUCTION Much interest has been expressed in database mining by using association rules (Agrawal, Imielinski, & Swami, 1993). In this article, I provide a different view of the association rules, which are referred to as cubegrades (Imielinski, Khachiyan, & Abdulghani, 2002) . An example of a typical association rule states that, say, in 23% of supermarket transactions (so-called market basket data) customers who buy bread and butter also buy cereal (that percentage is called confidence) and that in 10% of all transactions, customers buy bread and butter (this percentage is called support). Bread and butter represent the body of the rule, and cereal constitutes the consequent of the rule. This statement is typically represented as a probabilistic rule. But association rules can also be viewed as statements about how the cell representing the body of the rule is affected by specializing it with the addition of an extra constraint expressed by the rule’s consequent. Indeed, the confidence of an association rule can be viewed as the ratio of the support drop, when the cell corresponding to the body of a rule (in this case, the cell of transactions including bread and butter) is augmented with its consequent (in this case, cereal). This interpretation gives association rules a dynamic flavor reflected in a hypothetical change of support affected by specializing the body cell to a cell whose description is a union of body and consequent descriptors. For example, the earlier association rule can be interpreted as saying that the count of transactions including bread and butter drops to 23% of the original when restricted (rolled down) to the transactions including bread, butter, and cereal. In other words, this rule states how the count of transactions supporting buyers of bread and butter is affected by buying cereal as well. With such interpretation in mind, a much more general view of association rules can be taken, when support (count) can be replaced by an arbitrary measure or aggregate, and the specialization operation can be substituted with a different “delta” operation. Cubegrades capture this generalization. Conceptually, this is very similar to the notion of gradients used in calculus. By definition, the gradient of a function between the domain points x1 and x2 measures the ratio of the delta change in the function value over the delta change between the points. For a
given point x and function f(), it can be interpreted as a statement of how a change in the value of x (∆x) affects a change in value in the function (∆ f(x)). From another viewpoint, cubegrades can also be considered as defining a primitive for cubes. An n-dimensional cube is a group of k-dimensional (k<=n) cuboids arranged by the dimensions of the data. A cell represents an association of a measure m (e.g., total sales) with a member of every dimension. The scope of interest in Online Analytical Processing (OLAP) is to evaluate one or more measure values of the cells in the cube. Cubegrades allow a broader, more dynamic view. In addition to evaluating the measure values in a cell, they evaluate how the measure values change or are affected in response to a change in the dimensions of a cell. Traditionally, OLAP has had operators such as drill downs, rollups defined, but the cubegrade operator differs from them as it returns a value measuring the effect of the operation. Additional operators have been proposed to evaluate/measure cell interestingness (Sarawagi, 2000; Sarawagi, Agrawal, & Megiddo, 1998). For example, Sarawagi et al. computes anticipated value for a cell by using the neighborhood values, and a cell is considered an exception if its value is significantly different from its anticipated value. The difference is that cubegrades perform a direct cell-to-cell comparison.
BACKGROUND An association or propositional rule can be defined in terms of cube cells. It can be defined as a quadruple (body, consequent, support, confidence) where body and consequent are cells over disjoint sets of attributes, support is the number of records satisfying the body, and confidence is the ratio of the number of records that satisfy the body and the consequent to the number of records that satisfy just the body. You can also consider an association rule as a statement about a relative change of measure, COUNT, when specializing or drilling down the cell denoted by the body to the cell denoted by the body + consequent. The confidence of the rule measures how the consequent affects the support when drilling down the body. These association rules can be generalized in two ways:
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Mining with Cubegrades
• •
By allowing relative changes in other measures, instead of just confidence, to be returned as part of the rule. By allowing cell modifications to be able to occur in different directions’ instead of just specializations (or drill-downs).
Figure 1. Cubegrade: specialization, generalization, and mutation A=a1, B=b1 [Count=150, Avg (M1)=25]
These generalized cell modifications are denoted as cubegrades. A cubegrade expresses how a change in the structure of a given cell affects a set of predefined measures. The original cell being modified is referred to as the source, and the modified cell as target. More formally, a cubegrade is a 5-tuple (source, target, measures, value, delta-value) where
Generalization Mutation A=a1, B=b1, C=c1 [Count=100, Avg(M1)=20]
A=a1, B=b1, C=c2 [Count=70, Avg(M1)=30]
Specialization
• • • •
source and target are cube cells measures is the set of measures that are evaluated both in the source as well as in the target value is a function, value: measures → R, that evaluates measure m∈ measures in the source delta-value is also a function, delta-value: measures → R, that computes the ratio of the value of m ∈ measures in the target versus the source
A cubegrade can visually be represented as a rule form:
A=a1, B=b1, C=c1, D=d1 [Count=50, Avg(M1)=18]
•
Source → target, [measures, value, delta-value] Define a descriptor to be an attribute value pair of the form dimension=value if the dimension is a discrete attribute, or dimension = [lo, hi] if the attribute is a dimension attribute. The cubegrades are distinguished as three types: •
•
•
Specializations: A cubegrade is a specialization if the set of descriptors of the target are a superset of those in the source. Within the context of OLAP, the target cell is termed a drill-down of source. Generalizations: A cubegrade is a generalization if the set of descriptors of the target cell are a subset of those in the source. Here, in OLAP, the target cell is termed a roll-up of source. Mutations: A cubegrade is a mutation if the target and source cells have the same set of attributes but differ on the descriptor values (they are union compatible, so to speak, as the term has been used in relational algebra).
Figure 1 illustrates the operations of these cubegrades. Following, I illustrate some specific examples to explain the use of these cubegrades:
•
(Specialization Cubegrade). The average age of buyers who purchase $20 to $30 worth of milk monthly drops by 10% among buyers who also buy cereal. (salesMilk=[$20,$30]) → (salesMilk=[$20,$30], salesCereal=[$1,$5]) [AVG(Age), AVG(Age) = 23, DeltaAVG(Age) = 90%]
(Mutation Cubegrade). The average amount spent on milk drops by 30% when moving from suburban buyers to urban buyers.
(areaType=‘suburban’) ’→(areaType=’urban’) [ AVG(salesMilk), AVG(salesMilk) = DeltaAVG(salesMilk)= 70%]
$12.40,
MAIN THRUST Similar to association rules (Agrawal & Srikant, 1994), the generation of cubegrades can be divided into two phases: (a) generation of significant cells (rather than frequent sets) satisfying the source cell conditions and (b) computation of cubegrades from the source (instead of computing association rules from frequent sets) satisfying the joint conditions between source and target and target conditions. The first task is similar to the computation of iceberg cube queries (Beyer & Ramakrishnan, 1999; Han, Pei, Dong, & Wang, 2001; Xin, Han, Li, & Wah, 2003). The fundamental property that allows for pruning in these computations is called monotonicity of the query: Let D
289
,
Data Mining with Cubegrades
be a database and X⊆D be a cell. A query is monotonic at X if the condition Q(X) is FALSE implies that Q(X’) is FALSE for any X`⊆X. However, as described by Imielinski et al. (2002), determining whether a query Q is monotonic in terms of this definition is an NP-hard problem for many simple classes of queries. To work around this problem, the authors introduced another notion of monotonicity, referred to as view monotonicity of a query. Suppose you have cuboids defined on a set S of dimension and measures. A view V on S is an assignment of values to the elements of the set. If the assignment holds for the dimensions and measures in a given cell X, then V is a view for X on the set S. So, for example, if in a cell of rural buyers, the average sales of bread for 20 buyers is 15, then the view on the set {areaType, COUNT(), AVG(salesBread)} for the cell is {areaType=‘rural’, COUNT()=20, AVG(salesBread)=15}. Extending the definition, a view on a query is an assignment of values for the set of dimension and measure attributes of the query expression. A query Q(⋅)view monotonic on view V if for any cell X in any database D such that V is the view for X, the condition Q is FALSE, for X implies Q is FALSE for all X'⊆X. An important property of view monotonicity is that the time and space required for checking it for a query depends on the number of terms in the query, not on the size of the database or the number of its attributes. Because most of the queries typically have few terms, it would be useful in many practical situations. The method presented can be used for checking for view monotonicity for queries that include constraints of type (Agg {<, >, =, !=} c), where c is a constant and Agg can be MIN, SUM, MAX, AVERAGE, COUNT, an aggregate that is a higher order moment about the origin, or an aggregate that is an integral of a function on a single attribute. Consider a hypothetical query asking for cubes with 1000 or more buyers and with total milk sales less than $50,000. In addition, the average milk sales per customer should be between $20 and $50, with maximum sales greater than $75. This query can be expressed as follows: COUNT(*)>=1000 and AVG(salesMilk)>=20 and AVG(salesMilk)<50 and MAX(salesMilk)>=75 and SUM(saleMilk)<50K Suppose, while performing bottom-up cube computation, you have a cell C with the following view V (Count =1200; AVG(salesMilk) = 50; MAX(salesMilk) = 80; MIN(salesMilk) = 30; SUM(salesMilk) =60000). Using the method for checking view monotonicity, it can be shown that some cell C’ of C can exist (though this subcell is not guaranteed to exist in this database) with 1000<= count <1075 for which this query can be satisfied. Thus, this query cannot be pruned on the cell. 290
However, if the view for C is (Count = 1200; AVG(salesMilk) = 57; MAX(salesMilk) =80; MIN(salesMilk) = 30; SUM(salesMilk) = 68400), then it can be shown that there cannot exist any subcell C’ of C in any database for which the original query can be satisfied. Thus, the query can be pruned on cell C. After the source cells have been computed, the next task is to compute the set of target cells. This is done by performing a set of query conversions which make it possible to reduce cubegrade query evaluation to iceberg queries. Given a specific candidate source cell C, define Q[C] as the query which results from Q by source substitution. Here, Q is transformed by substituting into its “where” clause all the values of the measures, and the descriptors of the source cell C as well, by performing the delta elimination step, which replaces all the relative delta-values (expressed as fractions) by the regular less than–greater than conditions. This is possible because the values for the measures of C are known. With this, the delta-values can now be expressed as conditions on values of the target. For example, if AVG(Salary)=40K in cell C and the condition on the DeltaAVG(Salary) is of the form DeltaAVG(Salary) >1.10, this can be translated now to AVG(Salary) >44K, where AVG(Salary) references the target cell. The final step in source substitution is join transformation, where the join conditions (specializations, generalizations, and mutations) in Q are transformed into the target conditions because the source cell is known. Notice, thus, that Q[C] is the cube query specifying the target cell. Dong, Han, Lam, Pei, and Wang (2001) present an optimized version of target generation that is particularly useful for the case where the number of source cells are few in number. The ideas in the algorithm include the following steps: •
•
•
Perform for the set of identified source cells the lowest common delta elimination such that the resulting target condition does not exclude any possible target cells. Perform a bottom-up iceberg query for the target cells based on the target condition. Define LiveSet(T) of a target cell T as the candidate set of source cells that can possibly match or join with the target. A target cell, T, may identify a source, S, in its LiveSet to be prunable based on its monotonicity and thus removable from its LiveSet. In such a case, all descendants of T would also not include S in their LiveSets. Perform a join of the target and each of its LiveSet’s source cells and for the resulting cubegrade; check whether it satisfies the join criteria and the deltavalue condition for the query.
Data Mining with Cubegrades
In a typical scenario, it is not expected that users would be asking for cubegrades per se. Rather, it may be more likely that they pose a query on how a given delta change affects a set of cells, which cells are affected by a given delta change, or what delta changes affect a set of cells in a prespecified manner. Further, more complicated sets of applications can be implemented by using cubegrades (Abdulghani, 2001). For example, one may be interested in finding cells that remain stable and are not significantly affected by generalization, specialization, or mutation. An illustration for such a situation would be to find cells that remain stable on the blood pressure measure and are not affected by a different specialization on age or area demographics. Another application could be to find effective factors, a set of specialization, generalization, or mutation descriptors that are effective in changing a measure value m by a significant ratio. For example, one may want to find effective factors in decreasing a cholesterol level across a set of selected cells.
2002). Applying the cubegrade paradigm in such domains provides an opportunity for richer mining results.
CONCLUSION In this article, I look at a generalization of association rules referred to as cubegrades. These generalizations include allowing the evaluation of relative changes in other measures, rather than just confidence, to be returned as well as allowing cell modifications to occur in different directions. The additional directions that I consider here include generalizations, which modify cells towards the more general cell with fewer descriptors, and mutations, which modify the descriptors of a subset of the attributes in the original cell definition with the others remaining the same. The paradigm allows you to ask queries that were not possible through association rules. The downside is that it comes with the price of relatively increased computation/storage costs that need to be tackled with innovative methods.
FUTURE TRENDS A major challenge for cubegrade processing is its computational complexity. Potentially, an exponential number of source/target cells can be generated. A positive development in this direction is the work done on Quotient Cubes (Lakshmanan, Pei, & Han, 2002). This work provides a method for partitioning and compressing the cells of a data cube into equivalent classes such that the resulting classes have cells covering the same set of tuples and preserving the cube’s semantic roll-up/drill-down. In this context, we can reduce the number of cells generated for the source and target. Further, the pairings for the cubegrade can be reduced by restricting the source and target to different classes. Another related challenge for cubegrades is to identify the set of interesting cubegrades (cubegrades that are somewhat surprising). Insights to this problem can be obtained from similar work done in the context of association rules (Bayardo & Agrawal, 1999; Liu, Ma, &Yu, 2001). The main difference being that for cubegrades, measures (possibly a combination of them) other than COUNT are involved, but for association rules, the interesting functions are based on the count function. In addition, cubegrades have cell modification in multiple directions, but association rules are restricted to specializations. As cubegrades are better understood, wider applications of the concept to various domains are expected to be seen. Association rules have been applied with success to such areas as intrusion detection (Lee, Stolfo, & Mok, 1998), and microarray data (Tuzhilin & Adomavicius,
REFERENCES Abdulghani, A. (2001). Cubegrades-generalization of association rules to mine large datasets. Doctoral dissertation, Rutgers University, New Brunswick, NJ. Dissertation Abstracts International, DAI-B 62/10, UMI Number 3027950. Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conference (pp. 207-216), USA. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Proceedings of the International Conference on Very Large Data Bases (pp. 487-499), Chile. Bayardo, R., & Agrawal, R. (1999). Mining the most interesting rules. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (pp. 145-154), USA. Beyer, K. S., & Ramakrishnan, R. (1999). Bottom-up computation of sparse and iceberg CUBEs. Proceedings of the ACM SIGMOD Conference (pp. 359-370), USA. Dong, G., Han, J., Lam, J. M., Pei, J., & Wang, K. (2001). Mining multi-dimensional constrained gradients in data cubes. Proceedings of the International Conference on Very Large Data Bases (pp. 321-330), Italy.
291
,
Data Mining with Cubegrades
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. Proceedings of the International Conference on Data Engineering (pp. 152159), USA. Han, J., Pei, J., Dong, G., & Wang, K. (2001). Efficient computation of iceberg cubes with complex measures. Proceedings of the ACM SIGMOD Conference (pp. 1-12), USA.
KEY TERMS Cubegrade: A cubegrade is a 5-tuple (source, target, measures, value, delta-value) where • • •
source and target are cells measures is the set of measures that are evaluated both in the source as well as in the target value is a function, value: measures → R, that evaluates measure m ∈ measures in the source delta-value is a function, delta-value: measures → R, that computes the ratio of the value of m ∈ measures in the target versus the source
Imielinski, T., Khachiyan, L., & Abdulghani, A. (2002). Cubegrades: Generalizing association rules. Journal of Data Mining and Knowledge Discovery, 6(3), 219-257.
•
Lakshmanan, L. V. S., Pei, J., & Han, J. (2002). Quotient cube: How to summarize the semantics of a data cube. Proceedings of the International Conference on Very Large Data Bases (pp. 778-789), China.
Drill-Down: A cube operation that allows users to navigate from summarized cells to more detailed cells.
Lee, W., Stolfo, S. J., & Mok, K. W. (1998). Mining audit data to build intrusion detection models. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (pp. 66-72), USA. Liu, B., Ma, Y., & Yu, S. P. (2001). Discovering unexpected information from your competitors’ Web sites. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (pp. 144-153), USA. Sarawagi, S. (2000). User-adaptive exploration of multidimensional data. Proceedings of the International Conference on Very Large Data Bases (pp. 307-316). Sarawagi, S., Agrawal, R., & Megiddo, N.(1998). Discovery driven exploration of OLAP data cubes. Proceedings of the International Conference on Extending Database Technology (pp. 168-182), Spain. Tuzhilin, A., & Adomavicius, G. (2002). Handling very large numbers of association rules in the analysis of microarray data. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (pp. 396-404), Canada. Xin, D., Han, J., Li, X., & Wah, B. W.(2003). Star-cubing: Computing iceberg cubes by top-down and bottom-up integration. Proceedings of the International Conference on Very Large Data Bases (pp. 476-487), Germany.
292
Generalizations: A cubegrade is a generalization if the set of descriptors of the target cell are a subset of the set of attribute-value pairs of the source cell. Iceberg Cubes: The set of cells in a cube that satisfies an iceberg query. Iceberg Query: A query on top of a cube that asks for aggregates above a certain threshold. Mutations: A cubegrade is a mutation if the target and source cells have the same set of attributes but differ on the values. Query Monotonicity: (•) is monotonic at a cell X if the condition Q(X) is FALSE implies that Q(X’) is FALSE for any cell X`⊆X. Roll-Up: A cube operation that allows users to aggregate from detailed cells to summarized cells. Specialization: A cubegrade is a specialization if the set of attribute-value pairs of the target cell is a superset of the set of attribute-value pairs of the source cell. View: A view V on S is an assignment of values to the elements of the set. If the assignment holds for the dimensions and measures in a given cell X, then V is a view for X on the set S. View Monotonicity: A query Q is view monotonic on view V if for any cell X in any database D such that V is the view for X for query Q , the condition Q is FALSE for X implies that Q is FALSE for all X` ⊆X.
293
Data Mining with Incomplete Data Hai Wang Saint Mary’s University, Canada Shouhong Wang University of Massachusetts Dartmouth, USA
INTRODUCTION Survey is one of the common data acquisition methods for data mining (Brin, Rastogi & Shim, 2003). In data mining one can rarely find a survey data set that contains complete entries of each observation for all of the variables. Commonly, surveys and questionnaires are often only partially completed by respondents. The possible reasons for incomplete data could be numerous, including negligence, deliberate avoidance for privacy, ambiguity of the survey question, and aversion. The extent of damage of missing data is unknown when it is virtually impossible to return the survey or questionnaires to the data source for completion, but is one of the most important parts of knowledge for data mining to discover. In fact, missing data is an important debatable issue in the knowledge engineering field (Tseng, Wang, & Lee, 2003). In mining a survey database with incomplete data, patterns of the missing data as well as the potential impacts of these missing data on the mining results constitute valuable knowledge. For instance, a data miner often wishes to know how reliable a data mining result is, if only the complete data entries are used; when and why certain types of values are often missing; what variables are correlated in terms of having missing values at the same time; what reason for incomplete data is likely, etc. These valuable pieces of knowledge can be discovered only after the missing part of the data set is fully explored.
BACKGROUND There have been three traditional approaches to handling missing data in statistical analysis and data mining. One of the convenient solutions to incomplete data is to eliminate from the data set those records that have missing values (Little & Rubin, 2002). This, however, ignores potentially useful information in those records. In cases where the proportion of missing data is large, the data mining conclusions drawn from the screened data set are more likely misleading.
Another simple approach of dealing with missing data is to use generic “unknown” for all missing data items. However, this approach does not provide much information that might be useful for interpretation of missing data. The third solution to dealing with missing data is to estimate the missing value in the data item. In the case of time series data, interpolation based on two adjacent data points that are observed is possible. In general cases, one may use some expected value in the data item based on statistical measures (Dempster, Laird, & Rubin, 1997). However, data in data mining are commonly of the types of ranking, category, multiple choices, and binary. Interpolation and use of an expected value for a particular missing data variable in these cases are generally inadequate. More importantly, a meaningful treatment of missing data shall always be independent of the problem being investigated (Batista & Monard, 2003). More recently, there have been mathematical methods for finding the salient correlation structure, or aggregate conceptual directions, of a data set with missing data (Aggarwal & Parthasarathy, 2001; Parthasarathy & Aggarwal, 2003). These methods make themselves distinct from the traditional approaches of treating missing data by focusing on the collective effects of the missing data instead of individual missing values. However, these statistical models are data-driven, instead of problem-domain-driven. In fact, a particular data mining task is often related to its specific problem domain, and a single generic conceptual construction algorithm is insufficient to handle a variety of data mining tasks.
MAIN THRUST There have been two primary approaches of data mining with incomplete data: conceptual construction and enhanced data mining.
Conceptual Construction with Incomplete Data Conceptual construction with incomplete data reveals the patterns of the missing data as well as the potential
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Mining with Incomplete Data
impacts of these missing data on the mining results based only on the complete data. Conceptual construction on incomplete data is a knowledge development process. To construct new concepts on incomplete data, the data miner needs to identify a particular problem as a base for the construction. According to (Wang, S. & Wang, H., 2004), conceptual construction is carried out through two phases. First, data mining techniques (e.g., cluster analysis) are applied to the data set with complete data to reveal the unsuspected patterns of the data, and the problem is then articulated by the data miner. Second, the incomplete data with missing values related to the problem are used to construct new concepts. In this phase, the data miner evaluates the impacts of missing data on the identification of the problem and develops knowledge related to the problem. For example, suppose a data miner is investigating the profile of the consumers who are interested in a particular product. Using the complete data, the data miner has found that variable i (e.g., income) is an important factor of the consumers’ purchasing behavior. To further verify and improve the data mining result, the data miner must develop new knowledge through mining the incomplete data. Four typical concepts as results of knowledge discovery in data mining with incomplete data are described as follows: (1)
(2)
(3)
294
Reliability: The reliability concept reveals the scope of the missing data in terms of the problem identified based only on complete data. For instance, in the above example, to develop the reliability concept, the data miner can define index VM(i)/VC(i) where VM(i) is the number of missing values in variable i, and V C(i) is the number of samples used for the problem identification in variable i. Accordingly, the higher VM(i)/V C(i) is, the lower the reliability of the factor would be. Hiding: The concept of hiding reveals how likely an observation with a certain range of values in one variable is to have a missing value in another variable. For instance, in the above example, the data miner can define index VM(i)|x(j)∈(a,b) where VM(i) is the number of missing values in variable i, x(j) is the occurrence of variable j (e.g., education years), and (a,b) is the range of x(j); and use this index to disclose the hiding relationships between variables i and j, say, more than two thousand records have missing values in variable income given the value of education years ranging from 13 to 19. Complementing: The concept of complementing reveals what variables are more likely to have missing values at the same time; that is, the correlation of missing values related to the problem being investigated. For instance, in the above example,
(4)
the data miner can define index VM(i,j)/VM(i) where VM(i,j) is the number of missing values in both variables i and j, and VM(i) is the number of missing values in variable i. This concept discloses the correlation of two variables in terms of missing values. The higher the value VM(i,j)/VM(i) is, the stronger the correlation of missing values would be. Conditional Effects: The concept of conditional effects reveals the potential changes to the understanding of the problem caused by the missing values. To develop the concept of conditional effects, the data miner assumes different possible values for the missing values, and then observe the possible changes of the nature of the problem. For instance, in the above example, the data miner can define index ∆P|∀z(i)=k where ∆P is the change of the size of the target consumer group perceived by the data miner, ∀z(i) represents all missing values of variable i, and k is the possible value variable i might have for the survey. Typically, k={max, min, p} where max is the maximal value of the scale, min is the minimal value of the scale, and p is the random variable with the same distribution function of the values in the complete data. By setting different possible values of k for the missing values, the data miner is able to observe the change of the size of the consumer group and redefine the problem.
Enhanced Data Mining with Incomplete Data The second primary approach to data mining with incomplete data is enhanced data mining, in which incomplete data are fully utilized. Enhanced data mining is carried out through two phases. In the first phase, observations with missing data are transformed into fuzzy observations. Since missing values make the observation fuzzy, according to fuzzy set theory (Zadeh, 1978), an observation with missing values can be transformed into fuzzy patterns that are equivalent to the observation. For instance, suppose there is an observation A=X(x1, x2, . . . xc . . .xm) where x c is the variable with missing value, and xc∈{ r1, r 2 ... rp } where rj (j=1, 2, ... p) is the possible occurrence of xc. Let µj = P j(xc = rj), the fuzzy membership (or possibility) that x c belongs to rj , (j=1, 2, ...p), and ∑ j µj = 1. Then, µj [X |(xc= rj)] (j=1, 2, ... p) are fuzzy patterns that are the equivalence to the observation A. In the second phase of enhanced data mining, all fuzzy patterns, along with the complete data, are used for data mining using tools such as self-organizing maps (SOM) (Deboeck & Kohonen, 1998; Kohonen, 1989; Vesanto & Alhoniemi, 2000) and other types of neural
Data Mining with Incomplete Data
networks (Wang, 2000, 2002). These tools used for enhanced data mining are different from the original ones in that they are capable of retaining information of fuzzy membership for each fuzzy pattern. Wang (2003) has developed a SOM-based enhanced data mining model to utilize all fuzzy patterns and the complete data for knowledge discovery. Using this model, the data miner is allowed to compare SOM based on complete data and fuzzy SOM based on all incomplete data to perceive covert patterns of the data set. It also allows the data miner to conduct what-if trials by including different portions of the incomplete data to disclose more accurate facts. Wang (2005) has developed a Hopfield neural network based model (Hopfield & Tank, 1986) for data mining with incomplete survey data. The enhanced data mining method utilizes more information provided by fuzzy patterns, and thus makes the data mining results more accurate. More importantly, it produces rich information about the uncertainty (or risk) of the data mining results.
FUTURE TRENDS Research into data mining with incomplete data is still in its infancy. The literature on this issue is still scarce. Nevertheless, the data miners’ endeavor for “knowing what we do not know” will accelerate research in this area. More theories and techniques of data mining with incomplete data will be developed in the near future, followed by comprehensive comparisons of these theories and techniques. These theories and techniques will be built on the combination of statistical models, neural networks, and computational algorithms. Data management systems on large-scale database systems for data mining with incomplete data will be available for data mining practitioners. In the long term, techniques of dealing with missing data will become a prerequisite of any data mining instrument.
CONCLUSION Generally, knowledge discovery starts with the original problem identification. Yet the validation of the problem identified is typically beyond the database and generic algorithms themselves. During the knowledge discovery process, new concepts must be constructed through demystifying the data. Traditionally, incomplete data in data mining are often mistreated. As explained in this chapter, data with missing values must be taken into account in the knowledge discovery process. There have been two major non-traditional approaches that can be effectively used for data mining with incomplete data. One approach is conceptual construction on
incomplete data. It provides effective techniques for knowledge development so that the data miner is allowed to interpret the data mining results based on the particular problem domain and his/her perception of the missing data. The other approach is fuzzy transformation. According to this approach, observations with missing values are transformed into fuzzy patterns based on fuzzy set theory. These fuzzy patterns along with observations with complete data are then used for data mining through, for examples, data visualization and classification. The inclusion of incomplete data for data mining would provide more information for the decision maker in identifying problems, verifying and improving the data mining results derived from observations with complete data only.
REFERENCES Aggarwal, C.C., & Parthasarathy, S. (2001). Mining massively incomplete data sets by conceptual reconstruction. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 227-232). New York: ACM Press. Batista, G., & Monard, M. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5/6), 519-533. Brin, S., Rastogi, R., & Shim, K. (2003). Mining optimized gain rules for numeric attributes. IEEE Transactions on Knowledge & Data Engineering, 15(2), 324-338. Deboeck, G., & Kohonen, T. (1998). Visual explorations in finance with self-organizing maps. London, UK: Springer-Verlag. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1-38. Hopfield, J.J., & Tank, D. W. (1986). Computing with neural circuits. Sciences, 233, 625-633. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Little, R.J.A., & Rubin, D.B. (2002). Statistical analysis with missing data (2nd ed.). New York: John Wiley and Sons. Parthasarathy, S., & Aggarwal, C.C. (2003). On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Computer Society, 15(6), 1512-1521. 295
,
Data Mining with Incomplete Data
Tseng, S., Wang, K., & Lee, C. (2003). A pre-processing method to deal with missing values by integrating clustering and regression techniques. Applied Artificial Intelligence, 17(5/6), 535-544. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the selforganizing map. IEEE Transactions on Neural Networks, 11(3), 586-600. Wang, S. (2000). Neural networks. In M. Zeleny (Ed.), IEBM handbook of IT in business (pp. 382-391). London, UK: International Thomson Business Press. Wang, S. (2002). Nonlinear pattern hypothesis generation for data mining. Data & Knowledge Engineering, 40(3), 273-283. Wang, S. (2003). Application of self-organizing maps for data mining with incomplete data Sets. Neural Computing & Application, 12(1), 42-48. Wang, S. (2005). Classification with incomplete survey data: A Hopfield neural network approach, Computers & Operational Research, 32(10), 2583-2594. Wang, S., & Wang, H. (2004). Conceptual construction on incomplete survey data. Data and Knowledge Engineering, 49(3), 311-323. Zadeh, L.A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28.
Enhanced Data Mining with Incomplete Data: Data mining that utilizes incomplete data through fuzzy transformation. Fuzzy Transformation: The process of transforming an observation with missing values into fuzzy patterns that are equivalent to the observation based on fuzzy set theory. Hopfield Neural Network: A neural network with a single layer of nodes that have binary inputs and outputs. The output of each node is fed back to all other nodes simultaneously, and each of the node forms a weighted sum of inputs and passes the output result through a nonlinearity function. It applies a supervised learning algorithm, and the learning process continues until a stable state is reached. Incomplete Data: The data set for data mining contains some data entries with missing values. For instance, when surveys and questionnaires are partially completed by respondents, the entire response data becomes incomplete data. Neural Network: A set of computer hardware and/ or software that attempt to emulate the information processing patterns of the biological brain. A neural network consists of four main components: (1) (2)
KEY TERMS Aggregate Conceptual Direction: Aggregate conceptual direction describes the trend in the data along which most of the variance occurs, taking the missing data into account. Conceptual Construction with Incomplete Data: Conceptual construction with incomplete data is a knowledge development process that reveals the patterns of the missing data as well as the potential impacts of these missing data on the mining results based only on the complete data.
296
(3) (4)
Processing units (or neurons); and each of them has a certain output level at any point in time. Weighted interconnections between the various processing units which determine how the output of one unit leads to input for another unit. An activation rule which acts on the set of input at a processing unit to produce a new output. A learning rule that specifies how to adjust the weights for a given input/output pair.
Self-Organizing Map (SOM): Two layer neural network that maps the high-dimensional data onto lowdimensional grid of neurons through unsupervised learning or competitive learning process. It allows the data miner to view the clusters on the output maps.
297
Data Quality in Cooperative Information Systems Carlo Marchetti Università di Roma “La Sapienza”, Italy Massimo Mecella Università di Roma “La Sapienza”, Italy Monica Scannapieco Università di Roma “La Sapienza”, Italy Antonino Virgillito Università di Roma “La Sapienza”, Italy
INTRODUCTION A Cooperative Information System (CIS) is a largescale information system that interconnects various systems of different and autonomous organizations, geographically distributed and sharing common objectives (De Michelis et al., 1997). Among the different resources that are shared by organizations, data are fundamental; in real world scenarios, organization A may not request data from organization B, if it does not trust B’s data (i.e., if A does not know that the quality of the data that B can provide is high). As an example, in an e-government scenario in which public administrations cooperate in order to fulfill service requests from citizens and enterprises (Batini & Mecella, 2001), administrations very often prefer asking citizens for data rather than from other administrations that have stored the same data, because the quality of such data is not known. Therefore, lack of cooperation may occur due to lack of quality certification. Uncertified quality also can cause a deterioration of the data quality inside single organizations. If organizations exchange data without knowing their actual quality, it may happen that data of low quality will spread all over the CIS. On the other hand, CISs are characterized by high data replication (i.e., different copies of the same data are stored by different organizations). From a data quality perspective, this is a great opportunity; improvement actions can be carried out on the basis of comparisons among different copies in order to select the most appropriate one or to reconcile available copies, thus producing a new improved copy to be notified to all interested organizations. In this article, we describe possible solutions to data quality problems in CISs that have been implemented within the DaQuinCIS architecture. The description of the
architecture will allow us to understand the challenges posed by data quality management in CISs. Moreover, the presentation of the DaQuinCIS services will provide examples of techniques that can be used to face data quality challenges and will show future research directions that should be investigated.
BACKGROUND Data quality traditionally has been investigated in the context of single information systems. Only recently, there is a growing attention toward data quality issues in the context of multiple and heterogeneous information systems (Berti-Equille, 2003; Bertolazzi & Scannapieco, 2001; Naumann et al., 1999). In cooperative scenarios, the main data quality issues regard (Bertolazzi & Scannapieco, 2001): • •
assessment of the quality of the data owned by each organization; and methods and techniques for exchanging and improving quality information.
For the assessment issue, some of the results already achieved for traditional systems can be borrowed. In the statistical area, a lot of work has been done since the late 1960s. Record linkage techniques have been proposed, most of them based on the Fellegi and Sunter (1969) model. Also, edit imputation methods based on the Fellegi and Holt (1976) model have been provided in the same area. Instead, in the database area, record matching techniques (Hernandez & Stolfo, 1998) and data cleaning tools (Galhardas et al., 2000) have been proposed as a contribution to data quality assessment. In Winkler (2004), a survey of data quality assessment
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Quality in Cooperative Information Systems
techniques is provided, covering both statistical techniques and data cleaning solutions. When considering the issue of exchanging data and the associated quality, a model to export both data and quality data needs to be defined. Some conceptual models to associate quality information to data have been proposed, which include an extension of the entity-relationship model (Wang et al., 1993) and a data warehouse conceptual model with quality features described through the description logic formalism (Jarke at al., 1995). Both models are for a specific purpose: the former to introduce quality elements in relational database design; the latter to introduce quality elements in the data warehouse design. In Mihaila et al. (2000), the problem of the quality of Web-available information has been faced in order to select data with high quality coming from distinct sources; every source has to evaluate some pre-defined data quality parameters and to make their values available through the exposition of metadata. Furthermore, exchanging data in CISs poses important problems that have been addressed by the data integration literature. Data integration is the problem of combining data residing at different sources and providing the user with a unified view of these data (Lenzerini, 2002). As described in Yan et al. (1999), when performing data integration, two different types of conflicts may arise: semantic conflicts, due to heterogeneous source models; and instance-level conflicts, due to what happens when sources record inconsistent values on the same objects. The data quality broker described in the following is a system solving instance-level conflicts. Other notable examples of data integration systems within the same category are AURORA (Yan et al., 1999) and the system described in Sattler et al. (2003). AURORA supports conflict tolerant queries (i.e., it provides a dynamic mechanism to resolve conflicts by means of defined conflict resolution functions). The system described in Sattler et al. (2003) describes how to solve both semantic and instance-level conflicts. The proposed solution is based on a multi-database query language, called FraQL, which is an extension of SQL with conflict resolution mechanisms. A system that also takes into account metadata for instance-level conflict resolution is described in Fan et al. (2001). Such a system adopts the ideas of the context interchange framework (Bressan et al., 1997); therefore, contextdependent and independent conflicts are distinguished, and, accordingly, to this very specific direction, conversion rules are discovered on pairs of systems. Finally, among the techniques explicitly proposed to perform query answering in CISs, we cite Naumann et al. (1999), in which an algorithm to perform query planning based on the evaluation of data sources’ qualities, specific queries’ qualities, and query results’ qualities is described. 298
MAIN THRUST In current government and business scenarios, organizations start cooperating in order to offer services to their customers and partners. Organizations that cooperate have business links (i.e., relationships, exchanged documents, resources, knowledge, etc.) connecting each other. Specifically, organizations exploit business services (e.g., they exchange data or require services to be carried out) on the basis of business links, and, therefore, the network of organizations and business links constitutes a cooperative business system. As an example, a supply chain, in which some enterprises offer basic products and some others assemble them in order to deliver final products to customers, is a cooperative business system. As another example, a set of public administrations, which need to exchange information about citizens and their health state in order to provide social aids, is a cooperative business system derived from the Italian e-government scenario (Batini & Mecella, 2001). A cooperative business system exists independently of the presence of a software infrastructure supporting electronic data exchange and service provisioning. Indeed, cooperative information systems are software systems supporting cooperative business systems; in the remaining of this article, the following definition of CIS is considered: A cooperative information system is formed by a set of organizations that cooperate through a communication infrastructure N, which provides software services to organizations as well as reliable connectivity. Each organization is connected to N through a gateway G, on which software services offered by the organization to other organizations are deployed. A user is a software or human entity residing within an organization and using the cooperative system. Several CISs are characterized by a high degree of data replicated in different organizations; for example, in an e-government scenario, the personal data of a citizen are stored by almost all administrations. But in such scenarios, the different organizations can provide the same data with different quality levels; thus, any user of data may appreciate to exploit the data with the highest quality level, among the provided ones. Therefore, only the highest quality data should be returned to the user, limiting the dissemination of low quality data. Moreover, the comparison of the gathered data values might be used to enforce a general improvement of data quality in all organizations. In the context of the DaQuinCIS project1, we are proposing an architecture for the management of data
Data Quality in Cooperative Information Systems
Figure 1. An architecture for enhancing data quality in cooperative information systems
,
Rating Service Communication Infrastructure
Quality Notification Service (QNS)
Quality Factory (QF)
Orgn Cooperative Gateway
Org1 Cooperative Gateway
Org2 Cooperative Gateway
Data Quality Broker (DQB)
quality in CISs; this architecture allows the diffusion of data and related quality and exploits data replication to improve the overall quality of cooperative data. The interested reader can find a detailed description of the design and implementation of the architecture in Scannapieco, et al. (2004). Each organization offers services to other organizations on its own cooperative gateway and also to specific services to its internal backend systems. Therefore, cooperative gateways interface both internally and externally through services. Moreover, the communication infrastructure offers some specific services. Services are all identical and peer (i.e., they are instances of the same software artifacts and act both as servers and clients of the other peers, depending on the specific activities to be carried out). The overall architecture is depicted in Figure 1. Organizations export data and quality data according to a common model referred to as Data and Data Quality (D2Q) model. It includes the definitions of (i) constructs to represent data, (ii) a common set of data quality properties, (iii) constructs to represent them, and (iv) the association between data and quality data. In order to produce data and quality data according to the D2Q model, each organization deploys on its cooperative gateway a quality factory service that is responsible for evaluating the quality of its own data. The design of the quality factory has been addressed in Cappiello, et al. (2003). The Data Quality Broker poses, on behalf of a requesting user, a data request over other cooperating organizations, also specifying a set of quality requirements that the desired data have to satisfy; this is referred to as quality brokering function. Different copies of the same data received as responses to the request are reconciled, and a best-quality value is selected and proposed to organizations, which can choose to discard their data and to adopt higher quality ones; this is referred to as quality improvement function. If the re-
back-end systems
internals of the organization
quirements specified in the request cannot be satisfied, then the broker initiates a negotiation with the user that can optionally weaken the constraints on the desired data. The data quality broker is, in essence, a data integration system (Lenzerini, 2002) deployed as a peer-to-peer system, which allows to pose a qualityenhanced query over a global schema and to select data satisfying such requirements. The Quality Notification Service is a publish/subscribe engine used as a quality message bus between services and/or organizations. More specifically, it allows quality-based subscriptions for users to be notified on changes of the quality of data. For example, an organization may want to be notified if the quality of some data it uses degrades below a certain threshold, or when high quality data are available. Also the quality notification service is deployed as a peer-to-peer system. The Rating Service associates trust values to each data source in the CIS. These values are used to determine the reliability of the quality evaluation performed by organizations. The rating service is a centralized service, to be provided by a third-party organization. The interested reader can find further details on the rating service design and implementation in De Santis, et al. (2003).
FUTURE TRENDS The complete development of a framework for data quality management in CISs requires the solution of further issues. An important aspect concerns the techniques to be used for quality dimension measurement. In both statistical and machine learning areas, some techniques could be usefully exploited. The general idea is to have quality values estimated with a certain probability instead of a deterministic quality evalua-
299
Data Quality in Cooperative Information Systems
tion. In this way, the task of assigning quality values to each data value could be considerably simplified. The data quality broker covers some aspects of quality-driven query processing in CISs. Nevertheless, there is still the need to investigate instance reconciliation techniques; whereas, quality values are not attached to exported data, how is it possible to select between two conflicting instances of same data? Current data integration systems simply do not provide any answer in such cases, but looser semantics for query answering is needed in order to make data integration systems actually work in real scenarios where errors and conflicts are present. Some further open issues concern trusting data sources. Data ownership is an important one. From a data quality perspective, assigning a responsibility on data helps actually to engage improvement actions as well as to trust sources providing data. In some cases, laws help to assign responsibilities on data typologies, but this is not always possible. Models and techniques that allow trusting such sources with respect to provided data are an open and important issue, especially when data sources interact in open and dynamic environments like peer-to-peer systems.
CONCLUSION Managing data quality in CISs requires solving problems from many research areas of computer science, such as databases, software engineering, distributed computing, security, and information systems. This implies that the proposal of integrated solutions is very challenging. In this article, an architecture to support data quality management in CISs has been described; such an architecture consists of modules that provide some solutions to principal data quality problems in CISs; namely, quality-driven query answering, data quality access, data quality maintenance, and trust management. The architecture is suitable in all contexts in which data stored by different and distributed sources are overlapping and affected by data errors. Among such contexts, the architecture has been validated in the Italian e-government scenario, in which all public administrations store data about citizens that need to be corrected and reconciled.
REFERENCES Batini, C., & Mecella, M. (2001). Enabling Italian egovernment through a cooperative architecture. IEEE Computer, 34(2), 40-45.
300
Berti-Equille, L. (2003). Quality-extended query processing for distributed processing. Proceedings of the ICDT’03 International Workshop on Data Quality in Cooperative Information Systems (DQCIS’03), Siena, Italy. Bertolazzi, P., & Scannapieco, M. (2001). Introducing data quality in a cooperative context. Proceedings of the 6th International Conference on Information Quality (ICIQ’01), Boston, Massachusetts. Bressan, S. et al. (1997). The context interchange mediator prototype. Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona. Cappiello, C. et al. (2003). Data quality assurance in cooperative information systems: A multi-dimension quality certificate. Proceedings of the ICDT’03 International Workshop on Data Quality in Cooperative Information Systems (DQCIS’03), Siena, Italy. De Michelis, G. et al. (1997). Cooperative information systems: A manifesto. In M.P. Papazoglou, & G. Schlageter (Eds.), Cooperative information systems: Trends & directions. London: Academic Press. De Santis, L., Scannapieco, M., & Catarci, T. (2003). Trusting data quality in cooperative information systems. Proceedings of the 11th International Conference on Cooperative Information Systems (CoopIS’03), Catania, Italy. Fan, W., Lu, H., Madnick, S., & Cheung, D. (2001). Discovering and reconciling value conflicts for numerical data integration. Information Systems, 26(8), 635-656. Fellegi, I., & Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71, 17-35. Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 11831210. Galhardas, H. et al. (2000). An extensible framework for data cleaning. Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, California. Hernandez, M., & Stolfo, S. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2), 9-37. Jarke, M., Lenzerini, M., Vassiliou, V., & Vassiliadis, P. (Eds.). (1995). Fundamentals of data warehouses. Berlin, Heidelberg, Germany: Springer Verlag.
Data Quality in Cooperative Information Systems
Lenzerini, M. (2002). Data integration: A theoretical perspective. Proceedings of the 21st ACM Symposium on Principles of Database Systems (PODS 2002), Madison, Wisconsin. Mihaila, G., Raschid, L., & Vidal, M.(2000). Using quality of data metadata for source selection and ranking. Proceedings of the Third International Workshop on the Web and Databases (WebDB’00), Dallas, Texas. Naumann, F., Leser, U., & Freytag, J. (1999). Qualitydriven integration of heterogenous information systems. Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), Edinburgh, Scotland. Redman, T. (1996). Data quality for the information age. Norwood, MA: Artech House. Sattler, K., Conrad, S., & Saake, G. (2003). Interactive example-driven integration and reconciliation for accessing database integration. Information Systems, 28(5), 393-413.
KEY TERMS Cooperative Information Systems: Set of geographically distributed information systems that cooperate on the basis of shared objectives and goals. Data Integration System: System that presents data distributed over heterogeneous data sources according to a unified view. A data integration system allows processing of queries over such a unified view by gathering results from the various data sources. Data Quality: Data have a good quality if they are fit for use. Data quality is measured in terms of many dimensions or characteristics, including accuracy, completeness, consistency, and currency of electronic data. Data Schema: Collection of data types and relationships described according to a particular type language. Data Schema Instance: Collection of data values that are valid (i.e., conform) with respect to a data schema.
Scannapieco, M. et al. (2004). The DaQuinCIS architecture: A platform for exchanging and improving data quality in cooperative information systems. Information Systems, 29(7), 551-582.
Metadata: Data providing information (e.g., quality, provenance, etc.) about application data. Metadata are associated with data according to specific data models.
Wand, Y., & Wang, R. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86-95.
Peer-to-Peer Systems: Distributed systems in which each node can be both client and server with respect to service and data provision.
Wang, R. (1998). A product perspective on total data quality management. Communications of the ACM, 41(2), 58-65.
Publish and Subscribe: Distributed communication paradigm according to which subscribers are notified about events that have been published by publishers and that are of interest to them.
Wang, R., Kon, H., & Madnick, S. (1993). Data quality requirements: Analysis and modeling. Proceedings of the 9th International Conference on Data Engineering (ICDE ’93), Vienna, Austria. Winkler, W.E. (2004). Methods for evaluating and creating data quality. Information Systems, 29(7), 531-550. Yan, L., & Ozsu, T. (1999). Conflict-tolerant queries in AURORA. Proceedings of the Fourth International Conference on Cooperative Information Systems (CoopIS’99), Edinburgh, Scotland.
ENDNOTE 1
DaQuinCIS - Methodologies and Tools for Data Quality inside Cooperative Information Systems (http://www.dis.uniroma1.it/~dq) is a joint research project carried out by DIS - Università di Roma “La Sapienza”, DISCo - Università di Milano “Bicocca” and DEI - Politecnico di Milano.
301
,
302
Data Quality in Data Warehouses William E. Winkler U.S. Bureau of the Census, USA
INTRODUCTION Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling in missing data. The fastest data-cleaning methods are suitable for files with hundreds of millions of records (Winkler, 1999b, 2003b). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 1999a, 2004b).
BACKGROUND When data from several sources are successfully combined in a data warehouse, many new analyses can be done that might not be done on individual files. If duplicates are present within a file or across a set of files, then the duplicates might be identified. Data cleaning or record linkage uses name, address, and other information, such as income ranges, type of industry, and medical treatment category, to determine whether two or more records should be associated with the same entity. Related types of files might be combined. In the health area, a file of medical treatments and related information might be combined with a national death index. Sets of files from medical centers and health organizations might be combined over a period of years to evaluate the health of individuals and discover new effects of different types of treatments. Linking files is an alternative to exceptionally expensive follow-up studies. The uses of the data are affected by lack of quality due to the duplication of records and missing or erroneous values of variables. Duplication can waste money and yield error. If a hospital has a patient incorrectly represented in two different accounts, then the hospital might repeatedly bill the patient. Duplicate records may inflate the numbers and amounts in overdue-billing categories. If the quantitative amounts associated with some accounts are missing, then the totals may be biased low. If values associated with variables such as
billing amounts are erroneous because they do not satisfy edit or business rules, then totals may be biased low or high. Imputation rules can supply replacement values for erroneous or missing values that are consistent with the edit rules and preserve joint probability distributions. Files without error can be effectively data mined.
MAIN THRUST This section provides an overview of data cleaning and of statistical data editing and imputation. The cleanup and homogenization of the files are preprocessing steps prior to data mining.
Data Cleaning Data cleaning is also referred to as record linkage or object identification. Record linkage was introduced by Newcombe, Kennedy, Axford, and James (1959) and given a formal mathematical framework by Fellegi and Sunter (1969). Notation is needed. Two files, A and B, are matched. The idea is to classify pairs in a product space, A x B, from two files A and B into M, the set of true matches, and U, the set of true nonmatches. Fellegi and Sunter considered ratios of conditional probabilities of the form R = P( γ∈Γ | M) / P( γ∈Γ | U)
(1)
where γ is an arbitrary agreement pattern in a comparison space Γ . For instance, Γ might consist of eight patterns representing simple agreement or disagreement on the largest name component, street name, and street number. Alternatively, each γ∈Γ might additionally account for the relative frequency with which specific values of name components such as “Smith,” “Zabrinsky,” “AAA,” and “Capitol” occur. Ratio R, or any monotonely increasing function of it, such as the natural log, is referred to as a matching weight (or score). The decision rule is given by the following statements: • •
If R > Tµ, then designate the pair as a match. If Tλ
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Quality in Data Warehouses
•
If R < T λ, then designate the pair as a nonmatch.
The cutoff thresholds Tµ and Tλ are determined by a priori error bounds on false matches and false nonmatches. Rule 2 agrees with intuition. If γ∈Γ consists primarily of agreements, then γ∈Γ would intuitively be more likely to occur among matches than nonmatches, and Ratio 1 would be large. On the other hand, if γ∈Γ consists primarily of disagreements, then Ratio 1 would be small. Rule 2 partitions the set γ∈Γ into three disjoint subregions. The region Tλ
the EM algorithm. The parameters are known to vary significantly across files (Winkler, 1999b). They can even vary significantly across similar files representing an urban area and an adjacent suburban area. If two files each contain 1,000 or more records, than bringing together all pairs from two files is impractical,due to the small number of potential matches within the total set of pairs. Blocking is the method of considering only pairs that agree exactly (character by character) on subsets of fields. For instance, a set of blocking criteria may be to consider only pairs that agree on the U.S. Postal zip code and the first character of the last name. Additional blocking passes may be needed to obtain matching pairs that are missed by earlier blocking passes (Newcombe et al., 1959; Hernandez & Stolfo, 1995; Winkler, 2004a).
Statistical Data Editing and Imputation Correcting inconsistent information and filling in missing information needs to be efficient and cost effective. For single fields, edits are straightforward. A lookup table may yield correct diagnostic or zip codes. For multiple fields, an edit might require that an individual younger than 15 years of age must have a marital status of unmarried. If a record fails this edit, then a subsequent procedure would need to change either the age or the marital status. Editing has been done extensively in statistical agencies since the 1950s. Early work was clerical. Later computer programs applied if-then-else rules with logic similar to the clerical review. The main disadvantage was that edits that did not fail, for a record would initially fail as the values in fields associated with edit failures were changed. Fellegi and Holt (1976) provided a theoretical model. In providing their model, they had three goals: 1. 2. 3.
The data in each record should be made to satisfy all edits by changing the fewest possible variables (fields). Imputation rules should derive automatically from edit rules. When imputation is necessary, it should maintain the joint distribution of variables.
Fellegi and Holt (1976; Theorem 1) proved that implicit edits are needed for solving the problem of Goal 1. Implicit edits are those that can be logically derived from explicitly defined edits. Implicit edits provide information about edits that do not fail initially for a record but may fail as the values in fields that are associated with failing edits are changed. The following example illustrates some of the computational issues. An edit can be considered as a set of points. Let edit E = 303
,
Data Quality in Data Warehouses
{married & age ≤ 15}. Let r be a data record. Then r ∈ E => r fails edit. This formulation is equivalent to “If age ≤15, then not married.” If a record r fails a set of edits, then one field in each of the failing edits must be changed. An implicit edit E3 can be implied from two explicitly defined edits E 1 and E2; i.e., E1 and E2 => E3. E1 = {age ≤ 15, married, . } E2 = { . , not married, spouse} E3 = {age ≤ 15, . , spouse} The edits restrict the fields age, marital status, and relationship to head of household. Implicit edit E3 is derived from E1 and E2. If E3 fails for a record r = {age ≤ 15, not married, spouse}, then necessarily either E1 or E2 fails. Assume that the implicit edit E3 is unobserved. If edit E2 fails for record r, then one possible correction is to change the marital status field in record r to “married” in order to obtain a new record r1. Record r1 does not fail for E2 but now fails for E1. The additional information from edit E3 assures that record r satisfies all the edits after changing one additional field. For larger data situations with more edits and more fields, the number of possibilities increases at a very high exponential rate. In data warehouse situations, the ease of implementing the ideas of Fellegi and Holt (1996) by using generalized edit software is dramatic. An analyst who has knowledge of the edit situations might put together the edit tables in a relatively short time. Kovar and Winkler (1996) compared two edit systems on economic data. Both were installed and run in less than one day. In many business situations, only a few simple edit rules might be needed. In their books on data quality, Redman (1996), English (1999), and Loshin (2001) have described editing files to assure that business rules are satisfied. The authors have not noted the difficulties of applying hardcoded if-then-else rules and the relative ease of applying Fellegi and Holt’s methods.
FUTURE TRENDS There are several trends in data cleaning. First, some research considers better search strategies to compensate for typographical error. Winkler (2003b, 2004a) applies efficient blocking strategies for bringing together pairs in a situation where one file and its indexes are held in memory. This allows the matching of a moderate size file of 100 million records (which will reside in 4 GB of memory) against large administrative files of upwards of 4 billion records. This type of record linkage requires only one pass against both files, whereas conventional record linkage requires many sorts, matching passes of both files, and large amounts of disk space. 304
Chaudhuri, Gamjam, Ganti, and Motwani (2003) provide a method of indexing that significantly improves over brute-force methods, in which all pairs in two files are compared. Second, some research deals with better string comparators for comparing fields having typographical errors. Cohen et al. (2003) have methods that improve over the basic Jaro-Winkler methods (Winkler, 2004b). Third, another research area investigates methods to standardize and parse general names and address fields into components that can be more easily compared. Borkar, Desmukh, and Sarawagi (2001), Churches, Christen, Lu, and Zhu (2002), and Agichtein and Ganti (2004) have Hidden Markov methods that work as well or better than some of the rule-based methods. Although the Hidden Markov methods require training data, Churches et al. provide methods for quickly creating additional training data for new applications. The additional training data supplement a core set of generic training data. The training data consists of free-form records and the corresponding records that have been processed into components. Fourth, because training data are rarely available and optimal matching parameters are needed, some research investigates unsupervised learning methods that do not require training data. Ravikumar and Cohen (2004) have unsupervised learning methods that improve over the basic EM methods of Winkler (1995, 1999b, 2003b). Their methods are even competitive with supervised learning methods. Fifth, other research considers methods of (nearly) automatic error rate estimation with little or no training data. Winkler (2002) considers methods that use unlabeled data and very small subsets of labeled data (training data). Sixth, some research considers methods for adjusting statistical and other types of analyses for matching error. Lahiri and Larsen (2004) have methods for improving the accuracy of statistical analyses in the presence of matching error. Their methods extend ideas introduced by Scheuren and Winkler (1993, 1997). There are three trends for edit/imputation research. The first trend comprises faster ways of determining the minimum number of variables containing values that contradict the edit rules. DeWaal (2003a, 2003b, 2003c) applies Fourier-Motzkin and cardinality constrained Chernikova algorithms that allow direct bounding of the number of computational paths. RieraLedesma and Salazar-Gonzalez (2004) apply clever heuristics in the setup of the problems that allow direct integer programming methods to perform much faster. The second trend is to apply machine learning methods. Nordbotten (1995) applies neural nets to the basic editing problem. Di Zio, Scanu, Coppola, Luzi, and Ponti (2004) apply Bayesian networks to the imputation problem. The advantage of these approaches is that they do not require the detailed rule elicitation of other edit
Data Quality in Data Warehouses
approaches; they depend on representative training data. The training data consists of unedited records and the corresponding edited records after review by subject matter specialists. The third trend comprises methods that preserve both statistical distributions and edit constraints in the preprocessed data. Winkler (2003a) connects the generalized imputation methods of Little and Rubin (2002) with generalized edit methods of Winkler (1999a, 2003b). The potential advantage is that the statistical properties needed for data mining may be preserved.
De Waal, T. (2003b). A fast and simple algorithm for automatic editing of mixed data. Journal of Official Statistics, 19(4), 383-402.
CONCLUSION
Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining into solutions for insights. Communications of the Association of Computing Machinery, 45(8), 28-31.
To data mine effectively, data need to be preprocessed in a variety of steps that include removing duplicates, performing statistical data editing and imputation, and doing other cleanup and regularization of the data. If moderate errors exist in the data, data mining may waste computational and analytic resources with little gain in knowledge.
REFERENCES Agichtein, E., & Ganti, V. (2004). Mining reference tables for automatic text segmentation. ACM Special Interest Group on Knowledge Discovery and Data Mining (pp. 20-29). Borkar, V., Deshmukh, K., & Sarawagi, S. (2001). Automatic segmentation of text into structured records. ACM SIGMOD 2001 (pp. 175-186). Chaudhuri, S., Gamjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient match for on-line data cleaning. Proceedings of the ACM SIGMOD Conferences, 2003 (pp. 313-324). Churches, T., Christen, P., Lu, J., & Zhu, J. X. (2002). Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), Retrieved from http://www.biomedcentral.com/1472-6947/2/9/ Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string metrics for matching names and addresses. Proceedings of the International Joint Conference on Artificial Intelligence, Mexico. De Waal, T. (2003a). Solving the error localization problem by means of vertex generation. Survey Methodology, 29(1), 71-79.
De Waal, T. (2003c). Processing of erroneous and unsafe data. Rotterdam: ERIM Research in Management. Di Zio, M., Scanu, M., Coppola, L., Luzi, O., & Ponti, A. (2004). Bayesian networks for imputation. Journal of the Royal Statistical Society, A, 167(2), 309-322. English, L. P. (1999). Improving data warehouse and business information quality: Methods for reducing costs and increasing profits. New York: Wiley.
Fellegi, I. P., & Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71, 17-35. Fellegi, I. P., & Sunter, A. B. (1969). A theory of record linkage. Journal of the American Statistical Association, 64, 1183-1210. Hernandez, M., & Stolfo, S. J. (1995). The merge/purge problem for large databases. ACM SIGMOD 1995 (pp. 127138). Kovar, J. G., & Winkler, W. E. (1996). Editing economic data. Proceedings of the Section on Survey Research Methods, American Statistical Association (pp. 81-87). Retrieved from http://www.census.gov/srd/ www/byyear.html Lahiri, P. A., & Larsen, M. D. (2004). Regression analysis with linked data. Journal of the American Statistical Association. Little, R. A., & Rubin, D. B., (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley. Loshin, D. (2001). Enterprise knowledge management: The data quality approach. San Diego: Morgan Kaufman. Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage of vital records. Science, 130, 954-959. Nordbotten, S. (1995). Editing statistical records by neural networks. Journal of Official Statistics, 11, 393414. Ravikumar, P., & Cohen, W. W. (2004). A hierarchical graphical model for record linkage. Proceedings of the Conference on Uncertainty in Artificial Intelligence, USA. Retrieved from http://www.cs.cmu.edu/~wcohen 305
,
Data Quality in Data Warehouses
Redman, T. C. (1996). Data quality in the information age. Boston, MA: Artech. Riera-Ledesma, J., & Salazar-Gonzalez, J.-J. (2004). A branch-and-cut algorithm for the error localization problem in data cleaning (Tech. Rep.). Tenerife, Spain: Universidad de la Laguna. Scheuren, F., & Winkler, W. E. (1993). Regression analysis of data files that are computer matched. Survey Methodology, 19, 39-58. Scheuren, F., & Winkler, W. E. (1997). Regression analysis of data files that are computer matched, II. Survey Methodology, 23, 157-165. Winkler, W. E. (1995). Matching and record linkage. In B. G. Cox (Eds.), Business survey methods (pp. 355-384). New York: Wiley. Winkler, W. E. (1999a). The state of statistical data editing. In Statistical data editing (pp. 169-187). Rome: ISTAT. Winkler, W. E. (1999b). The state of record linkage and current research problems. Proceedings of the Survey Methods Section, Statistical Society of Canada (pp. 7380). Winkler, W. E. (2002). Record linkage and Bayesian networks. Proceedings of the Section on Survey Research Methods, American Statistical Association, Retrieved from http://www.census.gov/srd/www/ byyear.html Winkler, W. E. (2003a). A contingency table model for imputing data satisfying analytic constraints. Proceedings of the Section on Survey Research Methods, American Statistical Association, . Retrieved from http:/ /www.census.gov/srd/www/byyear.html Winkler, W. E. (2003b). Data cleaning methods. Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, USA. Retrieved from http://csaa.byu.edu/kdd03cleaning.html
306
Winkler, W. E. (2004a). Approximate string comparator search strategies for very large administrative lists. Proceedings of the Section on Survey Research Methods, American Statistical Association. Winkler, W. E. (2004b). Methods for evaluating and creating data quality. Information Systems, 29(7), 531-550.
KEY TERMS Data Cleaning: The methodology of identifying duplicates in a single file or across a set of files by using a name, address, and other information. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships, such as classification, prediction, estimation, or affinity grouping. Edit Restraints: Logical restraints such as business rules that assure that an employee’s listed salary in a job category is not too high or too low or that certain contradictory conditions, such as a male hysterectomy, do not occur. Imputation: The method of filling in missing data that sometimes preserves statistical distributions and satisfies edit restraints. Preprocessed Data: In preparation for data mining, data that have been through preprocessing such as data cleaning or edit/imputation. Rule Induction: The process of learning, from cases or instances, if-then rule relationships consisting of an antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent). Training Data: A representative subset of records for which the truth of classifications and relationships is known and that can be used for rule induction in machine learning models.
307
Data Reduction and Compression in Database Systems Alexander Thomasian New Jersey Institute of Technology, USA
INTRODUCTION Data compression is storing data such that it requires less space than usual. Data compression has been effectively used in storing data in a compressed form on magnetic tapes, disks, and even main memory. In many cases, updated data cannot be stored in place when it is not compressible to the same or smaller size. Compression also reduces the bandwidth requirements in transmitting (program) code, data, text, images, speech, audio, and video. The transmission may be from main memory to the CPU and its caches, from tape and disk into main memory, or over local, metropolitan, and wide area networks. When data compression is used, transmission time improves or, conversely, the required transmission bandwidth is reduced. Two excellent texts on this topic are Sayood (2002) and Witten, Bell, and Moffat (1999). Huffman encoding is a popular data compression method. It substitutes the symbols of an alphabet with k bits per symbol, so that frequent symbols are represented with fewer than k bits, and less common symbols with more than k bits. A significant saving in space is possible when the distribution is highly skewed, for example, the Zipf distribution, because the average number of bits to represent symbols is smaller than k bits. Arithmetic coding is a more sophisticated technique that represents a string with an appropriate fraction. Lempel-Ziv coding substitutes a character string by an index to a dictionary or a previous occurrence and the string length. Variations of these algorithms, separately or in combination, are used in many applications. Compression can be lossy or lossless. Lossy data compression is utilized in cases where some data loss can be tolerated, for example, a restored compressed image may not be discernibly different from the original. Lossless data compression, which restores compressed data to its original value, is absolutely necessary in some applications. Quantization is a lossy data compression method, which represents data more coarsely than the original signal. General purpose data compression is applicable to data warehouses and databases. For example, a variation of the lossless Lempel-Ziv method has been applied to DB2 records. This is accomplished by analyzing the records in a relational table and building dictionaries,
which are then used in compressing the data. Because relational tables are organized as database pages, each page on disk (and main memory) will hold compressed data, so the number of records per page is doubled. Data is uncompressed on demand, with or without hardware assistance. Data reduction operates at a higher level of abstraction than data compression, although data compression methods can be used in conjunction with data reduction. An example is quantizing the data in a matrix, which has been dimensionally reduced via the SVD method, as I describe in the following sections. This concludes the discussion of data compression; the remainder of this article deals with data reduction.
MAIN THRUST Recent interest in data reduction resulted in the New Jersey Data Reduction Report (Barbara, et al., 1997), which classifies data reduction methods into parametric and nonparametric. Histograms, clustering, and indexing structures are examples of nonparametric methods. Another classification is direct versus transform-based methods. Singular value decomposition –(SVD) and discrete wavelet transforms are parametric transformbased methods. In the following few sections, I briefly introduce the aforementioned methods.
SVD SVD is applicable to a two-dimensional M×N matrix X, where M is the number of objects, and there are N features per object. For example, M may represent the number of customers, and the columns represent the amount they spend on any of the N products. M may be in the millions, while N is the hundreds or even thousands. According to SVD we have the decomposition X = USVt, where U is another M×N matrix, S is a diagonal N×N matrix of singular values, sn, 1
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Reduction and Compression in Database Systems
where Λ is a diagonal matrix of eigenvalues¸ λn = sn2/ M. We assume, without a loss in generality, that the eigenvalues are in nonincreasing order, such that the transformation of coordinates into the principal components will yield Y = XV, whose columns are in decreasing order of their energy or variance. There is a reduction in the rank of the matrix when some eigenvalues are equal to zero. If we retain the first p columns of Y, the Normalized Mean Square Error –(NMSE) is equal to the sum of the eigenvalues of discarded columns divided by the trace of the matrix (sum of the eigenvalues or diagonal elements of C, which remains invariant). A significant reduction in the number of columns can be attained at a relatively small NMSE, as shown in numerous studies (see Korn, Jagadish, & Faloutsos, 1997). Higher dimensional data, as in the case of data warehouses, for example, (product, customer, date) (dollars) can be reduced to two dimensions by appropriate transformations, for example, with products as rows and (customer × date) as columns. The columns of dataset X may not be globally correlate —for example, high-income customers buy expensive items, and low-income customers buy economy items — so that the items bought by these two groups of customers are disjoint. Higher data compression (for a given NMSE) can be attained by first clustering the data, using an off-the-shelf clustering method, such as k-means (Dunham, 2003), and then applying SVD to clusters (Castelli, Thomasian, & Li, 2003). More sophisticated clustering methods, which generate elliptical clusters, may yield higher dimensionality reduction. An SVD-friendly clustering method, which generates clusters amenable to dimensionality reduction, is proposed in Chakrabarti and Mehrotra (2000). K-nearest-neighbor (k-NN) queries can be carried out with respect to a dataset, which has been subjected to SVD, by first transforming the query point to the appropriate coordinates by using the principal components. In the case of multiple clusters, we first need to determine the cluster to which the query point belongs. In the case of the k-means clustering method, the query point belongs to the cluster with the closest centroid. After determining the k nearest neighbors in the primary cluster, I need to determine if other clusters are to be searched. A cluster is searched if the hypersphere centered on the query point, with the k nearest neighbors inside it, intersects with the hypersphere of that cluster. This step is repeated until no more intersections exist. Multidimensional scaling – (MS) is another method for dimensionality reduction (Kruskal & Wish, 1978). Given the pair-wise distances or dissimilarities among a set of objects, the goal of MS is to represent them in k dimensions so that their distances are preserved. A stress function, which is the sum of squares of the 308
difference between the distances of points with k dimensions and the original distance, is used to represent the goodness of the fit. The value of k should be selected to be as small as possible, while stress is maintained at an appropriately low level. A fast, approximate alternative is FASTMAP, whose goal is to find a k-dimensional space that matches the distances of an N×N matrix for N points (Faloutsos & Lin, 1995).
WAVELETS According to Fourier’s theorem, a continuous function can be expressed as the sum of sinusoidal functions. A discrete signal with n points can be expressed by the n coefficients of a Discrete Fourier Transform –(DFT). According to Parseval’s theorem, the energy in the time and frequency domain are equal (Faloutsos, 1996). The DFT consists of the sum of sine and cosine functions. I am interested in transforms, which can capture a vector with as few coefficients as possible. The Discrete Cosine Transform –(DCT) achieves better energy concentration than DFT and also solves the frequency-leak problem that plagues DFT (Agrawal, Faloutsos, & Swami, 1993). The Discrete Wavelet Transform –(DWT) is also related to DFT but achieves better lossy data compression. The Haar transform is a simple wavelet transform that operates on a time sequence and computes the sum and difference of its halves, recursively. DWT can be applied to signals with multiple dimensions, one dimension at a time (Press, Teukolsky, Vetterling, & Flannery, 1996). To illustrate how a single dimensional wavelet transform works, consider an image with four pixels having the following values: [9,7,3,5] (Stollnitz, Derose, & Salesin, 1996). We obtain a lower resolution image by substituting pairs of pixel values with their average: [8,4]. Information is lost due to down sampling. The original pixels can be recovered by storing detail coefficients, given as 1=9–8 and –1=3–4, that is, [1,–1]. Another averaging and detailing step yields [6] and [2]. The wavelet transform of the original image is then [6,2,1,–1]. In fact, for normalization purposes, the last two coefficients have to be divided by the square root of 2. Wavelet compression is attained by not retaining all the coefficients. As far as data compression in data warehousing is concerned, a k-d DWT can be applied to a k-d data cube to obtain a compressed approximation by saving a fraction of the strongest coefficients. An approximate computation of multidimensional aggregates for sparse data using wavelets is reported in Vitter and Wang (1999).
Data Reduction and Compression in Database Systems
REGRESSION, MULTIREGRESSION, AND LOG-LINEAR MODELS Linear regression in its simplest form, given two vectors x and y, uses the least squares formula to obtain the approximation Y = a+bX. In case y is a polynomial in x, for example, Y = a + bX + cX2 + dX3, the transformation Xi = xi can be used to obtain a linear model so that it can be solved by using the least squares method. The case Y = bXn can be handled by first taking logarithms of both sides. In effect, a set of points can be represented by a compact formula. Most cases have multiple columns of data so that multiregression can be used. Log-linear modeling arises in the context of discrete multivariate analysis and seeks approximation to discrete multidimensional probability distributions. For example, a probability distribution over several variables is factored into a product of the joint probabilities of the subsets; this is where data compression is achieved.
HISTOGRAMS Data compression in the form of histograms is a wellknown database technique. A histogram on a column or attribute can assist the relational query optimizer in estimating the selectivity of that attribute. For example, if the column has a nonclustered index, the histogram can determine that using the index would be more costly because the selectivity of the attribute is low (Ramakrishnan & Gehrke, 2003). In this case histograms are categorized into equidepth, equiwidth, and compressed categories, where in the last case separate buckets are assigned to the most frequent attributes.
CLUSTERING Clustering partitions records into multiple groups or clusters, such that (a) records in one cluster are similar to each other and (b) records in different clusters are dissimilar to each other (Ramakrishnan & Gehrke, 2003). For example, using the age and salaries of the employees in a large company, a clustering algorithm may determine that there are three major groups: young employees with low salaries, middle-aged employees with higher salaries, and older employees with the highest salaries. More generally, a cluster may be represented by the coordinates of its centroid and possibly its radius, which is the average distance from the centroid. Data reduction is attained in clustering by representing points based on their membership in a cluster so that effectively all points in the cluster are assigned the coordinates of its centroid.
A large number of clustering methods have been developed over the years. They can be roughly classified as statistical, database, and machine-learning methods (Barbara et al., 1997), although the last category is rather limited. Statistical methods are classified as hierarchical and partitioning methods. Hierarchical methods have been classified into agglomerative or bottom-up methods and divisive or top-down methods. The number of clusters to be generated should be specified for partitioning methods. The k-means and kmedoids are two such methods, where the medoid is an actual centrally located point belonging to the cluster, and a centroid is not. The k-means method applied to n points first randomly selects k points as the initial centroids of the clusters. Then these steps are followed: (a) Assign the remaining n-k points to one of k clusters based on the proximity of the point to the centroids, according to Euclidean distance, (b) compute the new centroids of the k clusters, whose coordinates are the means of respective dimensions, (c) reassign all points to clusters based on their proximity, and (d) quit if there is no change from one iteration to the next; otherwise, go back to step (b). A comprehensive review of basic clustering methods from the viewpoint of database and data-mining applications is given in Dunham (2003). There is emphasis on clustering methods for very large datasets, which cannot be held in main memory. It is important to keep the number of passes over the disk-resident data for clustering to a minimum.
INDEX TREES Index trees are used in facilitating access to large datasets, for example, relational tables, which is usually done based on a single attribute, such as age, or two attributes, age and salary, by two separate one-dimensional index structures. B+ trees are the most popular indexing structures for a single dimension, with each node splitting the range of values assigned to it n + 1 ways, where n is the node fanout. The nodes of the B+ tree correspond to 4 or 8 KB database pages and have an occupancy of at least 50%. B+ trees are self-organizing and remain balanced as a result of insertions and deletions. The larger the value of n, the lower the tree. Typical B+ trees have three to four levels; the top two levels fit in main memory. To determine employees who are 31 to 35 years old and earn $90,000 to $120,000 annually is then tantamount to a set intersection of records qualified by the two B+ trees. R-trees, developed in the mid-1980s, deal with multiple dimensions simultaneously. Over the years, 309
,
Data Reduction and Compression in Database Systems
numerous multidimensional indexing methods –(MDIMs) have been proposed (Gaede & Gunther, 1998), although their usage in operational database systems remains rather limited. MDIMs have been classified into data-partitioning and space-partitioning methods. In the former case the partitioning is carried out based on insertions and deletions of multidimensional points in the index. Examples are R-trees and their variations. Space-partitioning methods, such as quad-trees and k-d-b trees, recursively partition the space globally when local overflows occur. As a result, space-partitioning methods may not be balanced. I will discuss only data-partitioning methods in the remainder of this article. MDIMs tend to be viable for a limited number of dimensions, and as the number of dimensions increases, the dimensionality curse sets in (Faloutsos, 1996). In effect, the index loses its effectiveness, because a rather large fraction of the pages constituting the index are touched as part of query processing. Indexes need some extensions to provide summary data for data reduction. They can be considered as hierarchical histograms, but there is no easy way to extract this information from an index.
SAMPLING Sampling achieves data compression by selecting an appropriate subset of a large dataset to which a (relational) query is applied. Sampling can be categorized as follows (Han & Kamber, 2001): (a) simple random sample without replacement, (b) simple random sample with replacement, (c) cluster sample, and (d) stratified sample (examples follow). Consider a large university, which maintains student records in a relational table. Two columns of this table are of interest: year of study (i.e., freshman, sophomore, junior, senior) and the grade point average (GPA). The average GPA by year of study can be specified succinctly as an SQL GROUP-By query. Assuming that neither column is indexed, instead of scanning the table to obtain the averages GPAs, the system may randomly select records from the table to carry out this task. Online query processing is possible in this context, that is, the system starts displaying the average GPAs based on samples it has obtained so far by year to the user and an error bound. The running of the query can be stopped as soon as the user is satisfied (Hellerstein, Haas, & Wang, 1997). When the data is not ordered by one of the attributes under consideration, for example, it is ordered alphabetically by student last names; then all the records in the page can be used in sampling to reduce the number of disk accesses. This is an example
310
of a cluster sample. Stratified sampling could be attained if the records were indexed according to the year.
FUTURE TRENDS With rapid increases in the volume of data being held for data mining and data warehousing, lossy yet accurate data compression methods are required. This is also important from the viewpoint of collecting data from remote sources with low bandwidth transmission capability. The new field of data streams (Golab & Ozsu, 2003) uses summarization methods, such as transmitting averages rather than detailed values.
CONCLUSION I have provided a summary of data reduction methods applicable to data warehousing and databases in general. I have also discussed data compression. Appropriate references are given for further study.
ACKNOWLEDGMENT Supported by NSF through Grant 0105485 in Computer Systems Architecture.
REFERENCES Agrawal, A., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. Proceedings of the Foundations of Data Organization and Algorithms Conference (pp. 69-84), USA. Barbara, D., et al. (1997). The New Jersey data reduction report. Data Engineering Bulletin, 20(4), 3-42. Castelli, V., Thomasian, A., & Li, C. S. (2003). CSVD: Clustering and singular value decomposition for approximate similarity search in high dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 15(3), 671-685. Chakrabarti, K., & Mehrotra, S. (2000). Local dimensionality reduction: A new approach to indexing highdimensional spaces. Proceedings of the 26th International Conference on Very Large Data Bases (pp. 89-100), Egypt. Dunham, M. H. (2003). Data mining: Introductory and advanced topics. Prentice-Hall.
Data Reduction and Compression in Database Systems
Faloutsos, C. (1996). Searching multimedia databases by content. Kluwer Academic Publishers. Faloutsos, C., & Lin, K. I. (1995). Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Proceedings of the ACM SIGMOD International Conference (pp. 163-174), USA. Gaede, V., & Guenther, O. (1998). Multidimensional indexing methods. ACM Computing Surveys, 30(2), 170-231. Golab, L., & Ozsu, M. T. (2003). Data stream management issues — A survey. ACM SIGMOD Record, 32(2), 5-14. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann. Hellerstein, J. M., Haas, P. J., & Wang, H. J. (1997). Online aggregation. Proceedings of the ACM SIGMOD International Conference (pp. 171-182), USA. Korn, F., Jagadish, H., & Faloutsos, C. (1997). Efficiently supporting ad hoc queries in large datasets of time sequences. Proceedings of the ACM SIGMOD International Conference (pp. 289-300), USA. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage Publications. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1996). Numerical recipes in C: The art of scientific computing. Cambridge University Press. Ramakrishnan, K., & Gehrke, J. (2003). Database management systems (3rd ed.). McGraw-Hill. Sayood, K. (2002). Introduction to data compression (2nd ed.). Elsevier. Stollnitz, E. J., Derose, T. D., & Salesin, D. H. (1996). Wavelets for computer graphics: Theory and applications. Prentice-Hall. Vetterli, M., & Kovacevic, J. (1995). Wavelets and subband coding. Prentice-Hall. Vitter, J. S., & Wang, M. (1999). An approximate computation of multidimensional aggregates of sparse data using wavelets. Proceedings of the ACM SIGMOD International Conference (pp. 193-204), USA.
Witten, I. H., Bell, T., & Moffat, A. (1999). Managing gigabytes: Compressing and indexing documents and images (2nd ed.). Morgan Kaufmann.
KEY TERMS Clustering: The process of grouping objects based on their similarity and dissimilarity. Similar objects should be in the same cluster, which is different from the cluster for dissimilar objects. Histogram: A data structure that maintains one or more attributes or columns of a relational DBMS to assist the query optimizer. Index Tree: Partitions the space in a single or multiple dimensions for efficient access to the subset of the data that is of interest. Karhunen-Loeve Transform (KLT): Utilizes principal component analysis or singular value decomposition to minimize the distance error introduced for that level of dimensionality reduction. Principal Component Analysis (PCA): Computes the eigenvectors for principal components and uses them to transform a matrix X into a matrix Y, whose columns are aligned with the principal components. Dimensionality is reduced by discarding columns in Y with the least variance or energy. Sampling: A technique for selecting units from a population so that by studying the sample, you may fairly generalize your results back to the population. Singular Value Decomposition (SVD): Attains the same goal as PCA by decomposing the X matrix into a U matrix, a diagonal matrix of eigenvalues, and a matrix of eigenvectors — the same as those obtained by PCA. Wavelet Transform: A method to transform data so that it can be represented compactly.
311
,
312
Data Warehouse Back-End Tools Alkis Simitsis National Technical University of Athens, Greece Dimitri Theodoratos New Jersey Institute of Technology, USA
INTRODUCTION
Figure 1. Abstract architecture of a data warehouse
The back-end tools of a data warehouse are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. They are known under the general term extraction, transformation and loading (ETL) tools. In all the phases of an ETL process (extraction and exportation, transformation and cleaning, and loading), individual issues arise and, along with the problems and constraints that concern the overall ETL process, make its lifecycle a very complex task.
BACKGROUND A Data Warehouse (DW) is a collection of technologies aimed at enabling the knowledge worker (executive, manager, analyst, etc.) to make better and faster decisions. Data warehouses typically are divided into the front-end part concerning end users who access the data warehouse with decision-support tools, and the back-stage part, where the collection, integration, cleaning and transformation of data takes place in order to populate the warehouse. The architecture of a data warehouse exhibits various layers of data in which data from one layer are derived from data of the previous layer (Figure 1). The processes that take part in the back stage of the data warehouse are data intensive, complex, and costly (Vassiliadis, 2000). Several reports mention that most of these processes are constructed through an in-house development procedure that can consume up to 70% of the resources for a data warehouse project (Gartner, 2003). In order to facilitate and manage the data warehouse operational processes, commercial tools exist in the market under the general title Extraction-TransformationLoading (ETL) tools. To give a general idea of the functionality of these tools, we mention their most prominent tasks, which include (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the customization and integration of the
information coming from multiple sources into a common format; (d) the cleaning of the resulting data set on the basis of database and business rules; and (e) the propagation of the data to the data warehouse and/or data marts. In the sequel, we will adopt the general acronym ETL for all kinds of in-house or commercial tools and all the aforementioned categories of tasks. In Figure 2, we abstractly describe the general framework for ETL processes. In the left side, we can observe the original data providers (sources). Typically, data providers are relational databases and files. The data from these sources are extracted by extraction routines, which provide either complete snapshots or differentials of the data sources. Then, these data are propagated to the Data Staging Area (DSA), where they are transformed and cleaned before being loaded to the data warehouse. Intermediate results, again in the form of (mostly) files or relational tables, are part of the data-staging area. The data warehouse is depicted in the right part of Figure 2 and comprises the target data stores (i.e., fact tables for the storage of information and dimension tables with the description and the multidimensional, roll-up hierarchies of the stored facts). The loading of the central warehouse is performed from the loading activities depicted right before the data warehouse data store.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Warehouse Back-End Tools
Figure 2. The environment of extract-transformationload processes Extract
Sources
Transform & Clean
DSA
Load
DW
State of the Art In the past, there have been research efforts toward the design and optimization of ETL tasks. We mention three research prototypes: (a) the AJAX system (Galhardas et al., 2000); (b) the Potter’s Wheel system (Raman & Hellerstein, 2001); and (c) ARKTOS II (Arktos II, 2004). The first two prototypes are based on algebras, which we find mostly tailored for the case of homogenizing Web data; the latter concerns the modeling and the optimization of ETL processes in a customizable and extensible manner. An extensive review of data quality problems and related literature, along with quality management methodologies, can be found in Jarke, et al. (2000). Rundensteiner (1999) offers a discussion of various aspects of data transformations. Sarawagi (2000) offers a similar collection of papers in the field of data, including a survey (Rahm & Do, 2000) that provides an extensive overview of the field, along with research issues and a review of some commercial tools and solutions on specific problems (Monge, 2000; Borkar et al., 2000). In a related but different context, we would like to mention the IBIS tool (Calì et al., 2003). IBIS is an integration tool following the global-asview approach to answer queries in a mediated system. Moreover, there is a variety of ETL tools in the market. Simitsis (2003) lists the ETL tools available at the time that this paper was written.
MAIN THRUST In this section, we briefly review the problems and constraints that concern the overall ETL process, as well as the individual issues that arise separately in each phase of an ETL process (extraction and exportation, transformation and cleaning, and loading). Simitsis (2004) offers a detailed study on the problems described in this paper and presents a framework toward the modeling and the optimization of ETL processes.
Scalzo (2003) mentions that 90% of the problems in data warehouses arise from the nightly batch cycles that load the data. At this stage, the administrators have to deal with problems like (a) efficient data loading and (b) concurrent job mixture and dependencies. Moreover, ETL processes have global time constraints, including the time they must be initiated and their completion deadlines. In fact, in most cases, there is a tight time window in the night that can be exploited for the refreshment of the data warehouse, since the source system is off-line or not heavily used during this period. Other general problems include the scheduling of the overall process, the finding of the right execution order for dependent jobs and job sets on the existing hardware for the permitted time schedule, and the maintenance of the information in the data warehouse.
Phase I: Extraction and Transportation During the ETL process, a first task that must be performed is the extraction of the relevant information that has to be propagated further to the warehouse (Theodoratos et al., 2001). In order to minimize the overall processing time, this involves only a fraction of the source data that has changed since the previous execution of the ETL process, mainly concerning the newly inserted and possibly updated records. Usually, change detection is performed physically by the comparison of two snapshots (one corresponding to the previous extraction and the other to the current one). Efficient algorithms exist for this task, like the snapshot differential algorithms presented in Labio and Garcia-Molina (1996). Another technique is log sniffing (i.e., the scanning of the log file in order to reconstruct the changes performed since the last scan). In rare cases, change detection can be facilitated by the use of triggers. However, this solution is technically impossible for many of the sources that are legacy systems or plain flat files. In numerous other cases, where relational systems are used at the source side, the usage of triggers also is prohibitive due to the performance degradation that their usage incurs and to the need to intervene in the structure of the database. Moreover, another crucial issue concerns the transportation of data after the extraction, where tasks like ftp, encryption-decryption, compression-decompression, and so forth can possibly take place.
Phase II: Transformation and Cleaning It is possible to determine typical tasks that take place during the transformation and cleaning phase of an ETL process. Rahm and Do (2000) further detail this phase in the following tasks: (a) data analysis; (b) definition of
313
,
Data Warehouse Back-End Tools
transformation workflow and mapping rules; (c) verification; (d) transformation; and (e) backflow of cleaned data. In terms of the transformation tasks, we distinguish two main classes of problems (Lenzerini, 2002): (a) conflicts and problems at the schema level (e.g., naming and structural conflicts) and (b) data-level transformations (i.e., at the instance level). The integration and transformation programs perform a wide variety of functions, such as reformatting data, recalculating data, modifying key structures of data, adding an element of time to data warehouse data, identifying default values of data, supplying logic to choose between multiple sources of data, summarizing data, merging data from multiple sources, and so forth. In the sequel, we present four common ETL transformation cases as examples: (a) semantic normalization and denormalization; (b) surrogate key assignment; (c) slowly changing dimensions; and (d) string problems. The research prototypes presented in the previous section and several commercial tools already have done some piece of progress in order to tackle problems like these four. Still, their presentation in this paper aspires to make the reader understand that the whole process should be discriminated from the way we resolved integration issues until now.
call a surrogate key (Kimball et al., 1998). The basic reasons for this replacement are performance and semantic homogeneity. Performance is affected by the fact that textual attributes are not the best candidates for indexed keys and need to be replaced by integer keys. More importantly, semantic homogeneity causes reconciliation problems, since different production systems might use different keys for the same object (synonyms) or the same key for different objects (homonyms), resulting in the need for a global replacement of these values in the data warehouse. Observe row (20,green) in table Src_1 of Figure 3. This row has a synonym conflict with row (10,green) in table Src_2, since they represent the same real-world entity with different IDs, and a homonym conflict with row (20,yellow) in table Src_2 (over attribute ID). The production key ID is replaced by a surrogate key through a lookup table of the form Lookup(SourceID,Source,SurrogateKey). The Source column of this table is required, because there can be synonyms in the different sources, which are mapped to different objects in the data warehouse (e.g., value 10 in tables Src_1 and Src_2). At the end of this process, the data warehouse table DW has globally unique, reconciled keys.
Semantic Normalization and Denormalization
Slowly Changing Dimensions
It is common for source data to be long denormalized records, possibly involving more than 100 attributes. This is due to the fact that bad database design or classical COBOL tactics led to the gathering of all the necessary information for an application in a single table/file. Frequently, data warehouses also are highly denormalized in order to answer certain queries more quickly. But sometimes, it is imperative to normalize somehow the input data. Consider, for example, a table of the form R(KEY,TAX,DISCOUNT,PRICE), which we would like to transform to a table of the form R’(KEY,CODE,AMOUNT). For example, the input tuple t[key,30,60,70] is transformed into the tuples t1[key,1,30]; t2[key,2,60]; t3[key,3,70]. The transformation of the information organized in rows to information organized in columns is called rotation or denormalization, since frequently, the derived values (e.g., the total income, in our case) also are stored as columns, functionally dependent on other attributes. Occasionally, it is possible to apply the reverse transformation in order to normalize denormalized data before being loaded to the data warehouse.
Surrogate Keys In a data warehouse project, we usually replace the keys of the production systems with a uniform key, which we 314
Factual data are not the only data that change. The dimension values change at the sources, too, and there are several policies to apply for their propagation to the data warehouse. Kimball, et al. (1998) present three policies for dealing with this problem: overwrite (Type 1); create a new dimensional record (Type 2); push down the changed value into an old attribute (Type 3). For Type 1 processing, we only need to issue appropriate update commands in order to overwrite attribute values in existing dimension table records. This policy can be used whenever we do not want to track history in dimension changes and if we want to correct erroneous values. In this case, we use an old version of the dimenFigure 3. Surrogate key assignment ID, COLOR 10, red 20, green
ID, 100, 200, 300, Src_1
ML
COLOR red green yellow
DW
ML
ID, COLOR 10, green 20, yellow Src_2
SourceID, Source, SurrogateKey 10, Src_1, 100 20, Src_1, 200 10, Src_2, 200 Lookup 20, Src_2, 300
Data Warehouse Back-End Tools
sion data Dold as they were received from the source and their current version Dnew. We discriminate the new and updated rows through the respective operators. The new rows are assigned a new surrogate key through a function application. The updated rows are assigned a surrogate key, which is the same as the one that their previous version had already being assigned. Then, we can join the updated rows with their old versions from the target table, which subsequently will be deleted, and project only the attributes with the new values. In the Type 2 policy, we copy the previous version of the dimension record and create a new one with a new surrogate key. If there is no previous version of the dimension record, we create a new one from scratch; otherwise, we keep them both. This policy can be used whenever we want to track the history of dimension changes. Finally, Type 3 processing is also very simple, since, again, we only have to issue update commands to existing dimension records. For each attribute A of the dimension table, which is checked for updates, we need to have an extra attribute called old_A. Each time we spot a new value for A, we write the current A value to the old_A field and then write the new value to attribute A. In this way, we can have both new and old values present at the same dimension record.
process. The option to turn them off contains some risk in the case of a loading failure. So far, the best technique seems to be the usage of the batch loading tools offered by most RDBMSs that avoid these problems. Other techniques that facilitate the loading task involve the creation of tables at the same time with the creation of the respective indexes, the minimization of inter-process wait states, and the maximization of concurrent CPU usage.
FUTURE TRENDS In terms of financial growth, in a recent study (Giga Information Group, 2002), it is reported that the ETL market reached a size of $667 million for year 2001; still, the growth rate reached a rather low 11% (as compared to a rate of 60% growth for year 2000). More recent studies (Giga Information Group, 2002; Gartner, 2003) account the ETL issue as a research challenge and pinpoint several topics for future work: •
•
String Problems A major challenge in ETL processes is the cleaning and the homogenization of string data (e.g., data that stands for addresses, acronyms, names, etc.). Usually, the approaches for the solution of this problem include the application of regular expressions for the normalization of string data to a set of reference values.
Phase III: Loading The final loading of the data warehouse has its own technical challenges. A major problem is the ability to discriminate between new and existing data at loading time. This problem arises when a set of records has to be classified to (a) the new rows that need to be appended to the warehouse and (b) rows that already exist in the data warehouse, but their value has changed and must be updated (e.g., with an UPDATE command). Modern ETL tools already provide mechanisms for this problem, mostly through language predicates. Also, simple SQL commands are not sufficient, since the open-loop-fetch technique, where records are inserted one by one, is extremely slow for the vast volume of data to be loaded in the warehouse. An extra problem is the simultaneous usage of the rollback segments and log files during the loading
•
Integration of ETL with XML adapters; EAI (Enterprise Application Integration) tools (e.g., MQ-Series); customized data quality tools; and the move toward parallel processing of the ETL workflows. Active ETL (Adzic & Fiore, 2003), meaning the need to refresh the warehouse with as fresh of data as possible (ideally, online). Extension of the ETL mechanisms for non-traditional data, like XML/HTML, spatial, and biomedical.
CONCLUSION ETL tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In all the phases of an ETL process (extraction and exportation, transformation and cleaning, and loading), individual issues arise and, along with the problems and constraints that concern the overall ETL process, make its lifecycle a very troublesome task. The key factors underlying the main problems of ETL workflows are (a) vastness of the data volumes; (b) quality problems, since data are not always clean and have to be cleansed; (c) performance, since the whole process has to take place within a specific time window; and (d) evolution of the sources and the data warehouse, which can eventually lead even to daily maintenance operations. Although state of the art in the field of both research and commercial ETL tools includes some signs of progress, much work remains to be done before we can claim that this problem is resolved. In our opinion, there are several issues that are technologically
315
,
Data Warehouse Back-End Tools
open and that present interesting topics of research for the future in the field of data integration in data warehouse environments.
Rahm, E., & Do, H.H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering, 23(4).
REFERENCES
Raman, V., & Hellerstein, J. (2001). Potter’s wheel: An interactive data cleaning system. Proceedings of 27th International Conference on Very Large Data Bases (VLDB), Rome, Italy.
Adzic, J., & Fiore, V., (2003). Data warehouse population platform. Proceedings of 5th International Workshop on the Design and Management of Data Warehouses (DMDW), Berlin, Germany. Arktos II. (2004). A framework for modeling and managing ETL processes. Retrieved from http://www.dblab.ece. ntua.gr/~asimi Borkar, V., Deshmuk, K., & Sarawagi, S. (2000). Automatically extracting structure from free text addresses. Bulletin of the Technical Committee on Data Engineering, 23(4). Calì, A. et al. (2003). IBIS: Semantic data integration at work. Proceedings of the 15th CAiSE. Galhardas, H., Florescu, D., Shasha, D., & Simon, E. (2000). Ajax: An extensible data cleaning tool. Proceedings ACM SIGMOD International Conference On the Management of Data, Dallas, Texas. Gartner. (2003). ETL magic quadrant update: Market pressure increases. Retrieved from http://www.gartner. com/reprints/informatica/112769.html Giga Information Group. (2002). Market overview update: ETL. Technical Report RPA-032002-00021. Inmon, W.-H. (1996). Building the data warehouse. New York: John Wiley & Sons, Inc. Jarke, M., Lenzerini, M., Vassiliou, Y., & Vassiliadis, P. (Eds.). (2000). Fundamentals of data warehouses. Springer-Verlag. Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). The data warehouse lifecycle toolkit: Expert methods for designing, developing, and deploying data warehouses. New York: John Wiley & Sons. Labio, W., & Garcia-Molina, H. (1996). Efficient snapshot differential algorithms for data warehousing. Proceedings of 22nd International Conference on Very Large Data Bases (VLDB), Bombay, India. Lenzerini, M. (2002). Data integration: A theoretical perspective. Proceedings of 21st Symposium on Principles of Database Systems (PODS), Wisconsin. Monge, A. (2000). Matching algorithms within a duplicate detection system. Bulletin of the Technical Committee on Data Engineering, 23(4). 316
Rundensteiner, E. (Ed.). (1999). Special issue on data transformations. Bulletin of the Technical Committee on Data Engineering, 22(1). Sarawagi, S. (2000). Special issue on data cleaning. Bulletin of the Technical Committee on Data Engineering, 23(4). Scalzo, B. (2003). Oracle DBA guide to data warehousing and star schemas. Upper Saddle River, NJ : Prentice Hall. Simitsis, A. (2003). List of ETL tools. Retrieved from http:/ /www.dbnet.ece.ntua.gr/~asimi/ETLTools.htm Simitsis, A. (2004). Modeling and managing extractiontransformation-loading (ETL) processes in data warehouse environments [doctoral thesis]. National Technical University of Athens, Greece. Theodoratos, D., Ligoudistianos, S., & Sellis, T. (2001). View selection for designing the global data warehouse. Data & Knowledge Engineering, 39(3), 219-240. Vassiliadis, P. (2000). Gulliver in the land of data warehousing: Practical experiences and observations of a researcher. Proceedings of 2nd International Workshop on Design and Management of Data Warehouses (DMDW), Sweden.
KEY TERMS Data Mart: A logical subset of the complete data warehouse. We often view the data mart as the restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group. Data Staging Area (DSA): An auxiliary area of volatile data employed for the purpose of data transformation, reconciliation, and cleaning before the final loading of the data warehouse. Data Warehouse: A subject-oriented, integrated, timevariant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data (Inmon, 1996).
Data Warehouse Back-End Tools
ETL: Extract, transform, and load (ETL) are data warehousing functions that involve extracting data from outside sources, transforming them to fit business needs, and ultimately loading them into the data warehouse. ETL is an important part of data warehousing, as it is the way data actually gets loaded into the warehouse. Online Analytical Processing (OLAP): The general activity of querying and presenting text and number data from data warehouses, as well as a specifically dimensional style of querying and presenting that is exemplified by a number of OLAP vendors.
Source System: An operational system of record whose function is to capture the transactions of the business. A source system is often called a legacy system in a mainframe environment. Target System: The physical machine on which the data warehouse is organized and stored for direct querying by end users, report writers, and other applications. A target system is often called a presentation server.
317
,
318
Data Warehouse Performance Beixin (Betsy) Lin Montclair State University, USA Yu Hong BearingPoint Inc., USA Zu-Hsu Lee Montclair State University, USA
INTRODUCTION
BACKGROUND
A data warehouse is a large electronic repository of information that is generated and updated in a structured manner by an enterprise over time to aid business intelligence and to support decision making. Data stored in a data warehouse is non-volatile and time variant and is organized by subjects in a manner to support decision making (Inmon, Rudin, Buss, & Sousa, 1998). Data warehousing has been increasingly adopted by enterprises as the backbone technology for business intelligence reporting and query performance has become the key to the successful implementation of data warehouses. According to a survey of 358 businesses on reporting and enduser query tools, conducted by Appfluent Technology, data warehouse performance significantly affects the Return on Investment (ROI) on Business Intelligence (BI) systems and directly impacts the bottom line of the systems (Appfluent Technology, 2002). Even though in some circumstances it is very difficult to measure the benefits of BI projects in terms of ROI or dollar figures, management teams are still eager to have a “single version of the truth,” better information for strategic and tactical decision making, and more efficient business processes by using BI solutions (Eckerson, 2003). Dramatic increases in data volumes over time and the mixed quality of data can adversely affect the performance of a data warehouse. Some data may become outdated over time and can be mixed with data that are still valid for decision making. In addition, data are often collected to meet potential requirements, but may never be used. Data warehouses also contain external data (e.g. demographic, psychographic, etc.) to support a variety of predictive data mining activities. All these factors contribute to the massive growth of data volume. As a result, even a simple query may become burdensome to process and cause overflowing system indices (Inmon, Rudin, Buss & Sousa, 2001). Thus, exploring the techniques of performance tuning becomes an important subject in data warehouse management.
There are inherent differences between a traditional database system and a data warehouse system, though to a certain extent, all databases are similarly designed to serve a basic administrative purpose, e.g., to deliver a quick response to transactional data processes such as entry, update, query and retrieval. For many conventional databases, this objective has been achieved by online transactional processing (OLTP) systems (e.g. Oracle Corp, 2004; Winter & Auerbach, 2004). In contrast, data warehouses deal with a huge volume of data that are more historical in nature. Moreover, data warehouse designs are strongly organized for decision making by subject matter rather than by defined access or system privileges. As a result, a dimension model is usually adopted in a data warehouse to meet these needs, whereas an Entity-Relationship model is commonly used in an OLTP system. Due to these differences, an OLTP query usually requires much shorter processing time than a data warehouse query (Raden, 2003). Performance enhancement techniques are, therefore, especially critical in the arena of data warehousing. Despite the differences, these two types of database systems share some common characteristics. Some techniques used in a data warehouse to achieve a better performance are similar to those used in OLTP, while some are only developed in relation to data warehousing. For example, as in an OLTP system, an index is also used in a data warehouse system, though a data warehouse might have different kinds of indexing mechanisms based on its granularity. Partitioning is a technique which can be used in data warehouse systems as well (Silberstein, Eacrett, Mayer, & Lo, 2003). On the other hand, some techniques are developed specifically to improve the performance of data warehouses. For example, aggregates can be built to provide a quick response time for summary information (e.g. Eacrett, 2003; Silberstein, 2003). Query parallelism can be implemented to speed up the query when data are queried from
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Warehouse Performance
several tables (Silberstein, et al., 2003). Caching and query statistics are unique for data warehouses since the statistics will help to build a smart cache for better performance. Also, pre-calculated reports are useful to certain groups of users who are only interested in seeing static reports (Eacrett, 2003). Periodic data compression and archiving helps to cleanse the data warehouse environment. Keeping only the necessary data online will allow faster access (e.g. Kimball, 1996).
MAIN THRUST As discussed earlier, performance issues play a crucial role in a data warehouse environment. This chapter describes ways to design, build, and manage data warehouses for optimum performance. The techniques of tuning and refining the data warehouse discussed below have been developed in recent years to reduce operating and maintenance costs and to substantially improve the performance of new and existing data warehouses.
Performance Optimization at the Data Model Design Stage Adopting a good data model design is a proactive way to enhance future performance. In the data warehouse design phase, the following factors should be taken into consideration. •
•
Granularity: Granularity is the main issue that needs to be investigated carefully before the data warehouse is built. For example, does the report need the data at the level of store keeping units (SKUs), or just at the brand level? These are size questions that should be asked of business users before designing the model. Since a data warehouse is a decision support system, rather than a transactional system, the level of detail required is usually not as deep as the latter. For instance, a data warehouse does not need data at the document level such as sales orders, purchase orders, which are usually needed in a transactional system. In such a case, data should be summarized before they are loaded into the system. Defining the data that are needed – no more and no less – will determine the performance in the future. In some cases, the Operational Data Stores (ODS) will be a good place to store the most detailed granular level data and those data can be provided on the jump query basis. Cardinality: Cardinality means the number of possible entries of the table. By collecting business
•
requirements, the cardinality of the table can be decided. Given a table’s cardinality, an appropriate indexing method can then be chosen. Dimensional Models: Most data warehouse designs use dimensional models, such as Star-Schema, Snow-Flake, and Star-Flake. A star-schema is a dimensional model with fully denormalized hierarchies, whereas a snowflake schema is a dimensional model with fully normalized hierarchies. A star-flake schema represents a combination of a star schema and a snow-flake schema (e.g. Moody & Kortink, 2003). Data warehouse architects should consider the pros and cons of each dimensional model before making a choice.
Aggregates Aggregates are the subsets of the fact table data (Eacrett, 2003). The data from the fact table are summarized into aggregates and stored physically in a different table than the fact table. Aggregates can significantly increase the performance of the OLAP query since the query will read fewer data from the aggregates than from the fact table. Database read time is the major factor in query execution time. Being able to reduce the database read time will help the query performance a great deal since fewer data are being read. However, the disadvantage of using aggregates is its loading performance. The data loaded into the fact table have to be rolled up to the aggregates, which means any newly updated records will have to be updated in the aggregates as well to make the data in the aggregates consistent with those in the fact table. Keeping the data as current as possible has presented a real challenge to data warehousing (e.g. Bruckner & Tjoa, 2002). The ratio of the database records transferred to the database records read is a good indicator of whether or not to use the aggregate technique. In practice, if the ratio is 1/10 or less, building aggregates will definitely help performance (Silberstein, 2003).
Database Partitioning Logical partitioning means using year, planned/actual data, and business regions as criteria to partition the database into smaller data sets. After logical partitioning, a database view is created to include all the partitioned tables. In this case, no extra storage is needed and each partitioned table will be smaller to accelerate the query (Silberstein, Eacrett, Mayer & Lo, 2003). Take a multinational company as an example. It is better to put the data from different countries into different data targets, such as cubes or data marts, than to put the data from all the countries into one data target. By logical partitioning (splitting the data into smaller cubes), query can read the 319
,
Data Warehouse Performance
smaller cubes instead of large cubes and several parallel processes can read the small cubes at the same time. Another benefit of logical partitioning is that each partitioned cube is less complex to load and easier to perform the administration. Physical partitioning can also reach the same goal as logical partitioning. Physical partitioning means that the database table is cut into smaller chunks of data. The partitioning is transparent to the user. The partitioning will allow parallel processing of the query and each parallel process will read a smaller set of data separately.
Query Parallelism
•
adopted. In contrast, for a high cardinality dimension table, the B-Tree index should be used (e.g. McDonald, Wilmsmeier, Dixon & Inmon, 2002). The Master Data Table Index: Since the data warehouse commonly uses dimensional models, the query SQL plan always starts from reading the master data table. Using indices on the master data table will significantly enhance the query performance.
Caching Technology
In business intelligence reporting, the query is usually very complex and it might require the OLAP engine to read the data from different sources. In such case, the technique of query parallelism will significantly improve the query performance. Query parallelism is an approach to split the query into smaller sub-queries and to allow parallel processing of sub-queries. By using this approach, each subprocess takes a shorter time and reduces the risk of system hogging if a single long-running process is used. In contrast, sequential processing definitely requires a longer time to process (Silberstein, Eacrett, Mayer & Lo, 2003).
Caching technology will play a critical role in query performance as the memory size and speed of CPUs increase. After a query runs once, the result is stored in the memory cache. Subsequently, similar query runs will get the data directly from the memory rather than by accessing the database again. In an enterprise server, a certain block of memory is allocated for the caching use. The query results and the navigation status can then be stored in a highly compressed format in that memory block. For example, a background user can be set up during the off-peak time to mimic the way a real user runs the query. By doing this, a cache is generated that can be used by real business users.
Database Indexing
Pre-Calculated Report
In a relational database, indexing is a well known technique for reducing database read time. By the same token, in a data warehouse dimensional model, the use of indices in the fact table, dimension table, and master data table will improve the database read time.
Pre-calculation is one of the techniques where the administrator can distribute the workload to off-peak hours and have the result sets ready for faster access (Eacrett, 2003). There are several benefits of using the pre-calculated reports. The user will have faster response time since calculation takes place on the fly. Also, the system workload is balanced and shifted to off-peak hours. Lastly, the reports can be available offline.
•
•
320
The Fact Table Index: By default the fact table will have the primary index on all the dimension keys. However, a secondary index can also be built to fit a different query design (e.g. McDonald, Wilmsmeier, Dixon, & Inmon, 2002). Unlike a primary index, which includes all the dimension keys, the secondary index can be built to include only some dimension keys to improve the performance. By having the right index, the query read time can be dramatically reduced. The Dimension Table Index: In a dimensional model, the size of the dimension table is the deciding factor affecting query performance. Thus, the index of the dimension table is important to decrease the master data read time, and thus to improve filtering and drill down. Depending on the cardinality of the dimension table, different index methods will be adopted to build the index. For a low cardinality dimension table, Bit-Map index is usually
Use Statistics to Further Tune up the System In a real-world data warehouse system, the statistics data of the OLAP are collected by the system. The statistics provide such information as what the most used queries are and how the data are selected. The statistics can help further tune the system. For example, examining the descriptive statistics of the queries will reveal the most common used drill-down dimensions as well as the combinations of the dimensions. Also, the OLAP statistics will also indicate what the major time component is out of the total query time. It could be database read time or OLAP calculation time. Based on the data, one can build aggregates or offline reports to increase the query performance.
Data Warehouse Performance
Data Compression Over time, the dimension table might contain some unnecessary entries or redundant data. For example, when some data has been deleted from the fact table, the corresponding dimension keys will not be used any more. These keys need to be deleted from the dimension table in order to have a faster query time since the query execution plan always start from the dimension table. A smaller dimension table will certainly help the performance. The data in the fact table need to be compressed as well. From time to time, some entries in the fact table might contain just all zeros and need to be removed. Also, entries with the same dimension key value should be compressed to reduce the size of the fact table. Kimball (1996) pointed out that data compression might not be a high priority for an OLTP system since the compression will slow down the transaction process. However, compression can improve data warehouse performance significantly. Therefore, in a dimensional model, the data in the dimension table and fact table need to be compressed periodically.
Data Archiving In the real data warehouse environment, data are being loaded daily or weekly. The volume of the production data will become huge over time. More production data will definitely lengthen the query run time. Sometimes, there are too much unnecessary data sitting in the system. These data may be out-dated and not important to the analysis. In such cases, the data need to be periodically archived to offline storage so that they can be removed from the production system (e.g. Uhle, 2003). For example, three years of production data are usually sufficient for a decision support system. Archived data can still be pulled back to the system if such need arises
FUTURE TRENDS Although data warehouse applications have been evolving to a relatively mature stage, several challenges remain in developing a high performing data warehouse environment. The major challenges are: •
How to optimize the access to volumes of data based on system workload and user expectations (e.g. Inmon, Rudin, Buss & Sousa, 1998)? Corporate information systems rely more and more on data warehouses that provide one entry point for access to all information for all users. Given this, and because of varying utilization demands, the system should be tuned to reduce the system workload during peak user periods.
•
•
How to design a better OLAP statistics recording system to measure the performance of the data warehouse? As mentioned earlier, query OLAP statistics data is important to identify the trend and the utilization of the queries by different users. Based on the statistics results, several techniques could be used to tune up the performance, such as aggregates, indices and caching. How to maintain data integrity and keep the good data warehouse performance at the same time? Information quality and ownership within the data warehouse play an important role in the delivery of accurate processed data results (Ma, Chou, & Yen, 2000). Data validation is thus a key step towards improving data reliability and quality. Accelerating the validation process presents a challenge to the performance of the data warehouse.
CONCLUSION Information technologies have been evolving quickly in recent decades. Engineers have made great strides in developing and improving databases and data warehousing in many aspects, such as hardware; operating systems; user interface and data extraction tools; data mining tools, security, and database structure and management systems. Many success stories in developing data warehousing applications have been published (Beitler & Leary, 1997; Grim & Thorton, 1997). An undertaking of this scale – creating an effective data warehouse, however, is costly and not risk-free. Weak sponsorship and management support, insufficient funding, inadequate user involvement, and organizational politics have been found to be common factors contributing to failure (Watson, Gerard, Gonzalez, Haywood & Frenton, 1999). Therefore, data warehouse architects need to first clearly identify the information useful for the specific business process and goals of the organization. The next step is to identify important factors that impact performance of data warehousing implementation in the organization. Finally all the business objectives and factors will be incorporated into the design of the data warehouse system and data models, in order to achieve high-performing data warehousing and business intelligence.
REFERENCES Appfluent Technology. (2002, December 2). A study on reporting and business intelligence application usage. Retrieved from http://searchsap.techtarget.com/ whitepaperPage/0,293857,sid21_gci866993,00.html 321
,
Data Warehouse Performance
Beitler, S.S., & Leary, R. (1997). Sears’ EPIC transformation: Converting from mainframe legacy systems to OnLine Analytical Processing (OLAP). Journal of Data Warehousing, 2, 5-16. Bruckner, R.M., & Tjoa, A.M. (2002). Capturing delays and valid times in data warehouses—towards timely consistent analyses. Journal of Intelligent Information Systems, 19(2), 169-190. Eacrett, M. (2003). Hitchhiker’s guide to SAP business information warehouse performance tuning. SAP White Paper. Eckerson, W. (2003). BI StatShots. Journal of Data Warehousing, 8(4), 64. Grim, R., & Thorton, P.A. (1997). A customer for life: the warehouseMCI approach. Journal of Data Warehousing, 2, 73-79. Inmon, W.H., Imhoff, C., & Sousa, R. (2001). Corporate information factory. New York: John Wiley & Sons, Inc. Inmon, W.H., Rudin, K., Buss, C.K., & Sousa, R. (1998). Data Warehouse Performance. New York: John Wiley & Sons, Inc. Kimball, R. (1996). The data warehouse toolkit. New York: John Wiley & Sons, Inc. Ma, C., Chou, D.C., & Yen, D.C. (2000). Data warehousing, technology assessment and management. Industrial Management + Data Systems, 100, 125 McDonald, K., Wilmsmeier, A., Dixon, D.C., & Inmon, W.H. (2002, August). Mastering the SAP business information warehouse. Hoboken, NJ: John Wiley & Sons, Inc. Moody, D., & Kortink, M.A.R. (2003). From ER models to dimensional models, part II: Advanced design issues. Journal of Data Warehousing, 8, 20-29. Oracle Corp. (2004). Largest transaction processing db on Unix runs oracle database. Online Product News, 23(2). Peterson, S. (1994). Stars: A pattern language for query optimized schema. Sequent Computer Systems, Inc. White Paper. Raden, N. (2003). Real time: Get real, part II. Intelligent Enterprise, 6(11), 16. Silberstein, R. (2003). Know how network: SAP BW performance monitoring with BW statistics. SAP White Paper. Silberstein, R., Eacrett, M., Mayer, O., & Lo, A. (2003). SAP BW performance tuning. SAP White Paper.
322
Uhle, R. (2003). Data aging with mySAP business intelligence. SAP White Paper. Watson, H., Gerard, J., Gonzalez, L.E., Haywood, M.E., & Fenton, D. (1999). Data warehousing failures: Case studies and findings. Journal of Data Warehousing, 4, 44-55. Winter, R., & Auerbach, K. (2004). Contents under pressure: scalability challenges for large databases. Intelligent Enterprise, 7(7), 18-25.
KEY TERMS Cache: A region of a computer’s memory which stores recently or frequently accessed data so that the time of repeated access to the same data can decrease. Granularity: The level of detail or complexity at which an information resource is described. Indexing: In data storage and retrieval, the creation and use of a list that inventories and cross-references data. In database operations, a method to find data more efficiently by indexing on primary key fields of the database tables. ODS (Operational Data Stores): A system with capability of continuous background update that keeps up with individual transactional changes in operational systems versus a data warehouse that applies a large load of updates on an intermittent basis. OLAP (Online Analytical Processing): A category of software tools for collecting, presenting, delivering, processing and managing multidimensional data (i.e., data that has been aggregated into various categories or “dimensions”) in order to provide analytical insights for business management. OLTP (Online Transaction Processing): A standard, normalized database structure designed for transactions in which inserts, updates, and deletes must be fast. Service Management: The strategic discipline for identifying, establishing, and maintaining IT services to support the organization’s business goal at an appropriate cost. SQL (Structured Query Language): A standard interactive programming language used to communicate with relational databases in order to retrieve, update, and manage data.
323
Data Warehousing and Mining in Supply Chains Richard Mathieu Saint Louis University, USA Reuven R. Levary Saint Louis University, USA
INTRODUCTION Every finished product has gone through a series of transformations. The process begins when manufacturers purchase the raw materials that will be transformed into the components of the product. The parts are then supplied to a manufacturer, who assembles them into the finished product and ships the completed item to the consumer. The transformation process includes numerous activities (Levary, 2000). Among them are • • • • • • • • • • • • •
Designing the product Designing the manufacturing process Determining which component parts should be produced in house and which should be purchased from suppliers Forecasting customer demand Contracting with external suppliers for raw materials or component parts Purchasing raw materials or component parts from suppliers Establishing distribution channels for raw materials and component parts from suppliers to manufacturer Establishing of distribution channels to the suppliers of raw materials and component parts Establishing distribution channels from the manufacturer to the wholesalers and from wholesalers to the final customers Manufacturing the component parts Transporting the component parts to the manufacturer of the final product Manufacturing and assembling the final product Transporting the final product to the wholesalers, retailers, and final customer
Each individual activity generates various data items that must be stored, analyzed, protected, and transmitted to various units along a supply chain. A supply chain can be defined as a series of activities that are involved in the transformation of raw materials into a final product, which a customer then purchases (Levary, 2000). The flow of materials, component parts,
and products is moving downstream (i.e., from the initial supply sources to the end customers). The flow of information regarding the demand for the product and orders to suppliers is moving upstream, while the flow of information regarding product availability, shipment schedules, and invoices is moving downstream. For each organization in the supply chain, its customer is the subsequent organization in the supply chain, and its subcontractor is the prior organization in the chain.
BACKGROUND Supply chain data can be characterized as either transactional or analytical (Shapiro, 2001). All new data that are acquired, processed, and compiled into reports that are transmitted to various organizations along a supply chain are deemed transactional data (Davis & Spekman, 2004). Increasingly, transactional supply chain data is processed and stored in enterprise resource planning systems, and complementary data warehouses are developed to support decision-making processes (Chen R., Chen, C., & Chang, 2003; Zeng, Chiang, & Yen, 2003). Organizations such as Home Depot, Lowe’s, and Volkswagen have developed data warehouses and integrated data-mining methods that complement their supply chain management operations (Dignan, 2003a; Dignan, 2003b; Hofmann, 2004). Data that are used in descriptive and optimization models are considered analytical data (Shapiro, 2001). Descriptive models include various forecasting models, which are used to forecast demands along supply chains, and managerial accounting models, which are used to manage activities and costs. Optimization models are used to plan resources, capacities, inventories, and product flows along supply chains. Data collected from consumers are the core data that affect all other data items along supply chains. Information collected from the consumers at the point of sale include data items regarding the sold product (e.g., type, quantity, sale price, and time of sale) as well as information about the consumer (e.g., consumer address and method of payment). These data items are analyzed often. Datamining techniques are employed to determine the types of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Data Warehousing and Mining in Supply Chains
items that are appealing to consumers. The items are classified according to consumers’ socioeconomic backgrounds and interests, the sale price that consumers are willing to pay, and the location of the point of sale. Data regarding the return of sold products are used to identify potential problems with the products and their uses. These data include information about product quality, consumer disappointment with the product, and legal consequences. Data-mining techniques can be used to identify patterns in returns so that retailers can better determine which type of product to order in the future and from which supplier it should be purchased. Retailers are also interested in collecting data regarding competitors’ sales so that they can better promote their own product and establish a competitive advantage. Data related to political and economic conditions in supplier countries are of interest to retailers. Data-mining techniques can be used to identify political and economic patterns in countries. Information can help retailers choose suppliers who are situated in countries where the flow of products and funds is expected to be stable for a reasonably long period of time. Manufacturers collect data regarding a) particular products and their manufacturing process, b) suppliers, and c) the business environment. Data regarding the product and the manufacturing process include the characteristics of products and their component parts obtained from CAD/CAM systems, the quality of products and their components, and trends in theresearch and development (R & D) of relevant technologies. Datamining techniques can be applied to identify patterns in the defects of products, their components, or the manufacturing process. Data regarding suppliers include availability of raw materials, labor costs, labor skills, technological capability, manufacturing capacity, and lead time of suppliers. Data related to qualified teleimmigrants (e.g., engineers and computer software developers) is valuable to many manufacturers. Data-mining techniques can be used to identify those teleimmigrants having unique knowledge and experience. Data regarding the business environment of manufacturers include information about competitors, potential legal consequences regarding a product or service, and both political and economic conditions in countries where the manufacturer has either facilities or business partners. Data-mining techniques can be used to identify possible liability concerning a product or service as well as trends in political and economic conditions in countries where the manufacturer has business interests. Retailers, manufacturers, and suppliers are all interested in data regarding transportation companies. These data include transportation capacity, prices, lead time, and reliability for each mode of transportation.
324
MAIN THRUST Data Aggregation in Supply Chains Large amounts of data are being accumulated and stored by companies belonging to supply chains. Data aggregation can improve the effectiveness of using the data for operational, tactical, and strategic planning models. The concept of data aggregation in manufacturing firms is called group technology (GT). Nonmanufacturing firms are also aggregating data regarding products, suppliers, customers, and markets.
Group Technology Group technology is a concept of grouping parts, resources, or data according to similar characteristics. By grouping parts according to similarities in geometry, design features, manufacturing features, materials used, and/or tooling requirements, manufacturing efficiency can be enhanced, and productivity increased. Manufacturing efficiency is enhanced by • • • •
Performing similar activities at the same work center so that setup time can be reduced Avoiding duplication of effort both in the design and manufacture of parts Avoiding duplication of tools Automating information storage and retrieval (Levary, 1993)
Effective implementation of the GT concept necessitates the use of a classification and coding system. Such a system codes the various attributes that identify similarities among parts. Each part is assigned a number or alphanumeric code that uniquely identifies the part’s attributes or characteristics. A part’s code must include both design and manufacturing attributes. A classification and coding system must provide an effective way of grouping parts into part families. All parts in a given part family are similar in some aspect of design or manufacture. A part may belong to more than one family. A part code is typically composed of a large number of characters that allow for identification of all part attributes. The larger the number of attributes included in a part code, the more difficult the establishment of standard procedures for classifying and coding. Although numerous methods of classification and coding have been developed, none has emerged as the standard method. Because different manufacturers have different requirements regarding the type and composition of parts’ codes,
Data Warehousing and Mining in Supply Chains
customized methods of classification and coding are generally required. Some of the better known classification and coding methods are listed by Groover (1987). After a code is established for each part, the parts are grouped according to similarities and are assigned to part families. Each part family is designed to enhance manufacturing efficiency in a particular way. The information regarding each part is arranged according to part families in a GT database. The GT database is designed in such a way that users can efficiently retrieve desired information by using the appropriate code. Consider part families that are based on similarities of design features. A GT database enables design engineers to search for existing part designs that have characteristics similar to those of a new part that is to be designed. The search begins when the design engineer describes the main characteristics of the needed part with the help of a partial code. The computer then searches the GT database for all the items with the same code. The results of the search are listed on the computer screen, and the designer can then select or modify an existing part design after reviewing its specifications. Selected designs can easily be retrieved. When design modifications are needed, the file of the selected part is transferred to a CAD system. Such a system enables the design engineer to effectively modify the part’s characteristics in a short period of time. In this way, efforts are not duplicated when designing parts. The creation of a GT database helps reduce redundancy in the purchasing of parts as well. The database enables manufacturers to identify similar parts produced by different companies. It also helps manufacturers to identify components that can serve more than a single function. In such ways, GT enables manufacturers to reduce both the number of parts and the number of suppliers. Manufacturers that can purchase large quantities of a few items rather than small quantities of many items are able to take advantage of quantity discounts.
Aggregation of Data Regarding Retailing Products Retailers may carry thousands of products in their stores. To effectively manage the logistics of so many products, product aggregation is highly desirable. Products are aggregated into families that have some similar characteristics. Examples of product families include the following: • • •
Products belonging to the same supplier Products requiring special handling, transportation, or storage Products intended to be used or consumed by a specific group of customers
• • • • •
Products intended to be used or consumed in a specific season of the year Volume and speed of product movement The methods of the transportation of the products from the suppliers to the retailers The geographical location of suppliers Method of transaction handling with suppliers; for example, EDI, Internet, off-line
As in the case of GT, retailing products may belong to more than one family.
Aggregation of Data Regarding Customers of Finished Products To effectively market finished products to customers, it is helpful to aggregate customers with similar characteristics into families. Examples of customers of finished product families include • • • • • • •
Customers residing in a specific geographical region Customers belonging to a specific socioeconomic group Customers belonging to a specific age group Customers having certain levels of education Customers having similar product preferences Customers of the same gender Customers with the same household size
Similar to both GT and retailing products, customers of finished products may belong to more than one family.
FUTURE TRENDS Supply Chain Decision Databases The enterprise database systems that support supply chain management are repositories for large amounts of transaction-based data. These systems are said to be data rich but information poor. The tremendous amount of data that are collected and stored in large, distributed database systems has far exceeded the human ability for comprehension without analytic tools. Shapiro estimates that 80% of the data in a transactional database that supports supply chain management is irrelevant to decision making and that data aggregations and other analyses are needed to transform the other 20% into useful information (2001). Data warehousing and online analytical processing (OLAP) technologies combined with tools for data mining and knowledge discovery have allowed the creation of systems to support organizational decision making. 325
,
Data Warehousing and Mining in Supply Chains
The supply chain management (SCM) data warehouse must maintain a significant amount of data for decision making. Historical and current data are required from supply chain partners and from various functional areas within the firm in order to support decision making in regard to planning, sourcing, production, and product delivery. Supply chains are dynamic in nature. In a supply chain environment, it may be desirable to learn from an archived history of temporal data that often contains some information that is less than optimal. In particular, SCM environments are typically characterized by variable changes in product demand, supply levels, product attributes, machine characteristics, and production plans. As these characteristics change over time, so does the data in the data warehouses that support SCM decision making. We should note that Kimball and Ross (2002) use a supply value chain and a demand supply chain as the framework for developing the data model for all business data warehouses. The data warehouses provide the foundation for decision support systems (DSS) for supply chain management. Analytical tools (simulation, optimization, & data mining) and presentation tools (geographic information systems and graphical user interface displays) are coupled with the input data provided by the data warehouse (Marakas, 2003). Simchi-Levi, D., Kaminsky, and SimchiLevi, E. (2000) describe three DSS examples: logistics network design, supply chain planning and vehicle routing, and scheduling. Each DSS requires different data elements, has specific goals and constraints, and utilizes special graphical user interface (GUI) tools.
The Role of Radio Frequency Identification (RFID) in Supply Chains Data Warehousing The emerging RFID technology will generate large amounts of data that need to be warehoused and mined. Radio frequency identification (RFID) is a wireless technology that identifies objects without having either contact or sight of them. RFID tags can be read despite environmentally difficult conditions such as fog, ice, snow, paint, and widely fluctuating temperatures. Optically read technologies, such as bar codes, cannot be used in such environments. RFID can also identify objects that are moving. Passive RFID tags have no external power source. Rather, they have operating power generated from a reader device. The passive RFID tags are very small and inexpensive. Further, they have a virtually unlimited operational life. The characteristics of these passive RFID tags make them ideal for tracking materials through supply chains. Wal-Mart has required manufacturers, suppli-
326
ers, distributors, and carriers to incorporate RFID tags into both products and operations. Other large retailers are following Wal-Mart’s lead in requesting RFID tags to be installed in goods along their supply chain. The tags follow products from the point of manufacture to the store shelf. RFID technology will significantly increase the effectiveness of tracking materials along supply chains and will also substantially reduce the loss that retailers accrue from thefts. Nonetheless, civil liberty organizations are trying to stop RFID tagging of consumer goods, because this technology has the potential of affecting consumer privacy. RFID tags can be hidden inside objects without customer knowledge. So RFID tagging would make it possible for individuals to read the tags without the consumers even having knowledge of the tags’ existence. Sun Microsystems has designed RFID technology to reduce or eliminate drug counterfeiting in pharmaceutical supply chains (Jaques, 2004). This technology will make the copying of drugs extremely difficult and unprofitable. Delta Air Lines has successfully used RFID tags to track pieces of luggage from check-in to planes (Brewin, 2003). The luggage-tracking success rate of RFID was much better than that provided by bar code scanners. Active RFID tags, unlike passive tags, have an internal battery. The tags have the ability to be rewritten and/ or modified. The read/write capability of active RFID tags is useful in interactive applications such as tracking work in process or maintenance processes. Active RFID tags are larger and more expensive than passive RFID tags. Both the passive and active tags have a large, diverse spectrum of applications and have become the standard technologies for automated identification, data collection, and tracking. A vast amount of data will be recorded by RFID tags. The storage and analysis of this data will pose new challenges to the design, management, and maintenance of databases as well as to the development of data-mining techniques.
CONCLUSION A large amount of data is likely to be gathered from the many activities along supply chains. This data must be warehoused and mined to identify patterns that can lead to better management and control of supply chains. The more RFID tags installed along supply chains, the easier data collection becomes. As the tags become more popular, the data collected by them will grow significantly. The increased popularity of the tags will bring with it new possibilities for data analysis as well as new warehousing and mining challenges.
Data Warehousing and Mining in Supply Chains
REFERENCES
Shapiro, J. F. (2001). Modeling the supply chain. Pacific Grove, CA: Duxbury.
Brewin, B. (2003, December). Delta says radio frequency ID devices pass first bag. Computer World, 7. Retrieved March 29, 2005 from www.computerworld.com/ mobiletopics/mobile/technology/story/0,10801,88446, 00/
Simchi-Levi, D., Kaminsky, P., & Simchi-Levi, E. (2000). Designing and managing the supply chain. Boston, MA: McGraw-Hill.
Chen, R., Chen, C., & Chang, C. (2003). A Web-based ERP data mining system for decision making. International Journal of Computer Applications in Technology, 17(3), 156-158. Davis, E. W., & Spekman, R. E. (2004). Extended enterprise. Upper Saddle River, NJ: Prentice Hall. Dignan, L. (2003a, June 16). Data depot. Baseline Magazine. Dignan, L. (2003b, June 16). Lowe’s big plan. Baseline Magazine. Groover, M. P. (1987). Automation, production systems, and computer-integrated manufacturing. Englewood Cliffs, NJ: Prentice-Hall.
Zeng, Y., Chiang, R., & Yen, D. (2003). Enterprise integration with advanced information technologies: ERP and data warehousing. Information Management and Computer Security, 11(3), 115-122.
KEY TERMS Analytical Data: All data that are obtained from optimization, forecasting, and decision support models. Computer Aided Design (CAD): An interactive computer graphics system used for engineering design. Computer Aided Manufacturing (CAM): The use of computers to improve both the effectiveness and efficiency of manufacturing activities.
Hofmann, M. (2004, March). Best practices: VW revs its B2B engine. Optimize Magazine, 22-30.
Decision Support System (DSS): Computer-based systems designed to assist in managing activities and information in organizations.
Jaques, R. (2004, February 20). Sun pushes RFID drug technology. E-business/Technology/News. Retrieved March 29, 2005, from www.vnunet.com/News/1152921
Graphical User Interface (GUI): A software interface that relies on icons, bars, buttons, boxes, and other images to initiate computer-based tasks for users.
Kimball, R., & Ross, M. (2002). The data warehouse toolkit (2nd ed.). New York: Wiley.
Group Technology (GP): The concept of grouping parts, resources, or data according to similar characteristics.
Levary, R. R. (1993). Group technology for enhancing the efficiency of engineering activities. European Journal of Engineering Education, 18(3), 277-283. Levary, R. R. (2000, May–June). Better supply chains through information technology. Industrial Management, 42, 24-30. Marakas, G. M. (2003). Modern data warehousing, mining and visualization. Upper Saddle River, NJ: PrenticeHall.
Online Analytical Processing (OLAP): A form of data presentation in which data are summarized, aggregated, deaggregated, and viewed in the frame of a table or cube. Supply Chain: A series of activities that are involved in the transformation of raw materials into a final product that is purchased by a customer. Transactional Data: All data that are acquired, processed, and compiled into reports, which are transmitted to various organizations along a supply chain.
327
,
328
Data Warehousing Search Engine Hadrian Peter University of the West Indies, Barbados Charles Greenidge University of the West Indies, Barbados
INTRODUCTION Modern database systems have incorporated the use of DSS (Decision Support Systems) to augment their decision-making business function and to allow detailed analysis of off-line data by higher-level business managers (Agosta, 2000; Kimball, 1996). The data warehouse is an environment that is readily tuned to maximize the efficiency of performing decision support functions. However the advent of commercial uses of the Internet on a large scale has opened new possibilities for data capture and integration into the warehouse. The data necessary for decision support can be divided roughly into two categories: internal data and external data. In this article, we focus on the need to augment external data capture from Internet sources and provide a tri-partite, high-level model termed the Data Warehouse Search Engine (DWSE) model, to perform the same. We acknowledge efforts that also are being made to retrieve internal information from Web sources by use of the Web warehouse, which stores the Web user’s mouse and keyboard activities online, the so called clickstream data. A number of Web warehouse initiatives have been proposed, including WHOWEDA (Madria et al., 1999). To be clear, data warehouses have focused on large volumes of long-term historical data (a number of weeks, months, or years old), but the presence of the Internet with its data, which is short-lived and volatile, and improved automation and integration activities make the shorter time scales for the refresh cycle more attractive. To attain the maximum benefits of the DWSE, significant technical contributions that surpass existing implementations must be encouraged from the wider database, information-retrieval, and search-engine technology communities. Our model recognizes the fact that a crossdisciplinary approach to research is the only way to guarantee further advancement in the area of decision support.
BACKGROUND Data warehousing methodologies are concerned with the collection, organization, and analysis of data taken
from several heterogeneous sources, all aimed at augmenting end-user business function (Berson & Smith, 1997; Inmon, 2003; Wixom & Watson, 2001). Central to the use of heterogeneous data sources is the need to extract, clean, and load data from a variety of operational sources. Operational data not only need to be cleaned by removing bad records or invalid fields, but also typically must be put through a merge/purge process that removes redundancies and records that are inconsistent and lack integrity (Celko, 1995). External data is key to business function and decision making, and includes sources of information such as newspapers, magazines, trade publications, personal contacts, and news releases. In the case where external data is being used in addition to data taken from disparate operational sources, this external data may require a similar cleaning/merge/purge process to be applied to guarantee consistency (Higgins, 2003). The World Wide Web represents a large and growing source of external data but is notorious for the presence of bad, unauthorized, or otherwise irregular data (Brake, 1997; Pfaffenberger, 1996). Thus, the need for cleaning and integrity checking activities increases when the Web is being used to gather external data (SanderBeuermann & Schomburg, 1998). The prevalence of viruses and worms on the Internet, and the problems with unsolicited e-mail (spam) show that communications coming across the Internet must be filtered. The cleaning process may include pre-programmed rules to adjust values, the use of sophisticated AI techniques, or simply mechanisms to spot anomalies that then can be fixed manually.
MAIN THRUST Informed decision making in a data warehouse relies on suitable levels of granularity for internal data, where users can drill down from low to higher levels of aggregation. External data in the form of multimedia files, informal contacts, or news sources are often marginalized (Inmon, 2002; Kimball, Barquin & Edelstein, 1997) due to their unstructured nature.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Warehousing Search Engine
Figure 1. Merits of internal/external data Internal Data
External Data
Decision-Making Data
Trusted Sources
Unproven Sources
Informative
Known Structure
Unknown Structure
+
=
Challenging
Known Frequency
Varying Frequency
Wide in Scope
Consistent
Unpredictable
Balanced
True decision making embraces as many pertinent sources of information as possible so that a holistic perspective of facts, trends, and individual pieces of data can be obtained. Increasingly, the growth of commerce and business on the Internet has meant that in addition to traditional modes of disseminating information, the Internet has become a forum for ready posting of information of all kinds (Hall, 1999). The prevalence of so much information on the Internet means that it is potentially a superior source of external data for the data warehouse. Since such data typically originate from a variety of sources (sites), it has to undergo a merging and transformation process before it can be used in the data warehouse. In the case of internal data, which forms the core of the data warehouse, previous manual methods of applying ad-hoc tools and techniques to the data cleaning process are being replaced by more automated forms such as ETL (Extract, Transform, Load). Although current ETL standardized packages are expensive, they offer productivity gains in the long run (Earls, 2003; Songini, 2004). The existence of the so-called invisible Web and ongoing efforts to gain access to these untapped sources suggest that the future of external data retrieval will enjoy the same interest as that shown in internal data (Inmon, 2002; Sherman & Price, 2003; Smith, 2001). The need for reliable and consistent external data provides the motivation for an intermediate layer between raw data gathered from the Internet and external data storage areas lying within the domain of the data warehouse (Agosta, 2000). Until there is a maturing of widely available tools with the ability to access the invisible Web, there will be a continued reliance on information retrieval techniques, as contrasted with data retrieval techniques, to gather external data (van Rijsbergen, 1979). The need for three environments to be present to process external data from Web sources into the warehouse suggests a threetier solution to this problem. Accordingly, we propose a tri-partite model called the Data Warehouse Search
,
Engine Model (DWSE), which has an intermediate data extraction/cleaning layer functionally called the MetaData Engine, sandwiched between the data warehouse and search engine environments.
The DWSE Model The data warehouse and search engine environments serve two distinct and important roles at the current time, but there is scope to utilize the strengths of both in conjunction for maximum usefulness (Barquin & Edelstein, 1997; Sonnenreich & Macinta, 1998). Our proposed DWSE model seeks to allow cooperative links between data warehouse and search engine with an aim of satisfying external data requirements. The model consists of (1) data warehouse (DW), (2) meta-data engine (MDE) and (3) search engine (SE). The MDE is the component that provides a bridge over which information must pass from one environment to the other. The MDE enhances queries coming from the warehouse and also captures, merges, and formats information returned by the search engine (Devlin, 1998). The new model, through the MDE, seeks to augment the operations of both by allowing external data to be collected for the business analyst, while improving the search engine searches through query modifications of queries emerging from the data warehouse. The generalized process is as follows. A query originates in the warehouse environment and is modified by the MDE so that it is specific and free of nonsense words. A word that has a high occurrence in a text but conveys little specific information about the subject of the text is deemed to be a nonsense word. Typically, these words include pronouns, conjunctions, and proper names. This term is synonymous with noise word, as found in information retrieval texts (Belew, 2000). The modified query is transmitted to the search engine that performs its operations and retrieves its results documents. The documents returned are analyzed by the MDE, and information is prepared for return to the 329
Data Warehousing Search Engine
Figure 2. The DWSE model Data Warehouse Enterprise Wide S o u r ces Meta Data Storage
Meta-Data Engine
Search Engine
Extract/ Transform
SE
SE Files Extraction/ Cleaning
Web
Figure 3. Bridging the architectures: The meta-data engine Meta-Data Engine
I n t e r n e t
Data Warehouse
DW DBMS
DSS Tool
Hybrid Search Engine
Internal Data
Internet
End User
warehouse. Finally, the information relating to the answer to the query is returned to the warehouse environment. The entire process may take days or weeks, as both warehouse and search engine operate independently according to their own schedules. The design of both search engine and data warehouse are skill-intensive and specialized tasks. Prudent judgment dictates that nothing should be done to add to the burden of building and maintaining these complex systems. History shows that many information technology (IT) projects fail when expediency overtakes deliberate design. In Figure 2, we see the cooperation among components of the DWSE. The DWSE model considers only the external data requirements of the data warehouse. The SE component performs actual searches but no analysis of documents. The indexing of the retrieved documents and other analysis is done by the MDE.
Meta-Data Engine Model To cope with the radical differences between the SE and DW designs, we propose a meta-data engine to coordinate all activities. Typical commercial SEs are composed of a crawler (spider) and an indexer (mite). The indexer is used to codify results into a database for easy querying. Our approach reduces the complexity of the search engine by moving the indexing tasks to the metadata engine. The meta-data engine seeks to form a bridge between the diverse SE and DW environments. The purpose of the meta-data engine is to facilitate an automatic information retrieval mode. Figure 3 illustrates the concept of the bridge between DW and SE environments, and Figure 4 outlines the main features of the MDE. The following are some of the main functions of the meta-data engine: • 330
Capture and transform data arising from SE (e.g., handle HTML, XML, .pdf, and other document formats).
• • • •
Index words retrieved during SE crawls. Manage the scheduling and control of tasks arising from the operation of both DW and SE. Provide a neutral platform so that SE and DW operational cycles and architectures cannot interfere with each other. Track meta-data that may be used to enhance the quality of queries and results in the future (e.g., monitor the use of domain-specific words/phrases such as jargon for a particular domain).
Of major interest in Figure 4 is the automatic information retrieval (AIR) component. AIR seeks to highlight the existence of documents that are related to a given query (Belew, 2000) using a process that is different from data retrieval. Data retrieval (van Rijsbergen, 1979), as used in relation to database management systems (RDBMSs) requires an exact match of terms, uses an artificial query language (e.g., SQL), and uses deduction. Information retrieval, on the other hand (as is common for Internet searches), uses partial matching and a natural query language, and inductively produces relevant results (Spertus & Stein, 1999). This idea of relevance distinguishes the IR search from the normal data retrieval where a matching of terms is done.
ANALYSIS OF MODEL Strengths The major strengths of this model are:
Independence This model carries logical independence. There is a deliberate attempt to ensure that the best practices of each individual model is adhered to by insistence that
Data Warehousing Search Engine
Figure 4. Characteristics of the meta-data engine Meta-Data Engine Operate in AIR Mode Process Queries Emerging From Data Warehouse Provide Neutral Platform Process Meta-Data
there be logical independence in the design. This is very important, since many IT projects fail when good design is compromised. This model carries physical independence. There is no need to overburden the capacities of existing warehouse projects by insistence on the simultaneous storage of vast quantities of external data. The model allows for the transfer of summarized results via network connections, while the physical machines may reside continents apart. This model carries administrative independence. There is the implicit expectation that the three separate subsystems comprising the model can and will be under the control of three separate administrators. This is important because warehouse administration is a specialist area, as is search engine maintenance. By allowing specialists to focus on their respective disciplines, the chances for effective administration are enhanced.
ment to take place in three languages. For example, when there is a query language for the warehouse, perl for the meta-data engine, and java for the search engine, the strengths of each can be utilized effectively. The new model allows for security and integrity to be maintained. The warehouse need not expose itself to dangerous integrity questions or to increasingly malicious threats on the Internet. By isolating the sector that interfaces with the Internet from the sector that carries vital internal data, the prospects for good security are improved.
Relieves Information Overload By seeking to remove, repackage, and reprocess external data, we are indirectly removing time-consuming filtering activities by employees and busy executives. There is a growing need to protect users from being overwhelmed by increasing volumes of data. In the future, we will rely on smart, accurate, automated systems to guide us through the realms of extraneous data to the cores of knowledge.
Potential Weaknesses •
•
•
Value-Added Approach This model complements the concept of maximizing the utility of acquired data by adding value to this data even long after its day-to-day usefulness has expired. The ability to add pertinent external data to volumes of aging internal data extends the horizons for the usefulness of the internal data indefinitely.
Adaptability The new model provides for scalability, as the logical/ physical independence is not restricted to any one system or architecture. Databases in any sector of the model can be provided for independently without negatively impacting the other parts of the system. The new model also allows for flexibility by allowing develop-
Relies on fast-changing search engine technologies. Improvements or modifications to the way information formatted and accessed on the Internet will impact on the search and retrieval process. Introduces the need for more system overhead and administrators. A warehouse administrator, search engine administrator, and information retrieval expert (meta-data engine) will have to be contracted. The systems themselves will have to be configured so that cooperation among them remains easy. Storage of large volumes of irrelevant information is a problem. The cost of maintaining and indexing unmanageable quantities of Internet data may discount the benefits derived from analysis of the external data. Figure 5 highlights possible drawbacks to using the proposed model.
FUTURE TRENDS The future is notoriously unpredictable, yet the following trends can be seen readily for the next decade. 1. 2.
Maturing of the data warehouse tools and techniques (Agosta, 2000; Greenfield, 1996; Inmon, 2002; Schwartz, 2003). Increases in the volumes of data being stored online and off-line. 331
,
Data Warehousing Search Engine
Figure 5. Hidden dangers in DWSE model
B
D A
Large volumes of redundant/unusable data may be stored End-user analysis may become skewed by wrong use of external data Validity and accuracy of external data not always known, so there are risks involved in using the external data. Figure 8.2 Hidden dangers in DWSE model
3. 4. 5. 6. 7. 8.
Central importance of the Internet and its related technologies (Goodman, 2000). Increasing sophistication and focus of Web sites. Increased need to automate analysis and decision support on the growing volumes of data; both internal and external to the organization. Adoption of specialized search engines (Raab, 1999). Adoption of specialized sites (vortals) catering to a particular category of user (Rupley, 2000). Ability to mine invisible Web, increasing potential benefits from a data warehouse search engine model (Sherman & Price, 2003; Smith, 2001; Soudah, Sullivan, 2000).
The DWSE model is poised to gain prominence as practitioners realize the benefits of three distinct and architecturally independent layers. We are confident that with vital and open architectures and models of clarity, data warehousing efforts can be successful in the long term.
CONCLUSION More research is needed to investigate how to enable the data warehouse to take advantage of the growing external data stores on the Internet. Comparison of results obtained using the three-tiered approach of the DWSE model, as compared with the traditional external data approaches, must be carried out. Unclear also is how the traditional Online Transaction Processing (OLTP) systems will funnel data toward this model. If operational systems start to retain significant quantities of external data, it may mitigate against this model as more and more external data will be available in the warehouse through normal integration processes. On the other hand, the costs of dealing with such external data in the warehouse may be so prohibitive that an 332
alternate path into the meta-data engine of the new model may be required. In the case of the second outcome, a new model may be proposed with this process (OLTP external data to meta-data engine) as an integral part. Current data warehouse models already allow for the establishment of data marts and querying by external tools. If these new processes are considered desirable components of the data warehousing process, then a new data warehouse model could be proposed. Relevance remains a thorny issue. The problem of providing relevant data without anyone to manually verify its relevance is challenging. Semantic Web developments promise to address questions of relevance on the Internet.
REFERENCES Agosta, L. (2000). The essential guide to data warehousing. New Jersey: Prentice-Hall. Barquin, R., & Edelstein, H. (Eds.). (1997). Planning and designing the data warehouse. New Jersey: Prentice-Hall. Belew, R.K. (2000). Finding out about: A cognitive perspective on search engine technology and the WWW. New York: Cambridge University Press. Berson, A., & Smith, S.J. (1997). Data warehousing, data mining and olap. New York: McGraw-Hill. Brake, D. (1997). Lost in cyberspace [Electronic version]. New Scientist. Celko, J. (1995). Don’t warehouse dirty data. Datamation, 42-53. Devlin, B. (1998). Meta-data: The warehouse atlas. DB2 Magazine, 3(1), 8-9. Earls, A.R. (2003). ETL: Preparation is the best bet. Computerworld, 37(34), 25-27. Goodman, A. (2000). Searching for a better way [Electronic version]. Greenfield, L. (1996). Don’t let data warehousing gotchas getcha. Datamation, 76-77. Hall, C. (1999). Enterprise information portals: Hot air or hot technology [Electronic version]. Business Intelligence Advisor, 111(11). Higgins, K.J. (2003). Warehouse data earns its keep. Network Computing, 14(8), 111-115. Inmon, W.H. (2002). Building the data warehouse. New York: John Wiley & Sons.
Data Warehousing Search Engine
Inmon, W.H. (2003). The story so far. Computerworld, 37(15), 26-27. Kimball, R. (1996). Dangerous preconceptions [Electronic version]. Kimball, R. (1997). A dimensional modeling manifesto [Electronic version]. DBMS Magazine. Madria, S.K. et al. (1999). Research issues in Web data mining. Proceedings of Data Warehousing and Knowledge Discovery, First International Conference. Pfaffenberger, B. (1996). Web search strategies. New York: MIS Press. Raab, D.M. (1999). Enterprise information portals [Electronic version]. Relationship Marketing Report. Rupley, S. (2000). From portals to vortals. PC Magazine. Sander-Beuermann, W., & Schomburg, M. (1998). Internet information retrieval—The further development of meta-search engine technology. Proceedings of the Internet Summit, Internet Society, Geneva, Switzerland. Schwartz, E. (2003). Data warehouses get active. InfoWorld, 25(48), 12-13. Sherman, C., & Price, G. (2003). The invisible Web: Uncovering sources search engines can’t see. Library Trends, 52(2), 282-299.
Wixom, B.H., & Watson, H.J. (2001). An empirical investigation of the factors affecting data warehousing success. MIS Quarterly, 25(1), 17-39.
KEY TERMS Data Retrieval: Denotes the standardized database methods of matching a set of records, given a particular query (e.g., use of the SQL SELECT command on a database. Decision Support System (DSS): An interactive arrangement of computerized tools tailored to retrieve and display data regarding business problems and queries. ETL (Extract/Transform/Load): This term specifies a category of software that efficiently handles three essential components of the warehousing process. First, data must be extracted (removed from originating system), then transformed (reformatted and cleaned), and third, loaded (copied/appended) into the data warehouse database system. External Data: A broad term indicating data that is external to a particular company. Includes electronic and non-electronic formats.
Songini, M.L. (2004). ETL. Computerworld, 38(5), 23-24.
Information Retrieval: Denotes the attempt to match a set of related documents to a given query using semantic considerations (e.g., library catalogue systems often employ information retrieval techniques).
Sonnenreich, W., & Macinta, T. (1998). Web developer.com guide to search engines. New York: Wiley Computer Publishing.
Internal Data: Previously cleaned warehouse data that originated from the daily information processing systems of a company.
Soudah, T. (2000). Search, and you shall find [Electronic version]. Spertus, E., & Stein, L.A. (1999). Squeal: A structured query language for the Web [Electronic version].
Invisible Web: Denotes those significant portions of the Internet where data is stored which are inaccessible to the major search engines. The invisible Web represents an often ignored/neglected source of potential online information.
Sullivan, D. (2000). Invisible Web gets deeper [Electronic version]. The Search Engine Report.
Metadata: Data about data; in the data warehouse, it describes the contents of the data warehouse.
Smith, C.B. (2001). Getting to know the invisible Web. Library Journal, 126(11), 16-19.
Van Rijsbergen, C.J. (1979). Information retrieval [Electronic version]. In Finding out about. [CD-ROM]. Richard Belew.
333
,
334
Data Warehousing Solutions for Reporting Problems Juha Kontio Turku Polytechnic, Finland
INTRODUCTION Reporting is one of the basic processes in all organizations. It provides information for planning and decision making and, on the other hand, information for analyzing the correctness of the decisions made at the beginning of the process. Reporting is based on the data that the operational information systems contain. Reports can be produced directly from these operational databases, but an operational database is not organized in a way that naturally supports analysis. An alternative way is to organize the data in such a way that supports analysis easily. Typically, this method leads to the introduction of a data warehouse. In the summer of 2002, a multiple case study research was launched in six Finnish organizations (see Table 1). The researchers studied the databases of these organizations and identified the trends in database exploitation. One of the main ideas was to study the diffusion of database innovations. In practice this meant that the researchers described the present database architecture and identified the future plans and present problems. The data for this research was mainly collected with semistructured interviews, and altogether, 54 interviews were arranged. The research processed data of 44 different information systems. Most (40%) of the analyzed information systems were online transaction processing sys-
tems, such as order-entry systems. The second largest category (30%) comprised information systems relating to decision support and reporting. Only one pilot data warehouse was among these systems, but on the other hand, customized reporting systems were used, for example, in SOK, SSP, and OPTI. Reporting was commonly recognized as an area where interviewees were not satisfied and were hoping for improvements. This article focuses on describing the reporting problems that the organizations are facing and explains how they can exploit a data warehouse to overcome these problems.
BACKGROUND The term data warehouse was first introduced as a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions (Inmon, 1992). A simpler definition says that a data warehouse is a store of enterprise data designed to facilitate management decision making (Kroenke, 2004). A data warehouse differs from traditional databases in many ways. Its structure is different than that of traditional databases, and different functionalities are required (Elmasri & Navathe, 2000). The aim is to integrate all corporate information into one repository, where the information is easily accessed, queried, ana-
Table 1. The case organizations Organization/ Abbreviation Used in This Research SOK Corporation/SOK Salon Seudun Puhelin, Ltd./SSP Statistics Finland/STAT State Provincial Office of Western Finland/WEST TS-Group, Ltd.,/TS Optiroc, Ltd./OPTI
Line of Business
Private/ Public
Turnover Employees 2002 2002 (Millions €) 2998 4645
Co-operative society (main businesses: food and groceries, hardware) Telecommunication
Private Private
28
121
National statistics
Public
52
1074
Regional administrative authority
Public
Printing services and communications
Private
69.7
Building materials
Private
149
350 2052
(Consolidated corporation)
388
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Data Warehousing Solutions for Reporting Problems
lyzed, and used as a basis for the reports (Begg & Connolly, 2002). A data warehouse provides decision support to organizations with the help of analytical databases and Online Analytical Processing (OLAP) tools (Gorla, 2003). A data warehouse (see Figure 1) receives data from the operational databases on a regular basis, and new data is added to the existing data. The warehouse contains both detailed aggregated data and summarized data to speed up the queries. It is typically organized in smaller units called data marts, which support the specific analysis needs of a department or business unit (Bonifati, Cattaneo, Ceri, Fuggetta, & Paraboschi, 2001). In the case organizations, the idea of the data warehouse has been discussed, but so far no data warehouses exist, although in one case, a data warehouse pilot is in use. The rationale for these discussions is that at the moment, the reporting and the analyzing possibilities are not serving the organizations very well. Actually, the interviewees identified many problems in reporting. In the SOK Corporation, the interviewees complained that information is distributed in numerous information systems; thus, building a comprehensive view of the information is difficult. Another problem is in financial reporting. A financial report taken from different information systems gives different results, though they should be equal. A reason for this inequality is that the data is not harmonized and processed similarly. In the restaurant business of SOK Corporation, an essential piece of information is the sales figures of the products. It should be able to analyze which, where, and how many products have been bought. In the whole SOK Corporation, analyzing different customers and their behavior in detail is, at the moment, impossible. The interviewees also mentioned that a common database containing all products of the co-operative society might help in reporting, but defining a common classification of the
Figure 1. Data warehousing Valuable information for processes
Operational databases
Data Warehouse Cleaning Reformatting
Other Data Inputs
Summarized data
Analytical tools
Detailed data
Filter
Data Data Data Mart Mart Mart
products will be a demanding task. Centralization of the data is also one topic that has been discussed in SOK Corporation, which has been justified with improvements in reporting. In Salon Seudun Puhelin, Ltd., the interviewees mentioned that the major information system is somehow used inconsistently. Therefore, the data is not consistent and influences the reporting. This company has developed its own reporting application with Microsoft Access, but the program is not capable of managing files over 1 GB, which reduces the possibilities of using the system. According to the interviewees, this limit prevents, for example, the follow-up of daily sales. Another problem concerning the reporting system is that users are incapable of defining their own reports when their needs change. Analyzing customer data is also difficult, because collecting all customer data together is a very burdensome task. Therefore, Salon Seudun Puhelin, Ltd., has also discussed a data warehouse solution for three reasons: a) to get rid of the size limits, b) to provide a system for users where they can easily define new reports, and c) to gain more versatile analysis possibilities. The State Provincial Office of Western Finland is a joint regional administrative authority of seven ministries. One of their yearly responsibilities is to evaluate the basic service in their region. In practice, this responsibility means that they gather and analyze a large amount of data. The first problem is that they have not used a special data management tool. The lack of an adequate tool for data management makes it difficult to do any time-series analysis, which many of the interviewees hoped for. Another problem is that the results should be easily distributed in forms of different reports, but at the moment, this is not the case. In TS-Group, Ltd., a data warehouse pilot has been implemented. The pilot enables versatile reporting and, as the interviewees mentioned, this opportunity should not be lost. However, some reporting problems still exist. For example, the distribution and the format of the reports should be solved. In the department of financial management, the reporting system does not support the latest operating systems and, therefore, only some computers are capable of using the system. In Optiroc, Ltd., reports are generated directly from the operational databases that are not designed for reporting purposes. One problem attendant on this design is that the reports run slowly. In principle, users should also be able to create their reports by themselves, but in reality, only a few of them are able to. Maybe this is one reason that the interviewees presented very critical comments on reporting. The interviewees mentioned as well that the implementation of a data warehouse system is strongly supported and is seen as a solution for prob335
,
Data Warehousing Solutions for Reporting Problems
lems in reporting. A data warehouse was also justified because, with it, the company could serve customers better and could produce customized reports. At the moment, customer reporting is analyzed to develop reporting and to define necessary tools.
MAIN THRUST The case organizations have plans to start exploiting a data warehouse, but before the introduction of a data warehouse, plenty of design must be accomplished in all these cases. Designing a data warehouse requires quite different techniques than the design of an operational database (Golfarelli & Rizzi, 1998). Modeling a data model for a data warehouse is seen as one of the most critical phases in the development process (Bonifati et al., 2001). This data modeling has specific features that distinguish it from normal data modeling (Busborg, Christiansen, & Tryfona, 1999). At the beginning, the content of the operational information systems, the interconnections between them, and the equivalent entities should be understood (Blackwood, 2000). In practice, this entails studying the data models of the operational databases and developing an integrated schema to enhance the data interoperability (Bonifati et al., 2001). The data modeling of a data warehouse is called dimensionality modeling (DM) (Golfarelli & Rizzi, 1998; Begg & Connolly, 2002). Dimensional models were developed to support analytical tasks (Loukas & Spencer, 1999). Dimensionality modeling concentrates on facts and the properties of the facts and dimensions connected to facts (Busborg et al., 1999). Facts are numeric, and quantitative data of the business and dimensions describe different dimensions of the business (Bonifati et al., 2001). Fact tables contain all the business events to be analyzed, and dimension tables define how to analyze fact information (Loukas & Spencer, 1999). The result of the dimensionality modeling is typically presented in a star model or in a snowflake model (Begg & Connolly, 2002). Multidimensional schema (MDS) is a more generic term that collectively refers to both schemas (Martyn, 2004). When a star model is used, the fact tables are normalized, but dimension tables are not. When dimension tables are normalized too, the star model turns into a snowflake model (Bonifati et al., 2001). Ideally, an information system such as a data warehouse should be correct, fast, and friendly (Martyn, 2004). Correctness is especially important in data warehouses to ensure that decisions are based on accurate information. Actually, an estimated 30% to 50% of information in a typical database is either missing or incorrect (Blackwood, 2000). This idea emphasizes the 336
need to pay attention to the quality of the source data in the operational databases (Finnegan & Sammon, 2000). One problem with MDS is that when the business environment changes, the evolution of multidimensional schemas is not as manageable as with normalized schemas (Martyn, 2004). It is also said that any architecture not based on third normalized form can cause the failure of a data warehouse project (Gardner, 1998). On the other hand, a dimensional model provides a better solution for a decision support application than a pure normalized relational model does (Loukas & Spencer, 1999). All of the above is actually related to efficiency, because a large amount of data is processed during analysis. Typically, this is a question about needed joins in the database level. Usually, a star schema is the most efficient design for a data warehouse, because the denormalized tables require fewer joins (Martyn, 2004). However, recent developments in storage technology, access methods (such as bitmap indexes), and query optimization indicate that the performance with the third normalized form should be tested before moving to multidimensional schemas (Martyn, 2004). From this 3NF schema, a natural step toward MDS is to use denormalization, which will support both efficiency and flexibility issues (Finnegan & Sammon, 2000). Still, it is possible to define necessary SQL views on top of the 3NF schema without denormalization (Martyn, 2004). Finally, during the design, the issues of physical and logical design should be separated; physical design is about performance, and logical design is about understandability (Kimball, 2001). The OLAP tools, which are based on MDS views, access data warehouses for complex data analysis and decision support activities (Kambayashi, Kumar, Mohania, & Samtani, 2004). These tools typically include assessing the effectiveness of a marketing campaign, forecasting product sales, and planning capacity. The architecture of the underlying database of the data warehouse categorizes the different analysis tools (Begg & Connolly, 2002). Depending on the schema type, the terms Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP), and Hybrid OLAP (HOLAP) are used (Kroenke, 2004). ROLAP is a preferable choice when a) the information needs change frequently, b) the information should be as current as possible, and c) the users are sophisticated computer users (Gorla, 2003). The main differences between ROLAP and MOLAP are in the currency of data and in the data storage processing capacity. MOLAP populates its own structure of the original data when it is loaded from the operational databases (Dodds, Hasan, Hyland, & Veeraraghavan, 2000). In MOLAP, the data is stored in a special-purpose MDS (Begg & Connolly, 2002). ROLAP, on the other hand, analyzes the original data, and the
Data Warehousing Solutions for Reporting Problems
users can drill down to the unit data level (Dodds et al., 2000). ROLAP uses a meta-data layer to avoid the creation of an MDS. It typically utilizes the SQL extensions, such as CUBE and ROLLUP in Oracle DBMS (Begg & Connolly, 2002).
FUTURE TRENDS In the future, the SOK Corporation needs to build a comprehensive view of the information. They can achieved this goal first by building an enterprise data model and then by modifying the existing information systems. From the reporting point of view, the operational data should be further modeled into a data warehouse solution. After these steps, the problems relating to reporting should be solved. In order to improve the consistency of the data at Salon Seudun Puhelin, Ltd., most concerns need to be placed on the correctness of the data and the ways users work with the information systems. Until then, exploiting a data warehouse is not rational. For this company, it is reasonable to evaluate the true requirements of a data warehouse for reporting, because other solutions that are easier than starting a data warehouse project might be on hand. In the State Provincial Office of Western Finland (WEST), the most important issue is acquiring a suitable tool for managing the collected data. After that step, the emphasis can shift to reporting, analyzing, and in timeseries. Basically, the environment of WEST is ideal for exploiting a data warehouse. The processes and the functions are heavily dependent on the analysis and the developments in time series. In TS-Group, Ltd., a data warehouse could solve the problems in reporting as well. Introducing a data warehouse would define the standard formats of the reports, which are currently causing a problem. At the same time, the distribution problems of the reports would be solved. In Optiroc, Ltd., a data warehouse can solve the slowness of running reports and the difficulties in analysis. One reason for the slowness is that the reports run directly from the operational databases; moving the origin of the data to a data warehouse might offer improvements. Another problem deals with the analysis and the possibilities to define necessary reports on the fly. Introducing necessary OLAP tools and training the users sufficiently will solve this problem. In Statistics Finland, the interviewees mentioned that reporting is not a problem. However, a data warehouse might offer extra value in data analysis. At the moment, though, data warehousing is not the most acute topic in information technology development in Statistics Finland.
The presented cases reflect the overall situation in different organizations quite well and might predict what will happen in the future. The cases show that organizations have clear problems in analyzing the operational data. They have developed their own applications to ease the reporting but are still living with inadequate solutions. Data warehousing has been identified as a possible solution for the reporting problems, and the future will show whether data warehouses diffuse in the organizations.
CONCLUSION The case organizations have recognized the possibilies of a data warehouse to solve problems in reporting. Their initiatives are based on business requirements, which is a good starting point because to be successful, a data warehouse should be a business-driven initiative in partnership with the information technology department (Gardner, 1998; Finnegan & Sammon, 2000). However, the first step should be the analysis of the real needs of a data warehouse. At the same time, the present problems in reporting should be solved if possible. To really support reporting, the operational data should be organized like that in a data warehouse. In practice, this means studying and analyzing the existing operational databases. As a result, a data model describing the necessary elements of the data warehouse should be achieved. A suggestion is to first produce an enterprise level data model to describe all the data and their interconnections. This enterprise level data model will be the basis for the data warehouse design. As the theory suggested at the beginning of this article, a normalized data model with SQL views should be produced and tested. This model can further be denormalized into a snowflake or a star model when performance requirements are not met with the normalized schema. Producing a data model for the data warehouse is only one part of the process. In addition, special attention should be taken in designing data extraction from operational databases to the data warehouse. The enterprise level data model helps these databases understand the data and thus eases the development of data extraction solutions. When the data warehouse is in use, automated tools are necessary to speed up the loading of the operational data into the data warehouse (Finnegan & Sammon, 2000). Before loading data into the data warehouse, the organizations should also analyze the correctness and the quality of the data in the operational databases. Finally, as this article shows, a data warehouse is a relevant alternative for solving reporting problems that
337
,
Data Warehousing Solutions for Reporting Problems
the case organizations are currently facing. After the implementation of a data warehouse, organizations must ensure that the possible users of the system are educated in order to fully take advantage of the new possibilities that the data warehouse offers. Of course, organizations should remember that there is no quick jump to data warehouse exploitation.
REFERENCES Begg, C., & Connolly, T. (2002). Database systems: A practical guide to design, implementation, and management. Addison-Wesley. Blackwood, P. (2000). Eleven steps to success in data warehousing. Business Journal, 14(44), 26-27.
Inmon, W. H. (1992). Building the data warehouse. New York: Wiley. Kambayashi, Y., Kumar, V., Mohania, M., & Samtani, S. (2004). Recent advances and research problems in data warehousing. Lecture Notes in Computer Science, 1552, 81-92. Kimball, R. (2001). A trio of interesting snowflakes. Intelligent Enterprise, 4, 30-32. Kroenke, D. M. (2004). Database processing: Fundamentals, design and implementation. Upper Saddle River, NJ: Pearson Prentice Hall. Loukas, T., & Spencer, T. (1999, October). From star to snowflake to ERD: Comparing data warehouse design approaches. Enterprise Systems.
Bonifati, A., Cattaneo, F., Ceri, S., Fuggetta, A., & Paraboschi, S. (2001). Designing data marts for data warehouses. ACM Transactions on Software Engineering and Methodology, 10(4), 452-483.
Martyn, T. (2004). Reconsidering multi-dimensional schemas. ACM SIGMOD Record, 33(1), 83-88.
Busborg, F., Christiansen, J. G. B., & Tryfona, N. (1999). StarER: A conceptual model for data warehouse design. Proceedings of the ACM International Workshop on Data Warehousing and OLAP, USA.
KEY TERMS
Dodds, D., Hasan, H., Hyland, P., & Veeraraghavan, R. (2000). Approaches to the development of multidimensional databases: Lessons from four case studies. 31(3), 10-23. Retrieved from the ACM SIGMIS database. Elmasri, R., & Navathe, S. B. (2000). Fundamentals of database systems. Reading, MA: Addison-Wesley. Finnegan, P., & Sammon, D. (2000). The ten commandments of data warehousing. 31(4), 82-91. Retrieved from the ACM SIGMIS database. Gardner, S. R. (1998). Building the data warehouse. Communications of the ACM, 41(9), 52-60. Golfarelli, M., & Rizzi, S. (1998). A methodological framework for data warehouse design. Proceedings of the ACM International Workshop on Data Warehousing and OLAP, USA. Gorla, N. (2003). Features to consider in a data warehousing system. Communications of the ACM, 46(11), 111-115.
338
Data Extraction: A process in which data is transferred from operational databases to a data warehouse. Dimensionality Modeling: A logical design technique that aims to present data in a standard, intuitive form that allows for high-performance access. Normalization/Denormalization: Normalization is a technique for producing a set of relations with desirable properties, given the data requirements of an enterprise. Denormalization is a step backward in the normalization process — for example, to improve performance. OLAP: The dynamic synthesis, analysis, and consolidation of large volumes of multidimensional data. Snowflake Model: A variant of the star schema, in which dimension tables do not contain denormalized data. Star Model: A logical structure that has a fact table containing factual data in the center, surrounded by dimension tables containing reference data.
339
Database Queries, Data Mining, and OLAP
,
Lutz Hamel University of Rhode Island, USA
INTRODUCTION Modern, commercially available relational database systems now routinely include a cadre of data retrieval and analysis tools. Here we shed some light on the interrelationships between the most common tools and components included in today’s database systems: query language engines, data mining components, and online analytical processing (OLAP) tools. We do so by pairwise juxtaposition, which will underscore their differences and highlight their complementary value.
BACKGROUND Today’s commercially available relational database systems now routinely include tools such as SQL database query engines, data mining components, and OLAP (Craig, Vivona, & Bercovitch, 1999; Oracle, 2001; Scalzo, 2003; Seidman, 2001). These tools allow developers to construct high powered business intelligence (BI) applications which are not only able to retrieve records efficiently but also support sophisticated analyses such as customer classification and market segmentation. However, with powerful tools so tightly integrated with the database technology, understanding the differences between these tools and their comparative advantages and disadvantages becomes critical for effective application development. From the practitioner’s point of view questions like the following often arise: • • •
Is running database queries against large tables considered data mining? Can data mining and OLAP be considered synonymous? Is OLAP simply a way to speed up certain SQL queries?
The issue is being complicated even further by the fact that data analysis tools are often implemented in terms of data retrieval functionality. Consider the data mining models in the Microsoft SQL server which are implemented through extensions to the SQL database query language (e.g., predict join) (Seidman, 2001) or the proposed SQL extensions to enable decision tree classifiers (Sattler & Dunemann, 2001). OLAP cube
definition is routinely accomplished via the data definition language (DDL) facilities of SQL by specifying either a star or snowflake schema (Kimball, 1996).
MAIN THRUST The following sections contain the pair wise comparisons between the tools and components considered in this chapter.
Database Queries vs. Data Mining Virtually all modern, commercial database systems are based on the relational model formalized by Codd in the 1960s and 1970s (Codd, 1970) and the SQL language (Date, 2000) which allows the user to efficiently and effectively manipulate a database. In this model a database table is a representation of a mathematical relation, that is, a set of items that share certain characteristics or attributes. Here, each table column represents an attribute of the relation and each record in the table represents a member of this relation. In relational databases the tables are usually named after the kind of relation they represent. Figure 1 is an example of a table that represents the set or relation of all the customers of a particular store. In this case the store tracks the total amount of money spent by its customers. Relational databases do not only allow for the creation of tables but also for the manipulation of the tables and the data within them. The most fundamental operation on a database is the query. This operation enables the user to retrieve data from database tables by asserting that the retrieved data needs to fulfill certain criteria. As an example, consider the fact that the storeowner might be interested in finding out which customers spent more than $100 at the store. The following query Figure 1. A relational database table representing customers of a store Id
Name
ZIP
Sex
Age
Income
Children
Car
5
Peter
05566
M
35
$40,000
2
… 22
… Maureen
… 04477
… F
… 26
… $55,000
… 0
Mini Van … Coupe
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Total Spent $250.00 … $50.00
Database Queries, Data Mining, and OLAP
returns all the customers from the above customer table that spent more than $100: SELECT * FROM CUSTOMER_TABLE WHERE TOTAL_SPENT > $100; This query returns a list of all instances in the table where the value of the attribute Total Spent is larger than $100. As this example highlights, queries act as filters that allow the user to select instances from a table based on certain attribute values. It does not matter how large or small the database table is, a query will simply return all the instances from a table that satisfy the attribute value constraints given in the query. This straightforward approach to retrieving data from a database has also a drawback. Assume for a moment that our example store is a large store with tens of thousands of customers (perhaps an online store). Firing the above query against the customer table in the database will most likely produce a result set containing a very large number of customers and not much can be learned from this query except for the fact that a large number of customers spent more than $100 at the store. Our innate analytical capabilities are quickly overwhelmed by large volumes of data. This is where differences between querying a database and mining a database surface. In contrast to a query, which simply returns the data that fulfills certain constraints, data mining constructs models of the data in question. The models can be viewed as high level summaries of the underlying data and are in most cases more useful than the raw data, since in a business sense they usually represent understandable and actionable items (Berry & Linoff, 2004). Depending on the questions of interest, data mining models can take on very different forms. They include decision trees and decision rules for classification tasks, association rules for market basket analysis, as well as clustering for market segmentation among many other possible models. Good overviews of current data mining techniques and models can be found in Berry & Linoff (2004), Han & Kamber (2001), Hand, Mannila, & Smyth (2001), and Hastie, Tibshirani, & Friedman (2001). To continue our store example, in contrast to a query, a data mining algorithm that constructs decision rules might return the following set of rules for customers that spent more than $100 from the store database: IF AGE > 35 AND CAR = MINIVAN THEN TOTAL SPENT > $100 OR IF SEX = M AND ZIP = 05566 THEN TOTAL SPENT > $100
340
These rules are understandable because they summarize hundreds, possibly thousands, of records in the customer database and it would be difficult to glean this information off the query result. The rules are also actionable. Consider that the first rule tells the storeowner that adults over the age of 35 that own a minivan are likely to spend more than $100. Having access to this information allows the storeowner to adjust the inventory to cater to this segment of the population, assuming that this represents a desirable cross-section of the customer base. Similar with the second rule, male customers that reside in a certain ZIP code are likely to spend more than $100. Looking at census information for this particular ZIP code, the storeowner could again adjust the store inventory to also cater to this population segment, presumably increasing the attractiveness of the store and thereby increasing sales. As we have shown, the fundamental difference between database queries and data mining is the fact that in contrast to queries data mining does not return raw data that satisfies certain constraints, but returns models of the data in question. These models are attractive because in general they represent understandable and actionable items. Since no such modeling ever occurs in database queries we do not consider running queries against database tables as data mining, it does not matter how large the tables are.
Database Queries vs. OLAP In a typical relational database queries are posed against a set of normalized database tables in order to retrieve instances that fulfill certain constraints on their attribute values (Date, 2000). The normalized tables are usually associated with each other via primary/foreign keys. For example, a normalized database of our store with multiple store locations or sales units might look something like the database given in Figure 2. Here, PK and FK indicate primary and foreign keys, respectively. From a user perspective it might be interesting to ask some of the following questions: • • •
How much did sales unit A earn in January? How much did sales unit B earn in February? What was their combined sales amount for the first quarter?
Even though it is possible to extract this information with standard SQL queries from our database, the normalized nature of the database makes the formulation of the appropriate SQL queries very difficult. Furthermore, the query process is likely to be slow due to the fact that it must perform complex joins and multiple
Database Queries, Data Mining, and OLAP
Figure 2. Normalized database schema for a store (Source: Craig et al., 1999, Figure 3.2)
scans of entire database tables in order to compute the desired aggregates. By rearranging the database tables in a slightly different manner and using a process called pre-aggregation or computing cubes the above questions can be answered with much less computational power enabling a real time analysis of aggregate attribute values – OLAP (Craig et al., 1999; Kimball, 1996; Scalzo, 2003). In order to enable OLAP, the database tables are usually arranged into a star schema where the innermost table is called the fact table and the outer tables are called dimension tables. Figure 3 shows a star schema representation of our store organized along the main dimensions of the store business: customers, sales units, products, and time. The dimension tables give rise to the dimensions in the pre-aggregated data cubes. The fact table relates the
Figure 3. Star schema for a store database (Source: Craig et al., 1999, Figure 3.3)
dimensions to each other and specifies the measures that are to be aggregated. Here the measures are “dollar_total,” “sales_tax,” and “shipping_charge.” Figure 4 shows a three-dimensional data cube pre-aggregated from the star schema in Figure 3 (in this cube we ignored the customer dimension, since it is difficult to illustrate fourdimensional cubes). In the cube building process the measures are aggregated along the smallest unit in each dimension, giving rise to small pre-aggregated segments in a cube. Data cubes can be seen as a compact representation of pre-computed query results1. Essentially, each segment in a data cube represents a pre-computed query result to a particular query within a given star schema. The efficiency of cube querying allows the user to interactively move from one segment in the cube to another enabling the inspection of query results in real time. Cube querying also allows the user to group and ungroup segments, as well as project segments onto given dimensions. This corresponds to such OLAP operations as roll-ups, drill-downs, and slice-and-dice, respectively (Gray, Bosworth, Layman, & Pirahesh, 1997). These specialized operations in turn provide answers to the kind of questions mentioned above. As we have seen, OLAP is enabled by organizing a relational database in a way that allows for the preaggregation of certain query results. The resulting data cubes hold the pre-aggregated results giving the user the ability to analyze these aggregated results in real time using specialized OLAP operations. In a larger context we can view OLAP as a methodology for the organization of databases along the dimensions of a business making the database more comprehensible to the end user.
Data Mining vs. OLAP Is OLAP data mining? As we have seen, OLAP is enabled by a change to the data definition of a relational Figure 4. A three-dimensional data cube
341
,
Database Queries, Data Mining, and OLAP
database in such a way that it allows for the pre-computation of certain query results. OLAP itself is a way to look at these pre-aggregated query results in real time. However, OLAP itself is still simply a way to evaluate queries, which is different from building models of the data as in data mining. Therefore, from a technical point of view we cannot consider OLAP to be data mining. Where data mining tools model data and return actionable rules, OLAP allows users to compare and contrast measures along business dimensions in real time. It is interesting to note that recently a tight integration of data mining and OLAP has occurred. For example, Microsoft SQL Server 2000 not only allows OLAP tools to access the data cubes but also enables its data mining tools to mine data cubes (Seidman, 2001).
FUTURE TRENDS Perhaps the most important trend in the area of data mining and relational databases is the liberation of data mining tools from the “single table requirement.” This new breed of data mining algorithms is able to take advantage of the full relational structure of a relational database obviating the need of constructing a single table that contains all the information to be used in the data mining task (Deroski & Lavraè , 2001). This allows for data mining tasks to be represented naturally in terms of the actual database structures, for example Yin, Han, Yang, & Yu (2004), and also allows for a natural and tight integration of data mining tools with relational databases.
CONCLUSION Modern, commercially available relational database systems now routinely include a cadre of data retrieval and analysis tools. Here, we briefly described and contrasted the most often-bundled tools: SQL database query engines, data mining components, and OLAP tools. Contrary to many assertions in the literature and business press, performing queries on large tables or manipulating query data via OLAP tools is not considered data mining due to the fact that no data modeling occurs in these tools. On the other hand, these three tools complement each other and allow developers to pick the tool that is right for their application: queries allow ad hoc access to virtually any instance in a database; data mining tools can generate high-level, actionable summaries of data residing in database tables; and OLAP allows for real-time access to pre-aggregated measures along important business dimensions. In this light it does
342
not seem surprising that all three tools are now routinely bundled.
REFERENCES Berry, M.J.A., & Linoff, G.S. (2004). Data mining techniques: For marketing, sales, and customer relationship management (2nd ed.). New York: John Wiley & Sons. Codd, E.F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6), 377-387. Craig, R.S., Vivona, J.A., & Bercovitch, D. (1999). Microsoft data warehousing. New York: John Wiley & Sons. Date, C.J. (2000). An introduction to database systems (7th ed.). Reading, MA: Addison-Wesley. Deroski, S., & Lavraè , N. (2001). Relational data mining. Berlin, New York: Springer. Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1997). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1), 29-53. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann Publishers. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. Hastie, T., Tibshirani, R., & Friedman, J.H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer. Kimball, R. (1996). The data warehouse toolkit: Practical techniques for building dimensional data warehouses. New York: John Wiley & Sons. Oracle. (2001). Oracle 9i Data Mining. Oracle White Paper. Pendse, N. (2001). Multidimensional data structures. Retrieved from http://www.olapreport.com/ MDStructures.htm Sattler, K., & Dunemann, O. (2001, November 5-10). SQL database primitives for decision tree classifiers. Paper presented at the 10th International Conference on Information and Knowledge Management, Atlanta, Georgia. Scalzo, B. (2003). Oracle DBA guide to data warehousing and star schemas. Upper Saddle River, NJ: Prentice Hall PTR.
Database Queries, Data Mining, and OLAP
Seidman, C. (2001). Data mining with Microsoft SQL Server 2000 technical reference. Microsoft Press. Yin, X., Han, J., Yang, J., & Yu, P.S. (2004). CrossMine: Efficient classification across multiple database relations. Paper presented at the 20th International Conference on Data Engineering (ICDE 2004), Boston, MA, USA.
KEY TERMS Business Intelligence: Business intelligence (BI) is a broad category of technologies that allows for gathering, storing, accessing and analyzing data to help business users make better decisions. (Source: http:// www.oranz.co.uk/glossary_text.htm) Data Cubes: Also known as OLAP cubes. Data stored in a format that allows users to perform fast multidimensional analysis across different points of view. The data is often sourced from a data warehouse and relates to a particular business function. (Source: http:/ /www.oranz.co.uk/glossary_text.htm) Normalized Database: A database design that arranges data in such a way that it is held at its lowest level avoiding redundant attributes, keys, and relationships. (Source: http://www.oranz.co.uk/glossary_text.htm)
OLAP (Online Analytical Processing): A category of applications and technologies for collecting, managing, processing and presenting multidimensional data for analysis and management purposes. (Source: http://www.olapreport.com/glossary.htm) Query: This term generally refers to databases. A query is used to retrieve database records that match certain criteria. (Source: http://usa.visa.com/business/ merchants/online_trans_glossary.html) SQL (Structured Query Language): SQL is a standardized programming language for defining, retrieving, and inserting data objects in relational databases. Star Schema: A database design that is based on a central detail fact table linked to surrounding dimension tables. Star schemas allow access to data using business terms and perspectives. (Source: http://www.ds.uillinois. edu/glossary.asp)
ENDNOTE 1
Another interpretation of data cubes is as an effective representation of multidimensional data along the main business dimensions (Pendse, 2001).
343
,
344
Database Sampling for Data Mining Patricia E.N. Lutu University of Pretoria, South Africa
INTRODUCTION In data mining, sampling may be used as a technique for reducing the amount of data presented to a data mining algorithm. Other strategies for data reduction include dimension reduction, data compression, and discretisation. For sampling, the aim is to draw, from a database, a random sample, which has the same characteristics as the original database. This chapter looks at the sampling methods that are traditionally available from the area of statistics, how these methods have been adapted to database sampling in general and database sampling for data mining in particular.
BACKGROUND Given the rate at which database/data warehouse sizes are growing, attempts at creating faster/more efficient algorithms that can process massive data sets may eventually become futile exercises. Modern database and data warehouse sizes are in the region of 10s or 100s of terabytes, and sizes continue to grow. A query issued on such a database/data warehouse could easily return several millions of records. While the costs of data storage continue to decrease, the analysis of data continues to be hard. This is the case for even traditionally simple problems requiring aggregation, for example, the computation of a mean value for some database attribute. In the case of data mining, the computation of very sophisticated functions, on very large numbers of database records, can take several hours, or even days. For inductive algorithms, the problem of lengthy computations is compounded by the fact that many iterations are needed in order to measure the training accuracy as well as the generalization accuracy. There is plenty of evidence to suggest that, for inductive data mining, the learning curve flattens after only a small percentage of the available data from a large data set has been processed (Catlett, 1991; Kohavi, 1996; Provost et al., 1999). The problem of overfitting (Dietterich, 1995) also dictates that the mining of massive data sets should be avoided. The sampling of databases has been studied by researchers for some time. For data mining, sampling should be used as a data reduction technique, allowing a
data set to be represented by a much smaller random sample that can be processed much more efficiently.
MAIN THRUST There are a number of key issues to be considered before obtaining a suitable random sample for a data mining task. It is essential to understand the strengths and weaknesses of each sampling method. It is also essential to understand which sampling methods are more suitable to the type of data to be processed and the data mining algorithm to be employed. For research purposes, we need to look at a variety of sampling methods used by statisticians, and attempt to adapt them to sampling for data mining.
Some Basics of Statistical Sampling Theory In statistics, the theory of sampling, also known as statistical estimation or the representative method, deals with the study of suitable methods of selecting a representative sample of a population, in order to study or estimate values of specific characteristics of the population (Neyman, 1934). Since the characteristics being studied can only be estimated from the sample, confidence intervals are calculated to give the range of values within which the actual value will fall, with a given probability. There are a number of sampling methods discussed in the literature, for example the book by Rao (Rao, 2000). Some methods appear to be more suited to database sampling than others. Simple random sampling (SRS), stratified random sampling, and cluster sampling are three such methods. Simple random sampling involves selecting at random elements of the population, P, to be studied. The method of selection may be either with replacement (SRSWR) or without replacement (SRSWOR). For very large populations, however, SRSWR and SRSWOR are equivalent. For simple random sampling, the probabilities of inclusion of the elements may or may not be uniform. If the probabilities are not uniform then a weighted random sample is obtained. The second method of sampling is stratified random sampling. Here, before the samples are drawn, the population P is divided into several strata, p1, p2,.. p k, and the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Database Sampling for Data Mining
sample S is composed of k partial samples s1, s 2,.., sk, each drawn randomly, with replacement or not, from one of the strata. Rao (2000) discusses several methods of allocating the number of sampled elements for each stratum. Bryant et al. (1960) argue that, if the sample is allocated to the strata in proportion to the number of elements in the strata, it is virtually certain that the stratified sample estimate will have a smaller variance than a simple random sample of the same size. The stratification of a sample may be done according to one criterion. Most commonly though, there are several alternative criteria that may be used for stratification. When this is the case, the different criteria may all be employed to achieve multi-way stratification. Neyman (1934) argues that there are situations when it is very difficult to use an individual unit as the unit of sampling. For such situations, the sampling unit should be a group of elements, and each stratum should be composed of several groups. In comparison with stratified random sampling, where samples are selected from each stratum, in cluster sampling a sample of clusters is selected and observations/measurements are made on the clusters. Cluster sampling and stratification may be combined (Rao, 2000).
Database Sampling Methods Database sampling has been practiced for many years for purposes of estimating aggregate query results, database auditing, query optimization, and, obtaining samples for further statistical processing (Olken, 1993). Static sampling (Olken, 1993) and adaptive (dynamic) sampling (Haas & Swami, 1992) are two alternatives for obtaining samples for data mining tasks. In recent years, many studies have been conducted in applying sampling to inductive and non-inductive data mining (John & Langley, 1996; Provost et al., 1999; Toivonen, 1996).
Simple Random Sampling Simple random sampling is by far, the simplest method of sampling a database. Simple random sampling may be implemented using sequential random sampling or reservoir sampling. For sequential random sampling, the problem is to draw a random sample of size n without replacement, from a file containing N records. The simplest sequential random sampling method is due to Fan et al. (1962) and Jones (1962). An independent uniform random variate [from the uniform interval (0,1)] is generated for each record in the file to determine whether the record should be included in the sample. If m records have already been chosen from among the first t records in the file, the (t+1)st record is chosen with probability (RQsize/RMsize), where RQsize = (n-m) is the number of
records that still need to be chosen for the sample, and RMsize = (N-t) is the number of records in the file still to be processed. This sampling method is commonly referred to as method S (Vitter, 1987). The reservoir sampling method (Fan et al., 1962; Jones, 1962; Vitter, 1985, 1987) is a sequential sampling method over a finite population of database records, with an unknown population size. Olken (1993) discuss its use in sampling of database query outputs on the fly. This technique produces a sample of size S, by initially placing the first S records of the database/file/query in the reservoir. For each subsequent kth database record, that record is accepted with probability S/k. If accepted, it replaces a randomly selected record in the reservoir. Acceptance/Rejection sampling (A/R sampling) can be used to obtain weighted samples (Olken, 1993). For a weighted random sample, the probabilities of inclusion of the elements of the population are not uniform. For database sampling, the inclusion probability of a data record is proportional to some weight calculated from the record’s attributes. Suppose that one database record rj is to be drawn from a file of n records with the probability of inclusion being proportional to the weight wj. This may be done by generating a uniformly distributed random integer 1 ≤ j ≤ n and then accepting the sampled record rj with probability αj = w j / w max, where wmax is the maximum possible value for wj. The acceptance test is performed by generating another uniform random variate uj, 0 ≤ uj ≤ 1, and accepting rj iff uj < α j. If r j is rejected, the process is repeated until some rj is accepted.
Stratified Sampling Density biased sampling (Palmer & Faloutsos, 2000) is a method that combines clustering and stratified sampling. In density biased sampling, the aim is to sample so that within each cluster points are selected uniformly, the sample is density preserving, and the sample is biased by cluster size. Density preserving in this context means that the expected sum of weights of the sampled points for each cluster is proportional to the cluster’s size. Since it would be infeasible to determine the clusters apriori, groups are used instead to represent all the regions in ndimensional space. Sampling is then done to be density preserving for each group. The groups are formed by “placing” a d-dimensional grid over the data. In the ddimensional grid, the d dimensions of each cell are labeled either with a bin value for numeric attributes, or by a discrete value for categorical attributes. The d-dimensional grid defines the strata for multi-way stratified sampling. A one-pass algorithm is used to perform the weighted sampling, based on the reservoir algorithm.
345
,
Database Sampling for Data Mining
Adaptive Sampling Lipton et al. (1990) use adaptive sampling, also known as sequential sampling, for database sampling. In sequential sampling, a decision is made after each sampled element whether to continue sampling. Olken (1993) has observed that sequential sampling algorithms outperform conventional single-stage algorithms in terms of the number of sample elements required, since they can adjust the sample size to the population parameters. Haas & Swami (1992) have proved that sequential sampling uses the minimum sample size for the required accuracy. John & Langley (1996) have proposed a method they call dynamic sampling, which combines database sampling with the estimation of classifier accuracy. The method is most efficiently applied to classification algorithms that are incremental, for example naïve Bayes and artificial neural-network algorithms such as backpropagation. They define the concept of “probably close enough” (PCE), which they use for determining when a sample size provides an accuracy that is probably good enough. “Good enough” in this context means that there is a small probability δ that the mining algorithm could do better by using the entire database. The smallest sample size n is chosen from a database of size N, so that: Pr (acc(N) – acc(n) > ε ) <= δ, where: acc(n) is the accuracy after processing a sample of size n, and ε is a parameter that describes what “close enough” means. The method works by gradually increasing the sample size n until the PCE condition is satisfied. Provost, Jensen, & Oates (1999) use progressive sampling, another form of adaptive sampling, and analyse its efficiency relative to induction with all available examples. The purpose of progressive sampling is to establish nmin, the size of the smallest sufficient sample. They address the issue of convergence, where convergence means that a learning algorithm has reached its plateau of accuracy. In order to detect convergence, they define the notion of a sampling schedule S as S = {no, n1, .., nk} where ni is an integer that specifies the size of the sample, and S is a sequence of sample sizes to be provided to an inductive algorithm. They show that schedules in which n i increases geometrically as: {no, a.n o, a2.no,., ak.no}, are asymptotically optimal. As one can see, progressive sampling is similar to the adaptive sampling method of John & Langley (1996), except that a non-linear increment for the sample size is used.
THE SAMPLING PROCESS Several decisions need to be made when sampling a database. One needs to decide on a sampling method, a
346
suitable sample size, and the level accuracy that can be tolerated. These issues are discussed below.
Deciding on a Sampling Method The data to be sampled may be balanced, imbalanced, clustered or unclustered. These characteristics will affect the quality of the sample obtained. While simple random sampling is very easy to implement, it may produce non-representative samples for data that is imbalanced or clustered. On the other hand, stratified sampling, with a good choice of strata cells, can be used to produce representative samples, regardless of the characteristics of the data. Implementation of one-way stratification should be straight forward, however, for multi-way stratification, there are many considerations to be made. For example, in density-biased sampling, a ddimensional grid is used. Suppose each dimension has n possible values (or bins). The multi-way stratification will result in n d strata cells. For large d, this is a very large number of cells. When it is not easy to estimate the sample size in advance, adaptive (or dynamic) sampling may be employed, if the data mining algorithm is incremental.
Determining the Representative Sample Size For static sampling, the question must be asked: “What is the size of a representative sample?” A sample is considered statistically valid if it is sufficiently similar to the database from where it is drawn (John & Langley, 1996). Univariate sampling may be used to test that each field in the sample comes from the same distribution as the parent database. For categorical fields, the chi-squared test can be used to test the hypothesis that the sample and the database come from the same distribution. For continuous-valued fields, a “large-sample” test can be used to test the hypothesis that the sample and the database have the same mean. It must however be pointed out that obtaining fixed size representative samples from a database is not a trivial task, and consultation with a statistician is recommended. For inductive algorithms, the results from the theory of probably approximately correct (PAC) learning have been suggested in the literature (Valiant, 1984; Haussler, 1990). These have however been largely criticized for overestimating the sample size (e.g., Haussler, 1990). For incremental inductive algorithms, dynamic sampling (John & Langley, 1996; Provost et al., 1999) may be employed to determine when a sufficient sample has been processed. For association rule mining, the methods described by Toivonen (1996) may be used to determine the sample size.
Database Sampling for Data Mining
Determining the Accuracy and Confidence of Results Obtained from Samples For the task of inductive data mining, suppose we have estimated the classification error for a classifier constructed from sample S, to be errors (h), as the proportion of the test examples that are misclassified. Statistical theory, based on the central limit theorem, enables us to conclude that, with approximately N% probability, the true error lies in the interval: Errors (h) ± Z N
(Errors (h)(1 − Errors (h)) / n )
As an example, for a 95% probability, the value of ZN is 1.96. Note that, ErrorS (h) is the mean error and (ErrorS (h) (1- ErrorS (h))/n) is the variance of the error. Mitchell (1997) gives a detailed discussion of the estimation of classifier accuracy and confidence intervals. For non– inductive data mining algorithms, there are known ways of estimating the error and confidence in the results obtained from samples. For association rule mining, for example, Toivonen (1996) states that the lower bound for the size of the sample, given an error bound ε and a maximum probability δ, is given by: | s |≥
1 2 ln δ 2ε 2
The value ε is the error in estimating the frequencies of frequent item sets for some give set of attributes.
FUTURE TRENDS There two main thrusts in research on establishing the sufficient sample size for a data mining task. The theoretical approach, most notably the PAC framework and the Vapnik-Chernonenkis (VC) dimension have produced results that are largely criticized by data mining researchers and practitioners. However, research will continue in this area, and may eventually yield practical guidelines. On the empirical front, researchers will continue to conduct simulations that will lead to generalizations on how to establish sufficient sample sizes.
CONCLUSIONS Modern databases and data warehouses are massive, and will continue to grow in size. It is essential for researchers to investigate data reduction techniques that can greatly
reduce the amount of data presented to a data mining algorithm. More than fifty years of research has produced a variety of sampling algorithms for database tables and query result sets. The selection of a sampling method should depend on the nature of the data as well as the algorithm to be employed. Estimation of sample sizes for static sampling is a tricky issue. More research in this area is needed in order to provide practical guidelines. Stratified sampling would appear to be a versatile approach to sampling any type of data. However, more research is needed to address especially the issue of how to define the strata for sampling.
REFERENCES Bryant, E.C., Hartley, H.O., & Jessen, R.J. (1960). Design and estimation in two-way stratification. Journal of the American Statistical Association, 55, 105-124. Catlett, J. (1991). Megainduction: A test flight. In Proceedings of the Eighth Workshop on Machine Learning (pp. 596-599). San Mateo, CA: Morgan Kaufman. Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM Computing Surveys, 27(3), 326327. Fan, C., Muller, M., & Rezucha, I. (1962). Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57, 387-402. Haas, P.J., & Swami, A.N. (1992). Sequential sampling procedures for query size estimation. IBM Technical Report RJ 8558. IBM Alamaden. Haussler, D. (1990). Probably approximately correct learning. National Conference on Artificial Intelligence. John, G.H., & Langley, P. (1996). Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery in Databases and Data Mining. AAAI/MIT Press. Jones, T. (1962). A note on sampling from tape files. Communications of the ACM, 5, 343-343. Kohavi, R. (1996). Scaling up the accuracy on naïve-Bayes classifiers: A decision tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Lipton, R., Naughton, J., & Schneider, D. (1990). Practical selectivity estimation through adaptive sampling. In ACM SIGMOD International Conference on the Management of Data (pp. 1-11). 347
,
Database Sampling for Data Mining
Mitchell, T.M. (1997). Machine learning. McGraw-Hill. Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-625.
Dynamic Sampling (Adaptive Sampling): A method of sampling where sampling and processing of data proceed in tandem. After processing each incremental part of the sample, a decision is made whether to continue sampling or not.
Olken, F. (1993). Random sampling from databases. PhD thesis. University of California at Berkeley.
Reservoir Sampling: A database sampling method that implements uniform random sampling on a database table of unknown size, or a query result set of unknown size.
Palmer C.R., & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering. In Proceedings of the ACM SIGMOD Conference (pp. 82-92).
Sequential Random Sampling: A database sampling method that implements uniform random sampling on a database table whose size is known.
Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth ACM DIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 23-32), San Diego, CA.
Simple Random Sampling: Simple random sampling involves selecting at random elements of the population to be studied. The sample S is obtained by selecting at random single elements of the population P.
Rao, P.S.R.S. (2000). Sampling methodologies with applications. FL: Chapman & Hall/CRC.
Simple Random Sampling with Replacement (SRSWR): A method of simple random sampling where an element stands a chance of being selected more than once.
Toivonen, H. (1996). Sampling large databases for association rules. In Proceedings of the Twenty-Second Conference on Very Large Databases (VLDDB96), Mumbai India.
Simple Random Sampling without Replacement (SRSWOR): A method of simple random sampling where each element stands a chance of being selected only once.
Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM, 27 (11), 1134-1142.
Static Sampling: A method of sampling where the whole sample is obtained before processing begins. The user must specify the sample size.
Vitter, J.S. (1985). Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11, 37-57. Vitter, J.S. (1987). An efficient method for sequential random sampling. ACM Transactions on Mathematical Software, 13 (1), 58-67.
KEY TERMS Cluster Sampling: In cluster sampling, a sample of clusters is selected and observations/measurements are made on the clusters. Density-Biased Sampling: A database sampling method that combines clustering and stratified sampling.
348
Stratified Sampling: For this method, before the samples are drawn, the population P is divided into several strata, p1, p2,.. pk, and the sample S is composed of k partial samples s1, s 2,.., sk, each drawn randomly, with replacement or not, from one of the strata. Uniform Random Sampling: A method of simple random sampling where the probabilities of inclusion for each element are equal. Weighted Random Sampling: A method of simple random sampling where the probabilities of inclusion for each element are not equal.
349
DEA Evaluation of Performance of E-Business Initiatives Yao Chen University of Massachusetts Lowell, USA Luvai Motiwalla University of Massachusetts Lowell, USA M. Riaz Khan University of Massachusetts Lowell, USA
INTRODUCTION The Internet has experienced a phenomenal growth in attracting people and commerce activities over the last decade—from a few thousand people in 1993 to 150+ million in 1999, and about one billion by 2004 (Bingi et al., 2000). This growth has attracted a variety of organizations initially to provide marketing information about their products and services and, customer support, and later to conduct business transactions with customers or business partners on the Web. These electronic business (EB) initiatives could be the implementation of intranet and/or extranet applications like B2C, B2B, Web-CRM, Webmarketing, and others, to take advantage of the Webbased economic model, which offers opportunities for internal efficiencies and external growth. It has been recognized that the Internet economic model is more efficient at the transaction cost level and elimination of the middleman in the distribution channel, and also can have a big impact on the market efficiency. Web-enabling business processes are particularly attractive in the new economy, where product life cycles are short and efficient, while the market for products and services is global. Similarly, management of these companies expects a much better financial performance than their counterparts in the industry, which had not adopted these EB initiatives (Hoffman et. al., 1995; Wigand & Benjamin, 1995). EB allows organizations to expand their business reach. One of the key benefits of the Web is access to and from global markets. The Web eliminates several geographical barriers for a corporation that wants to conduct global commerce. While traditional commerce relied on value-added networks (VANs) or private networks, which were expensive and provided limited connectivity (Pyle, 1996), the Web makes electronic commerce cheaper with extensive global connectivity. Corporations have been able to produce goods anywhere and deliver electroni-
cally or physically via couriers. This enables organizations the flexibility to expand into different product lines and markets quickly, with low investments. Secondly, 24x7 availability, better communication with customers, and sharing of the organizational knowledge base allows organizations to provide better customer service. This can translate to better customer retention rates as well as repeat orders. Finally, the rich interactive media and database technology of the Web allows for unconstrained awareness, visibility, and opportunity for an organization to promote its products and services. This enhances organizations’ abilities to attract new customers, thereby increasing their overall markets and profitability. Despite the recent dot-com failures, EB has made tremendous inroads in traditional corporations. Forrester Research in its survey found 90% of the firms plan to conduct some ecommerce, business-to-consumer (B2C), or business-tobusiness (B2B), and predicts EB transactions to rise to about $6.9 trillion by 2004. As a result, the management has started to believe in the Internet because of its ability to attract and retain more customers, reduce sales and distribution overheads, and global access to markets with an expectation of an increase in sales revenues, higher profits, and better returns for the stockholders (Choi & Winston, 2000; Motiwalla & Khan, 2002; Steinfield & Whitten, 1999; White, 1999).
BACKGROUND It is important that we use a comprehensive performance evaluation tool to examine whether these EB initiatives have a positive impact on the financial performance. Managers are often interested in evaluating how efficiently EB initiatives are with respect to multiple inputs and outputs. Single-measure gap analysis is often used as a fundamental method in performance evaluation and best practice identification. It is extremely difficult to show
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
DEA Evaluation of Performance of E-Business Initiatives
benchmarks where multiple measurements exist. It is rare that one single measure can suffice for the purpose of performance assessment. In our empirical study, there are multiple measures that characterize the performance of retail companies. This requires that the research tool used here have the flexibility to deal with changing production technology in the context of multiple performance measures. Data envelopment analysis (DEA) is originally developed to measure the relative efficiency of peer decision-making units (DMUs) in a multiple input-output setting. DEA has been proven to be an excellent methodology for performance evaluation and benchmarking (Zhu, 2003). Based on Cooper, Seiford, and Zhu (2004), the specific reasons for using DEA are given as follows. First, DEA is a data-oriented approach for evaluating the performance of a set of peer DMUs, which convert multiple inputs into multiple outputs. In our case, the DMUs can be, for example, corporations that have launched EB activities. For each corporation, each year can be regarded as a DMU. Second, DEA is a methodology directed to frontiers rather than central tendencies. Instead of trying to fit a regression plane through the center of the data, as is done in statistical regression, for example, one floats a piecewise linear surface to rest on top of the observations. Because of this approach, DEA proves particularly adept at uncovering relationships that remain hidden in other methodologies. Third, DEA does not require explicitly formulated assumptions of functional form as in linear and nonlinear regression models. This flexibility allows us to identify the multi-dimensional efficient frontier without the need for explicitly expressing the technology change and organizational knowledge. In order to discriminate the performance among the efficient DMUs, a super-efficiency DEA model in which a DMU under evaluation is excluded from the reference set is developed. However, the super-efficiency model has been restricted to the case of constant returns to scale (CRS), because the non-CRS super-efficiency DEA model can be infeasible (Seiford & Zhu, 1998; 1999; Zhu, 1996). It is difficult to precisely define infeasibility. As a result, one cannot rank the performance of a set of DMUs. In fact, an input-oriented super-efficiency DEA model measures the input super-efficiency when outputs are fixed at their current levels. Likewise, an output-oriented super-efficiency DEA model measures the output superefficiency when inputs are fixed at their current levels. From the different uses of the super-efficiency concept, we see that super-efficiency can be interpreted as the degree of efficiency stability or input saving/output surplus achieved by an efficient DMU. If super-efficiency is used as an efficiency stability measure, then infeasibility means that an efficient DMU’s efficiency classification is
350
stable to any input changes, if an input-oriented superefficiency DEA model is used (or any output changes, if an output-oriented super-efficiency DEA model is used). Therefore, we can use +¥ to represent the super-efficiency score (i.e., infeasibility means the highest super-efficiency). Chen (2004) shows that (i) if an efficient DMU does not possess any input super-efficiency (input saving), it must possess output super-efficiency (output surplus), and (ii) if an efficient DMU does not possess any output superefficiency, it must possess input super-efficiency. We thus can use both input-oriented and output-oriented super-efficiency DEA models to fully characterize the super-efficiency. Based on the above derivations, Chen, et al. (2004) are able to rank the performance of a set of publicly held corporations in retail industry over the period 1997-2000. Specifically, the objective of this study is to determine whether the financial data support the beneficial claims made in the popular literature that EB has boosted the bottom-line.
MAIN TRUST To present our DEA methodology, we assume that there are n DMUs to be evaluated. Each DMU consumes varying amounts of m different inputs to produce s different outputs. Specifically, DMUj consumes amount xij of input i and produces amount yrj of output r. We assume that xij > 0 and y rj > 0 and further assume that each DMU has at least one positive input and one positive output value. The input-output oriented super efficiency models whose frontier exhibits VRS can be expressed as Seiford and Zhu (1999) in Box 1, where xio and yro are respectively the ith input and rth output for a DMU o under evaluation. Let γ o represent the score for characterizing the superefficiency in terms of input saving, we have θ VRS −super*
γ o = 1 o
if the input - oriented super - efficiency model is feasible if the input - oriented super - efficiency model is infeasible
Note that γ o > 1. If γ o >1, a specific efficient DMU o has input super-efficiency. If γ o = 1, DMU o does not have input super-efficiency. Similarly, let τ o represent the score for characterizing the output super-efficiency, we have τo =
φoVRS −super* 1
if the output - oriented super - efficiency model is feasible if the output - oriented super - efficiency model is infeasible
DEA Evaluation of Performance of E-Business Initiatives
Box 1. min θ oVRS-super s.t.
n
∑λ x j =1 j ≠o
j
ij
j=1 j ≠o
j
n
∑λ j=1 j ≠o
j
i = 1,2,..., m;
s.t.
n
∑λ x j =1 j ≠o
n
∑λ y
≤ θ oVRS −super xio
rj
≥ y ro
r = 1,2,..., s;
j
n
∑λ j=1 j ≠o
j
n
=1
θ oVRS-super ≥ 0 λj ≥ 0
,
max φoVRS −super
∑λ j=1 j ≠o
j≠o
Note that τ o < 1. If τ o < 1, a specific efficient DMU o has output super-efficiency. If τ o = 1, DMU o does not have output super-efficiency. The DEA inputs and outputs can be developed from the financial ratios. For example, the inputs can include (i) number of employees, (ii) inventory cost, (iii) total current assets, and (iv) cost of sales; and the outputs can include (i) revenue and (ii) net income. In our study, 75% of the EB companies are efficient, and 57% of the non-EB companies are efficient. Thus, in general, the EB companies have performed better. In terms of the super-efficiency DEA model, the EB companies demonstrate a better performance than the companies that have not yet adopted the EB initiatives (Chen et al., 2004).
FUTURE TRENDS It has been recognized that the link between IT investment and firm performance is indirect. Future research should focus on multi-stage performance of IT impacts on firm performance. For example, Chen and Zhu (2004) developed a preliminary DEA-based model to (i) characterize the indirect impact of IT on firm performance, (ii) identify the efficient frontier of two value-added stages related to IT investment and profit generation, and (iii) highlight firms that can be further analyzed for best practice benchmarking.
j
ij
≤ xio
y rj ≥ φoVRS −super y ro
i = 1,2,..., m; r = 1,2,..., s;
=1
φoVRS −super ≥ 0 λ j ≥ 0 ( j ≠ o)
indicate that the EB companies performed better in some measures than the non-EB companies included in the sample. Further, by contrasting the performance of EB companies against the non-EB companies, the findings confirm that the EB companies have benefited from the innovation and the strategic integration of the EB technologies.
REFERENCES Bingi, P., Mir, A., & Khamalah, J. (2000). The challenges facing global e-commerce. Information Systems Management, 6(2), 26-34. Charnes, A, Cooper, W.W., & Rhodes, E. (1978). Measuring the efficiency of decision making units. European Journal of Operational Research, 2, 429-444. Chen, Y. (2004). Ranking efficient units in DEA. OMEGA, 32, 213-219. Chen, Y., Motiwalla, L., & Khan, M.R. (2004). Using super-efficiency DEA to evaluate financial performance of e-business initiative in the retail industry. International Journal of Information Technology and Decision Making, 3(2), 337-351. Chen, Y., & Zhu, J. (2004). Measuring information technology’s indirect impact on firm performance. Information Technology & Management Journal, 5(1), 9-22.
CONCLUSION
Choi, S., & Winston, A. 2000. Benefits and requirements for interoperability in electronic marketplace. Technology in Society, 22, 33-44.
The analysis of the retail industry data indicates that there is some evidence that the EB initiatives have had some degree of favorable impact on the financial performance of companies that have moved in this direction. Results
Cooper, W.W., Seiford, L.M., & Zhu, J. (2004). Handbook on data envelopment analysis. Boston: Kluwer Academic Publishers.
351
DEA Evaluation of Performance of E-Business Initiatives
Hoffman, D., Novak, T., & Chatterjee, P. (1995). Commercial scenarios for the Web: Opportunities and challenges. Journal of Computer-Mediated Communications, 1(3). Motiwalla, L., & Khan, M.R. (2002). Financial impact of ebusiness initiatives in the retail industry. Journal of Electronic Commerce in Organization, 1(1), 55-73. Pyle, R. (1996). Commerce and the Internet. Communications of the ACM, 39(6), 23. Seiford, L.M., & Zhu, J. (1998). Sensitivity analysis of DEA models for simultaneous changes in all the data. Journal of the Operational Research Society, 49, 10601071. Seiford, L.M., & Zhu, J. (1999). Infeasibility of superefficiency data envelopment analysis models. INFOR, 37, 174-187. Steinfield, C., & Whitten, P. (1999). Community level socio-economic impacts of electronic commerce. Journal Of Computer-Mediated Communications, 5(2). White, G. (1999, December 3). How GM, Ford think Web can make a splash on the factory floor. Wall Street Journal, 1, 1. Wigand, R., & Benjamin, R. (1995). Electronic commerce: effects on electronic markets. Journal of Computer Mediated Communication, 1(3). Zhu, J. (1996). Robustness of the efficient DMUs in data envelopment analysis. European Journal of Operational Research, 90(3), 451-460. Zhu, J. (2003). Quantitative models for performance evaluation and benchmarking: Data envelopment analysis with spreadsheets and DEA excel solver. Boston: Kluwer Academic Publishers.
352
KEY TERMS Data Envelopment Analysis (DEA): A data-oriented mathematical programming approach that allows multiple performance measures in a single model. Decision Making Unit (DMU): The subject under evaluation. Efficient: Full efficiency is attained by any DMU if and only if none of its inputs or outputs can be improved without worsening some of its other inputs or outputs. Electronic Business (EB): Business transactions involving exchange of goods and services with customers and/or business partners over the Internet. Inputs/Outputs: Refer to the performance measures used in DEA evaluation. Inputs usually refer to the resources used, and outputs refer to the outcomes achieved by an organization or DMU. Returns to Scale (RTS): RTS are considered to be increasing if a proportional increase in all the inputs results in a more than proportional increase in the single output. In DEA, the concept of returns to scale is extended to multiple inputs and multiple outputs situations. Super-Efficiency: The input savings or output surpluses achieved by an efficient DMU.
353
Decision Tree Induction
,
Roberta Siciliano University of Naples Federico II, Italy Claudio Conversano University of Cassino, Italy
INTRODUCTION Decision Tree Induction (DTI) is an important step of the segmentation methodology. It can be viewed as a tool for the analysis of large datasets characterized by high dimensionality and nonstandard structure. Segmentation follows a nonparametric approach, since no hypotheses are made on the variable distribution. The resulting model has the structure of a tree graph. It is considered a supervised method, since a response criterion variable is explained by a set of predictors. In particular, segmentation consists of partitioning the objects (also called cases, individuals, observations, etc.) into a number of subgroups (on the basis of suitable partitioning of the modalities of the explanatory variables, the so-called predictors) in a recursive way, so that a tree-structure is produced. Typically, partitioning is in two subgroups yielding to binary trees, although ternary trees as well as r-way trees also can be built up. Two main targets can be achieved with tree-structures—classification and regression trees—on the basis of the type of response variable, which can be categorical or numerical. Tree-based methods are characterized by two main tasks: exploratory and decision. The first is to describe with the tree structure the dependence between the response and the predictors. The decision task is properly of DTI, aiming to define a decision rule for unseen objects for estimating unknown response class/values as well as validating the accuracy of the final results. For example, trees often are considered in creditscoring problems in order to describe and classify good and bad clients of a bank on the basis of socioeconomic indicators (e.g., age, working conditions, family status, etc.) and financial conditions (e.g., income, savings, payment methods, etc.). Conditional interactions describing the client profile can be detected looking at the paths along the tree, when going from the top to the terminal nodes. Each internal node of the tree is assigned a partition (or a split for binary tree) of the predictor space, and each terminal node is assigned a label class/value of the response. As a result, each tree path, characterized by a sequence of predictor interac-
tions, can be viewed as a production rule yielding to a specific label class/value. The set of production rules constitutes the predictive learning of the response class/ value of new objects, where only measurements of the predictors are known. As an example, a new client of a bank is classified as a good client or a bad one by dropping it down the tree according to the set of splits (binary questions) of a tree path, until a terminal node labeled by a specific response-class is reached.
BACKGROUND The appealing aspect for the segmentation user is that the final tree provides a comprehensive description of the phenomenon in different contexts of application, such as marketing, credit scoring, finance, medical diagnosis, and so forth. Segmentation can be considered as an exploratory tool but also as a confirmatory nonparametric model. Exploration can be obtained by performing a recursive partitioning of the objects until a stopping rule defines the final structure to interpret. Confirmation is a different problem, requiring definition of decision rules, usually obtained by performing a pruning procedure soon after a partitioning one. Important questions arise when using segmentation for predictive learning goals (Hastie et al., 2001; Zhang, 1999). The tree structure that fits the data and can be used for unseen objects cannot be the simple result of any partitioning algorithm. Two aspects should be jointly considered: the tree size (i.e., the number of terminal nodes) and the accuracy of the final decision rule evaluated by an error measure. In fact, a weak point of decision trees is the sensitivity of the classification/prediction rules measured by the size of the tree and its accuracy to the type of dataset as well as to the pruning procedure. In other words, the ability of a decision tree to detect cases and take right decisions can be evaluated by a simple measure, but it also requires a specific induction procedure. Likewise, in statistical inference, where the power of a testing procedure is judged with respect to changes of the alternative hypotheses, decision tree induction strongly
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Decision Tree Induction
depends on both the hypotheses to verify and their alternatives. For instance, in classification trees, the number of response classes and the prior distribution of cases among the classes influence the quality of the final decision rule. In the credit-scoring example, an induction procedure using a sample of 80% of good clients and 20% of bad clients likely will provide reliable rules to identify good clients and unreliable rules to identify bad ones.
MAIN THRUST Exploratory trees can be fruitfully used to investigate the data structure, but they cannot be used straightforwardly for induction purposes. The main reason is that exploratory trees are accurate and effective with respect to the training data used for growing the tree, but they might perform poorly when applied to classifying/ predicting fresh cases that have not been used in the growing phase.
DTI Main Tasks DTI definitely has an important purpose represented by understandability: the tree structure for induction needs to be simple and not large; this is a difficult task since a predictor may reappear (even though in a restricted form) many times down a branch. At the same time, a further requirement is given by the identification issue: on one hand, terminal branches of the expanded tree reflect particular features of the training set, causing over-fitting; on the other hand, over-pruned trees necessarily do not allow identification of all the response classes/values (under-fitting).
Tree Model Building Simplification method performance in terms of accuracy depends on the partitioning criterion used in the tree-growing procedure (Buntine & Niblett, 1992). Thus, exploratory trees become an important preliminary step for DTI. In tree model building, it is worth distinguishing between the optimality criterion for tree pruning (simplification method) and the criterion for selecting the best decision rule (decision rule selection). These criteria often use independent datasets (training set and test set). In addition, a validation set can be required to assess the quality of the final decision rule (Hand, 1997). In this respect, segmentation with pruning and assessment can be viewed as stages of any computational model-building process based on a supervised learning algorithm. Furthermore, growing the tree structure using a Fast Algorithm for Splitting Trees (FAST) 354
(Mola & Siciliano, 1997) becomes a fundamental step to speed up the overall DTI procedure.
Tree Simplification: Pruning Algorithms A further step is required for DTI relying on the hypothesis of uncertainty in the data due to noise and residual variation. Simplifying trees is necessary to remove the most unreliable branches and improve understandability. Thus, the goal of simplification is inferential (i.e., to define the structural part of the tree and reduce its size while retaining its accuracy). Pruning methods consist in simplifying trees in order to remove the most unreliable branches and improve the accuracy of the rule for classifying fresh cases. The pioneer approach of simplification was presented in the Automatic Interaction Detection (AID) of Morgan and Sonquist (1963). It was based on arresting the recursive partitioning procedure according to some stopping rule (pre-pruning). Alternative procedures consist in pruning algorithms working either from the bottom to the top of the tree (post-pruning) or vice versa (pre-pruning). CART (Breiman et al., 1984) introduced the idea to grow the totally expanded tree for removing retrospectively some of the branches (post-pruning). This results in a set of optimally pruned trees for the selection of the final decision rule. The main issue of pruning algorithms is the definition of a complexity measure that takes account of both the tree size and accuracy through a penalty parameter expressing the gain/cost of pruning tree branches. The training set is often used for pruning, whereas the test set for selecting the final decision rule. This is the case of both the error-complexity pruning of CART and the critical value pruning (Mingers, 1989). Nevertheless, some methods require only the training set. This is the case of the pessimistic error pruning and the error-based pruning (Quinlan, 1987, 1993) as well as the minimum error pruning (Cestnik & Bratko, 1991) and the CART cross-validation method. Instead, other methods use only the test set, such as the reduced error pruning (Quinlan, 1987). These latter pruning algorithms yield to just one best pruned tree, which represents in this way the final rule. In DTI, accuracy refers to the predictive ability of the decision tree to classify/predict an independent set of test data. In classification trees, the error rate, measured by the number of incorrect classifications of the tree on test data, does not reflect accuracy of predictions for classes that are not equally likely, and those with few cases are usually badly predicted. As an alternative to the CART pruning, Cappelli, et al. (1998) provided a pruning algorithm based on the impurity-complexity measure to take account of the distribution of the cases over the classes.
Decision Tree Induction
Decision Rule Selection Given a set of optimally-pruned trees, a simple method to choose the optimal decision rule consists in selecting the pruned tree producing the minimum misclassification rate on an independent test set (0-SE rule) or the most parsimonious pruned tree whose error rate on the test set is within one standard error of the minimum (1-SE rule) (Breiman et al., 1984). Once the sequence of pruned trees is obtained, a statistical testing pruning also could be performed to achieve the most reliable decision rule (Cappelli et al., 2002).
Tree Averaging and Compromise Rules A more recent approach is to generate a set of candidate decision trees that are aggregated in a suitable way to provide a more accurate final rule. This strategy, which is an alternative to the selection of one final tree, consists in the definition of either a compromise rule or a consensus rule using a set of trees rather than the single one. One approach is known as tree averaging, which classifies a new object by averaging over a set of trees using a set of weights (Oliver & Hand, 1995). It requires the definition of the set of trees (since it is impractical to average over every possible pruned tree), the calculation of the weights, and the independent data set to classify. An alternative selection method for classification problems consists in summarizing the information of each tree by table cross-classifying terminal nodes and response classes, so that evaluating the predictability power or the degree of homogeneity through a statistical index yields finding out the best rule (Siciliano, 1998).
Ensemble Methods Ensemble methods are based on a combination, expressed by a weighted or non-weighted aggregation, of single induction estimators able to improve the overall accuracy of any single induction method. Bagging (Bootstrap Aggregating) (Breiman, 1996) is obtained by a bootstrap replication of the sample units of the training sample, each having the same probability to be included in the bootstrap sample, in order to generate single prediction/classification rules that are aggregated, providing a final decision rule consisting in either the average (for regression problems) or the modal class (for classification problems) of the single estimates. Definitions of bias and variance for a classifier as components of the test set error have been used by Breiman (1998). Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance.
Adaptive Boosting (AdaBoost) (Freud & Schapire, 1996; Schapire et al., 1998) adopts an iterative bootstrap replication of the sample units of the training sample such that at any iteration, misclassified/worse predicted cases have higher probability to be included in the current bootstrap sample, and the final decision rule is obtained by majority voting. Random Forest (Breiman, 1999) is instead an ensemble of unpruned trees obtained by introducing two bootstrap resampling schema, one on the objects and another one on the predictors, such that an out-of-bag sample provides the estimation of the test set error and suitable measures of predictor importance are derived for the final interpretation.
FUTURE TRENDS The use of trees for both exploratory and decision purposes will receive more and more attention over both the statistical and the machine-learning communities, with novel contributions highlighting the idea of decision rules as a preliminary or final tool for data analysis (Conversano et al., 2001).
Combining Trees With Other Statistical Tools Some examples are given by the joint use of statistical modeling and tree-structures for exploratory analysis as well as for confirmatory analysis. For the former, examples are given by the procedures called multibudget trees and two-stage discriminant trees, which have been recently overviewed by Siciliano, Aria, and Conversano (2004). For the latter, an example is given by the stepwise model tree induction method (Malerba et al., 2004), which associates multiple regression models to leaves of the tree in order to define a suitable data-driven procedure for tree induction.
Trees in Regression Smoothing It is also worth mentioning the idea of using DTI for regression smoothing purposes. Conversano (2002) introduced a novel class of semiparametric models named Generalized Additive Multi-Mixture Models (GAM-MM) for performing statistical learning for both classification and regression tasks. These models are similar to generalized additive models and work using an iterative estimation algorithm evaluating the predictive power of suitable mixtures of smoothers/classifiers associated with each predictor entering the model.
355
,
Decision Tree Induction
Trees are part of the set of alternative smoothers or classifiers defining the estimations functions associated with each predictor on the basis of bagged scoring measures taking into account the trade-off between estimation accuracy and model complexity.
Trees for Data Imputation and Data Validation A different approach is based on the use of DTI to solve data quality problems. It is well known that the extreme flexibility of DTI allows handling missing data in an easy way when performing data modeling. But DTI does not permit performing missing data imputation in a straightforward manner when dealing with multivariate data. Conversano and Siciliano (2004) defined an incremental approach for missing data imputation based on decision trees. The basic idea is inspired by the principle of statistical learning for information retrieval; namely, observed data can be used to impute missing observations in an incremental manner, starting from the subgroups of observations presenting the less proportion of missingness (with respect to one set of variables) up to the subgroups presenting the maximum number of them (with respect to one set of variables). The imputation is made in each step of the algorithm by performing a segmentation of the complete cases (using as response the variable presenting missing values). The final decision rule is derived using either cross-validation or boosting, and it is used to impute missing values in a given subgroup.
CONCLUSION In the last two decades, computational enhancements highly contributed to the increase in popularity of tree structures. Partitioning algorithms and decision tree induction are two faces of one medal. While the computational time consuming has been rapidly reducing, the statistician is making use of more computational intensive procedures to find out an unbiased and accurate decision rule for new objects. Nevertheless, DTI cannot result in finding out just a number (the error rate) with its accuracy. Trees are much more than a decision rule. Software enhancements through interactive guide user interface and custom routines should empower the final user of tree structures making possible to achieve further important results in the directions of interpretability, identification, and robustness.
356
REFERENCES Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Breiman, L. (1999). Random forest—Random features [technical report]. Berkeley, CA: University of California. Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Buntine, W., & Niblett, T. (1992). A further comparison of splitting rules for decision-tree induction. Machine Learning, 8, 75-85. Cappelli, C., Mola, F., & Siciliano, R. (1998). An alternative pruning method based on the impurity-complexity measure Proceedings in Computational Statistics: 13th Symposium of COMPSTAT. London: Springer. Cappelli, C., Mola, F., & Siciliano, R. (2002). A statistical approach to growing a reliable honest tree. Computational Statistics and Data Analysis, 38, 285-299. Cestnik, B., & Bratko, I. (1991). On estimating probabilities in tree pruning. Proceedings of the EWSL-91. Berlin: Springer. Conversano, C. (2002). Bagged mixture of classifiers using model scoring criteria. Patterns Analysis & Applications, 5(4), 351-362. Conversano, C., Mola, F., & Siciliano, R. (2001). Partitioning algorithms and combined model integration for data mining. Computational Statistics, 16, 323-339. Conversano, C., & Siciliano, R. (2004). Incremental tree-based missing data imputation with lexicographic ordering. Proceedings of Interface 2003, Fairfax, USA. Freud, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. Proceedings of Machine Learning, Thirteenth International Conference. Berlin: Springer. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, USA: MIT Press. Hastie, T., Friedman, J.H., & Tibshirani, R. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer Verlag. Malerba, D., Esposito, F., Ceci, M., & Appice, A. (2004), Top-down induction of model trees with regression and splitting nodes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 6.
Decision Tree Induction
Mingers, J. (1989a). An empirical comparison of selection measures for decision tree induction. Machine Learning, 3, 319-342. Mingers, J. (1989b). An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4, 227-243. Mola, F., & Siciliano, R. (1997). A fast splitting algorithm for classification trees. Statistics and Computing, 7, 209-216. Morgan, J.N., & Sonquist, J.A. (1963). Problem in the analysis of survey data and a proposal. Journal of the American Statistical Association, 58, 415-434. Oliver, J.J., & Hand, D.J. (1995). On pruning and averaging decision trees. Proceedings of Machine Learning, the 12th International Workshop. Berlin: Springer. Quinlan, J.R. (1986). Induction of decision tree. Machine Learning, 1, 86-106. Quinlan, J.R. (1987). Simplifying decision tree. Internat. J. Man-Mach. Studies, 27, 221-234. Schapire, R.E., Freund, Y., Barlett, P., & Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5), 1998. Siciliano, R. (1998). Exploratory versus decision trees. R. Payne, & P. Green (Eds.), Proceedings in computational statistics (pp. 113-124). Physica-Verlag. London: Springer. Siciliano, R., Aria, M., & Conversano, C. (2004). Tree harvest: Methods, software and applications. In J. Antoch (Ed.), COMPSTAT 2004 Proceedings (pp. 1807-1814). Berlin: Springer. Zhang, H., & Singer, B. (1999). Recursive partitioning in the health sciences. New York: Springer Verlag.
KEY TERMS Adaptive Boosting (AdaBoost): An iterative bootstrap replication of the sample units of the training sample such that at any iteration, misclassified/worse predicted cases have higher probability to be included in the current bootstrap sample, and the final decision rule is obtained by majority voting. Bagging (Bootstrap Aggregating): A bootstrap replication of the sample units of the training sample, each having the same probability to be included in the
bootstrap sample to generate single prediction/classification rules that being aggregated provides a final decision rule consisting in either the average (for regression problems) or the modal class (for classification problems) among the single estimates. Classification Tree: An oriented tree structure obtained by a recursive partitioning of a sample of cases on the basis of a sequential partitioning of the predictor space such to obtain internally homogenous groups and externally heterogeneous groups of cases with respect to a categorical variable. Decision Rule: The result of an induction procedure providing the final assignment of a response class/ value to a new object so that only the predictor measurements are known. Such rule can be drawn in the form of decision tree. Ensemble: A combination, typically weighted or unweighted aggregation, of single induction estimators able to improve the overall accuracy of any single induction method. Exploratory Tree: An oriented tree graph formed by internal nodes and terminal nodes, the former allowing the description of the conditional interaction paths between the response variable and the predictors, whereas the latter are labeled by a response class/value. FAST (Fast Algorithm for Splitting Trees): A splitting procedure to grow a binary tree using a suitable mathematical property of the impurity proportional reduction measure to find out the optimal split at each node without trying out necessarily all candidate splits. Partitioning Tree Algorithm: A recursive algorithm to form disjoint and exhaustive subgroups of objects from a given group in order to build up a tree structure. Production Rule: A tree path characterized by a sequence of predictor interactions yielding to a specific label class/value of the response variable. Pruning: A top-down or bottom-up selective algorithm to reduce the dimensionality of a tree structure in terms of the number of its terminal nodes. Random Forest: An ensemble of unpruned trees obtained by introducing two bootstrap resampling schema, one on the objects and another one on the predictors, such that an out-of-bag sample provides the estimation of the test set error, and suitable measures of predictor importance are derived for the final interpretation.
357
,
Decision Tree Induction
Regression Tree: An oriented tree structure obtained by a recursive partitioning of a sample on the basis of a sequential partitioning of the predictor space such to obtain internally homogenous groups and externally heterogeneous groups of cases with respect to a numerical response variable.
358
359
Diabetic Data Warehouses
,
Joseph L. Breault Ochsner Clinic Foundation, USA
INTRODUCTION
MAIN THRUST
The National Academy of Sciences convened in 1995 for a conference on massive data sets. The presentation on health care noted that “massive applies in several dimensions . . . the data themselves are massive, both in terms of the number of observations and also in terms of the variables . . . there are tens of thousands of indicator variables coded for each patient” (Goodall, 1995, paragraph 18). We multiply this by the number of patients in the United States, which is hundreds of millions. Diabetic registries have existed for decades. Datamining techniques have recently been applied to them in an attempt to predict diabetes development or high-risk cases, to find new ways to improve outcomes, and to detect provider outliers in quality of care or in billing services (Breault, 2001; He, Koesmarno, Van, & Huang, 2000; Hsu, Lee, Liu, & Ling, 2000; Kakarlapudi, Sawyer, & Staecker, 2003; Stepaniuk, 1999; Tafeit, Moller, Sudi, & Reibnegger, 2000). Diabetes is a major health problem. The long history of diabetic registries makes it a realistic and valuable target for data mining.
Three major challenges are reviewed here: a) understanding and converting the diabetic databases into a datamining data table, b) the data mining, and c) utilizing results to assist clinicians and managers in improving the health of the population studied.
BACKGROUND In-depth examination of one such diabetic data warehouse developed a method of applying data-mining techniques to this type of database (Breault, Goodall, & Fos, 2002). There are unique data issues and analysis problems with medical transactional databases. The lessons learned will be applicable to any diabetic database and perhaps to broader medical databases. Methods for translating a complex relational medical database with time series and sequencing information to a flat file suitable for data mining are challenging. We used the classification tree approach with a binary target variable. While many data mining methods (neural networks, logistic regression, etc.) could be used, classification trees have been noted to be appealing to physicians because much of medical diagnosis training operates in a fashion similar to classification trees.
The Diabetic Database The diabetic data warehouse we studied included 30,383 diabetic patients during a 42-month period with hundreds of fields per patient. Understanding the data requires awareness of its limitations. These data were obtained for purposes other than research. Clinicians will be aware that billing codes are not always precise, accurate, and comprehensive. However, the codes are widely used in outcomes modeling. Epidemiologists and clinicians will be aware that important predictors of diabetic outcomes are missing from the database, such as body mass index, family history of diabetes, time since the onset of diabetes, diet, and exercise habits. These variables were not electronically stored and would require going to the paper chart and patient interviews to obtain.
Developing the Data-Mining Data Table The major challenge is transforming the data from the relational structure of the diabetic data warehouse with its multiple tables to a form suitable for data mining (Nadeau, Sullivan, Teorey, & Feldman, 2003). Data-mining algorithms are most often based on a single table, within which is a record for each individual, and the fields contain variable values specific to the individual. We call this the data-mining data table. The most portable format for the data-mining data table is a flat file, with one line for each individual record. SQL statements on the data warehouse create the flat file output that the data-mining software then reads. The steps are as follows: •
Review each table of the relational database and select the fields to export.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Diabetic Data Warehouses
• • •
• • •
Determine the interactions between the tables in the relational database. Define the layout of the data-mining data table. Specify patient inclusion and exclusion criteria. What is the time interval? What are the minimum and maximum number of records (e.g., clinic visits or outcome measures) each patient must have to be included? What relevant fields can be missing and still include the individual in the datamining data table? Extract data, including the stripping of patient identifiers to protect human subjects. Determine how to handle missing values (Duhamel, Nuttens, Devos, Picavet, & Beuscart, 2003) Perform sanity checks on the data-mining data table, for example, that the minimum and maximum of each variable make clinical sense.
Handling time series medical data is challenging for data-mining software. One example in our study is the HgbA1c, the key measure of glycemic control. This is closely related to clinical outcomes and complication rates in diabetes. Health care costs increase markedly with each 1% increase in baseline HgbA1c; patients with an HgbA1c of 10% versus 6% had a 36% increase in 3year medical costs (Blonde, 2001). How should this time series variable be transformed from the relational database to a vector (column) in the data-mining data table? A given diabetic patient may have many of these HgbA1c results. We could pick the last one, the first, a median or mean value. Because the trend over time for this variable is important, we could choose the slope of its regression line over time. However, a linear function may be a good representation for some patients, but a very bad one for others that may be better represented by an upside down U curve. This difficulty is a problem for most repeated laboratory tests. Some information will be lost in the creation of the data-mining data table. We used the average HgbA1c for a given patient and excluded patients who did not have at least two HgbA1c results in the data warehouse. We repartitioned this average HgbA1c into a binary variable based on a meaningful clinical cut-point of 9.5%. Experts agree that an HgbA1c >9.5% is a bad outcome, or a medical quality error, no matter what the circumstances (American Medical Association, Joint Commission on Accreditation of Healthcare Organizations, & National Committee for Quality Assurance, 2001). Our final data-mining data table had 15,902 patients (rows). Mean HgbA1c > 9.5% was the target variable, and the 10 predictors were age, sex, emergency department visits, office visits, comorbidity index, dyslipidemia, hypertension, cardiovascular disease, retinopathy, and end stage renal disease. All these patients 360
had at least two HgbA1c tests and at least two office visits, the criteria we used for minimal continuity in this 42-month period.
Data-Mining Technique We used the classification tree approach as standardized in the CART software by Salford Systems. As detailed in Hand, Mannila, and Smyth (2001), the principle behind all tree models is to recursively partition the input variable space to maximize purity in the terminal tree nodes. The partitioning split in any cell is done by searching each possible threshold for each variable to find the threshold split that leads to the greatest improvement in the purity score of the resultant nodes. Hence, this is a monothetic process, which may be a limitation of this method in some circumstances. In CART’s defaults, the Gini splitting criteria are used, although other methods are options. This could recursively continue to the point of perfect purity, which would sometimes mean only one patient in a terminal node. But overfitting of the data does not help in accurately classifying another data set. Therefore, we divide the data randomly into learning and test sets. The number of trees generated is halted or pruned back by how accurately the classification tree created from the learning set can predict classification in the test set. Cross-validation is another option for doing this, though in the CART software’s defaults this is limited to n = 3000. This could be changed higher to use our full data set, but some CART consultants note, “The n-fold crossvalidation technique is designed to get the most out of datasets that are too small to accommodate a hold-out or test sample. Once you have 3,000 records or more, we recommend that a separate test set be used” (TimberlakeConsultants, 2001). The original CART creators recommended dividing the data into test and learning samples whenever there were more than 1,000 cases, with crossvalidation being preferable in smaller data sets (Breiman, Friedman, Olshen, & Stone, 1984). The 10 predictor variables were used with the binary target variable of the HgbA1c average (cut-point of 9.5%) in an attempt to find interesting patterns that may have management or clinical importance and are not already known. The variables that are most important to classification in the optimal CART tree were age (100, where the most important variable is arbitrarily given a relative score of 100), number of office visits (51), comorbidity index (CMI) (44), cardiovascular disease (16), cholesterol problems (17), number of emergency room visits (7), and hypertension (0.6). CART can be used for multiple purposes. Here we want to find clusters of deviance from glycemic control.
Diabetic Data Warehouses
With no analysis, the rate of bad glycemic control in the learning sample is 13.2%. We want to find nodes that have higher rates of bad glycemic control. The first split in node 1 of the tree is based on an age of 65.6 (CART sends all cases less than or equal to the cut-point to the left and greater than the cut-point to the right). In node 2 (<=65.6 years of age), we have bad glycemic control in 19.4%. If we look at the tree to find terminal nodes (TN) where the percentage of all the patients in those nodes having a bad HgbA1c is greater than the 19.4% true of those <= 65.6 years of age, we identify 4 of the 10 TNs in the learning sample. The purer the nodes we limit ourselves to, the less percentage of the overall population with bad glycemic control we get. With no analysis in the learning set, we can capture all 1052 patients with bad glycemic control but must target the entire group of 7953. When we limit ourselves to the purer node 2, we capture 74% of those with bad glycemic control by targeting only 50% of the population. If we limit ourselves to TN1, we capture 49% of those with bad glycemic control by targeting only 27% of the population. If we use more complicated criteria by combining the 4 TNs with worst glycemic control, we capture 54% of those with bad glycemic control by targeting only 30% of the population. The classification errors in the learning and test samples are substantial, as a quarter of the bad glycemic control patients are missed in the CART analysis. CART is doing a good job with the 10 predictor variables it is given, but more accurate prediction requires additional variables not in our database. Adjustment to defaults in CART can give better results defined as capturing a larger percentage of those with bad glycemic control within a smaller percentage of the population. However, the complexity of the formulas to identify the population is difficult for managers to use. Not only must managers identify the persons, but they must get enough of a feel for what the population characteristics are to know what interventions are likely to be helpful. This is more intuitive for those who are younger than 55 or younger than 65 than it is for those who satisfy 0.451*(AGE) + 0.893*(CMI) <= 32.5576.
Results From this CART analysis, the most important variable associated with a bad HgbA1c score is age less than 65. Those less than 65.6 years old are almost three times as likely to have bad glycemic control than those who are older. The odds ratio that someone is less than 65.6 years old if they have a bad HgbA1c (average reading > 9.5%) is 3.18 (95% CI: 2.87, 3.53) (Fos & Fine, 2000). Similarly, the odds ratio that someone is younger than 55.2
years rather than older than 65.6 if they have a bad glycemic control is 4.11 (95% CI:3.60, 4.69). This is surprising information to most clinicians. Similar findings were recently reported with vascular endpoints in diabetic patients (Miyaki, Takei, Watanabe, Nakashina, & Omae, 2002). The case may be that those with the worst glycemic control die young and never make it to the older group. Although this is an interesting theoretical explanation, nevertheless, these numbers represent real patients that need help with their glycemic control now. If we want to target diabetics with bad HgbA1c values, the odds of finding them are 3.2 times as high in diabetic patients younger than 65.6 years than those who are older and 4.1 times as high in those who are younger than 55 than those over 65. This information is clinically important because the younger group has so many more years of life left to develop diabetic complications from bad glycemic control. This is especially helpful because it tells us which population to target interventions at even before we have the HgbA1c values to show us. Health maintenance organizations and public health workers may want to explore what educational interventions can be successfully directed to younger diabetics (younger than 65, especially younger than 55) who are much more likely to have bad glycemic control than the geriatric patients.
FUTURE TRENDS Areas that need further work to fully utilize data mining in health care include time-series issues, sequencing information, data-squashing technologies, and a tight integration of domain expertise and data-mining skills. We have already discussed time-series issues. This has been investigated but needs further exploration in health care data mining (Bellazzi, Larizza, Magni, Montani, & Stefanelli, 2000; Goodall, 1999; Tsien, 2000). The sequence of various events may hold meaning important to a study. For example, a patient may have better glycemic control, manifested in improved HgbA1c values, especially when the patient had an office visit with a physician within a month of a previous HgbA1c. Perhaps this is meaningful information that implies that the proper sequence of physician visits relative to HgbA1c measurements is an important predictor of good outcomes. All this information is located in the relational database, but we must ferret it out by having in advance an idea that a sequence of this sort may be important and then searching for such associations. There may be many such sequences involving interactions between hospital, clinic, pharmacy, 361
,
Diabetic Data Warehouses
and laboratory variables. Regrettably, no amount of data mining will be able to extract sequence associations where we have not thought to extract the prerequisite variables from the relational database into the data-mining data table. In the ideal data-mining scenario, software could interface directly with the relational database and extract all possibly meaningful sequences for us to review, and domain experts would then sort through the list. This issue has begun to get attention (Deroski & Lavraè , 2001) and will need to be addressed in future health care data mining. It has been shown that using a data-squashing algorithm to reduce a massive data set is more powerful and accurate than using a random sample (DuMouchel, Volinsky, Johnson, Cortes, & Pregibon, 1999). Squashing is a form of lossy compression that attempts to preserve statistical information (DuMouchel, 2001). These newer data-squashing techniques may be a better approach than random sampling in massive data sets. These techniques also protect human subjects’ privacy. Transactional health care data mining, exemplified in the diabetic data warehouses discussed previously, involves a number of tricky data transformations that require close collaboration between domain experts and data miners (Breault & Goodall, 2002). Even with ideal collaboration or overlapping expertise, we need to develop new ways to extract variables from relational databases containing time-series and sequencing information. Part of the answer lies in collaborative groups that can have additional insights. Part of the answer lies in the further development of data-mining tools that act directly on a relational database without transformation to explicit data arrays. In some circumstances, it may be useful to produce several data-mining data tables to independently data mine and then combine the results. This may be particularly useful when different granularities of attributes are glossed over by the use of a single data-mining data table.
CONCLUSION Data mining is valuable in discovering novel associations in diabetic databases that can prove useful to clinicians and administrators. This may also be the case for many other health care problems.
REFERENCES American Medical Association, Joint Commission on Accreditation of Healthcare Organizations, & National Committee for Quality Assurance. (2001). Coordi362
nated performance measurement for the management of adult diabetes. Retreived March 24, 2005, from http:// www.ama-assn.org/ama/upload/mm/370/nr.pdf Bellazzi, R., Larizza, C., Magni, P., Montani, S., & Stefanelli, M. (2000). Intelligent analysis of clinical time series: An application in the diabetes mellitus domain. Artificial Intelligence Med, 20(1), 37-57. Blonde, L. (2001). Epidemiology, costs, consequences, and pathophysiology of type 2 diabetes: An American epidemic. Ochsner Journal, 3(3), 126-131. Breault, J. L. (2001). Data mining diabetic databases: Are rough sets a useful addition? In E. Wegman, A. Braverman, A. Goodman, & P. Smyth (Eds.), Computing science and statistics (pp. 597-606). Interface Foundation of North America, Inc, 33, Fairfax Station, VA. Breault, J. L., & Goodall, C. R. (2002, January). Mathematical challenges of variable transformations in data mining diabetic data warehouses. Paper presented at the Mathematical Challenges in Scientific Data Mining Conference, Los Angeles, CA. Breault, J. L., Goodall, C. R., & Fos, P. J. (2002). Data mining a diabetic data warehouse. Artificial Intelligence Med, 26(1-2), 37-54. Breiman, L. (1984). Classification and regression trees. Belmont, CA: Wadsworth International. Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., & Beuscart, R. (2003). A preprocessing method for improving data mining techniques: Application to a large medical diabetes database. Stud Health Technol Inform, 95, 269274. DuMouchel, W. (2001). Data squashing: Constructing summary data sets. In E. Wegman, A. Braverman, A. Goodman, & P. Smyth (Eds.), Computing science and statistics. Interface Foundation of North America, Inc., Fairfax Station, VA. DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Pregibon, D. (1999, August). Squashing flat files flatter. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA. Deroski, S., & Lavraè , N. (2001). Relational data mining. New York: Springer. Fos, P. J., & Fine, D. J. (2000). Designing health care for populations: Applied epidemiology in health care administration. San Francisco: Jossey-Bass. Goodall, C. (1995). Massive data sets in healthcare. In (Chair: Jon R. Kettenring), Massive data sets. Paper pre-
Diabetic Data Warehouses
sented at the meeting of the Committee on Applied and Theoretical Statistics at the National Academy of Sciences, National Research Council, Washington, DC. Retrieved March 29, 2005, from http://bob.nap.edu/html/massdata/ media/cgoodall-t.html Goodall, C. R. (1999). Data mining of massive datasets in healthcare. Journal of Computational and Graphical Statistics, 8(3), 620-634. Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. He, H., Koesmarno, Van, & Huang. (2000). Data mining in disease management: A diabetes case study. In R. Mizoguchi & J. K. Slaney (Eds.), Proceedings of the Sixth Pacific Rim International Conference on Artificial Intelligence: Topics in artificial intelligence (p. 799). New York: Springer. Hsu, W. (2000, August). Exploration mining in diabetic patients databases: Findings and conclusions. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA. Kakarlapudi, V., Sawyer, R., & Staecker, H. (2003). The effect of diabetes on sensorineural hearing loss. Otol Neurotol, 24(3), 382-386. Miyaki, K., (2002). Novel statistical classification model of type 2 diabetes mellitus patients for tailormade prevention using data mining algorithm. J Epidemiol, 12(3), 243-248. Nadeau, T. P., (2003). Applying database technology to clinical and basic research bioinformatics projects. J Integr Neurosci, 2(2), 201-217. Stepaniuk, J. (1999, June). Rough set data mining of diabetes data. In Z. Ras & A. Skowron (Eds.), Proceedings of the 11th International Symposium onVol. Foundations of intelligent systems (pp. 457-465). New York: Springer. Tafeit, E., Moller, Sudi, & Reibnegger. (2000). ROC and CART analysis of subcutaneous adipose tissue topography (SAT-Top) in type-2 diabetic women and healthy females. American Journal of Human Biology, 12, 388-394.
Timberlake-Consultants. (2001). CART frequently asked questions. Retrieved October 21, 2001, from http:// www.timberlake.co.uk/software/cart/cartfaq1.htm#q23 Tsien, C. L. (2000). Event discovery in medical time-series data. Proceedings of the AMIA Symposium, 858-862.
KEY TERMS Co-Morbidity Index: A composite variable that gives a measure of how many other medical problems someone has in addition to the one being studied. Data Mining Data Table: The flat file constructed from the relational database that is the actual table used by the data-mining software. Glycemic Control: Tells how well controlled are the sugars of a diabetic patient. Usually measured by HgbA1c. HgbA1c: Blood test that measures the percent of receptors on a red blood cell that are saturated with glucose. This is translated into a measure of how sugars have averaged over the last few months. Normal is less than 6, depending on the laboratory standards. Medical Transactional Database: The database created from the billing and required reporting transactions of a medical practice. Clinical experience is sometimes required to understand gaps and inadequacies in the collected data. Monothetic Process: In a classification tree, when data at each node are split on just one variable rather than several variables. Recursive Partitioning: The method used to divide data at each node of a classification tree. At the top node, every variable is examined at every possible value to determine which variable split will produce the maximum and minimum amounts of the target variable in the daughter nodes. This is recursively done for each additional node.
363
,
364
Discovering an Effective Measure in Data Mining Takao Ito Ube National College of Technology, Japan
INTRODUCTION One of the most important issues in data mining is to discover an implicit relationship between words in a large corpus and labels in a large database. The relationship between words and labels often is expressed as a function of distance measures. An effective measure would be useful not only for getting the high precision of data mining, but also for time saving of the operation in data mining. In previous research, many measures for calculating the one-to-many relationship have been proposed, such as the complementary similarity measure, the mutual information, and the phi coefficient. Some research showed that the complementary similarity measure is the most effective. The author reviewed previous research related to the measures in one-to-many relationships and proposed a new idea to get an effective one, based on the heuristic approach in this article.
sented by Hagita and Sawaki (1995). The fourth one is the dice coefficient. The fifth one is the Phi coefficient. The last two are both mentioned by Manning and Schutze (1999). The sixth one is the proposal measure (PM) suggested by Ishiduka, Yamamoto, and Umemura (2003). It is one of the several new measures developed by them in their paper. In order to evaluate these distance measures, formulas are required. Yamamoto and Umemura (2002) analyzed these measures and expressed them in four parameters of a, b, c, and d (Table 1). Suppose that there are two words or labels, x and y, and they are associated together in a large database. The meanings of these parameters in these formulas are as follows: a. b. c.
BACKGROUND Generally, the knowledge discover in databases (KDD) process consists of six stages: data selection, cleaning, enrichment, coding, data mining, and reporting (Adriaans & Zantinge, 1996). Needless to say, data mining is the most important part in the KDD. There are various techniques, such as statistical techniques, association rules, and query tools in a database, for different purposes in data mining. (Agrawal, Mannila, Srikant, Toivonen & Verkamo, 1996; Berland & Charniak, 1999; Caraballo, 1999; Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996; Han & Kamber, 2001). When two words or labels in a large database have some implicit relationship with each other, one of the different purposes is to find out the two relative words or labels effectively. In order to find out relationships between words or labels in a large database, the author found the existence of at least six distance measures after reviewing previously conducted research. The first one is the mutual information proposed by Church and Hanks (1990). The second one is the confidence proposed by Agrawal and Srikant (1995). The third one is the complementary similarity measure (CSM) pre-
d. n.
The number of documents/records that have x and y both. The number of documents/records that have x but not y. The number of documents/records that do not have x but do have y. The number of documents/records that do not have either x or y. The total number of parameters a, b, c, and d.
Umemura (2002) pointed out the following in his paper: “Occurrence patterns of words in documents can be expressed as binary. When two vectors are similar, the two words corresponding to the vectors may have some implicit relationship with each other.” Yamamoto and Umemura (2002) completed their experiment to test the validity of these indexes under Umemura’s concept. The result of the experiment of distance measures without noisy pattern from their experiment can be seen in Figure 1 (Yamamoto & Umemura, 2002). The experiment by Yamamoto and Umemura (2002) showed that the most effective measure is the CSM. They indicated in their paper as follows: “All graphs showed that the most effective measure is the complementary similarity measure, and the next is the confidence and the third is asymmetrical average mutual information. And the least is the average mutual information” (Yamamoto and Umemura, 2002). They also completed their experiments with noisy pattern and found the same result (Yamamoto & Umemura, 2002).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Discovering an Effective Measure in Data Mining
Table 1. Kind of distance measures and their formulas No
Kin d of Dista n ce M ea sur es
,
F or m ul a
1
th e m u tu a l in for m a t ion
I ( x1 ; y1 ) = log
2
th e c on fiden ce
conf (Y | X ) =
3
th e comp leme n ta r y si mi la r it y me a su r e
Sc (F , T ) =
4
th e d ice coefficien t
S d (F ,T ) =
5
th e P h i co effici en t
φ DE =
6
th e pr oposal m eas u r e
S (F , T ) =
→ →
→ →
an (a + b)(a + c) a a+c ad − bc
(a + c)(b + d ) 2a ( a + b ) + (a + c ) ad − bc
(a + b)(a + c)(b + d )(c + d ) a 2b 1+ c
Figure 1. Result of the experiment of distance measures without noisy pattern
MAIN THRUST How to select a distance measure is a very important issue, because it has a great influence on the result of data mining (Fayyad, Piatetsky-Shapiro & Smyth, 1996; Glymour, Madigan, Pregibon & Smyth, 1997) The author completed the following three kinds of experiments, based upon the heuristic approach, in order to discover an
effective measure in this article (Aho, Kernighan & Weinberger, 1995).
RESULT OF THE THREE KINDS OF EXPERIMENTS All of these three kinds of experiments are executed under the following conditions. In order to discover an effective 365
Discovering an Effective Measure in Data Mining
measure, the author selected actual data of a place’s name, such as the name of prefecture and the name of a city in Japan from the articles of a nationally circulated newspaper, the Yomiuri. The reasons for choosing a place’s name are as follows: first, there are one-to-many relationships between the name of a prefecture and the name of a city; second, the one-to-many relationship can be checked easily from the maps and telephone directory. Generally speaking, the first name, such as the name of a prefecture, consists of another name, such as the name of a city. For instance, Fukuoka City is geographically located in Fukuoka Prefecture, and Kitakyushu City also is included in Fukuoka Prefecture, so there are one-to-many relationships between the name of the prefecture and the name of the city. The distance measure would be calculated with a large database in the experiments. The experiments were executed as follows: Step 1. Establish the database of the newspaper. Step 2. Choose the prefecture name and city name from the database mentioned in step 1. Step 3. Count the number of parameters a, b, c, and d from the newspaper articles. Step 4. Then, calculate the distance measure adopted. Step 5. Sort the result calculated in step 4 in descent order upon the distance measure. Step 6. List the top 2,000 from the result of step 5. Step 7. Judge the one-to-many relationship whether it is correct or not. Step 8. List the top 1,000 as output data and count its number of correct relationships. Step 9. Finally, an effective measure will be found from the result of the correct number. To uncover an effective one, two methods should be considered. The first one is to test various combinations of each variable in distance measure and to find the best combination, depending upon the result. The second one is to assume that the function is a stable one and that only
part of the function can be varied. The first one is the experiment with the PM. The total number of combinations of five variables is 3,125. The author calculated all combinations, except for the case when the denominator of the PM becomes zero. The result of the top 20 in the PM experiment, using a year’s amount of articles from the Yomiuri in 1991, is calculated as follows. In Table 2, the No. 1 function of a11c1 has a highest → →
correct number, which means S ( F , T ) =
1× a × 1 a = . c+0 c The rest function in Table 2 can be read as mentioned previously. This result appears to be satisfactory. But to prove whether it is satisfactory or not, another experiment should be done. The author adopted the Phi coefficient and calculated its correct number. To compare with the result of the PM experiment, the author completed the experiment of iterating the exponential index of denominator in the Phi coefficient from 0 to 2, with 0.01 step based upon the idea of fractal dimension in the complex theory instead of fixed exponential index at 0.5. The result of the top 20 in this experiment, using a year’s amount of articles from the Yomiuri in 1991, just like the first experiment, can be seen in Table 3. Compared with the result of the PM experiment mentioned previously, it is obvious that the number of correct relationships is less than that of the PM experiment; therefore, it is necessary to uncover a new, more effective measure. From the previous research done by Yamamoto and Umemura (2002), the author found that an effective measure is the CSM. The author completed the third experiment using the CMS. The experiment began by iterating the exponential index of the denominator in the CSM from 0 to 1, with 0.01 steps just like the second experiment. Table 4 shows the → →
The No. 11 function of 1a1c0 means S ( F , T ) =
Table 2. Result of the top 20 in the PM experiment No
366
a × 1× 1 a = . c +1 1+ c
Function
Correct Number
No
Function
Correct Number
1
a11c1
789
11
1a1c0
778
2
a111c
789
12
1a10c
778
3
1a1c1
789
13
11acc
778
4
1a11c
789
14
11ac0
778
5
11ac1
789
15
11a0c
778
6
11a1c
789
16
aa1cc
715
7
a11cc
778
17
aa1c0
715
8
a11c0
778
18
aa10c
715
9
a110c
778
19
a1acc
715
10
1a1cc
778
20
a1ac0
715
Discovering an Effective Measure in Data Mining
Table 3. Result of the top 20 in the Phi coefficient experiment No 1 2 3 4 5 6 7 8 9 10
Exponential Index 0.26 0.25 0.27 0.24 0.28 0.23 0.29 0.22 0.21 0.30
Correct Number 499 497 495 493 491 488 480 478 472 470
No 11 12 13 14 15 16 17 18 19 20
Exponential index 0.20 0.31 0.19 0.32 0.18 0.17 0.16 0.15 0.14 0.13
,
Correct Number 465 464 457 445 442 431 422 414 410 403
Table 4. Result of the top 20 in the CSM experiment 1991 E.I.
1992
C.N.
E.I.
1993
C.N.
E.I.
1994
C.N.
E.I.
1995
C.N.
E.I.
1996
C.N.
E.I.
1997
C.N.
E.I.
C.N.
0.73
850
0.75
923
0.79
883
0.81
820
0.85
854
0.77
832
0.77
843
0.72
848
0.76
922
0.80
879
0.82
816
0.84
843
0.78
828
0.78
842
0.71
846
0.74
920
0.77
879
0.8
814
0.83
836
0.76
826
0.76
837
0.70
845
0.73
920
0.81
878
0.76
804
0.86
834
0.75
819
0.79
829
0.74
842
0.72
917
0.78
878
0.75
800
0.82
820
0.74
818
0.75
824
0.63
841
0.77
909
0.82
875
0.79
798
0.88
807
0.71
818
0.73
818
0.69
840
0.71
904
0.76
874
0.77
798
0.87
805
0.73
812
0.74
811
0.68
839
0.78
902
0.75
867
0.78
795
0.89
803
0.70
803
0.72
811
0.65
837
0.79
899
0.74
864
0.74
785
0.81
799
0.72
800
0.71
801
0.64
837
0.70
898
0.73
859
0.73
769
0.90
792
0.69
800
0.80
794
0.62
837
0.69
893
0.72
850
0.72
761
0.80
787
0.79
798
0.81
792
0.66
835
0.80
891
0.71
833
0.83
751
0.91
777
0.68
788
0.83
791
0.67
831
0.81
884
0.83
830
0.71
741
0.79
766
0.67
770
0.82
790
0.75
828
0.68
884
0.70
819
0.70
734
0.92
762
0.80
767
0.70
789
0.61
828
0.82
883
0.69
808
0.84
712
0.93
758
0.66
761
0.84
781
0.76
824
0.83
882
0.83
801
0.69
711
0.78
742
0.81
755
0.85
780
0.60
818
0.84
872
0.68
790
0.85
706
0.94
737
0.65
749
0.86
778
0.77
815
0.67
870
0.85
783
0.86
691
0.77
730
0.82
741
0.87
776
0.58
814
0.66
853
0.86
775
0.68
690
0.76
709
0.83
734
0.69
769
0.82
813
0.85
852
0.67
769
0.87
676
0.95
701
0.64
734
0.68
761
Note: C.N. means the correct number, and E.I. means the exponential index.
result of the top 20 in this experiment, using a year’s amount of articles from the Yomiuri from 1991-1997. It is obvious by these results that the CSM is more effective than the PM and the Phi coefficient. The relationship of the complete result of the exponential index of the denominator in the CSM and the correct number can be seen as in Figure 2. The most effective exponential index of the denominator in the CSM is from 0.73 to 0.85, but not as 0.5, as many researchers believe. It would be hard to get the best result with the usual method using the CSM.
Determine the Most Effective Exponential Index of the Denominator in the CSM To discover the most effective exponential index of the denominator in the CSM, a calculation about the relationship between the exponential index and the total number of documents of n was carried out, but none was found. In fact, it is hard to consider that the exponential index would vary with the difference of the number of documents. 367
Discovering an Effective Measure in Data Mining
Figure 2. Relationships between the exponential indexes of denominator in the CSM and the correct number The C orrect num ber
1000 900 800 700 1991 1992 1993 1994 1995 1996 1997
600 500 400 300 200 100 0 0
0.08
0.16
0.24
0.32
0.4
0.48
0.56
0.64
0.72
0.8
0.88
0.96
The ExponentialIndex of Denom inator In C SM
The details of the four parameters of a, b, c, and d in Table 4 are listed in Table 5. The author adopted the average of the cumulative exponential index to find the most effective exponential index of the denominator in the CSM. Based upon the results in Table 4, calculations for the average of the cumulative exponential index of the top 20 and those results are presented in Figure 3. From the results in Figure 3, it is easy to understand that the exponential index converges at a constant value of about 0.78. Therefore, the exponential index could be fixed at a certain value, and it will not vary with the size of documents. In this article, the author puts the exponential index at 0.78 to forecast the correct number. The gap between the calculation result forecast by the revised method and the maximum result of the third experiment is illustrated in Figure 4.
FUTURE TRENDS Many indexes have been developed for the discovery of an effective measure to determine an implicit relationship between words in a large corpus or labels in a large database. Ishiduka, Yamamoto, and Umemura (2003) referred to part of them in their research paper. Almost all of these approaches were developed by a variety of mathematical and statistical techniques, a conception of neural network, and the association rules.
368
For many things in the world, part of the economic phenomena obtain a normal distribution referred to in statistics, but others do not, so the author developed the CSM measure from the idea of fractal dimension in this article. Conventional approaches may be inherently inaccurate, because usually they are based upon linear mathematics. The events of a concurrence pattern of correlated pair words may be explained better by nonlinear mathematics. Typical tools of nonlinear mathematics are complex theories, such as chaos theory, cellular automaton, percolation model, and fractal theory. It is not hard to predict that many more new measures will be developed in the near future, based upon the complex theory.
CONCLUSION Based upon previous research of the distance measures, the author discovered an effective measure of distance measure in one-to-many relationships with the heuristic approach. Three kinds of experiments were conducted, and it was confirmed that an effective measure is the CSM. In addition, it was discovered that the most effective exponential of the denominator in the CSM is 0.78, not 0.50, as many researchers believe. A great deal of work still needs to be carried out, and one of them is the meaning of the 0.78. The meaning of the most effective exponential of denominator 0.78 in the CSM should be explained and proved mathematically, although its validity has been evaluated in this article.
Discovering an Effective Measure in Data Mining
Table 5. Relationship between the maximum correct number and their parameters Year
Exponential Index
Correct Number
a
1991
0.73
850
2,284
1992
0.75
923
1993
0.79
1994
b
c
d
,
n
46,242 3,930 1,329
53,785
453
57,636 1,556
27
59,672
883
332
51,649 1,321
27
53,329
0.81
820
365
65,290 1,435
36
67,126
1995
0.85
854
1,500
67,914 8,042
190
77,646
1996
0.77
832
636
56,529 2,237 2,873
62,275
Figure 3. Relationships between the exponential index and the average of the number of the cumulative exponential index T he exponential index 0.795 0.79 0.785 0.78 0.775 0.77 0.765 0.76 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
T he number of the cumulative exponential index
ACKNOWLEDGMENTS
ings of the ACM SIGMOD Conference on Management of Data.
The author would like to express his gratitude to the following: Kyoji Umemura, Professor, Toyohashi University of Technology; Taisuke Horiuchi, Associate Professor, Nagano National College of Technology; Eiko Yamamoto, Researcher, Communication Research Laboratories; Yuji Minami, Associate Professor, Ube National College of Technology; Michael Hall, Lecturer, NakamuraGakuen University; who suggested a large number of improvements, both in content and computer program.
Agrawal, R. et al. (1996). Fast discovery of association rules, advances in knowledge discovery and data mining. Cambridge, MA: MIT Press.
REFERENCES Adriaans, P., & Zantinge, D. (1996). Data mining. Addison Wesley Longman Limited. Agrawal, R., & Srikant, R. (1995). Mining of association rules between sets of items in large databases. Proceed-
Aho, A.V., Kernighan, B.W., & Weinberger, P.J. (1989). The AWK programming language. Addison-Wesley Publishing Company, Inc. Berland, M., & Charniak, E. (1999). Finding parts in very large corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Caraballo, S.A. (1999). Automatic construction of a hypernym-labeled noun hierarchy from text. Proceedings of the Association for Computational Linguistics. Church, K.W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29. 369
Discovering an Effective Measure in Data Mining
Figure 4. Gaps between the maximum correct number and the correct number forecasted by the revised method
1000 900 800 700 600 m axim um correct num ber
500
correct num ber focasted by revised m ethod
400 300 200 100 0 1991 1992
1993
1994 1995
Fayyad, U.M. et al. (1996). Advances in knowledge discovery and data mining.AAAI Press/MIT Press. Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. Glymour, C. et al. (1997). Statistical themes and lessons for data mining. Data Mining and Knowledge Discover, 1(1), 11-28. Hagita, N., & Sawaki, M. (1995). Robust recognition of degraded machine-printed characters using complimentary similarity measure and error-correction learning. Proceedings of the SPIE—The International Society for Optical Engineering. Han, J., & Kamber, M. (2001). Data mining. Morgan Kaufmann Publishers. Ishiduka, T., Yamamoto, E., & Umemura, K. (2003). Evaluation of a function presumes the one-to-many relationship [unpublished research paper] [Japanese edition]).
1996 1997
Yamamoto, E., & Umemura, K. (2002). A similarity measure for estimation of one-to-many relationship in corpus [Japanese edition]. Journal of Natural Language Processing, 9, 45-75.
KEY TERMS Complementary Similarity Measurement: An index developed experientially to recognize a poorly printed character by measuring the resemblance of the correct pattern of the character expressed in a vector. The author calls this a diversion index to identify the one-to-many relationship in the concurrence patterns of words in a large corpus or labels in a large database in this article. Confidence: An asymmetric index that shows the percentage of records for which A occurred within the group of records and for which the other two, X and Y, actually occurred under the association rule of X, Y => A.
Manning, C.D., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press.
Correct Number: For example, if the city name is included in the prefecture name geographically, the author calls it correct. So, in this index, the correct number indicates the total number of correct one-to-many relationship calculated on the basis of the distance measures.
Umemura, K. (2002). Selecting the most highly correlated pairs within a large vocabulary. Proceedings of the COLING Workshop SemaNet’02, Building and Using Semantic Networks.
Distance Measure: One of the calculation techniques to discover the relationship between two implicit words in a large corpus or labels in a large database from the viewpoint of similarity.
370
Discovering an Effective Measure in Data Mining
Mutual Information: Shows the amount of information that one random variable x contains about another y. In other words, it compares the probability of observing x and y, together with the probabilities of observing x and y independently. Phi Coefficient: One metric for corpus similarity measured upon the chi square test. It is an index to calculate the frequencies of the four parameters of a, b, c, and d in
documents based upon the chi square test with the frequencies expected for independence. Proposal Measure: An index to measure the concurrence frequency of the correlated pair words in documents, based upon the frequency of two implicit words that appear. It will have high value, when the two implicit words in a large corpus or labels in a large database occur with each other frequently.
371
,
372
Discovering Knowledge from XML Documents Richi Nayak Queensland University of Technology, Australia
INTRODUCTION XML is the new standard for information exchange and retrieval. An XML document has a schema that defines the data definition and structure of the XML document (Abiteboul et al., 2000). Due to the wide acceptance of XML, a number of techniques are required to retrieve and analyze the vast number of XML documents. Automatic deduction of the structure of XML documents for storing semi-structured data has been an active subject among researchers (Abiteboul et al., 2000; Green et al., 2002). A number of query languages for retrieving data from various XML data sources also has been developed (Abiteboul et al., 2000; W3c, 2004). The use of these query languages is limited (e.g., limited types of inputs and outputs, and users of these languages should know exactly what kinds of information are to be accessed). Data mining, on the other hand, allows the user to search out unknown facts, the information hidden behind the data. It also enables users to pose more complex queries (Dunham, 2003). Figure 1 illustrates the idea of integrating data mining algorithms with XML documents to achieve knowledge discovery. For example, after identifying similarities among various XML documents, a mining technique can analyze links between tags occurring together within the documents. This may prove useful in the analysis of e-commerce Web documents recommending personalization of Web pages.
Figure 1. XML mining scheme
BACKGROUND: WHAT IS XML MINING? XML mining includes mining of structures as well as contents from XML documents, depicted in Figure 2 (Nayak et al., 2002). Element tags and their nesting therein dictate the structure of an XML document (Abiteboul et al., 2000). For example, the textual structure enclosed by
Intrastructure Mining Concerned with the structure within an XML document. Knowledge is discovered about the internal structure of XML documents in this type of mining. The following mining tasks can be applied. The classification task of data mining maps a new XML document to a predefined class of documents. A schema is interpreted as a description of a class of XML documents. The classification procedure takes a collection of schemas as a training set and classifies new XML documents according to this training set.
Figure 2. A taxonomy of XML mining XML Mining
XML Content Mining
XML Structure Mining
IntraStructure Mining
InterStructure Mining
Content Analysis
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Structure Clarification
Discovering Knowledge from XML Documents
The clustering task of data mining identifies similarities among various XML documents. A clustering algorithm takes a collection of schemas to group them together on the basis of self-similarity. These similarities are then used to generate new schema. As a generalization, the new schema is a superclass to the training set of schemas. This generated set of clustered schemas can now be used in classifying new schemas. The superclass schema also can be used in integration of heterogeneous XML documents for each application domain. This allows users to find, collect, filter, and manage information sources more effectively on the Internet. The association data mining describes relationships between tags that tend to occur together in XML documents that can be useful in the future. By transforming the tree structure of XML into a pseudo-transaction, it becomes possible to generate rules of the form “if an XML document contains a
Interstructure Mining Concerned with the structure between XML documents. Knowledge is discovered about the relationship between subjects, organizations, and nodes on the Web in this type of mining. The following mining tasks can be applied. Clustering schemas involves identifying similar schemas. The clusters are used in defining hierarchies of schemas. The schema hierarchy overlaps instances on the Web, thus discovering authorities and hubs (Garofalakis et al. 1999). Creators of schema are identified as authorities, and creators of instances are hubs. Additional mining techniques are required to identify all instances of schema present on the Web. The following application of classification can identify the most likely places to mine for instances. Classification is applied with namespaces and URIs (Uniform Resource Identifiers). Having previously associated a set of schemas with a particular namespace or URI, this information is used to classify new XML documents originating from these places. Content is the text between each start and end tag in XML documents. Mining for XML content is essentially mining for values (an instance of a relation), including content analysis and structural clarification.
Content Analysis Concerned with analysing texts within XML documents. The following mining tasks can be applied to contents.
Classification is performed on XML content, labeling new XML content as belonging to a predefined class. To reduce the number of comparisons, pre-existing schemas classify the new document’s schema. Then, only the instance classifications of the matching schemas need to be considered in classifying a new document. Clustering on XML content identifies the potential for new classifications. Again, consideration of schemas leads to quicker clustering; similar schemas are likely to have a number of value sets. For example, all schemas concerning vehicles have a set of values representing cars, another set representing boats, and so forth. However, schemas that appear dissimilar may have similar content. Mining XML content inherits some problems faced in text mining and analysis. Synonymy and polysemy can cause difficulties, but the tags surrounding the content usually can help resolve ambiguities.
Structural Clarification Concerned with distinguishing the similar structured documents based on contents. The following mining tasks can be performed. Content provides support for alternate clustering of similar schemas. Two distinctly structured schemas may have document instances with identical content. Mining these avails new knowledge. Vice versa, schemas provide support for alternate clustering of content. Two XML documents with distinct content may be clustered together, given that their schemas are similar. Content also may prove important in clustering schemas that appear different but have instances with similar content. Due to heterogeneity, the incidence of synonyms is increased. Are separate schemas actually describing the same thing, only with different terms? While thesauruses are vital, it is impossible for them to be exhaustive for the English language, let alone handle all languages. Conversely, schemas appearing similar actually are completely different, given homographs. The similarity of the content does not distinguish the semantic intention of the tags. Mining, in this case, provides probabilities of a tag having a particular meaning or a relationship between meaning and a URI.
METHODS OF XML STRUCTURE MINING Mining of structures from a well-formed or valid document is straightforward, since a valid document has a schema mechanism that defines the syntax and structure of the document. However, since the presence of schema is not mandatory for a well-formed XML document, the 373
,
Discovering Knowledge from XML Documents
document may not always have an accompanying schema. To describe the semantic structure of the documents, schema extraction tools are needed to generate schema for the given well-formed XML documents. DTD Generator (Kay, 2000) generates the DTD for a given XML document. However, the DTD generator yields a distinct DTD for every XML document; hence, a set of DTDs is defined for a collection of XML documents rather than an overall DTD. Thus, the application of data mining operations will be difficult in this matter. Tools such as XTRACT (Garofalakis, 2000) and DTD-Miner (Moh et al., 2000) infer an accurate and semantically meaningful DTD schema for a given collection of XML documents. However, these tools depend critically on being given a relatively homogeneous collection of XML documents. In such heterogeneous and flexible environment as the Web, it is not reasonable to assume that XML documents related to the same topic have the same document structure. Due to a number of limitations using DTDs as an internal structure, such as limited set of data types, loose structure constraints, and limitation of content to textual, many researchers propose the extraction of XML schema as an extension to XML DTDs (Feng et al., 2002; Vianu, 2001). In Chidlovskii (2002), a novel XML schema extraction algorithm is proposed, based on the Extended Context-Free Grammars (ECFG) with a range of regular expressions. Feng et al. (2002) also presented a semantic network-based design to convey the semantics carried by the XML hierarchical data structures of XML documents and to transform the model into an XML schema. However, both of these proposed algorithms are very complex. Mining of structures from ill-formed XML documents (that lack any fixed and rigid structure) are performed by applying the structure extraction approaches developed for semi-structured documents. But not all of these techniques can effectively support the structure extraction from XML documents that is required for further application of data mining algorithms. For instance, the NoDoSe tool (Adelberg & Denny, 1999) determines the structure of a semi-structured document and then extracts the data. This system is based primarily on plain text and HTML files, and it does not support XML. Moreover, in Green, et al. (2002), the proposed extraction algorithm considers both structure and contents in semi-structured documents, but the purpose is to query and build an index. They are difficult to use without some alteration and adaptation for the application of data mining algorithms. An alternative method is to approach the document as the Object Exchange Model (OEM) (Nestorov et al. 1999; Wang et al. 2000) data by using the corresponding
374
data graph to produce the most specific data guide (Nayak et al., 2002). The data graph represents the interactions between the objects in a given data domain. When extracting a schema from a data graph, the goal is to produce the most specific schema graph from the original graph. This way of extracting schema is more general than using the schema for a guide, because most of the XML documents do not have a schema, and sometimes, if they have a schema, they do not conform to it.
METHODS OF XML CONTENT MINING Before knowledge discovery in XML documents occurs, it is necessary to query XML tags and content to prepare the XML material for mining. An SQL-based query can extract data from XML documents. There are a number of query languages, some specifically designed for XML and some for semi-structured data, in general. Semi-structured data can be described by the grammar of SSD (semi-structured data) expressions. The translation of XML to SSD expression is easily automated (Abiteboul et al., 2000). Query languages for semi-structured data exploit path expressions. In this way, data can be queried to an arbitrary depth. Path expressions are elementary queries with the results returned as a set of nodes. However, the ability to return results as semi-structured data is required, which path expressions alone cannot do. Combining path expressions with SQL-style syntax provides greater flexibility in testing for equality, performing joins, and specifying the form of query results. Two such languages are Lorel (Abiteboul et al., 2000) and Unstructured Query Language (UnQL) (Farnandez et al., 2000). UnQL requires more precision and is more reliant on path expressions. XML-QL, XML-GL, XSL, and Xquery are designed specifically for querying XML (W3c, 2004). XML-QL (Garofalsaki et al., 1999) and Xquery bring together regular path expressions, SQL-style query techniques, and XML syntax. The great benefit is the construction of the result in XML and, thus, transforming XML data from one schema to another. Extensible Stylesheet Language (XSL) is not implemented as a query language but is intended as a tool for transforming XML to HTML. However, XSL’_S select pattern is a mechanism for information retrieval and, as such, is akin to a query (W3c, 2004). XML-GL (Ceri et al., 1999) is a graphical language for querying and restructuring XML documents.
Discovering Knowledge from XML Documents
FUTURE TRENDS There has been extensive effort to devise new technologies to process and integrate XML documents, but a lot of open possibilities still exist. For example, integration of data mining, XML data models and database languages will increase the functionality of relational database products, data warehouses, and XML products. Also, to satisfy the range of data mining users (from naive to expert users), future work should include mining user graphs that are structural information of Web usages, as well as visualization of mined data. As data mining is applied to large semantic documents or XML documents, extraction of information should consider rights management of shared data. XML mining should have the authorization level to empower security to restrict only appropriate users to discover classified information.
CONCLUSION XML has proved effective in the process of transmitting and sharing data over the Internet. Companies want to bring this advantage into analytical data, as well. As XML material becomes more abundant, the ability to gain knowledge from XML sources decreases due to their heterogeneity and structural irregularity; the idea behind the XML data mining looks like a solution to put to work. Using XML data in the mining process has become possible through new Web-based technologies that have been developed. Simple Object Access Protocol (SOAP) is a new technology that has enabled XML to be used in data mining. For example, vTag Web Mining Server aims at monitoring and mining of the Web with the use of information agents accessed by SOAP (vtag, 2003). Similarly, XML for Analysis defines a communication structure for an application programming interface, which aims at keeping client programming independent from the mechanics of data transport but, at the same time, providing adequate information concerning the data and ensuring that it is properly handled (XMLanalysis, 2003). Another development, YALE, is an environment for machine learning experiments that uses XML files to describe data mining experiments setup (Yale, 2004). The Data Miner’s ARCADE also uses XML as the target language’ for all data mining tools within its environment (Arcade, 2004).
Adelberg, B., & Denny, M. (1999). Nodose version 2.0. Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington. Arcade. (2004). http://datamining.csiro.au/arcade.html Ceri, S. et al. (1999). XML—Gl: A graphical language for querying and restructuring XML documents. Proceedings of the 8th International WWW Conference, Toronto, Canada. Chidlovskii, B. (2002). Schema extraction from XML collections. Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, Portland, Oregon. Dunham, M.H. (2003). Data mining: Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall Farnandez, M., Buneman, P., & Suciu, D. (2000). UNQL: A query language and algebra for semistructured data based on structural recursion. VLDB JOURNAL: Very Large Data Bases, 9(1), 76-110. Feng, L., Chang, E., & Dillon, T. (2002). A semantic network-based design methodology for XML documents. ACM Transactions of Information Systems (TOIS), 20(4), 390-421. Garofalakis, M. et al. (1999). Data mining and the Web: Past, present and future. Proceedings of the Second International Workshop on Web Information and Data Management, Kansas City, Missouri. Garofalakis, M.N. et al. (2000). XTRACT: A system for extracting document type descriptors from XML documents. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas. Green, R., Bean, C.A., & Myaeng, S.H. (2002). The semantics of relationships: An interdisciplinary perspective. Boston: Kluwer Academic Publishers. Kay, M. (2000). SAXON DTD generator—A tool to generate XML DTDs. Retrieved January 2, 2003, from http:/ /home.iclweb.com/ic2/mhkay/dtdgen.html Moh, C.-H., & Lim, E.-P. (2000). DTD-miner: A tool for mining DTD from XML documents. Proceedings of the Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems, California.
REFERENCES
Nayak, R., Witt, R., & Tonev, A. (2002, June). Data mining and XML documents. Proceedings of the 2002 International Conference on Internet Computing, Nevada.
Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: From relations to semistructured data and XML. San Francisco, CA: Morgan Kaumann.
Nestorov, S. et al. (1999). Representative objects: Concise representation of semi-structured, hierarchical data. Pro375
,
Discovering Knowledge from XML Documents
ceedings of the IEEE Proc on Management of Data, Seattle, Washington. Vianu, V. (2001). A Web odyssey: from Codd to XML. Proceedings of the 20 th ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, California. Vtag. (2003). http://www.connotate.com/csp.asp W3c. (2004). XML query (Xquery). Retrieved March 18, 2004, from http://www.w3c.org/XML/Query Wang, Q., Yu, X.J., & Wong, K. (2000). Approximate graph scheme extraction for semi-structured data. Proceedings of the 7th International Conference on Extending Database Technology, Konstanz. XMLanalysis. (2003). http://www.intelligenteai.com/feature/011004/editpage.shtml Yale. (2004). http://yale.cs.uni-dortmund.de/
KEY TERMS Ill-Formed XML Documents: Lack any fixed and rigid structure. Valid XML Document: To be valid, an XML document additionally must conform (at least) to an explicitly associated document schema definition.
376
Well-Formed XML Documents: To be well-formed, a page’s XML must have properly nested tags, unique attributes (per element), one or more elements, exactly one root element, and a number of schema-related constraints. Well-formed documents may have a schema, but they do not conform to it. XML Content Analysis Mining: Concerned with analysing texts within XML documents. XML Interstructure Mining: Concerned with the structure between XML documents. Knowledge is discovered about the relationship among subjects, organizations, and nodes on the Web. XML Intrastructure Mining: Concerned with the structure within an XML document(s). Knowledge is discovered about the internal structure of XML documents. XML Mining: Knowledge discovery from XML documents (heterogeneous and structural irregular). For example, clustering data mining techniques can group a collection of XML documents together according to the similarity of their structures. Classification data mining techniques can classify a number of heterogeneous XML documents into a set of predefined classification of schemas to improve XML document handling and achieve efficient searches. XML Structural Clarification Mining: Concerned with distinguishing the similar structured documents based on contents.
377
Discovering Ranking Functions for Information Retrieval
,
Weiguo Fan Virginia Polytechnic Institute and State University, USA Praveen Pathak University of Florida, USA
INTRODUCTION The field of information retrieval deals with finding relevant documents from a large document collection or the World Wide Web in response to a user’s query seeking relevant information. Ranking functions play a very important role in the retrieval performance of such retrieval systems and search engines. A single ranking function does not perform well across different user queries, and document collections. Hence it is necessary to “discover” a ranking function for a particular context. Adaptive algorithms like genetic programming (GP) are well suited for such discovery.
BACKGROUND In an information retrieval system (IR), given a user query a matching function matches the information in the query with that in the documents to rank the documents in decreasing order of their predicted relevance to the user. The top ranked documents are then presented to the user as a response to her query. To facilitate this relevance estimation process both the documents and the queries need to be transformed in a form that can be easily processed by computers. Vector Space Model (VSM) (Salton, 1989) is one of the most successful models to represent the documents and queries. We choose this model as the underlying model in our research due to the ease of interpretation of the model and the tremendous success in retrieval performance studies. Moreover most existing search engines and IR systems are based on this model. In VSM, both documents and queries are represented as vectors of terms. Suppose there are t terms in the collection, then a document D and a query Q are represented as: D = (wd1, wd2, …, wdt) Q = (wq1, wq2, …, wqt)
where wdi, wqi (for i=1 to t) are weights assigned to different terms in the document and the query respectively. The similarity between the two vectors is calculated as the cosine of the angle between the two vectors. It is expressed as (Salton & Buckley, 1988): t
Similarity (Q, D) =
∑w i =1
t
∑ (w i =1
qi
qi
wdi t
) 2 * ∑ ( wdi ) 2
(1)
i =1
The score, called retrieval status value (RSV), is calculated for each document in the collection and the documents are ordered and presented to the user in the decreasing order of RSV. Various content based features are available in VSM to compute the term weights. The most common ones are the term frequency (tf) and the inverse document frequency (idf). Term frequency measures the number of times a term appears in the document or the query. The higher this number, the more important the term is assumed to be in describing the document. Inverse document frequency is calculated as log( N / DF ) , where N is the total number of documents in the collection, and DF is the number of documents in which the term appears. A high value of idf means the term appears in a relatively few number of documents and hence the term is assumed to be important in describing the document. A lot of similar content based features are available in literature (Salton, 1989; Salton & Buckley, 1988). The features can be combined (e.g. tf*idf) to generate a variety of new composite features that can be used in term weighting. Equation (1) suggests that in order to discover a good ranking function, we need to discover the optimal way of assigning weights to document and query keywords. Change in term weighting strategy will essentially change the behavior of a ranking function. In this chapter we describe a method to discover an optimal ranking function.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Discovering Ranking Functions for Information Retrieval
Figure 1. A sample tree representation for a ranking function +
-
lo g
tf
n
*
df
df
N
MAIN THRUST In this chapter we present a systematic and automatic discovery process to discover ranking functions. The process is based on an artificial intelligence technique called genetic programming (GP). GP is based on genetic algorithms (GA) (Goldberg, 1989; Holland, 1992). Because of the intrinsic parallel search mechanism and powerful global exploration capability in high-dimensional space, both GA and GP have been used to solve a wide range of hard optimization problems. They are used in various optimal design and data-mining applications (Koza, 1992). GP represents the solution to a problem as a chromosome (or an individual) in a population pool. It evolves the population of chromosomes in successive generations by following the genetic transformation operations such as reproduction, crossover, and mutation to discover chromosomes with better fitness values. A fitness function assigns a fitness value for each chromosome that represents how good the chromosome is at solving the problem at hand. We use GP for discovering ranking functions because of four reasons. First, in GP there is no stringent requirement for an objective function to be continuous. All that is needed is that the objective function should be able to differentiate good solutions from the bad ones. This property allows us to use common IR performance measures, like “average precision” (P_Avg), which are non-linear in nature as objective functions. Second, GP is well suited to represent the common tree based
representations for the solutions. A tree-based representation allows for easier parsing and implementation. An example of a term weighting formula using tree structure is given in Figure 1. We will use such a tree based representation in this chapter. Third, GP is very effective for non-linear function and structure discovery problems where traditional optimization methods do not seem to work well (Banzhaf, Nordin, Keller, & Francone, 1998). Finally, it has been empirically found that GP discovers better solutions than those obtained by conventional heuristic algorithms. Ranking function discovery as presented in this chapter is different from classification in that we seek to find a function that will be used for ranking or prioritizing documents. There are efforts in IR that treat this as a classification problem in which a classifier or discriminant function is used for ranking. But evidence shows that ranking function discovery has yielded better retrieval results than the results obtained by treating this as a classification problem using Support Vector Machines and Neural Networks (Fan, Gordon, Pathak, Wensi, & Fox, 2004; Fuhr & Pfeifer, 1994). We now proceed to describe the discovery process using GP.
Ranking Function Discovery by GP In order to apply GP in our context we need to define several components for it. We use the tree structure as shown in Figure 1 to represent term weighting formula. Components needed for such a representation are given in Table 1. For the purpose of our discovery framework we will define these parameters as follows:
•
• •
An individual in the population is expressed in term of a tree which represents one possible ranking function. A population in a generation consists of P such trees. Terminals: We use the features mentioned in Table 2 and real constants as the terminals. Functions: We use +, -, *, /, and log as the functions allowed
Table 1. Essential GP components GP Parameters Terminals Functions Fitness Function Reproduction and Crossover
378
Meaning Leaf nodes in the tree data structure Non-leaf nodes used to combine the leaf nodes. Typically numerical operations The objective function that needs to be optimized Genetic operators used to copy fit solutions from one generation to another and to introduce diversity in the population
Discovering Ranking Functions for Information Retrieval
•
Fitness Function: We use the P_Avg as the fitness function which is defined in equation i ∑ r (d j ) D r (d ) * j =1 ∑ i i i =1 P _ Avg = T Re l
•
•
(2)
where r(di) ∈ (0,1) is the relevance score assigned to a document, it being 1 if the document is relevant and 0 otherwise. |D| is the total number of retrieved documents. TRel is the total number of relevant documents for the query. This equation incorporates both the standard retrieval measures of precision and recall. It also takes into account the ordering of the relevant retrieved documents. For example, if there are 20 documents retrieved and only 5 of them are relevant then P_Avg score is higher if the relevant documents are higher up in the order (say top 5 retrieved documents are relevant) as compared to the P_Avg score with relevant documents that are lower in the retrieval order (say the 16th to 20th document). This property is very important in many retrieval scenarios where the user is willing to see only the top few retrieved documents. P_Avg is the most widely used measure in retrieval studies for comparing performance of different systems. Reproduction: Reproduction copies the top (in terms of fitness) trees in the population into the next population. If P is the population size and reproduction rate is rate_r then top rate_r * P trees are copied into next generation. rate_r is set to 0.1. Crossover: We use tournament selection to select, with replacement, 6 random trees from the population. The top two among the six trees (in terms of fitness) are selected for crossover and they exchange sub-trees to form trees for the next generation.
The detailed process for the ranking function discovery is as given in Figure 2. It is an iterative process. First a population of random ranking functions is created. The training documents are divided into a set of training set and a validation set. This two dataset training methodology has been commonly used in machine learning experiments (Mitchell, 1997). Relevance judgments for a query for each of the documents in the training and validation set are known. The random ranking functions are evaluated using relevance information for the training set. The performance measure used is given in Equation (2). The topmost ranking function in terms of performance is noted. The population is subjected to the genetic operations of selection, reproduction, and crossover to generate the next generation. The process is repeated for each generation. At the end of 30 generations the thirty ranking functions are applied to the validation set and the best performing ranking function is chosen as the discovered ranking function for the particular query. It is to be noted that over the 30 generations the best ranking function in each generation does successively improve retrieval performance on the training data set. This improvement in retrieval performance is as expected. However, to avoid overfitting problems that are very common in these techniques, we apply the best ranking function from each of the 30 generations to the unseen validation data set and choose the best performing ranking function on the validation data set to be applied to the test data set. Thus the final ranking function that we choose for applying to the test data set need not necessarily come from the last generation.
Experiments and Results We have applied the discovery process described above on various TREC (Harman, 1996; Hawking, 2000; Hawking & Craswell, 2001) document collections. The TREC datasets were divided into training, validation, and test datasets. The discovered ranking function after the training and validation phase was applied to the test dataset. The performance results of our method were compared
Table 2. Features used in the discovery process Features Used Tf tf_max tf_avg Df df_max N Length Length_avg N
Statistical Meaning Number of times a term appears in a document Maximum tf in the entire document collection Average tf for a document Number of documents in the collection the term appeared in Maximum df across all terms Number of documents in the entire text collection Length of a document Average length of document in the entire collection Number of unique terms in the document
379
,
Discovering Ranking Functions for Information Retrieval
Figure 2. Discovery process • IN P U T : Q u ery, T raining d ocu m en ts • O U TP U T: D isco vered ra nkin g fu nctio n • P rocess: – T raining docum ents divided into a training set and a validation set – Initial population of random ranking functions generated – A pply the follow ing on training dataset for 30 generations • Use the individual ranking function in the generation to rank documents in the training set • Create a new population by applying genetic operators of reproduction and crossover on the existing population
– T he top perform ing individual from each generation is applied to the docum ents in the validation set and the best perform ing ranking function is selected – T his is the D iscovered ranking function
with the results obtained by applying well known ranking functions in the literature (OKAPI, Pivoted TFIDF, and INQUERY) (Singhal, Salton, Mitra, & Buckley, 1996) to the same test dataset. These well known functions essentially differ based on their term weighting strategies i.e. the way they combine various weighting features shown in Table 2. We have found that retrieval performance has significantly increased by using our method of discovering ranking functions. The performance improvement, in terms of P_Avg, has been anywhere from 11% to 75%, with the results being statistically significant. More details about the performance comparisons are available elsewhere (Fan, Gordon, & Pathak, 2004a, 2004b).
FUTURE TRENDS We believe the ranking functions discovery as outlined in this chapter is just the beginning of a promising avenue of research. This discovery process can be applied to either the routing search task (search specific to a particular query) or the ad-hoc search task (a generalized search for a set of queries). Thus the process could be made more personalized for an individual user with persistent information requirements or to a group of users (for example in a department in an organization) with similar, but not same, information requirements. Another promising avenue of research can deal with intelligently leveraging both the structural as well as statistical information available in documents, specifically HTML and XML documents. The work presented here utilizes just the statistical information in the documents in terms of frequencies of terms in documents or across documents. But structural information such as where the terms appear e.g. in the title, anchor, body, or abstract of the document can also be leveraged using the discovery process described in the chapter. Such leveraging, 380
we believe, will yield even better retrieval performance. Finally, we believe the technique mentioned in this chapter can be combined with data-fusion techniques to combine successful traits from various algorithms to yield better retrieval performance.
CONCLUSION In this chapter we have presented a ranking function discovery process to discover new ranking functions. Ranking functions match the information in documents with that in the queries to rank the documents in the decreasing order of predicted relevance of the documents to the user. Although there are well known ranking functions in the IR literature we believe even better ranking functions can be discovered for specific queries or a set of queries. The discovery of ranking functions was accomplished using Genetic Programming, an artificial intelligence algorithm based on evolutionary theory. The results of GP based retrieval have been found to significantly outperform the results obtained by well known ranking functions. We believe this line of research will be potentially very rewarding in terms of much improved retrieval performance.
REFERENCES Banzhaf, W., Nordin, P., Keller, R., & Francone, F. (1998). Genetic programming: An introduction - On the automatic evolution of computer programs and its applications. San Francisco, CA: Morgan Kaufmann Publishers. Fan, W., Gordon, M., & Pathak, P. (2004a). Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4), 523527. Fan, W., Gordon, M., & Pathak, P. (2004b). A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 40(4), 587-602. Fan, W., Gordon, M. D., Pathak, P., Wensi, X., & Fox, E. (2004). Ranking function optimization for effective web search by genetic programming: an empirical study, Proceedings of 37th Hawaii International Conference on System Sciences, Big Island, Hawaii. IEEE. Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as combination of abstraction inductive learning and probabilistic assumptions. ACM Transactions on Information Systems, 12, 92-115.
Discovering Ranking Functions for Information Retrieval
Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Addison-Wesley.
KEY TERMS
Harman, D. K. (1996). Overview of the fourth text retrieval conference (TREC-4). In D. K. Harman (Ed.), Proceedings of the Fourth Text Retrieval Conference (Vol. 500-236, pp. 1-24). NIST Special Publication.
Average Precision (P_Avg): This is a well known measure of retrieval performance. It is the average of the precision scores calculated every time a new relevant document is found, normalized by the total number of relevant documents in the collection.
Hawking, D. (2000). Overview of the TREC-9 Web track. In E. Voorhees & H. D. (Eds.), Ninth Text Retrieval Conference (Vol. 500-249, pp. 86-102). NIST Special Publication. Hawking, D., & Craswell, N. (2001). Overview of the TREC2001 Web track. In E. Voorhees & D. K. Harman (Eds.), Proceedings of the Tenth Text Retrieval Conference (Vol. 500-250, pp. 61-67). NIST.
Document Cut-OFF Value (DCV): The number of documents that the user is willing to see as a response to the query. Document Frequency: The number of documents in the document collection that the term appears in.
Holland, J. H. (1992). Adaptation in natural and artificial systems (2nd ed.). MIT Press.
Genetic Programming: A stochastic search algorithm based on evolutionary theory, with the aim to optimize structure or functional form. A tree structure is commonly used for representation of solutions.
Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection. Cambridge, MA: MIT Press.
Precision: The ratio of the number of relevant documents retrieved to the total number of documents retrieved.
Mitchell, T. M. (1997). Machine learning. McGraw Hill).
Ranking Function: A function that matches the information in documents with that in the user query to assign a score for each document in the collection.
Salton, G. (1989). Automatic text processing. Reading, MA: Addison-Wesley Publishing Co. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information processing and management, 24(5), 513-523. Singhal, A., Salton, G., Mitra, M., & Buckley, C. (1996). Document length normalization. Information processing and management, 32(5), 619-633.
Recall: The ratio of the number of relevant documents retrieved to the total number of relevant documents in the document collection. Term Frequency: The number of times a term appears in a document. Vector Space Model (VSM): A common IR model where both documents and queries are represented as vectors of terms.
381
,
382
Discovering Unknown Patterns in Free Text Jan H. Kroeze University of Pretoria, South Africa
INTRODUCTION A very large percentage of business and academic data is stored in textual format. With the exception of metadata, such as author, date, title and publisher, these data are not overtly structured like the standard, mainly numerical, data in relational databases. Parallel to data mining, which finds new patterns and trends in numerical data, text mining is the process aimed at discovering unknown patterns in free text. Owing to the importance of competitive and scientific knowledge that can be exploited from these texts, “text mining has become an increasingly popular and essential theme in data mining” (Han & Kamber, 2001, p. 428). Text mining has a relatively short history: “Unlike search engines and data mining that have a longer history and are better understood, text mining is an emerging technical area that is relatively unknown to IT professions” (Chen, 2001, p. vi).
BACKGROUND Definitions of text mining vary a great deal, from views that it is an advanced form of information retrieval (IR) to those that regard it as a sibling of data mining: • • • • •
Text mining is the discovery of texts. Text mining is the exploration of available texts. Text mining is the extraction of information from text. Text mining is the discovery of new knowledge in text. Text mining is the discovery of new patterns, trends and relations in and among texts.
Han & Kamber (2001, pp. 428-435), for example, devote much of their rather short discussion of text mining to information retrieval. However, one should differentiate between text mining and information retrieval. Text mining does not consist of searching through metadata and full-text databases to find existing information. The point of view expressed by Nasukawa & Nagano (2001, p. 969), to wit that text mining “is a text version of generalized data mining,” is correct. Text mining should “focus on finding valuable patterns and
rules in text that indicate trends and significant features about specific topics” (Nasukawa & Nagano, 2001, p. 967).
MAIN THRUST Like data mining, text mining is a proactive process that automatically searches data for new relationships and anomalies to serve as a basis for making business decisions aimed at gaining competitive advantage (cf., Rob & Coronel, 2004, p. 597). Although data mining can require some interaction between the investigator and the data-mining tool, it can be considered as an automatic process because “data-mining tools automatically search the data for anomalies and possible relationships, thereby identifying problems that have not yet been identified by the end user,” while mere data analysis “relies on the end users to define the problem, select the data, and initiate the appropriate data analyses to generate the information that helps model and solve problems those end-users uncover” (Rob & Coronel, 2004, p. 597). The same distinction is valid for text mining. Therefore, text-mining tools should also “initiate analyses to create knowledge” (Rob & Coronel, 2004, p. 598). In practice, however, the borders between data analysis, information retrieval and text mining are not always quite so clear. Montes-y-Gómez et al. (2004) proposed an integrated approach, called contextual exploration, which combines robust access (IR), non-sequential navigation (hypertext) and content analysis (text mining). According to Smallheiser (2001, pp. 690-691), text mining approaches can be divided into two main types: “Macro analyses perform data-crunching operations over a large, often global set of papers encompassing one or more fields, in order to identify large-scale trends or to classify and organize the literature…. In contrast, micro analyses pose a sharply focused question, in which one searches for complementary information that links two small, pre-specified fields of inquiry.”
The Need for Text Mining Text mining can be used as an effective business intelligence tool for gaining competitive advantage through
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Discovering Unknown Patterns in Free Text
the discovery of critical, yet hidden, business information. In the field of academic research, text mining can be used to scan large numbers of publications in order to select the most relevant literature and to propose new links between independent research results. Text mining is also needed “to formulate and assess hypotheses arising in biomedical research … and … for helping make policy decisions regarding technical innovation” (Smallheiser, 2001, p. 690). Another application of text mining is in medical science, to discover gene interactions, functions and relations, or to build and structure medical knowledge bases, and to find undiscovered relations between diseases and medications (De Bruijn & Martin, 2002, p. 8).
Types of Text Mining Keyword-Based Association Analysis Association analysis looks for correlations between texts based on the occurrence of related keywords or phrases. Texts with similar terms are grouped together. The pre-processing of the texts is very important and includes parsing and stemming, and the removal of words with minimal semantic content. Another issue is the problem of compounds and non-compounds — should the analysis be based on singular words or should word groups be accounted for? (cf., Han & Kamber, 2001, p. 433). Kostoff et al. (2002), for example, have measured the frequencies and proximities of phrases regarding electrochemical power to discover central themes and relationships among them. This knowledge discovery, combined with the interpretation of human experts, can be regarded as an example of knowledge creation through intelligent text mining.
Automatic Document Classification Electronic documents are classified according to a predefined scheme or training set. The user compiles and refines the classification parameters, which are then used by a computer program to categorise the texts in the given collection automatically (cf., Sullivan, 2001, p. 198). Classification can also be based on the analysis of collocation [“the juxtaposition or association of a particular word with another particular word or words” (The Oxford Dictionary, 1995)]. Words that often appear together probably belong to the same class (Lopes et al., 2004). According to Perrin & Petry (2003) “useful text structure and content can be systematically extracted by collocational lexical analysis” with statistical methods. Text classification can be used by businesses, for example, to categorise customers’ e-mails
automatically and suggest the appropriate reply templates (Weng & Liu, 2004).
Similarity Detection Texts are grouped according to their own content into categories that were not previously known. The documents are analysed by a clustering computer program, often a neural network, but the clusters still have to be interpreted by a human expert (Hearst, 1999). Document pre-processing (tagging of parts of speech, lemmatisation, filtering and structuring) precedes the actual clustering phase (Iiritano et al., 2004). The clustering program finds similarities between documents, for example, common author, same themes, or information from common sources. The program does not need a training set or taxonomy, but generates it dynamically (cf., Sullivan, 2001, p. 201). One example of the use of text clustering is found in the work of Fattori et al. (2003), whose text-mining tool processes patent documents into dynamic clusters to discover patenting trends, which constitutes information that can be used as competitive intelligence.
Link Analysis “Link analysis is the process of building up networks of interconnected objects through relationships in order to expose patterns and trends” (Westphal & Blaxton, 1998, p. 202). In text databases, link analysis is the finding of meaningful, high levels of correlations between text entities. The user can, for example, suggest a broad hypothesis and then analyse the data in order to prove or disprove this hunch. It can also be an automatic or semiautomatic process, in which a surprisingly high number of links between two or more nodes may indicate relations that have hitherto been unknown. Link analysis can also refer to the use of algorithms to build and exploit networks of hyperlinks in order to find relevant and related documents on the Web (Davison, 2003). Yoon & Park (2004) use link analysis to construct a visual network of patents, which facilitates the identification of a patent’s relative importance: “The coverage of the application is wide, ranging from new idea generation to ex post facto auditing” (p. 49). Text mining is also used to identify experts by finding and evaluating links between persons and areas of expertise (Ibrahim, 2004).
Sequence Analysis A sequential pattern is the arrangement of a number of elements, in which the one leads to the other over time (Wong et al., 2000). Sequence analysis is the discovery 383
,
Discovering Unknown Patterns in Free Text
of patterns that are related to time frames, for example, the origin and development of a news thread (cf., Montes-yGómez et al., 2001), or tracking a developing trend in politics or business. It can also be used to predict recurring events.
Anomaly Detection Anomaly detection is the finding of information that violates the usual patterns, for example, a book that refers to a unique source, or a document lacking typical information. An example of anomaly detection is the detection of irregularities in news reports or different topic profiles in newspapers (Montes-y-Gómez et al., 2001).
Hypertext Analysis “Text mining is about looking for patterns in natural language text…. Web mining is the slightly more general case of looking for patterns in hypertext and often applies graph theoretical approaches to detect and utilise the structure of web sites” (New Zealand Digital Library, 2002). Marked-up language, especially XML tags, facilitates text mining because the tags can often be used to simulate database attributes and to convert data-centric documents into databases, which can then be exploited (Tseng & Hwung, 2002). Mark-up tags also make it possible to create “artificial structures [that] help us understand the relationship between documents and document components” (Sullivan, 2001, p. 51).
CRITICAL ISSUES Many sources on text mining refer to text as “unstructured data.” However, it is a fallacy that text data are unstructured. Text is actually highly structured in terms of morphology, syntax, semantics and pragmatics. On the other hand, it must be admitted that these structures are not directly visible: “… text represents factual information … in a complex, rich, and opaque manner”
(Nasukawa & Nagano, 2001, p. 967). Authors also differ on the issue of natural language processing within text mining. Some prefer a more statistical approach (cf., Hearst, 1999), while others feel that linguistic parsing is an essential part of text mining. Sullivan (2001, p. 37) regards the representation of meaning by means of syntactic-semantic representations as essential for text mining: “Text processing techniques, based on morphology, syntax, and semantics, are powerful mechanisms for extracting business intelligence information from documents…. We can scan text for meaningful phrase patterns and extract key features and relationships.” According to De Bruijn & Martin (2002, p. 16), “[l]arge-scale statistical methods will continue to challenge the position of the more syntax-semantics oriented approaches, although both will hold their own place.” In the light of the various definitions of text mining, it should come as no surprise that authors also differ on what qualifies as text mining and what does not. Building on Hearst (1999), Kroeze, Matthee, & Bothma (2003) use the parameters of novelty and data type to distinguish between information retrieval, standard text mining and intelligent text mining (see Figure 1). Halliman (2001, p. 7) also hints at a scale of newness of information: “Some text mining discussions stress the importance of ‘discovering new knowledge.’ And the new knowledge is expected to be new to everybody. From a practical point of view, we believe that business text should be ‘mined’ for information that is ‘new enough’ to give a company a competitive edge once the information is analyzed.” Another issue is the question of when text mining can be regarded as “intelligent.” Intelligent behavior is “the ability to learn from experience and apply knowledge acquired from experience, handle complex situations, solve problems when important information is missing, determine what is important, react quickly and correctly to a new situation, understand visual images, process and manipulate symbols, be creative and imaginative, and use heuristics” (Stair & Reynolds, 2001, p.
Figure 1. A differentiation between information retrieval, standard and intelligent metadata mining, and standard and intelligent text mining (abbreviated from Kroeze, Matthee, & Bothma, 2003) Novelty level: Data type: Metadata (overtly structured) Free text (covertly structured)
384
Non-novel investigation information retrieval Information retrieval of metadata Information retrieval of full texts
Semi-novel investigation knowledge discovery Standard metadata mining Standard text mining
Novel investigation knowledge creation Intelligent metadata mining Intelligent text mining
Discovering Unknown Patterns in Free Text
421). Intelligent text mining should therefore refer to the interpretation and evaluation of discovered patterns.
FUTURE TRENDS Mack and Hehenberger (2002, p. S97) regards the automation of “human-like capabilities for comprehending complicated knowledge structures” as one of the frontiers of “text-based knowledge discovery.” Incorporating more artificial intelligence abilities into text-mining tools will facilitate the transition from mainly statistical procedures to more intelligent forms of text mining.
CONCLUSION Text mining can be regarded as the next frontier in the science of knowledge discovery and creation, enabling businesses to acquire sought-after competitive intelligence, and helping scientists of all academic disciplines to formulate and test new hypotheses. The greatest challenges will be to select the most appropriate technology for specific problems and to popularise these new technologies so that they become instruments that are generally known, accepted and widely used.
REFERENCES Chen, H. (2001). Knowledge management systems: A text mining perspective. Tucson, AZ: University of Arizona (Knowledge Computing Corporation). Davison, B.D. (2003). Unifying text and link analysis. In Text-Mining & Link-Analysis Workshop of the 18 th International Joint Conference on Artificial Intelligence. Retrieved from http://www-2.cs.cmu.edu/ ~dunja/TextLink2003/ De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore of knowledge: Mining biomedical literature. International Journal of Medical Informatics, 67(1-3), 7-18. Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text mining applied to patent mapping: A practical business case. World Patent Information, 25(4), 335-342. Halliman, C. (2001). Business intelligence using smart techniques: Environmental scanning using text mining and competitor analysis using scenarios and manual simulation. Houston, TX: Information Uncover. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann.
Hearst, M.A. (1999). Untangling text data mining. In Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics. University of Maryland, June 20-26 (invited paper). Retrieved from http://www.ai.mit.edu/people/jimmylin/papers/ Hearst99a.pdf Ibrahim, A. (2004). Expertise location: Can text mining help? In N.F.F. Ebecken, C.A. Brebbia, & A. Zanas (Eds.), Data mining IV (pp. 109-118). Southampton: WIT Press. Iiritano, S., Ruffolo, M., & Rullo, P. (2004). Preprocessing method and similarity measures in clustering-based text mining: A preliminary study. In N.F.F. Ebecken, C.A. Brebbia, & A. Zanas (Eds.), Data mining IV (pp. 73-79). Southampton: WIT Press. Kostoff, R.N., Tshiteya, R., Pfeil, K.M., & Humenik, J.A. (2002). Electrochemical power text mining using bibliometrics and database tomography. Journal of Power Sources, 110(1), 163-176. Kroeze, J.H., Matthee, M.C., & Bothma, T.J.D. (2003). Differentiating data- and text-mining terminology. In IT Research in Developing Countries - Proceedings of SAICSIT 2003 (Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists) (pp. 93-101), September 1719. Pretoria. SAICSIT. Lopes, M.C.S., Terra, G.S., Ebecken, N.F.F., & Cunha, G.G. (2004). Mining text databases on clients opinion for oil industry. In N.F.F. Ebecken, C.A. Brebbia, & A. Zanas (eds.), Data Mining IV (pp. 139-147). Southampton: WIT Press. Mack, R., & Hehenberger, M. (2002). Text-based knowledge discovery: Search and mining of life-science documents. Drug Discovery Today, 7(11) (Suppl.), S89-S98. Montes-y-Gómez, M., Gelbukh, A., & López-López, A. (2001). Mining the news: Trends, associations, and deviations. Computación y Sistemas, 5(1). Retrieved from http:/ /ccc.inaoep.mx/~mmontesg/publicaciones/2001/ NewsMining-CyS01.pdf Montes-y-Gómez, M., Pérez-Coutiño, M., VillaseñorPineda, L., & López-López, A. (2004). Contextual exploration of text collections. Lecture Notes in Computer Science (Vol. 2945) Berlin: Springer-Verlag. Retrieved from http://ccc.inaoep.mx/~mmontesg/publicaciones/ 2004/ContextualExploration-CICLing04.pdf Nasukawa, T., & Nagano, T. (2001). Text analysis and knowledge mining system. IBM Systems Journal, 40(4), 967-984.
385
,
Discovering Unknown Patterns in Free Text
New Zealand Digital Library, University of Waikato. (2002). Text mining. Retrieved from http://www.cs.waikato.ac.nz/ ~nzdl/textmining/
Hypertext: A collection of texts containing links to each other to form an interconnected network (Sullivan, 2001, p. 46).
Perrin, P., & Petry, F.E. (2003). Extraction and representation of contextual information for knowledge discovery in texts. Information Sciences, 151, 125-152.
Information Retrieval: The searching of a text collection based on a user’s request to find a list of documents organised according to its relevance, as judged by the retrieval engine (Montes-y-Gómez et al., 2004). Information retrieval should be distinguished from text mining.
Rob, P., & Coronel, C. (2004). Database systems: Design, implementation, and management (6th ed.). Boston, MA: Course Technology. Smallheiser, N.R. (2001). Predicting emerging technologies with the aid of text-based data mining: The micro approach. Technovation, 21(10), 689-693. Stair, R.M., & Reynolds, G.W. (2001). Principles of information systems: A managerial approach (5th ed.). Boston, MA: Course Technology. Sullivan, D. (2001). Document warehousing and text mining: Techniques for improving business operations, marketing, and sales. New York, NY: John Wiley. Tseng, F.S.C., & Hwung, W.J. (2002). An automatic load/extract scheme for XML documents through object-relational repositories. Journal of Systems and Software, 64(3), 207-218. Weng, S.S., & Liu, C.K. (2004). Using text classification and multiple concepts to answer e-mails. Expert Systems with Applications, 26(4), 529-543. Westphal, C., & Blaxton, T. (1998). Data mining solutions: Methods and tools for solving real-world problems. New York, NY: John Wiley. Wong, P.K., Cowley, W., Foote, H., Jurrus, E., & Thomas, J. (2000). Visualizing sequential patterns for text mining. In Proceedings of the IEEE Symposium on Information Visualization 2000 (p. 105). Retrieved from http://portal.acm.org/citation.cfm Yoon, B., & Park, Y. (2004). A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 15(1), 37-50.
KEY TERMS Business Intelligence: “Any information that reveals threats and opportunities that can motivate a company to take some action” (Halliman, 2001, p. 3). Competitive Advantage: The head start a business has owing to its access to new or unique information and knowledge about the market in which it is operating. 386
Knowledge Creation: The evaluation and interpretation of patterns, trends or anomalies that have been discovered in a collection of texts (or data in general), as well as the formulation of its implications and consequences, including suggestions concerning reactive business decisions. Knowledge Discovery: The discovery of patterns, trends or anomalies that already exist in a collection of texts (or data in general), but have not yet been identified or described. Mark-Up Language: Tags that are inserted in free text to mark structure, formatting and content. XML tags can be used to mark attributes in free text and to transform free text into an exploitable database (cf., Tseng & Hwung, 2002). Metadata: Information regarding texts, for example, author, title, publisher, date and place of publication, journal or series, volume, page numbers, key words, etc. Natural Language Processing (NLP): The automatic analysis and/or processing of human language by computer software, “focussed on understanding the contents of human communications.” It can be used to identify relevant data in large collections of free text for a data mining process (Westphal & Blaxton, 1998, p. 116). Parsing: A (NLP) process that analyses linguistic structures and breaks them down into parts, on the morphological, syntactic or semantic level. Stemming: Finding the root form of related words, for example singular and plural nouns, or present and past tense verbs, to be used as key terms for calculating occurrences in texts. Text Mining: The automatic analysis of a large text collection in order to identify previously unknown patterns, trends or anomalies, which can be used to derive business intelligence for competitive advantage or to formulate and test scientific hypotheses.
387
Discovery Informatics
,
William W. Agresti Johns Hopkins University, USA
INTRODUCTION Discovery informatics is an emerging methodology that brings together several threads of research and practice aimed at making sense out of massive data sources. It is defined as “the study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data” (Agresti, 2003).
BACKGROUND In this broad-based conceptualization, discovery informatics may be seen as taking shape by drawing on more of the following established disciplines:
•
•
•
•
•
Database Management: Data models, data analysis, data structures, data management, federation of databases, data warehouses, database management systems. Pattern Recognition: Statistical processes, classifier design, image data analysis, similarity measures, feature extraction, fuzzy sets, clustering algorithms. Information Storage and Retrieval: Indexing, content analysis, abstracting, summarization, electronic content management, search algorithms, query formulation, information filtering, relevance and recall, storage networks, storage technology. Knowledge Management: Knowledge sharing, knowledge bases, tacit and explicit knowledge, relationship management, content structuring, knowledge portals, collaboration support systems. Artificial Intelligence: Learning, concept formation, neural nets, knowledge acquisition, intelligent systems, inference systems, Bayesian methods, decision support systems, problem solving, intelligent agents, text analysis, natural language processing.
What distinguishes discovery informatics is that it brings coherence across dimensions of technologies and domains to focus on discovery. It recognizes and
builds upon excellent programs of research and practice in individual disciplines and application areas. It looks selectively across these boundaries to find anything (e.g., ideas, tools, strategies, and heuristics) that will help with the critical task of discovering new information. To help characterize discovery informatics, it may be useful to see if there are any roughly analogous developments elsewhere. Two examples—knowledge management and core competence—may be instructive as reference points. Knowledge management, which began its evolution in the early 1990s, is the practice of transforming the intellectual assets of an organization into business value (Agresti, 2000). Of course, before 1990, organizations, to varying degrees, knew that the successful delivery of products and services depended on the collective knowledge of employees. However, KM challenged organizations to focus on knowledge and recognize its key role in their success. They found value in addressing questions such as the following: • • •
What is the critical knowledge that should be managed? Where is the critical knowledge? How does knowledge get into products and services?
When C. K. Prahalad and Gary Hamel published their highly influential paper, “The Core Competence of the Corporation” (Prahalad & Hamel, 1990), companies had some capacity to identify what they were good at. However, as with KM, most organizations did not appreciate how identifying and cultivating core competences (CC) may make the difference between being competitive or not. A core competence is not the same as what you are good at or being more vertically integrated. It takes dedication, skill, and leadership to effectively identify, cultivate, and deploy core competences for organizational success. Both KM and CC illustrate the potential value of taking on a specific perspective. By doing so, an organization will embark on a worthwhile reexamination of familiar topics—its customers, markets, knowledge sources, competitive environment, operations, and success criteria. The claim of this article is that discovery informatics represents a distinct perspective, one that is potentially highly beneficial, because, like KM and CC,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Discovery Informatics
it strikes at what is often an essential element for success and progress—discovery.
MAIN THRUST Both the technology and application dimensions will be explored to help clarify the meaning of discovery informatics.
Discovery Across Technologies The technology dimension is considered broadly to include automated hardware and software systems, theories, algorithms, architectures, techniques, methods, and practices. Included here are familiar elements associated with data mining and knowledge discovery, such as clustering, link analysis, rule induction, machine learning, neural networks, evolutionary computation, genetic algorithms, and instance-based learning (Wang, 2003). However, the discovery informatics viewpoint goes further, to activities and advances that are associated with other areas but should be seen as having a role in discovery. Some of these activities, like searching or knowledge sharing, are well known from everyday experiences. Conducting searches on the Internet is a common practice that needs to be recognized as part of a thread of information retrieval. Because it is practiced essentially by all Internet users and involves keyword search, there is a tendency to minimize its importance. Search technology is extremely sophisticated (Baeza-Yates & Ribiero-Neto, 1999). People always have some starting point for their searches. Often, it is not a keyword, but a concept. So people are forced to perform the transformation from a notional concept of what is desired to a list of one or more keywords. The net effect can be the familiar many-thousand hits from the search engine. Even though the responses are ranked for relevance (a rich and research-worthy subject itself), people may still find that the returned items do not match their intended concepts. Offering hope for improved search are advances in concept-based search (Houston & Chen, 2004), more intuitiveness to a person’s sense of “find me content like this,” where this can be a concept embodied in an entire document or series of documents. For example, a person may be interested in learning which parts of a new process guideline are being used in practice in the pharmaceutical industry. Trying to obtain that information through keyword searches typically would involve trial and error on various combinations of keywords. What the person would like to do is to point a search tool to an entire folder of multimedia electronic content and ask the tool to effectively integrate over the folder 388
contents and then discover new items that are similar. Current technology can support this ability to associate a fingerprint with a document (Heintze, 2004) in order to characterize its meaning, thereby enabling conceptbased searching. Discovery informatics recognize that advances in search and retrieval enhance the discovery process. This same semantic analysis can be exploited in other settings, such as within organizations. It is possible now to have your e-mail system prompt you, based on the content of messages you compose. When you click send, the e-mail system may open a dialogue box (e.g., Do you also want to send that to Mary?). The system has analyzed the content of your message, determining that, for messages in the past having similar content, you also have sent them to Mary. So the system is now asking you if you have perhaps forgotten to include her. While this feature can certainly be intrusive and bothersome unless it is wanted, the point is that the same semantic analysis advances are at work here as with the Internet search example. The informatics part of discovery informatics also conveys the breadth of science and technology needed to support discovery. There are commercially available computer systems and special-purpose software dedicated to knowledge discovery (see listings at http:// www.kdnuggets.com/). The informatics support includes comprehensive hardware-software discovery platforms as well as advances in algorithms and data structures, which are core subjects of computer science. The latest developments in data sharing, application integration, and human-computer interfaces are used extensively in the automated support of discovery. Particularly valuable, because of the voluminous data and complex relationships, are advances in visualization (Marakas, 2003). Commercial visualization packages are used widely to display patterns and to enable expert interaction and manipulation of the visualized relationships.
Discovery Across Domains Discovery informatics encourages a view that spans application domains. Over the past decade, the term has been associated most often with drug discovery in the pharmaceutical industry, mining biological data. The financial industry also was known for employing talented programmers to write highly sophisticated mathematical algorithms for analyzing stock trading data, seeking to discover patterns that could be exploited for financial gain. Retailers were prominent in developing large data warehouses that enabled mining across inventory, transaction, supplier, marketing, and demographic databases. The situation (marked by drug discovery informatics, financial discovery informatics, etc.) was
Discovery Informatics
evolving into one in which discovery informatics was preceded by more and more words as it was being used in an increasing number of domain areas. One way to see the emergence of discovery informatics is to strip away the domain modifiers and recognize the universality of every application area and organization wanting to take advantage of its data. Discovery informatics techniques are very influential across professions and domains: •
•
•
In the financial community, the FinCEN Artificial Intelligence System (FAIS) uses discovery informatics methods to help identify money laundering and other crimes. Rule-based inference is used to detect networks of relationships among people and bank accounts so that human analysts can focus their attention on potentially suspicious behavior (Senator, Goldberg & Wooton, 1995). In the education profession, there are exciting scenarios that suggest the potential for discovery informatics techniques. In one example, a knowledge manager works with a special education teacher to evaluate student progress. The data warehouse holds extensive data on students, their performance, teachers, evaluations, and course content. The educators are pleased that the aggregate performance data for the school is satisfactory. However, with discovery informatics capabilities, the story does not need to end there. The data warehouse permits finer granularity examination, and the discovery informatics tools are able to find patterns that are hidden by the summaries. The tools reveal that examinations involving word problems show much poorer performance for certain students. Further analysis shows that this outcome was also true in previous courses for these students. Now, the educators have the specific data enabling them to take action in the form of targeted tutorials for specific students to address the problem (Tsantis & Castellani, 2001). Bioinformatics requires the intensive use of information technology and computer science to address a wide array of challenges (Watkins, 2001). One example illustrates a multi-step computational method to predict gene models, an important activity that has been addressed to date by combinations of gene prediction programs and human experts. The new method is entirely automated, involving optimization algorithms and the careful integration of the results of several gene prediction programs, including evidence from annotation software. Statistical scoring is implemented with decision trees to show clearly the gene prediction results (Allen, Pertea & Salzberg, 2004).
Common Elements of Discovery Informatics
,
The only constant in discovery informatics is data and an interacting entity with an interest in discovering new information from it. What varies and has an enormous effect on the ease of discovering new information is everything else, notably the following:
Data • • • • • •
Volume: How much? Accessibility: Ease of access and analysis? Quality: How clean and complete is it? Can it be trusted as accurate? Uniformity: How homogeneous are the data? Are they in multiple forms, structures, formats, and locations? Medium: Text, numbers, audio, video, image, electronic, or magnetic signals or emanations? Structure of the Data: Formatted rigidly, partially, or not at all? If text, does it adhere to known language?
Interacting Entity • •
Nature: Is it a person or intelligent agent? Need, Question, or Motivation of the User: What is prompting a person to examine these data? How sharply defined is the question or need? What expectations exist about what might be found? If the motivation is to find something interesting, what does that mean in context?
A wide variety of activities may be seen as variations on this scheme. An individual using a search engine to search the Internet for a home mortgage serves as an instance of query-driven discovery. A retail company looking for interesting patterns in its transaction data is engaged in data-driven discovery. An intelligent agent examining newly posted content to the Internet based on a person’s profile is also engaged in a process of discovery, only the interacting entity is not a person.
FUTURE TRENDS There are many indications that discovery informatics will grow in relevance. Storage costs are dropping; for example, the cost per gigabyte of magnetic disk storage declined by a factor of 500 from 1990 to 2000 (Univer389
Discovery Informatics
sity of California at Berkeley, 2003). Our stockpiles of data are expanding rapidly in every field of endeavor. Businesses at one time were comfortable with operational data summarized over days or even weeks. Increasing automation led to point-of-decision data on transactions. With online purchasing, it is now possible to know the sequence of clickstreams leading up to the sale. So the granularity of the data is becoming finer, as businesses are learning more about their customers and about ways to become more profitable. This business analytics is essential for organizations to be competitive. A similar process of finer granular data exists in bioinformatics. In the human body, there are 23 pairs of human chromosomes, approximately 30,000 genes, and more than 1,000,000 proteins (Watkins, 2001). The advances in decoding the human genome are remarkable, but it is proteins that “ultimately regulate metabolism and disease in the body” (Watkins, 2001, p. 27). So, the challenges for bioinformatics continue to grow along with the data.
CONCLUSION Discovery informatics is an emerging methodology that promotes a crosscutting and integrative view. It looks across both technologies and application domains to identify and organize the techniques, tools, and models that improve data-driven discovery. There are significant research questions as this methodology evolves. Continuing progress will be received eagerly from efforts in individual strategies for knowledge discovery and machine learning, such as the excellent contributions in Koza, et al. (2003). An additional opportunity is to pursue the recognition of unifying aspects of practices now associated with diverse disciplines. While the anticipation of new discoveries is exciting, the evolving practical application of discovery methods needs to respect individual privacy and a diverse collection of laws and regulations. Balancing these requirements constitutes a significant and persistent challenge as new concerns emerge and as laws are drafted. Looking ahead to the challenges and opportunities of the 21st century, discovery informatics is poised to help people and organizations learn as much as possible from the world’s abundant and ever-growing data assets.
REFERENCES Agresti, W.W. (2000). Knowledge management. Advances in Computers, 53, 171-283.
390
Agresti, W.W. (2003). Discovery informatics. Communications of the ACM, 46(8), 25-28. Allen, J.E., Pertea, M., & Salzberg, S.L. (2004). Computational gene prediction using multiple sources of evidence. Genome Research, 14, 142-148. Baeza-Yates, R., & Ribiero-Neto, B. (1999). Modern information retrieval. Reading, MA: Addison-Wesley. Bergeron, B. (2003). Bioinformatics computing. Upper Saddle River, NJ: Prentice Hall. Heintze, N. (2004). Scalable document fingerprinting. Carnegie Mellon University. Retrieved from http:// www2.cs.cmu.edu/afs/cs/user/nch/www/koala/ main.html Houston, A., & Chen, H. (2004). A path to concept-based information access: From national collaboratories to digital libraries. University of Arizona. Retrieved from http://ai.bpa.arizona.edu/go/intranet/papers/ Book7.pdf Koza, J.R. et al. (Eds.) (2003). Genetic programming IV: Routine human-competitive machine intelligence. Dordrecht, The Netherlands: Kluwer Academic Publishers. Marakas, G.M. (2003). Modern data warehousing, mining, and visualization. Upper Saddle River, NJ: Prentice Hall. Prahalad, C.K., & Hamel, G. (1990). The core competence of the corporation. Harvard Business Review, 3, 79-91. Senator, T.E., Goldberg, H.G., & Wooton, J. (1995). The financial crimes enforcement network AI system (FAIS): Identifying potential money laundering from reports of large cash transactions. AI Magazine, 16, 21-39. Tsantis, L., & Castellani, J. (2001). Enhancing learning environments through solution-based knowledge discovery tools: Forecasting for self-perpetuating systemic reform. Journal of Special Education Technology, 16, 39-52. University of California at Berkeley. (2003). How much information? School of Information Management and Systems. Retrieved from http://www.sims.berkeley.edu/ research/projects/how-much-info/how-much-info.pdf Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing. Watkins, K.J. (2001). Bioinformatics. Chemical & Engineering News, 79, 26-45.
Discovery Informatics
KEY TERMS Clickstream: The sequence of mouse clicks executed by an individual during an online Internet session. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships such as classification, prediction, estimation, or affinity grouping. Discovery Informatics: The study and practice of employing the full spectrum of computing and analytical science and technology to the singular pursuit of discovering new information by identifying and validating patterns in data. Evolutionary Computation: Solution approach guided by biological evolution, which begins with potential solution models and then iteratively applies al-
gorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a model that best represents the data. Knowledge Management: The practice of transforming the intellectual assets of an organization into business value. Neural Networks: Learning systems, designed by analogy with a simplified model of the neural connections in the brain, which can be trained to find nonlinear relationships in data. Rule Induction: Process of learning from cases or instances the if-then rule relationships consisting of an antecedent (i.e., if-part, defining the preconditions or coverage of the rule) and a consequent (i.e., then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent).
391
,
392
Discretization for Data Mining Ying Yang Monash University, Australia Geoffrey I. Webb Monash University, Australia
INTRODUCTION Discretization is a process that transforms quantitative data into qualitative data. Quantitative data are commonly involved in data mining applications. However, many learning algorithms are designed primarily to handle qualitative data. Even for algorithms that can directly deal with quantitative data, learning is often less efficient and less effective. Hence research on discretization has long been active in data mining.
mentary, each relating to a different dimension along which discretization methods may differ. Typically, discretization methods can be either primary or composite. Primary methods accomplish discretization without reference to any other discretization method. Composite methods are built on top of some primary method(s). Primary methods can be classified as per the following taxonomies. •
BACKGROUND Many data mining systems work best with qualitative data, where the data values are discrete descriptive terms such as young and old. However, lots of data are quantitative, for example, with age being represented by a numeric value rather a small number of descriptors. One way to apply existing qualitative systems to such quantitative data is to transform the data. Discretization is a process that transforms data containing a quantitative attribute so that the attribute in question is replaced by a qualitative attribute. A many to one mapping function is created so that each value of the original quantitative attribute is mapped onto a value of the new qualitative attribute. First, discretization divides the value range of the quantitative attribute into a finite number of intervals. The mapping function associates all of the quantitative values in a single interval to a single qualitative value. A cut point is a value of the quantitative attribute where a mapping function locates an interval boundary. For example, a quantitative attribute recording age might be mapped onto a new qualitative age attribute with three values, pre-teen, teenage, and post-teen. The cut points for such a discretization may be 13 and 18. Values of the original quantitative age attribute that are below 13 might get mapped onto the pre-teen value of the new attribute, values from 13 to 18 onto teen, and values above 18 onto post-teen. Various discretization methods have been proposed. Diverse taxonomies exist in literature to categorize discretization methods. These taxonomies are comple-
•
•
Supervised vs. Unsupervised (Dougherty, Kohavi, & Sahami, 1995): Supervised methods are only applicable when mining data that are divided into classes. These methods refer to the class information when selecting discretization cut points. Unsupervised methods do not use the class information. For example, when trying to predict whether a customer will be profitable, the data might be divided into two classes profitable and unprofitable. A supervised discretization technique would take account of how useful was the selected cut point for identifying whether a customer was profitable. An unsupervised technique would not. Supervised methods can be further characterized as error-based, entropy-based or statistics-based. Error-based methods apply a learner to the transformed data and select the intervals that minimize error on the training data. In contrast, entropy-based and statisticsbased methods assess respectively the class entropy or some other statistic regarding the relationship between the intervals and the class. Parametric vs. Non-Parametric: Parametric discretization requires the user to specify parameters for each discretization performed. An example of such a parameter is the maximum number of intervals to be formed. Non-parametric discretization does not utilize user-specified parameters. Hierarchical vs. Non-Hierarchical: Hierarchical discretization utilizes an incremental process to select cut points. This creates an implicit hierarchy over the value range. Hierarchical discretization can be further characterized as either split or merge (Kerber, 1992). Split discretization starts with a single interval that
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Discretization for Data Mining
•
•
•
•
•
encompasses the entire value range, then repeatedly splits it into sub-intervals until some stopping criterion is satisfied. Merge discretization starts with each value in a separate interval, then repeatedly merges adjacent intervals until a stopping criterion is met. It is possible to combine both split and merge techniques. For example, initial intervals may be formed by splitting. A merge process is then applied to post-process these initial intervals. Non-hierarchical discretization creates intervals without forming a hierarchy. For example, many methods forming the intervals sequentially in a single scan through the data. Univariate vs. Multivariate (Bay, 2000): Univariate methods discretize an attribute without reference to attributes other than the class. In contrast, multivariate methods consider relationships among attributes during discretization. Disjoint vs. Non-Disjoint (Yang & Webb, 2002): Disjoint methods discretize the value range of an attribute into intervals that do not overlap. Nondisjoint methods allow overlap between intervals. Global vs. Local (Dougherty, Kohavi, & Sahami, 1995): Global methods create a single mapping function that is applied throughout a given classification task. Local methods allow different mapping functions for a single attribute in different classification contexts. For example, decision tree learning may discretize a single attribute into different intervals at different nodes of a tree (Quinlan, 1993). Global techniques are more efficient, because one discretization is used throughout the entire data mining process, but local techniques may result in the discovery of more useful cut points. Eager vs. Lazy (Hsu, Huang, & Wong, 2000, 2003): Eager methods generate the mapping function prior to classification time. Lazy methods generate the mapping function as it is needed during classification time. Ordinal vs. Nominal: Ordinal discretization forms a mapping function from quantitative to ordinal qualitative data. It seeks to retain ordering information implicit in quantitative attributes. In contrast, nominal discretization forms a mapping function from quantitative to nominal qualitative data, thereby discarding any ordering information. For example, if the value range 0 – 29 were discretized into three intervals 0 – 9, 10 – 19 and 20 – 29, if the intervals are treated as nominal then a value in the interval 0 – 9 will be treated as dissimilar to one in 20 – 29 as it is to one in 10 – 19. In contrast, while ordinal discretization will treat the difference between 9 and either 10 or 19 as equivalent, it retains the informa-
•
tion that this difference is less than the difference between 9 and 29. Fuzzy vs. Non-fuzzy (Ishibuchi, Yamamoto, & Nakashima, 2001; Wu, 1999): Fuzzy discretization creates a fuzzy mapping function. A value may belong to multiple intervals, each with varying degrees of strength. Non-fuzzy discretization forms exact cut points.
Composite methods first generate a mapping function using an initial primary method. They then use other primary methods to adjust the initial cut points.
MAIN THRUST The main thrust of this chapter deals with how to select a discretization method. This issue is particularly important since there exist a large number of discretization methods and no one can be universally optimal. When selecting between discretization methods it is critical to take account of the learning context, in particular, of the learning algorithm, the nature of the data, and the learning objectives. Different learning contexts have different characteristics and hence have different requirements for discretization. It is unrealistic to pursue a universally optimal discretization approach that can be blind to its learning context. Many discretization techniques have been developed primarily in the context of a specific type of learning algorithm, such as decision tree learning, decision rule learning, naive-Bayes learning, Bayes network learning, clustering, and association learning. Different types of learning have different characteristics and hence require different strategies of discretization. For example, decision tree learners can suffer from the fragmentation problem. If an attribute has many values, a split on this attribute will result in many branches, each of which receives relatively few training instances, making it difficult to select appropriate subsequent tests. Hence they may benefit more than other learners from discretization that results in few intervals. Decision rule learners may require pure intervals (containing instances dominated by a single class), while probabilistic learners such as naive-Bayes do not. The relations between attributes are key themes for association learning, and hence multivariate discretization that can capture the inter-dependencies among attributes is desirable. If coupled with lazy discretization, lazy learners can further save training effort. Non-disjoint discretization is not applicable if the learning algorithm, such as decision tree learning, requires disjoint attribute values. In order to facilitate understanding this issue, we contrast discretization strategies in two popular learning 393
,
Discretization for Data Mining
contexts, decision tree learning and naive-Bayes learning. Although both are commonly used for data mining applications, they have very different inductive biases and learning mechanisms. As a result, they desire different discretization methodologies.
Discretization in Decision Tree Learning The learned concept is represented by a decision tree in decision tree learning. Each non-leaf node tests an attribute. Each branch descending from that node corresponds to one of the attribute’s values. Each leaf node assigns a class label. A decision tree classifies instances by sorting them down the tree from the root to some leaf node (Mitchell, 1997). Algorithms such as ID3 (Quinlan, 1986) and its successor C4.5 (Quinlan, 1993) are well known exemplars. Fayyad & Irani (1993) proposed multi-interval-entropyminimization discretization (MIEMD), which has been one of the most popular discretization mechanisms for decision tree learning over years. Briefly speaking, MIEMD discretizes a quantitative attribute by calculating the class information entropy as if the classification only uses that single attribute after discretization. This can be suitable for the divide-and-conquer strategy of decision tree learning, which handles one attribute at a time. However, it is not necessarily appropriate for other learning mechanisms such as naive-Bayes learning, which involves all the attributes simultaneously (Yang, 2003). Furthermore, MIEMD uses the minimum description length criterion (MDL) as its termination condition that decides when to stop further partitioning a quantitative attribute’s value range. As An and Cercone (1999) indicate, this criterion has an effect to form qualitative attributes with few values. This effect is desirable for decision tree learning, since it helps avoid the fragmentation problem by minimizing the number of values of an attribute. If an attribute has many values, forming a split on those values fragments that data into small subsets with respect to which it is difficult to perform further learning (Quinlan, 1993). However, this effect of minimization is not so welcome to naive-Bayes learning because it brings adverse impact as we will detail in the next section.
Discretization in Naive-Bayes Learning When classifying an instance, naïve-Bayes learning applies Bayes’ theorem to calculate the probability of each class given this instance. The most probable class is chosen as the class of this instance. In order to simplify the calculation, an attribute independence assumption is made, assuming attributes conditionally independent of each other given the class. Although this assumption is
394
often violated in real-world applications, naive-Bayes learning still achieves surprisingly good classification performance. Domingos and Pazzani (1997) suggested one reason is that the classification estimation under zero-one loss is only a function of the sign of the probability estimation. The classification accuracy can remain high even while the assumption violation causes poor probability estimation, so long as the highest estimate relates to the correct class. Because they are simple, effective, efficient, robust to noise, and support incremental training, naïve-Bayes classifiers have been employed in numerous classification tasks. Appropriate discretization mechanisms for naiveBayes learning include fixed-frequency discretization (Yang, 2003), proportional discretization (Yang, 2003) and non-disjoint discretization (Yang, 2003). For example, when discretizing a quantitative attribute, fixedfrequency discretization (FFD) predefines a sufficient interval frequency k. It then discretizes the sorted values into intervals so that each interval has approximately the same number k of training instances with adjacent (possibly identical) values. By this means, FFD fixes an interval frequency that is not arbitrary but can ensure that each interval contains sufficient instances to allow reasonable probability estimates. However, FFD may result in inferior performance for decision tree learning. FFD first ensures that each interval contains sufficient instances for estimating the naive-Bayes probabilities. On top of that, FFD tries to maximize the number of discretized intervals to reduce discretization bias (Yang, 2003). If employed in decision tree learning, this maximization effect of FFD tends to cause a severe fragmentation problem. The other way around, MIEMD is effective for decision tree learning but not for naive-Bayes learning. Because of its attribute independence assumption, naive-Bayes learning is not subject to the fragmentation problem. MIEMD’s tendency to minimize the number of intervals has a strong potential to reduce the classification variance but increase the classification bias. As the data size becomes large, it is very likely that the loss through bias increase will soon overshadow the gain through variance reduction, resulting in inferior learning performance (Yang, 2003). This impact is particularly undesirable, since, due to its efficiency, naive-Bayes learning is very popular for learning from large data. Hence, MIEMD is not a desirable approach for discretization in naive-Bayes learning.
Discretization in Association Rule Discovery Association rules (Agrawal, Imielinski, & Swami, 1993) have quite distinct discretization requirements to the
Discretization for Data Mining
classification learning techniques discussed above. Because any attribute may appear in the consequent of an association rule, there is no single class variable with respect to which global supervised discretization might be performed. Further, there is no clear evaluation criterion for the sets of rules that are produced. Whereas it is possible to assess the expected error rate of a decision tree or a naïve Bayes classifier, there is no corresponding metric for comparing the quality of two alternative sets of association rules. In consequence, it is not apparent how one might assess the quality of two alternative discretizations of an attribute. It is not even possible to run the system with each alternative and evaluate the quality of the respective results. For these reasons simple global discretization techniques such as fixed frequency discretization are often used. In contrast, Srikant & Agrawal (1996) present a hybrid global unsupervised and local multi-variate supervised discretization technique. Each attribute is initially discretized using fixed frequency discretization. Then, for each rule a locally optimal discretization is generated by considering new discretizations formed by joining neighboring intervals. This technique is more computationally demanding than a simple global approach, but may result in more useful rules.
FUTURE TRENDS Many problems still remain open with regard to discretization. The relationship between the nature of a learning task and the appropriate discretization strategies requires further investigation. One challenge is to identify what aspects of a learning task are relevant to the selection of discretization strategies. It would be useful to characterize tasks and discretization methods into abstract features that facilitate the selection. Discretization for time-series data is another interesting topic. In time-series data, each instance is associated with a time stamp. The concept that underlies the data may drift over time, which implies that the appropriate discretization cut points may change. It will be time consuming if discretization has to be conducted from scratch each time the data changes. In this case, incremental discretization that only needs the old cut points and new data to form new cut points can be of great utility. A third trend is discretization for stream data. A fundamental difference between time-series data and stream data is that for stream data, one only has access to data in the current time window, but not any previous data. The data may have large volume, change very fast and require fast response, for example, exchange data
from the stock market and observation data from monitoring sensors. The key theme here is to boost discretization’s efficiency (without any significant accuracy loss) and to quickly incorporate discretization’s results into the learner.
CONCLUSION The process that transforms quantitative data to qualitative data is discretization. The real world is abundant in quantitative data. In contrast, many learning algorithms are more adept at learning from qualitative data. This gap can be shrunk by discretization, which makes discretization an important research area for knowledge discovery. Numerous methods have been developed as understanding of discretization evolves. Different methods are tuned to different learning tasks. When seeking to employ an existing discretization method or develop a new discretization mechanism, it is very important to understand the learning context within which the discretization lies. For example, what learning algorithm will make use of the discretized values? Different learning algorithms have different characteristics and require different discretization strategies. There is no universally optimal discretization solution. There are significant research questions that still remain open in discretization. Continuing progress is both desirable and necessary.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data (pp. 207-216). An, A., & Cercone, N. (1999). Discretization of continuous attributes for learning classification rules. Proceedings of the 3rd Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining (pp. 509-514). Bay, S.D. (2000). Multivariate discretization of continuous variables for set mining. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 315-319). Bluman, A.G. (1992). Elementary statistics: A step by step approach. Dubuque, IA: Wm.C.Brown Publishers. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130.
395
,
Discretization for Data Mining
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the 12th International Conference on Machine Learning (pp. 94-202).
Yang, Y., & Webb, G.I. (2002). Non-disjoint discretization for naive-Bayes classifiers. Proceedings of the 19th International Conference on Machine Learning (pp. 666673).
Fayyad, U.M., & Irani, K.B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence (pp. 10221027).
Yang, Y. (2003). Discretization for naive-Bayes learning. PhD thesis, School of Computer Science and Software Engineering, Monash University, Melbourne, Australia.
Hsu, C.N., Huang, H.J., & Wong, T.T. (2000). Why discretization works for naive Bayesian classifiers. Proceedings of the 17th International Conference on Machine Learning (pp. 309-406). Hsu, C.N., Huang, H.J., & Wong, T.T. (2003). Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Machine Learning, 53(3), 235-263. Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001). Fuzzy data mining: Effect of fuzzy discretization. Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 241-248). Kerber, R. (1992). Chimerge: Discretization for numeric attributes. Proceedings of 10 th National Conference on Artificial Intelligence (pp. 123-128). Mitchell, T.M. (1997). Machine learning. New York, NY: McGraw-Hill Companies. Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann Publishers. Samuels, M.L., & Witmer, J.A. (1999). Statistics for the life sciences (2nd ed.). Prentice-Hall. Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. Proceedings of the 1996 ACM-SIGMOD International Conference on Management of Data (pp. 1-12). Wu, X. (1999). Fuzzy interpretation of discretized intervals. IEEE Transactions on Fuzzy Systems, 7(6), 753-759.
396
KEY TERMS Turning to the authority of introductory statistical textbooks (Bluman, 1992; Samuels & Witmer, 1999), the following definitions are adopted. Continuous Data: Can assume all values on the number line within their value range. The values are obtained by measuring. An example is: temperature. Discrete Data: Assume values that can be counted. The data cannot assume all values on the number line within their value range. An example is: number of children in a family. Discretization: A process that transforms quantitative data to qualitative data. Nominal Data: Classified into mutually exclusive (nonoverlapping), exhaustive categories in which no meaningful order or ranking can be imposed on the data. An example is: blood type of a person: A, B, AB, O. Ordinal Data: Classified into categories that can be ranked. However, the differences between the ranks cannot be calculated by arithmetic. An example is: assignment evaluation: fail, pass, good, excellent. Qualitative Data: Also often referred to as categorical data, are data that can be placed into distinct categories. Qualitative data sometimes can be arrayed in a meaningful order. But no arithmetic operations can be applied to them. Quantitative data can be further classified into two groups, nominal or ordinal. Quantitative Data: Numeric in nature. They can be ranked in order. They also admit to meaningful arithmetic operations. Quantitative data can be further classified into two groups, discrete or continuous.
397
Discretization of Continuous Attributes Fabrice Muhlenbach EURISE, Université Jean Monnet - Saint-Etienne, France Ricco Rakotomalala ERIC, Université Lumière - Lyon 2, France
INTRODUCTION In the data-mining field, many learning methods — such as association rules, Bayesian networks, and induction rules (Grzymala-Busse & Stefanowski, 2001) — can handle only discrete attributes. Therefore, before the machine-learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals. For example, the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 or greater. This process, known as discretization, is an essential task of the data preprocessing not only because some learning methods do not handle continuous attributes, but also for other important reasons. The data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan, & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide nonlinear relations — for example, the infants and the elderly people are more sensitive to illness. This relation between age and illness is then not linear — which is why many authors propose to discretize the data even if the learning method can handle continuous attributes (Frank & Witten, 1999). Lastly, discretization can harmonize the nature of the data if it is heterogeneous — for example, in text categorization, the attributes are a mix of numerical values and occurrence terms (Macskassy, Hirsh, Banerjee, & Dayanik, 2001). An expert realizes the best discretization because he can adapt the interval cuts to the context of the study and can then make sense of the transformed attributes. As mentioned previously, the continuous attribute “age” can be divided in two categories. Take basketball as an example; what is interesting about this sport is that it has many categories: “mini-mite” (under 7), “mite” (7 to 8), “squirt” (9 to 10), “peewee” (11 to 12), “bantam” (13 to 14), “midget” (15 to 16), “junior” (17 to 20), and “senior” (over 20). Nevertheless, this approach is not feasible in the majority of machine-learning problem
cases because there are no experts available, no a priori knowledge on the domain, or, for a big dataset, the human cost would be prohibitive. It is then necessary to be able to have an automated method to discretize the predictive attributes and find the cut-points that are better adapted to the learning problem. Discretization was little studied in statistics — except by some rather old articles considering it as a special case of the one-dimensional clustering (Fisher, 1958) — but from the beginning of the 1990s, the research expanded very quickly with the development of supervised methods (Dougherty, Kohavi, & Sahami, 1995; Liu et al., 2002). Lately, the applied discretization has affected other fields: An efficient discretization can also improve the performance of discrete methods such as the association rule construction (Ludl & Widmer, 2000a) or the machine learning of a Bayesian network (Friedman & Goldsmith, 1996). In this article, we will present the discretization as a preliminary condition of the learning process. The presentation will be limited to the global discretization methods (Frank & Witten, 1999), because in a local discretization, the cutting process depends on the particularities of the model construction — for example, the discretization in rule induction associated with genetic algorithms (Divina, Keijzer, & Marchiori, 2003) or lazy discretization associated with naïve Bayes classifier induction (Yang & Webb, 2002). Moreover, even if this article presents the different approaches to discretize the continuous attributes, whatever the learning method may be used, in the supervised learning framework, only discretizing the predictive attributes will be presented. The cutting of the attributes to be predicted depends a lot on the particular properties of the problem to treat. The discretization of the class attribute is not realistic because this pretreatment, if effectuated, would be the learning process itself.
BACKGROUND The discretization of a continuous-valued attribute consists of transforming it into a finite number of intervals and to re-encode, for all instances, each value on this
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Discretization of Continuous Attributes
attribute by associating it with its corresponding interval. There are many ways to realize this process. One of these ways consists of realizing a discretization with a fixed number of intervals. In this situation, the user must choose the appropriate number a priori: Too many intervals will be unsuited to the learning problem, and too few intervals can risk losing some interesting information. A continuous attribute can be divided in intervals of equal width (see Figure 1) or equal frequency (see Figure 2). Other methods exist to constitute the intervals based on the clustering principles, for example, k-means clustering discretization (Monti & Cooper, 1999). Nevertheless, for supervised learning, these discretization methods ignore an important source of information: the instance labels of the class attribute. By contrast, the supervised discretization methods handle the class label repartition to achieve the different cuts and find the more appropriate intervals. Figure 3 shows a situation where it is more efficient to have only two intervals for the continuous attribute instead of three: It is not relevant to separate two bordering intervals if they are composed of the same class data. Therefore, the supervised or unsupervised quality of a discretization method is an important criterion to take into consideration. Another important criterion to qualify a method is the fact that a discretization either processes on the different attributes one by one or takes into account the whole set of attributes for doing an overall cutting. The second case, called multivariate discretization, is particularly interesting when some interactions exist between the different attributes. In Figure 4, a supervised discretization attempts to find the correct cuts by taking into account only one attribute independently of the others. This will fail: It is necessary to represent the data with the attributes X1 and X2 together to find the appropriate intervals on each attribute.
MAIN THRUST The two criteria mentioned in the previous section — unsupervised/supervised and univariate/multivariate — will characterize the major discretization method famiFigure 1. Equal width discretization
Figure 2. Equal frequency discretization
398
lies. In the following sections, we use these criteria to distinguish the particularities of each discretization method.
Univariate Unsupervised Discretization The simplest discretization methods make no use of the instance labels of the class attribute. For example, the equal width interval binning consists of observing the values of the dataset to identify the minimum and the maximum values observed and to divide the continuous attribute into the number of intervals chosen by the user (Figure 1). Nevertheless, in this situation, if uncharacteristic extreme values exist in the dataset (“outliers”), the range will be changed, and the intervals will be misappropriated. To avoid this problem, divide the continuous attribute into intervals containing the same number of instances (Figure 2): This method is called the equal frequency discretization method. The unsupervised discretization can be grasped as a problem of sorting and separating intermingled probability laws (Potzelberger & Felsenstein, 1993). The existence of an optimum analysis was studied by Teicher (1963) and Yakowitz and Spragins (1968). Nevertheless, these methods are limited in their application in data mining due to too strong of statistical hypotheses seldom checked with real data.
Univariate Supervised Discretization To improve the quality of a discretization in supervised data-mining methods, it is important to take into account the instance labels of the class attribute. Figure 3 shows the problem of constituting intervals without the information of the class attribute. The intervals that are the better adapted to a discrete machine-learning method are the pure intervals containing only instances of a given class. To obtain such intervals, the supervised discretization methods — such as the state-of-the-art method Minimum Description Length Principle Cut (MDLPC) — are based on statistical or information-theoretical criteria and heuristics (Fayyad & Irani, 1993). In a particular case, even if one supervised method can give better results than another (Kurgan & Krysztof, 2004), with real data, the improvements of one method
Discretization of Continuous Attributes
Figure 3. Supervised and unsupervised discretizations
Figure 4. Interaction between the attributes X1 and X2 X2
X1
compared to the other supervised methods are insignificant. Moreover, the performance of a discretization method is difficult to estimate without a learning algorithm. In addition, the final results can arise from the discretization processing, the learning processing, or the combination of both. Because the discretization is realized in an ad hoc way, independently of the learning algorithm characteristics, there is no guarantee that the interval cut will be optimal for the learning method. Only a little work showed the relevance and the optimality of the global discretization for a very specific classifier such as naïve Bayes (Hsu, Huang, & Wong, 2003; Yang & Webb, 2003). The supervised discretization methods can be distinguished depending on the way the algorithm proceeds: bottom-up (each value represents an interval, which is merged progressively with the others to constitute the appropriate number of intervals) or top-down (the whole dataset represents an interval and is progressively cut to constitute the appropriate number of intervals). However, there are no significant performance differences between these two latest approaches (Zighed, Rakotomalala, & Feschet, 1997). In brief, the different supervised (univariate) methods can be characterized by (a) the particular statistical criterion used to evaluate the degree of an interval’s purity, (2) the top-down or bottom-up strategy to find the cut-points used to determine the intervals.
Multivariate Unsupervised Discretization Association rules constitute an unsupervised learning method that needs discrete attributes. For such a method, the discretization of a continuous attribute can be realized in an univariate way but also in a multivariate way. In the latter case, each attribute is cut in relation to the other attributes of the database; this approach can then provide some interesting improvements when unsupervised univariate discretization methods do not yield satisfactory results.
The multivariate unsupervised discretizations can be performed by clustering techniques using all attributes globally. It is also possible to consider each cluster obtained as a class and improve the discretization quality by using (univariate) supervised discretization methods (Chmielewski & Grzymala-Busse, 1994). An approach called multisupervised discretization (Ludl & Widmer, 2000a) can be seen as a particular unsupervised multivariate discretization. This method starts with the temporary univariate discretization of all attributes. Then the final cutting of a given attribute is based on the univariate supervised discretization of all other attributes previously and temporarily discretized. These attributes play the role of a class attribute one after another. Finally, the smallest intervals are merged. For supervised learning problems, a paving of the representation space can be done by cutting each continuous attribute into intervals. The discretization process consists in merging the bordering intervals in which the data distribution is the same (Bay, 2001). Nevertheless, even if this strategy can introduce the class attribute in the discretization process, it cannot give a particular role to the class attribute and can induce the discretization to nonimpressive results in the predictive model.
Multivariate Supervised Discretization When the learning problem is supervised and the instance labels are scattered in the representation space with interactions between the continuous predictive attributes (as presented in Figure 4), the methods previously seen will not give satisfactory results. HyperCluster Finder is a method that will fix this problem by combining the advantages of the supervised and multivariate approaches (Muhlenbach & Rakotomalala, 2002). This method is based on clusters constituted as sets of same-class instances that are closed in the representation space. The clusters are identified on a multivariate and supervised way: First, a neighborhood graph is built by using all predictive attributes to determine which instances are close to others; second, the edges connecting two instances belonging 399
,
Discretization of Continuous Attributes
to different classes are cut on the graph to constitute the clusters; third, the minimal and maximal values of each relevant cluster are used as cut-points on each predictive attribute. The intervals found by this method have the characteristic to be pure on a pavement of the whole representation space, even if the purity is not guaranteed for an independent attribute. The combination of all predictive attribute intervals is what will provide pure areas in the representation space.
does not benefit from the discretized attributes (Muhlenbach & Rakotomalala, 2002).
ACKNOWLEDGMENT Edited with the aid of Christopher Yukna.
REFERENCES FUTURE TRENDS Today, the discretization field is well studied in the supervised and unsupervised cases for a univariate process. However, there is little work in the multivariate case. A related problem exists in the feature selection domain, which needs to be combined with the aforementioned multivariate case. This should bring improved and more pertinent progress. It is virtually certain that better results can be obtained for a multivariate discretization if all attributes of the representation space are relevant for the learning problem.
CONCLUSION In a data-mining task, for a supervised or unsupervised learning problem, the discretization turned out to be an essential preprocessing step on which the performance of the learning algorithm that uses discretized attributes will depend. Many methods, supervised or not, multivariate or not, exist to perform this pretreatment, more or less adapted to a given dataset and learning problem. Furthermore, a supervised discretization can also be applied in a regression problem when the attribute to be predicted is continuous (Ludl & Widmer, 2000b). The choice of a particular discretization method depends on (a) its algorithmic complexity (complex algorithms will take more computation time and will be unsuited to very large datasets), (b) its efficiency (the simple unsupervised univariate discretization methods are inappropriate to complex learning problems), and (c) its appropriate combination with the learning method using the discretized attributes (a supervised discretization is better adapted to the supervised learning problem). For the last point, it is also possible to significantly improve the performance of the learning method by choosing an appropriate discretization, for instance, a fuzzy discretization for the naïve Bayes algorithm (Yang & Webb, 2002). Nevertheless, it is unnecessary to employ a sophisticated discretization method if the learning method
400
Bay, S. D. (2001). Multivariate discretization for set mining. Knowledge and Information Systems, 3(4), 491512. Chmielewski, M. R., & Grzymala-Busse, J. W. (1994). Global discretization of continuous attributes as preprocessing for machine learning. Proceedings of the Third International Workshop on Rough Sets and Soft Computing (pp. 294-301). Divina, F., Keijzer, M., & Marchiori, E. (2003). A method for handling numerical attributes in GA-based inductive concept learners. Proceedings of the Genetic and Evolutionary Computation Conference (pp. 898-908). Dougherty, K., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the 12th International Conference on Machine Learning (pp. 194-202). Fayyad, U. M., & Irani, K. B. (1993). The attribute selection problem in decision tree generation. Proceedings of the 13th International Joint Conference on Artificial Intelligence (pp. 1022-1027). Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Society, 53, 789-798. Frank, E., & Witten, I. (1999). Making better use of global discretization. Proceedings of the 16th International Conference on Machine Learning (pp. 115-123). Friedman N., & Goldszmidt, M. (1996). Discretization of continuous attributes while learning Bayesian networks from mixed data. Proceedings of the 13th International Conference on Machine Learning (pp. 157-165). Grzymala-Busse, J. W., & Stefanowski, J. (2001). Three discretization methods for rule induction. International Journal of Intelligent Systems, 16, 29-38. Hsu, H., Huang, H., & Wong, T. (2003). Implication of dirichlet assumption for discretization of continuous variables in naïve Bayes classifiers. Machine Learning, 53(3), 235-263.
Discretization of Continuous Attributes
Kurgan, L., & Krysztof, J. (2004). CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2), 145-153. Liu, H., Hussain, F., Tan, C., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4), 393-423. Ludl, M., & Widmer, G. (2000a). Relative unsupervised discretization for association rule mining. Principles of the Fourth European Conference on Data Mining and Knowledge Discovery (pp. 148-158). Ludl, M., & Widmer, G. (2000b). Relative unsupervised discretization for regression problem. Proceedings of the 11th European Conference on Machine Learning (pp. 246-253). Macskassy, S. A., Hirsh, H., Banerjee, A., & Dayanik, A. A. (2001). Using text classifiers for numerical classification. Proceedings of the 17th International Joint Conference on Artificial Intelligence (pp. 885-890). Mittal, A., & Cheong, L. (2002). Employing discrete Bayes error rate for discretization and feature selection tasks. Proceedings of the First IEEE International Conference on Data Mining (pp. 298-305). Monti, S., & Cooper, G. F. (1999). A latent variable model for multivariate discretization. Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics Muhlenbach, F., & Rakotomalala, R. (2002). Multivariate supervised discretization: A neighborhood graph approach. Proceedings of the First IEEE International Conference on Data Mining (pp. 314-321). Potzelberger, K., & Felsenstein, K. (1993). On the fisher information of discretized data. Journal of Statistical Computation and Simulation, 46(3-4), 125-144. Teicher, H. (1963). Identifiability of finite mixtures. Ann. Math. Statist., 34, 1265-1269. Yakowitz, S. J., & Spragins, J. D. (1968). On the identifiability of finite mixtures. Ann. Math. Statist., 39, 209-214. Yang, Y., & Webb, G. (2002). Non-disjoint discretization for naïve Bayes classifiers. Proceedings of the 19th International Conference on Machine Learning (pp. 666-673). Yang, Y., & Webb, G. (2003). On why discretization works for naïve Bayes classifiers. Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (pp. 440-452).
Zighed, D., Rakotomalala, R., & Feschet, F. (1997). Optimal multiple intervals discretization of continuous attributes for supervised learning. Proceedings of the Third International Conference on Knowledge Discovery in Databases (pp. 295-298).
KEY TERMS Cut-Points: A cut-point (or split-point) is a value that divides an attribute into intervals. A cut-point has to be included in the range of the continuous attribute to discretize. A discretization process can produce none or several cut-points. Discrete/Continuous Attributes: An attribute is a quantity describing an example (or instance); its domain is defined by the attribute type, which denotes the values taken by an attribute. An attribute can be discrete (or categorical, indeed symbolic) when the number of values is finite. A continuous attribute corresponds to real numerical values (for instance, a measurement). The discretization process transforms an attribute from continuous to discrete. Instances: An instance is an example (or record) of the dataset; it is often a row of the data table. Instances of a dataset are usually seen as a sample of the whole population (the universe). An instance is described by its attribute values, which can be continuous or discrete. Number of Intervals: The number of intervals corresponds to the different values of a discrete attribute resulting from the discretization process. The number of intervals is equal to the number of cut-points plus 1. The minimum number of intervals of an attribute is equal to 1, and the maximum number of intervals is equal to the number of instances. Representation Space: The representation space is formed with all the attributes of a learning problem. In the supervised learning, it consists of the representation of the labeled instances in a multidimensional space, where all predictive attributes play the role of a dimension. Supervised/Unsupervised: A supervised learning algorithm searches a functional link between a classattribute (or dependent attribute or attribute to be predicted) and predictive attributes (the descriptors). The supervised learning process aims to produce a predictive model that is as accurate as possible. In an unsupervised learning process, all attributes play the same role; the unsupervised learning method tries to group instances in clusters, where instances in the same cluster are similar, and instances in different clusters are dissimilar. 401
,
Discretization of Continuous Attributes
Univariate/Multivariate: A univariate (or monothetic) method processes a particular attribute independently of the others. A multivariate (or polythetic) method processes all attributes of the representation space, so it can fix some problems related to the interactions among the attributes.
402
403
Distributed Association Rule Mining Mafruz Zaman Ashrafi Monash University, Australia David Taniar Monash University, Australia Kate A. Smith Monash University, Australia
INTRODUCTION Data mining is an iterative and interactive process that explores and analyzes voluminous digital data to discover valid, novel, and meaningful patterns (Mohammed, 1999). Since digital data may have terabytes of records, data mining techniques aim to find patterns using computationally efficient techniques. It is related to a subarea of statistics called exploratory data analysis. During the past decade, data mining techniques have been used in various business, government, and scientific applications. Association rule mining (Agrawal, Imielinsky & Sawmi, 1993) is one of the most studied fields in the data-mining domain. The key strength of association mining is completeness. It has the ability to discover all associations within a given dataset. Two important constraints of association rule mining are support and confidence (Agrawal & Srikant, 1994). These constraints are used to measure the interestingness of a rule. The motivation of association rule mining comes from market-basket analysis that aims to discover customer purchase behavior. However, its applications are not limited only to marketbasket analysis; rather, they are used in other applications, such as network intrusion detection, credit card fraud detection, and so forth. The widespread use of computers and the advances in network technologies have enabled modern organizations to distribute their computing resources among different sites. Various business applications used by such organizations normally store their day-to-day data in each respective site. Data of such organizations increases in size everyday. Discovering useful patterns from such organizations using a centralized data mining approach is not always feasible, because merging datasets from different sites into a centralized site incurs large network communication costs (Ashrafi, David & Kate, 2004). Furthermore, data from these organizations are not only distributed over various locations, but are also fragmented vertically. Therefore, it becomes more difficult, if not impossible, to combine them in a central
location. Therefore, Distributed Association Rule Mining (DARM) emerges as an active subarea of datamining research. Consider the following example. A supermarket may have several data centers spread over various regions across the country. Each of these centers may have gigabytes of data. In order to find customer purchase behavior from these datasets, one can employ an association rule mining algorithm in one of the regional data centers. However, employing a mining algorithm to a particular data center will not allow us to obtain all the potential patterns, because customer purchase patterns of one region will vary from the others. So, in order to achieve all potential patterns, we rely on some kind of distributed association rule mining algorithm, which can incorporate all data centers. Distributed systems, by nature, require communication. Since distributed association rule mining algorithms generate rules from different datasets spread over various geographical sites, they consequently require external communications in every step of the process (Ashrafi, David & Kate, 2004; Assaf & Ron, 2002; Cheung, Ng, Fu & Fu, 1996). As a result, DARM algorithms aim to reduce communication costs in such a way that the total cost of generating global association rules must be less than the cost of combining datasets of all participating sites into a centralized site.
BACKGROUND DARM aims to discover rules from different datasets that are distributed across multiple sites and interconnected by a communication network. It tries to avoid the communication cost of combining datasets into a centralized site, which requires large amounts of network communication. It offers a new technique to discover knowledge or patterns from such loosely coupled distributed datasets and produces global rule models by using minimal network communication. Figure 1 illustrates a typical DARM framework. It shows three par-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Distributed Association Rule Mining
Figure 1. A distributed data mining framework Final Models
Global Model Generation
Communication Channel
Local Model
Local Model
Local Model
Algorithm
Algorithm
Algorithm
CHALLENGES OF DARM All of the DARM algorithms are based on sequential association mining algorithms. Therefore, they inherit all drawbacks of sequential association mining. However, DARM not only deals with the drawbacks of it but also considers other issues related to distributed computing. For example, each site may have different platforms and datasets, and each of those datasets may have different schemas. In the following paragraphs, we discuss a few of them.
Frequent Itemset Enumeration Data Repository Site 1
Data Repository Site 2
Data Repository Site 3
ticipating sites, where each site generates local models from its respective data repository and exchanges local models with other sites in order to generate global models. Typically, rules generated by DARM algorithms are considered interesting, if they satisfy both minimum global support and confidence threshold. To find interesting global rules, DARM generally has two distinct tasks: (i) global support count, and (ii) global rules generation. Let D be a virtual transaction dataset comprised of D1, D2, D3 … … D m geographically distributed datasets; let n be the number of items and I be the set of items such that I = {a1, a 2, a3, ……… an}, where ai ⊂ n. Suppose N is the total number of transactions and T = {t1, t2, t3, ……… t N} is the sequence of transaction, such that t i ⊂ D. The support of each element of I is the number of transactions in D containing I and for a given itemset A ⊂ I; we can define its support as follows: Support ( A) =
A ⊆ ti … … … (1) N
Itemset A is frequent if and only if Support(A) ≥ minsup, where minsup is a user-defined global support threshold. Once the algorithm discovers all global frequent itemsets, each site generates global rules that have user-specified confidence. It uses frequent itemsets to find the confidence of a rule R1 and can be calculated by using the following formula: Confidence( R ) =
404
Support ( F1 ∪ F2 ) … … … (2) Support ( F1 )
Frequent itemset enumeration is one of the main association rule mining tasks (Agrawal & Srikant, 1993; Zaki, 2000; Jiawei, Jian & Yiwen, 2000). Association rules are generated from frequent itemsets. However, enumerating all frequent itemsets is computationally expansive. For example, if a transaction of a database contains 30 items, one can generate up to 230 itemsets. To mitigate the enumeration problem, we found two basic search approaches in the data-mining literature. The first approach uses breadth-first searching techniques that search through the iterate dataset by generating the candidate itemsets. It works efficiently when user-specified support threshold is high. The second approach uses depth-first searching techniques to enumerate frequent itemsets. This search technique performs better when user-specified support threshold is low, or if the dataset is dense (i.e., items frequently occur in transactions). For example, Eclat (Zaki, 2000) determines the support of k-itemsets by intersecting the tidlists (Transaction ID) of the lexicographically first two (k-1) length subsets that share a common prefix. However, this approach may run out of main memory when there are large numbers of transactions. However, DARM datasets are spread over various sites. Due to this, it cannot take full advantage of those searching techniques. For example, breath-first search performs better when support threshold is high. On the other hand, in DARM, the candidate itemsets are generated by combining frequent items of all datasets; hence, it enumerates those itemsets that are not frequent in a particular site. As a result, DARM cannot utilize the advantage of breath-first techniques when user-specified support threshold is high. In contrast, if a depthfirst search technique is employed in DARM, then it needs large amounts of network communication. Therefore, without very fast network connection, depth-first search is not feasible in DARM.
Distributed Association Rule Mining
Communication A fundamental challenge for DARM is to develop mining techniques without communicating data from various sites unnecessarily. Since, in DARM, each site shares its frequent itemsets with other sites to generate unambiguous association rules, each step of DARM requires communication. The cost of communication increases proportionally to the size of candidate itemsets. For example, suppose there are three sites, such as S1, S2, and S3, involved in distributed association mining. Suppose that after the second pass, site S1 has candidate 2itemsets equal to {AB, AC, BC, BD}, S2 = {AB, AD, BC, BD}, and S3 = {AC, AD, BC, BD}. To generate global frequent 2-itemsets, each site sends its respective candidate 2-itemsets to other sites and receives candidate 2itemsets from others. If we calculate the total number of candidate 2-itemsets that each site receives, it is equal to 8. But, if each site increases the number of candidate 2itemsets by 1, each site will receive 10 candidate 2itemsets, and, subsequently, this increases communication cost. This cost will further increase when the number of sites is increased. Due to this reason, message optimization becomes an integral part of DARM algorithms. The basic message exchange technique (Agrawal & Shafer, 1996) incurs massive communication costs, especially when the number of local frequent itemsets is large. To reduce message exchange cost, we found several message optimization techniques (Ashrafi et al., 2004; Assaf & Ron, 2002; Cheung et al., 1996). Each of these optimizations focuses on the message exchange size and tries to reduce it in such a way that the overall communication cost is less than the cost of merging all datasets into a single site. However, those optimization techniques are based on some assumptions and, therefore, are not suitable in many situations. For example, DMA (Cheung et al., 1996) assumes the number of disjoint itemsets among different sites is high. But this is only achievable when the different participating sites have vertical fragmented datasets. To mitigate these problems, we found another framework (Ashrafi et al., 2004) in data-mining literature. It partitions the different participating sites into two groups, such as sender and receiver, and uses itemsets history to reduce message exchange size. Each sender site sends its local frequent itemsets to a particular receiver site. When receiver sites receive all local frequent itemsets, it generate global frequent itemsets for that iteration and sends back those itemsets to all sender sites
Privacy Association rule-mining algorithms discover patterns from datasets based on some statistical measurements.
However those statistical measurements have significant meaning and may breach privacy (Evfimievski, Srikant, Agrawal & Gehrke, 2002; Rizvi & Haritsa, 2002). Therefore, privacy becomes a key issue of association rule mining. Distributed association rule-mining algorithms discover association rules beyond the organization boundary. For that reason, the chances of breaching privacy in distributed association mining are higher than the centralized association mining (Vaidya & Clifton, 2002; Ashrafi, David & Kate, 2003), because distributed association mining accomplishes the final rule model by combining various local patterns. For example, there are three sites, S1, S2, and S3, each of which has datasets DS1, DS2, and DS3. Suppose A and B are two items having global support threshold; in order to find rule A→B or B→A, we need to aggregate the local support of itemsets AB from all participating sites (i.e., site S1, S2, and S3). When we do such aggregation, all sites learn the exact support count of other sites. However, participating sites generally are reluctant to disclose the exact support of itemset AB to other sites, because support counts of an itemset has a statistical meaning, and this may thread data privacy. For the above-mentioned reason, we need secure multi-party computation solutions to maintain privacy of distributed association mining (Kantercioglu & Clifton, 2002). The goal of secure multi-party computation (SMC) in distributed association rule mining is to find global support of all itemsets using a function where multiple parties hold their local support counts, and, at the end, all parties know about the global support of all itemsets but nothing more. And finally, each participating site uses this global support for rule generation.
Partition DARM deals with different possibilities of data distribution. Different sites may contain horizontally or vertically partitioned data. In horizontal partition, the dataset of each participating site shares a common set of attributes. However, in vertical partition, the datasets of different sites may have different attributes. Figure 2 illustrates horizontal and vertical partition dataset in a distributed context. Generating rules by using DARM becomes more difficult when different participating sites have vertically partitioned datasets. For example, if a participating site has a subset of items of other sites, then that site may have only a few frequent itemsets and may finish the process earlier than others. However, that site cannot move to the next iteration, because it needs to wait for other sites to generate global frequent itemsets of that iteration. Furthermore, the problem becomes more difficult when the dataset of each site does not have the same 405
,
Distributed Association Rule Mining
Figure 2. Distributed (a) homogeneous (b) heterogeneous dataset TID X-1 X-2 1 1.1 2.2 2 1.1 2.2 3 1.3 2.3 4 1.2 2.5 5 1.7 2.5 6 1.6 2.6 7 1.7 2.7
TID X-1 X-2 8 1.5 2.1 9 1.6 2.2 10 1.3 2.1 11 1.4 2.4 12 1.5 2.4 13 1.6 2.6 14 1.7 2.7
X-3 3.1 3.1 3.3 3.2 3.3 3.6 3.7
Site A
X-3 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Site B
TID X-1 X-2 1 1.1 2.2 2 1.1 2.2 3 1.3 2.3 4 1.2 2.5 5 1.7 2.5 6 1.6 2.6 7 1.7 2.7
TID X-1 X-3 1 1.5 3.1 2 1.6 3.1 3 1.3 3.3 4 1.4 3.2 5 1.5 3.3 6 1.6 3.6 7 1.7 3.7
TID X-1 X-4 1 1.5 4.1 2 1.6 4.2 3 1.3 4.1 4 1.4 4.4 5 1.5 4.4 6 1.6 4.5 7 1.7 4.5
Site A
Site B
Site C
(a)
(b)
hierarchical taxonomy. If a different taxonomy level exists in datasets of different sites, as shown in Figure 3, it becomes very difficult to maintain the accuracy of global models.
FUTURE TRENDS
REFERENCES
The DARM algorithms often consider the datasets of various sites as a single virtual table. On the other hand, such assumptions become incorrect when DARM uses different datasets that are not from the same domain. Enumerating rules using DARM algorithms on such datasets may cause a discrepancy, if we assume that semantic meanings of those datasets are the same. The future DARM algorithms will investigate how such datasets can be used to find meaningful rules without increasing the communication cost.
CONCLUSION The widespread use of computers and the advances in database technology have provided a large volume of data distributed among various sites. The explosive growth of data in databases has generated an urgent need for efficient DARM to discover useful information and knowledge. Therefore, DARM becomes one of the active subareas of data-mining research. It not only prom-
Figure 3. A generalization scenario Level 3
Beverage
Coffee
Tea
406
Level 2
Soft drink
Ice Tea
Coke
ises to generate association rules with minimal communication cost, but it also utilizes the resources distributed among different sites efficiently. However, the acceptability of DARM depends to a great extent on the issues discussed in this article.
Pepsi
Level 1
Agrawal, R., Imielinsky, T., & Sawmi, A.N. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington D.C. Agrawal, R., & Shafer, J.C. (1996). Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6), 962-969. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large database. Proceedings of the International Conference on Very Large Databases, Santiago de Chile, Chile. Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2003). Towards privacy preserving distributed association rule mining. Proceedings of the Distributed Computing, Lecture Notes in Computer Science, IWDC’03, Calcutta, India. Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). Reducing communication cost in privacy preserving distributed association rule mining. Proceedings of the Database Systems for Advanced Applications, DASFAA’04, Jeju Island, Korea. Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2004). ODAM: An optimized distributed association rule mining algorithm. IEEE Distributed Systems Online, IEEE. Assaf, S., & Ron, W. (2002). Communication-efficient distributed mining of association rules. Proceedings of the ACM SIGMOD International Conference on Management of Data, California.
Distributed Association Rule Mining
Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996a). Efficient mining of association rules in distributed databases. IEEE Transactions on Knowledge and Data Engineering, 8(6), 911-922. Cheung, D.W., Ng, V.T., Fu, A.W., & Fu, Y. (1996b). A fast distributed algorithm for mining association rules. Proceedings of the International Conference on Parallel and Distributed Information Systems, Florida. Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, J. (2002). Privacy preserving mining association rules. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada. Jiawei, H., Jian, P., & Yiwen, Y. (2000). Mining frequent patterns without candidate generation. Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas.
Zaki, M.J. (1999). Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4), 14-25. Zaki, M.J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(2), 372-390. Zaki, M.J., & Ya, P. (2002). Introduction: Recent developments in parallel and distributed data mining. Journal of Distributed and Parallel Databases, 11(2), 123-127.
KEY TERMS DARM: Distributed Association Rule Mining. Data Center: A centralized repository for the storage and management of information, organized for a particular area or body of knowledge.
Kantercioglu, M., & Clifton, C. (2002). Privacy preserving distributed mining of association rules on horizontal partitioned data. Proceedings of the ACM SIGMOD Workshop of Research Issues in Data Mining and Knowledge Discovery DMKD, Edmonton, Canada.
Frequent Itemset: A set of itemsets that have the user specified support threshold.
Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy in association rule mining. Proceedings of the International Conference on Very Large Databases, Hong Kong, China.
SMC: SMC computes a function f (x1, x2 , x3 … xn) that holds inputs from several parties, and, at the end, all parties know about the result of the function f (x1, x2 , x3 … xn) and nothing else.
Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada.
Network Intrusion Detection: A system that detects inappropriate, incorrect, or anomalous activity in the private network.
Taxonomy: A classification based on a pre-determined system that is used to provide a conceptual framework for discussion, analysis, or information retrieval.
407
,
408
Distributed Data Management of Daily Car Pooling Problems Roberto Wolfler Calvo Universitè de Technologie de Troyes, France Fabio de Luigi University of Ferrara, Italy Palle Haastrup European Commission, Italy Vittorio Maniezzo University of Bologna, Italy
INTRODUCTION The increased human mobility, combined with high use of private cars, increases the load on the environment and raises issues about the quality of life. The use of private cars lends to high levels of air pollution in cities, parking problems, noise pollution, congestion, and the resulting low transfer velocity (and, thus, inefficiency in the use of public resources). Public transportation service is often incapable of effectively servicing non-urban areas, where cost-effective transportation systems cannot be set up. Based on investigations during the last years, problems related to traffic have been among those most commonly mentioned as distressing, while public transportation systems inherently are incapable of facing the different transportation needs arising in modern societies. A solution to the problem of the increased passenger and freight transportation demand could be obtained by increasing both the efficiency and the quality of public transportation systems, and by the development of systems that could provide alternative solutions in terms of flexibility and costs between the public and private ones. This is the rationale behind so-called Innovative Transport Systems (ITS) (Colorni et al., 1999), like car pooling, car sharing, dial-a-ride, park-and-ride, card car, park pricing, and road pricing, which are characterized by the exploitation of innovative organizational elements and by a large flexibility in their management (e.g., traffic restrictions and fares can vary according with the time of day). Specifically, car pooling is a collective transportation system based on the idea that sets of car owners having the same travel destination can share their vehicles (private cars), with the objective of reducing the number of cars on the road. Until now, these systems have had a limited use due to lack of efficient information processing and communication support. This study presents an
integrated system for the organization of a car-pooling service and reports about a real-world case study.
BACKGROUND Car pooling can be operated in two main ways: Daily Car Pooling Problem (DCPP) or Long-Term Car Pooling Problem (LCPP). In the case of DCPP (Baldacci et al., 2004; Mingozzi et al., 2004), each day a number of users (hereafter called servers) declare their availability for picking up and later bringing back colleagues (hereafter clients) on that particular day. The problem is to assign clients to servers and to identify the routes to be driven by the servers in order to minimize service costs and a penalty due to unassigned clients, subject to user time window and car capacity constraints. In the case of LCPP, each user is available both as a server and as a client, and the objective is to define crews or user pools, where each user, in turn, on different days, will pick up the remaining pool members (Hildmann, 2001; Maniezzo et al., 2004). The objective here becomes that of maximizing pool sizes and minimizing the total distance traveled by all users when acting as servers, again subject to car capacity and time window constraints. Car pooling is most effective for daily commuters. It is an old idea but scarcely applied, at least in Europe, for many reasons. First of all, it is necessary to collect a lot of information about service users and to make them available to the service planner; it can be difficult to cluster people in such a way that everybody feels satisfied. Ultimately, it is important that all participants perceive a clear personal gain in order to continue using the service. Moreover, past experience has shown already that the effectiveness and competitiveness of public transporta-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Distributed Data Management of Daily Car Pooling Problems
tion, compared to private transportation, can be improved through an integrated process, based on data management facilities. Usually, potential customers of a new transportation system face serious problems of knowledge retrieval, which is the reason for devoting substantial efforts to providing clients with an easy and powerful tool to find information. Current information systems give the possibility of real-time data management and include the possibility to react to unscheduled events, as they can be reached from almost anywhere. Architectures for information browsing, like the World Wide Web (WWW), are an easy and powerful means through which to deploy databases and, thus, represent obvious options for setting up services, such as the one advocated in this work. The WWW is useful for providing access to central data storage, for collecting remote data, and for allowing GIS and optimization software to interact. All these elements have an important impact on the implementation of new transport services (Ridematching, 2004). Based on the way they use the WWW and on the services they provide, car-pooling systems range between two extremes (Carpoolmatch, 2004; Carpooltool, 2004; Ridepro, 2004; SAC, 2004, etc.), one where there is a WWW site collecting information about trips, which are open to every registered user, and another one where the users of the system are a restricted group and are granted more functionalities. A main issue for the first type is to guarantee the reliability of the information. Their interface often is designed mainly to help set service-related geographical information (customer delivery points, customers pick up points, paths). Such systems only rarely will suggest a matching between clients and servers but operate as a post-it wall, where users can consult or leave information about travel routes. As for the latter type, the idea is normally to set up a car-pooling service among the users, usually employees of the same organization. This system is more structured. Moreover, the spontaneous user matching often is substituted by a solution found by means of an algorithmic approach. An example of this type of system is reported in Dailey, et al. (1999) or in Lexington-Fayette County (2004). These systems use the WWW and react to unexpected events with e-mail messages.
service is supported by a database of potential users (e.g., employees of a company) that daily commute from their houses to their workplace. A subset of them offers seats in their cars. Moreover, they specify the departure time (when they leave their house) and the mandatory arrival time at the office. The employees that offer seats in their cars are called servers. The employees asking for a lift are called clients. The set of servers and the set of clients need to be redefined once a day. The effectiveness of the proposed system is related strictly to the architecture and the techniques used to manage information. The objective of this research is to prove that, at least in a particular site (the Joint Research Center [JRC] of the European Commission located in Ispra, northern Italy), it could be possible to reduce the number of transport vehicles without significantly changing either the number of commuters or their comfort-level. The users of the system are commuters normally using their own cars for traveling between home and a workplace poorly served by public transportation.
System Architecture The architecture of the system developed for this problem is shown in Figure 1 (interested readers can find a complete description of the whole system in Wolfler et al., 2004). The system consists of the following five main modules:
•
•
MAIN THRUST This article describes an integrated ICT system that supports the management of a car-pooling service in a real-world prototypical setting. An approach similar to Dailey, et al. (1999) is suggested, and a complete system for supporting the operation of a car-pooling case as a prototype for a real-life application is described. The
•
OPT: An optimization module that generates a feasible solution using an algorithm that defines the paths for the servers. The algorithm makes use of a heuristic approach and is used to assign the clients to the servers and to define the path for each server. Each path minimizes the travel time, maximizes the number of requests picked up, and satisfies the time and capacity constraints. {M-, S-, W-}CAR: Three modules that permit to receive, decrypt, and send SMS (SCAR), EMAIL (MCAR) and Web pages (WCAR) to the users, respectively. The module Sms Car Pooling (SCAR) allows the server to send and receive Sms messages. The module Mail Car Pooling (MCAR) supports email communication and uses POP3 and SMTP as protocols. The module (WCAR) is the gateway for the Web interaction. All modules filter the customer’s access, allowing the entitled user to insert new data, to query, and to modify the database (e.g., the departure and the arrival time desired) and to access service data. GUI: A graphical user interface based on ESRI ArcView. It generates a view (a digital map) of the current problem instance and provides all relevant data management. 409
,
Distributed Data Management of Daily Car Pooling Problems
Figure 1. The architecture of the system
SMS
SC AR
Email E-mail
MCAR
W EB
W CAR
Requests
Offers
Employees GUI
OPT
Solutions
Geography
The system collects and uses data of two different types: geographic and alphanumeric. The geographic database contains the maps of the region of interest with a detailed road network, the geocoded client pickup sites, and all the geographic information needed to generate a GIS output on the Web. The alphanumeric data repository, maintained by a relational database, contains information about the employees and a representation of the road network of the area where the car-pooling service is active. Other data, input by the users, are related strictly to the daily situation—detailed information about the service, whether a user is a server or a client, the departure time from home and the maximal acceptable arrival time at work, the number of available seats in the cars offered by the servers, and the maximum accepted delay, which is the parameter used to specify how far out of the shortest way a server is willing to drive in order to pick up colleagues. The users are permitted to consult, modify, or delete their own entries in the database at any time. The road network and all user-related data are processed by the optimization module in order to define user pools and car paths. The optimization algorithm can be activated without immediate presence of anybody on some regular schedule, searching a local database and presenting the results on the Web. All these actions are performed on a periodic basis on the evening before each work day. The system is designed to support two types of users: the system administrator and the employee. The system administrator must update the static databases (users, maps) and guarantee all the functionalities. All other data are entered, edited and deleted by the users through a distributed GUI.
Communication Subsystems The system supports three different communication channels: Web, SMS, and e-mails. The Web is the main interface to connect a user to the system. The user, by means of a standard GUI, is allowed to insert transportation requests 410
and offers; the system then generates Web pages containing text and maps presenting the results of the optimization algorithm. Both e-mail and SMS also can be used for submitting requests and for receiving results or variations of previously proposed results (following real-time notifications). The clients know whether their requests could be matched and the detail of their pickup, while the servers simply receive an SMS notifying the pool formation; travel details are separately sent via email. SMS also is used by the servers (via a service relay) to inform clients of delays. Operationally, a user sending an e-mail to the system must insert the message string in the subject field, which will be processed by a mail receiver agent and finally sent to the parser. Economic and privacy considerations suggested sending all SMS to an SMS engine, which is interfaced with the system and which transfers the string sent to the parser. Each time the system receives a syntactically correct e-mail or SMS, it automatically sends an acknowledgment SMS message. While the system generates messages to be sent directly to users, those that users generate, either e-mail or SMS, must follow a rigidly structured format in order to allow automatic processing. We implemented a single parser for all these messages. The Web interface is obviously more user-friendly. Each employee can specify through an ASP page the set of days (possibly empty) when he or she is willing to drive and the set of days when he or she asks to be picked up. The system then computes a matching of servers as clients. The result is a set of routes starting from the server houses, arriving at the workplace, and passing through the set of client houses without violating the time windows constraints and the car capacity constraints. These routes are made available both to clients and to server. A well-known GIS module, ArcView, loads the route information given in alphanumeric format, after the optimization. Then, it transforms this information in a set of GIS views through a set of queries to its database. The last step transforms the views into bitmaps displayed by the Web server. Moreover, the system alerts clients and servers in real time by means of SMS when any variation of the scheduling happens.
Optimization Algorithms and Implementation Details One of the most interesting features of the approach described in this article, with respect to the other systems currently in use, is the set of algorithms used to obtain a matching of clients and servers. Due to the NP-hardness of the underlying problem (Varrentrapp et al., 2002) and to the size of the instances to solve, a heuristic
Distributed Data Management of Daily Car Pooling Problems
approach was in order. In designing it, the efficiency was the main parameters, ensuring relatively fast response time also for larger instances. The matching obtained minimizes the total travel length of all servers going to the workplace and maximizes the number of clients serviced, while meeting the operational constraints. In addition, real-time services are supported; for example, the system sends warnings to employees when delays occur. The system was implemented at the Joint Research Center of the European Commission, using standard commercial software and developing extra modules in C++ using Microsoft Visual Studio C++ compiler. The database used is Microsoft Access. The geographical data are stored in shape files and used by an ArcView application (release 3.2). The system administrator can access data directly through MSAccess and ArcView, which are both packages available and running in the same machine where the car-pooling application resides. Through Arcview, the system manager can start the optimization module. The Joint Research Center of the European Commission (JRC) is situated in the northwest of Italy not far from Milan. It covers a big area of 2 km2, and its mission is to carry out research useful to the European Commission. The number of employees is about 2,000 people, divided into three main classes: administrative, staff, and researchers. They come from all around Europe and live in the area surrounding the center. The wide geographical area covered by the commuters is about 100 Km2. Since this is a sparsely populated area, public transportation (i.e., trains, buses, etc.) is definitely insufficient; thus, private cars are necessary and used. Module OPT, in particular, was tested on a set of realworld problem instances derived from data provided by the JRC. The instances used are the same as those described in (Baldacci et al., 2004); they are derived from the real-world instance defined by over 600 employees by randomly selecting the desired number of clients and servers. Computational results show that the CPU time used by our heuristic increases approximately linearly with the problem dimension, and the number of unserviced requests is a constant proportion of the total number of employees. Moreover, comparing two instances with the same dimension, but with different percentage of servers, one can see that the CPU time increases when fewer servers are available, as expected, while the total travel time decreases, since fewer paths are driven.
FUTURE TRENDS Car-pooling services as well as most other alternative public transportation services are likely to become more
and more common in the near future. Already, several software companies are marketing tailored solutions. Therefore, it is easy to envisage that systems featuring the services included in the prototype described in this work will define their own market niche. This, however, can happen only parallel to an increasing sensibility of local governments, which alone can define incentive policies for drivers who share their cars. These policies will reflect features of the implemented systems. As shown in this article, all needed technological infrastructure is already available.
CONCLUSION This article presents the essential elements of a prototypical deployment of ITC support to a car-pooling service for a large-sized organization. The essential features that can make its actual use possible are in the technological infrastructure, essentially in the distributed data management and accompanying optimization, which helps to provide real-time response to user needs. The current possibility of using any data access means as a system interface (we used e-mails, SMS, and Web, but new smart phones and more in general UMTS can provide other access points) can ease system acceptance and end-user interest. However, the ease of usage alone would be of little interest, if the solutions proposed, in terms of driver paths, were not of good quality. Current development of optimization research, both in the areas of heuristic and exact approaches, provide firm basis for designing support system that also can deal with the dimension of the instances induced by large-sized organizations.
REFERENCES Baldacci, R., Maniezzo, V., & Mingozzi, A. (2004). An exact method for the car pooling problem based on Lagrangean column generation. Operations Research, 52(3), 422-439. Carpoolmatch. (2004). http://www.carpoolmatchnw.org/ Carpooltool. (2004). http://www.carpooltool.com/en/my/ Colorni, A., Cordone, R., Laniado, E., & Wolfler Calvo, R. (1999). Innovation in transports: Planning and management [in Italian]. In S. Pallottino, & A. Sciomachen (Eds.), Scienze delle decisioni per i trasporti. Franco Angeli. Cordeau, J.-F., & Laporte, G. (2003). The dial-a-ride problem (DARP): Variants, modeling issues and algorithms. Quarterly Journal of the Belgian, French and Italian Operations Research Societies, 1, 89-101.
411
,
Distributed Data Management of Daily Car Pooling Problems
Dailey, D.J., Loseff, D., & Meyers, D. (1999). Seattle smart traveler: Dynamic ridematching on the World Wide Web. Transportation Research Part C, 7, 17-32. Hildmann, H. (2001). An ants metaheuristic to solve car pooling problems [master’s thesis]. University of Amsterdam, The Netherlands. Lexington-Fayette County. (2004). Ride matching services. Retrieved from http://www.lfucg.com/mobility/ rideshare.asp Maniezzo, V. (2002). Decision support for location problems. Encyclopedia of Microcomputers, 8, 31-52. NY: Marcel Dekker. Maniezzo, V., Carbonaro, A., & Hidmann, H. (2004). An ANTS heuristic for the long-term car pooling problem. In G.C. Onwuboulu, & B.V. Babu (Eds.), New optimization techniques in engineering (pp. 411-430).Heidelber, Germany: Springer-Verlag. Mattarelli, M., Maniezzo, V., & Haastrup, P. (1998). A decision support system distributed on the Internet. Journal of Decision Systems, 6(4), 353-368. Ridematching. (2004). University of South Florida, ridematching systems. Retrieved from http:// www.nctr.usf.edu/clearinghouse/ridematching.htm Ridepro. (2004). http://www.ridepro.net/index.asp SAC. (2004). San Antonio College’s carpool matching service. Retrieved from http://www.accd.edu/sac/ carpool/ Varrentrapp, K., Maniezzo, V., & Stützle, T. (2002). The long term car pooling problem: On the soundness of the problem formulation and proof of NP-completeness. Technical Report AIDA-02-03. Darmstadt, Germany: Technical University of Darmstadt. Wolfler Calvo, R., de Luigi, F., Haastrup, P., & Maniezzo, V.(2004). A distributed geographic information system for the daily car pooling problem. Computers & Operations Research, 31, 2263-2278.
412
KEY TERMS Car Pooling: A collective transportation system based on the shared use of private cars (vehicles) with the objective of reducing the number of cars on the road. Computational Problem: A relation between input and output data, where input data are known (and correspond to all possible different problem instances), and output data are to be identified, but predicates or assertions they must verify are given. Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships, such as classification, prediction, estimation, or affinity grouping. GIS: Geographic Information Systems; tools used to gather, transform, manipulate, analyze, and produce information related to the surface of the Earth. Heuristic Algorithms: Optimization algorithms that do not guarantee to identify the optimal solution of the problem they are applied to, but which usually provide good quality solutions in an acceptable time. NP-Hard Problems: Optimization problems for which a solution can be verified in polynomial time, but no polynomial solution algorithm is known, even though no one so far has been able to demonstrate that none exists. Optimization Problem: A computational problem for which an objective function associates a merit figure with each problem solution, and it is asked to identify a feasible solution that minimizes or maximizes the objective function.
ENDNOTE 1
This article is an abridged and updated version of the paper, “A Distributed Geographic Information System for the Daily Car Pooling Problem,” published as Wolfler et al., 2004.
413
Drawing Representative Samples from Large Databases Wen-Chi Hou Southern Illinois University, USA Hong Guo Southern Illinois University, USA Feng Yan Williams Power, USA Qiang Zhu University of Michigan, USA
INTRODUCTION Sampling has been used in areas like selectivity estimation (Hou & Ozsoyoglu, 1991; Haas & Swami, 1992, Jermaine, 2003; Lipton, Naughton & Schnerder, 1990; Wu, Agrawal, & Abbadi, 2001), OLAP (Acharya, Gibbons, & Poosala, 2000), clustering (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Palmer & Faloutsos, 2000), and spatial data mining (Xu, Ester, Kriegel, & Sander, 1998). Due to its importance, sampling has been incorporated into modern database systems. The uniform random sampling has been used in various applications. However, it has also been criticized for its uniform treatment of objects that have non-uniform probability distributions. Consider the Gallup poll for a Federal election as an example. The sample is constructed by randomly selecting residences’ telephone numbers. Unfortunately, the sample selected is not truly representative of the actual voters on the election. A major reason is that statistics have shown that most voters between ages 18 and 24 do not cast their ballots, while most senior citizens go to the poll-booths on Election Day. Since Gallup’s sample does not take this into account, the survey could deviate substantially from the actual election results. Finding representative samples is also important for many data mining tasks. For example, a carmaker may like to add desirable features in its new luxury car model. Since not all people are equally likely to buy the cars, only from a representative sample of potential luxury car buyers can most attractive features be revealed. Consider another example in deriving association rules from market basket data, recalling that the goal was to place items often purchased together in near locations. While serving ordinary customers, the store would like to pay some special
tribute to customers who are handicapped, pregnant, elderly, and etcetera. A uniform sampling may not be able to include enough such under-populated people. However, by giving higher inclusion probabilities to (the transaction records of) these under-populated customers in sampling, the special care can be reflected in the association rules. To find representative samples for populations with non-uniform probability distributions, some remedies, such as the density biased sampling (Palmer & Faloutsos, 2000) and the Acceptance/Rejection (AR) sampling (Olken, 1993), have been proposed. The density-biased sampling is specifically designed for applications where the probability of a group of objects is inversely proportional to its size. The AR sampling, based on the “acceptance/ rejection” approach (Rubinstein, 1981), aims for all probability distributions and is probably the most general approach discussed in the database literature. We are interested in finding a general, efficient, and accurate sampling method applicable to all probability distributions. In this research, we develop a Metropolis sampling method, based on the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953), to draw representative samples. As it will be clear, the sample generated by this method is bona fide representative.
BACKGROUND Being a representative sample, it must satisfy some criteria. First, the sample mean and variance must be good estimates of the population mean and variance, respectively, and converge to the latter when the sample size increases. In addition, a selected sample must have a
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
,
Drawing Representative Samples from Large Databases
similar distribution to that of the underlying population. In the following, we briefly describe the population mean and variance and the Chi-square (Spiegel, 1991) test used to examine the similarity of distributions.
v Let x be a d-dimensional vector representing a set of d
attributes that characterizes an object in a population, and ρ the quantity of interest of the object, denoted as
v ρ (x ) . Our task is to calculate the mean and variance of v v ρ (x ) of the population (relation). Let w( x ) ≥ 0 be the v
v x . The population
mean of ρ (x ) , denoted by ρ , is given by
v v ρ = ∑rρ ( x ) w( x )
v
(2)
all x
v
For example, let x be a registered voter. Assuming there are only two candidates: a Republican candidate and v a Democratic candidate, then we can let ρ (x ) = 1 if the
v x will cast a vote for the Republican candidate; and v v ρ (x ) = − 1 , otherwise. w(x ) is the weight of the registered r voter x . If ρ is positive, the Republican candidate is voter
predicted to win the election. Otherwise, the Democratic candidate wins the election. Another useful quantity is the population variance, which is defined as
v v ∆2 = ∑r( ρ ( x ) − ρ ) 2 w( x ) . all x
(3)
v
The variance specifies the variability of the ρ (x ) values relative to ρ .
Chi-Square Test To compare the distributions of a sample and its population, we perform the Chi-square test (Press, Teukolsky, Vetterling, & Flannery, 1994) by calculating 414
ith bin, ∑i =1 ri = N is the sample size,
wi is the probability
of the i th bin of the population, and
Nwi is the expected
k
number of sample objects from that bin. A bin here refers to a designated group or range of values. The larger the χ 2 value, the greater is the discrepancy between the sample and the population distributions. Usually, a level of significance α is specified as the uncertainty of the test. 2
The probability distribution is required to satisfy
∑rw( x ) = 1 .
ri is the number of sample objects drawn from the
If the value of χ 2 is less than χ 1−α , we are about 1-α (1)
all x
(4)
i =1
where
Mean and Variance Estimation
probability or weigh function of
k
χ 2 = ∑ (ri − Nwi ) 2 /( Nwi ) ,
confident that the sample and population have similar 2
distributions: customarily, α = 0.05 . The value of χ 1−α is also determined by the degree of freedom involved (i.e., k - 1 ).
MAIN THRUST Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953) has been known as the most successful and influential Monte Carlo Method. Unlike its use in numerical calculations, we shall use it to construct representative samples. In addition, we will also incorporate techniques for finding the best start sampling point in the algorithm, which can greatly improve the efficiency of the process.
Probability Distribution v
The probability distribution w(x ) plays an important role in the Metropolis algorithm. Unfortunately, such information is usually unknown or difficult to obtain due to incompleteness or size of a population. However, the relative probability distribution or non-normalized probv ability distribution, denoted by W (x ) , can often be obtained from, for example, preliminary analysis, knowledge, statistics, and etcetera. Take the Gallup poll for example. While it may be difficult or impossible to assign a weight v (i.e., w(x ) ) to each individual voter, it can be easily known, for example, from Federal Election Commission, that the relative probabilities for people to vote on the v Election Day (i.e., W (x ) ) are 18.5%, 38.7%, 56.5%, and 61.5% for groups whose ages fall in 18-24, 25-44, 4565, and 65+, respectively. Fortunately, the relative prob-
Drawing Representative Samples from Large Databases
v
ability distribution W (x ) suffices to perform the calculations and construct a sample (Kalos & Whilock, 1986).
ment is that the random selection method must provide a chance for every element in the entire data space to be picked.
Sampling Procedure
We now decide if the trial object y should be incorporated into the sample. We calculate the ratio v v v θ = W ( y ) / W ( xi ) . If θ ≥ 1 , we accept y into the sample
v
Similar to simple random sampling, objects are selected randomly from the population one after another. The Metropolis sampling guarantees that the sample selected has a similar distribution to that of the population.
Selecting the Starting Point
and let
v v v xi +1 = y . That is, y becomes the last object
added to the sample. If θ < 1 , we generate a random number R, which has an equal probability to lie between
v y is accepted into the v v v sample and we let xi +1 = y . Otherwise, the trial object y v v is rejected and we let xi +1 = xi . It is noted that in the
0 and 1. If R ≤ θ , the trial object
Objects with higher weight are more important than others. If a method does not start with those objects, we could miss them in the process or take long time to incorporate them into the sample, which is formidable in cost and detrimental to accuracy. In the following, we propose a general approach to selecting a starting point. We begin with searching for an object with the maxiv mum W (x ) . If there are several objects with the same maximum value, we could just pick any of them as the “best
latter situation, we incorporate the just selected object
v v v xi into the sample again (i.e. xi +1 = xi ). Therefore, an
example, finding the maximal value of W (x ) is straightforward. For others, there are several useful approaches for searching for the maximum of functions, such as Golden Section search, Downhill Simplex method, Conjugate Gradient method, and etcetera. A good review of these methods can be found in the reference (Press, Teukolsky, Vetterling, & Flannery, 1994).
object with a high probability may appear more than once in our sample. The above step is called a Monte Carlo step. After each Monte Carlo step, we add one more object into the sample. The above Monte Carlo step is repeated until a predefined sample size N is reached. It is expected that the sample average converges very fast as N increases and the fluctuation decreases in the order of 1 / N (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). The complete sampling procedure, which includes selecting the starting point and the Metropolis algorithm, is summarized in Figure 1.
Incorporating Objects into Sample
Properties of a Metropolis Sample
starting host”
v x1 . In many applications, such as the Gallup v
v
Let the object last added to the sample be xi . Now, we pick v a trial object y from the population randomly. In general, v there is no restriction on how to select y . The only require-
Since the selection starts with an object having the
v
maximum weight W (x ) , it ensures that the sample always includes the most important objects. In addition,
Figure 1. The Metropolis Sampling v (1) locate an object with the maximum weight W (xv ) as the first object x1 of the sample. (2) for i from 1 to N-1 do v (3) randomly select an object y ; (4) compute θ = W ( yv) / W ( xvi ) ; v v (5) if θ ≥ 1 then xi +1 = y ; (6) else generate a random number R; v v (7) if R ≤ θ then xi +1 = y ; v v (8) else xi +1 = xi ; (9) end if; (10) end if; (11) end for.
415
,
Drawing Representative Samples from Large Databases
the sample has a distribution close to that of the population when the sample size is large enough. Let r l be a random variable denoting the number of occurrences for object
xl (l = 1, 2, …, k) in the sample, then
E ( r1 ) : E ( r2 ) : ... : E ( rk ) ≅ W ( x1 ) : W ( x2 ) : ... : W ( xk )
, when the sample size is large enough, where E(rl) is the expected number of occurrences for object rl in the sample. We also have E (r1 ) / N ≅ w( x1 ), E ( r2 ) / N ≅ w( x2 ), …, E ( rk ) / N ≅ w( xk ) . Finally, it should be pointed out that when all objects have an equal weight, the Metropolis sampling degenerates to the simple random sampling.
Experimental Results Now, we report the results of our empirical evaluations of the Metropolis sampling and the AR sampling. We compare the efficiency of the methods and the quality of the samples yielded. To compare the methods quantitatively, we select a model for which we know the analytic values. Here, we have
chosen
the
Gaussian
model
v w(x ) =
v v (1 / π ) d / 2 e ( − x ) , where d is the dimension and ρ ( x ) = x 2 . v2
The mean, the second moment, and variance are:
v v v ρ = ∫ x 2 w( x ) dx = d / 2 ,
(5)
v 2 v v ρ = ∫ ( x 2 ) w( x ) dx = ( d 2 + 2d ) / 4 ,
(6)
∆2 = ρ 2 − ( ρ ) 2 = d / 2 .
(7)
2
Figure 2 is a 2-D Gaussian distribution. It has a high peak at the center. Only a small region near the center makes significant contributions to the integrations of Equations (5) and (6), and the vast remaining region has vanishing contributions. As the dimension increases, the peak becomes higher and narrower, and the distribution becomes more “skewed.” In our experiments, we let d = 1, 3, 10, and 20.
distribution and the greater the Wmax / Wavg value. This explains why the AR sampling becomes less efficient as the dimension increases. In comparison, our Metropolis sampling method accepts one object in every trial.
Quality of the Samples 2
Instead of showing the variance of estimation < ∆ > , here we show the second moment of ρ , < ρ 2 > . The second moment tells how different the estimates are from the actual values, especially when the estimate is biased. As shown in Figures 4 and 5, both methods yield pretty accurate estimates of the population mean (= 0.5) and second moment (= 0.75) for one-dimensional case. Hence, from Equation (7), they also give good estimates of the variance. As shown in Figures 6 and 7 for d=3, the AR sampling yields estimates around 1.0 and 1.7 for the mean and second moment, respectively, which are below their respective exact values 3/2 and 15/4. On the other hand, our Metropolis sampling yields quite accurate estimates. As the sample size gets larger, our estimates get closer to the analytic values, but AR’s do not. Similar results are also observed for higher dimensional cases. As shown in Figures 8 and 9, when d=20, the AR sampling yields estimates around 5.1 and 29 for the mean and second moment, respectively, which are well below the exact values 10 and 110. As for our method, a sample of size as small as of 10,000 objects already gives < ρ > ≈ 10 and < ρ 2 > ≈ 110. The underestimation of the AR sampling was attributed to two factors: the acceptance/rejection criterion v W ( x ) / Wmax of the AR sampling and the random number generator. In the Gaussian model described earlier, W max v v appears at the center (i.e., x = 0 ), and the weight W (x ) diminishes quickly as we move away from the center. Indeed, the majority of the points have very small weights
Sampling Efficiency
and thus small W ( xv ) / Wmax = e ( − x ) ratios, especially when v the dimension is high. The low W ( x ) / Wmax values could make the remote points not selectable when compared with the random numbers generated in the process. The random number generators on most computers are based on the linear congruent method, which first generates a sequence of integers by the recurrence rela-
First, let us compare the cost of sampling. As shown in Figure 3, the AR method roughly needs 5, 10, 75, and 1,750 trials to accept just one object into the sample in 1, 3, 10, and 20-dimensional cases, respectively. It is noted that the higher the dimension, the more “skewed” the Gaussian
tion I j +1 = aI j + c (mod m), where m is the modulus, and a and c are positive integers (Press, Teukolsky, Vetterling, & Flannery, 1994). The random numbers generated are I1/ m, I 2/m, I3/m, ..., and the smallest numbers generated is 1/ m. As a result, a trial point in the AR sampling, whose
416
v2
Drawing Representative Samples from Large Databases
Figure 2. A Gaussian distribution: d=2
Figure 3. Cost of AR sampling
,
2.0M
d=20
1.6M
1.2M
Trial s
d=10
800.0k
d=1
400.0k
d=3 0.0
0
2k
4k
6k
8k
10k
12k
14k
16k
18k
2k
Sample Size
Figure 5. Sample means of second moment. < ρ 2 > = 0.75
Figure 4. Sample means for d=1. < ρ > = 0.5 0.9 0.8
MC AR
0.7
− <ρ>
0.6 0.5 0.4 0.3 0.2 0.1 0.0
0
2k
4k
6k
8k 10k 12k 14k 16k 18k 20k Sample Size
Figure 7. Sample means of second moment < ρ 2 > = 15/ 4
Figure 6. Sample means for d=3. < ρ > = 1.5
7
2.5
MC AR
2.0
MC AR
6
1.5
−− < ρ2 >
− <ρ>
5 4 3
1.0
2 0.5
0.0
1
0
2k
4k
6k
8k
10k 12k 14k 16k 18k 20k
Sample Size
0
0
2k
4k
6k
8k
10k 12k 14k 16k 18k 20k
Sample Size
417
Drawing Representative Samples from Large Databases
Figure 9. Sample means of second moment. < ρ 2 > = 110
Figure 8. Sample means for d=20. < ρ > = 10 12
160
10
140 120 −− < ρ2 >
− <ρ>
8 6
100 80 60
4
40
MC AR
2 0
0
2k
4k
6k
8k
20
10k 12k 14k 16k 18k 20k
0
0 2k
v W ( x ) / Wmax is smaller than 1/m, cannot be accepted. Since these “remote” points have larger values than the points
v
v
near the center, recalling that ρ ( x ) = x 2 , underestimation sets in. The higher the dimension, the more “skewed” the distribution, and the more serious the underestimation. Increasing sample size would not help because those points simply will not be accepted. On the other hand, our Metropolis sampling uses the weight ratio of the trial and the last accepted objects. As points, away from the center, are accepted, the chances of accepting remote points increase. Therefore, it is immune from the difficulty associated with the AR sampling method. Based on our experiments with the Gaussian distributions, a sufficient minimum size should be in the neighborhood of 500d to 1,000d, where d is the number of dimensions for the population. For smoother distributions, the results can be even better.
Distribution To examine whether a sample has a similar distribution to the underlying population, generally a Chi-square test is performed. Since the Chi-square test is designed for discrete data, we divide each dimension into 7 bins of equal length to compute the
χ 2 values.
Figure 10 shows the χ 2 -test results of AR’s and our samples. We have chosen the commonly used significance level α =0.05 for the test. It is observed that when sample size N > 200, our χ 2 is below χ 12−α = χ 02.95 = 12.6
ν = 6 (=7-1). This indicates the agreement of the
sample with the population distributions. The 418
2
χ value
4k
6k
8k 10k 12k 14k 16k 18k 20k Sampel Size
Sample Size
with
MC AR
stabilizes as N gets larger. We have also performed tests on other bin values and the results are similar, all verifying the agreement of the sample distribution with the population distribution. As for AR sampling, it also performs very well, showing a strong agreement of the sample distribution with the underlying population distribution. Extending this test to the three-dimensional Gaussian model, we concentrate on the cubic space divided it into 7 3 = 343 bins. The results are plotted in Figure 11. As observed, our
χ 2 values are always less than χ 02.95 ( ≈ 386
with ν = 342 ), indicating a good agreement with the population distribution. As for the AR sampling, it is observed that for any sample of reasonable size, its
χ2
well exceeds χ 02.95 ≈ 386 for ν = 342 : moreover, its χ 2 increases with the sample size. This indicates that the samples have different distributions from the population. As explained earlier, this is because the AR sampling could not include points with very low weights and thus as more points are included (or as the sample size increases) the more different the sample distribution is from the population distribution. The results are consistent with the evaluation of means and variance discussed in the previous sections. The AR sampling may work well when the probability distribution is not very skewed, but our approach works for all distributions. In addition, our approach is also more efficient.
FUTURE TRENDS While modern computers become more and more powerful, many databases in social, economical, engineering, scientific, and statistical applications may still be too
Drawing Representative Samples from Large Databases
Figure 10. Chi-square test on 1-d samples
Figure 11. Chi-square test on 3-d samples with
with χ
χ 02.95 = 386 and ν = 342
2 0.95
= 12.6 and ν = 6
25
1000
MC AR
20
MC AR
800
15
χ2
χ2
600
10
400 5
200 0 0
500
1000
1500
2000
2500
3000
Sam p le S iz e N
0
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 Sample Size N
large to handle. Sampling, therefore, becomes a necessity for analyses, surveys, and numerical calculations in these applications. In addition, for many modern applications in OLAP and data mining, where fast responses are required, sampling also becomes a viable approach for constructing an in-core representation of the data. A general sampling algorithm that applies to all distributions is needed more than ever.
sional data for data mining application. In Proceedings of the ACM SIGMOD Conference (pp. 94-105).
CONCLUSION
Jermaine, C. (2003). Robust estimation with sampling and approximate pre-aggregation. In Proceedings of VLDB Conferences (pp. 886-897).
The Metropolis sampling presented in this paper can be a useful and powerful tool for studying large databases of any distributions. We propose to start the sampling by taking an object from where the probability distribution has its maximum. This guarantees that the sample always includes the most important objects and improves the efficiency of the process. Our experiments also indicate a strong agreement between the selected sample and the population distributions. The selected sample is bona fide representative, better than the samples produced by other existing methods.
REFERENCES Acharya, S., Gibbons, P., & Poosala, V. (2000). Cogressional samples for approximate answering of groupby queries. In Proceedings of ACM SIGMOD Conference (pp. 487-498). Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimen-
Haas, P., & Swami, A. (1992). Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD Conference (pp. 341-350). Hou, W.-C., & Ozsoyoglu, G. (1991). Statistical estimators for aggregate relational algebra queries. ACM Transactions on Database Systems, 16(4), 600-654.
Kalos, M., & Whilock, P. (1986). Monte Carlo methods: basic. New York: John Wiley & Sons. Lipton, R., Naughton, J., & Schnerder, D. (1990). Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD Conference (pp. 1-11). Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller A., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21(6), 1087-1092. Olken, F. (1993). Random sampling from databases. Ph.D. dissertation. University of California. Palmer, C., & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering Proceedings of the ACM SIGMOD Conference, 29 (2), 82-92. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1994). Numerical recipes in C. Cambridge: Cambridge University Press. 419
,
Drawing Representative Samples from Large Databases
Rubinstein, R. (1981). Simulation and the Monte Carlo Method. New York: John Wiley & Sons. Spiegel, M. (1991). Probability and statistics. McGrawHill, Inc. Wu, Y., Agrawal, D., & Abbadi, A. (2001). Using the golden rule of sampling for query estimation. In Proceedings of ACM SIGMOD Conference (pp. 279-290). Xu, X., Ester, M., Kriegel, H., & Sander, J. (1998). A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the IEEE ICDE Conference (pp. 324-331).
KEY TERMS Metropolis Algorithm: Was proposed in 1953 by Metropolis et al. for studying statistical physics. Since then it has become a powerful tool for investigating
420
thermodynamics, solid state physics, biological systems, and etcetera. The algorithm is known as the most successful and influential Monte Carlo method. Monte Carlo Method: The heart of this method is a random number generator. The term “Monte Carlo Method” now stands for any sort of stochastic modeling. OLAP: Online Analytical Processing. Representative Sample: A sample whose distribution is the same as that of the underlying population. Sample: A set of elements drawn from a population. Selectivity: The ratio of the number of output tuples of a query to the total number of tuples in the relation. Uniform Sampling: All objects or clusters of objects are drawn with equal probability.
421
Efficient Computation of Data Cubes and Aggregate Views Leonardo Tininini CNR - Istituto di Analisi dei Sistemi e Informatica “Antonio Ruberti,” Italy
INTRODUCTION This paper reviews the main techniques for the efficient calculation of aggregate multidimensional views and data cubes, possibly using specifically designed indexing structures. The efficient evaluation of aggregate multidimensional queries is obviously one of the most important aspects in data warehouses (OLAP systems). In particular, a fundamental requirement of such systems is the ability to perform multidimensional analyses in online response times. As multidimensional queries usually involve a huge amount of data to be aggregated, the only way to achieve this is by pre-computing some queries, storing the answers permanently in the database and reusing these almost exclusively when evaluating queries in the multidimensional database. These pre-computed queries are commonly referred to as materialized views and carry several related issues, particularly how to efficiently compute them (the focus of this paper), but also which views to materialize and how to maintain them.
BACKGROUND Multidimensional data are obtained by applying aggregations and statistical functions to elementary data, or more
precisely to data groups, each containing a subset of the data and homogeneous with respect to a given set of attributes. For example, the data “Average duration of calls in 2003 by region and call plan” is obtained from the so-called fact table, which is usually the product of complex source integration activities (Lenzerini, 2002) on the raw data corresponding to each phone call in that year. Several groups are defined, each consisting of calls made in the same region and with the same call plan, and finally applying the average aggregation function on the duration attribute of the data in each group (see Figure 1). The triple of values (region, call plan, year) is used to identify each group and is associated with the corresponding average duration value. In multidimensional databases, the attributes used to group data define the dimensions, whereas the aggregate values define the measures. The term multidimensional data comes from the wellknown metaphor of the data cube (Gray, Bosworth, Layman, & Pirahesh, 1996). For each of n attributes, used to identify a single measure, a dimension of an n-dimensional space is considered. The possible values of the identifying attributes are mapped to points on the dimension’s axis, and each point of this n-dimensional space is thus mapped to a single combination of the identifying attribute values and hence to a single aggregate value. The collection of all these points, along with
Figure 1. From facts to data cubes and drill-down on the time dimension C P3
C P2 C P1
n Italy Norther
Phone calls in 2003
C entral
Italy
rn Italy So uthe
Average on the duration attribute Time
CP1
CP2
CP3
Italy hern aly No rt tral It Italy C en thern Sou
2003 Drill-down on the time dimension
CP1
CP2
CP3
Italy hern aly No rt tral It Italy C en thern Sou
2003 (1. quat.) 2003 (2. quat.) 2003 (3. quat.) 2003 (4. quat.)
2002
2001 plan C all
R eg ion
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Efficient Computation of Data Cubes and Aggregate Views
all possible projections in lower dimensional spaces, constitutes the so-called data cube. In most cases, dimensions are structured in hierarchies, representing several granularity levels of the corresponding measures (Jagadish, Lakshmanan, & Srivastava, 1999). Hence a time dimension can be organized into days, months, quarters and years; a territorial dimension into towns, regions and countries; a product dimension into brands, families and types. When querying multidimensional data, the user specifies the measures of interest and the level of detail required by indicating the desired hierarchy level for each dimension. In a multidimensional environment querying is often an exploratory process, where the user “moves” along the dimension hierarchies by increasing or reducing the granularity of displayed data. The drill-down operation corresponds to an increase in detail, for example, by requesting the number of calls by region and quarter, starting from data on the number of calls by region or by region and year. Conversely, roll-up allows the user to view data at a coarser level of granularity. Multidimensional querying systems are commonly known as OLAP (Online Analytical Processing) Systems, in contrast to conventional OLTP (Online Transactional Processing) Systems. The two types have several contrasting features, although they share the same requirement of fast online response times: •
•
•
422
Number of records involved: One of the key differences between OLTP and multidimensional queries is the number of records required to calculate the answer. OLTP queries typically involve a rather limited number of records, accessed through primary key or other specific indexes, which need to be processed for short, isolated transactions or to be issued on a user interface. In contrast, multidimensional queries usually require the classification and aggregation of a huge amount of data. Indexing techniques: Transaction processing is mainly based on the access of a few records through primary key or other indexes on highly selective attribute combinations. Efficient access is easily achieved by well-known and established indexes, particularly B+-tree indexes. In contrast, multidimensional queries require a more articulated approach, as different techniques are required and each index performs well only for some categories of queries/aggregation functions (Jürgens & Lenz, 1999). Current state vs. historical DBs: OLTP operations require up-to-date data. Simultaneous information access/update is a critical issue and the database usually represents only the current state of the system. In OLAP systems, the data does not need to be the most recent available and should in fact be time-stamped, thus enabling the user to perform
•
historical analyses with trend forecasts. However, the presence of this temporal dimension may cause problems in query formulation and processing, as schemes may evolve over time and conventional query languages are not adequate to cope with them (Vaisman & Mendelzon, 2001). Target users: Typical OLTP system users are clerks, and the types of query are rather limited and predictable. In contrast, multidimensional databases are usually the core of decision support systems, targeted at management level. Query types are only partly predictable and often require highly expressive (and complex) query language. However, the user usually has little experience even in “easy” query languages like basic SQL: the typical interaction paradigm is a spreadsheet-like environment based on iconic interfaces and the graphical metaphor of the multidimensional cube (Cabibbo & Torlone, 1998).
MAIN THRUST In this section we briefly analyze the techniques proposed to compute data cubes and, more generally, materialized views containing aggregates. The focus here is on the exact calculation of the views from scratch. In particular, we do not consider (a) the problems of aggregate view maintenance in the presence of insertions, deletions and updates (Kotidis & Roussopoulos, 1999; Riedewald, Agrawal, & El Abbadi, 2003) and (b) the approximated calculation of data cube views (Wu, Agrawal, & El Abbadi, 2000; Chaudhuri, Das, & Narasayya, 2001), as they are beyond the scope of this paper.
(Efficiently) Computing Data Cubes and Materialized Views A typical multidimensional query consists of an aggregate group by query applied to the join of the fact table with two or more dimension tables. In consequence, it has the form of an aggregate conjunctive query, for example: SELECT D1.dim1, D2.dim2, AGG(F.measure) FROM fact_table F, dim_table1 D1, dim_table2 D2 WHERE F.dimKey1 = D1.dimKey1 AND F.dimKey2 = D2.dimKey2 GROUP BY D1.dim1, D2.dim2 (Q1) where AGG is an aggregation function, such as SUM, MIN, AVG, etc. For example, in the above-mentioned phone call data warehouse the fact table Phone_calls may have the (sim-
Efficient Computation of Data Cubes and Aggregate Views
plified) schema (call_id, territ_id, call_plan_id, duration), and the dimension tables Terr and Call_pl the simplified schemas (territ_id, town, region) and (call_plan_id, call_plan_name) respectively. The aggregation illustrated in Figure 1 would be performed by the following query: SELECT D1.region, D2.call_plan_name, AVG(F.duration) FROM Phone_calls F, Terr D1, Call_pl D2 WHERE F.territ_id = D1.territ_id AND F.call_plan_id = D2.call_plan_id GROUP BY D1.region, D2.call_plan_name (Q2)
Traditional query processing systems first perform all joins expressed in the FROM and WHERE clause, and only afterwards perform the grouping on the result of the join and aggregation on each group. The algorithms producing the groups can be broadly classified as techniques based on (a) sorting on the GROUP BY attributes and (b) hashing tables [see Graefe (1993)]. However, there are many common cases where an early evaluation of the GROUP BY is possible. This can significantly reduce calculation time, as it (i) reduces the input size of the join (usually very large in the context of multidimensional databases) and (ii) enables the query processing engine to use indexes to perform the (early) GROUP BY on the base tables. In Chaudhuri & Shim (1995, 1996), some techniques and applicability conditions are proposed for transforming execution plans into equivalent (more efficient) ones. As is typical in query optimization, the technique is based on pull-up transformations, which delay the execution of a costly operation (e.g., a group by on a large dataset) by moving it towards the root of the query tree, and on pushdown transformations, used for example to anticipate an aggregation, thus decreasing the size of a join. Several transformations for multidimensional queries are also proposed in Gupta, Harinarayan, & Quass (1995), based on the concept of generalized projection (GP). Transformations enable the optimizer to (i) push a GP down the query tree; (ii) pull a GP up the query tree; (iii) coalesce two GPs into one, or conversely split one GP into two. Query tree transformations are also used in the rewriting process. In Agarwal et al. (1996) some algorithms are proposed and compared to extend the traditional techniques for the GROUP BY query evaluation to the CUBE operator processing. These are applicable in the case of distributive aggregate functions and are based on the property that higher-level aggregates can be calculated from lower levels in this case. In MOLAP systems, however, the above methods for CUBE calculation are inadequate, as they are substantially based on sorting and hashing techniques, which can not be applied to multidimensional arrays. In Zhao, Deshpande, & Naughton (1997) an algorithm is proposed to calculate CUBEs in a MOLAP environment. It is shown that this can
be made significantly more efficient (and even more so than ROLAP-based calculations) by exploiting the inherently compressed data representation of MOLAP systems. The algorithm particularly benefits from the compactness of multidimensional arrays, enabling the query processor to transfer larger “chunks” of data to the main memory and efficiently process them. In many practical cases GROUP BYs at the finest granularity level correspond to sparse data cubes, that is, cubes where a high percentage of points correspond to null values. Consider for instance the CUBE corresponding to the query “Number of calls by customer, day and antenna:” it is evident that a considerable number of combinations correspond to zero. Fast techniques to compute sparse data cubes are proposed in Ross & Srivastava (1997). They are based on (i) decomposing the fact table into fragments that can be stored in main memory; (ii) computing the data cube in the main memory for each fragment and finally (iii) combining the partial results obtained in (i) and (ii).
Using Indexes to Improve Efficiency The most common indexes used in traditional DBMSs are probably B+-trees, a particular form labeled with index keyvalues and having a list of record identifiers on the leaf level; that is, a list of elements specifying the actual position of each record on the disk. Index keyvalues may consist of one or more columns of the indexed table. OLTP queries usually retrieve a very limited number of tuples (or even a single tuple accessed through the primary key index) and in these cases B+-trees have been demonstrated as particularly efficient. In contrast, OLAP queries typically involve aggregation of large tuples groups, requiring specifically designed indexing structures. In contrast to the OLTP context, there is no “universally good” index for multidimensional queries, but rather a variety of techniques, each of which may perform well for specific data types and query forms but be inappropriate for others. Let us again consider the typical multidimensional query expressed by the SQL query (Q1). The core operations related to its evaluation are: (1) the joins of the fact table with two or more dimension tables, (2) tuple grouping by various dimensional values, and (3) application of an aggregation function to each tuple group. An interesting index type which can be used to efficiently perform operation (1) is the join index; while conventional indexes map column values to records in one table, join indexes map them to records in two (or more) joined tables, thus constituting a particular form of materialized view. Join indexes in their original version can not be used directly for efficient evaluation of OLAP queries, but can be very effective in combination with other 423
-
Efficient Computation of Data Cubes and Aggregate Views
indexing techniques, such as bitmaps and partitioning. Bitmap indexes (Chan & Ioannidis, 1998) are useful for tuple grouping and performing some aggregation forms. In practice, these indexes use a bitmap representation for the list of record identifiers in the tree’s leaf level: if table t contains n records, then each leaf of the bitmap index (corresponding to a specific value c of the indexed column C) contains a sequence of n bits, where the i-th bit is set to 1 if ti.C=c, and otherwise to zero. Bitmap representations are indicated when the number of distinct keyvalues is low and several predicates of the form (Column = value) are to be combined in AND/OR, as the operation can be efficiently performed by AND-/OR-ing the corresponding bitmap representation bit to bit. This operation can be performed in parallel and very efficiently on modern processors. Finally, bitmap indexes can be used for fast count query evaluation, as counting can be performed directly on the bitmap representation without even accessing the selected records. In projection indexes (O’Neil & Quass, 1997) the tree access structure is coupled with a sort of materialized view, representing the projection of the table on the indexed column. The technique has some analogies with vertically partitioned tables and is indicated when the aggregate operations need to be performed on one or more indexed columns. Bit-sliced indexes (O’Neil & Quass, 1997) can be considered as a combination of the two previous techniques. Values of the projected column are encoded and a bitmap associated with each resulting bit component. The technique has some analogies with bit transposed files (Wong, Li, Olken, Rotem, & Wong, 1986), which were proposed for query evaluation in very large scientific and statistical databases. These indexes work best for SUM and AVG aggregations, but are not well suited for aggregations involving more than one column. In Chan & Ioannidis (1998, 1999) some variations on the general idea of bitmap indexes are presented. Encoding schemes, time-optimal and space-optimal indexes, and trade-off solutions are studied. A comparison of the use of STR-tree based indexes (a particular form of spatial index) and some variations of bitmap indexes in the context of OLAP range queries can be found in Jürgens & Lenz (1999).
FUTURE TRENDS The efficiency of many evaluation techniques is strictly related to the adopted query language and storage technique (e.g., MOLAP and ROLAP). This stresses the importance of a standardization process: there is indeed a general consensus on data warehouse key concepts, but a common unified framework for multidimensional data
424
querying is still lacking, particularly a standardized query language independent from the specific storage technique. Although research results on the exact computation of data cubes in the last few years have been mainly incremental, several interesting techniques have been proposed for the approximated computation, particularly some based on wavelets, which certainly require further investigations. In contrast to the OLTP context, we have shown that there is no “universally good” index for multidimensional queries, but rather a variety of techniques, each of which may perform well for specific data types and query forms but be inappropriate for others. Hence, an algorithm of index selection for data warehouses should determine not only the several sets of attributes to be indexed, but also the index type(s) to be used. The definition of an indexing structure enabling a good trade-off in the several cases of interest is also an interesting issue for future research.
CONCLUSION In this paper we have discussed the main issues related to the computation of data cubes and aggregate materialized views in a data warehouse environment. First of all, the main features of OLAP queries with respect to conventional OLTP queries have been summarized, particularly the number of records involved, the temporal aspects, the specific indexes. Various techniques for the exact computation of materialized views and data cubes from scratch in both ROLAP and MOLAP environments have been discussed. Finally, the main indexing techniques for OLAP queries and their applicability have been illustrated.
REFERENCES Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J.F., Ramakrishnan, R., & Sarawagi, S. (1996). On the computation of multidimensional aggregates. In International Conference on Very Large Data Bases (VLDB’96) (pp. 506-521). Cabibbo, L., & Torlone, R. (1998). From a procedural to a visual query language for OLAP. In International Conference on Scientific and Statistical Database Management (SSDBM’98) (pp. 74-83). Chan, C.Y., & Ioannidis, Y.E. (1998). Bitmap index design and evaluation. In ACM International Conference on Management of Data (SIGMOD’98) (pp. 355-366). Chan, C.Y., & Ioannidis, Y.E. (1999). An efficient bitmap encoding scheme for selection queries. In ACM Interna-
Efficient Computation of Data Cubes and Aggregate Views
tional Conference on Management of Data (SIGMOD’99) (pp. 215-226). Chaudhuri, S., Das, G., & Narasayya, V. (2001). A robust, optimization-based approach for approximate answering of aggregate queries. In ACM International Conference on Management of Data (SIGMOD’01) (pp. 295-306). Chaudhuri, S., & Shim, K. (1995). An overview of costbased optimization of queries with aggregates. Data Engineering Bulletin, 18(3), 3-9. Chaudhuri, S., & Shim, K. (1996). Optimizing queries with aggregate views. In International Conference on Extending Database Technology (EDBT’96) (pp. 167-182).
Ross, K.A., & Srivastava, D. (1997). Fast computation of sparse datacubes. In International Conference on Very Large Data Bases (VLDB’97) (pp. 116-125). Vaisman A.A., & Mendelzon A.O. (2001). A temporal query language for OLAP: Implementation and a case study. In 8th International Workshop on Database Programming Languages (DBPL 2001) (pp. 78-96). Wong, H.K.T., Li, J., Olken, F., Rotem, D., & Wong, L. (1986). Bit transposition for very large scientific and statistical databases. Algorithmica, 1(3), 289-309.
Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys, 25(2), 73-170.
Wu, Y., Agrawal, D., & El Abbadi, A. (2000). Using wavelet decomposition to support progressive and approximate range-sum queries over data cubes. In Conference on Information and Knowledge Management (CIKM’00) (pp. 414-421).
Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In International Conference on Data Engineering (ICDE’96) (pp. 152-159).
Zhao, Y., Deshpande, P., & Naughton, J.F. (1997). An array-based algorithm for simultaneous multidimensional aggregates. In ACM International Conference on Management of Data (SIGMOD’97) (pp. 159-170).
Gupta, A., Harinarayan, V., & Quass, D. (1995). Aggregate-query processing in data warehousing environments. In International Conference on Very Large Data Bases (VLDB’95) (pp. 358-369).
KEY TERMS
Jagadish, H.V., Lakshmanan, L.V.S., & Srivastava, D. (1999). What can hierarchies do for data warehouses? International Conference on Very Large Data Bases (VLDB’99) (pp. 530-541). Jürgens, M., & Lenz, H.J. (1999). Tree based indexes vs. bitmap indexes: A performance study. In International Workshop on Design and Management of Data Warehouses (DMDW’99). Retrieved from
Aggregate Materialized View: A materialized view (see below) in which the results of a query containing aggregations (like count, sum, average, etc.) are stored. B+-Tree: A particular form of search tree in which the keys used to access data are stored in the leaves. Particularly efficient for key-access to data stored in slow memory devices (e.g., disks). Data Cube: A collection of aggregate values classified according to several properties of interest (dimensions). Combinations of dimension values are used to identify the single aggregate values in the data cube. Dimension: A property of the data used to classify it and navigate the corresponding data cube. In multidimensional databases dimensions are often organized into several hierarchical levels, for example, a time dimension may be organized into days, months and years. Drill-Down (Roll-Up): Typical OLAP operation, by which aggregate data are visualized at a finer (coarser) level of detail along one or more analysis dimensions. Fact (Multidimensional Datum): A single elementary datum in an OLAP system, the properties of which correspond to dimensions and measures. Fact Table: A table of (integrated) elementary data grouped and aggregated in the multidimensional querying process. 425
-
Efficient Computation of Data Cubes and Aggregate Views
Materialized View: A particular form of query whose answer is stored in the database to accelerate the evaluation of further queries. Measure: A numeric value obtained by applying an aggregate function (such as count, sum, min, max or average) to groups of data in a fact table.
426
Multidimensional Query: A query on a collection of multidimensional data, which produces a collection of measures classified according to some specified dimensions. OLAP System: A particular form of information system specifically designed for processing, managing and reporting multidimensional data.
427
Embedding Bayesian Networks in Sensor Grids Juan E. Vargas University of South Carolina, USA
INTRODUCTION
2.
In their simplest form, sensors are transducers that convert physical phenomena into electrical signals. By combining recent innovations in wireless technology, distributed computing, and transducer design, grids of sensors equipped with wireless communication can monitor large geographical areas. However, just getting the data is not enough. In order to react intelligently to the dynamics of the physical world, advances at the lower end of the computing spectrum are needed to endow sensor grids with some degree of intelligence at the sensor and the network levels. Integrating sensory data into representations conducive to intelligent decision making requires significant effort. By discovering relationships between seemingly unrelated data, efficient knowledge representations, known as Bayesian networks, can be constructed to endow sensor grids with the needed intelligence to support decision making under conditions of uncertainty. Because sensors have limited computational capabilities, methods are needed to reduce the complexity involved in Bayesian network inference. This paper discusses methods that simplify the calculation of probabilities in Bayesian networks and perform probabilistic inference with such a small footprint that the algorithms can be encoded in small computing devices, such as those used in wireless sensors and in personal digital assistants (PDAs).
3.
BACKGROUND Recent innovations in wireless development, distributed computing, and sensing design have resulted in energyefficient sensor architectures with some computing capabilities. By spreading a number of smart sensors across a geographical area, wireless ad-hoc sensor grids can be configured as complex monitoring systems that can react intelligently to changes in the physical world. Wireless sensor architectures such as the University of California Berkeley’s Motes (Pister, Kahn, & Boser, 1999) are being increasingly used in a variety of fields to 1.
Monitor information about enemy movements, explosions, and other phenomena of interest (Vargas & Wu, 2003)
4. 5. 6.
Monitor chemical, biological, radiological, nuclear, and explosive (CBRNE) attacks and materials Monitor environmental changes in forests, oceans, and so forth Monitor vehicle traffic Provide security in shopping malls, parking garages, and other facilities Monitor parking lots to determine which spots are occupied and which are free
Typically, sensor networks produce vast amounts of data, which may arrive at any time and may contain noise, missing values, or other types of uncertainty. Therefore, a theoretically sound framework is needed to construct virtual worlds populated by decision-making agents that may receive data from sensing agents or other decisionmaking entities. Specific aspects of the real world could be naturally distributed and mapped to devices acting as data collectors, data integrators, data analysts, effectors, and so forth. The data collectors would be devices equipped with sensors (optical, acoustical, radars, etc.) to monitor the environment or to provide domain-specific variables relevant to the overall decision-making process. The data analysts would be engaged in low-level data filtering or in high-level decision making. In all cases, a single principle should integrate these activities. Bayesian theory is the preferred methodology, because it offers a good balance between the need for separation at the data source level and the integrative needs at the analysis level (Stone, Lawrence, Barlow, & Corwin, 1999). Another major advantage of Bayesian theory is that data from different measurement spaces can be fused into a rich, unified, representation that supports inference under conditions of uncertainty and incompleteness. Bayesian theory offers the following additional advantages: •
•
Robust operational behavior: Multisensor data fusion has an increased robustness when compared to single-sensor data fusion. When one sensor becomes unavailable or is inoperative, other sensors can provide information about the environment. Extended spatial and temporal coverage: Some parts of the environment may not be accessible to some sensors due to range limitations. This occurs especially when the environment being monitored is
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Embedding Bayesian Networks in Sensor Grids
• • • •
vast. In such scenarios, multiple sensors that are mounted at different locations can maximize the regions of scanning. Multisensor data fusion provides increased temporal coverage, as some sensors can provide information when others cannot. Increased confidence: Single target location can be confirmed by more than one sensor, which increases the confidence in target detection. Reduced ambiguity: Joint information from multiple sensors can reduce the set of beliefs about data. Decreased costs: Multiple, inexpensive sensors can replace expensive single-sensor architectures at a significant reduction of cost. Improved detection: Integrating measurements from multiple sensors can reduce the signal-to-noise ratio, which ensures improved detection.
Bayesian-based or entropy-based algorithms can the used to construct efficient data structures−known as Bayesian networks−to represent the relations and the uncertainties in the domain (Pearl, 1988). After the Bayesian networks are created, they can act as hyperdimensional knowledge representations that can be used for probabilistic inference. In situations when data is not as rich, the knowledge representations can still be created from statements of causality and independence formulated by expert opinions. Under this framework, activities such as vehicle control, maneuvering, and scheduling could be planned, and the effectiveness of those plans could be evaluated online as the actions of the plans are executed. To illustrate these ideas, consider a domain composed of threats, assets, and grids of sensors. Although unmanned vehicles loaded with sensors might be able to detect potential targets and provide data to guide the distribution of assets, integrating and transforming those data into meaningful information that is amenable for intelligent decisions is very demanding. The data must be filtered, the relationships between seemingly unrelated data sets must be determined, and knowledge representations must be created to support wise and timely decisions, because conditions of uncertain and incomplete information are the norm, not the exception. Therefore, a solution is to endow sensors with embedded local and global intelligence, as shown in Figure 1. In the figure, friendly airplanes and tanks, in red, use Bayesian networks (BNs) to make decisions. The BNs are illustrated as red boxes containing graphs. Each node in the graphs corresponds to a variable in the domain. The data for each variable may come from sensors spread in the battlefield. The red nodes are variables related to the friendly resources, and the blue variables to the enemy resources. The dotted red arrows connecting the BNs represent wireless communication between the BNs.
428
Figure 1. A domain scenario
The goal is to recognize situations locally and globally, identify the available options, and make global and local decisions quickly in order to reduce or eliminate the threats and optimize the use of assets. The task is difficult due to the dynamics and uncertainties in the domain. Threats may change in many ways, targets may move, enemy forces may identify the sensing capabilities and eliminate them, and so forth. In most cases, sensing information will contain noise and most likely will be inaccurate, unreliable, and uncertain. These constraints suggest a distributed, bottom-up approach to match the natural dynamics and uncertainties of the problem. Thus, at the core of this problem, a theoretical framework that effectively balances local and global conditions is needed. Distributed Bayesian networks offer that balance (Xiang, 2002; Valtorta, Kim, & Vomlel, 2002).
MAIN THRUST Embedding Bayesian Networks in Sensor Grids Recent advances in the theory of Bayesian network inference (Darwiche, 2003; Castillo, Gutierrez, & Hadi, 1996; Utete, 1998) have resulted in algorithms that can perform probabilistic inference on very small-scale computing devices that are comparable to commercially available PDAs. The algorithms can encode, in real-time, families of polynomial equations representing queries of the type p(e|h) involving sets of variables local to the device and its neighbors. Using the knowledge representations locally encoded into these devices, larger, distributed systems can be interconnected. The devices can assess their local conditions given local observations and engage with other devices in the system to gain better understanding of the global situation, to obtain more assets, or to convey
Embedding Bayesian Networks in Sensor Grids
information needed by other devices engaged in larger scale tactical decisions. The devices can maintain models of the local parameters surrounding them, making them situationally aware and enabling them to participate as cells of a larger decision-making process. Sensory information might be redirected towards a device with no access to that sensor’s information. For example, an airplane might have full access to information about the topology of a certain area, but another vehicle may have access to this information only if it can communicate with the airplane. Global information known by groups of devices can be sent to the relevant devices such that whenever a device gets new input data, the findings are used to determine locally the most probable states for each of the variables within the devices and propagate those determinations to the neighboring devices. The situation is illustrated in Figure 2, in which the variables x3, x4, and x6 are shared by more than one device. The direction of the dotted red arrows indicate the direction of the flow of evidence.
Symbolic Propagation of Evidence Methods for exact and approximate propagation of evidence in Bayesian network models are discussed in other sections of this encyclopedia. A common requirement for both types of methods is that all the parameters of the joint probability distribution (JPD) must be known prior to propagation. However, complete specifications might not always be available. This may happen when the number of observations in some combinations of variables is not sufficient to support exact quantifications of conditional probabilities, or in cases when domain experts may only know ranges but may not know the exact values of the
Figure 2. A collection of three Bayesian networks
X1
X5
p ( xi | π i ) can be expressed as a parametric family of the form
θ ijπ = p( X i = j | Π i = π ), j ∈ {0,..., ri }
X6
X3
X3
X6
X7
(1)
where i refers to the node number, j refers to the state of the node, and π is any instantiation Π i of the parents of X i . Assume a JPD given on the binary variables {x1 , x2 , x3 , x4 } , as p( x1 , x2 , x3 , x4 ) = p( x1 ) p( x2 | x1 ) p( x3 | x1 ) p( x4 | x2 , x3 )
(2)
The complete set of parameters for the binary variables for a Bayesian network compatible with the JPD is given in Table 1.
Table 1. Complete set of parameters for the binary variables of a Bayesian network compatible with the JPD of equation 2 X1
θ10 = p ( x1 )
θ11 = 1 − θ10
X2
θ 200 = p ( x2 | x1 ) θ 201 = p ( x2 | x1 )
θ 210 = 1 − θ 200 θ 211 = 1 − θ 201
X3
θ 300 = p( x3 | x1 ) θ 301 = p ( x3 | x1 )
θ 310 = 1 − θ 300 θ 311 = 1 − θ 300
X4
θ 4000 = p ( x4 | x2 , x3 ) θ 4001 = p ( x4 | x2 , x3 ) θ 4010 = p ( x4 | x2 , x3 ) θ 4011 = p ( x4 | x2 , x3 )
θ 4100 = 1 − θ 4000 θ 4101 = 1 − θ 4001 θ 4110 = 1 − θ 4010 θ 4111 = 1 − θ 4011
X4
X4
X2
parameters. In such cases, symbolic propagation can be used to obtain the values of the unknown parameters to propagate and compute the probabilities of all the relevant variables in a query (Zhaoyu, D’Ambrosio, 1994; Castillo, Gutierrez, & Hadi, 1995; Castillo et al., 1996). Symbolic propagation of evidence consists of asserting polynomials involving the known and missing parameters and the available evidence. After those polynomials are found, customized code generators can produce code to solve the expressions in real-time, obtain the exact values of the parameters, and complete the propagation of evidence. In general, a node X i having a conditional probability
429
-
Embedding Bayesian Networks in Sensor Grids
5.
In general, because
6.
ri
∑θijs = 1
Compute the probabilities by executing the generated expressions. Complete the process by propagating the results to the rest of the network.
j =0
the parameters for categorical variables can be obtained from θ iks = 1 −
ri
∑ θ ijs
j =0..rj \ k
Knowing the JPD in Equation 2 and the parameters in Table 1, the probability of any set of nodes can be calculated by extracting the marginal probabilities out of the JPD. Keeping with the example, I obtain p ( x2 = 0 | x3 = 1)
as follows: ∑ p( x1,1,0, x 4)
p( x2 = 1 | x3 = 0) =
x1, x 4
∑ p( x1, x2,0, x 4)
x1, x 2, x 4
=
=
(3)
θ10θ300 − θ10θ 200θ 300 + θ301 − θ10θ301 − θ 201θ301 + θ10θ 201θ 301 (4) θ10θ 300 + θ 301 − θ10θ 301
A you can see in Equation 4, if all the parameters are know, then the expression can be solved by computing nine multiplications, three additions, four subtractions, and one division. These operations can be performed in real-time by using simple computing devices, such as those available to PDAs or most modern wireless sensor architectures. If, on the other hand, some of the parameters are not available but the topology of the graph is known, then the use of clustering algorithms for exact inference can produce additional polynomial expressions that solve for the unknown parameters. In this case, the entire inference process consists of the following steps: 1.
2. 3.
4.
430
From the topology of the graph, obtain a junction tree, using clustering methods for exact propagation (Lauritzen & Spiegelhalter, 1988; Jensen, 2001; Olesen, Lauritzen, & Jensen, 1992). Use the junction tree to generate the polynomial expressions that solve for the unknown parameters of conditional independence. If the inference process involves goal-oriented queries, obtain expressions for each parameter, as in Equations 3 and 4, involving the parameters and the query. At this point, all the parameters are known, either numerically or symbolically. Use a code generator to configure the equations resulting from steps 2 and 3.
Castillo et al. (1995, 1996) present methods that take advantage of the polynomial structure of conditional probabilities. Their procedure consists of asserting polynomials involving known and unknown parameters as sets of parametric equations, which are solved symbolically by using computer packages such as Mathematica or Maple. Having such symbolic packages available to embedded systems is not a reasonable expectation. Nonetheless, two important results can be derived from that work: (a) Any parameter can be expressed as a first-degree polynomial and (b) the combined probability of any instantiation of a set of variables in a network can be expressed as a polynomial of degree less than or equal to the number of variables. Taking advantage of the parametric structure of the marginal and conditional probabilities is very valuable; the parametric functions can reduce the complexity in the calculation and propagation of probabilities and in turn makes it possible to have grids of sensing devices endowed with Bayesian intelligence. Symbolic propagation provides the theoretical framework for a distributed architecture that can be reconfigured in real-time as a grid of smart sensors, each performing calculations on variables only relevant to the query and the local reasoner.
FUTURE TRENDS Conservative estimates compare the impact of wireless sensors in the next decade to the microprocessor revolution of the 1980s. Several million efforts are already on their way: Wal-Mart announced in 2003 that it would require its main suppliers to use RFID tags on all pallets and cartons of goods, with an estimated savings of over $8 billion a year (Goode, 2003). It is expected that the future of the automobile and aircraft industries will heavily depend on sensors (Goode, 2004). Innovations in transducers and architectures are making worldwide news everyday, as industrial giants such as Intel, Microsoft (“Microsoft launches,” 2002), and IBM enter the competition with their own products (Goode, 2004). The capability of sensors will continue to increase, and their ability to process information locally will improve. However, despite the progress at the lower level, the fundamental issues related to the fusion of data and its analysis at the higher levels are not likely to undergo dramatic improvements. Bayesian theory, and in particular Bayesian networks, still offer the best framework for data fusion, knowledge discovery, and probabilistic inference. Until
Embedding Bayesian Networks in Sensor Grids
now, the main barriers to the use of Bayesian networks in sensor grids have been the computational complexity of the inference process the lack of efficient methods for learning the graph of a network, and adapting to the dynamics of the domain. The latter two, however, are problems that permeate the field of Bayesian networks and constitute fertile ground for future research.
CONCLUSION This paper outlines recent work aimed at endowing sensor grids with local and global intelligence. I describe the advantages of knowledge representations known as Bayesian networks and argue that Bayesian networks offer an excellent theoretical foundation to accomplish this goal. The power of Bayesian networks lends them to decomposing a problem into a set of smaller, distributed problem solvers. Each variable in a network could be associated to a different sensing agent represented at the local level as a Bayesian network. The sensing agents could decide if and when to send observations to the other agents. As information flows between the agents, they in turn could decide to send their own observations, their local inferences, or request local or remote agents for more information, given the available evidence. Another advantage of this approach is that it is scalable on demand; more agents can be added as the problem gets bigger. Recovery from loss of parts of the system is possible by introducing a set of redundant observations, each of which can be part of a local solver agent.
REFERENCES Castillo, E. Gutierrez, J. M., & Hadi, A. S. (1995). Symbolic propagation in discrete and continuous Bayesian networks. In V. Keranen & P. Mitic (Eds.), Proceedings of the First International Mathematica Symposium: Vol. Mathematics with vision (pp. 77-84). Southhampton, UK: Computational Mechanics Publications, Southerhampton, UK. Castillo, E., Gutierrez, J. M., & Hadi, A. S. (1996). A new method for efficient symbolic propagation in discrete Bayesian networks. Networks, 28, 31-43. Darwiche, A. (2003). Revisiting the problem of belief revision with uncertain evidence. Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico. Goode, B. (2003, September). Having wonder full time. Sensors Magazine.
Goode, B. (2004, August). Keys to keynotes. Sensors Magazine. Jensen, F. V. (2001). Bayesian networks and decision graphs. New York: Springer. Lauritzen, S.L., & Spiegelhalter, D.J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50, 157-224. Microsoft launches smart personal object technology initiative. (2002, November 17). Retrieved from http:// www.microsoft.com/presspass/features/2002/nov02/1117SPOT.asp Olesen, K. G. , Lauritzen, & Jensen, F. V. (1992). Hugin: A system creating adaptive causal probabilistic networks. Proceedings of the Eighth Conference on Uncertainty in Artificial Intelligence (pp. 223-229), USA. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Pister, K. S. J., Kahn, J. M., & Boser, B. E.(1999). Smart dust: Wireless networks of millimeter-scale sensor nodes. Highlight Article in Electronics Research Laboratory Research Summary. Retrieved from http:// www.xbow.com/Products/Wireless_Sensor_Networks. htm Sensors Magazine. (Ed.). (2004, August). Best of sensors expo awards [Special issue]. Sensors Magazine. Stone, C. A., Lawrence, D., Barlow, C. A., & Corwin, T. L. (1999). Bayesian multiple target tracking. Boston: Artech House. Utete, S. W. (1998). Local information processing for decision making in decentralised sensing networks. Proceedings of the 11th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, IEA/AIE-667-676, Castellon, Spain. Valtorta, M., Kim, Y. G., & Vomlel, J. (2002). Soft evidential update for probabilistic multiagent systems. International Journal of Approximate Reasoning, 29(1), 71-106. Vargas, J. E., Tvarlapati, K., & Wu, Z. (2003). Target tracking with Bayesian estimation. In V. Lesser, C. Ortiz, & M. Tambe (Eds.), Distributed sensor networks. Kluwer Academic Press. Vargas, J. E., & Wu, Z. (2003). Real-time multiple-target tracking using networked wireless sensors. Proceedings of the Second Conference on Autonomous Intelligent Networks and Systems, Palo Alto, CA. 431
-
Embedding Bayesian Networks in Sensor Grids
Xiang, Y. (2002). Probabilistic reasoning in multiagent systems: A graphical models approach. Cambridge, MA: Cambridge University Press. Zhaoyu, L., & D’Ambrosio, B. (1994). Efficient inference in Bayes networks as a combinatorial optimization problem. International Journal of Approximate Reasoning, 11, 55-81.
KEY TERMS Bayesian Network: A directed acyclic graph (DAG) that encodes the probabilistic dependencies between the variables within a domain and is consistent with a joint probability distribution (JPD) for that domain. For example, a domain with variables {A,B,C,D}, in which the variables B and C depend on A and the variable D depends on C and B, would have the following JPD: P(A,B,C,D) = p(A)p(B|A)p(C|A)p(D|C,D) and the graph A B
C D
Joint Probability Distribution: A function that encodes the probabilistic dependencies among a set of variables in a domain. Junction Tree: A representation that captures the joint probability distribution of a set of variables in a very
432
efficient data structure. A junction tree contains cliques, each of which is a set of variables from the domain. The junction tree is configured to maintain the probabilistic dependencies of the domain variables and provides a data structure over queries of the type “What is the most probable value of variable D given that the values of variables A, B, etc., are known?” Knowledge Discovery: The process by which new pieces of information can be revealed from a set of data. For example, given a set of data for variables {A,B,C,D}, a knowledge discovery process could discover unknown probabilistic dependencies among variables in the domain, using measures such as Kullback’s mutual information, which, for the discrete case, is given by the formula I ( A : B) = ∑ i
p (ai , b j )
∑ p(a , b ) log P(a ) P(b ) j
i
j
i
j
Smart Sensors: Transducers that convert some physical parameters into an electrical signal and are equipped with some level of computing power for signal processing. Symbolic Propagation in Bayesian Networks: A method that reduces the number of operations required to compute a query in a Bayesian network by factorizing the order of operations in the joint probability distribution. Wireless Ad-Hoc Sensor Network: A number of sensors spread across a geographical area. Each sensor has wireless communication capability and some computing capabilities for signal processing and data networking.
433
Employing Neural Networks in Data Mining Mohamed Salah Hamdi UAE University, UAE
INTRODUCTION Data-mining technology delivers two key benefits: (i) a descriptive function, enabling enterprises, regardless of industry or size, in the context of defined business objectives, to automatically explore, visualize, and understand their data and to identify patterns, relationships, and dependencies that impact business outcomes (i.e., revenue growth, profit improvement, cost containment, and risk management); (ii) a predictive function, enabling relationships uncovered and identified through the datamining process to be expressed as business rules or predictive models. These outputs can be communicated in traditional reporting formats (i.e., presentations, briefs, electronic information sharing) to guide business planning and strategy. Also, these outputs, expressed as programming code, can be deployed or hard wired into business-operating systems to generate predictions of future outcomes, based on newly generated data, with higher accuracy and certainty. However, there also are barriers to effective largescale data mining. Barriers related to the technical aspects of data mining concern issues such as large datasets, highly complex data, high algorithmic complexity, heavy data management demands, and qualification of results (Musick, Fidelis & Slezak, 1997). Barriers related to attitudes, policies, and resources concern potential dangers such as privacy concerns (Kobsa, 2002). The number of research projects and publications reporting experiences with data mining has been growing steadily. Researchers in many different fields, including database systems, knowledge-base systems, artificial intelligence, machine learning, knowledge acquisition, statistics, spatial databases, data visualization, and Internet computing, have shown great interest in data mining. In this contribution, the focus is on the relationship between artificial neural networks and data mining. We review some of the related literature and report on our own experience in this context.
BACKGROUND Artificial Neural Networks Neural networks are patterned after the biological ganglia and synapses of the nervous system. The essential ele-
ment of the neural network is the neuron. A typical neuron j receives a set of input signals from the other connected neurons, each of which is multiplied by a synaptic weight of wij (weight for connection between neurons i and j). The resulting activation weights are then summed to produce the activation level for the neuron j. Learning is carried out by adjusting the weights in a neural network. Neurons that contribute to the correct answer have their weights strengthened, while other neurons have their weights reduced. Several architectures and error correction algorithms have been developed for neural networks (Haykin, 1999). In general, neural networks can help where an algorithmic solution cannot be formulated, where lots of examples of the required behavior are available, and where picking out the structure from existing data is needed. Neural networks work by feeding in some input variables and producing some output variables. Therefore, they can be used where some known information is available and some unknown information should be inferred.
Data Mining The data-mining revolution started in the mid-1990s. It was characterized by the incorporation of existing and already well-established tools and algorithms such as machine learning. In 1995, the International Conference on Knowledge Discovery and Data Mining became the most important annual event for data mining. The framework of data mining also was outlined in many books, such as Advances in Knowledge Discovery and Data Mining (Fayyad et al., 1996). Data-mining conferences like ACM SIGKDD, SPIE, PKDD, and SIAM, and journals like Data Mining and Knowledge Discovery Journal (1997), Journal of Knowledge and Information Systems (1999), and IEEE Transactions on Knowledge and Data Engineering (1989) have become an integral part of the data-mining field. The trends in data mining over the last few years include OLAP (Online Analytical Processing), data warehousing, association rules, high performance data-mining systems, visualization techniques, and applications of data mining. Recently, new trends have emerged that have great potential to benefit the data-mining field, like XML (eXtensible Markup Language) and XML-related technologies, database products that incorporate datamining tools, and new developments in the design and
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Employing Neural Networks in Data Mining
implementation of the data-mining process. Another important data-mining issue is concerned with the relationship between theoretical data-mining research and datamining applications. Data mining is an exponentially growing field with a strong emphasis on applications. A further issue of great importance is the research in data-mining algorithms and the discussion of issues of scale (Hearst, 1997). The commonly used tools may not scale up to huge volumes of data. Scalable data-mining tools are characterized by the linear increase of their runtime with the increase of the number of data points within a fixed amount of available memory. An overview of scalable data-mining tools is given in Ganti, Gehrke, and Ramakrishnan (1999). In addition to scalability, robust techniques to model noisy data sets containing an unknown number of overlapping categories are of great importance (Krishnapuram et al., 2001).
MAIN THRUST Exploiting Neural Networks in Data Mining How is data mining able to tell you important things that you didn’t know or tell you what is going to happen next? The technique that is used to perform these feats is called modeling. Modeling is simply the act of building a model based on data from situations where the answer is known, and then applying the model to other situations where the answers aren’t known. Modeling techniques have been around for centuries, but it is only recently that data storage and communication capabilities required to collect and to store huge amounts of data and the computational power to automate modeling techniques to work directly on the data have been available. Modeling techniques used for data mining include decision trees, rule induction, genetic algorithms, nearest neighbor, artificial neural networks, and many other techniques (Chen, Han & Yu, 1996; Cios, Pedrycz & Swiniarski, 1998; Hand, Mannila & Smyth, 2000). Exploiting artificial neural networks as a modeling technique for data mining is considered to be an important direction of research. Neural networks can be applied to a number of data-mining problems, including classification, regression, and clustering, and there are quite a few interesting developments and tools that are being developed in this field. Lu, Setiono, and Liu (1996) applied neural networks to mine symbolic classification rules from large databases. They report that neural networks were able to deliver a lower error rate and are more robust against noise than decision trees. Ainslie and Drèze
434
(1996) show how effective data mining can be achieved by combining the power of neural networks with the rigor of more traditional statistical tools. They argue that this alliance can generate important synergies. Craven and Shavlik (1997) describe neural network learning algorithms for data mining that are able to produce comprehensible models and that do not require excessive training times. They argue that neural network methods deserve a place in the toolboxes of data-mining specialists. Mitra, Pal, and Mitra (2002) provide a survey of the available literature on data mining using soft computing methodologies, including neural networks. They came to the conclusion that neural networks are suitable in data-rich environments and are typically used for extracting embedded knowledge in the form of rules, quantitative evaluation of these rules, clustering, self-organization, classification, and regression. Vesely (2003) argues that from the methods of data mining based on neural networks, the Kohonen’s self-organizing maps are the most promising, because, by using a self-organizing map, one can more easily visualize high-dimensional data. Self-organizing maps also outperform other conventional methods such as the popular Principal Component Analysis (PCA) method for screening analysis of high-dimensional data. Although highly successful in typical cases, PCA suffers from the drawback of being a linear method. Furthermore, real-world data manifolds, besides being nonlinear, often are corrupted by noise and embed into high-dimensional spaces. Self-organizing maps are more robust against noise and are often used to provide representations that can be analyzed successfully using conventional methods like PCA. In spite of their excellent performance in concept discovery, neural networks do suffer from some shortcomings. They are sensitive to the net topology, initial weights, and the selection of attributes. If the number of layers is not selected suitably, the learning efficiency will be affected. Too many irrelevant nodes can cause unnecessary computational expense and overfit (i.e., the network creates meaningless concepts); randomly selected initial weights sometimes can trap the nets in so-called pitfalls; that is, neural nets stabilize around local minima instead of the global minimum. Background knowledge remains unused in neural nets. The knowledge discovered by nets is not transparent to users. This is perhaps the main failing of neural networks, as they are unintelligible black boxes. Our own work is focused on mining educational data to assist e-learning in a variety of ways. In the following section, we report on the experience with MASACAD (Multi-Agent System for ACademic ADvising), a datamining, multi-agent system that advises students using neural networks.
Employing Neural Networks in Data Mining
Case Study: Academic Advising of Students At the UAE University, there is enormous interest in the area of online education. Rigorous steps are being taken toward the creation of the technological infrastructure and the academic infrastructure for the improvement of teaching and learning. MASACAD, the academic advising system described in the following, is to be understood as a tool that mines educational data to support learning. The general goal of academic advising is to assist students in developing educational plans that are consistent with academic, career, and life goals, and to provide students with information and skills needed to pursue those goals. In order to improve the advising process and make it easier, an intelligent assistant in the form of a computer program will be of great interest. The goal of academic advising, as stated previously, is too general, because many experts are involved and a huge amount of expertise is needed. This makes the realization of such an assistant too difficult, if not impossible. Therefore, in the implemented system, the scope of academic advising was restricted. It was understood as just being intended to provide the student with an opportunity to plan programs of study, select appropriate required and elective classes, and schedule classes in a way that provides the greatest potential for academic success.
Data Mined by the Advising System MASACAD To extract the academic advice (i.e., to provide the student with a set of appropriate courses he or she should register for in the coming term), MASACAD has to mine a huge amount of educational data available in different formats. The data contain the student profile, which includes the courses already attended, the corresponding grades, the desires of the student concerning the courses to be attended, and much other information. The part of the profile consisting of the courses already attended, the corresponding grades, and so forth, is maintained by the university administration in appropriate databases. The part of the profile consisting of the desires of the student concerning the courses to be attended should be asked for from the student before advising is performed. The data to be mined also include the courses that are offered in the semester for which advising is needed. This information is maintained by the university administration in appropriate Web sites. Finally, a very important component of the data that the system has to mine is expertise. For the problem of academic advising, expertise consists partly of the university laws concerning academic advising. These consist of all the details and regulations concerning courses,
programs, and curricula. This kind of information is published in Web pages, in booklets, and in many other forms such as printouts and announcements. The sources of knowledge are many; however, the primary source will be a human expert, who should posses more complex knowledge than can be found in documented sources.
The Advising System MASACAD MASACAD is a multi-agent system that offers academic advice to students by mining the educational data described in the previous section. It consists of a user system, a grading system, a course announcement system, and a mediation agent. The mediation agent provides the information retrieving service. It moves from the site of an application to another, where it interacts with the agent wrappers. The agent wrappers manage the states of the applications they are wrapped around, invoking them when necessary. The application grading system is a database application for answering queries about the students and the courses they have already taken. The application course announcement system is a Web application for answering queries about the courses that are expected to be offered in the semester for which advising is needed. The application user system is the heart of the advising system, and it is here where intelligence resides. The application gives students the opportunity to express their desires concerning the courses to be attended by choosing among the courses that are offered, initiating a query to obtain advice, and, finally, seeing the results returned by the advising system. The system also alerts the user automatically via e-mail when something changes in the offered courses or in the student profile. The advising procedure suggests courses according to university laws, in a way that provides the greatest potential for academic success, as seen by a human academic advisor. Taking into account the adequacy of the machinelearning approach for data mining, added to the availability of experience with advising students, made the adoption of a paradigm of supervised learning from examples using artificial neural networks interesting. For academic advising, the known information (input variables) consists of the profile of the student and of the offered courses. The unknown information (output variables) consists of the advice expected by the student. In order for the network to be able to infer the unknown information, prior training is needed. Training will integrate the expertise in academic advising into the network. The back-propagation algorithm was used for training the neural network. Information (i.e., training examples) is gained from information about students and courses they really took in previous semesters. The selection of these courses 435
-
Employing Neural Networks in Data Mining
was made, based on the advice of human experts specializing in academic advising. About 250 computer science students in different stages of study were available for the learning procedures. Each one of the 250 examples consisted of a pair of input-output vectors. The input vector summarized all the information needed for advising a particular student (85 real-valued components; each component encodes the information about one of the 85 courses of the curriculum). The output vector encodes the final decision concerning the courses in which the student actually enrolled, based on the advice of the human academic advisor (85 integer-valued components; each component represents a priority value for one of the 85 courses of the curriculum, and a higher priority value indicates a more appropriate course for the student). The aim of the learning phase was to determine the most suitable values for the learning rate, the size of the network (number of neurons, number of hidden layers was set to 2), and the number of training cycles that are needed for the convergence of the network. Many experiments were conducted to obtain these parameters and to test the system. With a network topology of 85-100-100-85 and systematically selected network parameters (50 different experiments chosen carefully were performed to obtain these parameters), the layered, fully connected backpropagation network was able to deliver a considerable performance. Fifty students participated in the evaluation of the system. In 92% of the cases, the network was able to produce very appropriate advice according to human experts in academic advising. In the remaining 8% of the cases (4 cases), some unsatisfactory course suggestions were produced by the network (Hamdi, 2004).
FUTURE TRENDS The rapid growth of business, industrial, and educational data sources has overwhelmed the traditional, interactive approaches to data analysis and created a need for a new generation of tools for intelligent and automated discovery in data. Neural networks are well suited for datamining tasks, due to their ability to model complex, multidimensional data. As data availability has magnified, so has the dimensionality of problems to be solved, thus limiting many traditional techniques, such as manual examination of the data and some statistical methods. Although there are many techniques and algorithms that can be used for data mining, some of which can be used effectively in combination, neural networks offer many desirable qualities, such as the automatic search of all possible interrelationships among key factors, the automatic modeling of complex problems without prior knowledge of the level of complexity, and the ability to extract
436
key findings much faster than many other tools. As computer systems become faster, the value of neural networks as a data-mining tool only will increase.
CONCLUSION The amount of raw data stored in databases and the Web is exploding. Raw data by itself, however, does not provide much information. One benefits when meaningful trends and patterns are extracted from the data. Datamining techniques help to recognize significant facts, relationships, trends, patterns, exceptions, and anomalies that might otherwise go unnoticed. In this contribution, we have seen an example of how neural networks can be used to help mine data about students and courses with the aim of developing educational plans that are consistent with academic, career, and life goals, and providing students with information and skills needed to pursue those goals. The neural network paradigm seems interesting and viable enough to be used as a data-mining tool.
REFERENCES Ainslie, A., & Drèze, X. (1996). Data mining: Using neural networks as a benchmark for model building. Decision Marketing, 7, 77-86. Chen, M.S., Han, J., & Yu, P.S. (1996). Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866-883. Cios, K.J., Pedrycz, W., & Swiniarski, R.W. (1998). Data mining methods for knowledge discovery. Norwell, MA: Kluwer Academic Publishers. Craven, M.W., & Shavlik, J.W. (1997). Using neural networks for data mining. Future Generation Computer Systems, 13(2-3), 211-229. Fayyad, U.M., Piatesky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: MIT Press. Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). Mining very large databases. IEEE Computer, 32(8), 38-45. Hamdi, M.S. (2004). MASACAD: A learning multi-agent system that mines the Web to advise students. Proceedings of the International Conference on Internet Computing (IC’04), Las Vegas, Nevada. Hand, D.J., Mannila, H., & Smyth, P. (2000). Principles of data mining. Cambridge, MA: MIT Press.
Employing Neural Networks in Data Mining
Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hearst, M. (1997). Distinguishing between web data mining and information access. The Internet
KEY TERMS Artificial Neural Networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Classification: The process of dividing a dataset into mutually exclusive groups such that the members of each group are as close as possible to one another, and different groups are as far as possible from one another, where distance is measured with respect to specific variable(s) one is trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values good and bad. Supervised classification is when we know the class labels and the number of classes. Clustering: The process of dividing a dataset as in classification, but the distance is now measured with respect to all available variables. Unsupervised classification is when we do not know the class labels and may not know the number of classes. Data Mining: The extraction of hidden predictive information from large databases. Decision Tree: A tree-shaped structure that represents a set of decisions. These decisions generate rules for the classification of a dataset. Linear Regression: A classic statistical problem is to try to determine the relationship between two random variables X and Y. For example, we might consider height and weight of a sample of adults. Linear regression attempts to explain this relationship with a straight line fit to the data. OLAP: Online analytical processing. Refers to arrayoriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases.
437
-
438
Enhancing Web Search through Query Log Mining Ji-Rong Wen Microsoft Research Asia, China
INTRODUCTION Web query log is a type of file keeping track of the activities of the users who are utilizing a search engine. Compared to traditional information retrieval setting in which documents are the only information source available, query logs are an additional information source in the Web search setting. Based on query logs, a set of Web mining techniques, such as log-based query clustering, log-based query expansion, collaborative filtering and personalized search, could be employed to improve the performance of Web search.
BACKGROUND Web usage mining is an application of data mining techniques to discovering interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Since the majority of usage data is stored in Web logs, usage mining is usually also referred to as log mining. Web logs can be divided into three categories based on the location of data collecting: server log, client log, and proxy log. Server log provides an aggregate picture of the usage of a service by all users, while client log provides a complete picture of usage of all services by a particular client, with the proxy log being somewhere in the middle (Srivastava, Cooley, Deshpande, & Tan, 2000). Query log mining could be viewed as a special kind of Web usage mining. While there is a lot of work about mining Website navigation logs for site monitoring, site adaptation, performance improvement, personalization and business intelligence, there is relatively little work of mining search engines’ query logs for improving Web search performance. In early years, researchers have proved that relevance feedback can significantly improve retrieval performance if users provide sufficient and correct relevance judgments for queries (Xu & Croft, 2000). However, in real search scenarios, users are usually reluctant to explicitly give their relevance feedback. A large amount of users’ past query sessions have been accumulated in the query logs of search engines. Each query session records a user query and the correspond-
ing pages the user has selected to browse. Therefore, a query log can be viewed as a valuable source containing a large amount of users’ implicit relevance judgments. Obviously, these relevance judgments can be used to more accurately detect users’ query intentions and improve the ranking of search results. One important assumption behind query log mining is that the clicked pages are “relevant” to the query. Although the clicking information is not as accurate as explicit relevance judgment in traditional relevance feedback, the user’s choice does suggest a certain degree of relevance. In the long run with a large amount of log data, query logs can be treated as a reliable resource containing abundant implicit relevance judgments from a statistical point of view.
MAIN THRUST Web Query Log Preprocessing Typically, each record in a Web query log includes the IP address of the client computer, timestamp, the URL of the requested item, the type of Web browser, protocol, etc. The Web log of a search engine records various kinds of user activities, such as submitting queries, clicking URLs in the result list, getting HTML pages and skipping to another result list. Although all these activities reflect, more or less, a user’s intention, the query terms and the Web pages the user visited are the most important data for mining tasks. Therefore, a query session, the basic unit of mining tasks, is defined as a query submitted to a search engine together with the Web pages the user visits in response to the query. Since the HTTP protocol requires a separate connection for every client-server interaction, the activities of multiple users usually interleave with each other. There are no clear boundaries among user query sessions in the logs, which makes it a difficult task to extract individual query sessions from Web query logs (Cooley, Mobasher, & Srivastava, 1999). There are mainly two steps to extract query sessions from query logs: user identification and session identification. User identification is the process of isolating from the logs the activities associated with an
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Enhancing Web Search through Query Log Mining
individual user. Activities of the same user could be grouped by their IP addresses, agent types, site topologies, cookies, user IDs, etc. The goal of session identification is to divide the queries and page accesses of each user into individual sessions. Finding the beginning of a query session is trivial: a query session begins when a user submits a query to a search engine. However, it is difficult to determine when a search session ends. The simplest method of achieving this is through a timeout, where if the time between page requests exceeds a certain limit, it is assumed that the user is starting a new session.
Log-Based Query Clustering Query clustering is a technique aiming at grouping users’ semantically (not syntactically) related queries in Web query logs. Query clustering could be applied to FAQ detecting, index-term selection and query reformulation, which are effective ways to improve Web search. First of all, FAQ detecting means to detect Frequently Asked Questions (FAQs), which can be achieved by clustering similar queries in the query logs. A cluster being made up of many queries can be considered as a FAQ. Some search engines (e.g. Askjeeves) prepare and check the correct answers for FAQs by human editors, and a significant majority of users’ queries can be answered precisely in this way. Second, inconsistency between term usages in queries and those in documents is a well-known problem in information retrieval, and the traditional way of directly extracting index terms from documents will not be effective when the user submits queries containing terms different from those in the documents. Query clustering is a promising technique to provide a solution to the word mismatching problem. If similar queries can be recognized and clustered together, the resulting query clusters will be very good sources for selecting additional index terms for documents. For example, if queries such as “atomic bomb”, “Manhattan Project”, “Hiroshima bomb” and “nuclear weapon” are put into a query cluster, this cluster, not the individual terms, can be used as a whole to index documents related to atomic bomb. In this way, any queries contained in the cluster can be linked to these documents. Third, most words in the natural language have inherent ambiguity, which makes it quite difficult for user to formulate queries with appropriate words. Obviously, query clustering could be used to suggest a list of alternative terms for users to reformulate queries and thus better represent their information needs. The key problem underlying query clustering is to determine an adequate similarity function so that truly similar queries can be grouped together. There are mainly two categories of methods to calculate the similarity between queries: one is based on query content, and the
other on query session. Since queries with the same or similar search intentions may be represented with different words and the average length of Web queries is very short, content-based query clustering usually does not perform well. Using query sessions mined from query logs to cluster queries is proved to be a more promising method (Wen, Nie, & Zhang, 2002). Through query sessions, “query clustering” is extended to “query session clustering”. The basic assumption here is that the activities following a query are relevant to the query and represent, to some extent, the semantic features of the query. The query text and the activities in a query session as a whole can represent the search intention of the user more precisely. Moreover, the ambiguity of some query terms is eliminated in query sessions. For instance, if a user visited a few tourism Websites after submitting a query “Java”, it is reasonable to deduce that the user was searching for information about “Java Island”, not “Java programming language” or “Java coffee”. Moreover, query clustering and document clustering can be combined and reinforced with each other (Beeferman & Berger, 2000).
Log-Based Query Expansion Query expansion involves supplementing the original query with additional words and phrases, which is an effective way to overcome the term-mismatching problem and to improve search performance. Log-based query expansion is a new query expansion method based on query log mining. Taking query sessions in query logs as a bridge between user queries and Web pages, probabilistic correlations between terms in queries and those in pages can then be established. With these term-term correlations, relevant expansion terms can be selected from the documents for a query. For example, a recent work by Cui, Wen, Nie, and Ma (2003) shows that, from query logs, some very good terms, such as “personal computer”, “Apple Computer”, “CEO”, “Macintosh” and “graphical user interface”, can be detected to be tightly correlated to the query “Steve Jobs”, and using these terms to expand the original query can lead to more relevant pages. Experiments by Cui, Wen, Nie, and Ma (2003) show that mining user logs is extremely useful for improving retrieval effectiveness, especially for very short queries on the Web. The log-based query expansion overcomes several difficulties of traditional query expansion methods because a large number of user judgments can be extracted from user logs, while eliminating the step of collecting feedbacks from users for ad-hoc queries. Logbased query expansion methods have three other important properties. First, the term correlations are pre-
439
-
Enhancing Web Search through Query Log Mining
computed offline and thus the performance is better than traditional local analysis methods which need to calculate term correlations on the fly. Second, since user logs contain query sessions from different users, the term correlations can reflect the preference of the majority of the users. Third, the term correlations may evolve along with the accumulation of user logs. Hence, the query expansion process can reflect updated user interests at a specific time.
means that objects newly introduced into the system have not been rated by any users and can therefore not be recommended. Due to the absence of recommendations, users tend to not be interested in these new objects. This in turn has the consequence that the newly added objects remain in their state of not being recommendable. Combining collaborative and content-based filtering is therefore a promising approach to solving the above problems (Baudisch, 1999).
Collaborative Filtering and Personalized Web Search
Personalized Web Search
Collaborative filtering and personalized Web search are two most successful examples of personalization on the Web. Both of them heavily rely on mining query logs to detect users’ preferences or intentions.
Collaborative Filtering Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The basic assumption of collaborative filtering is that users having similar tastes on some items may also have similar preferences on other items. From Web logs, a collection of items (books, music CDs, movies, etc.) with users’ searching, browsing and ranking information could be extracted to train prediction and recommendation models. For a new user with a few items he/she likes or dislikes, other items meeting the user’s taste could be selected based on the trained models and recommended to him/her (Shardanand & Maes, 1995; Konstan, Miller, Maltz, Herlocker, & Gordon, L. R., et al., 1997). Generally, vector space models and probabilistic models are two major models for collaborative filtering (Breese, Heckerman, & Kadie, 1998). Collaborative filtering systems have been implemented by several e-commerce sites including Amazon. Collaborative filtering has two main advantages over content-based recommendation systems. First, contentbased annotations of some data types are often not available (such as video and audio) or the available annotations usually do not meet the tastes of different users. Second, through collaborative filtering, multiple users’ knowledge and experiences could be remembered, analyzed and shared, which is a characteristic especially useful for improving new users’ information seeking experiences. The original form of collaborative filtering does not use the actual content of the items for recommendation, which may suffers from the scalability, sparsity and synonymy problems. Most importantly, it is incapable of overcoming the so-called first-rater problem. The first-rater problem
440
While most modern search engines return identical results to the same query submitted by different users, personalized search targets to return results related to users’ preferences, which is a promising way to alleviate the increasing information overload problem on the Web. The core task of personalization is to obtain the preference of each individual user, which is called user profile. User profiles could be explicitly provided by users or implicitly learned from users’ past experiences (Hirsh, Basu, & Davison, 2000; Mobasher, Cooley, & Srivastava, 2000). For example, Google’s personalized search requires users to explicitly enter their preferences. In the Web context, query log provides a very good source for learning user profiles since it records the detailed information about users’ past search behaviors. The typical method of mining a user’s profile from query logs is to first collect all of this user’s query sessions from logs, and then learn the user’s preferences on various topics based on the queries submitted and the pages viewed by the user (Liu, Yu, & Meng, 2004). User preferences could be incorporated into search engines to personalize both their relevance ranking and their importance ranking. A general way for personalizing relevance ranking is through query reformulation, such as query expansion and query modification, based on user profiles. A main feature of modern Web search engines is that they assign importance scores to pages through mining the link structures and the importance scores will significantly affect the final ranking of search results. The most famous link analysis algorithms are PageRank (Page, Brin, Motwani, & Winograd, 1998) and HITS (Kleinberg, 1998). One main shortcoming of PageRank is that it assigns a static importance score to a page, no matter who is searching and what query is being used. Recently, a topic-sensitive PageRank algorithm (Haveliwala, 2002) and a personalized PageRank algorithm (Jeh & Widom, 2003) are proposed to calculate different PageRank scores based on different users’ preferences and interests. A search engine using topicsensitive PageRank and Personalized PageRank is expected to retrieve Web pages closer to user preferences.
Enhancing Web Search through Query Log Mining
FUTURE TRENDS
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 407-416).
Although query log mining is promising to improve the information seeking process on the Web, the number of publications in this field is relatively small. The main reason might lie in the difficulty of obtaining a large amount of query log data. Due to lack of data, especially lack of a standard data set, it is difficult to repeat and compare existing algorithms. This may block knowledge accumulation in the field. On the other hand, search engine companies are usually reluctant to make pubic of their log data because of the costs and/or legal issues involved. However, to create such a kind of standard log dataset will greatly benefit both the research community and the search engine industry, and thus may become a major challenge of this field in the future. Nearly all of the current published experiments were conducted on snapshots of small to medium scale query logs. In practice, query logs are usually continuously generated and with huge sizes. To develop scalable and incremental query log mining algorithms capable of handling the huge and growing dataset is another challenge. We also foresee that client-side query log mining and personalized search will attract more and more attentions from both academy and industry. Personalized information filtering and customization could be effective ways to ease the deteriorated information overload problem.
Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (pp. 43-52).
CONCLUSION Web query log mining is an application of data mining techniques to discovering interesting knowledge from Web query logs. In recent years, some research work and industrial applications have demonstrated that Web log mining is an effective method to improve the performance of Web search. Web search engines have become the main entries for people to exploit the Web and find needed information. Therefore, to keep improving the performance of search engines and to make them more userfriendly are important and challenging tasks for researchers and developers. Query log mining is expected to play a more and more important role in enhancing Web search.
REFERENCES Baudisch, P. (1999). Joining collaborative and contentbased filtering. Proceedings of the Conference on Human Factors in Computing Systems (CHI’99). Beeferman, D., & Berger. A. (2000). Agglomerative clustering of a search engine query log. Proceedings of the 6th
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, 1(1). Cui, H., Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2003). Query expansion by mining user logs. IEEE Transaction on Knowledge and Data Engineering, 15(4), 829-839. Haveliala, T. H. (2002). Topic-sensitive PageRank. Proceeding of the Eleventh World Wide Web conference (WWW 2002). Hirsh, H., Basu, C., & Davison, B. D. (2000). Learning to personalize. Communications of the ACM, 43(8), 102-106. Jeh, G., & Widom, J. (2003). Scaling personalized Web search. Proceedings of the Twelfth International World Wide Web Conference (WWW 2003). Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proceedings of the 9th ACM SIAM International Symposium on Discrete Algorithms (pp. 668-677). Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., & Riedl, J. (1997). GroupLens: applying collaborative filtering to Usenet news. Communications of the ACM, 40(3), 77-87. Liu, F., Yu, C., & Meng, W. (2004). Personalized Web search for improving retrieval effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1), 28-40. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Communications of the ACM, 43(8), 142-151. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking: Bringing order to the Web. Technical report of Stanford University. Shardanand, U., & Maes, P. (1995). Social information filtering: Algorithms for automating word of mouth. Proceedings of the Conference on Human Factors in Computing Systems. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23.
441
-
Enhancing Web Search through Query Log Mining
Wen, J.-R., Nie, J.-Y., & Zhang, H.-J. (2002). Query clustering using user logs. ACM Transactions on Information Systems (ACM TOIS), 20(1), 59-81. Xu, J., & Croft, W.B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems, 18(1), 79-112.
tions between terms in the user queries and those in the documents can then be established through user logs. With these term-term correlations, relevant expansion terms can be selected from the documents for a query. Log-Based Personalized Search: Personalized search targets to return results related to users’ preferences. The core task of personalization is to obtain the preference of each individual user, which could be learned from query logs.
KEY TERMS
Query Log: A type of file keeping track of the activities of the users who are utilizing a search engine.
Collaborative Filtering: A method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating).
Query Log Mining: An application of data mining techniques to discover interesting knowledge from Web query logs. The mined knowledge is usually used to enhance Web search.
Log-Based Query Clustering: A technique aiming at grouping users’ semantically related queries collected in Web query logs.
Query Session: a query submitted to a search engine together with the Web pages the user visits in response to the query. Query session is the basic unit of many query log mining tasks.
Log-Based Query Expansion: A new query expansion method based on query log mining. Probabilistic correla-
442
443
Enhancing Web Search through Web Structure Mining Ji-Rong Wen Microsoft Research Asia, China
INTRODUCTION
BACKGROUND
The Web is an open and free environment for people to publish and get information. Everyone on the Web can be either an author, a reader, or both. The language of the Web, HTML (Hypertext Markup Language), is mainly designed for information display, not for semantic representation. Therefore, current Web search engines usually treat Web pages as unstructured documents, and traditional information retrieval (IR) technologies are employed for Web page parsing, indexing, and searching. The unstructured essence of Web pages seriously blocks more accurate search and advanced applications on the Web. For example, many sites contain structured information about various products. Extracting and integrating product information from multiple Web sites could lead to powerful search functions, such as comparison shopping and business intelligence. However, these structured data are embedded in Web pages, and there are no proper traditional methods to extract and integrate them. Another example is the link structure of the Web. If used properly, information hidden in the links could be taken advantage of to effectively improve search performance and make Web search go beyond traditional information retrieval (Page, Brin, Motwani, & Winograd, 1998, Kleinberg, 1998). Although XML (Extensible Markup Language) is an effort to structuralize Web data by introducing semantics into tags, it is unlikely that common users are willing to compose Web pages using XML due to its complication and the lack of standard schema definitions. Even if XML is extensively adopted, a huge amount of pages are still written in the HTML format and remain unstructured. Web structure mining is the class of methods to automatically discover structured data and information from the Web. Because the Web is dynamic, massive and heterogeneous, automated Web structure mining calls for novel technologies and tools that may take advantage of state-of-the-art technologies from various areas, including machine learning, data mining, information retrieval, and databases and natural language processing.
Web structure mining can be further divided into three categories based on the kind of structured data used. •
•
•
Web graph mining: Compared to a traditional document set in which documents are independent, the Web provides additional information about how different documents are connected to each other via hyperlinks. The Web can be viewed as a (directed) graph whose nodes are the Web pages and whose edges are the hyperlinks between them. There has been a significant body of work on analyzing the properties of the Web graph and mining useful structures from it (Page et al., 1998; Kleinberg, 1998; Bharat & Henzinger, 1998; Gibson, Kleinberg, & Raghavan, 1998). Because the Web graph structure is across multiple Web pages, it is also called interpage structure. Web information extraction (Web IE): In addition, although the documents in a traditional information retrieval setting are treated as plain texts with no or few structures, the content within a Web page does have inherent structures based on the various HTML and XML tags within the page. While Web content mining pays more attention to the content of Web pages, Web information extraction has focused on automatically extracting structures with various accuracy and granularity out of Web pages. Web content structure is a kind of structure embedded in a single Web page and is also called intrapage structure. Deep Web mining: Besides Web pages that are accessible or crawlable by following the hyperlinks, the Web also contains a vast amount of noncrawlable content. This hidden part of the Web, referred to as the deep Web or the hidden Web (Florescu, Levy, & Mendelzon, 1998), comprises a large number of online Web databases. Compared to the static surface Web, the deep Web contains a much larger amount of high-quality structured in-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Enhancing Web Search through Web Structure Mining
formation (Chang, He, Li, & Zhang, 2003). Automatically discovering the structures of Web databases and matching semantically related attributes between them is critical to understanding the structures and semantics of the deep Web sites and to facilitating advanced search and other applications.
MAIN THRUST
weight and a hub weight to solve the first problem and combine connectivity and content analysis to solve the latter two. Chakrabarti, Joshi, and Tawde (2001) addressed another problem with HITS: regarding the whole page as a hub is not suitable, because a page always contains multiple regions in which the hyperlinks point to different topics. They proposed to disaggregate hubs into coherent regions by segmenting the DOM (document object model) tree of an HTML page.
Web Graph Mining
PageRank
Mining the Web graph has attracted a lot of attention in the last decade. Some important algorithms have been proposed and have shown great potential in improving the performance of Web search. Most of these mining algorithms are based on two assumptions. (a) Hyperlinks convey human endorsement. If there exists a link from page A to page B, and these two pages are authored by different people, then the first author found the second page valuable. Thus the importance of a page can be propagated to those pages it links to. (b) Pages that are co-cited by a certain page are likely related to the same topic. Therefore, the popularity or importance of a page is correlated to the number of incoming links to some extendt, and related pages tend to be clustered together through dense linkages among them.
The main drawback of the HITS algorithm is that the hubs and authority score must be computed iteratively from the query result on the fly, which does not meet the realtime constraints of an online search engine. To overcome this difficulty, Page et al. (1998) suggested using a random surfing model to describe the probability that a page is visited and taking the probability as the importance measurement of the page. They approximated this probability with the famous PageRank algorithm, which computes the probability scores in an iterative manner. The main advantage of the PageRank algorithm over the HITS algorithm is that the importance values of all pages are computed off-line and can be directly incorporated into ranking functions of search engines. Noisy link and topic drifting are two main problems in the classic Web graph mining algorithms. Some links, such as banners, navigation panels, and advertisements, can be viewed as noise with respect to the query topic and do not carry human editorial endorsement. Also, hubs may be mixed, which means that only a portion of the hub content may be relevant to the query. Most link analysis algorithms treat each Web page as an atomic, indivisible unit with no internal structure. This leads to false reinforcements of hub/authority and importance calculation. Cai, He, Wen, and Ma (2004) used a visionbased page segmentation algorithm to partition each Web page into blocks. By extracting the page-to-block, block-to-page relationships from the link structure and page layout analysis, a semantic graph over the Web can be constructed such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the Web. Based on block-level link analysis, they proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances are shown to exceed the classic PageRank and HITS algorithms.
Hub and Authority In the Web graph, a hub is defined as a page containing pointers to many other pages, and an authority is defined as a page pointed to by many other pages. An authority is usually viewed as a good page containing useful information about one topic, and a hub is usually a good source to locate information related to one topic. Moreover, a good hub should contain pointers to many good authorities, and a good authority should be pointed to by many good hubs. Such a mutual reinforcement relationship between hubs and authorities is taken advantage of by an iterative algorithm called HITS (Kleinberg, 1998). HITS computes authority scores and hub scores for Web pages in a subgraph of the Web, which is obtained from the (subset of) search results of a query with some predecessor and successor pages. Bharat and Henzinger (1998) addressed three problems in the original HITS algorithm: mutually reinforced relationships between hosts (where certain documents “conspire” to dominate the computation), automatically generated links (where no human’s opinion is expressed by the link), and irrelevant documents (where the graph contains documents irrelevant to the query topic). They assign each edge of the graph an authority 444
Community Mining Many communities, either in an explicit or implicit form, exist in the Web today, and their number is grow-
Enhancing Web Search through Web Structure Mining
ing at a very fast speed. Discovering communities from a network environment such as the Web has recently become an interesting research problem. The Web can be abstracted into directional or nondirectional graphs with nodes and links. It is usually rather difficult to understand a network’s nature directly from its graph structure, particularly when it is a large scale complex graph. Data mining is a method to discover the hidden patterns and knowledge from a huge network. The mined knowledge could provide a higher logical view and more precise insight of the nature of a network and will also dramatically decrease the dimensionality when trying to analyze the structure and evolution of the network. Quite a lot of work has been done in mining the implicit communities of users, Web pages, or scientific literature from the Web or document citation database using content or link analysis. Several different definitions of community were also raised in the literature. In Gibson et al. (1998), a Web community is a number of representative authority Web pages linked by important hub pages that share a common topic. Kumar, Raghavan, Rajagopalan, and Tomkins (1999) define a Web community as a highly linked bipartite subgraph with at least one core containing complete bipartite subgraph. In Flake, Lawrence, and Lee Giles (2000), a set of Web pages that linked more pages in the community than those outside of the community could be defined as a Web community. Also, a research community could be based on a singlemost-cited paper and could contain all papers that cite it (Popescul, Flake, Lawrence, Ungar, & Lee Giles, 2000).
Web Information Extraction Web IE has the goal of pulling out information from a collection of Web pages and converting it to a homogeneous form that is more readily digested and analyzed for both humans and machines. The results of IE could be used to improve the indexing process, because IE removes irrelevant information in Web pages and facilitates other advanced search functions due to the structured nature of data. The structuralization degrees of Web pages are diverse. Some pages can be just taken as plain text documents. Some pages contain a little loosely structured data, such as a product list in a shopping page or a price table in a hotel page. Some pages are organized with more rigorous structures, such as the home pages of the professors in a university. Other pages have very strict structures, such as the book description pages of Amazon, which are usually generated by a uniform template. Therefore, basically two kinds of Web IE techniques exist: IE from unstructured pages and IE from semistructured pages. IE tools for unstructured pages are similar to those classical IE tools that typically use natural language processing techniques such as syntactic
analysis, semantic analysis, and discourse analysis. IE tools for semistructured pages are different from the classical ones, as IE utilizes available structured information, such as HTML tags and page layouts, to infer the data formats or pages. Such a kind of methods is also called wrapper induction (Kushmerick, Weld, & Doorenbos, 1997; Cohen, Hurst, & Jensen, 2002). In contrast to classic IE approaches, wrapper induction operates less dependently of the specific contents of Web pages and mainly focuses on page structure and layout. Existing approaches for Web IE mainly include the manual approach, supervised learning, and unsupervised learning. Although some manually built wrappers exist, supervised learning and unsupervised learning are viewed as more promising ways to learn robust and scalable wrappers, because building IE tools manually is not feasible and scalable for the dynamic, massive and diverse Web contents. Moreover, because supervised learning still relies on manually labeled sample pages and thus also requires substantial human effort, unsupervised learning is the most suitable method for Web IE. There have been several successful fully automatic IE tools using unsupervised learning (Arasu & GarciaMolina, 2003; Liu, Grossman, & Zhai, 2003).
Deep Web Mining In the deep Web, it is usually difficult or even impossible to directly obtain the structures (i.e. schemas) of the Web sites’ backend databases without cooperation from the sites. Instead, the sites present two other distinguishing structures, interface schema and result schema, to users. The interface schema is the schema of the query interface, which exposes attributes that can be queried in the backend database. The result schema is the schema of the query results, which exposes attributes that are shown to users. The interface schema is useful for applications, such as a mediator that queries multiple Web databases, because the mediator needs complete knowledge about the search interface of each database. The result schema is critical for applications, such as data extraction, where instances in the query results are extracted. In addition to the importance of the interface schema and result schema, attribute matching across different schemas is also important. First, matching between different interface schemas and matching between different results schemas (intersite schema matching) are critical for metasearching and data-integration among related Web databases. Second, matching between the interface schema and the result schema of a single Web database (intrasite schema matching) enables automatic data annotation and database content crawling.
445
-
Enhancing Web Search through Web Structure Mining
Most existing schema-matching approaches for Web databases primarily focus on matching query interfaces (He & Chang, 2003; He, Meng, Yu, & Wu, 2003; Raghavan & Garcia-Molina, 2001). They usually adopt a labelbased strategy to identify attribute labels from the descriptive text surrounding interface elements and then find synonymous relationships between the identified labels. The performance of these approaches may be affected when no attribute description can be identified or when the identified description is not informative. In Wang, Wen, Lochovsky, and Ma (2004), an instancebased schema-matching approach was proposed to identify both the interface and result schemas of Web databases. Instance-based approaches depend on the content overlap or statistical properties, such as data ranges and patterns, to determine the similarity of two attributes. Thus, they could effectively deal with the cases where attribute names or labels are missing or not available, which are common for Web databases.
FUTURE TRENDS It is foreseen that the biggest challenge in the next several decades is how to effectively and efficiently dig out a machine-understandable information and knowledge layer from unorganized and unstructured Web data. However, Web structure mining techniques are still in their youth today. For example, the accuracy of Web information extraction tools, especially those automatically learned tools, is still not satisfactory to meet the requirements of some rigid applications. Also, deep Web mining is a new area, and researchers have many challenges and opportunities to further explore, such as data extraction, data integration, schema learning and matching, and so forth. Moreover, besides Web pages, various other types of structured data exist on the Web, such as e-mail, newsgroup, blog, wiki, and so forth. Applying Web mining techniques to extract structures from these data types is also a very important future research direction.
CONCLUSION Despite the efforts of XML and semantic Web, which target to bring structures and semantics to the Web, Web structure mining is considered a more promising way to structuralize the Web due to its characteristics of automation, scalability, generality, and robustness. As a result, there has been a rapid growth of technologies for automatically discovering structures from the Web, namely Web graph mining, Web information extraction, and deep Web mining. The mined information and knowledge will 446
greatly improve the effectiveness of current Web search and will enable much more sophisticated Web information retrieval technologies in the future.
REFERENCES Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from Web pages. Proceedings of the ACM SIGMOD International Conference on Management of Data. Bharat, K., & Henzinger, M. R. (1998). Improved algorithms for topic distillation in a hyperlinked environment. Proceedings of the 21st Annual International ACM SIGIR Conference. Cai, D., He, X., Wen J.-R., & Ma, W.-Y. (2004). Blocklevel link analysis. Proceedings of the 27th Annual International ACM SIGIR Conference. Chakrabarti, S., Joshi, M., & Tawde, V. (2001). Enhanced topic distillation using text, markup tags, and hyperlinks. Proceedings of the 24th Annual International ACM SIGIR Conference (pp. 208-216). Chang, C. H., He, B., Li, C., & Zhang, Z. (2003). Structured databases on the Web: Observations and implications discovery (Tech. Rep. No. UIUCCDCS-R-2003-2321). Urbana-Champaign, IL: University of Illinois, Department of Computer Science. Cohen, W., Hurst, M., & Jensen, L. (2002). A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of the 11th World Wide Web Conference. Flake, G. W., Lawrence, S., & Lee Giles, C. (2000). Efficient identification of Web communities. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining. Florescu, D., Levy, A. Y., & Mendelzon, A. O. (1998). Database techniques for the World Wide Web: A survey. SIGMOD Record, 27(3), 59-74. Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web communities from link topology. Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia. He, B., & Chang, C. C. (2003). Statistical schema matching across Web query interfaces. Proceedings of the ACM SIGMOD International Conference on Management of Data. He, H., Meng, W., Yu, C., & Wu, Z. (2003). WISE-Integrator: An automatic integrator of Web search interfaces for
Enhancing Web Search through Web Structure Mining
e-commerce. Proceedings of the 29th International Conference on Very Large Data Bases. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proceedings of the Ninth ACM SIAM International Symposium on Discrete Algorithms (pp. 668677). Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the Web for emerging cyber-communities. Proceedings of the Eighth International World Wide Web Conference. Kushmerick, N., Weld, D., & Doorenbos, R. (1997). Wrapper induction for information extraction. Proceedings of the International Joint Conference on Artificial Intelligence. Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data records in Web pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank citation ranking: Bringing order to the Web (Tech. Rep.). Stanford University. Popescul, A., Flake, G. W., Lawrence, S., Ungar, L. H., & Lee Giles, C. (2000). Clustering and identifying temporal trends in document databases. Proceedings of the IEEE Conference on Advances in Digital Libraries. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden Web. Proceedings of the 27th International Conference on Very Large Data Bases.
Wang, J., Wen, J.-R., Lochovsky, F., & Ma, W.-Y. (2004). Instance-based schema matching for Web databases by domain-specific query probing. Proceedings of the 30th International Conference on Very Large Data Bases.
KEY TERMS Community Mining: A Web graph mining algorithm to discover communities from the Web graph in order to provide a higher logical view and more precise insight of the nature of the Web. Deep Web Mining: Automatically discovering the structures of Web databases hidden in the deep Web and matching semantically related attributes between them. HITS: A Web graph mining algorithm to compute authority scores and hub scores for Web pages. PageRank: A Web graph mining algorithm that uses the probability that a page is visited by a random surfer on the Web as a key factor for ranking search results. Web Graph Mining: The mining techniques used to discover knowledge from the Web graph. Web Information Extraction: The class of mining methods to pull out information from a collection of Web pages and converting it to a homogeneous form that is more readily digested and analyzed for both humans and machines. Web Structure Mining: The class of methods used to automatically discover structured data and information from the Web.
447
-
448
Ensemble Data Mining Methods Nikunj C. Oza NASA Ames Research Center, USA
INTRODUCTION Ensemble data mining methods, also known as committee methods or model combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: Each member of the committee should be as competent as possible, but the members should complement one another. If the members are not complementary, that is, if they always agree, then the committee is unnecessary — any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.
BACKGROUND A supervised machine learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. In a classification learning task, each output is one or more classes to which the input belongs. The goal of classification learning is to develop a model that separates the data into the different classes, with the aim of classifying new examples in the future. For example, a credit card company may develop a model that separates people who defaulted on their credit cards from those who did not, based on other known information such as annual income. The goal would be to predict whether a new credit card applicant is likely to default on his or her credit card and thereby decide whether to approve or deny this applicant a new card. In a regression learning task, each output is a continuous value to be predicted (e.g., the average balance that a credit card holder carries over to the next month). Many traditional machine learning algorithms generate a single model (e.g., a decision tree or neural network). Ensemble learning methods instead generate multiple models. Given a new example, the ensemble passes it to each of its multiple base models, obtains their predictions,
and then combines them in some appropriate manner (e.g., averaging or voting). As mentioned earlier, it is important to have base models that are competent but also complementary. To further motivate this point, consider Figure 1. This figure depicts a classification problem in which the goal is to separate the points marked with plus signs from points marked with minus signs. None of the three individual linear classifiers (marked A, B, and C) is able to separate the two classes of points. However, a majority vote over all three linear classifiers yields the piecewiselinear classifier shown as a thick line. This classifier is able to separate the two classes perfectly. For example, the plusses at the top of the figure are correctly classified by A and B, but are misclassified by C. The majority vote over these classifiers correctly identifies them as plusses. This happens because A and B are very different from C. If our ensemble instead consisted of three copies of C, then all three classifiers would misclassify the plusses at the top of the figure, and so would a majority vote over these classifiers.
MAIN THRUST We now discuss the key elements of an ensemble-learning method and ensemble model and, in the process, discuss several ensemble methods that have been developed.
Ensemble Methods The example shown in Figure 1 is an artificial example. We cannot normally expect to obtain base models that misclassify examples in completely separate parts of the input space and ensembles that classify all the examples correctly. However, many algorithms attempt to generate a set of base models that make errors that are as uncorrelated as possible. Methods such as bagging (Breiman, 1994) and boosting (Freund & Schapire, 1996) promote diversity by presenting each base model with a different subset of training examples or different weight distributions over the examples. For example, in Figure 1, if the plusses in the top part of the figure were temporarily removed from the training set, then a linear classifier learning algorithm trained on the remaining examples would probably yield a classifier similar to C. On the other hand, removing the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Ensemble Data Mining Methods
Figure 1. An ensemble of linear classifiers
plusses in the bottom part of the figure would probably yield classifier B, or something similar. In this way, running the same learning algorithm on different subsets of training examples can yield very different classifiers, which can be combined to yield an effective ensemble. Input decimation ensembles (IDE) (Oza & Tumer, 2001; Tumer & Oza, 2003) and stochastic attribute selection committees (SASC) (Zheng & Webb, 1998) instead promote diversity by training each base model with the same training examples but different subsets of the input features. SASC trains each base model with a random subset of input features. IDE selects, for each class, a subset of features that has the highest correlation with the presence of that class. Each feature subset is used to train one base model. However, in both SASC and IDE, all the training patterns are used with equal weight to train all the base models. So far we have distinguished ensemble methods by the way they train their base models. We can also distinguish methods by the way they combine their base models’ predictions. Majority or plurality voting is frequently used for classification problems and is used in bagging. If the classifiers provide probability values, simple averaging is commonly used and is very effective (Tumer & Ghosh, 1996). Weighted averaging has also been used, and different methods for weighting the base models have been examined. Two particularly interesting methods for weighted averaging include mixtures of experts (Jordan & Jacobs, 1994) and Merz’s use of principal components analysis (PCA) to combine models (Merz, 1999). In the mixtures of experts method, the weights in the weighted average combination are determined by a gating network, which is a model that takes the same inputs that the base models take and returns a weight on each of the base models. The higher the weight for a base model, the more that base model is trusted to provide the correct answer. These weights are determined during training by how well the base models perform on the training examples. The
gating network essentially keeps track of how well each base model performs in each part of the input space. The hope is that each model learns to specialize in different input regimes and is weighted highly when the input falls into its specialty. Merz’s method uses PCA to lower the weights of base models that perform well overall but are redundant and, therefore, effectively give too much weight to one model. For example, in Figure 1, if an ensemble of three models instead had two copies of A and one copy of B, we may prefer to lower the weights of the two copies of A because, essentially, A is being given too much weight. Here, the two copies of A would always outvote B, thereby rendering B useless. Merz’s method also increases the weight on base models that do not perform as well overall but perform well in parts of the input space, where the other models perform poorly. In this way, a base model’s unique contributions are rewarded. When designing an ensemble learning method, in addition to choosing the method by which to bring about diversity in the base models and choosing the combining method, one has to choose the type of base model and base model learning algorithm to use. The combining method may restrict the types of base models that can be used. For example, to use average combining in a classification problem, one must have base models that can yield probability estimates. This precludes the use of linear discriminant analysis or support vector machines, which cannot return probabilities. The vast majority of ensemble methods use only one base model learning algorithm but use the methods described earlier to bring about diversity in the base models. Surprisingly, little work has been done (e.g., Merz, 1999) on creating ensembles with many different types of base models. Two of the most popular ensemble learning algorithms are bagging and boosting, which we briefly explain next.
Bagging Bootstrap aggregating (Bagging) generates multiple bootstrap training sets from the original training set (by using sampling with replacement) and uses each of them to generate a classifier for inclusion in the ensemble. The algorithms for bagging and sampling with replacement are given in Figure 2. In these algorithms, T is the original training set of N examples, M is the number of base models to be learned, L b is the base model learning algorithm, the his are the base models, random_integer (a,b) is a function that returns each of the integers from a to b with equal probability, and I(A) is the indicator function that returns 1 if A is true and 0 otherwise. To create a bootstrap training set from an original training set of size N, we perform N multinomial trials, where in each trial, we draw one of the N examples. Each example has the probability 1/N of being drawn in each 449
-
Ensemble Data Mining Methods
Figure 2. Batch bagging algorithm and sampling with replacement
Figure 3. Batch boosting algorithm AdaBoost({(x1, y1 ),(x 2 , y 2 ),K,(x N , y N )},Lb , M)
Bagging (T , M ) For each m = 1, 2,K , M Tm = Sample_With_Replacement(T ,| T |) hm = Lb (Tm ) Return h fin ( x) = arg max y∈Y
M
∑ I (h m =1
m
( x) = y ).
Sample_With_Replacement(T , N ) S = {} For i = 1, 2,K , N r = random_integer(1, N ) Add T [r ] to S .
Initialize D1 (n ) = 1/N for all n ∈ {1,2,K,N }. For each m = 1,2,K, M : hm = Lb ({(x1, y1 ),(x 2 , y 2 ),K,(x N , y N )},Dm ).
εm = ∑
n:h m (x n )≠y n
Dm (n ).
If εm ≥ 1/2 then, set M = m −1 and abort this loop. Update distribution Dm : 1 2 1− ε if hm (x n ) = y n ( m) Dm +1 (n) = Dm (n ) × 1 otherwise. 2εm M 1− εm Return h fin (x) = argmaxy ∈Y ∑ I(hm (x) = y)log . εm m=1
Return S .
trial. The second algorithm shown in Figure 2 does exactly this — N times, the algorithm chooses a number r from 1 to N and adds the rth training example to the bootstrap training set S. Clearly, some of the original training examples will not be selected for inclusion in the bootstrap training set, and others will be chosen one time or more. In bagging, we create M such bootstrap training sets and then generate classifiers using each of them. Bagging returns a function h fin ( x) that classifies new examples by returning the class y that gets the maximum number of votes from the base models h1,h2,…,hM. In bagging, the M bootstrap training sets that are created are likely to have some differences. If these differences are enough to induce noticeable differences among the M base models while leaving their performances reasonably good, then the ensemble will probably perform better than the base models individually. Breiman (1994) demonstrates that bagged ensembles tend to improve upon their base models more if the base model learning algorithms are unstable — differences in their training sets tend to induce significant differences in the models. He notes that decision trees are unstable, which explains why bagged decision trees often outperform individual decision trees; however, decision stumps (decision trees with only one variable test) are stable, which explains why bagging with decision stumps tends to not improve upon individual decision stumps. 450
Boosting We now explain the AdaBoost algorithm, because it is the most frequently used among all boosting algorithms. AdaBoost generates a sequence of base models with different weight distributions over the training set. The AdaBoost algorithm is shown in Figure 3. Its inputs are a set of N training examples, a base model learning algorithm Lb, and the number M of base models that we wish to combine. AdaBoost was originally designed for two-class classification problems; therefore, for this explanation, we will assume that two possible classes exist. However, AdaBoost is regularly used with a larger number of classes. The first step in AdaBoost is to construct an initial distribution of weights D1 over the training set. This distribution assigns equal weight to all N training examples. We now enter the loop in the algorithm. To construct the first base model, we call Lb with distribution D1 over the training set.1 After getting back a model h1, we calculate its error e1 on the training set itself, which is just the sum of the weights of the training examples that h1 misclassifies. We require that e1 < 1/2 (this is the weak learning assumption — the error should be less than what we would achieve through randomly guessing the class2). If this condition is not satisfied, then we stop and return the ensemble consisting of the previously gener-
Ensemble Data Mining Methods
ated base models. If this condition is satisfied, then we calculate a new distribution, D2, over the training examples as follows. Examples that were correctly classified by h1 have their weights multiplied by 1/(2(1-e1). Examples that were misclassified by h1 have their weights multiplied by 1/(2eð1). Note that because of our condition e1 < 1/2, correctly classified examples have their weights reduced, and misclassified examples have their weights increased. Specifically, examples that h1 misclassified have their total weight increased to 1/2 under D2, and examples that h1 correctly classified have their total weight reduced to 1/ 2 under D2. We then go into the next iteration of the loop to construct base model h2 using the training set and the new distribution D2. The point is that the next base model will be generated by a weak learner (i.e., the base model will have an error less than 1/2); therefore, at least some of the examples misclassified by the previous base model will have to be correctly classified by the current base model. In this way, boosting forces subsequent base models to correct the mistakes made by earlier models. We construct M base models in this fashion. The ensemble returned by AdaBoost is a function that takes a new example as input and returns the class that gets the maximum weighted vote over the M base models, where each base model’s weight is log((1-em)/em), which is proportional to the base model’s accuracy on the weighted training set presented to it. AdaBoost has performed very well in practice and is one of the few theoretically motivated algorithms that has turned into a practical algorithm. However, AdaBoost can perform poorly when the training data is noisy (Dietterich, 2000); that is, the inputs or outputs have been randomly contaminated. Noisy examples are normally difficult to learn. Because of this fact, the weights assigned to noisy examples often become much higher than for the other examples, often causing boosting to focus too much on those noisy examples at the expense of the remaining data. Some work has been done to mitigate the effect of noisy examples on boosting (Oza, 2003, 2004; Ratsch, Onoda, & Muller, 2001).
FUTURE TRENDS The fields of machine learning and data mining are increasingly moving away from working on small datasets in the form of flat files that are presumed to describe a single process. They are changing their focus toward the types of data increasingly being encountered today: very large datasets, possibly distributed over different locations, describing operations with multiple regimes of operation, time-series data, online applications (the data is not a time series but nevertheless arrives continually and must be processed as it arrives), partially labeled data,
and documents. Research in ensemble methods is beginning to explore these new types of data. For example, ensemble learning traditionally has required access to the entire dataset at once; that is, it performs batch learning. However, this idea is clearly impractical for very large datasets that cannot be loaded into memory all at once. Oza and Russell (2001) and Oza (2001) apply ensemble learning to such large datasets. In particular, this work develops online bagging and boosting; that is, they learn in an online manner. Whereas standard bagging and boosting require at least one scan of the dataset for every base model created, online bagging and online boosting require only one scan of the dataset, regardless of the number of base models. Additionally, as new data arrive, the ensembles can be updated without reviewing any past data. However, because of their limited access to the data, these online algorithms do not perform as well as their standard counterparts. Other work has also been done to apply ensemble methods to other types of data, such as time-series data (Weigend, Mangeas, & Srivastava, 1995). However, most of this work is experimental. Theoretical frameworks that can guide us in the development of new ensemble learning algorithms specifically for modern datasets have yet to be developed.
CONCLUSION Ensemble methods began about 10 years ago as a separate area within machine learning and were motivated by the idea of wanting to leverage the power of multiple models and not just trust one model built on a small training set. Significant theoretical and experimental developments have occurred over the past 10 years and have led to several methods, especially bagging and boosting, being used to solve many real problems. However, ensemble methods also appear to be applicable to current and upcoming problems of distributed data mining and online applications. Therefore, practitioners in data mining should stay tuned for further developments in the vibrant area of ensemble methods. An excellent way to do this is to follow the series of workshops called the International Workshop on Multiple Classifier Systems. This series’ balance between theory, algorithms, and applications of ensemble methods gives a comprehensive idea of the work being done in the field.
REFERENCES Breiman, L. (1994). Bagging predictors (Tech. Rep. 421). Berkeley: University of California, Department of Statistics.
451
-
Ensemble Data Mining Methods
Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40, 139-158. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In M. Kaufmann (Ed.), Proceedings of the 13th International Conference on Machine Learning (pp. 148-156). Bari, Italy: Morgan Kaufmann Publishers. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixture of experts and the EM algorithm. Neural Computation, 6, 181-214. Merz, C. J. (1999). A principal component approach to combining regression estimates. Machine Learning, 36, 9-32. Oza, N. C. (2001). Online ensemble learning. Unpublished doctoral dissertation, University of California, Berkeley. Oza, N. C. (2003). Boosting with averaged weight vectors. In T. Windeatt & F. Roli (Eds.), Proceedings of the Fourth International Workshop on Multiple Classifier Systems (pp. 15-24). Guildford, UK: Springer-Verlag. Oza, N. C. (2004). AveBoost2: Boosting with noisy data. In F. Roli, J. Kittler, & T. Windeatt (Eds.), Proceedings of the Fifth International Workshop on Multiple Classifier Systems (pp. 31-40). Cagliari, Italy: Springer-Verlag. Oza, N. C., & Russell, S. (2001). Experimental comparisons of online and batch versions of bagging and boosting. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA, (pp. 359-364). ACM Press. Oza, N. C., & Tumer, K. (2001). Input decimation ensembles: Decorrelation through dimensionality reduction. Proceedings of the Second International Workshop on Multiple Classifier Systems, Berlin (pp. 238-247). Springer-Verlag. Ratsch, G., Onoda, T., & Muller, K. R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287-320. Tumer, K., & Ghosh, J. (1996). Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3-4), 385-404. Tumer, K., & Oza, N. C. (2003). Input decimated ensembles. Pattern Analysis and Applications, 6(1), 65-77. Weigend, A. S., Mangeas, M., & Srivastava, A. N. (1995). Nonlinear gated experts for time-series: Discovering regimes and avoiding overfitting. International Journal of Neural Systems, 6(4), 373-399.
452
Zheng, Z., & Webb, G. (1998). Stochastic attribute selection committees. Proceedings of the 11th Australian Joint Conference on Artificial Intelligence (pp. 321-332), Brisbane, Australia. Springer-Verlag.
KEY TERMS Batch Learning: Learning by using an algorithm that views the entire dataset at once and can access any part of the dataset at any time and as many times as desired. Decision Tree: A model consisting of nodes that contain tests on a single attribute and branches representing the different outcomes of the test. A prediction is generated for a new example by performing the test described at the root node and then proceeding along the branch that corresponds to the outcome of the test. If the branch ends in a prediction, then that prediction is returned. If the branch ends in a node, then the test at that node is performed and the appropriate branch selected. This continues until a prediction is found and returned. Ensemble: A function that returns a combination of the predictions of multiple machine learning models. Machine Learning: The branch of artificial intelligence devoted to enabling computers to learn. Neural Network: A nonlinear model derived through analogy with the human brain. It consists of a collection of elements that linearly combine their inputs and pass the result through a nonlinear transfer function. Online Learning: Learning by using an algorithm that only examines the dataset once in order. This paradigm is often used in situations when data arrives continually in a stream and when predictions must be obtainable at any time. Principal Components Analysis (PCA): Given a dataset, PCA determines the axes of maximum variance. For example, if the dataset were shaped like an egg, then the long axis of the egg would be the first principal component, because the variance is greatest in this direction. All subsequent principal components are found to be orthogonal to all previous components.
ENDNOTES 1
2
If Lb cannot take a weighted training set, then one can call it with a training set that is generated by sampling with replacement from the original training set according to the distribution Dm. This requirement is perhaps too strict when more than two classes exist. AdaBoost has a multiclass
Ensemble Data Mining Methods
version (Freund & Schapire, 1997) that does not have this requirement. However, the AdaBoost algorithm presented here is often used even when more than two classes exist, if the base model learning algorithm is strong enough to satisfy the requirement.
-
453
454
Ethics of Data Mining Jack Cook Rochester Institute of Technology, USA
INTRODUCTION Decision makers thirst for answers to questions. As more data is gathered, more questions are posed: Which customers are most likely to respond positively to a marketing campaign, product price change or new product offering? How will the competition react? Which loan applicants are most likely or least likely to default? The ability to raise questions, even those that currently cannot be answered, is a characteristic of a good decision maker. Decision makers no longer have the luxury of making decisions based on gut feeling or intuition. Decisions must be supported by data; otherwise decision makers can expect to be questioned by stockholders, reporters, or attorneys in a court of law. Data mining can support and often direct decision makers in ways that are often counterintuitive. Although data mining can provide considerable insight, there is an “inherent risk that what might be inferred may be private or ethically sensitive” (Fule & Roddick, 2004, p. 159). Extensively used in telecommunications, financial services, insurance, customer relationship management (CRM), retail, and utilities, data mining more recently has been used by educators, government officials, intelligence agencies, and law enforcement. It helps alleviate data overload by extracting value from volume. However, data analysis is not data mining. Query-driven data analysis, perhaps guided by an idea or hypothesis, that tries to deduce a pattern, verify a hypothesis, or generalize information in order to predict future behavior is not data mining (Edelstein, 2003). It may be a first step, but it is not data mining. Data mining is the process of discovering and interpreting meaningful, previously hidden patterns in the data. It is not a set of descriptive statistics. Description is not prediction. Furthermore, the focus of data mining is on the process, not a particular technique, used to make reasonably accurate predictions. It is iterative in nature and generically can be decomposed into the following steps: (1) data acquisition through translating, cleansing, and transforming data from numerous sources, (2) goal setting or hypotheses construction, (3) data mining, and (4) validating or interpreting results. The process of generating rules through a mining operation becomes an ethical issue, when the results are used in decision-making processes that affect people or when mining customer data unwittingly compromises the privacy of those customers (Fule & Roddick, 2004). Data
miners and decision makers must contemplate ethical issues before encountering one. Otherwise, they risk not identifying when a dilemma exists or making poor choices, since all aspects of the problem have not been identified.
BACKGROUND Technology has moral properties, just as it has political properties (Brey 2000; Feenberg, 1999; Sclove, 1995; Winner, 1980). Winner (1980) argues that technological artifacts and systems function like laws, serving as frameworks for public order by constraining individuals’ behaviors. Sclove (1995) argues that technologies possess the same kinds of structural effects as other elements of society, such as laws, dominant political and economic institutions, and systems of cultural beliefs. Data mining, being a technological artifact, is worthy of study from an ethical perspective due to its increasing importance in decision making, both in the private and public sectors. Computer systems often function less as background technologies and more as active constituents in shaping society (Brey, 2000). Data mining is no exception. Higher integration of data mining capabilities within applications ensures that this particular technological artifact will increasingly shape public and private policies. Data miners and decision makers obviously are obligated to adhere to the law. But ethics are oftentimes more restrictive than what is called for by law. Ethics are standards of conduct that are agreed upon by cultures and organizations. Supreme Court Justice Potter Stewart defines the difference between ethics and laws as knowing the difference between what you have a right to do (legally, that is) and what is right to do. Sadly, a number of IS professionals either lack an awareness of what their company actually does with data and data mining results or purposely come to the conclusion that it is not their concern. They are enablers in the sense that they solve management’s problems. What management does with that data or results is not their concern. Most laws do not explicitly address data mining, although court cases are being brought to stop certain data mining practices. A federal court ruled that using data mining tools to search Internet sites for competitive information may be a crime under certain circumstances (Scott, 2002). In EF Cultural Travel BV vs. Explorica Inc.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Ethics of Data Mining
(No. 01-2000 1st Cir. Dec. 17, 2001), the First Circuit Court of Appeals in Massachusetts held that Explorica, a tour operator for students, improperly obtained confidential information about how rival EF’s Web site worked and used that information to write software that gleaned data about student tour prices from EF’s Web site in order to undercut EF’s prices (Scott, 2002). In this case, Explorica probably violated the federal Computer Fraud and Abuse Act (18 U.S.C. Sec. 1030). Hence, the source of the data is important when data mining. Typically, with applied ethics, a morally controversial practice, such as how data mining impacts privacy, “is described and analyzed in descriptive terms, and finally moral principles and judgments are applied to it and moral deliberation takes place, resulting in a moral evaluation, and operationally, a set of policy recommendations” (Brey, 2000, p. 10). Applied ethics is adopted by most of the literature on computer ethics (Brey, 2000). Data mining may appear to be morally neutral, but appearances in this case are deceiving. This paper takes an applied perspective to the ethical dilemmas that arise from the application of data mining in specific circumstances as opposed to examining the technological artifacts (i.e., the specific software and how it generates inferences and predictions) used by data miners.
1. 2.
MAIN THRUST
Many times, the objective of data mining is to build a customer profile based on two types of data—factual (who the customer is) and transactional (what the customer does) (Adomavicius & Tuzhilin, 2001). Often, consumers object to transactional analysis. What follows are two examples; the first (identifying successful students) creates a profile based primarily on factual data, and the second (identifying criminals and terrorists) primarily on transactional.
Computer technology has redefined the boundary between public and private information, making much more information public. Privacy is the freedom granted to individuals to control their exposure to others. A customary distinction is between relational and informational privacy. Relational privacy is the control over one’s person and one’s personal environment, and concerns the freedom to be left alone without observation or interference by others. Informational privacy is one’s control over personal information in the form of text, pictures, recordings, and so forth (Brey, 2000). Technology cannot be separated from its uses. It is the ethical obligation of any information systems (IS) professional, through whatever means he or she finds out that the data that he or she has been asked to gather or mine is going to be used in an unethical way, to act in a socially and ethically responsible manner. This might mean nothing more than pointing out why such a use is unethical. In other cases, more extreme measures may be warranted. As data mining becomes more commonplace and as companies push for even greater profits and market share, ethical dilemmas will be increasingly encountered. Ten common blunders that a data miner may cause, resulting in potential ethical or possibly legal dilemmas, are (Skalak, 2001):
3. 4. 5. 6. 7. 8. 9. 10.
Selecting the wrong problem for data mining. Ignoring what the sponsor thinks data mining is and what it can and cannot do. Leaving insufficient time for data preparation. Looking only at aggregated results, never at individual records. Being nonchalant about keeping track of the mining procedure and results. Ignoring suspicious findings in a haste to move on. Running mining algorithms repeatedly without thinking hard enough about the next stages of the data analysis. Believing everything you are told about the data. Believing everything you are told about your own data mining analyses. Measuring results differently from the way the sponsor will measure them.
These blunders are hidden ethical dilemmas faced by those who perform data mining. In the next subsections, sample ethical dilemmas raised with respect to the application of data mining results in the public sector are examined, followed briefly by those in the private sector.
Ethics of Data Mining in the Public Sector
Identifying Successful Students Probably the most common and well-developed use of data mining is the attraction and retention of customers. At first, this sounds like an ethically neutral application. Why not apply the concept of students as customers to the academe? When students enter college, the transition from high school for many students is overwhelming, negatively impacting their academic performance. High school is a highly structured Monday-through-Friday schedule. College requires students to study at irregular hours that constantly change from week to week, depending on the workload at that particular point in the course. Course materials are covered at a faster pace; the duration of a single class period is longer; and subjects are often more difficult. Tackling the changes in a student’s aca455
-
Ethics of Data Mining
demic environment and living arrangement as well as developing new interpersonal relationships is daunting for students. Identifying students prone to difficulties and intervening early with support services could significantly improve student success and, ultimately, improve retention and graduation rates. Consider the following scenario that realistically could arise at many institutions of higher education. Admissions at the institute has been charged with seeking applicants who are more likely to be successful (i.e., graduate from the institute within a five-year period). Someone suggests data mining existing student records to determine the profile of the most likely successful student applicant. With little more than this loose definition of success, a great deal of disparate data is gathered and eventually mined. The results indicate that the most likely successful applicant, based on factual data, is an Asian female whose family’s household income is between $75,000 and $125,000 and who graduates in the top 25% of her high school class. Based on this result, admissions chooses to target market such high school students. Is there an ethical dilemma? What about diversity? What percentage of limited marketing funds should be allocated to this customer segment? This scenario highlights the importance of having welldefined goals before beginning the data mining process. The results would have been different if the goal were to find the most diverse student population that achieved a certain graduation rate after five years. In this case, the process was flawed fundamentally and ethically from the beginning.
Identifying Criminals and Terrorists The key to the prevention, investigation, and prosecution of criminals and terrorists is information, often based on transactional data. Hence, government agencies increasingly desire to collect, analyze, and share information about citizens and aliens. However, according to Rep. Curt Weldon (R-PA), chairman of the House Subcommittee on Military Research and Development, there are 33 classified agency systems in the federal government, but none of them link their raw data together (Verton, 2002). As Steve Cooper, CIO of the Office of Homeland Security, said, “I haven’t seen a federal agency yet whose charter includes collaboration with other federal agencies” (Verton, 2002, p. 5). Weldon lambasted the federal government for failing to act on critical data mining and integration proposals that had been authored before the terrorists’ attacks on September 11, 2001 (Verton, 2002). Data to be mined is obtained from a number of sources. Some of these are relatively new and unstructured in nature, such as help desk tickets, customer service complaints, and complex Web searches. In other circumstances, data miners must draw from a large number of sources. For 456
example, the following databases represent some of those used by the U.S. Immigration and Naturalization Service (INS) to capture information on aliens (Verton, 2002). • • • • • • • • • • • • • • •
Employment Authorization Document System Marriage Fraud Amendment System Deportable Alien Control System Reengineered Naturalization Application Casework System Refugees, Asylum, and Parole System Integrated Card Production System Global Enrollment System Arrival Departure Information System Enforcement Case Tracking System Student and Schools System General Counsel Electronic Management System Student Exchange Visitor Information System Asylum Prescreening System Computer-Linked Application Information Management System (two versions) Non-Immigrant Information System
There are islands of excellence within the public sector. One such example is the U.S. Army’s Land Information Warfare Activity (LIWA), which is credited with “having one of the most effective operations for mining publicly available information in the intelligence community” (Verton, 2002, p. 5). Businesses have long used data mining. However, recently, governmental agencies have shown growing interest in using “data mining in national security initiatives” (Carlson, 2003, p. 28). Two government data mining projects, the latter renamed by the euphemism “factual data analysis,” have been under scrutiny (Carlson, 2003) These projects are the U.S. Transportation Security Administration’s (TSA) Computer Assisted Passenger Prescreening System II (CAPPS II) and the Defense Advanced Research Projects Agency’s (DARPA) Total Information Awareness (TIA) research project (Gross, 2003). TSA’s CAPPS II will analyze the name, address, phone number, and birth date of airline passengers in an effort to detect terrorists (Gross, 2003). James Loy, director of the TSA, stated to Congress that, with CAPPS II, the percentage of airplane travelers going through extra screening is expected to drop significantly from 15% that undergo it today (Carlson, 2003). Decreasing the number of false positive identifications will shorten lines at airports. TIA, on the other hand, is a set of tools to assist agencies such as the FBI with data mining. It is designed to detect extremely rare patterns. The program will include terrorism scenarios based on previous attacks, intelligence analysis, “war games in which clever people imagine ways to attack the United States and its de-
Ethics of Data Mining
ployed forces,” testified Anthony Tether, director of DARPA, to Congress (Carlson, 2003, p. 22). When asked how DARPA will ensure that personal information caught in TIA’s net is correct, Tether stated that “we’re not the people who collect the data. We’re the people who supply the analytical tools to the people who collect the data” (Gross, 2003, p. 18). “Critics of data mining say that while the technology is guaranteed to invade personal privacy, it is not certain to enhance national security. Terrorists do not operate under discernable patterns, critics say, and therefore the technology will likely be targeted primarily at innocent people” (Carlson, 2003, p. 22). Congress voted to block funding of TIA. But privacy advocates are concerned that the TIA architecture, dubbed “mass dataveillance,” may be used as a model for other programs (Carlson, 2003). Systems such as TIA and CAPPS II raise a number of ethical concerns, as evidenced by the overwhelming opposition to these systems. One system, the Multistate Anti-TeRrorism Information EXchange (MATRIX), represents how data mining has a bad reputation in the public sector. MATRIX is self-defined as “a pilot effort to increase and enhance the exchange of sensitive terrorism and other criminal activity information between local, state, and federal law enforcement agencies” (matrixat.org, accessed June 27, 2004). Interestingly, MATRIX states explicitly on its Web site that it is not a data-mining application, although the American Civil Liberties Union (ACLU) openly disagrees. At the very least, the perceived opportunity for creating ethical dilemmas and ultimately abuse is something the public is very concerned about, so much so that the project felt that the disclaimer was needed. Due to the extensive writings on data mining in the private sector, the next subsection is brief.
Ethics of Data Mining in the Private Sector Businesses discriminate constantly. Customers are classified, receiving different services or different cost structures. As long as discrimination is not based on protected characteristics such as age, race, or gender, discriminating is legal. Technological advances make it possible to track in great detail what a person does. Michael Turner, executive director of the Information Services Executive Council, states, “For instance, detailed consumer information lets apparel retailers market their products to consumers with more precision. But if privacy rules impose restrictions and barriers to data collection, those limitations could increase the prices consumers pay when they buy from catalog or online apparel retailers by 3.5% to 11%” (Thibodeau, 2001, p. 36). Obviously, if retailers cannot target their advertising, then their only option is to mass advertise, which drives up costs.
With this profile of personal details comes a substantial ethical obligation to safeguard this data. Ignoring any legal ramifications, the ethical responsibility is placed firmly on IS professionals and businesses, whether they like it or not; otherwise, they risk lawsuits and harming individuals. “The data industry has come under harsh review. There is a raft of federal and local laws under consideration to control the collection, sale, and use of data. American companies have yet to match the tougher privacy regulations already in place in Europe, while personal and class-action litigation against businesses over data privacy issues is increasing” (Wilder & Soat, 2001, p. 38).
FUTURE TRENDS Data mining traditionally was performed by a trained specialist, using a stand-alone package. This once nascent technique is now being integrated into an increasing number of broader business applications and legacy systems used by those with little formal training, if any, in statistics and other related disciplines. Only recently has privacy and data mining been addressed together, as evidenced by the fact that the first workshop on the subject was held in 2002 (Clifton & Estivill-Castro, 2002). The challenge of ensuring that data mining is used in an ethically and socially responsible manner will increase dramatically.
CONCLUSION Several lessons should be learned. First, decision makers must understand key strategic issues. The data miner must have an honest and frank dialog with the sponsor concerning objectives. Second, decision makers must not come to rely on data mining to make decisions for them. The best data mining is susceptible to human interpretation. Third, decision makers must be careful not to explain away with intuition data mining results that are counterintuitive. Decision making inherently creates ethical dilemmas, and data mining is but a tool to assist management in key decisions.
REFERENCES Adomavicius, G. & Tuzhilin, A. (2001). Using data mining methods to build customer profiles. Computer, 34(2), 74-82. Brey, P. (2000). Disclosive computer ethics. Computers and Society, 30(4), 10-16.
457
-
Ethics of Data Mining
Carlson, C. (2003a). Feds look at data mining. eWeek, 20(19), 22.
Winner, L. (1980). Do artifacts have politics? Daedalus, 109, 121-136.
Carlson, C. (2003b). Lawmakers will drill down into data mining. eWeek, 20(13), 28.
KEY TERMS
Clifton, C. & Estivill-Castro, V. (Eds.). (2002). Privacy, security and data mining. Proceedings of the IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, Maebashi City, Japan. Edelstein, H. (2003). Description is not prediction. DM Review, 13(3), 10. Feenberg, A. (1999). Questioning technology. London: Routledge. Fule, P., & Roddick, J.F. (2004). Detecting privacy and ethical sensitivity in data mining results. Proceedings of the 27 th Conference on Australasian Computer Science, Dunedin, New Zealand. Gross, G. (2003). U.S. agencies defend data mining plans. ComputerWorld, 37(19), 18. Sclove, R. (1995). Democracy and technology. New York: Guilford Press.
Applied Ethics: The study of a morally controversial practice, whereby the practice is described and analyzed, and moral principles and judgments are applied, resulting in a set of recommendations. Ethics: The study of the general nature of morals and values as well as specific moral choices; it also may refer to the rules or standards of conduct that are agreed upon by cultures and organizations that govern personal or professional conduct. Factual Data: Data that include demographic information such as name, gender, and birth date. It also may contain information derived from transactional data such as someone’s favorite beverage. Factual Data Analysis: Another term for data mining, often used by government agencies. It uses both factual and transactional data.
Scott, M.D. (2002). Can data mining be a crime? CIO Insight, 1(10), 65.
Informational Privacy: The control over one’s personal information in the form of text, pictures, recordings, and such.
Skalak, D. (2001). Data mining blunders exposed! 10 data mining mistakes to avoid making today. DB2 Magazine, 6(2), 10-13.
Mass Dataveillance: Suspicion-less surveillance of large groups of people.
Thibodeau, P. (2001). FTC examines privacy issues raised by data collectors. ComputerWorld, 35(13), 36. Verton, D. (2002a). Congressman says data mining could have prevented 9-11. ComputerWorld, 36(35), 5. Verton, D. (2002b). Database woes thwart counterterrorism work. ComputerWorld, 36(49), 14. Wilder, C., & Soat, J. (2001). The ethics of data. Information Week, 1(837), 37-48.
458
Relational Privacy: The control over one’s person and one’s personal environment. Transactional Data: Data that contains records of purchases over a given period of time, including such information as date, product purchased, and any special requests.
459
Ethnography to Define Requirements and Data Model Gary J. DeLorenzo Robert Morris University, USA
INTRODUCTION Ethnographic research offers an orientation to understand the process and structure of a social setting and employs research techniques consistent with this orientation. The ethnographic study is rooted in gaining an understanding of cultural knowledge used to interpret the experience and behavior patterns of the researched participants. The ethnographic method aids the researcher in identifying descriptions and in interpreting language, terms, and meanings as the basis of defining user requirements for information systems. In the business sector, enterprises that use thirdparty application providers to support day-to-day operations have limited independence to access, review, and analyze the data needed for strategic and tactical decision making. Because of security and firewall protection controls maintained by the provider, enterprises do not have direct access to the data needed to build analytical reporting solutions for decision support. This article reports a unique methodology used to define the conceptual model of one enterprise’s (PPG Industries, Inc.) business needs to reduce working capital when using application service providers (ASPs).
BACKGROUND The Methodology Ethnographic research is the study of both explicit and tacit knowledge within a culture. The technique is based to gain an understanding of the acquired knowledge people use to interpret experience and generate behavior. Whereas explicit cultural knowledge can be communicated at a conscious level and with relative ease, tacit cultural knowledge remains largely outside of people’s awareness (Creswell, 1994). The ethnographic technique aids the researcher in identifying cultural descriptions and uncovering meaning to language within an organization. From my 12 years of experience at PPG Industries, Inc. (PPG), a U.S.-based
$8 billion per year manufacturing company specializing in coatings, chemicals, glass and fiberglass, the normal workflow process to define user requirements for decision support is through a quick, low cost, rapid application development approach. To gain an in-depth understanding of the culture and knowledge base of the Credit Services Department, the ethnographic research technique was selected to construct an understanding to questions such as “why do they do what they do?” and more importantly for me, “what reporting solutions are needed to understand customer payments so that actions can be prioritized accordingly?” The ability to answer the second question was critical in order to define the conceptual model and information system specifications based on Credit’s user requirements. While using a third-party vendor solution thorough an ASP model, another challenge was in gaining access to data that was not readily accessible (Boyd, 2001).
Why the Ethnographic Technique? Ethnographic research is based on understanding the language, terms, and meaning within a culture (Agar, 1980). Ethnography is an appealing approach, because it promotes an understanding of process, terms and meaning, and beliefs of the people in a particular culture. Interviews, discussions, participant observation, and social interaction would help me attain an understanding of a culture. As opposed to the quantitative approach, which reflects on researching a situation, problem, or hypothesis at a given point in time, the qualitative approach using ethnographic techniques is more of an evolutionary process over time, which covers months, if not years, for the researcher. The time consideration is important because of the ever changing and evolving environment of the enterprise. New business plans, direction, and project activities frequently change within a business. Emerging technology processes such as RAD (rapid application development) to streamline the process of developing and implementing information systems continue to change the way information systems are created.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Ethnography to Define Requirements and Data Model
MAIN THRUST Cultural Setting and Contextual Baseline To gain an understanding of what the Credit Services Department actually does, an overview of the specific department goals, objectives, and responsibilities are noted in this section and called the contextual baseline, and uses the native, emic terms of the department. A contextual baseline identifies the driving motivators of Credit Services to understand what people do and how they accomplish their jobs. The General Credit Manager noted how the Credit Services Department is viewed as the “financial conscience of PPG in the relationship between sales and receiving payments against those sales by our customers” (Camsuzou, 2001) The Credit Services Department monitors credit ratings to ensure that customers have the financial stability to pay invoices while also tracking and researching with outside banking services on the application of payments against invoices.
Field Work The interview process included different informants throughout PPG with informants from strategic associates (business unit directors), tactical members (credit managers, business unit managers), and operational members (users with the business units). Fourteen individuals from throughout the enterprise were interviewed as key informants. Also, general observations were captured in the field notebook by sitting and working on a daily basis in the Credit Department area (over 50 employees) from June, 2001 through December, 2001. Geertz (1973) suggests that qualitative research often is presented as working “up from data… [whereas] in ethnography, the techniques and procedures [do not] define the enterprise. … [what defines it] is the intellectual effort [borrowed] from Ryle called thick descriptions” (p 6). Thick descriptions are a reference to getting an understanding and insight into a culture through the uncovering of facts, at a very detailed level. It’s the ability to gain a truer, real understanding of a culture through an interpretation of language, symbols, and signs. Geertz (1973) summarized that the challenge of the ethnographer “does not rest on the ability to capture facts and carry them home like a mask or carving, but on a degree to which he is able to clarify what goes on in such places, to reduce the puzzlement [where] unfamiliar acts naturally give rise [to meaning]” (p. 17). Ethnography is a technique based on a theory where the understanding of a culture can be gained through the analysis of the cumulative, unstructured, and disconnected findings captured during the 460
field research. These findings are assembled into a coherent sequence of bold, structured, and related stories.
Initial Themes According to Spradley (1979), the next step is the identification of initial themes. The themes are thin descriptions of high-level, overview insights toward a culture. Geertz (1973) states that “systematic modes of assessment [are needed in] interpretative approaches. For a field of study to assert itself as a science, … the conceptual structure of a cultural interpretation should be formulable”. (p. 24) He continues to state that theoretical formulations tend to start as general interpretations, but the scientific, theoretical process pushing forward “should [include} the essential task of theory building, not to codify abstract regularities, but to make thick descriptions possible” (p. 26). This theory-building approach, from an overview perspective, helped me to define the initial themes of the Credit Services Department. After analyzing the transcribed interviews, field notes, and results from categorizing the field-work terms, three general themes began to surface. The themes, also called categories by Spradley (1979), are the following: (1) At the enterprise level of PPG, there is a strategic mission to reduce working capital through improved cash flow. (2) At the functional level for both the Credit Services Department and the business units, there is a tactical vision to decrease daily sales outstanding. (3). In pursuing the first two themes, information systems are used to track customer payment patterns and to monitor incoming cash flow.
Domain Analysis For Spradley (1979), a domain structure includes the “terms that belong to a category …. and the semantic relationship of terms and categories [that] are linked together”(p. 100). The process of domain identification included contexualizing the terms and categories, drawing overview diagrams of terms and categories, and identifying how the terms and categories related to each other. Spradley (1979) summarized domain analysis as the process “to gain a better understanding, from the gathered field data, the semantic relationships of terms and categories” (p. 107). Further review of the language to identify the thick descriptions and rich points for PPG to reduce working capital and decrease sales outstanding resulted in defining 43 critical and essential terms. These terms were used as a baseline to define the domain analysis. The semantic relationships are links between two categories that have boundaries either inside or outside of the domain. The domain analysis identifies the relationships among categories.
Ethnography to Define Requirements and Data Model
term, relationship, and hierarchical structure of each term within the area. Structured questions allow the researcher to find out how informants have organized their knowledge. A review of the transcribed words from the interviews, additional notes, and the categorized terms from the field work began to identify key, structured questions that were important to the Credit Services Department. These in-depth, structured, thick-description questions began to emerge out of the terms and categories used in the language.
The domain analysis concept also is used in information systems to show how data relate to each other. Peter Chen (1976) proposed the Entity-Relationship (ER) model as a way to define and unify relational database views and can be found in popular relational database management systems such as Oracle and Microsoft SQL Server. In the entity-relationship model, tables are shown as entities, data fields are shown as attributes, and the line shows how the entities and attributes relate to each other. Entities are categories such as a person, place, or thing. The domain analysis identified in Figure 1 is based on the initial themes of reducing working capital and decreasing sales outstanding, where rectangles identify primary domain entities for both the internal PPG associates and external PPG customers.
Componential Analysis Spradley (1979) states that the development research sequence evolves through locating informants, conducting interviews, and collecting the terms and symbols used for language and communication. This research included the domain analysis process with an overview of terms that make up categories and the relationship among categories. The domain analysis provided an overview of the language in the cultural scene. The technique of taxonomic analysis identified relationships and differences among terms within each domain. The next technique in the process included searching and organizing the attributes associated with cultural terms and categories. Attributes are specific terms that have significant meaning in the language. This process is called componential analysis. Spradley (1973) states that, in drafting a schematic diagram on domains, “This thinking process is one of the best strategies for discovering cultural themes” (p. 199). The schematic diagram in Figure 3 identifies the one central theme that motivates Credit Services and this research project, and is based on one principle—understanding customer payment behavior.
Taxonomic Analysis Further analysis of the fields in the domain analysis included a review of the specific terms were considered critical because of the frequency and consistency that these terms were used in the language. These terms eventually became the baseline data field attributes required for the decision support system model. A taxonomic analysis is a technique used to capture the essential terms, also called fields, in a hierarchical fashion to define not only the relationship among the fields but also to understand what terms are a subset meaning to a grander, higher term. The design of the taxonomic analysis (see Figure 2) is based on a spiraling concept where all of the terms and relationships defined previously are dependent on the central theme; that is, to get dollars for sales through reducing working capital, which is noted in the center of the drawing. The taxonomic analysis is broken into five component areas of who, what, why, when and how. Each area identifies the Figure 1. Domain analysis (DeLorenzo, 2001) B ra n c h O p e r a t io n s / D is t r ib u t io n
F ie ld S a le s
C u s to m e r S e r v ic e
S t r a te g ic B u s in e s s U n its C r e d it
P a re n t H d q trs , L o c a t io n s C r e d it S e r v ic e s D e p a rtm e n t
E x te rn a l C u s to m e r A u d ie n c e
In te rn a l P P G A u d ie n c e
C a s h A p p lic a t io n
A c c o u n t S ta tu s
P u rc h a s e a n d S e ll P r o d u c ts / S e r v ic e s
P re s e n t S ta te m e n t a n d I n v o ic e f o r Paym ent D is p u te s
R e c e iv e $ , A p p ly P a y m e n t
B a n k $ T ra n s fe r to PP G Account
461
-
Ethnography to Define Requirements and Data Model
Figure 2. Taxonomic analysis (DeLorenzo, 2001) Taxonomic Analysis
WHAT
Customer Payment Cash Application
WHO Location Level 3
Dimension
SBU Order Entry
DSS/Business Intelligence
HOW
Account Statements
PPG
Oracle A/R: AG, CARS Knowledge Mgt
Status: On Hold, Active, Release
Invoices
PPG Auto Glass
Credit
Credit Limit
Customer
15 SBU's
Cash Application
Field Sales
Product Line (Divisional Sales Resp)
Parent Level 1
Branch Operations/ Distribution Customer Service
Dispute Invoice
Online Transaction Processing
Get Dollars For Sales, Reduce Working Cap
Past Due Over Credit Limit
Term Timeline
Measures
Invoice Term (Discounted Price)
Aging Balance
Daily Sales Outstanding
eBilling
Online Analytical Processing
Paid In Full
Current Balance
Statement Term (Discount on Payment)
Dispute Reason: Unearned Discount, Price Deduction
Deduct/ Short Pay Reason
Dispute Resolution:
Chargeback, Writeoff, Credit Memo
WHY
Discount Due Date
Invoice Date Invoice Due Date
Payment Date
WHEN
Figure 3. Cultural themes (DeLorenzo, 2002) S e ll p r o d u c t t o f in a n c ia lly s ta b le c u s to m e rs , A c c u r a te p ric e o n in v o ic e
C u ltu ra l T h e m e s R e d u c e s w o r k in g c a p it a l a t t h e e n t e r p r is e , P P G le v e l
R e v ie w a n d s t a b ilit y
m o n it o r f in a n ic a l o f c u s to m e rs
Im p ro v e d c a s h flo w a t th e s tra te g ic b u s in e s s u n it , d e p a r t m e n t le v e l
R e c e iv e c a s h , re c e iv e d
e n s u re p a y m e n ts f o r s e r v ic e s
U n d e r s t a n d in g C u s t o m e r P a y m e n t B e h a v io r
I m p r o v e in v o ic r e s u lt in g in d e d u c c u s to m
e , s ta te m e n t in te g rity r e d u c e d d is p u t e s / ts -- - - h ig h e r e r s a t is f a c t io n
D is p u t e s / d e d u c t s a n d r e s o lu t io n s o n d is c r e p a n c ie s
K n o w le d g e b a s e d d e c is io n s u p p o rt a n d b u s in e s s in t e llig e n c e
H y b r id , in t e g r a t e d in f o r m a t io n s y s te m s
FUTURE TRENDS Requirements Definition The final effort in the ethnography included the drafting of the decision support system model to assist PPG in reducing working capital. The model is based on evolutionary changes in computer architecture over the past 40 462
years that businesses have taken to manage data. It addresses the information system needs at PPG to better understand customer payment behavior (Craig & Jutla, 2001; Forcht & Cochran, 1999; Inmon, 1996). The accumulation of the domain analysis and important entities, the taxonomic analysis with critical categories and attributes, and the componential analytical efforts identifying cultural themes brought me to my last
Ethnography to Define Requirements and Data Model
Inmon, W. (1996). Building the data warehouse. New York: Wiley Computer Publishing.
activity. The final stage in the ethnographic process was to define the decision support system model to assist PPG in tracking customer payments. The model was based on using an information systems-based approach to capture the data required to understand customer payment behavior and to provide trend analysis capabilities to gain knowledge and insight from that understanding.
Spradley, J.P. (1979). The ethnographic interview. Orlando, FL: Harcourt Brace.
CONCLUSION
Application Solution Providers: Third-party vendors who provide data center, telecommunications, and application options for major companies.
An analysis of a culture’s language and communication mechanisms are important areas to understand in building information systems models. The research included the gathering of data requirements for the Credit Services Department at PPG by living and working within the researched community to gain a better understanding of its problem-solving and reporting needs to monitor customer payment behavior. The data requirements became the baseline prerequisite in defining the conceptual decision support model and database schema for the Credit Services Department’s decision support project. In 2003, the conceptual model and schema were used by PPG’s Information Technology Department to develop an application system to assist the Credit Services Department in tracking and managing customer payment behavior.
KEY TERMS
Business Intelligence: Synonym for online analytical processing system. Data Mining: Decision support technique that can predict future trends and behaviors based on past historical data. Data Warehouse: The capturing of attributes from multiple databases in a centralized repository to assist in the decision-making and problem-solving needs throughout the organization. Decision Support: Evolutionary step in the 1990s with characteristics to review retrospective, dynamic data. OLAP is an example of an enabling technology in this area. Dimensions: Static, fact-oriented information used in decision support to drill down into measurements.
Agar, M. (1980). The professional stranger: An informal introduction to ethnography. San Diego, CA: Academic Press, Inc.
Domain Analysis Technique: Search for the larger units of cultural knowledge called domains, a synonym for person, place, or thing. Used to gain an understanding of the semantic relationships of terms and categories.
Boyd, J. (2001, April 16). Think ASPs make sense. Internet Week, 48.
Drill Down: User interface technique to navigate into lower levels of information in decision support systems.
Camsuzou, C. (2001). The ecommerce vision of the credit services department. PPG Industries IT Strategy, 15-17.
Ethnography: The work of describing a culture. The essential core aims to understand another way of life from the native point of view.
REFERENCES
Chen P. (1976). The entity-relationship model—Toward a unified view of data. ACM Transactions on Database Systems, 1(1), 9-36. Craig, J., & Jutla, D. (2001). eBusiness readiness: A customer-focused framework. Upper Saddle River, NJ: Addison Wesley Publishers. Creswell, J. (1994). Research design qualitative and quantitative approaches. Thousand Oaks, CA: Sage Publications. Forcht, K.A., & Cochran, K. (1999). Using data mining and data warehousing techniques. Industrial Management and Data Systems, 189-196. Geertz, C. (1973). Emphasizing interpretation from the interpretation of cultures. New York: Basic Books, Inc.
Measurements: Dynamic, numeric values associated with dimensions found through drilling down into lower levels of detail within decision support systems. Online Analytical Processing Systems (OLAP): Technology-based solution with data delivered at multiple dimensions to allow drilling down at multiple levels. Requirements: Specifics based on defined criterion used as the basis for information systems design. Systems Development Life Cycle: A controlled, phased approach in building information systems from understanding wants, defining specifications, designing and coding the system through implementing the final solution.
463
-
464
Evaluation of Data Mining Methods Paolo Giudici University of Pavia, Italy
INTRODUCTION Several classes of computational and statistical methods for data mining are available. Each class can be parameterised so that models within the class differ in terms of such parameters (see, for instance, Giudici, 2003; Hastie et al., 2001; Han & Kamber, 2000; Hand et al., 2001; Witten & Frank, 1999): for example, the class of linear regression models, which differ in the number of explanatory variables; the class of Bayesian networks, which differ in the number of conditional dependencies (links in the graph); the class of tree models, which differ in the number of leaves; and the class multilayer perceptrons, which differ in terms of the number of hidden strata and nodes. Once a class of models has been established the problem is to choose the “best” model from it.
BACKGROUND A rigorous method to compare models is statistical hypothesis testing. With this in mind one can adopt a sequential procedure that allows a model to be chosen through a sequence of pairwise test comparisons. However, we point out that these procedures are generally not applicable in particular to computational data mining models, which do not necessarily have an underlying probabilistic model and, therefore, do not allow the application of statistical hypotheses testing theory. Furthermore, it often happens that for a data problem it is possible to use more than one type of model class, with different underlying probabilistic assumptions. For example, for a problem of predictive classification it is possible to use both logistic regression and tree models as well as neural networks. We also point out that model specification and, therefore, model choice is determined by the type of variables used. These variables can be the result of transformations or of the elimination of observations, following an exploratory analysis. We then need to compare models based on different sets of variables present at the start. For example, how do we compare a linear model with the original explanatory variables with one with a set of transformed explanatory variables?
The previous considerations suggest the need for a systematic study of the methods for comparison and evaluation of data mining models.
MAIN THRUST Comparison criteria for data mining models can be classified schematically into: criteria based on statistical tests, based on scoring functions, computational criteria, Bayesian criteria and business criteria.
Criteria Based on Statistical Tests The first are based on the theory of statistical hypothesis testing and, therefore, there is a lot of detailed literature related to this topic. See, for example, a text about statistical inference, such as Mood, Graybill, & Boes (1991) and Bickel & Doksum (1977). A statistical model can be specified by a discrete probability function or by a probability density function, f(x). Such model is usually left unspecified, up to unknown quantities that have to be estimated on the basis of the data at hand. Typically, the observed sample it is not sufficient to reconstruct each detail of f(x), but can indeed be used to approximate f(x) with a certain accuracy. Often a density function is parametric so that it is defined by a vector of parameters Θ=(θ 1 ,…,θI ), such that each value θ of Θ corresponds to a particular density function, p (x). In order to measure the accuracy of a parametric model, one can resort to the notion of distance between a model f, which underlies the data, and an approximating model g (see, for instance, Zucchini, 2000). Notable examples of distance functions are, for categorical variables: the entropic distance, which describes the proportional reduction of the heterogeneity of the dependent variable; the chi-squared distance, based on the distance from the case of independence; and the 0-1 distance, which leads to misclassification rates. For quantitative variables, the typical choice is the Euclidean distance, representing the distance between two vectors in a Cartesian space. Another possible choice is the uniform distance, applied when nonparametric models are being used. Any of the previous distances can be employed to define the notion of discrepancy of an statistical model. ∅
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Evaluation of Data Mining Methods
The discrepancy of a model, g, can be obtained comparing the unknown probabilistic model, f, and the best parametric statistical model. Since f is unknown, closeness can be measured with respect to a sample estimate of the unknown density f. A common choice of discrepancy function is the Kullback-Leibler divergence, which can be applied to any type of observations. In such context, the best model can be interpreted as that with a minimal loss of information from the true unknown distribution. It can be shown that the statistical tests used for model comparison are generally based on estimators of the total Kullback-Leibler discrepancy; the most used is the loglikelihood score. Statistical hypothesis testing is based on subsequent pairwise comparisons of log-likelihood scores of alternative models. Hypothesis testing allows one to derive a threshold below which the difference between two models is not significant and, therefore, the simpler models can be chosen. Therefore, with statistical tests it is possible make an accurate choice among the models. The defect of this procedure is that it allows only a partial ordering of models, requiring a comparison between model pairs and, therefore, with a large number of alternatives it is necessary to make heuristic choices regarding the comparison strategy (such as choosing among the forward, backward and stepwise criteria, whose results may diverge). Furthermore, a probabilistic model must be assumed to hold, and this may not always be possible.
Criteria Based on scoring functions A less structured approach has been developed in the field of information theory, giving rise to criteria based on score functions. These criteria give each model a score, which puts them into some kind of complete order. We have seen how the Kullback-Leibler discrepancy can be used to derive statistical tests to compare models. In many cases, however, a formal test cannot be derived. For this reason, it is important to develop scoring functions that attach a score to each model. The Kullback-Leibler discrepancy estimator is an example of such a scoring function that, for complex models, can be often be approximated asymptotically. A problem with the Kullback-Leibler score is that it depends on the complexity of a model as described, for instance, by the number of parameters. It is thus necessary to employ score functions that penalise model complexity. The most important of such functions is the AIC (Akaike Information Criterion) (Akaike, 1974). From its definition, notice that the AIC score essentially penalises the loglikelihood score with a term that increases linearly with model complexity. The AIC criterion is based on the implicit assumption that q remains constant when the size of the sample increases. However this
assumption is not always valid and therefore the AIC criterion does not lead to a consistent estimate of the dimension of the unknown model. An alternative, and consistent, scoring function is the BIC criterion (Bayesian Information Criterion), also called SBC, formulated by Schwarz (1978). As can be seen from its definition, the BIC differs from the AIC only in the second part, which now also depends on the sample size n. Compared to the AIC, when n increases the BIC favours simpler models. As n gets large, the first term (linear in n) will dominate the second term (logarithmic in n). This corresponds to the fact that, for a large n, the variance term in the mean squared error expression tends to be negligible. We also point out that, despite the superficial similarity between the AIC and the BIC, the first is usually justified by resorting to classical asymptotic arguments, while the second by appealing to the Bayesian framework. To conclude, the scoring function criteria for selecting models are easy to calculate and lead to a total ordering of the models. From most statistical packages we can get the AIC and BIC scores for all the models considered. A further advantage of these criteria is that they can be used also to compare non-nested models and, more generally, models that do not belong to the same class (for instance a probabilistic neural network and a linear regression model). However, the limit of these criteria is the lack of a threshold, as well the difficult interpretability of their measurement scale. In other words, it is not easy to determine if the difference between two models is significant or not, and how it compares to another difference. These criteria are indeed useful in a preliminary exploratory phase. To examine this criteria and to compare it with the previous ones see, for instance, Zucchini (2000), or Hand, Mannila, & Smyth (2001).
Bayesian Criteria A possible “compromise” between the previous two criteria is the Bayesian criteria, which could be developed in a rather coherent way (see, e.g., Bernardo & Smith, 1994). It appears to combine the advantages of the two previous approaches: a coherent decision threshold and a complete ordering. One of the problems that may arise is connected to the absence of a general purpose software. For data mining works using Bayesian criteria the reader could see, for instance, Giudici (2001), Giudici & Castelo (2003) and Brooks et al. (2003).
Computational Criteria The intensive wide spread use of computational methods has led to the development of computationally intensive model comparison criteria. These criteria are usually 465
-
Evaluation of Data Mining Methods
based on using dataset different than the one being analysed (external validation) and are applicable to all the models considered, even when they belong to different classes (for example in the comparison between logistic regression, decision trees and neural networks, even when the latter two are non probabilistic). A possible problem with these criteria is that they take a long time to be designed and implemented, although general purpose software has made this task easier. The most common of such criterion is based on cross-validation. The idea of the cross-validation method is to divide the sample into two sub-samples, a “training” sample, with n - m observations, and a “validation” sample, with m observations. The first sample is used to fit a model and the second is used to estimate the expected discrepancy or to assess a distance. Using this criterion the choice between two or more models is made by evaluating an appropriate discrepancy function on the validation sample. Notice that the cross-validation idea can be applied to the calculation of any distance function. One problem regarding the cross-validation criterion is in deciding how to select m, that is, the number of the observations contained in the “validation sample.” For example, if we select m = n/2 then only n/2 observations would be available to fit a model. We could reduce m but this would mean having few observations for the validation sampling group and therefore reducing the accuracy with which the choice between models is made. In practice proportions of 75% and 25% are usually used, respectively, for the training and the validation samples. To summarise, these criteria have the advantage of being generally applicable but have the disadvantage of taking a long time to be calculated and of being sensitive to the characteristics of the data being examined. A way to overcome this problem is to consider model combination methods, such as bagging and boosting. For a thorough description of these recent methodologies, see Hastie, Tibshirani, & Friedman (2001).
Business Criteria One last group of criteria seem specifically tailored for the data mining field. These are criteria that compare the performance of the models in terms of their relative losses, connected to the errors of approximation made by fitting data mining models. Criteria based on loss functions have appeared recently, although related ideas are long time known in Bayesian decision theory (see, for instance, Bernardo & Smith, 1984). They have a great application potential, although at present they are mainly concerned with classification problems. For a more detailed examination of these criteria the reader can see, for example, Hand (1997), Hand, Mannila, & Smyth (2001), or the reference manuals on data mining software, such as 466
that of SAS Enterprise Miner (SAS Institute, 2004). The idea behind these methods is to focus the attention, in the choice among alternative models, to the utility of the obtained results. The best model is the one that leads to the least loss. Most of the loss function based criteria are based on the confusion matrix. The confusion matrix is used as an indication of the properties of a classification rule. On its main diagonal it contains the number of observations that have been correctly classified for each class. The off-diagonal elements indicate the number of observations that have been incorrectly classified. If it is assumed that each incorrect classification has the same cost, the proportion of incorrect classifications over the total number of classifications is called rate of error, or misclassification error, and it is the quantity that must be minimised. The assumption of equal costs can be replaced by weighting errors with their relative costs. The confusion matrix gives rise to a number of graphs that can be used to assess the relative utility of a model, such as the Lift Chart, and the ROC Curve (see Giudici, 2003). The lift chart puts the validation set observations, in increasing or decreasing order, on the basis of their score, which is the probability of the response event (success), as estimated on the basis of the training set. Subsequently, it subdivides such scores in deciles. It then calculates and graphs the observed probability of success for each of the decile classes in the validation set. A model is valid if the observed success probabilities follow the same order (increasing or decreasing) as the estimated ones. Notice that, in order to be better interpreted, the lift chart of a model is usually compared with a baseline curve, for which the probability estimates are drawn in the absence of a model, that is, taking the mean of the observed success probabilities. The ROC (Receiver Operating Characteristic) curve is a graph that also measures predictive accuracy of a model. It is based on four conditional frequencies that can be derived from a model, and the choice of a cut-off points for its scores: a) the observations predicted as events and effectively such (sensitivity); b) the observations predicted as events and effectively non-events; c) the observations predicted as non-events and effectively events; d) the observations predicted as nonevents and effectively such (specificity). The ROC curve is obtained representing, for any fixed cut-off value, a point in the plane having as x-value the false positive value (1-specificity) and as y-value the sensitivity value. Each point in the curve corresponds therefore to a particular cut-off. In terms of model comparison, the best curve is the one that is leftmost, the ideal one coinciding with the y-axis.
Evaluation of Data Mining Methods
To summarise, criteria based on loss functions have the advantage of being easy to understand but, on the other hand, they still need formal improvements and mathematical refinements.
parison criteria based on business quantities are extremely useful.
FUTURE TRENDS
Akaike, H. (1974). A new look at statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723.
It is well known that data mining methods can be classified into exploratory, descriptive (or unsupervised), predictive (or supervised) and local (see, e.g., Hand et al., 2001). Exploratory methods are preliminary to others and, therefore, do not need a performance measure. Predictive problems, on the other hand, are the setting where model comparison methods are most needed, mainly because of the abundance of the models available. All presented criteria can be applied to predictive models: this is a rather important aid for model choice. For descriptive models aimed at summarising variables, such as clustering methods, the evaluation of the results typically proceeds on the basis of the Euclidean distance, leading at the R2 index. We remark that it is important to examine the ratio between the “between” and “total” sums of squares that leads to R 2 for each variable in the dataset. This can give a variable-specific measure of the goodness of the cluster representation. Finally, it is difficult to assess local models, such as association rules, for the bare fact that a global measure of evaluation of such model contradicts with the very notion of a local model. The idea that prevails in the literature is to measure the utility of patterns in terms of how interesting or unexpected they are to the analyst. As measures of interest one can consider, for instance, the support, the confidence and the lift. The former can be used to assess the importance of a rule, in terms of its frequency in the database; the second can be used to investigate possible dependences between variables; finally the lift can be employed to measure the distance from the situation of independence.
CONCLUSION The evaluation of data mining methods requires a great deal of attention. A valid model evaluation and comparison can improve considerably the efficiency of a data mining process. We have presented several ways to perform model comparison; each has its advantages and disadvantages. Choosing which of them is most suited for a particular application depends on the specific problem and on the resources (e.g., computing tools and time) available. Furthermore, the choice must also be based upon the kind of usage of the final results. This implies that, for example, in a business setting, com-
REFERENCES
Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian theory. New York: Wiley. Bickel, P.J., & Doksum, K.A. (1977). Mathematical statistics. New York: Prentice Hall. Brooks, S.P., Giudici, P., & Roberts, G.O. (2003). Efficient construction of reversible jump MCMC proposal distributions. Journal of The Royal Statistical Society Series B, 1, 1-37. Castelo, R., & Giudici, P. (2003). Improving Markov chain model search for data mining. Machine Learning, 50, 127-158. Giudici, P. (2001). Bayesian data mining, with application to credit scoring and benchmarking. Applied Stochastic Models in Business and Industry, 17, 69-81. Giudici, P. (2003). Applied data mining. London, Wiley. Han, J., & Kamber, M (2001). Data mining: Concepts and techniques. New York: Morgan Kaufmann. Hand, D. (1997). Construction and assessment of classification rules. London: Wiley. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining. New York: MIT Press. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag. Mood, A.M., Graybill, F.A., & Boes, D.C. (1991). Introduction to the theory of statistics. Tokyo: McGraw Hill. SAS Institute Inc. (2004). SAS enterprise miner reference manual. Cary: SAS Institute. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 62, 461-464. Witten, I., & Frank, E. (1999). Data Mining: Practical machine learning tools and techniques with Java implementation. New York: Morgan Kaufmann. Zucchini, W. (2000). An introduction to model selection. Journal of Mathematical Psychology, 44, 41-61. 467
-
Evaluation of Data Mining Methods
Entropic Distance: The entropic distance of a distribution g from a target distribution f, is:
KEY TERMS 0-1 Distance: The 0-1 distance between a vector of predicted values, X g , and a vector of observed values, X f , is:
E
d = ∑i f i log
fi gi
Euclidean Distance: The distance between a vector of n
0 −1
predicted values, X g , and a vector of observed values,
d = ∑ 1(X fr − X gr )
X f , is expressed by the equation:
r =1
where 1(w,z) =1 if w=z and 0 otherwise. AIC Criterion: The AIC criterion is defined by the following equation: AIC = −2 log L(ϑˆ; x1 ,..., x n ) + 2q
where log L(ϑˆ; x1 ,..., x n ) is the logarithm of the likelihood function calculated in the maximum likelihood parameter estimate and q is the number of parameters of the model. BIC Criterion: The BIC criterion is defined by the following expression:
(
)
BIC = −2 log L ϑˆ; x1 ,..., x n + q log(n )
Chi-Squared Distance: The chi-squared distance of a distribution g from a target distribution f is:
χ2
d = ∑i
468
n
∑ (X r =1
fr
− X gr )
2
Kullback-Leibler Divergence: The KullbackLeibler divergence of a parametric model p ¸ with respect to an unknown density f is defined by: f (x i ) ∆ K − L ( f , pϑ ) = ∑i f (x i ) log p ˆ (x i ) θ
where the suffix θˆ indicates the values of the parameters, which minimizes the distance with respect to f. Log-Likelihood Score: The log-likelihood score is defined by n
i =1
gi
∆( f , pϑ ) = ∑ ( f (x i ) − pϑ (x i )) i =1
d (X f , X g ) =
[
]
− 2∑ log pθˆ (x i )
( f i − g i )2
Discrepancy Of A Model: Assume that f represents the unknown density of the population, and let g= p¸ be a family of density functions (indexed by a vector of I parameters, q) that approximates it. Using, to exemplify, the Euclidean distance, the discrepancy of a model g, with respect to a target model f is: n
2
2
Uniform Distance: The uniform distance between two distribution functions, F and G, with values in [0, 1] is defined by:
sup F (t ) − G (t ) 0 ≤ t ≤1
469
Evolution of Data Cube Computational Approaches Rebecca Boon-Noi Tan Monash University, Australia
INTRODUCTION Aggregation is a commonly used operation in decision support database systems. Users of decision support queries are interested in identifying trends rather than looking at individual records in isolation. Decision support system (DSS) queries consequently make heavy use of aggregations, and the ability to simultaneously aggregate across many sets of dimensions (in SQL terms, this translates to many simultaneous group-bys) is crucial for Online Analytical Processing (OLAP) or multidimensional data analysis applications (Datta, VanderMeer, & Ramamritham, 2002; Dehne, Eavis, Hambrusch, & RauChaplin, 2002; Elmasri & Navathe, 2004; Silberschatz, Korth & Sudarshan, 2002).
BACKGROUND Although aggregation functions together with another operator, Group-by in SQL terms, have been widely used
in business application for the past few decades. Common forms of data analysis include histograms, roll-up totals and sub-totals for drill-downs and cross tabulation. The three common forms are difficult to use with these SQL aggregation constructs (Gray, Bosworth, Lyaman, & Pirahesh, 1996). An explanation of the three common problems: (a) Histograms, (b) Roll-up Totals and SubTotals for drill-downs, and (c) Cross tabulation will be presented. Firstly the problem with histograms is that the SQL standard Group-by operator does not allow a direct construction with aggregation over computed categories. Unfortunately, not all SQL systems directly support histograms including the standard. In standard SQL, histograms are computed indirectly from a table-valued expression which is then aggregated. However, the cube-by operator is able to have direct construction. The histogram can be easily obtained from the data cube using the cube-by operator to compute the raw data. The histogram on the right-hand side of Figure 1 shows the sales amount as well as the sub-totals and total of a range of the
Figure 1. An example showing how a histogram is formed from results obtained from a data cube
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Evolution of Data Cube Computational Approaches
Figure 2. An example of cross-tabulation from a data cube
products at a range of the time-frame on a particular location. The second problem relates to roll-ups totals and subtotals for drill-down. Reports commonly aggregate data initially at a coarse level, and then at successively finer levels. This type of report is difficult with the normal SQL construct. However, the Cube-by operator is able to present the roll-ups totals and sub-totals for drill-down easily. The third problem relates to cross-tabulation which is difficult to construct with the current standard SQL. The symmetric aggregation result is cross-tabulation table or cross tab for short (known as a pivot-table in spreadsheets). Using the cube-by operator, cross tab data can be readily obtained which is routinely displayed in the more compact format as shown in Figure 2. This cross tab is a two dimensional aggregation within the red-dotted line. If we add another location such as L002, it becomes a 3D aggregation. In summary, the problem of representing aggregate data in a relational data model with the standard SQL can be a difficult and daunting task. A six dimensional crosstab will require a 64 way union of 64 different group-by operators in order to build the underlying representation. This is an important reason why the use of Group-bys is inadequate as the resulting representation of aggregation is too complex for optimal analysis.
MAIN THRUST Birth of cube-by operator To overcome the difficulty with these SQL aggregation constructs, (Gray et al., 1996) proposed using data cube 470
operators (also known as cube-by) to conveniently support such aggregates. The data cube is identified as the core operator in data warehousing and OLAP (Lakshmanan, Pei & Zhao, 2003). The cube-by operator computes group-bys corresponding to all possible combinations in a list of attributes. An example of data cube query is as follows: SELECT Product, Year, City, SUM(amount) FROM Sales CUBE BY Product, Year, City The above query will produce the SUM of amount of all tuples in the database according the 7 group-bys, i.e. (Product, Year, City), (Product, Year), (Product, City), (Year, City), (Product), (Year), (City). Lastly, the 8th groupby denotes as ALL, which contains an empty attribute so as to make all group-bys results union compatible. For example, a cube-by of three attributes (ABC) in data cube query will generate eight or 23 group-bys of ([ABC], [AB], [AC], [BC], [A], [B], [C] and [ALL]). The most straightforward way to execute the data cube query is to rewrite it as a collection of eight groupby queries and execute them separately as shown in Figure 3. This means that the eight group-by queries will need to access the raw data eight times. It is likely to be quite expensive in execution time. If the number of dimension attributes increases, it becomes very expensive to compute the data cube. This is because the required computation cost grows exponentially with the increase of dimension attributes. For instance, ‘N’ number of dimension attributes of cube-by will form a 2N number of group-bys. However, there are a number of ways in which this simple solution can be improved.
Evolution of Data Cube Computational Approaches
Figure 3. A straightforward way to execute each of the group-by separately
Figure 4. A lattice of the cube-by operator
Lattice Framework
Figure 4 shows a pattern of the eight group-bys based on cube-by with 3 attributes. One unique feature of the cube-by computation is that the computation of the various group-bys are not independent of each other, but are closely related as some of them can be computed using others. Figure 5 shows a simple example where the results of A and B are both derived from the group-by of two attributes AB (T# and P#) Why is AB chosen in preference to AC or BC in order to get both the results of A and B? A number of factors need to be considered when choosing the best candidate to obtain the result of the others group-by. Table 1 shows how complex it can be especially when the number of attributes is increased.
Harinarayan, Rajaraman & Ullman (1996) noted that some of the group-by queries in the data cube query could be answered using the results of other. This is known as dependence relation on queries. Consider a situation where there are two queries Q1 and Q2. We can say that Q1 is a subset of Q 2 when Q1can be answered by using the result of Q2. In this case, Q1 is dependent on Q2. This results in a formation of lattice diagram as shown in Figure 4. The cuboid may or may not have a common prefix with its parent. For a simple illustration, [AB] and [AC] have common prefix [A] with their parent [ABC]. However, there is no common prefix between [BC] and its parent [ABC]. An edge from a cuboid to its child indicates that the child cuboid can be computed from its parent.
-
Figure 5. An example of how A and B can be derived from AB
471
Evolution of Data Cube Computational Approaches
Table 1. An illustration of the increment in the number of paths and group-bys
Table 1 illustrates the increment in the number of paths and group-bys whenever there is an increase in one attribute. For example, three attributes [ABC] will result in 3 D cuboid level and 12 paths if we increase one more attributes [ABCD], this will generate 4 D cuboid level and 32 paths. When the number of attributes is small, the number of group-bys and number of paths are relatively similar. However, when the number of attributes increases, the number of paths increases significantly faster by a factor of ΣNp-1 where p is the path number. For instance, when Na is equal to 2, Ng and Np are equal to 4. When Na is equal to 3, Ng is 23 = 8 and Np = (23 + 4) = 12. It is important to note that with the number of different paths available, it is difficult to decide which path should be used since some of the group-bys can be computed by using others.
Hierarchies in Lattice Diagram In real-life environments, the dimensions of a data cube normally consist of more than one attribute, and the dimensions are organized as hierarchies of these attributes. Harinarayan, Rajaraman, & Ullman (1996) used a simple time dimension example to illustrate a hierarchy: day, month, and year in Figure 6. Hierarchies are very important, as they form a basis of two frequently used querying operations: “drill-down” and “roll-up”. Drill-up is the process of viewing data at gradually more detailed levels while roll-up is just the opposite. More details can be found in (Ramakrishnan & Gehrke, 2003, p. 852). Figure 7 shows an example of real-life data set, which can be viewed conceptually as a two-dimensional array
Figure 6. A hierarchy example of time attributes Day Week
Month
Year None
472
with hierarchies on the dimensions. Li’s represent Stores and Pi’s represent Products. Stores L1 – L3 are in Churchill, while L4 – L6 are in Morwell, and this roll-up into two towns. Products P1 – P3 are of type Cup, while products P4 – P5 are of type Plate. Both cup and plate are further grouped into the category Kitchenware. The x’s are sales volumes; entries that are blank correspond to (product, store) combinations for which there are no sales. In summary, there are two types of query dependencies: dimension dependency, and attribute dependency. Dimension dependency is present when there is an interaction of the different dimensions with one another as shown in Figures 4 and 5. Attribute dependency is introduced when it falls within a dimension caused by attribute hierarchies as shown in Figures 6 and 7.
Optimisations in Existing Approaches Sarawagi, Agrawal, and Megiddo (1996) adapted these methods to compute group-bys by incorporating a number of optimization: •
Smallest-Parent: This optimization was first proposed in (Gray et. al, 1997). It aims at computing a group-by from the smallest (in terms of either closest parent group-by or size of the parent group-by) of previously computed group-by. Each group-by can be computed from a number of other group-bys. Figures 8 and 9 show a three attributes cube (ABC).
There are a number of options for computing a groupby A from its group-bys parent (ABC, AB, or AC). For example, A can be computed from ABC, AB or AC. Figure 8 shows an example where it is better choice to compute A from smallest or closest parent group-bys AB or AC rather than ABC. Another example shows another scenario where the size of smallest-parent is smaller than the other (Figure 9). AC is a better choice as compared to AB in terms of size. •
Cache-Results: This optimization aims at caching (in memory) the results of a group-by from which
Evolution of Data Cube Computational Approaches
Figure 7. A hierarchy example of two attributes
-
Figure 8. AB or AC is clearly better choice of computing A
Figure 9. An example of different size in AB and AC
Figure 10. Reduction in disk I/O by computing A from AB instead of AC
Figure 11. Reduction of disk reads by computing as many group-bys as possible in the memory
Caching (in memory) [AB]
•
•
Input/Output
[ABC]
Disk
Disk
[AC]
other group-bys are computed to reduce disk I/O. Figure 10 shows an example AB is better choice to compute A as compared to AC. The reason is that A can be computed while AB is still in memory so that there is no disk I/O cost involved. Amortize-Scans: This optimization aims at amortizing disk reads by computing as many group-bys as possible, together in memory. In Figure 11 as many group-bys as possible are computed such as AB, AC and A from ABC while ABC is still in the memory. Thus, there is no need to involve disk read. Share-Sorts: This optimization is specific to the sort-based algorithms and aims at sharing sorting cost across multiple group-bys. Share-sort is using data sorted in a particular order to compute all group-bys that are prefixes of that order. There is no need to re-sort for the subsequent group-bys as it
•
Input/Output
Memory [AB, AC, A]
has brought equal items together, and duplicate removal will then be easy (Figure 12). Share-Partitions: This optimization is specific to the hash-based algorithms and share the partition costs across multiple group-bys. Data to be aggregated is usually too large for the hash-tables to fit in memory. Hence, the conventional way to deal with limited memory when constructing hash tables is to partition the data on one or more attributes. If data is partitioned on an attribute, say A, then all group-bys that contain A can be computed by independently grouping on each partition.
Unfortunately, the above optimizations are often contradictory for OLAP databases especially when the size of the data to be aggregated is usually much larger than the available main memory. For instance, a group-by 473
Evolution of Data Cube Computational Approaches
Figure 12. An example of share-sorts
Figure 13. An example of sharing partitioning cost
A sorted
A
AB sorted
AB
ABC Sort
ABC partitioned on A
[A] can be computed from one of the several parent groupbys ([AB] [ABC] [AC]), but the bigger one AC (in term of size) is in memory and the smallest one AB is not. In this case, based on the Cache-results optimization, AC maybe is a better choice. With the possible optimization techniques, Agarwal et al. (1996) suggested two proposed approaches, which are basically sort-based PipeSort and hash-based PipeHash. However, there is a need for some global planning, which uses the search lattice introduced in (Harinarayan et al., 1996).
Search Lattice
•
The search lattice is a graph where a vertex represents a group-by of the cube as shown in Figure 14. A directed edge connects group-by i to group-by j whenever j can be generated from i and j has exactly one attribute less than i (i is called the parent of j, for instance, AB is called the parent of A). Level k of the search lattice denotes all group-bys that contain exactly k attributes.
Data Cube Computation Methodology •
•
474
PipeSort: Agrawal et al. (1996) have incorporated the share-sorts, cache-results and amortize-scans into the PipeSort algorithm. The aim is to get minimum total cost and also to use Pipelining to achieve cache-results and amortize-scans. However, the main limitation is that it does not scale well with respect to the number of Cube-by attributes. PipeSort performs one sort operation for the pipelined evaluation of each path. When the underlying relation is sparse and much larger than available memory, this results in many of the cuboids that PipeSort sorts are also larger than the available memory. Hence, this has resulted a considerable amount of I/O when performing PipeSort. PipeHash: The PipeHash algorithm is able to include the four stated optimizations only if the memory is available. First optimization is smallestparent where PipeHash has fixed the parent group-
•
AC
by for each group-by and this has resulted in biased towards optimizing for smallest-parent. Second optimization, share-partitions, is achieved by computing from the same partition all group-bys that contain the partitioning attribute. Third and fourth optimizations are achieved when computing a subtree, the algorithm maintains all hash-tables of group-bys in the subtree in memory until all its children are created and also for each group-by, therefore its children can be computed in one scan of the group-by. However, the limitation of PipeHash algorithm relates to the NP-Hard problem especially in minimizing overall disk scan cost. Overlap: Deshpande et al. (1998) proposed the OVERLAP algorithm for data cube computation and it based on sorting-based method. The Overlap algorithm is executed in four stages. The overlap algorithm has minimized the number of sorting steps required to compute many sub-aggregates and also minimized the number of disk accesses by overlapping the computation of the cuboids. However, the limitation of the overlap algorithm is that the memory may be not large enough to store more than one such partition simultaneously. This is because each of the O(k) nodes has O(k) such descendants, so there are at least O(k2) nodes in the search tree that involve additional disk I/O. As a result, the total I/ O cost of OVERLAP is at least quadratic in k for sparse data. Partitioning: Ross et al. (1997) have taken into consideration the fact that real data is frequently sparse. It exists for two reasons: (a) large domain sizes of some cube-by attributes and (b) large number of cube-by attributes in the data cube query. In (Ross et al., 1997), large relations are partitioned into fragments that fit in memory. Thus, there is always enough memory to fit in the fragments of large relation. The memory-cube is similar to PipeSort as it computes the various cuboids (group-bys) of the data cube using the idea of pipelined paths. The memory-cube performs multiple in-memory sorts, and does not incur any I/O beyond the input of the relation and the output of the data cube itself.
Evolution of Data Cube Computational Approaches
Figure 14. A lattice search of the cube-by operator
•
Array Computation: Zhao, Deshpande, Naughton and Shukla (1997) introduced the multi-way array cubing algorithm in which the cells are visited in the right order so that a cell does not have to be revisited for each sub-aggregate. Two main issues: (a) the array itself is far too large to fit in memory especially in a multidimensional application; and (b) many of the cells in the array are empty. The goal is to overlap the computation of all these group-bys and finish the cube in one scan of the array with the requirement of memory minimized.
Table 2 shows a summary of benefits and optimization of existing techniques. The existing techniques show that their performance based on uni-processor environments. They either heavily relied on estimation of memory requirements, or simply partitioned data into equal-sized partitions based on a uniform distribution assumption.
FUTURE TRENDS Other approaches have utilized parallel techniques (Datta, et al., 2002; Tan, Taniar & Lu, 2003; Taniar & Tan, 2002). Dehne et al. (2003) have proposed the use of parallel techniques on data cubes whereas (Lu, Yu, Feng, & Li., 2003) focused on the handling of data skew in parallel data cube computations. However, due to the complexity of the data cube there is still capacity for computations to be further optimized. There is thus still scope for researchers to address these new challenges.
-
CONCLUSION Data cube computation requires all possible combinations of group-bys which correspond to the attributes. However, due to the complexity associated with data cube computation, a number of approaches have been proposed. In spite of these approaches, there are a number of research issues that still need to be resolved.
REFERENCES Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R. et al. (1996, September). On the computation of multidimensional aggregates. International VLDB Conference (pp. 506-521), Bombay, India. Datta, A., VanderMeer, D., & Ramamritham, K. (2002). Parallel star join + DataIndexes: Efficient query processing in data warehouses and OLAP. IEEE Transactions Journal on Knowledge & Data Engineering, 14, 12991316. Dehne, F., Eavis, T., Hambrusch, S., & Rau-Chaplin, A. (2002). Parallelizing the data cube. Journal of Distributed & Parallel Databases, 11, 181-201. Elmasri, R., & Navathe, S.B. (2004). Fundamentals of database systems. Boston, MA: Addison Wesley. Gray, J., Bosworth, A., Lyaman, A., & Pirahesh, H. (1996, June). Data cube: A relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS. International ICDE Conference (pp. 152-159), New Orleans, Louisiana. 475
Evolution of Data Cube Computational Approaches
Table 2. Benefits and optimization of existing techniques
Harinarayan, V., Rajaraman, A., & Ullman, J. (1996, June). Implementing data cubes efficiently. International ACM SIGMOD Conference (pp. 205-216), Montreal, Canada. Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003, September). Efficacious data cube exploration by semantic summarization and compression. International VLDB Conference (pp. 1125-1128), Berlin, Germany. Lu, H.J., Yu, J.X., Feng, L., & Li, Z.X. (2003). Fully dynamic partitioning: Handling data skew in parallel data cube computation. Journal Distributed & Parallel Databases, 13, 181-202. Ramakrishnan, R., & Gehrke, J. (2003). Database management systems. NY: McGraw-Hill. Ross, K.A., & Srivastava, D. (1997, August). Fast computation of sparse datacubes. International VLDB Conference (pp. 116-185), Athens, Greece.
tiple dimensional queries. International ACM SIGMOD Conference (pp. 271-282), Seattle, Washington.
KEY TERMS Amortize-Scans: Amortizing disk reads by computing as many group-bys as possible, simultaneously in memory. Attribute Dependency: Introduces when it falls within a dimension caused by attribute hierarchies. Cache-Results: the result of a group-by is obtained from other group-bys computation (in memory). Data Cube: Is known as core operator in data warehouse and OLAP (Lakshmanan, Pei & Zhao, 2003).
Sarawagi, S., Agrawal, R., & Megiddo, N. (1998, March). Discovery-driven exploration of OLAP data cubes. International EDBT Conference (pp. 168-182), Valencia, Spain.
Data Cube Operator: Computes Group-by corresponding to all possible combinations of attributes in the Cubeby clause.
Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Database system concepts. NY: McGraw-Hill.
Dependence Relation: Relates to data cube query where some of the group-by queries could be answered using the results of other.
Tan, R.B.N., Taniar, D., & Lu, G.J. (2003, March). Efficient execution of parallel aggregate data cube queries in data warehouse environments. International IDEAL Conference (pp. 709-7016), Hong Kong, China. Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing of multi-join expansion_aggregate data cube query in high performance database systems. International ISPAN Conference (pp. 51-58), Manila, Philippines. Zhao, Y., Deshpande, P., Naughton, J., & Shukla, A. (1998, June). Simultaneous optimization and evaluation of mul476
Dimension Dependency: Presents when there is an interaction of the different dimensions with one another. OLAP: Describes a technology that uses a multidimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced analysis. Smallest-Parent: In terms of either closest parent group-by or size of the parent group-by.
477
Evolutionary Computation and Genetic Algorithms William H. Hsu Kansas State University, USA
INTRODUCTION A genetic algorithm (GA) is a procedure used to find approximate solutions to search problems through the application of the principles of evolutionary biology. Genetic algorithms use biologically inspired techniques, such as genetic inheritance, natural selection, mutation, and sexual reproduction (recombination, or crossover). Along with genetic programming (GP), they are one of the main classes of genetic and evolutionary computation (GEC) methodologies. Genetic algorithms typically are implemented using computer simulations in which an optimization problem is specified. For this problem, members of a space of candidate solutions, called individuals, are represented using abstract representations called chromosomes. The GA consists of an iterative process that evolves a working set of individuals called a population toward an objective function, or fitness function (Goldberg, 1989; Wikipedia, 2004). Traditionally, solutions are represented using fixedlength strings, especially binary strings, but alternative encodings have been developed. The evolutionary process of a GA is a highly simplified and stylized simulation of the biological version. It starts from a population of individuals randomly generated according to some probability distribution (usually uniform) and updates this population in steps called generations. For each generation, multiple individuals are selected randomly from the current population, based upon some application of fitness, bred using crossover, and modified through mutation, to form a new population. •
• • •
Crossover: Exchange of genetic material (substrings) denoting rules, structural components, features of a machine learning, search, or optimization problem. Selection: The application of the fitness criterion to choose which individuals from a population will go on to reproduce. Replication: The propagation of individuals from one generation to the next. Mutation: The modification of chromosomes for single individuals.
This article begins with a survey of the following GA variants: the simple genetic algorithm, evolutionary algorithms, and extensions to variable-length individuals. It then discusses GA applications to data-mining problems, such as supervised inductive learning, clustering, and feature selection and extraction. It concludes with a discussion of current issues in GA systems, particularly alternative search techniques and the role of building block (schema) theory.
BACKGROUND The field of genetic and evolutionary computation (GEC) was first explored by Turing, who suggested an early template for the genetic algorithm. Holland (1975) performed much of the foundational work in GEC in the 1960s and 1970s. His goal of understanding the processes of natural adaptation and designing biologically inspired artificial systems led to the formulation of the simple genetic algorithm (Holland, 1975). •
•
State of the Field: To date, GAs have been applied successfully to many significant problems in machine learning and data mining, most notably classification, pattern detectors (González & Dasgupta, 2003; Rizki et al., 2002) and predictors (Au et al., 2003), and payoff-driven reinforcement learning1 (Goldberg, 1989). Theory of GAs: Current GA theory consists of two main approaches—Markov chain analysis and schema theory. Markov chain analysis is primarily concerned with characterizing the stochastic dynamics of a GA system (i.e., the behavior of the random sampling mechanism of a GA over time). The most severe limitation of this approach is that, while crossover is easy to implement, its dynamics are difficult to describe mathematically. Markov chain analysis of simple GAs has therefore been more successful at capturing the behavior of evolutionary algorithms with selection and mutation only. These include evolutionary algorithms (EAs) and evolutions strategie (Schwefel, 1977).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Evolutionary Computation and Genetic Algorithms
Successful building blocks can become redundant in a GA population. This can slow down processing and also can result in a phenomenon called takeover, where the population collapses to one or a few individuals. Goldberg (2002) characterizes steady-state innovation in GAs as the situation where time to produce a new, more highly-fit building block (the innovation time, ti) is lower than the expected time for the most fit individual to dominate the entire population (the takeover time, t*). Steady state innovation is achieved, facilitating convergence toward an optimal solution, when ti < t*, because the countdown to takeover, or race, between takeover and innovation is reset.
MAIN THRUST The general strengths of genetic algorithms lie in their ability to explore the search space efficiently through parallel evaluation of fitness (Cantú-Paz, 2000) and mixing of partial solutions through crossover (Goldberg, 2002); to maintain a search frontier to seek global optima (Goldberg, 1989); and to solve multi-criterion optimization problems. The basic units of partial solutions are referred to in the literature as building blocks, or schemata. Modern GEC systems also are able to produce solutions of variable length (De Jong et al., 1993; Kargupta & Ghosh, 2002). A more specific advantage of GAs is their ability to represent rule-based, permutation-based, and constructive solutions to many pattern-recognition and machinelearning problems. Examples of this include induction of decision trees (Cantú-Paz & Kamath, 2003) among several other recent applications surveyed in the following.
Types of Gas The simplest genetic algorithm represents each chromosome as a bit string (containing binary digits 0 and 1) of fixed length. Numerical parameters can be represented by integers, though it is possible to use floating-point representations for reals. The simple GA performs crossover and mutation at the bit level for all of these (Goldberg, 1989; Wikipedia, 2004). Other variants treat the chromosome as a parameter list, containing indices into an instruction table or an arbitrary data structure with pre-defined semantics (e.g., nodes in a linked list, hashes, or objects. Crossover and mutation are required to preserve semantics by respecting object boundaries, and formal invariants for each generation can be specified according to these semantics. For most data types, operators can be specialized, with differing levels of effectiveness that generally are domaindependent (Wikipedia, 2004).
478
Applications Genetic algorithms have been applied to many classification and performance tuning applications in the domain of knowledge discovery in databases (KDD). De Jong, et al. produced GABIL (Genetic Algorithm-Based Inductive Learning), one of the first general-purpose GAs for learning disjunctive normal form concepts (De Jong et al., 1993). GABIL was shown to produce rules achieving validation set accuracy comparable to that of decision trees induced using ID3 and C4.5. Since GABIL, there has been work on inducing rules (Zhou et al., 2003) and decision trees (Cantú-Paz & Kamath, 2003) using evolutionary algorithms. Other representations that can be evolved using a genetic algorithm include predictors (Au et al., 2003) and anomaly detectors (González & Dasgupta, 2003). Unsupervised learning methodologies, such as data clustering (Hall et al., 1999; Lorena & Furtado, 2001) also admit GA-based representation, with application to such current data-mining problems as gene expression profiling in the domain of computational biology (Iba, 2004). KDD from text corpora is another area where evolutionary algorithms have been applied (Atkinson-Abutridy et al., 2003). GAs can be used to perform meta-learning, or higherorder learning, by extracting features (Raymer et al., 2000), selecting features (Hsu, 2003), or selecting training instances (Cano et al., 2003). They also have been applied to combine, or fuse, classification functions (Kuncheva & Jain, 2000).
FUTURE TRENDS Some limitations of GAs are that, in certain situations, they are overkill compared to more straightforward optimization methods such as hill-climbing, feed-forward artificial neural networks using back propagation, and even simulated annealing and deterministic global search. In global optimization scenarios, GAs often manifest their strengths: efficient, parallelizable search; the ability to evolve solutions with multiple objective criteria (Llorà & Goldberg, 2003); and a characterizable and controllable process of innovation. Several current controversies arise from open research problems in GEC: •
Selection is acknowledged to be a fundamentally important genetic operator. Opinion, however, is divided over the importance of crossover vs. mutation. Some argue that crossover is the most important, while mutation is only necessary to ensure that potential solutions are not lost. Others argue that
Evolutionary Computation and Genetic Algorithms
•
crossover in a largely uniform population only serves to propagate innovations originally found by mutation, and that in a non-uniform population, crossover is nearly always equivalent to a very large mutation, which is likely to be catastrophic. In the field of GEC, basic building blocks for solutions to engineering problems primarily have been characterized using schema theory, which has been critiqued as being insufficiently exact to characterize the expected convergence behavior of a GA. Proponents of schema theory have shown that it provides useful normative guidelines for design of GAs and automated control of high-level GA properties (e.g., population size, crossover parameters, and selection pressure).
Looking ahead to future opportunities and challenges in data mining, genetic algorithms are widely applicable to classification by means of inductive learning. GAs also provide a practical method for optimization of data preparation and data transformation steps. The latter includes clustering, feature selection and extraction, and instance selection. In data mining, GAs likely are most useful where high-level, fitness-driven search is needed. Non-local search (global search or search with an adaptive step size) and multi-objective data mining are also problem areas where GAs have proven promising.
REFERENCES
Recent and current research in GEC relates certain evolutionary algorithms to ant colony optimization (Parpinelli, Lopes & Freitas, 2002).
Atkinson-Abutridy, J., Mellish, C., & Aitken, S. (2003). A semantically guided and domain-independent evolutionary model for knowledge discovery from texts. IEEE Transactions on Evolutionary Computation, 7(6), 546-560.
CONCLUSION
Au, W.-H., Chan, K.C.C., & Yao, X. (2003). A novel evolutionary data mining algorithm with applications to churn prediction. IEEE Transactions on Evolutionary Computation, 7(6), 532-545.
Genetic algorithms provide a comprehensive search methodology for machine learning and optimization. It has been shown to be efficient and powerful through many datamining applications that use optimization and classification. The current literature (Goldberg, 2002; Wikipedia, 2004) contains several general observations about the generation of solutions using a genetic algorithm: •
•
•
GAs are sensitive to deceptivity, the irregularity of the fitness landscape. This includes locally optimal solutions that are not globally optimal, lack of fitness gradient for a given step size, and jump discontinuities in fitness. In general, GAs have difficulty with adaptation to dynamic concepts or objective criteria. This phenomenon, called concept drift in supervised learning and data mining, is a problem, because GAs traditionally are designed to evolve highly fit solutions (populations containing building blocks of high relative and absolute fitness) with respect to stationary concepts. GAs are not always effective at finding globally optimal solutions but can rapidly locate good solutions, even for difficult search spaces. This makes steady-state GAs (i.e., Bayesian optimization GAs that collect and integrate solution outputs after convergence to an accurate representation of building blocks) a useful alternative to generational GAs (maximization GAs that seek the best individual of the final generation after convergence).
Cano, J.R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561-575. Cantú-Paz, E. (2000). Efficient and accurate parallel genetic algorithms. Norwell, MA: Kluwer. Cantú-Paz, E., & Kamath, C. (2003). Inducing oblique decision trees with evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 7(1), 54-68. De Jong, K.A., Spears, W.M., & Gordon, F.D. (1993). Using genetic algorithms for concept learning. Machine Learning, 13, 161-188. Goldberg, D.E. (1989). Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley. Goldberg, D.E. (2002). The design of innovation: Lessons from and for competent genetic algorithms. Norwell, MA: Kluwer. González, F.A., & Dasgupta, D. (2003). Anomaly detection using real-valued negative selection. Genetic Programming and Evolvable Machines, 4(4), 383-403. Hall, L.O., Ozyurt, I.B., & Bezdek, J.C. (1999). Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2), 103-112.
479
-
Evolutionary Computation and Genetic Algorithms
Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press.
KEY TERMS
Hsu, W.H. (2003). Control of inductive bias in supervised learning using evolutionary computation: A wrapperbased approach. In J. Wang (Ed.), Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing.
Crossover: In biology, a process of sexual recombination, by which two chromosomes are paired up and exchange some portion of their genetic sequence. Crossover in GAs is highly stylized and typically involves exchange of strings. These can be performed using a crossover bit mask in bit-string GAs but require complex exchanges (i.e., partial-match, order, and cycle crossover) in permutation GAs.
Iba, H. (2004). Classification of gene expression profile using combinatory method of evolutionary computation and machine learning. Genetic Programming and Evolvable Machines, Special Issue on Biological Applications of Genetic and Evolutionary Computation, 5(2), 145-156. Kargupta, H., & Ghosh, S. (2002). Toward machine learning through genetic code-like transformations. Genetic Programming and Evolvable Machines, 3(3), 231-258. Kuncheva, L.I., & Jain, L.C. (2000). Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation, 4(4), 327-336. Llorà, X., & Goldberg, D.E. (2003). Bounding the effect of noise in multiobjective learning classifier systems. Evolutionary Computation, 11(3), 278-297. Lorena, L.A.N., & Furtado, J.C. (2001). Constructive genetic algorithm for clustering problems. Evolutionary Computation, 9(3), 309-328. Parpinelli, R.S., Lopes, H.S., & Freitas, A.A. (2002). Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation, 6(4), 321-332. Raymer, M.L., Punch, W.F., Goodman, E.D., Kuhn, L.A., & Jain, A.K. (2000). Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation, 4(2), 164-171. Rizki, M.M., Zmuda, M.A., & Tamburino, L.A. (2002). Evolving pattern recognition systems. IEEE Transactions on Evolutionary Computation, 6(6), 594-609. Schwefel, H.-P. (1997). Numerische optimierung von computer—Modellen mittels der evolutionsstrategie. Basel: Birkhauser Verlag. Wikipedia. (2004). Genetic algorithm. Retrieved from http://en.wikipedia.org/wiki/Genetic_algorithm Zhou, C., Xiao, W., Tirpak, T. M., & Nelson, P.C. (2003). Evolving accurate and compact classification rules with gene expression programming. IEEE Transactions on Evolutionary Computation, 7(6), 519-531.
480
Evolutionary Computation: A solution approach based on simulation models of natural selection, which begins with a set of potential solutions and then iteratively applies algorithms to generate new candidates and select the fittest from this set. The process leads toward a model that has a high proportion of fit individuals. Generation: The basic unit of progress in genetic and evolutionary computation, a step in which selection is applied over a population. Usually, crossover and mutation are applied once per generation, in strict order. Individual: A single candidate solution in genetic and evolutionary computation, typically represented using strings (often of fixed length) and permutations in genetic algorithms, or using problem-solver representations (programs, generative grammars, or circuits) in genetic programming. Meta-Learning: Higher-order learning by adapting the parameters of a machine-learning algorithm or the algorithm definition itself, in response to data. An example of this approach is the search-based wrapper of inductive learning, which wraps a search algorithm around an inducer to find locally optimal parameter values such as the relevant feature subset for a given classification target and data set. Validation set accuracy typically is used as fitness. Genetic algorithms have been used to implement such wrappers for decision tree and Bayesian network inducers. Mutation: In biology, a permanent, heritable change to the genetic material of an organism. Mutation in GAs involves string-based modifications to the elements of a candidate solution. These include bit-reversal in bitstring GAs and shuffle and swap operators in permutation GAs. Permutation GA: A type of GA where individuals represent a total ordering of elements, such as cities to be visited in a minimum-cost graph tour (the Traveling Salesman Problem). Permutation GAs use specialized crossover and mutation operators compared to the more common bit string GAs.
Evolutionary Computation and Genetic Algorithms
Schema (pl. Schemata): An abstract building block of a GA-generated solution, corresponding to a set of individuals. Schemata typically are denoted by bit strings with don’t-care symbols ‘#’ (e.g., 1#01#00# is a schema with 23 = 8 possible instances, one for each instantiation of the # symbols to 0 or 1). Schemata are important in GA research, because they form the basis of an analytical approach called schema theory, for characterizing building blocks and predicting their proliferation and survival probability across generations, thereby describing the expected relative fitness of individuals in the GA. Selection: In biology, a mechanism in by which the fittest individuals survive to reproduce and the basis of speciation according to the Darwinian theory of evolution. Selection in GP involves evaluation of a quantitative criterion over a finite set of fitness cases, with the com-
bined evaluation measures being compared in order to choose individuals.
ENDNOTE 1
Payoff-driven reinforcement learning describes a class of learning problems for intelligent agents that receive rewards, or reinforcements, from the environment in response to actions selected by a policy function. These rewards are transmitted in the form of payoffs, sometimes strictly non-negative. A GA acquires policies by evolving individuals, such as condition-action rules, that represent candidate policies.
481
-
482
Evolutionary Data Mining for Genomics Laetitia Jourdan LIFL, University of Lille 1, France Clarisse Dhaenens LIFL, University of Lille 1, France El-Ghazali Talbi LIFL, University of Lille 1, France
INTRODUCTION
BACKGROUND
Knowledge discovery from genomic data has become an important research area for biologists. Nowadays, a lot of data is available on the Web, but it is wrong to say that corresponding knowledge is also available. For example, the first draft of the human genome, which contains 3,000,000,000 letters, was achieved in June 2000, but, up to now, only a small part of the hidden knowledge has been discovered. This is the aim of bioinformatics, which brings together biology, computer science, mathematics, statistics, and information theory to analyze biological data for interpretation and prediction. Hence, many problems encountered while studying genomic data may be modeled as data mining tasks, such as feature selection, classification, clustering, or association rule discovery. An important characteristic of genomic applications is the large amount of data to analyze, and, most of the time, it is not possible to enumerate all the possibilities. Therefore, we propose to model these knowledge discovery tasks as combinatorial optimization tasks in order to apply efficient optimization algorithms to extract knowledge from large datasets. To design an efficient optimization algorithm, several aspects have to be considered. The main one is the choice of the type of resolution method according to the characteristics of the problem. Is it an easy problem, for which a polynomial algorithm may be found? If the answer is yes, then let us design such an algorithm. Unfortunately, most of the time, the response to the question is no, and only heuristics that may find good but not necessarily optimal solutions can be used. In our approach, we focus on evolutionary computation, which has already shown an interesting ability to solve highly complex combinatorial problems. In this article, we will show the efficacy of such an approach while describing the main steps required to solve data mining problems from genomics with evolutionary algorithms. We will illustrate these steps with a real problem.
Evolutionary data mining for genomics groups three important fields: evolutionary computation, knowledge discovery, and genomics. It is now well known that evolutionary algorithms are well suited for some data mining tasks (Freitas, 2002). Here, we want to show the interest of dealing with genomic data, thanks to evolutionary approaches. A first proof of this interest may be the recent book by Gary Fogel and David Corne, Evolutionary Computation in Bioinformatics, which groups several applications of evolutionary computation to problems in the biological sciences and, in particular, in bioinformatics (Fogel & Corne, 2002). In this article, several data mining tasks are addressed, such as feature selection or clustering, and solved, thanks to evolutionary approaches. Another proof of the interest of such approaches is the number of sessions around evolutionary computation in bioinformatics and computational biology that have been organized during the last Congress on Evolutionary Computation (CEC) in Portland, Oregon in 2004. The aim of genomic studies is to understand the function of genes, to determine which genes are involved in a given process, and how genes are related. Hence, experiments are conducted, for example, to localize coding regions in DNA sequences and/or to evaluate the expression level of genes in certain conditions. Resulting from this, data available for the bioinformatics researcher may deal with DNA sequence information that are related to other types of data. The example used to illustrate this article may be classified in this category. Another type of data deals with the recent technology called microarray, which allows the simultaneous measurement of the expression level of thousands of genes under different conditions (i.e., various time points of a process, absorption of different drugs, etc.). This new type of data requires specific data mining tasks, as the number of genes to study is very large and the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Evolutionary Data Mining for Genomics
number of conditions may be limited. Classical questions are the classification or the clustering of genes based on their expression pattern, and commonly used approaches may vary from statistical approaches (Yeung & Ruzzo, 2001) to evolutionary approaches (Merz, 2002) and may use additional biological information, such as gene ontology (GO) (Speer, Spieth & Zell, 2004). Recently, the biclustering that allows the grouping of instances having similar characteristic for a subset of attributes (here, genes having the same expression patterns for a subset of conditions) has been applied to this type of data and evolutionary approaches proposed (Bleuler, Prelié & Ziztler, 2004). In this context of microarray data analysis, association rule discovery also has been realized using evolutionary algorithms (Khabzaoui, Dhaenens & Talbi, 2004).
MAIN THRUST In order to extract knowledge from genomic data using evolutionary algorithms, several steps have to be considered: 1. 2. 3.
Identification of the knowledge discovery task from the biological problem under study; Design of this task as an optimization problem; Resolution using an evolutionary approach.
Hence, in this section, we will focus on each of these steps. First, we will present the genomic application that we will use to illustrate the rest of the article and indicate the knowledge discovery tasks that have been extracted. Then, we will show the challenges and some proposed solutions for the two other steps.
Genomics Application The genomic problem under study is to formulate hypotheses on predisposition factors of different multifactorial diseases, such as diabetes and obesity. In such diseases, one of the difficulties is that sane people can become affected during their life, so only the affected status is relevant. This work has been done in collaboration with the Biology Institute of Lille (IBL, France). One approach aims to discover the contribution of environmental factors and genetic factors in the pathogenesis of the disease under study by discovering complex interactions, such as ([gene A and gene B] or [gene C and environmental factor D]) in one or more population. The rest of the article will use this problem as an illustration. To solve such a problem, the first thing is to formulate it into a classical data mining task. The difficulty of
such a formulation is to identify the task. This work must be done through discussions and cooperation with biologists in order to agree on the objective of the problems. For example, in our data, identifying groups of people can be modeled as a clustering task, as we cannot take into account non-affected people. Moreover, a lot of loci have to be studied (3,652 points of comparison on the 23 chromosomes and two environmental factors) and classical clustering algorithms are not able to cope with so many points. So, we decided first to execute a feature selection in order to reduce the number of loci in consideration and to extract the most influential features that will be used for the clustering. Hence, the model of this problem is decomposed into two phases: feature selection and clustering.
From a Data Mining Task to an Optimization Problem The most difficult aspect of turning a data mining task into an optimization problem is to define the criterion to optimize. The choice of the optimization criterion, which measures the quality of candidate knowledge to be extracted, is very important, and the quality of the results of the approach depends on it. Indeed, developing a very efficient method that does not use the right criterion will lead to obtaining the right answer to the wrong question. The optimization criterion either can be specific to the data mining task or dependent of the biological application. Several different choices exist. For example, considering the gene clustering, the optimization criterion can be the minimization of the minimum sum-of-squares (MSS) (Merz, 2002), while for the determination of the members of a predictive gene group, the criterion can be the maximization of the classification success using a maximum likelihood (MLHD) classification method (Ooi & Tan, 2003). Once the optimization criterion is defined, the second step of the design of the data mining task into an optimization problem is to define the encoding of a solution, which may be independent of the resolution method. For example, for clustering problems in gene expression mining with evolutionary algorithm, Faulkenauer and Marchand (2001) use the specific CGA encoding that is dedicated to grouping problems and is well suited to clustering. Regarding the genomic application used to illustrate this article, two phases have been isolated. For the feature selection, an optimization approach has been adopted, using an evolutionary algorithm (see next paragraph), whereas a classical approach (k-means) has been chosen for the clustering phase. Determining the optimization criterion for the feature selection was not an easy task, as it was difficult not to favor small sets of features. 483
-
Evolutionary Data Mining for Genomics
A corrective factor has been introduced (Jourdan et al., 2002).
Solving with Evolutionary Algorithms Once the formalization of the data mining task into an optimization problem is done, resolution methods either can be exact methods, specific heuristics, or metaheuristics. As the space of the potential knowledge is exponential in genomics problems (Zaki & Ho, 2000), exact methods are almost always discarded. The drawbacks of heuristic approaches are that it is difficult to cope with multiple solutions and not easy to integrate specific knowledge in a general approach. The advantages of metaheuristics are that you can define a general framework to solve the problem while specializing some agents in order to suit a specific problem. Genetic Algorithms (GAs), which represent a class of evolutionary methods, have given good results on hard combinatorial problems (Michalewicz, 1996). In order to develop a genetic algorithm for knowledge discovery, we have to focus on the following: • • •
Operators Diversification mechanisms Intensification mechanisms
Operators allow GAs to explore the search space and must be adapted to the problem. Generally, there are two classes of operators—mutation and crossover. The mutation allows diversity. For the feature selection task under study, the mutation flips n bits (Jourdan, Dhaenens & Talbi, 2001). The crossover produces one, two, or more children solutions by recombining two or parents. The objective of this mechanism is to keep useful information of the parents in order to ameliorate the solutions. In the considered problem, the subset-oriented common feature crossover operator (SSOCF) has been used. Its objective is to produce offspring that have the same distribution as the parents. This operator is adapted well for feature selection (Emmanouilidis, Hunter & MacIntyre, 2000). Another advantage of evolutionary algorithms is that you easily can use other data mining algorithms as an operator; for example, a kmeans iteration may be used as an operator in a clustering problem (Krishma & Murty, 1999). Working on knowledge discovery on a particular domain by optimization leads to definition and use of several operators, where some may use domain knowledge, and others may be specific to the model and/or the encoding. To take advantage of all the operators, the idea is to use adaptive mechanisms (Hong, Wang & Chen, 2000) that help to adapt application probabilities of these operators according to the progress they produce. Hence, 484
operators that are less efficient are less used, which may change during the search, if they become more efficient. Diversification mechanisms are designed to avoid premature convergence. There exist several mechanisms. The more classical are the sharing and the random immigrant. The sharing boosts the selection of individuals that lie in less crowded areas of the search space (Mafhoud, 1995). To apply such a mechanism, a distance between solutions has to be defined. In the feature selection for the genomic association discovery, a distance has been defined by integrating knowledge of the application domain. The distance is correlated to a Hamming distance, which integrates biological notions (chromosomal cut, inheritance notion, etc.). A further approach to the diversification of the population, the random immigrant, introduces new individuals. An idea is to generate new individuals by recording statistics on previous selections. Assessing the efficiency of such algorithms applied to genomic data is not easy, as most of the time, biologists have no exact idea about what must be found. Hence, one step in this analysis is to develop simulated data for which optimal results are known. In this manner, it is possible to measure the efficiency of the proposed method. For example, in the problem under study, predetermined genomic associations were constructed to form simulated data. Then, the algorithm was tested on these data and found these associations.
FUTURE TRENDS There has been much work on evolutionary data mining for genomics. In order to be more efficient and to propose more interesting solutions for decision makers, the researchers are investigating multi-criteria design of the data mining tasks. Indeed, we exposed that one of the critical phases was the determination of the optimization criterion, and it may be difficult to select a single one. In response to this problem, the multi-criteria design allows us to take into account some criteria dedicated to a specific data mining task and some criteria coming from the application domain. Evolutionary algorithms that work on a population of solutions are well adapted to multi-criteria problems, as they can exploit Pareto approaches and propose several good solutions (i.e., solutions of best compromise). For data mining in genomics, rule discovery has not been very well applied and should be studied carefully, as this is a very general model. Moreover, an interesting multi-criteria model has been proposed for this task and starts to give some interesting results by using multi-criteria genetic algorithms (Jourdan et al., 2004).
Evolutionary Data Mining for Genomics
CONCLUSION
tors involved in multifactorial diseases. Knowledge Based Systems (KBS) Journal, 15(4), 235-242.
Genomic is a real challenge for researchers in data mining, as most of the problems encountered are difficult to address for several reasons: (1) the objectives are not always very clear, neither for the data mining specialist nor for the biologist, and a real dialog has to exist between the two; (2) data may be uncertain and noisy; (3) the number of experiments may not be enough in comparison to the number of elements that have to be studied. In this article, we present the different phases required to deal with data mining problems arriving in genomics, thanks to optimization approaches. The objective was to give for each phase the main challenges and to illustrate through an example some responses. We saw that the definition on the optimization criterion is a central point of this approach, and the multi-criteria optimization may be used as a response to this difficulty.
Jourdan, L., Khabzaoui, M., Dhaenens, C., & Talbi, E-G. (2004). A hybrid metaheuristic for knowledge discovery in microarray experiments. Handbook of bioinspired algorithms and applications. CRC Press.
REFERENCES Bleuler, S., Prelié, A., & Ziztler, E. (2004). An EA framework for biclustering of gene expression data. Proceedings of the Congress on Evolutionary Computation (CEC04), Portland, Oregon. Emmanouilidis, C., Hunter, A., & MacIntyre, J. (2000). A multiobjective evolutionary setting for feature selection and a commonality-based crossover operator. Proceedings of the Congress on Evolutionary Computation, San Diego, California. Falkenauer E., & Marchand, A. (2001). Using k-means? Consider ArrayMiner. Proceedings of the 2001 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS’2001), Las Vegas, Nevada. Fogel, G., & Corne, D. (2002). Evolutionary computation in bioinformatics. Morgan Kaufmann. Freitas, A.A. (2002). Data mining and knowledge discovery with evolutionary algorithms. Springer-Verlag. Hong, T-P., Wang, H-S., & Chen, W-C. (2000). Simultaneously applying multiple mutation operators in genetic algorithms. J. Heuristics, 6(4), 439-455. Jourdan, L., Dhaenens, C., & Talbi, E.G. (2001). An optimization approach to mine genetic data. Proceedings of Biological Data Mining and Knowledge Discovery (METMBS’01). Jourdan, L., Dhaenens, C., Talbi, E.-G., & Gallina, S. (2002). A data mining approach to discover genetic fac-
Khabzaoui, M., Dhaenens, C., & Talbi, E-G. (2004). Association rules discovery for DNA microarray data. Proceedings of the Bioinformatics Workshop of SIAM International Conference on Data Mining, Orlando, Florida. Krishna, K., & Murty, M. (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man and Cybernetics, 29(3), 433-439. Mafhoud, S.W. (1995). Niching method for genetic algorithms [doctoral thesis]. IL: University of Illinois.Merz, P. (2002). Clustering gene expression profiles with memetic algorithms. Proceedings of the 7th International Conference on Parallel Problem Solving from Nature (PPSN VII). Michalewicz, Z. (1996). Genetic algorithms + data structures = evolution programs. Springer-Verlag. Ooi, C.H., & Tan, P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 19, 37-44. Speer, N., Spieth, C., & Zell, A. (2004). A memetic coclustering algorithm for gene expression profiles and biological annotation. Proceedings of the Congress on Evolutionary Computation (CEC04), Portland, Oregon. Yeung, K.Y., & Ruzzo, W.L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763-774. Zaki, M., & Ho, C.T. (2000). Large-scale parallel data mining. State-of-the-Art Survey, LNAI.
KEY TERMS Association Rule Discovery: Implication of the form X ⇒ Y, where X and Y are sets of items; implies that if the condition X is verified, the prediction Y is valid. Bioinformatics: Field of science in which biology, computer science, and information technology merge into a single discipline. Clustering: Data mining task in which the system has to classify a set of objects without any information on the characteristics of the classes. 485
-
Evolutionary Data Mining for Genomics
Feature (or Attribute): Quantity or quality describing an instance. Feature Selection: Task of identifying and selecting a useful subset of features from a large set of redundant, perhaps irrelevant, features. Genetic Algorithm: Evolutionary algorithm using a population and based on the Darwinian principle, the survival of the fittest.
486
Multi-Factorial Disease: Disease caused by several factors. Often, multi-factorial diseases are due to two kinds of causality that interact—one is genetic (and often polygenetic) and the other is environmental. Optimization Criterion: Criterion that gives the quality of a solution of an optimization problem.
487
Evolutionary Mining of Rule Ensembles Jorge Muruzábal University Rey Juan Carlos, Spain
INTRODUCTION Ensemble rule based classification methods have been popular for a while in the machine-learning literature (Hand, 1997). Given the advent of low-cost, high-computing power, we are curious to see how far can we go by repeating some basic learning process, obtaining a variety of possible inferences, and finally basing the global classification decision on some sort of ensemble summary. Some general benefits to this idea have been observed indeed, and we are gaining wider and deeper insights on exactly why this is the case in many fronts of interest. There are many ways to approach the ensemblebuilding task. Instead of locating ensemble members independently, as in Bagging (Breiman, 1996), or with little feedback from the joint behavior of the forming ensemble, as in Boosting (see, e.g., Schapire & Singer, 1998), members can be created at random and then made subject to an evolutionary process guided by some fitness measure. Evolutionary algorithms mimic the process of natural evolution and thus involve populations of individuals (rather than a single solution iteratively improved by hill climbing or otherwise). Hence, they are naturally linked to ensemble-learning methods. Based on the long-term processing of the data and the application of suitable evolutionary operators, fitness landscapes can be designed in intuitive ways to prime the ensemble’s desired properties. Most notably, beyond the intrinsic fitness measures typically used in pure optimization processes, fitness can also be endogenous, that is, it can prime the context of each individual as well.
to the general GP, EP, and CS frameworks and Koza, Keane, Streeter, Mydlowec, Yu, and Lanza (2003) for an idea of performance by GP algorithms at the patent level. Here I focus on ensemble –rule based methods for classification tasks or supervised learning (Hand, 1997). The CS architecture is naturally suitable for this sort of rule assembly problems, for its basic representation unit is the rule, or classifier (Holland, Holyoak, Nisbett, & Thagard, 1986). Interestingly, tentative ensembles in CS algorithms are constantly tested for successful cooperation (leading to correct predictions). The fitness measure seeks to reinforce those classifiers leading to success in each case. However interesting, the CS approach in no way exhausts the scope of evolutionary computation ideas for ensemble-based learning; see, for example, Kuncheva and Jain (2000), Liu, Yao, and Higuchi (2000), and Folino, Pizzuti, and Spezzano (2003). Ensembles of trees or rules are the natural reference for evolutionary mining approaches. Smaller trees, made by rules (leaves) with just a few tests, are of particular interest. Stumps place a single test and constitute an extreme case (which is nevertheless used often). These rules are more general and hence tend to make more mistakes, yet they are also easier to grasp and explain. A related notion is that of support, the estimated probability that new data satisfy a given rule’s conditions. A great deal of effort has been done in the contemporary CS literature to discern the idea of adequate generality, a recurrent topic in the machine-learning arena.
MAIN THRUST
BACKGROUND
Evolutionary and Tree-Based Rule Ensembles
A number of evolutionary mining algorithms are available nowadays. These algorithms may differ in the nature of the evolutionary process or in the basic models considered for the data or in other ways. For example, approaches based on the genetic programming (GP), evolutionary programming (EP), and classifier system (CS) paradigms have been considered, while predictive rules, trees, graphs, and other structures been have evolved. See Eiben and Smith (2003) for an introduction
In this section, I review various methods for ensemble formation. As noted earlier, in this article, I use the ensemble to build averages of rules. Instead of averaging, one could also select the most suitable classifier in each case and make the decision on the basis of that rule alone (Hand, Adams, & Kelly, 2001). This alternative idea may provide additional insights of interest, but I do not analyze it further here.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
-
Evolutionary Mining of Rule Ensembles
It is conjectured that maximizing the degree of interaction amongst the rules already available is critical for efficient learning (Kuncheva & Jain, 2000; Hand et al., 2001). A fundamental issue concerns then the extent to which tentative rules work together and are capable of influencing the learning of new rules. Conventional methods like Bagging and Boosting show at most moderate amounts of interaction in this sense. While Bagging and Boosting are useful, well-known data-mining tools, it is appropriate to explore other ensemble-learning ideas as well. In this article, I focus mainly on the CS algorithm. CS approaches provide interesting architectures and introduce complex nonlinear processes to model prediction and reinforcement. I discuss a specific CS algorithm and show how it opens interesting pathways for emergent cooperative behaviour.
Conventional Rule Assembly In Bagging methods, different training samples are created by bootstraping, and the same basic learning procedure is applied on each bootstrapped sample. In Bagging trees, predictions are decided by majority voting or by averaging the various opinions available in each case. This idea is known to reduce the basic instability of trees (Breiman, 1996). A distinctive feature of the Boosting approach is the iterative calling to a basic weak learner (WL) algorithm (Schapire & Singer, 1998). Each time the WL is invoked, it takes as input the training set — together with a dynamic (probability) weight distribution over the data — and returns a single tree. The output of the algorithm is a weighted sum itself, where the weights are proportional to individual performance error. The WL learning algorithm needs only to produce moderately successful models. Thus, trees and simplified trees (stumps) constitute a popular choice. Several weight updating schemes have been proposed. Schapire and Singer update weights according to the success of the last model incorporated, whereas in their LogitBoost algorithm, Friedman, Hastie, and Tibshirani (2000) let the weights depend on overall probabilistic estimates. This latter idea better reflects the joint work of all classifiers available so far and hence should provide a more effective guide for the WL in general. The notion of abstention brings a connection with the CS approach that will be apparent as I discuss the match set idea in the followins sections. In standard boosting trees, each tree contributes a leaf to the overall prediction for any new x input data vector, so the number of expressing rules is the number of boosting rounds independently of x. In the system proposed by Cohen and Singer (1999), the WL essentially produces rules or single leaves C (rather than whole trees). Their classi488
fiers are then maps taking only two values, a real number for those x verifying the leaf and 0 elsewhere. The final boosting aggregation for x is thus unaffected by all abstaining rules (with x ∉ C), so the number of expressing rules may be a small fraction of the total number of rules.
The General CS-Based Evolutionary Approach The general classifier system (CS) architecture invented by John Holland constitutes perhaps one of the most sophisticated classes of evolutionary computation algorithms (Hollandet et al., 1986). Originally conceived as a model for cognitive tasks, it has been considered in many (simplified) forms to address a number of learning problems. The nowadays standard stimulus-response (or single-step) CS architecture provides a fascinating approach to the representation issue. Straightforward rules (classifiers) constitute the CS building blocks. CS algorithms maintain a population of such predictive rules whose conditions are hyperplanes involving the wild-card character #. If we generalize the idea of hyperplane to mean “conjunctions of conditions on predictors where each condition involves a single predictor,” We see that these rules are also used by many other learning algorithms. Undoubtedly, hyperplane interpretability is a major factor behind this popularity. Critical subsystems in CS algorithms are the performance, credit-apportionment, and rule discovery modules (Eiben & Smith, 2003). As regards credit-apportionment, the question has been recently raised about the suitability of endogenous reward schemes, where endogenous refers to the overall context in which classifiers act, versus other schemes based on intrinsic value measures (Booker, 2000). A well-known family of algorithms is XCS (and descendants), some of which have been previously advocated as data-mining tools (see, e.g., Wilson, 2001). The complexity of the CS dynamics has been analyzed in detail in Westerdale (2001). The match set M=M(x) is the subset of matched (concurrently activated) rules, that is, the collection of all classifiers whose condition is verified by the input data vector x. The (point) prediction for a new x will be based exclusively on the information contained in this ensemble M.
A System Based on Support and Predictive Scoring Support is a familiar notion in various data-mining scenarios. There is a general trade-off between support and
Evolutionary Mining of Rule Ensembles
predictive accuracy: the larger the support, the lower the accuracy. The importance of explicitly bounding support (or deliberately seeking high support) has been recognized often in the literature (Greene & Smith, 1994; Friedman & Fisher, 1999; Muselli & Liberati, 2002). Because the world of generality intended by using only high-support rules introduces increased levels of uncertainty and error, statistical tools would seem indispensable for its proper modeling. To this end, classifiers in the BYPASS algorithm (Muruzábal, 2001) differ from other CS alternatives in that they enjoy (support-minded) probabilistic predictions (thus extending the more common single-label predictions). Support plays an outstanding role: A minimum support level b is input by the analyst at the outset, and consideration is restricted to rules with (estimated) support above b. The underlying predictive distributions, R, are easily constructed and coherently updated following a standard Bayesian Multinomial-Dirichlet process, whereas their toll on the system is minimal memory-wise. Actual predictions are built by first averaging the matched predictive distributions and then picking the maximum a posteriori label of the result. Hence, by promoting mixtures of probability distributions, the BYPASS algorithm connects readily with mainstream ensemble-learning methods. We can find sometimes perfect regularities, that is, subsets C for which the conditional distribution of the response Y (given X ∈C) equals 1 for some output class: P(Y=j | X ∈C)=1 for some j and 0 elsewhere. In the wellknown multiplexer environment, for example, there exists a set of such classifiers such that 100% performance can be achieved. But in real situations, it will be difficult to locate strictly neat C unless its support is quite small. Moreover, putting too much emphasis on error-free behavior may increase the risk of overfitting; that is, we may infer rules that do not apply (or generalize poorly) over a test sample. When restricting the search to highsupport rules, probability distributions are well equipped to represent high-uncertainty patterns. When the largest P(Y=j | X ∈C) is small, it may be especially important to estimate P(Y=j | X ∈C) for all j.
Furthermore, the use of probabilistic predictions R(j) for j=1, ..., k, where k is the number of output classes, makes possible a natural ranking of the M=M(x) assigned probabilities Ri(y) for the true class y related to the current x. Rules i with large Ri(y) (scoring high) are generally preferred in each niche. In fact, only a few rules are rewarded at each step, so rules compete with each other for the limited amount of resources. Persistent lack of reward means extinction — to survive, classifiers must get reward from time to time. Note that newly discovered, more effective rules with even better scores may cut dramatically the reward given previously to other rules in certain niches. The fitness landscape is thus highly dynamic, and lower scores pi(y) may get reward and survive provided they are the best so far at some niche. An intrinsic measure of fitness for a classifier C →R (such as the lifetime average score -log R(Y), where Y is conditioned by X ∈C) could hardly play the same role. It is worth noting that BYPASS integrates three learning modes in its classifiers: Bayesian at the data-processing level, reinforcement at the survival (competition) level, and genetic at the rule-discovery (exploration) level. Standard genetic algorithms (GAs) are triggered by system failure and act always circumscribed to M. Because the Bayesian updating guarantees that in the long run, predictive distributions R reflect the true conditional probabilities P(Y=j | X ∈C), scores become highly reliable to form the basis of learning engines such as reward or crossover selection. The BYPASS algorithm is sketched in Table 1. Note that utility reflects accumulated reward (Muruzábal, 2001). Because only matched rules get reward, high support is a necessary (but not sufficient) condition to have high utility. Conversely, low-uncertainty regularities need to comply with the (induced) bound on support. The background generalization rate P# (controlling the number of #s in the random C built along the run) is omitted for clarity, although some tuning of P# with regard to threshold u is often required in practice.
Table 1. The skeleton BYPASS algorithm Input parameters (h: reward intensity) (u: minimum utility) Initialize classifier population, say P Iterate for a fixed number of cycles: o Sample training item (x,y) from file o Build match set M=M(x) including all rules with x ∈C o Build point prediction y* o Check for success y* = y § If successful, reward the h ≥2 highest scores Ri(y) in M § If unsuccessful, add a new rule to P via a GA or otherwise o Update predictive distributions R (in M) o Update utility counters (in P) o Check for low utility via threshold u and possibly delete some rules
489
-
Evolutionary Mining of Rule Ensembles
BYPASS has been tested on various tasks under demanding b (and not very large h), and the results have been satisfactory in general. Comparative smaller populations are used, and low uncertainty but high support rules are uncovered. Very high values for P# (as high as 0.975) have been tested successfully in some cases (Muruzábal, 2001). In the juxtaposed (or concatenated) multiplexer environment, BYPASS is shown to maintain a compact population of relatively high uncertainty rules that solves the problem by bringing about appropriate match sets (nearly) all the time. Recent work by Butz, Goldberg, and Tharakunnel (2003) shows that XCS also solves this problem, although working at a lower level of support (generality). To summarize, BYPASS does not relay on rule plurality for knowledge encoding because it uses compact probabilistic predictions (bounded by support). It requires no intrinsic value for rules and no added tailormade heuristics. Besides, it tends to keep population size under control (with increased processing speed and memory savings). The ensembles (match sets) derived from evolution in BYPASS have shown good promise of cooperation.
FUTURE TRENDS Quick interactive data-mining algorithms and protocols are nice when human judgment is available. When not, computer-intensive, autonomous algorithms capable of thoroughly squeezing the data are also nice for preliminary exploration and other purposes. In a sense, we should tend to rely on the latter to mitigate the nearly ubiquitous data overflow problem. Representation schemes and learning engines are crucial to the success of these unmanned agents and need, of course, further investigation. Ensemble methods have lots of appealing features and will be subject to further analysis and testing. Evolutionary algorithms will continue to uprise and succeed in yet some other application areas. Additional commercial spin-offs will keep coming. Although great progress has been made in identifying many key insights in the CS framework, some central points still need further discussion. Specifically, the idea of rules that perform its prediction following some kind of more elaborate computation is appealing, and indeed more functional representations of classifiers (such as multilayer perceptrons) have been proposed in the CS literature (see, e.g., Bull, 2002). On the theoretical side, a formal framework for more rigorous analysis in highsupport learning is much needed. The task is not easy, however, because the target is somewhat more vague, and individual as well as collective interests should be brought to terms when evaluating the generality and 490
uncertainty associated to rules. Also, further research should be conducted to clearly delineate the strengths of the various CS approaches against current alternative methods for rule ensemble formation and data mining.
CONCLUSION Evolutionary rule mining is a successful, promising research area. Evolutionary algorithms constitute by now a very useful and wide class of stochastic optimization methods. The evolutionary CS approach is likely to provide interesting insights and cross-fertilization of ideas with other data-mining methods. The BYPASS algorithm discussed in this article has been shown to tolerate the high support constraint well, leading to pleasant and unexpected results in some problems. These results stress the latent predictive power of the ensembles formed by high uncertainty rules.
REFERENCES Booker, L. B. (2000). Do we really need to estimate rule utilities in classifier systems? Lecture Notes in Artificial Intelligence, 1813, 125-142. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Bull, L. (2002). On using constructivism in neural classifier systems. Lecture Notes in Computer Science, 2439, 558567. Butz, M. V., Goldberg, D. E., & Tharakunnel, K. (2003). Analysis and improvement of fitness exploitation in XCS: Bounding models, tournament selection, and bilateral accuracy. Evolutionary Computation, 11(3), 239277. Cohen, W. W., & Singer, Y. (1999). A simple, fast, and effective rule learner. Proceedings of the 16th National Conference on Artificial Intelligence,. Eiben, A. E., & Smith, J. E. (2003). Introduction to evolutionary computing. Springer. Folino, G., Pizzuti, C., & Spezzano, G. (2003). Ensemble techniques for parallel genetic programming based classifiers. Lecture Notes in Computer Science, 2610, 59-69. Friedman, J. H., & Fisher, N. (1999). Bump hunting in highdimensional data. Statistics and Computing, 9(2), 1-20.
Evolutionary Mining of Rule Ensembles
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337-407. Greene, D. P., & Smith, S. F. (1994). Using coverage as a model building constraint in learning classifier systems. Evolutionary Computation, 2(1), 67-91. Hand, D. J. (1997). Construction and assessment of classification rules. Wiley. Hand, D. J., Adams, N. M., & Kelly, M. G. (2001). Multiple classifier systems based on interpretable linear classifiers. Lecture Notes in Computer Science, 2096, 136-147. Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: Processes of inference, learning and discovery. MIT Press. Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W., Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming IV: Routine human-competitive machine intelligence. Kluwer. Kuncheva, L. I., & Jain, L. C. (2000). Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation, 4(4), 327-336. Liu, Y., Yao, X., & Higuchi, T. (2000). Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation, 4(4), 380-387. Muruzábal, J. (2001). Combining statistical and reinforcement learning in rule-based classification. Computational Statistics, 16(3), 341-359. Muselli, M., & Liberati, D. (2002). Binary rule generation via hamming clustering. IEEE Transactions on Knowledge and Data Engineering, 14, 1258-1268. Schapire, R. E., & Singer, Y. (1998). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), 297-336. Westerdale, T. H. (2001). Local reinforcement and recombination in classifier systems. Evolutionary Computation, 9(3), 259-281.
Wilson, S. W. (2001). Mining oblique data with XCS. Lecture Notes in Artificial Intelligence, 1996, 158-176.
KEY TERMS Classification: The central problem in (supervised) data mining. Given a training data set, classification algorithms provide predictions for new data based on predictive rules and other types of models. Classifier System: A rich class of evolutionary computation algorithms building on the idea of evolving a population of predictive (or behavioral) rules under the enforcement of certain competition and cooperation processes. Note that classifier systems can also be understood as systems capable of performing classification. Not all CSs in the sense meant here qualify as classifier systems in the broader sense, but a variety of CS algorithms concerned with classification do. Ensemble-Based Methods: A general technique that seeks to profit from the fact that multiple rule generation followed by prediction averaging reduces test error. Evolutionary Computation: The solution approach guided by artificial evolution, which begins with random populations (of solution models), then iteratively applies algorithms of various kinds to find the best or fittest models. Fitness Landscape: Optimization space due to the characteristics of the fitness measure used to define the evolutionary computation process. Predictive Rules: Standard if-then rules with the consequent expressing some form of prediction about the output variable. Rule Mining: A computer-intensive task whereby data sets are extensively probed for useful predictive rules. Test Error: Learning systems should be evaluated with regard to their true error rate, which in practice is approximated by the error rate on test data, or test error.
491
-
492
Explanation-Oriented Data Mining Yiyu Yao University of Regina, Canada Yan Zhao University of Regina, Canada
INTRODUCTION Data mining concerns theories, methodologies, and, in particular, computer systems for knowledge extraction or mining from large amounts of data (Han & Kamber, 2000). The extensive studies on data mining have led to many theories, methodologies, efficient algorithms and tools for the discovery of different kinds of knowledge from different types of data. In spite of their differences, they share the same goal, namely, to discover new and useful knowledge, in order to gain a better understanding of nature. The objective of data mining is, in fact, the goal of scientists when carrying out scientific research, independent of their various disciplines. Data mining, by combining research methods and computer technology, should be considered as a research support system. This goal-oriented view enables us to re-examine data mining in the wider context of scientific research. Such a re-examination leads to new insights into data mining and knowledge discovery. The result, after an immediate comparison between scientific research and data mining, is that an explanation construction and evaluation task is added to the existing data mining framework. In this chapter, we elaborate upon the basic concerns and methods of explanation construction and evaluation. Explanation-oriented association mining is employed as a concrete example to demonstrate the entire framework.
BACKGROUND Scientific research and data mining have much in common in terms of their goals, tasks, processes and methodologies. As a recently emerged area of multi-disciplinary study, data mining and knowledge discovery research can benefit from the long established studies of scientific research and investigation (Martella, Nelson, & Marchand-Martella, 1999). By viewing data mining in a wider context of scientific research, we can obtain insights into the necessities and benefits of explanation construction. The model of explanation-
oriented data mining is a recent result from such an investigation (Yao, 2003; Yao, Zhao, & Maguire, 2003).
Common Goals of Scientific Research and Data Mining Scientific research is affected by the perceptions and the purposes of science. Martella et al. summarized the main purposes of science, namely, to describe and predict, to improve or manipulate the world around us, and to explain our world (Martella, et al., 1999). The results of the scientific research process provide a description of an event or a phenomenon. The knowledge obtained from this research helps us to make predictions about what will happen in the future. Research findings are a useful tool for making an improvement in the subject matter. Research findings also can be used to determine the best or the most effective ways of bringing about desirable changes. Finally, scientists develop models and theories to explain why a phenomenon occurs. Goals similar to those of scientific research have been discussed by many researchers in data mining. For example, Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy identified two high-level goals of data mining as prediction and description (Fayyad, et al., 1996). Prediction involves the use of some variables to predict the values of some other variables, while description focuses on patterns that describe the data. Ling, Chen, Yang and Cheng studied the issue of manipulation and action based on discovered knowledge (Ling et al., 2002). Yao, Zhao, et al. introduced the notion of explanation-oriented data mining, which focuses on constructing models for the explanation of data mining results (2003).
Common Processes of Scientific Research and Data Mining Research is a highly complex and subtle human activity, which is difficult to formally define. It seems impossible to give any formal instruction on how to do research. On the other hand, some lessons and general principles can be learnt from the experience of scien-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Explanation-Oriented Data Mining
Table 1. The model of scientific research processes § § § § § § §
-
Idea-generation phase: to identify a topic of interest. Problem-definition phase: to precisely and clearly define and formulate vague and general ideas generated in the previous phase. Procedure-design/planning phase: to make a workable research plan by considering all issues involved. Observation/experimentation phase: to observe real world phenomenon, collect data, and carry out experiments. Data-analysis phase: to make sense out of the data collected. Results-interpretation phase: to build rational models and theories that explain the results from the data-analysis phase. Communication phase: to present the research results to the research community.
Table 2. The model of data mining processes § § § § §
Data pre-processing phase: to select and clean working data. Data transformation phase: to change the working data into the required form. Pattern discovery and evaluation phase: to apply algorithms to identify knowledge embedded in data, and to evaluate the discovered knowledge. Explanation construction and evaluation phase: to construct plausible explanations for discovered knowledge, and to evaluate different explanations. Pattern presentation: to present the extracted knowledge and explanations.
tists. There are some basic principles and techniques that are commonly used in most types of scientific investigations. We adopt the model of the research process from Garziano and Raulin (2000), and combine it with other models (Martella, et al., 1999). The basic phases and their objectives are summarized in Table 1. It is possible to combine several phases into one, or to divide one phase into more detailed steps. The division between phases is not clear-cut. The research process does not follow a rigid sequencing of the phases. Iteration of different phrases may be necessary (Graziano & Raulin, 2000). Many researchers have proposed and studied models of data mining processes (Fayyad, et al. 1996; Mannila, 1997; Yao, Zhao, et al., 2003; Zhong, Liu, & Ohsuga, 2001). A model that adds the explanation facility to the commonly used models has been recently proposed by Yao, Zhao, et al.; it is remarkably similar to the model of scientific research. The basic phases and their objectives are summarized in Table 2. Like the research process, the data mining process is also an iterative process and there is no clear-cut difference among the different phases. In fact, Zhong, et al. argue that it should be a dynamically organized process (Zhong, et al., 2001). The entire framework is illustrated in Figure 1. There is a parallel correspondence between the processes of scientific research and data mining. Their main difference lies in the subjects that perform the tasks. Research is carried out by scientists, and data mining is done by computer systems. In particular, data mining may be viewed as a study of domain-independent research methods with emphasis on data analysis. The higher and more abstract level of comparisons and connections between scientific research and data mining can be further studied in levels that are more concrete.
There are bi-directional benefits. The experiences and results from the studies of research methods can be applied to data mining problems; the data mining algorithms can be used to support scientific research.
MAIN THRUST Explanations of data mining address several important questions. What needs to be explained? How to explain the discovered knowledge? Moreover, is an explanation correct and complete? By answering these questions, one can better understand explanation-oriented data mining. The ideas and processes of explanation construction and explanation evaluation are demonstrated by explanation-oriented association mining. Figure 1. A framework of explanation-oriented data mining Pattern discovery & evaluation Data transformation
Explanation construction & evaluation
Data preprocessing Data
Target data
Transformed data
Patterns
Pattern representation
Explained patterns
Background
Selected attributes
Knowledge
Explanation profiles
Attribute selection Profile construction
493
Explanation-Oriented Data Mining
Basic Issues •
Explanation-oriented data mining explains and interprets the knowledge discovered from data.
Knowledge can be discovered by unsupervised learning methods. Unsupervised learning studies how systems can learn to represent, summarize, and organize the data in a way that reflects the internal structure (namely, a pattern) of the overall collection. This process does not explain the patterns, but describes them. The primary unsupervised techniques include clustering mining, belief (usually Bayesian) networks learning, and association mining. The criteria for choosing which pattern to be explained are directly related to the pattern evaluation step of data mining. •
Explanation-oriented data mining requires background knowledge to infer features that can possibly explain a discovered pattern.
Understanding the theory of explanation requires considerations from many branches of inquiry: physics, chemistry, meteorology, human culture, logic, psychology, and, above all, the methodology of science. In data mining, explanation can be made at a shallow, syntactic level based on statistical information, or at a deep, semantic level based on domain knowledge. The required information and knowledge for explanation may not necessarily be inside the original dataset. Additional information often is required for an explanation construction. It is argued that the power of explanation involves the power of insight and anticipation. One collects certain features based on the underlying hypothesis that these may provide explanations of the discovered pattern. That something is unexplainable may simply be an expression of the inability to discover an explanation of a desired sort. The process of selecting the relevant and explanatory features may be subjective, and trial-and-error. In general, the better the background knowledge is, the more accurate the inferred explanations are likely to be. •
Explanation-oriented data mining utilizes induction, namely, drawing an inference from a set of acquired training instances, and justifying or predicting the instances one might observe in the future.
Supervised learning methods can be applied to this explanation construction. The goal of supervised learning is to find a model that will correctly associate the input patterns with the classes. In real world applications, supervised learning models are extremely useful analytic techniques. The widely used supervised learning 494
methods include decision tree learning, rule-based learning, and decision graph learning. The learned results are represented as either a tree, or a set of if-then rules. •
The constructed explanations provide some evidence about conditions (within the background knowledge) under which the discovered pattern is most likely to happen, or how the background knowledge is related to the pattern.
The role of explanation in data mining is linked to proper description, relation and causality. Comprehensibility is the key factor. The accuracy of the constructed explanations relies on the amount of training examples. Explanation-oriented data mining performs poorly with insufficient data or poor presuppositions. Different background knowledge may infer different explanations. There is no reason to believe that only one unique explanation exists. One can use statistical measures and domain knowledge to evaluate different explanations.
A Concrete Example: Explanationoriented Association Mining Explanations, also expressed as conditions, can provide additional semantics to a standard association. For example, by adding time, place, and/or customer profiles as conditions, one can identify when, where, and/ or to whom an association occurs.
Explanation Construction The approach of explanation-oriented data mining combines unsupervised and supervised learning methods, namely, forming a concept first, and then explaining the concept. It consists of two main steps and uses two data tables. One table is used to learn a pattern. The other table, an explanation profile constructed with respect to an interesting pattern, is used to search for explanations of the pattern. In the first step, an unsupervised learning algorithm, like the Apriori algorithm (Agrawal, Imielinski, & Swami, 1993), can be applied to discover frequent associations. To discover other types of associations, different algorithms can be applied. In the second step, an explanation profile is constructed. Objects in the profile are labelled as positive instances if they satisfy the desired pattern, and negative instances, if they do not. Conditions that explain the pattern are searched by using a supervised learning algorithm. A classical supervised learning algorithm such as ID3 (Quinlan, 1983), C4.5 (Quinlan, 1993), or PRISM (Cendrowska, 1987), may be used to construct explanations.
Explanation-Oriented Data Mining
Table 3. The Apriori-ID3 algorithm
-
Input: A transaction table, and a related explanation profile table. Output: Associations and explained associations. 1. Use the Apriori algorithm to generate a set of frequent associations in the transaction table. For each association φ ∧ ϕ in the set, support (φ ∧ ϕ ) ≥ minsup , and confidence(φ ⇒ ϕ ) ≥ minconf . 2.
If the association φ ∧ ϕ is considered interesting (with respect to the user feedback or interestingness measures), then a. Introduce a binary attribute named Decision. Given a transaction, its value on Decision is “+” if it satisfies φ ∧ ϕ in the original transaction table, otherwise its value is “-”. b. Construct an information table by using the attribute Decision and explanation profiles. The new table is called an explanation table. c. By treating Decision as the target class, one can apply the ID3 algorithm to derive classification rules of the form λ ⇒ ( Decision = "+ ") . The condition λ is a formula discovered in the d.
explanation table, which states that under Evaluate the constructed explanation(s).
The Apriori-ID3 algorithm, which can be regarded as an example of explanation-oriented association mining method, is described in Table 3.
Explanation Evaluation Once explanations are generated, it is necessary to evaluate them. For explanation-oriented association mining, we want to compare a conditional association (explained association) with its unconditional counterpart, in addition to comparing different conditions. Let T be a transaction table, and E be an explanation profile table associated with T. Suppose that for a desired pattern φ generated by an unsupervised learning algorithm from T, there is a set K of conditions (explanations) discovered by a supervised learning algorithm from E, and λ ∈ K is one explanation. Two points are noted. First, the set K of explanations can be different according to various explanation profile tables, or various supervised learning algorithms. Second, not all explanations in K are equally interesting. Different conditions may have different degrees of interestingness. Suppose ε is a quantitative measure used to evaluate plausible explanations, which can be the support measure for an undirected association, the confidence or coverage measure for a one-way association, or the similarity measure for a two-way association (Yao & Zhong, 1999). A condition λ ∈ K provides an explanation of a discovered pattern φ if ε (φ | λ ) > ε (φ ) . One can further evaluate explanations quantitatively based on several measures, such as absolute difference (AD), relative difference (RD) and ratio of change (RC): AD(φ | λ ) = ε (φ | λ ) − ε (φ ), ε (φ | λ ) − ε (φ ) RD(φ | λ ) = , ε (φ ) ε (φ | λ ) − ε (φ ) RC (φ | λ ) = . 1 − ε (φ )
λ the association φ ∧ ϕ occurs.
The absolute difference represents the disparity between the pattern and the pattern under the condition. For a positive value, one may say that the condition supports φ ; for a negative value, one may say that the condition rejects φ . The relative difference is the ratio of absolute difference to the value of the unconditional pattern. The ratio of change compares the actual change and the maximum potential change. Generality is the measure to quantify the size of a condition with respect to the whole data, defined by
|λ| . When the generality of conditions |U | is essential, a compound measure should be applied. For example, one may be interested in discovering an accurate explanation with a high ratio of change and a high generality. However, it often happens that an explanation has a high generality but a low RC value, while another explanation has a low generality but a high RC value. A trade-off between these two explanations does not necessarily exist. A good explanation system must be able to rank the constructed explanations and be able to reject the bad explanations. It should be realized that evaluation is a difficult process because so many different kinds of knowledge can come into play. In many cases, one must rely on domain experts to reject uninteresting explanations.
generality (λ ) =
FUTURE TRENDS Considerable research remains to be done for explanation construction and evaluation. In this chapter, rule-based explanation is constructed by inductive supervised learning algorithms. Considering the structure of explanation, case-based explanations also need to be addressed. Based on the case-based explanation, a pattern is explained if an actual prior case is presented to provide compelling 495
Explanation-Oriented Data Mining
support. One of the perceived benefits of case-based explanation is that the rule generation effort is saved. Instead, similarity functions need to be studied in order to evaluate the distance between the description of the new pattern and an existing case, and retrieve the most similar case as an explanation. The constructed explanations of the discovered pattern provide conclusive evidence for the new instances. In other words, the new instances can be explained and implied by the explanations. This is normally true when the explanations are sound and complete. However, sometimes, the constructed explanations cannot guarantee that a certain instance is a perfect fit. Even worse, a new data set, as a whole, may show a change or a confliction with the learnt explanations. This is because the explanations may be context-dependent on certain spatial and/or temporal intervals. To consolidate the explanations we have constructed, we cannot simply logically “and”, “or”, or ignore the new explanation. Instead, a spatial-temporal reasoning model needs to be introduced to show the trend and evolution of the pattern to be explained. The explanations we have introduced so far are not necessarily the causal interpretation of the discovered pattern, i.e. the relationships expressed in the form of deterministic and functional equations. They can be inductive generalizations, descriptions, or deductive implications. Explanation as causality is the strongest explanation and coherence. We might think of Bayesian networks as an inference that unveils the internal relationship between attributes. Searching for an optimal model is difficult and NP-hard. Arrow direction is not guaranteed. Expert knowledge could be integrated in the a priori search function, such as the presence of links and orders.
Brodie, M. & Dejong, G. (2001). Iterated phantom induction: A knowledge-based approach to learning control. Machine Learning, 45(1), 45-76. Cendrowska, J. (1987). PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies, 27, 349-370. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (Eds.). (1996). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press. Graziano, A.M. & Raulin, M.L. (2000). Research methods: A process of inquiry (4th ed.). Boston: Allyn & Bacon. Han, J. & Kamber, M. (2000). Data mining: Concept and techniques. Morgan Kaufmann Publisher. Ling, C.X., Chen, T., Yang, Q. & Cheng, J. (2002). Mining optimal actions for profitable CRM. Proceedings of International Conference on Data Mining (pp. 767-770). Mannila, H. (1997). Methods and problems in data mining. Proceedings of International Conference on Database Theory (pp. 41-55). Martella, R.C., Nelson, R. & Marchand-Martella, N.E. (1999). Research methods: Learning to become a critical research consumer. Boston: Allyn & Bacon. Mitchell, T. (1999). Machine learning and data mining. Communications of the ACM, 42(11), 30-36. Quinlan, J.R. (1983). Learning efficient classification procedures. In J.S. Michalski, J.G. Carbonell & T.M. Mirchell (Eds.), Machine learning: An artificial intelligence approach (pp. 463-482). Palo Alto, CA: Morgan Kaufmann.
CONCLUSION
Quinlan, J.R. (1993). C4.5: Programs for Machine learning. Morgan Kaufmann Publisher.
Explanation-oriented data mining offers a new perspective. It closely relates scientific research and data mining, which have bi-directional benefits. The ideas of explanation-oriented mining can have a significant impact on the understanding of data mining and effective applications of data mining results.
Yao, Y.Y. (2003). A framework for web-based research support systems. Proceedings of the Twenty-Seventh Annual International Computer Software and Applications Conference (pp. 601-606).
REFERENCES Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of ACM Special Interest Group on Management of Data 1993 (pp. 207-216).
496
Yao, Y.Y., Zhao, Y. & Maguire, R.B. (2003). Explanation-oriented association mining using rough set theory. Proceedings of Rough Sets, Fuzzy Sets and Granular Computing (pp. 165-172). Yao, Y.Y. & Zhong, N. (1999). An analysis of quantitative measures associated with rules. Proceedings Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 479-488).
Explanation-Oriented Data Mining
Zhong, N., Liu, C. & Ohsuga, S. (2001). Dynamically organizing KDD processes. International Journal of Pattern Recognition and Artificial Intelligence, 15, 451473.
KEY TERMS Absolute Difference: A measure that represents the difference between an association and a conditional association based on a given measure. The condition provides a plausible explanation. Explanation-Oriented Data Mining: A general framework includes data pre-processing, data transformation, pattern discovery and evaluation, pattern explanation and explanation evaluation, and pattern presentation. This framework is consistent with the general model of scientific research processes. Generality: A measure that quantifies the coverage of an explanation in the whole data set. Goals of Scientific Research: The purposes of science are to describe and predict, to improve or to manipulate the world around us, and to explain our world. One goal of scientific research is to discover new and useful knowledge for the purpose of science. As a specific research field, data mining shares this common goal, and may be considered as a research support system.
Method of Explanation-Oriented Data Mining: The method consists of two main steps and uses two data tables. One table is used to learn a pattern. The other table, an explanation table, is used to explain one desired pattern. In the first step, an unsupervised learning algorithm is used to discover a pattern of interest. In the second step, by treating objects satisfying the pattern as positive instances, and treating the rest as negative instances, one can search for conditions that explain the pattern by a supervised learning algorithm. Ratio of Change: A ratio of actual change (absolute difference) to the maximum potential change. Relative Difference: A measure that represents the difference between an association and a conditional association relative to the association based on a given measure. Scientific Research Processes: A general model consists of the following phases: idea generation, problem definition, procedure design/planning, observation/ experimentation, data analysis, results interpretation, and communication. It is possible to combine several phases, or to divide one phase into more detailed steps. The division between phases is not clear-cut. Iteration of different phrases may be necessary.
497
-
498
Factor Analysis in Data Mining Zu-Hsu Lee Montclair State University, USA Richard L. Peterson Montclair State University, USA Chen-Fu Chien National Tsing Hua University, Taiwan Ruben Xing Montclair State University, USA
INTRODUCTION
BACKGROUND
The rapid growth and advances of information technology enable data to be accumulated faster and in much larger quantities (i.e., data warehousing). Faced with vast new information resources, scientists, engineers, and business people need efficient analytical techniques to extract useful information and effectively uncover new, valuable knowledge patterns. Data preparation is the beginning activity of exploring for potentially useful information. However, there may be redundant dimensions (i.e., variables) in the data, even after the data are well prepared. In this case, the performance of data-mining methods will be affected negatively by this redundancy. Factor Analysis (FA) is known to be a commonly used method, among others, to reduce data dimensions to a small number of substantial characteristics. FA is a statistical technique used to find an underlying structure in a set of measured variables. FA proceeds with finding new independent variables (factors) that describe the patterns of relationships among original dependent variables. With FA, a data miner can determine whether or not some variables should be grouped as a distinguishing factor, based on how these variables are related. Thus, the number of factors will be smaller than the number of original variables in the data, enhancing the performance of the data-mining task. In addition, the factors may be able to reveal underlying attributes that cannot be observed or interpreted explicitly so that, in effect, a reconstructed version of the data is created and used to make hypothesized conclusions. In general, FA is used with many data-mining methods (e.g., neural network, clustering).
The concept of FA was created in 1904 by Charles Spearman, a British psychologist. The term factor analysis was first introduced by Thurston in 1931. Exploratory FA and confirmatory FA are two main types of modern FA techniques. The goals of FA are (1) to reduce the number of variables and (2) to classify variables through detection of the structure of the relationships between variables. FA achieves the goals by creating a fewer number of new dimensions (i.e., factors) with potentially useful knowledge. The applications of FA techniques can be found in various disciplines in science, engineering, and social sciences, such as chemistry, sociology, economics, and psychology. To sum up, FA can be considered as a broadly used statistical approach that explores the interrelationships among variables and determines a smaller set of common underlying factors. Furthermore, the information contained in the original variables can be explained by these factors with a minimum loss of information.
MAIN THRUST In order to represent the important structure of the data efficiently (i.e., in a reduced number of dimensions), there are a number of techniques that can be used for data mining. These generally are referred to as multi-dimensional scaling methods. The most basic one is Principle Component Analysis (PCA). Through transforming the original variables in the data into the same number of new ones, which are mutually orthogonal (uncorrelated), PCA sequentially extracts most of the variance (variability) of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Factor Analysis in Data Mining
the data. The hope is that most of the information in the data might be contained in the first few components. FA also extracts a reduced number of new factors from the original data set, although it has different aims from PCA. FA usually starts with a survey or a number of observed traits. Before FA is applied, the assumptions of correlations in the data (normality, linearity, homogeneity of sample, and homoscedasticity) need to be satisfied. In addition, the factors to extract should all be orthogonal to one another. After defining the measured variables to represent the data, FA considers these variables as a linear combination of latent factors that cannot be measured explicitly. The objective of FA is to identify these unobserved factors, reflect what the variables share in common, and provide further information about them. Mathematically, let X represent a column vector that contains p measured variables and has a mean vector µ , F stand for a column vector which contains q latent factors, and L be a p × q matrix that transforms F to X. The elements of L (i.e., factor loadings) give the weights that each factor contributes to each measured variable. In addition, let ε be a column vector containing p uncorrelated random errors. Note that q is smaller than p. The following equation simply illustrates the general model of FA (Johnson & Wichern, 1998): X - µ = LF + ε . FA and PCA yield similar results in many cases, but, in practice, PCA often is preferred for data reduction, while FA is preferred to detect structure of the data. In any experiment, any one scenario may be delineated by a large number of factors. Identifying important factors and putting them into more general categories generates an environment or structure that is more advantageous to data analysis, reducing the large number of variables to smaller, more manageable, interpretable factors (Kachigan, 1986). Technically, FA allows the determination of the interdependency and pattern delineation of data. It “untangles the linear relationships into their separate patterns as each pattern will appear as a factor delineating a distinct cluster of interrelated data” (Rummel, 2002, Section 2.1). In other words, FA attempts to take a group of interdependent variables and create separate descriptive categories and, after this transformation, thereby decrease the number of variables that are used in an experi-
ment (Rummel, 2002). The analysis procedures can be performed through a geometrical presentation by plotting data points in a multi-dimensional coordinate axis (exploratory FA) or through mathematical techniques to test the specified model and suspected relationship among variables (confirmatory FA). In order to illustrate how FA proceeds step by step, here is an example from a case study on the key variables (or characteristics) for induction machines, conducted by Maté and Calderón (2000). The sample of a group of motors was selected from a catalog published by Siemens (1988). It consists of 134 cases with no missing values and 13 variables that are power (P), speed (W), efficiency (E), power factor (PF), current (I), locked-rotor current (ILK), torque (M), locked-rotor torque (MLK), breakdown torque (MBD), inertia (J), weight (WG), slip (S), and slope of Ms curve (M_S). FA can be implemented in the following procedures using this sample data.
Step 1: Ensuring the Adequacy of the Date The correlation matrix containing correlations between the variables is first examined to identify the variables that are statistically significant. In the case study, this matrix from the sample data showed that the correlations between the variables are satisfactory, and thus, all variables are kept for the next step. Meanwhile, preliminary tests, such as the Bartlett test, the Kaiser-Meyer-Olkin (KMO) test, and the Measures of Sampling Adequacy (MSA) test, are used to evaluate the overall significance of the correlation. Table 1 shows that the values of MSA (rounded to two decimal places) are higher than 0.5 for all variables but variable W. However, the MSA value of variable W is close to 0.5 (MSA should be higher than 0.5, according to Hair, et al. (1998)).
Step 2: Finding the Number of Factors There are many approaches available for this purpose (e.g., common factor analysis, parallel analysis). The case study first employed the plot of eigenvalues vs. the factor number (the number of factors may be 1 to 13) and found that choosing three factors accounts for 91.3% of the total variance. Then, it suggested that the solution be checked
Table 1. Measures of the adequacy of FA to the sample: MSA E 0.75 W 0.39
I 0.73 WG 0.87
ILK 0.86
J 0.78
M 0.76
M_S 0.82
MBD 0.79
MLK 0.74
P 0.74
PF 0.76
S 0.85
499
.
Factor Analysis in Data Mining
with the attempt to extract two or three more factors. Based on the comparison between results from different selected methods, the study ended up with five factors.
Step 3: Determining Transformation (or Rotation) Matrix A commonly used method is orthogonal rotation. Table 2 can represent the transformation matrix, if each cell shows the factor loading of each variable on each factor. For the sample size, only loadings with an absolute value bigger than 0.5 were accepted (Hair et al., 1998), and they are marked ‘X’ in Table 2 (other loadings lower than 0.5 are not listed). From the table, J, I, M, S, M, WG, and P can be grouped to the first factor; PF, ILK, E, and S belong to the second factor; W and MLK can be considered another two single factors, respectively. Note that MBD can go to factor 2 or be an independent factor (factor 5) of the other four. The case study settled the need to retain it in the fifth factor, based on the results obtained from other samples. In this step, an oblique rotation is another method to determine the transformation matrix. Since it is a nonorthogonal rotation, the factors are not required to be uncorrelated to each other. This gives better flexibility than an orthogonal rotation. Using the sample data of the case study, a new transformation matrix may be obtained from an oblique rotation, which provides new loadings to group variables into factors.
Step 4: Interpreting the Factors. In the case study, the factor consisting of J, I, M, S, M, WG, and P was named size, because the higher value of the weight (WG) and power (P) reflects the larger size of the machine. The factor containing PF, ILK, E, and S is explained as global efficiency. This example provides a general demonstration of the application of FA techniques to data analysis. Various
methods may be incorporated within the concept of FA. Proper selection of methods should depend on the nature of data and the problem to which FA is applied. Garson (2003) points out the abilities of FA: determining the number of different factors needed to explain the pattern of relationships among the variables, describing the nature of those factors, knowing how well the hypothesized factors explain the observed data, and finding the amount of purely random or unique variance that each observed variable includes. Because of these abilities, FA has been used for various data analysis problems and may be used in a variety of applications in data mining from science-oriented to business applications. One of the most important uses is to provide a summary of the data. The summary facilitates learning the data structure via an economic description. For example, Pan et al. (1997) employed Artificial Neural Network (ANN) techniques combined with FA for spectroscopic quantization of amino acids. Through FA, the number of input nodes for neural networks was compressed effectively, which greatly sped up the calculations of neural networks. Tan and Wisner (2001) used FA to reduce a set of factors affecting operations management constructs and their relationships. Kiousis (2004) applied an exploratory FA to he New York Times news coverage of eight major political issues during the 2000 presidential election. FA identified two indices that measured the construct of the key independent variable in agenda-setting research, which could then be used in future investigations. Screening variables is another important function of FA. A co-linearity problem will appear, if the factors of the variables in the data are very similar to each other. In order to avoid this problem, a researcher can group closely related variables into one category and then extract the one that would have the greatest use in determining a solution (Kachigan, 1986). For example, Borovec (1996) proposed a six-step sequential extraction
Table 2. Grouping of variables using the orthogonal rotation matrix
E I ILK J M M_S MBD MLK P PF S W WG
500
Factor 1 X (0.93314) X (0.97035) X (0.94582) X (0.98255) X (0.92265)
X (0.94153)
Factor 2 X (0.93540)
Factor 3
Factor 4
Factor 5
X (0.84356)
X (0.62605) X (0.81847) X (-0.92848)
X (0.91674)
X (0.95222)
X (0.67072)
Factor Analysis in Data Mining
procedure and applied FA, which found three dominant trace elements from 12 surface stream sediments. These three factors accounted for 78% of the total variance. In another example, Chen, et al. (2001) performed an exploratory FA on 48 financial ratios from 63 firms. Four critical financial ratios were concluded, which explained 80% of the variation in productivity. FA can be used as a scaling method, as well. Oftentimes, after the data are collected, the development of scales is needed among individuals, groups, or nations, when they are intended to be compared and rated. As the characteristics are grouped to independent factors, FA assigns weights to each characteristic according to the observed relationships among the characteristics. For instance, Tafeit, et al. (1999, 2000) provided a comparison between FA and ANN for low-dimensional classification of highdimensional body fat topography data of healthy and diabetic subjects with a high-dimensional and partly highly intercorrelated set of data. They found that the analysis of the extracted weights yielded useful information about the structure of the data. As the weights for each characteristic are obtained by FA, the score (by summing characteristics times these weights) can be used to represent the scale of the factor to facilitate the rating of factors. In addition, FA’s ability to divide closely related variables into different groups is also useful for statistical hypothesis testing, as Rummel (2002) stated, when hypotheses are about the dimensions that can be a group of highly intercorrelated characteristics, such as personality, attitude, social behavior, and voting. For instance, in a study of resource investments in tourism business, Morais, et al. (2003) use confirmatory FA to find that preestablished resource investment scales could not fit their model well. They reexamined each subscale with exploratory FA to identify factors that should not have been included in the original model. There have been controversies about uses of FA. Hand, et al. (2001) pointed out that one important reason is that FA’s solutions are not invariant to various transformations. More precisely, “the extracted factors are basically non-unique, unless extra constraints are imposed” (Hand et al., 2001, p. 84). The same information may reach different interpretations with personal judgment. Nevertheless, no method is perfect. In some situations, other statistical methods, such as regression analysis and cluster analysis, may be more appropriate than FA. However, FA is a well-known and useful tool among datamining techniques.
FUTURE TRENDS At the Factor Analysis at 100 Conference held in May 2004, the future of FA was discussed. Millsap and Meredith (2004) suggested further research in the area of ordinal measures in multiple populations and technical issues of small samples. These conditions can generate bias in current FA methods, causing results to be suspect. They also suggested further study in the impact of violations of factorial invariance and explanations for these violations. Wall and Amemiya (2004) feel that there are challenges in the area of non-linear FA. Although models exist for non-linear analysis, there are aspects of this area that are not fully understood. However, the flexibility of FA and its ability to reduce the complexity of the data still make FA one of commonly used techniques. Incorporated with advances in information technologies, the future of FA shows great promise for the applications in the area of data mining.
CONCLUSION FA is a useful multivariate statistical technique that has been applied in a wide range of disciplines. It enables researchers to effectively extract information from huge databases and attempts to organize and minimize the amount of variables used in collecting or measuring data. However, the applications of FA in business sectors (e.g., e-business) is relatively new. Currently, the increasing volumes of data in databases and data warehouses are the key issue governing their future development. Allowing the effective mining of potentially useful information from huge databases with many dimensions, FA definitely is helpful in sorting out the significant parts of information for decision makers, if it is used appropriately.
REFERENCES Borovec, Z. (1996). Evaluation of the concentrations of trace elements in stream sediments by factor and cluster analysis and the sequential extraction procedure. The Science of the Total Environment, 117, 237-250. Chen, L., Liaw, S., & Chen, Y. (2001). Using financial factors to investigate productivity: An empirical study in Taiwan. Industrial Management & Data Systems, 101(7), 378-384.
501
.
Factor Analysis in Data Mining
Garson, D. (2003). Factor analysis. Retrieved from http:/ /www2.chass.ncsu.edu/garson/pa765/factor.htm Hair, J., Anderson, R., Tatham R., & Black, W. (1998). Multivariate data analysis with readings. Englewood Cliffs, NJ: Prentice-Hall, Inc. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA. MIT Press. Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. Englewood Cliffs, NJ: Prentice Hall, Inc. Kachigan, S. (1986). Statistical analysis: An interdisciplinary introduction to univariate and multivariate methods. New York: Radius Press. Kiousis, S. (2004). Explicating media salience: A factor analysis of New York Times issue coverage during the 2000 U.S. presidential election. Journal of Communication, 54(1), 71-87. Mate, C., & Calderon, R. (2000). Exploring the characteristics of rotating electric machines with factor analysis. Journal of Applied Statistics, 27(8), 991-1006. Millsap, R., & Meredith, W. (2004). Factor invariance: Historical trends and new developments. Proceedings of the Factor Analysis at 100: Historical Developments and Future Directions Conference, Chapel Hill, North Carolina. Morais, D., Backman, S., & Dorsch, M. (2003). Toward the operationalization of resource investments made between customers and providers of a tourism service. Journal of Travel Research, 41, 362-374. Pan, Z. et al. (1997). Spectroscopic quantization of amino acids by using artificial neural networks combined with factor analysis. Spectrochimica Acta Part A 53, 16291632. Rummel, R.J. (2002). Understanding factor analysis. Retrieved from http://www.hawaii.edu/powerkills/ UFA.HTM Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (1999). The determination of three subcutaneous adipose tissue compartments in non-insulin-dependent diabetes mellitus women with artificial neural networks and factor analysis. Artificial Intelligence in Medicine, 17, 181-193. Tafeit, E., Moller, R., Sudi, K., & Reibnegger, G. (2000). Artificial neural networks compared to factor analysis for low-dimensional classification of high-dimensional body fat topography data of healthy and diabetic subjects. Computers and Biomedical Research, 33, 365-374.
502
Tan, K., & Wisner, J. (2003). A study of operations management constructs and their relationships. International Journal of Operations & Production Management, 23(11), 1300-1325. Wall, M., & Amemiya, Y. (2004). A review of nonlinear factor analysis methods and applications. Proceedings of the Factor Analysis at 100: Historical Developments and Future Directions Conference, Chapel Hill, North Carolina. Williams, R.H., Zimmerman, D.W., Zumbo, B.D., & Ross, D. (2003). Charles Spearman: British behavioral scientist. Human Nature Review. Retrieved from http://humannature.com/nibbs/03/spearman.html
KEY TERMS Cluster Analysis: A multivariate statistical technique that assesses the similarities between individuals of a population. Clusters are groups or categories formed so members within a cluster are less different than members from different clusters. Eigenvalue: The quantity representing the variance of a set of variables included in a factor. Factor Score: A measure of a factor’s relative weight to others, which is obtained using linear combinations of variables. Homogeneity: The degree of similarity or uniformity among individuals of a population. Homoscedasticity: A statistical assumption for linear regression models. It requires that the variations around the regression line be constant for all values of input variables. Matrix: An arrangement of rows and columns to display quantities. A p × q matrix contains p × q quantities arranged in p rows and q columns (i.e., each row has q quantities,and each column has p quantities). Normality: A statistical assumption for linear regression models. It requires that the errors around the regression line be normally distributed for each value of input variable. Variance: A statistical measure of dispersion around the mean within the data. Factor analysis divides variance of a variable into three elements—common, specific, and error. Vector: A quantity having both direction and magnitude. This quantity can be represented by an array of components in a column (column vector) or in a row (row vector).
503
Financial Ratio Selection for Distress Classification
.
Roberto Kawakami Harrop Galvão Instituto Tecnológico de Aeronáutica, Brazil Victor M. Becerra University of Reading, UK Magda Abou-Seada Middlesex University, UK
w = S–1(m 1 – m 2)
INTRODUCTION Prediction of corporate financial distress is a subject that has attracted the interest of many researchers in finance. The development of prediction models for financial distress started with the seminal work by Altman (1968), who used discriminant analysis. Such a technique is aimed at classifying a firm as bankrupt or nonbankrupt on the basis of the joint information conveyed by several financial ratios. The assessment of financial distress is usually based on ratios of financial quantities, rather than absolute values, because the use of ratios deflates statistics by size, thus allowing a uniform treatment of different firms. Moreover, such a procedure may be useful to reflect a synergy or antagonism between the constituents of the ratio.
BACKGROUND The classification of companies on the basis of financial distress can be performed by using linear discriminant models (also called Z-score models) of the following form (Duda, Hart, & Stork, 2001): µ 1 – µ 2)TS–1x Z(x) = (µ
(1)
where x = [x1 x2 ... xn]T is a vector of n financial ratios, µ1∈ℜn and µ 2∈ℜn are the sample mean vectors of each group (continuing and failed companies), and Sn×n is the common sample covariance matrix. Equation 1 can also be written as Z = w1x 1 + w2x2 + ... + wnxn = wTx
(2)
where w = [w1 w2 ... wn]T is a vector of coefficients obtained as
(3)
The optimal cut-off value for classification z c can be calculated as µ 1 – µ 2)TS–1(µ µ 1 + µ 2) z c = 0.5(µ
(4)
A given vector x should be assigned to Population 1 if Z(x) > zc, and to Population 2 otherwise. The generalization (or prediction) performance of the Z-score model, that is, its ability to classify objects not used in the modeling phase, can be assessed by using an independent validation set or cross-validation methods (Duda et al., 2001). The simplest cross-validation technique, termed “leave-one-out,” consists of separating one of the m modelling objects and obtaining a Zscore model with the remaining m − 1 objects. This model is used to classify the object that was left out. The procedure is repeated for each object in the modeling set in order to obtain a total number of cross-validation errors. Resampling techniques (Good, 1999) such as the Bootstrap method (Davison & Hinkley, 1997) can also be used to assess the sensitivity of the analysis to the choice of the training objects.
The Financial Ratio Selection Problem The selection of appropriate ratios from the available financial information is an important and nontrivial stage in building distress classification models. The best choice of ratios will normally depend on the types of companies under analysis and also on the economic context. Although the analyst’s market insight plays an important role at this point, the use of data-driven selection techniques can be of value, because the relevance of certain ratios may only become apparent when their joint contribution is considered in a multivariate
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Financial Ratio Selection for Distress Classification
context. Moreover, some combinations of ratios may not satisfy the statistical assumptions required in the modeling process, such as normal distribution and identical covariances in the groups being classified, in the case of standard linear discriminant analysis (Duda et al., 2001). Finally, collinearity between ratios may cause the model to have poor prediction ability (Naes & Mevik, 2001). Techniques proposed for ratio selection include normality tests (Taffler, 1982), and clustering followed by stepwise discriminant analysis (Alici, 1996). Most of the works cited in the preceding paragraph begin with a set of ratios chosen from either popularity in the literature, theoretical arguments, or suggestions by financial analysts. However, this article shows that it is possible to select ratios on the basis of data taken directly from the financial statements. For this purpose, we compare two selection methods proposed by Galvão, Becerra, and Abou-Seada (2004). A case study involving 60 failed and continuing British firms in the period from 1997 to 2000 is employed for illustration.
MAIN THRUST It is not always advantageous to include all available variables in the building of a classification model (Duda et al., 2001). Such an issue has been studied in depth in the context of spectrometry (Andrade, GomezCarracedo, Fernandez, Elbergali, Kubista, & Prada, 2003), in which the variables are related to the wavelengths monitored by an optical instrumentation framework. This concept also applies to the Z-score modeling process described in the preceding section. In fact, numerical ill-conditioning tends to increase with (m – n)–1, where m is the size of the modeling sample, and n is the number of variables (Tabachnick & Fidell, 2001). If n > m, matrix S becomes singular, thus preventing the use of Equation 1. In this sense, it may be more appropriate to select a subset of the available variables for inclusion in the classification model. The selection procedures to be compared in this article search for a compromise between maximizing the amount of discriminating information available for the model and minimizing collinearity between the classification variables, which is a known cause of generalization problems (Naes & Mevik, 2001). These goals are usually conflicting, because the larger the number of variables, the more information is available, but also the more difficult it is to avoid collinearity.
504
Algorithm A (Preselection Followed by Exhaustive Search) If N variables are initially available for selection, they can be combined in 2N – 1 different subsets (each subset with a number of variables between 1 and N). Thus, the computational workload can be substantially reduced if some variables are preliminarily excluded. In this algorithm, such a preselection is carried out according to a multivariate relevance index W(x) that measures the contribution of each variable x to the classification output when a Z-score model is employed. This index is obtained by using all variables to build a model as in Equation 1 and by multiplying the absolute value of each model weight by the sample standard deviation (including both groups) of the respective variable. An appropriate threshold value for the relevance index W(x) can be determined by augmenting the modeling data with artificial uninformative variables (noise) and then obtaining a Z-score model. Those variables whose relevance is not considerably larger than the average relevance of the artificial variables are then eliminated (Centner, Massart, Noord, Jong, Vandeginste, & Sterna, 1996). After the preselection phase, all combinations of the remaining variables are tested. Subsets with the same number of variables are compared on the basis of the number of classification errors on the modelling set for a Z-score model and the condition number of the matrix of modeling data. The condition number (the ratio between the largest and smallest singular value of the matrix) should be small to avoid collinearity problems (Navarro-Villoslada, Perez-Arribas, Leon-Gonzalez, & Polodiez, 1995). After the best subset has been determined for each given number of variables, a crossvalidation procedure is employed to find the optimum number of variables.
Algorithm B (Genetic Selection) The drawback of the preselection procedure employed in Algorithm A is that some variables that display a small relevance index when all variables are considered together could be useful in smaller subsets. An alternative to such a preselection consists of employing a genetic algorithm (GA), which tests subsets of variables in an efficient way instead of performing an exhaustive search (Coley, 1999; Lestander, Leardi, & Geladi, 2003). The GA represents subsets of variables as individuals competing for survival in a population. The genetic
Financial Ratio Selection for Distress Classification
code of each individual is stored in a chromosome, which is a string of N binary genes, each gene associated to one of the variables available for selection. The genes with value 1 indicate the variables that are to be included in the classification model. In the formulation adopted here, the measure F of the survival fitness of each individual is defined as follows. A Z-score model is obtained from Equation 1 with the variables indicated in the chromosome, and then F is calculated as F = (e + ρr)–1
(5)
where e is the number of classification errors in the modeling set, r is the condition number associated to the variables included in the model, and ρ > 0 is a design parameter that balances modeling accuracy against collinearity prevention. The larger ρ is, the more emphasis is placed on avoiding collinearity. After a random initialization of the population, the algorithm proceeds according to the classic evolutionary cycle (Coley, 1999). At each generation, the roulette method is used for mating pool selection, followed by the genetic operators of one-point crossover and point mutation. The population size is kept constant, with each generation replacing the previous one completely. However, the best-fitted individual is preserved from one generation to the next (“elitism”) in order to prevent good solutions from being lost.
CASE STUDY This example employs financial data from 29 failed and 31 continuing British corporations in the period from 1997 to 2000. The data for the failed firms were taken from the last financial statements published prior to the start of insolvency proceedings. Eight financial quantities Table 1. Numbering of financial ratios (Num/Den). Conventional ratios are displayed in boxes. WC = working capital, PBIT = profit before interest and tax, EQ = equity, S = Sales, TL = total liabilities, ARP = accumulated retained profit, RPY = retained profit for the year, TA = total assets. Den WC PBIT EQ S TL ARP RPY TA
WC
PBIT
1 2 3 4 5 6 7
8 9 10 11 12 13
EQ
Num S
TL
ARP
RPY
were extracted from the statements, allowing 28 ratios to be built, as shown in Table 1. Quantities WC, PBIT, EQ, S, TL, ARP, and TA are commonly found in the financial distress literature (Altman, 1968; Taffler, 1982; Alici, 1996), and the ratios shown in boxes are those adopted by Altman (1968). It is worth noting that the book value of equity was used rather than the market value of equity to allow the inclusion of firms not quoted in the stock market. Quantity RPY is not typically employed in distress models, but we include it here to illustrate the ability of the selection algorithms to discard uninformative variables. The data employed in this example are given in Galvão, Becerra, and Abou-Seada (2004). The data set was divided into a modeling set (21 failed and 21 continuing firms) and a validation set (8 failed and 10 continuing firms). In what follows, the errors will be divided into Type 1 (failed company classified as continuing) and 2 (continuing company classified as failed).
Conventional Financial Ratios Previous studies (Becerra, Galvão, & Abou-Seada, 2001) with this data set revealed that when the five conventional ratios are employed, Ratio 13 (PBIT/TA) is actually redundant and should be excluded from the Z-score model in order to avoid collinearity problems. Thus, Equation 1 was applied only to the remaining four ratios, leading to the results shown in Table 2. It is worth noting that if Ratio PBIT/TA is not discarded, the number of validation errors increases from four to seven.
Algorithm A The preselection procedure was carried out by augmenting the 28 financial ratios with seven uninformative variables yielded by an N(0,1) random number generator. The relevance index thus obtained is shown in Figure 1. The threshold value, represented by a horizontal line, was set to five times the average relevance of the uninformative variables. As a result, 13 ratios were discarded. After the preselection phase, combinations of the 15 remaining ratios were tested for modeling accuracy and Table 2. Results of a Z–score model using four conventional ratios Data set
14 15 16 17 18
19 20 21 22
23 24 25
26 27
28
Modeling Crossvalidation Validation
Type 1 Type 2 errors errors 2 7 3 8 0
4
Percent accuracy 79% 74% 78%
505
.
Financial Ratio Selection for Distress Classification
Figure 1. Relevance index of the financial ratios. Log 10 values are displayed for the convenience of visualization.
Table 3. Results of a Z-score model using the five ratios selected by Algorithm A Data set Modeling Crossvalidation Validation
Type 1 Type 2 Percent errors errors accuracy 1 3 90% 2 5 83% 1
2
83%
which is in agreement with the fact that in comparison with the other financial quantities used in this study, RPY has not often been used in the financial distress literature. The results of using the five selected ratios are summarized in Table 3.
Algorithm B
condition number. Figure 2 displays the number of modeling and cross-validation errors for the best subsets obtained as a function of the number n of ratios included. The cross-validation curve reaches a minimum between five and seven ratios. In this situation, the use of five ratios was deemed more appropriate due to the use of the well-known Parsimony Principle, which states that given models with similar prediction ability, the simplest should be favored (Duda et al., 2001). The selected ratios were 7 (WC/TA), 9 (PBIT/S), 10 (PBIT/TL), 13 (PBIT/TA), and 25 (TL/TA). Interestingly, two of these ratios (WC/TA and PBIT/TA) belong to the set advocated by Altman (1968). Note that no ratio involving quantity RPY was selected,
Figure 2. Modeling (dashed line) and cross-validation (solid line) errors as a function of the number of ratios included in the Z-score model. The arrow indicates the final choice of ratios.
The genetic algorithm was employed with a population size of 100 individuals. The crossover and mutation probabilities were set to 60% and 5% , respectively, and the number of generations was set to 100. Three values were tested for the design parameter ρ: 10, 1, and 0,1. For each value of ρ, the algorithm was run five times, starting from different populations, and the best result (in terms of fitness) was kept. The selected ratios are shown in Table 4, which also presents the number of modeling and cross-validation errors in each case. Regardless of the value of ρ, no ratio involving quantity RPY was selected (see Table 1). On the basis of cross-validation performance, ρ= 1 is seen to be the best choice (Ratios 2, 9, and 15). In fact, a smaller value for ρ, which causes the algorithm to place less emphasis on collinearity avoidance, results in the selection of more ratios. As a result, there is a gain in modeling accuracy but not in generalization ability (as assessed by cross-validation), which means that the model has become excessively complex. On the other hand, a larger value for ρ discouraged the selection of sets with more than one ratio. In this case, there is not enough discriminating information, which results in poor modeling and cross-validation performances. Table 5 details the results for ρ = 1. Notice that one of the ratios obtained in this case belongs to the set used by Altman (EQ/TL). It is also worth pointing out that Ratios 2 and 9 were discarded in the preselection phase of Algorithm A, which prevented the algorithm from taking the GA solution {2, 9, 15} into account.
Resampling Study The results in the preceding two sections were obtained for one given partition of the available data into modeling and validation sets. In order to assess the validity of 506
Financial Ratio Selection for Distress Classification
Table 4. GA results for different values of the weight parameter ρ ρ 10 1 0.1
Selected Modeling Cross-validation ratios errors errors {7} 11 11 {2,9,15} 4 5 {2,9,13,18,19,22,25} 3 7
Table 6. Resampling results (average number of errors)
Ratios employed Conventional Algorithm A Algorithm B (4 ratios) (5 ratios) (3 ratios) Modeling 8.52 5.70 5.36 Validation 4.69 3.22 2.89 Data set
the ratio selection scheme employed, a resampling study was carried out. For this purpose, 1,000 different modeling/validation partitions were randomly generated with the same size as the one employed before (42 modeling companies and 18 validation companies). For each partition, a Z-score model was built and validated by using the subsets of ratios obtained in the previous subsections. Table 6 presents the average number of resulting errors. As can be seen, the ratios selected by Algorithms A and B lead to better classifiers than the conventional ones. The best result, in terms of both modeling and validation performances, was obtained by Algorithm B. Such a finding is in line with the parsimony of the associated classifier, which employs only three ratios.
CONCLUSION
FUTURE TRENDS
Alici, Y. (1996). Neural networks in corporate failure prediction: The UK experience. In A. P. Refenes, Y. Abu-Mostafa, & J. Moody (Eds.), Neural networks in financial engineering. London: World Scientific.
The research on distress prediction has been moving towards nonparametric modeling in order to circumvent the limitations of discriminant analysis (such as the need for the classes to exhibit multivariate normal distributions with identical covariances). In this context, neural networks have been found to be a useful alternative, as demonstrated in a number of works (Wilson & Sharda, 1994; Alici, 1996; Atiya, 2001; Becerra et al., 2001). In light of the utility of data-driven ratio selection techniques for discriminant analysis, as discussed in this article, it is to be expected that ratio selection may also play an important role for neural network models. However, future developments along this line will have to address the difficulty that nonlinear classifiers usually have to be adjusted by numerical search techniques (training algorithms), which may be affected by local minima problems (Duda et al., 2001). Table 5. Results of a Z-score model using the three ratios selected by Algorithm B Data Type 1 Type 2 Percent set errors errors accuracy Modeling 4 0 90% Cross4 1 88% validation 1 3 78% Validation
The selection of appropriate ratios from the available financial quantities is an important and nontrivial step in building models for corporate financial distress classification. This article shows how data-driven variable selection techniques can be useful tools in building distress classification models. The article compares the results of two such techniques, one involving preselection followed by exhaustive search, and the other employing a genetic algorithm.
REFERENCES
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23, 505-609. Andrade, J. M., Gomez-Carracedo, M. P., Fernandez, E., Elbergali, A., Kubista, M., & Prada, D. (2003). Classification of commercial apple beverages using a minimum set of mid-IR wavenumbers selected by Procrustes rotation. Analyst, 128(9), 1193-1199. Atiya, A. F. (2001). Bankruptcy prediction for credit risk using neural networks: A survey and new results. IEEE Transactions on Neural Networks, 12(4), 929-935. Becerra, V. M., Galvão, R. K. H., & Abou-Seada, M. (2001). Financial distress classification employing neural networks. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (pp. 4549). Centner, V., Massart, D. L., Noord, O. E., Jong, S., Vandeginste, B. M., & Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Analytical Chemistry, 68, 3851-3858.
507
.
Financial Ratio Selection for Distress Classification
Coley, D. A. (1999). An introduction to genetic algorithms for scientists and engineers. Singapore: World Scientific.
Wilson, R. L., & Sharda, R. (1994). Bankruptcy prediction using neural networks. Decision Support Systems, 11, 545-557.
Davison, A. C., & Hinkley, D. V. (Eds.). (1997). Bootstrap methods and their application. Cambridge, MA: Cambridge University Press.
KEY TERMS
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. Galvão, R. K. H., Becerra, V. M., & Abou-Seada, M. (2004). Ratio selection for classification models. Data Mining & Knowledge Discovery, 8, 151-170.
Condition Number: Ratio between the largest and smallest singular values of a matrix, often employed to assess the degree of collinearity between variables associated to the columns of the matrix.
Good, P. I. (1999). Resampling methods: A practical guide to data analysis. Boston, MA: Birkhauser.
Cross-Validation: Resampling method in which elements of the modeling set itself are alternately removed and reinserted for validation purposes.
Lestander, T. A., Leardi, R., & Geladi, P. (2003). Selection of near infrared wavelengths using genetic algorithms for the determination of seed moisture content. Journal of Near Infrared Spectroscopy, 11(6), 433-446.
Financial Distress: A company is said to be under financial distress if it is unable to pay its debts as they become due, which is aggravated if the value of the firm’s assets is lower than its liabilities.
Naes, T., & Mevik, B. H. (2001). Understanding the collinearity problem in regression and discriminant analysis. Journal of Chemometrics, 15(4), 413-426.
Financial Ratio: Ratio formed from two quantities taken from a financial statement.
Navarro-Villoslada, F., Perez-Arribas, L. V., LeonGonzalez, M. E., & Polodiez, L. M. (1995). Selection of calibration mixtures and wavelengths for different multivariate calibration methods. Analytica Chimica Acta, 313(1-2), 93-101. Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston, MA: Allyn & Bacon. Taffler, R. J. (1982). Forecasting company failure in the UK using discriminant analysis and financial ratio data. Journal of the Royal Statistical Society, Series A, 145, 342-358.
508
Genetic Algorithm: Optimization technique inspired by the mechanisms of evolution by natural selection, in which the possible solutions are represented as the chromosomes of individuals competing for survival in a population. Linear Discriminant Analysis: Multivariate classification technique that models the classes under consideration by normal distributions with equal covariances, which leads to hyperplanes as the optimal decision surfaces. Resampling: Validation technique employed to assess the sensitivity of the classification method with respect to the choice of modeling data.
509
Flexible Mining of Association Rules Hong Shen Japan Advanced Institute of Science and Technology, Japan
INTRODUCTION
BACKGROUND
The discovery of association rules showing conditions of data co-occurrence has attracted the most attention in data mining. An example of an association rule is the rule “the customer who bought bread and butter also bought milk,” expressed by T(bread; butter)→T(milk). Let I ={x1,x2,…,xm} be a set of (data) items, called the domain; let D be a collection of records (transactions), where each record, T, has a unique identifier and contains a subset of items in I. We define itemset to be a set of items drawn from I and denote an itemset containing k items to be k-itemset. The support of itemset X, denoted by Ã(X/D), is the ratio of the number of records (in D) containing X to the total number of records in D. An association rule is an implication rule X ⇒ Y, where X; Y ⊆ I and XIY=0. The confidence of X ⇒ Y is the ratio of σ(X U Y/D) to σ(X/D), indicating that the percentage of those containing X also contain Y. Based on the userspecified minimum support (minsup) and confidence (minconf), the following statements are true: An itemset X is frequent if σ(X/D)> minsup, and an association rule
An association rule is called binary association rule if all items (attributes) in the rule have only two values: 1 (yes) or 0 (no). Mining binary association rules was the first proposed data mining task and was studied most intensively. Centralized on the Apriori approach (Agrawal et al., 1993), various algorithms were proposed (Savasere et al., 1995; Shen, 1999; Shen, Liang, & Ng, 1999; Srikant & Agrawal, 1996). Almost all the algorithms observe the downward property that all the subsets of a frequent itemset must also be frequent, with different pruning strategies to reduce the search space. Apriori works by finding frequent k-itemsets from frequent (k-1)-itemsets iteratively for k=1, 2, …, m-1. Two alternative approaches, mining on domain partition (Shen, L., Shen, H., & Cheng, 1999) and mining based on knowledge network (Shen, 1999) were proposed. The first approach partitions items suitably into disjoint itemsets, and the second approach maps all records to individual items; both approaches aim to improve the bottleneck of Apriori that requires multiple phases of scans (read) on the database. Finding all the association rules that satisfy minimal support and confidence is undesirable in many cases for a user’s particular requirements. It is therefore necessary to mine association rules more flexibly according to the user’s needs. Mining different sets of association rules of a small size for the purpose of predication and classification were proposed (Li, Shen, & Topor, 2001; Li, Shen, & Topor, 2002; Li, Shen, & Topor, 2004; Li, Topor, & Shen, 2002).
X ⇒ Y is strong if X U Y is frequent and
σ ( X UY / D ) σ ( X /Y )
¸ minconf.
The problem of mining association rules is to find all strong association rules, which can be divided into two subproblems: 1. 2.
Find all the frequent itemsets. Generate all strong rules from all frequent itemsets.
Because the second subproblem is relatively straightforward we can solve it by extracting every subset from an itemset and examining the ratio of its support; most of the previous studies (Agrawal, Imielinski, & Swami, 1993; Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996; Park, Chen, & Yu, 1995; Savasere, Omiecinski, & Navathe, 1995) emphasized on developing efficient algorithms for the first subproblem. This article introduces two important techniques for association rule mining: (a) finding N most frequent itemsets and (b) mining multiple-level association rules.
MAIN THRUST Association rule mining can be carried out flexibly to suit different needs. We illustrate this by introducing important techniques to solve two interesting problems.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
.
Flexible Mining of Association Rules
Finding N Most Frequent Itemsets Given x, y ⊆ I, we say that x is greater than y, or y is less than x, if σ ( x / D) > σ ( y / D) . The largest itemset in D is the itemset that occurs most frequently in D. We want to find the N largest itemsets in D, where N is a user-specified number of interesting itemsets. Because users are usually interested in those itemsets with larger supports, finding N most frequent itemsets is significant, and its solution can be used to generate an appropriate number of interesting itemsets for mining association rules (Shen, L., Shen, H., Pritchard, & Topor, 1998). We define the rank of itemset x, denoted by è (x), as follows: è (x) ={σ(y/D)>σ(x/D), è ⊂ y ⊆ I}|+1. Call x a winner if è (x)
510
To find all the winners, the algorithm makes multiple passes over the data. In the first pass, we count the supports of all 1-itemsets, select the N largest ones from them to form W 1, and then use W 1 to generate potential 2winners with size (2). Each subsequent pass k involves three steps: First, we count the support for potential kwinners with size k (called candidates) during the pass over D; then select the N largest ones from a pool precisely containing supports of all these candidates and all (k-1)-winners to form W k; finally, use W k to generate potential (k+1)-winners with size k+1, which will be used in the next pass. This process continues until we can’t get any potential (k+1)-winners with size k+1, which implies that W k+1 = Wk. From Property 2, we know that the last Wk exactly contains all winners. We assume that M k is the number of itemsets with support equal to k-crisup and a size not greater than k, where 1
Mining Multiple-Level Association Rules Although most previous research emphasized mining association rules at a single concept level (Agrawal et al., 1993; Agrawal et al., 1996; Park et al., 1995; Savasere et al., 1995; Srikant & Agrawal, 1996), some techniques were also proposed to mine rules at generalized abstract (multiple) levels (Han & Fu, 1995). However, they can only find multiple-level rules in a fixed concept hierarchy. Our study in this fold is motivated by the goal of mining multiple-level rules in all concept hierarchies (Shen, L., & Shen, H., 1998). A concept hierarchy can be defined on a set of database attribute domains such as D(a1),…,D(an), where, for i ∈ [1, n], ai denotes an attribute, and D(a i) denotes the domain of ai. The concept hierarchy is usually partially ordered according to a general-to-specific ordering. The most general concept is the null description ANY, whereas the most specific concepts correspond to the specific attribute values in the database. Given a set of D(a1),…,D(an), we define a concept hierarchy H as follows: H n →H n - 1 →… →H 0 , where i i H = D(a1 ) ×L × D(aii ) f o r i∈[0,n], and n n n n −1 { a 1 , … , a n } = { a1 ,..., an } ⊃ { a1 ,..., an −1 } ⊃ … ⊃ Ø. Here, Hn represents the set of concepts at the primitive level, Hn-1 represents the concepts at one level higher than those at Hn, and so forth; H0, the highest level hierarchy, may contain solely the most general concept,
Flexible Mining of Association Rules
ANY. We also use { a1n ,..., ann }→{ a1n ,..., ann−−11 }→...→{ a11 } to
denote H directly, and H0 may be omitted here. We introduce FML items to represent concepts at any level of a hierarchy. Let *, called a trivial digit, be a “don’tcare” digit. An FML item is represented by a sequence of digits, x = x1x2 … xn, xi ∈ D(ai) 7 {*}. The flat-set of x is defined as Sf (x) ={(I, x i) | i ∈ [1; n] and xi ≠*}. Given two items x and y, x is called a generalized item of y if Sf (x) ⊆ Sf (y), which means that x represents a higher-level concept that contains the lower-level concept represented by y. Thus, *5* is a generalized item of 35* due to Sf(*5*) ={(2,5)} ⊆ Sf (35*)={(1,3),(2,5)}. If Sf(x)=Ø, then x is called a trivial item, which represents the most general concept, ANY. Let T be an encoded transaction table, t a transaction in T, x an item, and c an itemset. We can say that (a) t supports x if an item y exists in t such that x is a generalized item of y and (b) t supports c if t supports every item in c. The support of an itemset c in T, σ (c/T), is the ratio of the number of transactions (in T) that support c to the total number of transactions in T. Given a minsup, an itemset c is large if σ(c/T)e>minsup; otherwise, it is small. Given an itemset c, we define its simplest form as Fs(c)={x ∈ c| ∀ y ∈ c; Sf (x) ⊄ Sf(y)} and its complete form as Fc(c)={x|Sf (x) ⊆ Sf (y), y ∈ c}. Given an itemset c, we call the number of elements in Fs(c) its size and the number of elements in Fc(c) its weight. An itemset of size j and weight k is called a (j)-itemset, [k]itemset, or (j)[k]-itemset. Let c be a (j)[k]-itemset. Use Gi(c) to indicate the set of all [i]-generalized-subsets of c, where i
f
f
start with L k-1, the set of all large [k-1]-itemsets, and use Lk-1 to generate Ck, a superset of all large [k]-itemsets. Call the elements in Ck candidate itemsets, and count the support for these itemsets during the pass over the data. At the end of the pass, we determine which of these itemsets are actually large and obtain Lk for the next pass. This process continues until no new large itemsets are found. Not that L1={{trivial item}}, because {trivial item} is the unique [1]-itemset and is supported by all transactions. The computation cost of the preceding algorithm, which finds all frequent FML itemsets is O(∑ c∈C s( x) ), where C is the set of all candidates, g(c) is the cost for generating c as a candidate, and s(c) is the cost for counting the support of c (Shen, L., & Shen, H., 1998). The algorithm is optimal if the method of support counting is optimal. After all frequent FML itemsets have been found, we can proceed with the construction of strong FML rules. Use r(l,a) to denote rule Fs(a)→Fs(Fc(l)-Fc(a)), where l is an itemset and a l. Use Fo(l,a) to denote Fc(l)-Fc(a) and say that Fo(l,a) is an outcome form of l or the outcome form of r(l,a). Note that Fo(l,a) represents only a specific form rather than a meaningful itemset, so it is not equivalent to any other itemset whose simplest form is Fs(Fo(l,a)). Outcome forms are also called outcomes directly. Clearly, the corresponding relationship between rules and outcomes is one to one. An outcome is strong if it corresponds to a strong rule. Thus, all strong rules related to a large itemset can be obtained by finding all strong outcomes of this itemset. Let l be an itemset. Use O(l) to denote the set of all outcomes of l; that is, O(l) ={Fo(l,a)|a l}. Thus, from O(l), we can output all rules related to l: Fs(Fc(l)-o) ⇒ Fs(o) (denoted by rˆ(l , o) ), where o ∈ O(l). Clearly, r(l,a) and rˆ(l , o) denote the same rule if o=Fo(l,a). Let o, oˆ ∈ O(l ) . We can say two things: (a) o is a |k|-outcome of l if o exactly
contains k elements and (b) oˆ is a sub-outcome of o versus l if oˆ ∈ o . Use Ok(l) to denote the set of all the |k|outcomes of l and use Vm(o,l) to denote the set of all the |m|-sub-outcomes of o versus l. Let o be an |m+1|-outcome of l and me>1. If |Vm(o,l)| =1, then o is called an elementary outcome; otherwise, o is a non-elementary outcome. Let r(l,a) and r(l,b) be two rules. We can say that r(l,a) is an instantiated rule of r(l,b) if b a. Clearly, b a σ (l / T ) σ (l / T ) ≥ . Hence, all σ (a / T ) σ (b / T ) instantiated rules of a strong rule must also be strong. Let l be a large itemset, o1, o2 ∈ O(l), o1=Fo(l,a), and o2 = Fo(l,b). The three straightforward conclusions are (a)
implies σ(b/T)>σ(a/T) and
511
.
Flexible Mining of Association Rules
o1 ⊆ o2 iff b
a, (b) rˆ(l , o1 ) = r(l,a), and (c) rˆ(l , o2 ) = r(l,b).
Therefore, rˆ(l , o1 ) is an instantiated rule of rˆ(l , o2 ) iff o1 ⊆ o2 , which implies that all the suboutcomes of a strong outcome must also be strong. This characteristic is similar to the property that all generalized subsets of a large itemset must also be large. Hence, the algorithm works as follows. From a large itemset l, we first generate all the strong rules with |1|-outcomes (ignore the |0|-outcome Ø, which only corresponds to all trivial strong rules; Fs(l) ⇒ Ø for every large itemset l). We then use all these strong |1|outcomes to generate all the possible strong |2|-outcomes of l, from which all the strong rules with |2|-outcomes can be produced, and so forth. The computation cost of the algorithm constructing all the strong FML rules is O(|C|t + |C|s) (Shen, L., & Shen, H., 1998), which is optimal for bounded costs t and s, where C is the set of all the candidate rules, s is the average cost for generating one element in C, and t is the average cost for computing the confidence of one element in C.
FUTURE TRENDS The discovery of data association is a central topic of data mining. Mining this association flexibly to suit different needs has an increasing significance for applications. Future research trends include mining association rules in data of complex types such as multimedia, time-series, text, and biological databases, in which certain properties (e.g., multidimensionality), constraints (e.g., time variance), and structures (e.g., sequence) of the data must be taken into account during the mining process; mining association rules in databases containing incomplete, missing, and erroneous data, which requires people to produce correct results regardless of the unreliability of data; mining association rules online for streaming data that come from an openended data stream and flow away instantaneously at a high rate, which usually adopts techniques such as active learning, concept drift, and online sampling and proceeds in an incremental manner.
CONCLUSION We have summarized research activities in mining association rules and some recent results of my research in different streams of this topic. We have introduced important techniques to solve two interesting problems of mining N most frequent itemsets and mining multiple-level association rules, respectively. These techniques show 512
that mining association rules can be performed flexibly in user-specified forms to suit different needs. More relevant results can be found in Shen (1999). We hope that this article provides useful insight to researchers for better understanding to more complex problems and their solutions in this research direction.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 207-216). Agrawal, R., Mannila, H., Srikant,Toivonen, R. H., & Verkamo, A. I. (1996). Fast discovery of association rules. In U. Fayyad (Ed.), Advances in knowledge discovery and data mining (pp. 307-328). MIT Press. Han, J., & Fu, Y. (1995). Discovery of multiple-level association rules from large databases. Proceedings of the 21st VLDB International Conference (pp. 420-431). Li, J., Shen, H., & Topor, R. (2001). Mining the smallest association rule set for predictions. Proceedings of the IEEE International Conference on Data Mining (pp. 361-368). Li, J., Shen, H., & Topor, R. (2002). Mining the optimum class association rule set. Knowledge-Based Systems, 15(7), 399-405. Li, J., Shen, H., & Topor, R. (2004). Mining the informative rule set for prediction. Journal of Intelligent Information Systems, 22(2), 155-174. Li, J., Topor, R., & Shen, H. (2002). Construct robust rule sets for classification. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (pp. 564-569). Park, J. S., Chen, M., & Yu, P. S. (1995). An effective hash based algorithm for mining association rules. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 175-186). Savasere, A., Omiecinski, R., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. Proceedings of the 21st VLDB International Conference (pp. 432-443). Shen, H. (1999). New achievements in data mining. In J. Sun (Ed.), Science: Advancing into the new millenium (pp. 162-178). Beijing: People’s Education. Shen, H., Liang, W., & Ng, J. K-W. (1999). Efficient computation of frequent itemsets in a subcollection of multiple set families. Informatica, 23(4), 543-548.
Flexible Mining of Association Rules
Shen, L., & Shen, H. (1998). Mining flexible multiple-level association rules in all concept hierarchies. Proceedings of the Ninth International Conference on Database and Expert Systems Applications (pp. 786-795). Shen, L., Shen, H., & Cheng, L. (1999). New algorithms for efficient mining of association rules. Information Sciences, 118, 251-268.
Concept Hierarchy: The organization of a set of database attribute domains into different levels of abstraction according to a general-to-specific ordering. Confidence of Rule X ⇒ Y: The fraction of the database containing X that also contains Y, which is the ratio of the support of X U Y to the support of X.
Shen, L., Shen, H., Pritchard, P., & Topor, R. (1998). Finding the N largest itemsets. Proceedings of the IEEE International Conference on Data Mining, 19 (pp. 211-222).
Flexible Mining of Association Rules: Mining association rules in user-specified forms to suit different needs, such as on dimension, level of abstraction, and interestingness.
Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. ACM SIGMOD Record, 25(2), 1-8.
Frequent Itemset: An itemset that has a support greater than user-specified minimum support.
KEY TERMS
Strong Rule: An association rule whose support (of the union of itemsets) and confidence are greater than user-specified minimum support and confidence, respectively.
Association Rule: An implication rule X ⇒ Y that shows the conditions of co-occurrence of disjoint itemsets (attribute value sets) X and Y in a given database.
Support of Itemset X: The fraction of the database that contains X, which is the ratio of the number of records containing X to the total number of records in the database.
513
.
514
Formal Concept Analysis Based Clustering Jamil M. Saquer Southwest Missouri State University, USA
INTRODUCTION Formal concept analysis (FCA) is a branch of applied mathematics with roots in lattice theory (Wille, 1982; Ganter & Wille, 1999). It deals with the notion of a concept in a given universe, which it calls context. For example, consider the context of transactions at a grocery store where each transaction consists of the items bought together. A concept here is a pair of two sets (A, B). A is the set of transactions that contain all the items in B and B is the set of items common to all the transactions in A. A successful area of application for FCA has been data mining. In particular, techniques from FCA have been successfully used in the association mining problem and in clustering (Kryszkiewicz, 1998; Saquer, 2003; Zaki & Hsiao, 2002). In this article, we review the basic notions of FCA and show how they can be used in clustering.
BACKGROUND A fundamental notion in FCA is that of a context, which is defined as a triple (G, M, I), where G is a set of objects, M is a set of features (or attributes), and I is a binary relation between G and M. For object g and feature m, gIm if and only if g possesses the feature m. An example of a context is given in Table 1, where an “X” is placed in the i th row and jth column to indicate that the object in row i possesses the feature in column j. The set of features common to a set of objects A is denoted by β(A) and is defined as {m ∈ M | gIm " g ∈ A}. Table 1. A context excerpted from (Ganter & Wille, 1999, p. 18) a = needs water to live; b = lives in water; c = lives on land; d = needs chlorophyll; e = two seeds leaf; f = one seed leaf; g = can move around; h = has limbs; i = suckles its offsprings 1 2 3 4 5 6 7 8
Leech Bream Frog Dog Spike-weed Reed Bean Maize
a X X X X X X X X
b X X X X X
c X X X X X
d
X X X X
e
X
f
X X X
g X X X X
h
i
X X X
X
Similarly, the set of objects possessing all the features in a set of features B is denoted by α(B) and is given by {g ∈ G | gIm " m ∈ B}. The operators α and b satisfy the assertions given in the following lemma. Lemma 1 (Wille, 1982): Let (G, M, I) be a context. Then the following assertions hold: 1. 2.
A1 Í A2 implies b(A2) Í b(A1) for every A1, A2 Í G, and B 1 Í B2 implies α(B2) Í α(B 1) for every B1, B 2 ÍM. A Í α(b(A) and A = b(α(b(A)) for all AÍ G, and B Í b(α((B)) and B = α(b(α (B))) for all BÍ M.
A formal concept in the context (G, M, I) is defined as a pair (A, B) where A Í G, B Í M, b(A) = B, and α(B) = A. A is called the extent of the formal concept and B is called its intent. For example, the pair (A, B) where A = {2, 3, 4} and B = {a, g, h} is a formal concept in the context given in Table 1. A subconcept/superconcept order relation on concepts is as follows: (A1, B 1) ≤ (A2, B 2) iff A1 Í A2 (or equivalently, iff B2 Í B1). The fundamental theorem of FCA states that the set of all concepts on a given context is a complete lattice, called the concept lattice (Ganter & Wille, 1999). Concept lattices are drawn using Hasse diagrams, where concepts are represented as nodes. An edge is drawn between concepts C1 and C2 iff C1 ≤ C 2 and there is no concept C 3 such that C1 ≤ C3 ≤ C2. The concept lattice for the context in Table 1 is given in Figure 1. A less condensed representation of a concept lattice is possible using reduced labeling (Ganter & Wille, 1999). Figure 2 shows the concept lattice in Figure 1 with reduced labeling. It is easier to see the relationships and similarities among objects when reduced labeling is used. The extent of a concept C in Figure 2 consists of the objects at C and the objects at the concepts that can be reached from C going downward following descending paths towards the bottom concept. Similarly, the intent of C consists of the features at C and the features at the concepts that can be reached from C going upwards following ascending paths to the top concept. Consider the context presented in Table 1. Let B = {a, f}. Then, α(B) = {5, 6, 8}, and b(α(B)) = b({5, 6, 8}) = {a, d, f} ¹ {a, f}; therefore, in general, β(α(B)) ¹ B. A set of features B that satisfies the condition b(α(B)) = B is called a closed feature set. Intuitively, a closed feature set is a
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Formal Concept Analysis Based Clustering
Figure 1. Concept lattice for the context in Table 1
.
Figure 2. Concept lattice for the context in Table 1 with reduced labeling
maximal set of features shared by a set of objects. It is easy to show that intents of the concepts of a concept lattice are all closed feature sets. The support of a set of features B is defined as the percentage of objects that possess every feature in B. That is, support(B) = |α(B)|/|G|, where |B| is the cardinality of B. Let minSupport be a user-specified threshold value for minimum support. A feature set B is frequent iff support(B) ≥ minSupport. A frequent closed feature set is a closed feature set, which is also frequent. For example, for minSupport = 0.3, {a, f} is frequent, {a, d, f} is frequent closed, while {a, c, d, f} is closed but not frequent.
CLUSTERING BASED ON FCA It is believed that the method described below is the first for using FCA for disjoint clustering. Using FCA for conceptual clustering to gain more information about data is discussed in Carpineto & Romano (1999) and Mineau & Godin (1995). In the remainder of this article we show how FCA can be used for clustering.
Traditionally, most clustering algorithms do not allow clusters to overlap. However, this is not a valid assumption for many applications. For example, in Web documents clustering, many documents have more than one topic and need to reside in more than one cluster (Beil, Ester, & Xu, 2002; Hearst, 1999; Zamir & Etzioni, 1998). Similarly, in the market basket data, items purchased in a transaction may belong to more than one category of items. The concept lattice structure provides a hierarchical clustering of objects, where the extent of each node could be a cluster and the intent provides a description of that cluster. There are two main problems, though, that make it difficult to recognize the clusters to be used. First, not all objects are present at all levels of the lattice. Second, the presence of overlapping clusters at different levels is not acceptable for disjoint clustering. The techniques described in this chapter solve these problems. For example, for a node to be a cluster candidate, its intent must be frequent (meaning a minimum percentage of objects must possess all the features of the intent). The intuition is that the objects within a cluster must contain many 515
Formal Concept Analysis Based Clustering
features in common. Overlapping is resolved by using a score function that measures the goodness of a cluster for an object and keeps the object in the cluster where it scores best.
Formalizing the Clustering Problem
negative(g, Ci) be the set of features in b(g) which are global-frequent but not cluster-frequent. The function score(g, Ci) is then given by the following formula: score(g, C i) = ∑ f ∈ positive(g, Ci) cluster-support(f) ∑ f∈ negative(g, Ci) global-support(f).
Given a set of objects G = {g1, g2, …, gn}, where each object is described by the set of features it possesses, (i.e., gi is described by b(gi)), £ = {C1, C2, …, Ck} is a clustering of G if and only if each Ci is a subset of G and UCi = G, where i ranges from 1 to k. For disjoint clustering, the additional condition Ci I Cj = Æ must be satisfied for all i ¹ j. Our method for disjoint clustering consists of two steps. First, assign objects to their initial clusters. Second, make these clusters disjoint. For overlapping clustering, only the first step is needed.
The first term in score(g, Ci) favors Ci for every feature in positive(g, Ci) because these features contribute to intra-cluster similarity. The second term penalizes Ci for every feature in negative(g, C i) because these features contribute to inter-cluster similarities. An overlapping object will be deleted from all initial clusters except for the cluster where it scores highest. Ties are broken by assigning the object to the cluster with the longest label. If this does not resolve ties, then one of the clusters is chosen randomly.
Assigning Objects to Initial Clusters
An Illustrating Example
Each frequent closed feature set (FCFS) is a cluster candidate, with the FCFS serving as a label for that cluster. Each object g is assigned to the cluster candidates described by the maximal frequent closed feature set (MFCFS) contained in b(g). These initial clusters may not be disjoint because an object may contain several MFCFS. For example, for minSupport = 0.375, object 2 in Table 1 contains the MFCFSs agh and abg. Notice that all of the objects in a cluster must contain all the features in the FCFS describing that cluster (which is also used as the cluster label). This is always true even after any overlapping is removed. This means that this method produces clusters with their descriptions. This is a desirable property. It helps a domain expert to assign labels and descriptions to clusters.
Consider the objects in Table 1. The closed feature sets are the intents of the concept lattice in Figure 1. For minSupport = 0.35, a feature set must appear in at least 3 objects to be frequent. The frequent closed feature sets are a, ag, ac, ab, ad, agh, abg, acd, and adf. These are the candidate clusters. Using the notation C[x] to indicate the cluster with label x, and assigning objects to MFCFS results in the following initial clusters: C[agh] = {2, 3, 4}, C[abg] = {1, 2, 3}, C[acd] = {6, 7, 8}, and C[adf] = {5, 6, 8}. To find the most suitable cluster for object 6, we need to calculate its score in each cluster containing it. For cluster-support threshold value θ of 0.7, it is found that, score(6,C[acd]) = 1 - 0.63 + 1 + 1 - 0.38 = 1.99 and score[6,C[adf]) = 1 - 0.63 - 0.63 + 1 + 1 = 1.74. We will use score(6,C[acd]) to explain the calculation. All features in b(6) are global-frequent. They are a, b, c, d, and f, with global frequencies 1, 0.63, 0.63, 0.5, and 0.38, respectively. Their respective cluster-support values are 1, 0.33, 1, 1, and 0.67. For a feature to be cluster-frequent in C[acd], it must appear in at least [θ x |C[acd]|] = 3 of its objects. Therefore, a, c, d ∈ positive(6,C[acd]), and b, f ∈ negative(6,C[acd]). Substituting these values into the formula for the score function, it is found that the score(6,C[acd]) = 1.99. Since score(6,C[acd]) > score[6,C[adf]), object 6 is assigned to C[acd]. Tables 2 and 3 show the score calculations for all overlapping objects. Table 2 shows initial cluster assignments. Features that are both global-frequent and cluster-frequent are shown in the column labeled positive(g,C i), and features that are only global-frequent are in the column labeled negative(g,Ci). For elements in the column labeled positive(g,Ci), the cluster-support values are listed between parentheses after each feature name. The same format is used for global-support values
Making Clusters Disjoint To make the clusters disjoint, we find the best cluster for each overlapping object g and keep g only in that cluster. To achieve this, a score function is used. The function score(g, C i) measures the goodness of cluster Ci for object g. Intuitively, a cluster is good for an object g if g has many frequent features which are also frequent in Ci. On the other hand, C i is not good for g if g has frequent features that are not frequent in Ci. Define global-support(f) as the percentage of objects possessing f in the whole database, and cluster-support(f) as the percentages of objects possessing f in a given cluster C i. We say f is cluster frequent in C i if clustersupport(f) is at least a user-specified minimum threshold value q. For cluster Ci, let positive(g, Ci) be the set of features in b(g) which are both global-frequent (i.e., frequent in the whole database) and cluster-frequent. Also let 516
Formal Concept Analysis Based Clustering
Table 2. Initial cluster assignments and support values (minSupport = 0.37, θ = 0.7) Ci
initial clusters
Positive(g,Ci)
negative(g,Ci)
C[agh]
2: abgh
a(1),
b(0.63),
3: abcgh
g(1),
c(0.63)
4: acghi
h(1)
1: abg
a(1),
c(0.63),
2: abgh
b(1),
h(0.38)
3: abcgh
g(1)
6: abcdf
a(1),
b(0.63),
7: acde
c(1),
f(0.38)
8: acdf
d(1)
5: abdf
a(1),
b(0.63),
6: abcdf
d(1),
c(0.63)
8: acdf
f(1)
C[abg]
C[acd]
C[adf]
Table 3. Score calculations and final cluster assignments (minSupport = 0.37,θ = 0.7) Ci C[agh] C[abg] C[acd] C[adf]
.
Score(g, Ci) 2: 1 - .63 + 1 + 1 = 2.37 3: 1 - .63 - .63 + 1 + 1 = 1.74 4: does not overlap 1: does not overlap 2: 1 + 1 + 1 - .38 = 2.62 3: 1 + 1 - .63 + 1 - .38 = 1.99 6: 1 - .63 + 1 +1 - .38 = 2.62 5: does not overlap 8: 1 + 1 + 1 -.38 = 2.62 5: does not overlap 6: 1 - .63 - .63 + 1 + 1 = 1.74 8: 1 - .63 + 1 + 1 = 2.37
Final clusters 4 1 2 3 6 7 8 5
for features in the column labeled negative(g,Ci). It is only a coincidence that in this example all cluster-support values are 1 (try θ = 0.5 for different values). Table 3 shows score calculations, and final cluster assignments. Notice that, different threshold values may result in different clusters. The value of minSupport affects cluster labels and initial clusters while that of θ affects final elements in clusters. For example, for minSupport = 0.375
and θ = 0.5, we get the following final clusters C[agh] = {3, 4}, C[abg] = {1, 2}, C[acd] = {7, 8}, and C[adf] = {5, 6}.
FUTURE TRENDS We have shown how FCA can be used for clustering. Researchers have also used the techniques of FCA and frequent itemsets in the association mining problem (Fung, Wang, & Ester, 2003; Wang, Xu, & Liu, 1999; Yun, Chuang, & Chen, 2001). We anticipate this framework to be suitable for other problems in data mining such as classification. Latest developments in efficient algorithms for generating frequent closed itemsets make this approach efficient for handling large amounts of data (Gouda & Zaki, 2001; Pei, Han, & Mao, 2000; Zaki & Hsiao, 2002).
CONCLUSION This chapter introduces formal concept analysis (FCA); a useful framework for many applications in computer science. We also showed how the techniques of FCA can be used for clustering. A global support value is used to specify which concepts can be candidate clusters. A score function is then used to determine the best cluster for each object. This approach is appropriate for cluster517
Formal Concept Analysis Based Clustering
ing categorical data, transaction data, text data, Web documents, and library documents. These data usually suffer from the problem of high dimensionality with only few items or keywords being available in each transaction or document. FCA contexts are suitable for representing this kind of data.
Saquer, J. (2003). Using concept lattices for disjoint clustering. In The Second IASTED International Conference on Information and Knowledge Sharing (pp. 144-148). Wang, K., Xu, C., & Liu, B. (1999). Clustering transactions using large items. In ACM International Conference on Information and Knowledge Management (pp. 483-490).
REFERENCES
Wille, R. (1982). Restructuring lattice theory: An approach based on hierarchies of concepts. In I. Rival (Ed.), Ordered sets (pp. 445-470). Dordecht-Boston: Reidel.
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. In The 8th International Conference on Knowledge Discovery and Data Mining (KDD 2002) (pp. 436-442).
Yun, C., Chuang, K., & Chen, M. (2001). An efficient clustering algorithm for market basket data based on small large ratios. In The 25th COMPSAC Conference (pp. 505510).
Carpineto, C., & Romano, G. (1993). GALOIS: An ordertheoretic approach to conceptual clustering. In Proceedings of 1993 International Conference on Machine Learning (pp. 33-40).
Zaki, M.J., & Hsiao, C. (2002). CHARM: An efficient algorithm for closed itemset mining. In Second SIAM International Conference on Data Mining.
Fung, B., Wang, K., & Ester, M. (2003). Large hierarchical document clustering using frequent itemsets. In Third SIAM International Conference on Data Mining (pp. 5970). Ganter, B., & Wille, R. (1999). Formal concept analysis: Mathematical foundations. Berlin: Springer-Verlag. Gouda, K., & Zaki, M. (2001). Efficiently mining maximal frequent itemsets. In First IEEE International Conference on Data Mining (pp. 163-170). San Jose, USA. Hearst, M. (1999). The use of categories and clusters for organizing retrieval results. In T. Strzalkowski (Ed.), Natural language information retrieval (pp. 333-369). Boston: Kluwer Academic Publishers. Kryszkiewicz, M. (1998). Representative association rules. In Proceedings of PAKDD ‘98. Lecture Notes in Artificial Intelligence (Vol. 1394) (pp. 198-209). Berlin: SpringerVerlag. Mineau, G., & Godin, R. (1995). Automatic structuring of knowledge bases by conceptual clustering. IEEE Transactions on Knowledge and Data Engineering, 7 (5), 824829. Pei J., Han J., & Mao R. (2000). CLOSET: an efficient algorithm for mining frequent closed itemsets. In ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (pp. 21-30). Dallas, USA.
518
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In The 21st Annual International ACM SIGIR (pp. 46-54).
KEY TERMS Cluster-Support of Feature F In Cluster Ci: Percentage of objects in Ci possessing f. Concept: A pair (A, B) of a set A of objects and a set B of features such that B is the maximal set of features possessed by all the objects in A and A is the maximal set of objects that possess every feature in B. Context: A triple (G, M, I) where G is a set of objects, M is a set of features and I is a binary relation between G and M such that gIm if and only if object g possesses the feature m. Formal Concept Analysis: A mathematical framework that provides formal and mathematical treatment of the notion of a concept in a given universe. Negative(g, Ci): Set of features possessed by g which are global-frequent but not cluster-frequent. Positive(g, Ci): Set of features possessed by g which are both global-frequent and cluster-frequent. Support or Global-Support of Feature F: Percentage of object “transactions” in the whole context (or whole database) that possess f.
519
Fuzzy Information and Data Analysis
.
Reinhard Viertl Vienna University of Technology, Austria
INTRODUCTION
precise. For details see (Viertl, 2002). This kind of uncertainty can be best described by a so-called fuzzy number.
The results of data warehousing and data mining are depending essentially on the quality of data. Usually data are assumed to be numbers or vectors, but this is often not realistic. Especially the result of a measurement of a continuous quantity is always not a precise number, but more or less non-precise. This kind of uncertainty is also called fuzziness and should not be confused with errors. Data mining techniques have to take care of fuzziness in order to avoid unrealistic results.
A fuzzy number x ∗ is defined by a so-called characterizing function ξ : IR → [0,1] which obeys the following:
BACKGROUND In standard data warehousing and data analysis data are treated as numbers, vectors, words, or symbols. These data types do not take care of fuzziness of data and prior information. Whereas some methodology for fuzzy data analysis was developed, statistical data analysis is usually not taking care of fuzziness. Recently some methods for statistical analysis of non-precise data were published (Viertl, 1996, 2003). Historically fuzzy sets were first introduced by K. Menger in 1951 (Menger, 1951). Later L. Zadeh made fuzzy models popular. For more information on fuzzy modeling compare (Dubois & Prade, 2000). Most data analysis techniques are statistical techniques. Only in the last 20 years alternative methods using fuzzy models were developed. For a detailed discussion compare (Bandemer & Näther, 1992; Berthold & Hand, 2003).
∃x0 ∈ IR : ξ (x 0 ) = 1
(1)
∀δ ∈ (0,1] the so-called δ −cut
Cδ [ξ (⋅)] defined by
Cδ [ξ (⋅)] := {x ∈ IR : ξ (x ) ≥ δ } = [aδ , bδ ] is a finite closed interval. (2)
Examples of non-precise data are results on analogue measurement equipments as well as readings on digital instruments. For continuous vector quantities real measurements are not precise vectors but also non-precise. This imprecision can result in a vector (x1∗ ,L , x k∗ ) of fuzzy num-
bers xi∗ , or more generally, in a so-called k-dimensional fuzzy vector
x∗ .
Using the notation
x = (x1 ,L , x k ) ∈ IR k a k-dimensional fuzzy vector is
defined by its vector-characterizing function ζ : IR k → [0,1] obeying ∃x 0 ∈ IR k : ζ (x 0 ) = 1 ∀δ ∈ (0,1]
the
(1) δ −cut
Cδ [ζ (⋅)]
defined
by
Cδ [ζ (⋅)] := {x ∈ IR k : ζ (x ) ≥ δ } is a compact simply con-
MAIN THRUST
nected subset of IR k
The main thrust of this chapter is to provide the quantitative description of fuzzy data, as well as generalized methods for the statistical analysis of fuzzy data.
Remark: A vector of fuzzy numbers is essentially different from a fuzzy vector. But it is possible to
Non-Precise Data The result of one measurement of a continuous quantity is not a precise real number but more or less non-
(2)
construct a fuzzy vector x ∗ from a vector ξ1 (⋅),L , ξ k (⋅) of fuzzy numbers. The vector-characterizing function ∗ ζ (⋅,L ,⋅) of x can be obtained by
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Fuzzy Information and Data Analysis
ζ (x1 , L , x k ) = min{ξ1 (x1 ),L , ξ k (x k )} ∀(x1 ,L , x k ) ∈ IR k
Examples of 2-dimensional fuzzy data are light points on radar screens.
The generalized integral of a fuzzy valued function f (⋅) defined on M is a fuzzy number I ∗ denoted by ∗
I ∗ = ∫− f ∗ (x )dx , M
Descriptive Statistics with Fuzzy Data Analysis of variable data by forming histograms has to take care of fuzziness. This is possible based on the characterizing functions ξ i (⋅) of the observations xi∗ for ∗ i = 1(1)n . The height h j over a class K j , j = 1(1)k of the
histogram is a fuzzy number whose characterizing function η j (⋅) is obtained in the following way: For each δ −level
[
] [
]
the δ −cut
Cδ η j (⋅) = h n ,δ (K j ), h n,δ (K j ) of η j (⋅) is defined by
h n,δ (K j ) =
h n,δ (K j ) =
in case of integrable δ −level curves f δ (⋅) and f δ (⋅) . Fuzzy probability densities on measurable spaces (M , A ) are special fuzzy valued functions f ∗ (⋅) defined on M with integrable δ −level curves, for which ∗
∗ +
M
n
,
where 1∗+ is a fuzzy number fulfilling
# {xi∗: Cδ [ξ i (⋅)] ⊆ K j } n
1 ∈ C1 [1∗+ ] and Cδ [1∗+ ] ⊆ (0, ∞ ) ∀ δ ∈ (0,1] .
.
By the representation lemma of fuzzy numbers hereby the characterizing functions η j (⋅), j = 1(1)k are determined:
η j (x ) = max δ ⋅ I Cδ [η j (⋅ )] (x ) ∀ x ∈ IR The resulting generalized histogram is also called fuzzy histogram. For more details compare (Viertl & Hareter, 2004).
In standard data analysis probability densities are considered as limits of histograms. For fuzzy data limits of fuzzy histograms are fuzzy valued functions f ∗ (⋅) whose
values f ∗ (x ) are fuzzy numbers. These fuzzy valued functions are normalized by generalizing classical intef δ (⋅)
and f δ (⋅) of f ∗ (⋅) , which are defined by the endpoints of
the δ −cuts of f ∗ (x ) for all x ∈ Def [ f ∗ (⋅)]:
]
Cδ [ f ∗ (x )] = f δ (x ), f δ (x )
{
S δ = ϕ (⋅) : f δ (x ) ≤ ϕ (x ) ≤ f δ (x ) ∀ x ∈ M
}
the fuzzy associated probability P ∗ (A) for all A ∈ A is the fuzzy number whose δ − cuts
Fuzzy Probability Distributions
gration with the help of so-called δ −level curves
Based on fuzzy probability densities so-called fuzzy probability distributions P ∗ on A are defined in the following way: Denoting the set of all classical probability densities ϕ (⋅) on M which are bounded by δ − level curves f δ (⋅) and f δ (⋅) by
δ ∈(0 ,1]
520
Cδ [I ∗ ] = ∫ f δ (x )dx, ∫ f δ (x )dx ∀ δ ∈ (0,1] M M
∫− f (x )dx = 1
# {xi∗: C [ξ i (⋅)] ∩ K j ≠ ∅}
[
which is defined via its δ −cuts
for all x ∈ Def [ f ∗ (⋅)]
[
]
Cδ [P ∗ ( A)] = P δ (A), P δ ( A) are
defined
by
P δ (A) = sup ∫ ϕ (x )dx ϕ ∈Sδ A
P δ (A) = inf ∫ ϕ (x )dx . ϕ∈Sδ
A
By this definition the probability of the extreme events ∅ and M are precise numbers, i.e.
Fuzzy Information and Data Analysis
For more details compare the book (Viertl, 1996) Remark: For precise data xi ∈ IR the characterizing
P ∗ (∅ ) = 0 and P ∗ (M ) = 1 .
Statistical Inference based on Fuzzy Information Statistical inference procedures can be generalized to the situation of fuzzy data. Moreover also Bayesian inference procedures can be generalized for fuzzy apriori distributions and fuzzy sample data. For stochastic model X ~ f (⋅ | θ ) , θ ∈ Θ with observa-
tion space M and sample x1 , L , xn let ϑ (x1 ,L , x n ) be an estimator for the true parameter, i.e. a measurable function from the sample space M n to Θ . In case of fuzzy sample x1∗ ,L , xn∗ the estimator can be adapted using the so-called extension principle from fuzzy set theory in the following way. In order to do this first the fuzzy sample
x1∗ ,L , x n∗
with characterizing functions
ξ1 (⋅), L , ξ n (⋅) has to be combined into a fuzzy element x ∗ of the sample space M n with vector-characterizing function ζ (⋅,L,⋅) whose values are defined by
ζ (x1 , L , x n ) = min ξ i (xi ) ∀ (x1 , L , x n ) ∈ M n . i =1(1) n Let ϑ : M → Θ be an estimator for the parameter θ of a stochastic model X ~ f (⋅ | θ ) , θ ∈ Θ . n
Using the so-called fuzzy combined sample x ∗ the characterizing function ψ (⋅) of the generalized fuzzy estimator θˆ ∗ is given by its values
sup{æ (x ) : ϑ (x )= θ } if ψ (θ ) = if 0
ϑ −1 (θ ) ≠ ∅ ∀ θ ∈ Θ. ϑ −1 (θ ) = ∅
The generalized estimator θˆ ∗ is a fuzzy element of the parameter space Θ . A similar construction is possible to generalize the concept of confidence sets. Let κ (X 1 ,L , X n ) be a confidence function for θ at given confidence level 1 − α . For fuzzy data with fuzzy combined sample x ∗ with vectorcharacterizing function ζ (⋅,L ,⋅) a generalized fuzzy confidence region Θ1∗−α with membership function ϕ (⋅) is defined by
sup{ζ (x ) : θ ∈ κ (x )} if ϕ (θ ) = 0 if
∃ x : θ ∈ κ (x ) ∀ θ ∈ Θ. ∃/x : θ ∈ κ (x )
functions are the one-point indicator functions I {x } (⋅) and the resulting characterizing functions are the indicator functions of the classical results. Statistical tests in case of fuzzy data yield fuzzy i
values t ∗ of test statistics t (x1∗ ,L , xn∗ ) . Therefore simple significance tests based on rejection regions don’t apply. A solution to this problem is to use p-values. An alternative approach is so-called fuzzy p-values as described in (Filzmoser & Viertl, 2004).
Fuzzy Bayesian Inference The second kind of fuzzy information appears as “fuzzy a-priori information” in the context of Bayesian data analysis. Fuzzy a-priori information is expressed by fuzzy probability distributions on the parameter space Θ in connection with a stochastic model X ~ f (⋅ | θ ) , θ ∈ Θ with observation space M . Frequently such fuzzy probability distributions are generated by fuzzy probability densities as described in the subsection Fuzzy Probability Distributions above. Using the concept of δ −level curves π δ (⋅) and π δ (⋅)
of fuzzy a-priori densities π ∗ (⋅) on the parameter space Θ , it is possible to generalize Bayes’ theorem to the
situation of fuzzy a-priori information and fuzzy data D ∗ . The resulting a-posteriori distribution is a fuzzy
probability distribution with fuzzy density π ∗ (⋅ | D ∗ ) on the parameter space. The fuzzy density π ∗ (⋅ | D ∗ ) can be used to calculate predictive densities p(⋅ | D ∗ ) by the generalized integration from above in the following way:
p(x | D ∗ ) = ∫− f (x | θ )π ∗ (θ | D ∗ )dθ Θ
for all
x∈M .
This is again a fuzzy probability distribution, now on the observation space M . For more details see (Viertl & Hareter, 2004).
Fuzzy Data Analysis In recent times non-statistical approaches to data analysis problems were developed using ideas from fuzzy theory. Especially for linguistic data this approach was successful. Instead of stochastic models to describe uncertainty it is possible to use fuzzy models. Such alternative methods are fuzzy clustering, fuzzy regres521
.
Fuzzy Information and Data Analysis
sion analysis, integration of fuzzy valued functions, fuzzy optimization, and fuzzy approximation. For more details compare (Bandemer & Näther, 1992).
FUTURE TRENDS Up to now fuzzy models where mainly used to describe vague data in form of linguistic data. But also precision measurements are connected with uncertainty which is not only of stochastic nature. Therefore hybrid methods of data analysis, which take care of the fuzziness of data as well as the statistical variation of them, have to be applied and further developed. Especially research on how to obtain the characterizing functions of non-precise data are necessary.
CONCLUSION
Ross, R., Booker, J., & Parkinson, J., (Eds.) (2002). Fuzzy logic and probability applications – Bridging the gap. Philadelphia: American Statistical Association and Society of Industrial and Applied Mathematics. Viertl, R. (1996). Statistical methods for non-precise data. Boca Raton: CRC Press. Viertl, R. (2002). On the description and analysis of measurements of continuous quantities. Kybernetika, 38, 353-362. Viertl, R. (2003). Statistical inference with imprecise data. Encyclopedia of life support systems. Retrieved from UNESCO, Paris, Web Site: www.eolss.unesco.org Viertl, R., & Hareter, D. (2004). Fuzzy information and imprecise probability. Zeitschrift für Angewandte Mathematik und Mechanik, 84(10-11), 1-10. Wolkenhauer, O. (2001). Data engineering – Fuzzy mathematics in systems theory and data analysis. New York: Wiley.
In this contribution different kinds of uncertainty are discussed. These are variation and “fuzziness”. Variation is usually described by stochastic models and fuzziness by fuzzy numbers and fuzzy vectors respectively. Looking at real data it is necessary to distinguish between fuzziness and variability in order to obtain realistic results from data analyses.
Characterizing Function: Mathematical description of a fuzzy number.
REFERENCES
Fuzzy Estimate: Generalized statistical estimation technique for the situation of non-precise data.
Bandemer, H., & Näther, W. (1992). Fuzzy data analysis. Dordrecht: Kluwer.
KEY TERMS
Fuzzy Histogram: A generalized histogram based on non-precise data, whose heights are fuzzy numbers.
Berthold, M., & Hand, D. (Eds.). (2003). Intelligent data analysis (2nd ed.). Berlin: Springer.
Fuzzy Information: Information which is not given in form of precise numbers, precise vectors, or precisely defined terms.
Bertoluzza, C., Gil, M., & Ralescu, D. (Eds.). (2002). Statistical modeling, analysis and management of fuzzy data. Heidelberg: Physica-Verlag.
Fuzzy Number: Quantitative mathematical description of fuzzy information concerning a one-dimensional numerical quantity.
Dubois, D., & Prade, H. (Eds.). (2000). Fundamentals of fuzzy sets. Boston: Kluwer.
Fuzzy Statistics: Statistical analysis methods for the situation of fuzzy information.
Filzmoser, P., & Viertl, R. (2004). Testing hypotheses with fuzzy data: The fuzzy p- value. Metrika, 59, 21-29.
Fuzzy Valued Functions: Generalized real valued functions whose values are fuzzy numbers.
Grzegorzewski, P., Hryniewicz, O., & Gil, M. (Eds.). (2002). Soft methods in probability, statistics and data analysis. Heidelberg: Physica-Verlag.
Fuzzy Vector: Mathematical description of nonprecise vector quantities.
Menger, K. (1951). Ensembles flous et functions aleatoires. Comptes Rendus Acad. Sci., 232, 2001-2003.
Hybrid Data Analysis Methods: Data analysis techniques using methods from statistics and from fuzzy modeling.
Möller, B., & Beer, M. (2004). Fuzzy randomness – Uncertainty in civil engineering and computational mechanics. Berlin: Springer.
Non-Precise Data: Data which are not precise numbers or not precise vectors.
522
523
A General Model for Data Warehouses Michel Schneider Blaise Pascal University, France
INTRODUCTION Basically, the schema of a data warehouse lies on two kinds of elements: facts and dimensions. Facts are used to memorize measures about situations or events. Dimensions are used to analyse these measures, particularly through aggregation operations (counting, summation, average, etc.). To fix the ideas let us consider the analysis of the sales in a shop according to the product type and to the month in the year. Each sale of a product is a fact. One can characterize it by a quantity. One can calculate an aggregation function on the quantities of several facts. For example, one can make the sum of quantities sold for the product type “mineral water” during January in 2001, 2002 and 2003. Product type is a criterion of the dimension Product. Month and Year are criteria of the dimension Time. A quantity is so connected both with a type of product and with a month of one year. This type of connection concerns the organization of facts with regard to dimensions. On the other hand a month is connected to one year. This type of connection concerns the organization of criteria within a dimension. The possibilities of fact analysis depend on these two forms of connection and on the schema of the warehouse. This schema is chosen by the designer in accordance with the users needs. Determining the schema of a data warehouse cannot be achieved without adequate modelling of dimensions and facts. In this article we present a general model for dimensions and facts and their relationships. This model will facilitate greatly the choice of the schema and its manipulation by the users.
BACKGROUND Concerning the modelling of dimensions, the objective is to find an organization which corresponds to the analysis operations and which provides strict control over the aggregation operations. In particular it is important to avoid double-counting or summation of nonadditive data. Many studies have been devoted to this problem. Most recommend organizing the criteria (we said also members) of a given dimension into hierarchies with which the aggregation paths can be explicitly defined. In (Pourabbas, 1999), hierarchies are defined
by means of a containment function. In (Lehner, 1998), the organization of a dimension results from the functional dependences which exist between its members, and a multi-dimensional normal form is defined. In (Hùsemann, 2000), the functional dependences are also used to design the dimensions and to relate facts to dimensions. In (Abello, 2001), relationships between levels in a hierarchy are apprehended through the PartWhole semantics. In (Tsois, 2001), dimensions are organized around the notion of a dimension path which is a set of drilling relationships. The model is centered on a parent-child (one to many) relationship type. A drilling relationship describes how the members of a children level can be grouped into sets that correspond to members of the parent level. In (Vassiliadis, 2000), a dimension is viewed as a lattice and two functions “anc” and “desc” are used to perform the roll up and the drill down operations. Pedersen (1999) proposes an extended multidimensional data model which is also based on a lattice structure, and which provides non-strict hierarchies (i.e. too many relationships between the different levels in a dimension). Modelling of facts and their relationships has not received so much attention. Facts are generally considered in a simple fashion which consists in relating a fact with the roots of the dimensions. However, there is a need for considering more sophisticated structures where the same set of dimensions are connected to different fact types and where several fact types are inter-connected. The model described in (Pedersen, 1999) permits some possibilities in this direction but is not able to represent all the situations. Apart from these studies it is important to note various propositions (Agrawal, 1997; Datta, 1999; Gyssens, 1997; Nguyen, 2000) for cubic models where the primary objective is the definition of an algebra for multidimensional analysis. Other works must also be mentioned. In (Golfarelli, 1998), a solution is proposed to derive multidimensional structures from E/R shemas. In (Hurtado, 2001) are established conditions for reasoning about summarizability in multidimensional structures.
MAIN THRUST Our objective in this article is to propose a generic model based on our personal research work and which
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
/
A General Model for Data Warehouses
integrates existing models. This model can be used to apprehend the sharing of dimensions in various ways and to describe different relationships between fact types. Using this model, we will also define the notion of wellformed warehouse structures. Such structures have desirable properties for applications. We suggest a graph representation for such structures which can help the users in designing and requesting a data warehouse.
Modelling Facts A fact is used to record measures or states concerning an event or a situation. Measures and states can be analysed through different criteria organized in dimensions. A fact type is a structure fact_name[(fact_key), (list_of_reference_attributes), (list_of_fact_attributes)] where • • • •
fact_name is the name of the type; fact_key is a list of attribute names; the concatenation of these attributes identifies each instance of the type; list_of_reference_attributes is a list of attribute names; each attribute references a member in a dimension or another fact instance; list_of_fact_attributes is a list of attribute names; each attribute is a measure for the fact.
The set of referenced dimensions comprises the dimensions which are directly referenced through the list_of_reference_attributes, but also the dimensions which are indirectly referenced through other facts. Each fact attribute can be analysed along each of the referenced dimensions. Analysis is achieved through the computing of aggregate functions on the values of this attribute. As an example let us consider the following fact type for memorizing the sales in a set of stores. Sales[(ticket_number, product_key), (time_key, product_key, store_key), (price_per_unit, quantity)] The key is (ticket_number, product_key). This means that there is an instance of Sales for each different product of a ticket. There are three dimension references: time_key, product_key, store_key. There are two fact attributes: price_per_unit, quantity. The fact attributes can be analysed through aggregate operations by using the three dimensions. 524
There may be no fact attribute; in this case a fact records the occurrence of an event or a situation. In such cases, analysis consists in counting occurrences satisfying a certain number of conditions. For the needs of an application, it is possible to introduce different fact types sharing certain dimensions and having references between them.
Modelling Dimensions The different criteria which are needed to conduct analysis along a dimension are introduced through members. A member is a specific attribute (or a group of attributes) taking its values on a well defined domain. For example, the dimension TIME can include members such as DAY, MONTH, YEAR, etc. Analysing a fact attribute A along a member M means that we are interested in computing aggregate functions on the values of A for any grouping defined by the values of M. In the article we will also use the notation Mij for the j-th member of i-th dimension. Members of a dimension are generally organized in a hierarchy which is a conceptual representation of the hierarchies of their occurrences. Hierarchy in dimensions is a very useful concept that can be used to impose constraints on member values and to guide the analysis. Hierarchies of occurrences result from various relationships which can exist in the real world: categorization, membership of a subset, mereology. Figure 1 illustrates a typical situation which can occur. Note that a hierarchy is not necessarily a tree. We will model a dimension according to a hierarchical relationship (HR) which links a child member Mij (i.e. town) to a parent member Mik (i.e. region) and we will use the notation Mij–Mik. For the following we consider only situations where a child occurrence is linked to a unique parent occurrence in a type. However, a child occurrence, as in case (b) or (c), can have several parent occurrences but each of different types. We will also suppose that HR is reflexive, antisymmetric and transitive. This kind of relationship covers the great majority of real situations. Existence of this HR is very important since it means that the members of a dimension Figure 1. A typical hierarchy in a dimension ck1
cust_key
town
demozone
region
ck2
ck3
ck4 ….
Dallas Mesquite Akron …. North
Texas
North ….
Ohio ….
A General Model for Data Warehouses
can be organized into levels and correct aggregation of fact attribute values along levels can be guaranteed. Each member of a dimension can be an entry for this dimension i.e. can be referenced from a fact type. This possibility is very important since it means that dimensions between several fact types can be shared in various ways. In particular, it is possible to reference a dimension at different levels of granularity. A dimension root represents a standard entry. For the three dimensions in Figure 1, there is a single root. However, definition 3 authorizes several roots. As in other models (Hùsemann, 2000), we consider property attributes which are used to describe the members. A property attribute is linked to its member through a functional dependence, but does not introduce a new member and a new level of aggregation. For example the member town in the customer dimension may have property attributes such as population, administrative position, etc. Such attributes can be used in the selection predicates of requests to filter certain groups. We now define the notion of member type, which incorporates the different features presented above. A member type is a structure: member_name[(member_key), dimension_name, (list_of_reference_attributes)] where • • •
member_name is the name of the type; member_key is a list of attribute names; the concatenation of these attributes identifies each instance of the type; list_of_reference_attributes is a list of attribute names where each attribute is a reference to the successors of the member instance in the cover graph of the dimension.
Only the member_key is mandatory. Using this model, the representations of the members of dimension customer in Figure 1 are the following: cust_root[(cust_key), customer, (town_name, zone_name)] town[(town_name), customer,(region_name)] demozone[(zone_name), customer, (region_name)] region[(region_name), customer, ()]
Well-Formed Structures
First, a fact can directly reference any member of a dimension. Usually a dimension is referenced through one of its roots (as we saw above, a dimension can have several roots). But it is also interesting and useful to have references to members other than the roots. This means that a dimension can be used by different facts with different granularities. For example, a fact can directly reference town in the customer dimension and another can directly reference region in the same dimension. This second reference corresponds to a coarser granule of analysis than the first. Moreover, a fact F1 can reference any other fact F2. This type of reference is necessary to model real situations. This means that a fact attribute of F1 can be analysed by using the key of F2 (acting as the grouping attribute of a normal member) and also by using the dimensions referenced by F2. To formalise the interconnection between facts and dimensions, we thus suggest extending the HR relationship of section 3 to the representation of the associations between fact types and the associations between fact types and member types. We impose the same properties (reflexivity, anti-symmetry, transitivity). We also forbid cyclic interconnections. This gives a very uniform model since fact types and member types are considered equally. To maintain a traditional vision of the data warehouses, we also ensure that the members of a dimension cannot reference facts. Figure 2 illustrates the typical structures we want to model. Case (a) corresponds to the simple case, also known as star structure, where there is a unique fact type F1 and several separate dimensions D1, D2, etc. Cases (b) and (c) correspond to the notion of facts of fact. Cases (d), (e) and (f) correspond to the sharing of the same dimension. In case (f) there can be two different paths starting at F2 and reaching the same member M of the sub-dimension D21. So analysis using these Figure 2. Typical warehouse structures F1
F1
D2
F2
D1
D1
D2
(a)
F1
F1
F2 D1
D2
D3
F1
D2
(c)
F2
F2
F1
D1 (d)
F3
(b)
D1
D21
In this section we explain how the fact types and the member types can be interconnected in order to model various warehouse structures.
F2
D21 (e)
(f)
525
/
A General Model for Data Warehouses
Figure 3. The DWG of a star-snowflake structure sales [(sales_key), (time_key, product_key, cust_key), (quantity, price_per_unit)]
time_root [(time_key), …]
product_root [(product_key), …]
day [(day_no),...]
category [(category_no), …]
year [(year_no),…]
town [(town_name),… ]
Definition of a Well-Formed Warehouse Structure A warehouse structure is well-formed when the DWG is acyclic and the path coherence constraint is satisfied for any couple of paths having the same starting node and the same ending node. A well-formed warehouse structure can thus have several roots. The different paths from the roots can be always divided into two sub-paths: the first one with only fact nodes and the second one with only member nodes. So, roots are fact types. Since the DWG is acyclic, it is possible to distribute its nodes into levels. Each level represents a level of aggregation. Each time we follow a directed edge, the
demozone [(zone_name), ...]
region [(region_name),…]
family [(family_name),…]
two paths cannot give the same results when reaching M. To pose this problem we introduce the DWG and the path coherence constraint. To represent data warehouse structures, we suggest using a graph representation called DWG (data warehouse graph). It consists in representing each type (fact type or member type) by a node containing the main information about this type, and representing each reference by a directed edge. Suppose that in the DWG graph, there are two different paths P1 and P2 starting from the same fact type F, and reaching the same member type M. We can analyse instances of F by using P1 or P2. The path coherence constraint is satisfied if we obtain the same results when reaching M. For example in case of figure 1 this constraint means the following: for a given occurrence of cust_key, whether the town path or the demozone path is used, one always obtains the same occurrence of region. We are now able to introduce the notion of wellformed structures.
526
cust_root [(cust_key), …]
level increases (by one or more depending on the used path). When using aggregate operations, this action corresponds to a ROLLUP operation (corresponding to the semantics of the HR) and the opposite operation to a DRILLDOWN. Starting from the reference to a dimension D in a fact type F, we can then roll up in the hierarchies of dimensions by following a path of the DWG.
Illustrating the Modelling of a Typical Case with Well-Formed Structures Well-formed structures are able to model correctly and completely the different cases of Figure 2. We illustrate in this section the modelling for the star-snowflake structure. We have a star-snowflake structure when: • •
there is a unique root (which corresponds to the unique fact type); each reference in the root points towards a separate subgraph in the DWG (this subgraph corresponds to a dimension).
Our model does not differentiate star structures from snowflake structures. The difference will appear with the mapping towards an operational model (relational model for example). The DWG of a star-snowflake structure is represented in Figure 3. This representation is well-formed. Such a representation can be very useful to a user for formulating requests.
FUTURE TRENDS A first opened problem is that concerning the property of summarizability between the levels of the different dimensions. For example, the total of the sales of a
A General Model for Data Warehouses
product for 2001 must be equal to the sum of the totals for the sales of this product for all months of 2001. Any model of data warehouse has to respect this property. In our presentation we supposed that function HR verified this property. In practice, various functions were proposed and used. It would be interesting and useful to begin a general formalization which would regroup all these propositions. Another opened problem concerns the elaboration of a design method for the schema of a data warehouse. A data warehouse is a data base and one can think that its design does not differ from that of a data base. In fact a data warehouse presents specificities which it is necessary to take into account, notably the data loading and the performance optimization. Data loading can be complex since sources schemas can differ from the data warehouse schema. Performance optimization arises particularly when using relational DBMS for implementing the data warehouse.
CONCLUSION In this paper we propose a model which can describe various data warehouse structures. It integrates and extends existing models for sharing dimensions and for representing relationships between facts. It allows for different entries in a dimension corresponding to different granularities. A dimension can also have several roots corresponding to different views and uses. Thanks to this model, we have also suggested the concept of Data Warehouse Graph (DWG) to represent a data warehouse schema. Using the DWG, we define the notion of well-formed warehouse structures which guarantees desirable properties. We have illustrated how typical structures such as star-snowflake structures can be advantageously represented with this model. Other useful structures like those depicted in Figure 2 can also be represented. The DWG gathers the main information from the warehouse and it can be very useful to users for formulating requests. We believe that the DWG can be used as an efficient support for a graphical interface to manipulate multidimensional structures through a graphical language. The schema of a data warehouse represented with our model can be easily mapped into an operational model. Since our model is object-oriented a mapping towards an object model is straightforward. But it is possible also to map the schema towards a relational model or an object relational model. It appears that our model has a natural place between the conceptual schema of the application and an object relational implementation of the warehouse. It can thus serve as a helping support for the design of relational data warehouses.
REFERENCES Abello, A., Samos, J., & Saltor, F. (2001). Understanding analysis dimensions in a multidimensional objectoriented model. Intl Workshop on Design and Management of Data Warehouses, DMDW’2000, Interlaken, Switzerland. Agrawal, R., Gupta, A., & Sarawagi, S. (1997). Modelling multidimensional databases. International Conference on Data Engineering, ICDE’97 (pp. 232-243), Birmingham, UK. Datta, A., & Thomas, H. (1999). The cube data model: A conceptual model and algebra for on-line analytical processing in data warehouses. Decision Support Systems, 27(3), 289-301. Golfarelli, M., Maio, D., & Rizzi, V.S. (1998). Conceptual design of data warehouses from E/R schemes. 32th Hawaii International Conference on System Sciences, HICSS’1998. Gyssens, M., & Lakshmanan, V.S. (1997). A foundation for multi-dimensional databases. Intl Conference on Very Large Databases (pp. 106-115). Hurtado, C., & Mendelzon, A. (2001). Reasoning about summarizability in heterogeneous multidimensional schemas. International Conference on Database Theory, ICDT’01. Hùsemann, B., Lechtenbörger, J., & Vossen, G. (2000). Conceptual data warehouse design. Intl Workshop on Design and Management of Data Warehouse, DMDW’2000, Stockholm, Sweden. Lehner, W., Albrecht, J., & Wedekind, H. (1998). Multidimensional normal forms. l0th Intl Conference on Scientific and Statistical Data Management, SSDBM’98, Capri, Italy. Nguyen, T., Tjoa, A.M., Wanger, S. (2000). Conceptual multidimensional data model based on metacube. Intl Conference on Advances in Information Systems (pp. 2433), Izmir, Turkey. Pedersen, T.B., & Jensen, C.S. (1999). Multidimensional data modelling for complex data. Intl Conference on Data Engineering, ICDE’ 99. Pourabbas, E., & Rafanelli, M. (1999). Characterization of hierarchies and some operators in OLAP environment. ACM Second International Workshop on Data Warehousing and OLAP, DOLAP’99 (pp. 54-59), Kansas City, USA. Tsois, A., Karayannidis, N., & Sellis, T. (2001). MAC: Conceptual data modeling for OLAP. Intl Workshop on 527
/
A General Model for Data Warehouses
Design and Management of Data Warehouses, DMDW’2000, Interlaken, Switzerland.
product type, manufacturer type). Members are used to drive the aggregation operations.
Vassiliadis, P., & Skiadopoulos, S. (2000). Modelling and optimisation issues for multidimensional databases. International Conference on Advanced Information Systems Engineering, CAISE’2000 (pp. 482-497), Stockholm, Sweden.
Drilldown: Opposite operation of the previous one.
KEY TERMS Data Warehouse: A data base which is specifically elaborated to allow different analysis on data. Analysis consists generally to make aggregation operations (count, sum, average, etc.). A data warehouse is different from a transactional data base since it accumulates data along time and other dimensions. Data of a warehouse are loaded and updated at regular intervals from the transactional data bases of the company. Dimension: Set of members (criteria) allowing to drive the analysis (example for the Product dimension:
528
Fact: Element recorded in a warehouse (example: each product sold in a shop) and whose characteristics (i.e. measures) are the object of the analysis (example: quantity of a product sold in a shop). Galaxy Structure: Structure of a warehouse for which two different types of facts share a same dimension. Hierarchy: The members of a dimension are generally organized along levels into a hierarchy. Member: Every criterion in a dimension is materialized through a member. Rollup: Operation consisting in going in a hierarchy at a more aggregated level. Star Structure: Structure of a warehouse for which a fact is directly connected to several dimensions and can be so analyzed according to these dimensions. It is the most simple and the most used structure.
529
Genetic Programming
/
William H. Hsu Kansas State University, USA
INTRODUCTION
BACKGROUND
Genetic programming (GP) is a subarea of evolutionary computation first explored by John Koza (1992) and independently developed by Nichael Lynn Cramer (1985). It is a method for producing computer programs through adaptation according to a user-defined fitness criterion, or objective function. GP systems and genetic algorithms (GAs) are related but distinct approaches to problem solving by simulated evolution. As in the GA methodology, GP uses a representation related to some computational model, but in GP, fitness is tied to task performance by specific program semantics. Instead of strings or permutations, genetic programs most commonly are represented as variablesized expression trees in imperative or functional programming languages, as grammars (O’Neill & Ryan, 2001) or as circuits (Koza et al., 1999). GP uses patterns from biological evolution to evolve programs:
Although Cramer (1985) first described the use of crossover, selection, mutation, and tree representations for using genetic algorithms to generate programs, Koza, et al. (1992) is indisputably the field’s most prolific and influential author (Wikipedia, 2004). In four books, Koza, et al. (1992, 1994, 1999, 2003) have described GP-based solutions to numerous toy problems and several important real-world problems.
• • • •
Crossover: Exchange of genetic material such as program subtrees or grammatical rules. Selection: The application of the fitness criterion in order to choose which individuals from a population will go on to reproduce. Replication: The propagation of individuals from one generation to the next. Mutation: The structural modification of individuals.
To work effectively, GP requires an appropriate set of program operators, variables, and constants. Fitness in GP typically is evaluated over fitness cases. In data mining, this usually means training and validation data, but cases also can be generated dynamically using a simulator or be directly sampled from a real-world problem-solving environment. GP uses evaluation over these cases to measure performance over the required task, according to the given fitness criterion. This article begins with a survey of the design of GP systems and their applications to data-mining problems, such as pattern classification, optimization of representations for inputs and hypotheses in machine learning, grammar-based information extraction, and problem transformation by reinforcement learning. It concludes with a discussion of current issues in GP systems (i.e., scalability, human-comprehensibility, code growth and reuse, and incremental learning).
•
State of the Field: To date, GPs have been applied successfully to a few significant problems in machine learning and data mining, most notably symbolic regression and feature construction. The method is very computationally intensive, however, and it is still an open question in current research whether simpler methods can be used instead. These include supervised inductive learning, deterministic optimization, randomized approximation using non-evolutionary algorithms (i.e., Markov chain Monte Carlo approaches), genetic algorithms, and evolutionary algorithms. It is postulated by GP researchers that the adaptability of GPs to structural, functional, and structure-generating solutions of unknown forms makes them more amenable to solving complex problems. Specifically, Koza, et al. (1999, 2003) demonstrate that, in many domains, GP is capable of human-competitive automated discovery of concepts deemed to be innovative through technical review such as patent evaluation.
MAIN THRUST The general strengths of genetic programming lie in its ability to produce solutions of variable functional form, reuse partial solutions, solve multi-criterion optimization problems, and explore a large search space of solutions in parallel. Modern GP systems also are able to produce structured, object-oriented, and functional programming solutions involving recursion or iteration, subtyping, and higher-order functions. A more specific advantage of GPs is their ability to represent procedural, generative solutions to pattern recognition and machine-learning problems. Examples of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Genetic Programming
this include image compression and reconstruction (Koza, 1992) and several of the recent applications surveyed in the following.
GP for Pattern Classification GP in pattern classification departs from traditional supervised inductive learning in that it evolves solutions whose functional form is not determined in advance, and in some cases can be theoretically arbitrary. Koza (1992, 1994) developed GPs for several pattern reproduction problems, such as the multiplexer and symbolic regression problems. Since then, there has been continuing work on inductive GP for pattern classification (Kishore et al., 2000), prediction (Brameier & Banzhaf, 2001), and numerical curve fitting (Nikolaev & Iba, 2001). GP has been used to boost performance in learning polynomial functions (Nikolaev & Iba, 2001). More recent work on tree-based multi-crossover schemes has produced positive results in GP-based design of classification functions (Muni et al., 2004).
GP for Control of Inductive Bias, Feature Construction, and Feature Extraction GP approaches to inductive learning face the general problem of optimizing inductive bias: the preference for groups of hypotheses over others on bases other than pure consistency with training data or other fitness cases. Krawiec (2002) approaches this problem by using GP to preserve useful components of representation (features) during an evolutionary run, validating them using the classification data and reusing them in subsequent generations. This technique is related to the wrapper approach to knowledge discovery in databases (KDD), where validation data is held out and used to select examples for supervised learning or to construct or select variables given as input to the learning system. Because GP is a generative problem-solving approach, feature construction in GP tends to involve production of new variable definitions rather than merely selecting a subset. Evolving dimensionally-correct equations on the basis of data is another area where GP has been applied. Keijzer and Babovic (2002) provide a study of how GP formulates its declarative bias and preferential (searchbased) bias. In this and related work, it is shown that a proper units of measurement (strong typing) approach can capture declarative bias toward correct equations,
530
whereas type coercion can implement even better preferential bias.
Grammar-Based GP for Data Mining Not all GP-based approaches use expression tree-based representations or functional program interpretation as the computational model. Wong and Leung (2000) survey data mining using grammars and formal languages. This general approach has been shown to be effective for some natural language learning problems, and extension of the approach to procedural information extraction is a topic of current research in the GP community.
GP Software Packages: Functionality and Research Features A number of GP software packages are available publicly and commercially. General features common to most GP systems for research and development include a very high-period random number generator, such as the Mersenne Twister for random constant generation and GP operations; a variety of selection, crossover, and mutation operations; and trivial parallelism (e.g., through multithreading). One of the most popular packages for experimentation with GP is Evolutionary Computation in Java, or ECJ (Luke et al., 2004). ECJ implements the previously discussed features as well as parsimony, strongly-typed GP, migration strategies for exchanging individual subpopulations in island mode GP (a type of GP featuring multiple demes— local populations or breeding groups), vector representations, and reconfigurability using parameter files.
Other Applications: Optimization, Policy Learning Like other genetic and evolutionary computation methodologies, GP is driven by fitness and suited to optimization approaches to machine learning and data mining. Its program-based representation makes it good for acquiring policies by reinforcement learning.1 Many GP problems are error-driven or payoff-driven (Koza, 1992), including the ant trail problems and foraging problems now explored more heavily by the swarm intelligence and ant colony optimization communities. A few problems use specific information-theoretic criteria, such as maximum entropy or sequence randomization.
Genetic Programming
FUTURE TRENDS
CONCLUSION
Limitations: Scalability and Solution Comprehensibility
Genetic programming (GP) is a search methodology that provides a flexible and complete mechanism for machine learning, automated discovery, and cost-driven optimization. It has been shown to work well in many optimization and policy-learning problems, but scaling GP up to most real-world data mining domains is a challenge, due to its high computational complexity. More often, GP is used to evolve data transformations by constructing features or to control the declarative and preferential inductive bias of the machine-learning component. Making GP practical poses several key questions dealing with how to scale up, make solutions comprehensible to humans and statistically validate them, control the growth of solutions, reuse partial solutions efficiently, and learn incrementally. Looking ahead to future opportunities and challenges in data mining, genetic programming provides one of the more general frameworks for machine learning and adaptive problem solving. In data mining, they are likely to be most useful where a generative or procedural solution is desired, or where the exact functional form of the solution (whether a mathematical formula, grammar, or circuit) is not known in advance.
Genetic programming remains a controversial approach due to its high computational cost, scalability issues, and current gaps in fundamental theory for relating its performance to traditional search methods, such as hill climbing. While GP has achieved results in design, optimization, and intelligent control that are as good as and sometimes better than those produced by human engineers, it is not yet widely used as a technique due to these limitations in theory. An additional controversy, stemming from the intelligent systems community, is the role of knowledge in search-driven approaches such as GP. Some proponents of GP view it as a way to generate innovative solutions with little or no domain knowledge, while critics have expressed skepticism over original results due to the lower human comprehensibility of some results. The crux of this debate is a tradeoff between innovation and originality vs. comprehensibility, robustness, and ease of validation. Successes in replicating previously patented engineering designs such as analog circuits using GP (Koza et al., 2003) have increased its credibility in this regard.
Open Issues: Code Growth, Diversity, Reuse, and Incremental Learning Some of the most important open problems in GP deal with the proliferation of solution code (called code growth or code bloat), the reuse of previously evolved partial solutions, and incremental learning. Code growth is an increase in solution size across generations and generally refers to one that is not matched by a proportionate increase in fitness. It has been studied extensively in the field of GP by many researchers. Luke (2000) provides a survey of known and hypothesized causes of code growth, along with methods for monitoring and controlling growth. Recently, Burke, et al. (2004) explored the relationship between diversity (variation among solutions) and code growth and fitness. Some techniques for controlling code growth include reuse of partial solutions through such mechanisms as automatically defined functions, or ADFs (Koza, 1994) and incremental learning; that is, learning in stages. One incremental approach in GP is to specify criteria for a simplified problem and then transfer the solutions to a new GP population (Hsu & Gustafson, 2002).
REFERENCES Brameier, M., & Banzhaf, W. (2001). Evolving teams of predictors with linear genetic programming. Genetic Programming and Evolvable Machines, 2(4), 381-407. Burke, E.K., Gustafson, S., & Kendall, G. (2004). Diversity in genetic programming: An analysis of measures and correlation with fitness. IEEE Transactions on Evolutionary Computation, 8(1), 47-62. Cramer, N.L. (1985). A representation for the adaptive generation of simple sequential programs. Proceedings of the International Conference on Genetic Algorithms and Their Applications. Hsu, W.H., & Gustafson, S.M. (2002). Genetic programming and multi-agent layered learning by reinforcements. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002), New York. Keijzer, M., & Babovic, V. (2002). Declarative and preferential bias in GP-based scientific discovery. Genetic Programming and Evolvable Machines, 3(1), 41-79.
531
/
Genetic Programming
Kishore, J.K., Patnaik, L.M., Mani, V., & Agrawal, V.K. (2000). Application of genetic programming for multicategory pattern classification. IEEE Transactions on Evolutionary Computation, 4(3), 242-258. Koza, J.R. (1992). Genetic programming: On the programming of computers by means of natural selection. Cambridge, MA: MIT Press. Koza, J.R. (1994). Genetic programming II: Automatic discovery of reusable programs. Cambridge, MA: MIT Press. Koza, J.R. et al. (2003). Genetic programming IV: Routine human-competitive machine intelligence. San Mateo, CA: Morgan Kaufmann. Koza, J.R., Bennett III, F.H., André, D., & Keane, M.A. (1999). Genetic programming III: Darwinian invention and problem solving. San Mateo, CA: Morgan Kaufmann. Krawiec, K. (2002). Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines, 3(4), 329-343. Kushchu, I. (2002). Genetic programming and evolutionary generalization. IEEE Transactions on Evolutionary Computation, 6(5), 431-442. Luke, S. (2000). Issues in scaling genetic programming: Breeding strategies, tree generation, and code bloat [doctoral thesis]. College Park, MD: University of Maryland. Luke, S. et al. (2004). Evolutionary computation in Java v11. Retrieved from http://www.cs.umd.edu/projects/plus/ ec/ecj/ Muni, D.P., Pal, N.R., & Das, J. (2004). A novel approach to design classifiers using genetic programming. IEEE Transactions on Evolutionary Computation, 8(2), 183-196. Nikolaev, N.Y., & Iba, H. (2001a). Regularization approach to inductive genetic programming. IEEE Transactions on Evolutionary Computation, 5(4), 359-375. Nikolaev, N.Y., & Iba, H. (2001b). Accelerated genetic programming of polynomials. Genetic Programming and Evolvable Machines, 2(3), 231-257. O’Neill, M., & Ryan, C. (2001). Grammatical evolution. IEEE Transactions on Evolutionary Computation. Wikipedia. (2004). Genetic programming. Retrieved from http://en.wikipedia.org/wiki/Genetic_programming Wong, M.L., & Leung, K.S. (2000). Data mining using grammar based genetic programming and applications. Norwell, MA: Kluwer.
532
KEY TERMS Automatically-Defined Function (ADF): Parametric functions that are learned and assigned names for reuse as subroutines. ADFs are related to the concept of macrooperators or macros in speedup learning. Code Growth (Code Bloat): The proliferation of solution elements (e.g., nodes in a tree-based GP representation) that do not contribute toward the objective function. Crossover: In biology, a process of sexual recombination, by which two chromosomes are paired up and exchange some portion of their genetic sequence. Crossover in GP is highly stylized and involves structural exchange, typically using subexpressions (subtrees) or production rules in a grammar. Evolutionary Computation: A solution approach based on simulation models of natural selection, which begins with a set of potential solutions and then iteratively applies algorithms to generate new candidates and select the fittest from this set. The process leads toward a model that has a high proportion of fit individuals. Generation: The basic unit of progress in genetic and evolutionary computation, a step in which selection is applied over a population. Usually, crossover and mutation are applied once per generation, in strict order. Individual: A single candidate solution in genetic and evolutionary computation, typically represented using strings (often of fixed length) and permutations in genetic algorithms, or using problem-solver representations (i.e., programs, generative grammars, or circuits) in genetic programming. Island Mode GP: A type of parallel GP, where multiple subpopulations (demes) are maintained and evolve independently, except during scheduled exchanges of individuals. Mutation: In biology, a permanent, heritable change to the genetic material of an organism. Mutation in GP involves structural modifications to the elements of a candidate solution. These include changes, insertion, duplication, or deletion of elements (subexpressions, parameters passed to a function, components of a resistor-capacitor-inducer circuit, non-terminals on the righthand side of a production rule). Parsimony: An approach in genetic and evolutionary computation related to minimum description length, which rewards compact representations by imposing a penalty for individuals in direct proportion to their size (e.g., number of nodes in a GP tree). The rationale for parsimony is that it promotes generalization in supervised inductive
Genetic Programming
learning and produces solutions with less code, which can be more efficient to apply.
ENDNOTE
Selection: In biology, a mechanism by which the fittest individuals survive to reproduce, and the basis of speciation, according to the Darwinian theory of evolution. Selection in GP involves evaluation of a quantitative criterion over a finite set of fitness cases, with the combined evaluation measures being compared in order to choose individuals.
1
Reinforcement learning describes a class of problems in machine learning, in which an agent explores an environment, selecting actions based upon perceived state and an internal policy. This policy is updated, based upon a (positive or negative) reward that it receives from the environment as a result of its actions and that it seeks to maximize in the expected case.
533
/
534
Graph Transformations and Neural Networks Ingrid Fischer Friedrich-Alexander University Erlangen-Nürnberg, Germany
INTRODUCTION As the beginning of the area of artificial neural networks, the introduction of the artificial neuron of McCulloch and Pitts is considered. They were inspired by the biological neuron. Since then, many new networks or new algorithms for neural networks have been invented with the result that this area is not very clearly laid out. In most textbooks on (artificial) neural networks (Rojas, 2000; Silipo, 2002), there is no general definition on what a neural net is, but rather an example-based introduction leading from the biological model to some artificial successors. Perhaps the most promising approach to define a neural network is to see it as a network of many simple processors (units), each possibly having a small amount of local memory. The units are connected by communication channels (connections) that usually carry numeric (as opposed to symbolic) data called the weight of the connection. The units operate only on their local data and on the inputs they receive via the connections. It is typical of neural networks that they have great potential for parallelism, since the computations of the components are largely independent of each other. Neural networks work best if the system modeled by them has a high tolerance to error. Therefore, one would not be advised to use a neural network to balance one’s checkbook. However, they work very well for: • • • •
capturing associations or discovering regularities within a set of patterns; any application where the number of variables or diversity of the data is very great; any application where the relationships between variables are vaguely understood; or, any application where the relationships are difficult to describe adequately with conventional approaches.
Neural networks are not programmed but can be trained in different ways. In supervised learning, examples are presented to an initialized net. From the input and the output of these examples, the neural net learns somehow. There are as many learning algorithms as there are types of neural nets. Also, learning is motivated physiologically. When an example is presented to a neural network that it cannot recalculate, several different steps are
possible: changes can be done for a neuron, for the connection’s weight or new connections, and neurons can be inserted. For unsupervised learning, the results of an input are not known. There are many advantages and limitations to neural network analysis, and to discuss this subject properly, one must look at each individual type of network. Nevertheless, there is one specific limitation of neural networks that potential users should be aware of. Neural networks are more or less, depending on the different types, the ultimate black boxes. The final result of the learning process is a trained network that provides no equations or coefficients defining a relationship beyond its own internal mathematics. Graphs are widely-used concepts within computer science; in nearly every field, graphs serve as a tool for visualization, summarization of dependencies, explanation of connections, and so forth. Famous examples are all kinds of different nets and graphs as semantic nets, petri nets, flow charts, interaction diagrams, or neural networks. Invented first 35 years ago, graph transformations have been expanding constantly. Wherever graphs are used, graph transformations also are applied (Blostein & Schürr, 1999; Ehrig et al., 1999a; Ehrig et al., 1999b; Rozenberg, 1997). Graph transformations are a very promising method for modeling and programming neural networks. The graph part is automatically given, as the name neural network already indicates. Having graph transformations as methodology, it is easy to model algorithms on this graph structure. Structure-preserving and structurechanging algorithms can be modeled equally well. This is not the case for the widely used matrices programmed mostly in C or C++. In these approaches, modeling structure change becomes more difficult. This leads directly to a second advantage. Graph transformations have proven useful for visualizing the network and its algorithms. Most modern neural network simulators have some kind of visualization tool. Graph transformations offer a basis for this visualization, as the algorithms are already implemented in visual rules. Also, in nearly all books, neural networks are visualized as graphs. When having a formal methodology at hand, it is also possible to use it for proving properties of nets and algorithms. Especially in this area, earlier results for graph
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Graph Transformations and Neural Networks
transformation systems can be used. Three possibilities are especially promising: first, it is interesting whether an algorithm is terminating. Though this question is undecidable in the general case, the formal methods of graph rewriting and general rewriting offer some chances to prove termination for neural network algorithms. The same holds for the question whether the result produced by an algorithm is useful, whether the learning of a neural network was successful. Then it helps to prove whether two algorithms are equivalent. Finally, possible parallelism in algorithms can be detected and described, based on results for graph transformation systems.
BACKGROUND A Short Introduction to Graph Transformations Despite the different approaches to handling graph transformations, there are some properties that all approaches have in common. When transforming a graph G somehow, it is necessary to specify what part of the graph, what subgraph L, has to be exchanged. For this subgraph, a new graph R must be inserted. When applying such a rule to a graph G, three steps are necessary: • • •
Choose an occurrence of L in G. Delete L from G. Insert R into the remainder of G.
In Figure 1, a sample application of a graph transformation rule is shown. The left-hand side L consists of three nodes (1:, 2:, 3:) and three edges. This graph is embedded into a graph G. Numbers in G indicate how the nodes of L are matched. The embedding of edges is straightforward. In the next step L is deleted from G, and R is inserted. If L is simply deleted from G, hanging edges remain. All edges ending/starting at 1:,2:,3: are missing one node after deletion. With the help of numbers 1:,2:,3: in the right-hand side R, it is indicated how these hanging edges are attached to R inserted in G/L. The resulting graph is H. Simple graphs are not enough for modeling real-world applications. Among the different extensions, two are of special interest. First, graphs and graph rules can be labeled. When G is labeled with numbers, L is labeled with variables, and R is labeled with terms over L’s variables. This way, calculations can be modeled. Taking our example and extending G with numbers 1,2,3, the left-hand side L with variables x,y,z and the right-hand side with terms x+y, x-y, x×y, x y is shown in Figure 1. When L is embedded in G, the variables are set to the numbers of the corresponding nodes. The nodes in H are labeled with the result of the terms in R when the variable settings resulting from the embedding of L in G are used. Also, application conditions can be added, restricting the application of a rule. For example, the existence of a certain subgraph A in G can be allowed or forbidden. A rule only can be applied if A can be found resp. not found in G. Additionally, label-based application conditions are possible. This rule could be extended by asking for x < y. Only in this case would the rule be applied.
Figure 1. The application of a graph rewrite rule L → R to a graph G
L 1: x 2:
y
G
1:
1
•
2:
2 •
R 1: x+y z 3:
3
3:
x-y xy 3:
2:
x×y
H
1:
3
-1
•
2:
2
1 3:
• 535
/
Graph Transformations and Neural Networks
Combining Neural Networks and Graph Transformations Various proposals exist as to how graph transformations and neural networks can be combined having different goals in mind. Several ideas originate from evolutionary computing (Curran & O’Riodan, 2002; De Jong & Pollack, 2001; Siddiqi & Lucas, 1998); others stem from electrical engineering (Wan & Beaufays, 1998). The approach of Fischer (2000) has its roots in graph transformations itself. Despite these sources, only three really different ideas can be found: •
•
In most of the papers (Curran & O’Riodan, 2002; De Jong & Pollack, 2001; Siddiqi & Lucas 1998), some basic graph operators like insert-node, delete-node, change-attribute, and so forth are used. This set of operators differs in the approaches. In addition to these operators, some kind of application condition exists, stating which rule has to be applied when and on which nodes or edges. These application conditions can be directed acyclic graphs giving the sequence of the rule applications. It also can be treebased, where newly created nodes are handed to different paths in the tree. The main application area of these approaches is to grow neural networks from just one node. In this field, other grammar-based approaches also can be found (Cantu-Paz & Kamath, 2002), where matrices are rewritten (Browse, Hussain & Smilie, 1999), taking attributed grammars. In Tsakonas and Dounias (2002), feed-forward neural networks are grown with the help of a grammar in Backus-Naur-Form. In Wan and Beaufays (1998), signal flow graphs known from electrical engineering are used to model the information flow through the net. With the help of rewrite rules, the elements can be reversed, so that the signal flow is going in the opposite direction as
Figure 2. A probabilistic neural network seen as graph. Please note that not all edges are labeled, due to space reasons
9 4 3
536
8
5
0.4
0.7
0.3
0.2
2
10
4
8
1
7
9
2
7 1 5
•
before. This way, gradient descent-based training methods as back propagation, a famous training algorithm (Rojas, 2000) can be derived. A disadvantage of this method is that no topology changing algorithms can be modeled. In Fischer (2000), arbitrary neural nets are modeled as graphs and transformed by arbitrary transformation rules. Because this is the most general approach, it will be explained in detail in the following sections.
MAIN THRUST In the remainder of this article, one special sort of neural network, the so-called probabilistic neural networks, together with training algorithms, are explained in detail.
Probabilistic Neural Networks The main purpose of a probabilistic neural network is to sort patterns into classes. It always has three layers—an input layer, a hidden layer, and an output layer. The input neurons are connected to each neuron in the hidden layer. The neurons of the hidden layer are connected to one output neuron each. Hidden neurons are modeled with two nodes, as they have an input value and do some calculations on it, resulting in an output value. Neurons and connections are labeled with values resp. weights. In Figure 2, a probabilistic neural network is shown.
Calculations in Probabilistic Neural Networks First, input is presented to the input neurons. The next step is to calculate the input value of the hidden neurons. The main purpose of the hidden neurons is to represent examples of classes. Each neuron represents one example. This example forms the weights from the input neurons to this special hidden neuron. When an input is presented to the net, each neuron in the hidden layer computes the probability that it is the example it models. Therefore, first, the Euclidean distance between the activation of the input neurons and the weight of the connection to the hidden layer’s neuron is computed. The result coming via the connections is summed up to within a hidden neuron. This is the distance that the current input has from the example modeled by the neuron. If the exact example is inserted into the net, the result is 0. In Figure 3, this calculation is shown in detail. The given graph rewrite rule can be applied to the net shown in Figure 2. The labels i are variables modeling the input values of the input neurons, w models the weights
Graph Transformations and Neural Networks
Figure 3. Calculating the input activation of a neuron L w1
in i1
…
in
wn
→
Figure 4. A new neuron is inserted into a net. This example is taken from Dynamic Decay Adjustment (Silipo, 2002)
1
1
¬∃
1 → i1
i2
i3
1 d
d≥θ+
0 i1
i2
/
R
i3
of the connections. In the right-hand side, R a formula is given to calculate the input activation. Then, a Gaussian function is used to calculate the probability that the inserted pattern is the example pattern of the neuron. The radius sσ of the Gaussian function has to be chosen during the training of the net, or it has to be adapted by the user. The result is the output activation of the neuron. The main purpose of the connection between hidden and output layer is to sum up the activations of the hidden layer’s neurons for one class. These values are multiplied with a weight associated to the corresponding connection. The output neuron with the highest activation represents the class of the input pattern.
Training Neural Networks For the probabilistic neural networks, different methods are used. The classical training method is the following: The complete net is built up by inserting one hidden neuron for each training pattern. As weights of the connections from the input neuron to the hidden neuron representing this example, the training pattern itself is taken. The connections from the hidden layer to the output layer are weighted with 1/m if there are m training patterns for one class. sσ, the radius of the Gaussian function used, is usually simply set to the average distance between centers of Gaussians modeling training patterns.
w1
Σ ||ij - wj|| i1
…
in
wn
A more sophisticated training method is called Dynamic Decay Adjustment (Silipo, 2002). The algorithm also starts with no neuron in the hidden layer. If a pattern is presented to the net, first, the activation of all existing hidden neurons is calculated. If there is a neuron whose activation is equal to or higher than a given threshold θ+, this neuron covers the input pattern. If this is not the case, a new hidden neuron is inserted into the net, as shown in Figure 4. With the help of this algorithm, fewer hidden neurons are inserted.
FUTURE TRENDS The area of graph grammars and graph transformation systems several programming environments, especially AGG and Progress (Ehrig, Engels, Kreowski, & Rozenberg, 1999) have been developed. They can be used to obtain new implementations of neural network algorithms, especially of algorithms that change the net’s topology. Pruning means the deletion of neurons and connections between neurons in already trained neural networks to minimize their size. Several mechanisms exist that mainly differ in the evaluation function determining the neurons/ connections to be deleted. It is an interesting idea to model pruning with graph transformation rules. A factor graph is a bipartite graph that expresses how a global function of many variables factors into a product of local functions. It subsumes other graphical means of artificial intelligence as Bayesian networks or Markov random fields. It would be worth checking whether graph transformation systems are helpful in formulating algorithms for these soft computing approaches. Also, for more biological applications, the use of graph transformation is interesting.
CONCLUSION Modeling neural networks as graphs and algorithms on neural networks as graph transformations is an easy-touse, straightforward method. It has several advantages. First, the structure of neural nets is supported. When modeling the net topology and topology changing algo537
Graph Transformations and Neural Networks
rithms, graph transformation systems can present their full power. This might be of special interest for educational purposes where it is useful to visualize step-bystep what algorithms do. Finally, the theoretical background of graph transformation and rewriting systems offers several possibilities for proving termination, equivalence, and the like, of algorithms.
Rojas, R. (2000). Neural networks. A systematic introduction. New York, NY: Springer Verlag. Rozenberg, G. (Ed.). (1997). Handbook of graph grammars and computing by graph transformations. Singapore: World Scientific.
REFERENCES
Siddiqi, A., & Lucas, S.M. (1998). A comparison of matrix rewriting versus direct encoding for evolving neural networks. Proceedings of the IEEE International Conference on Evolutionary Computation, Anchorage, Alaska.
Blostein, D., & Schürr, A. (1999). Computing with graphs and graph transformation. Software Practice and Experience, 29(3), 1-21.
Silipo, R. (2002). Artificial neural networks. In M. Berthold, & D. Hand (Eds.), Intelligent data analysis (pp. 269-319). New York, NY: Springer Verlag.
Browse, R.A., Hussain, T.S., & Smillie, M.B. (1999). Using attribute grammars for the genetic selection of backpropagation networks for character recognition. Proceedings of Applications of Artificial Neural Networks in Image Processing IV. San Jose, California.
Tsakonas, A., & Dounias, D. (2002). A scheme for the evolution of feedforward neural Networks using BNFgrammar driven genetic programming. Proceedings of the EUNITE—EUropean Network on Intelligent Technologies for Smart Adaptive Systems, Algarve, Portugal.
Cantu-Paz, E., & Kamath, C. (2002). Evolving neural networks for the classification of galaxies. Proceedings of the Genetic and Evolutionary Computation Conference.
Wan, E., & Beaufays, F. (1998). Diagrammatic methods for deriving and relating temporal neural network algorithms. In C. Giles, & M. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 63-98). Salerno, Italy: International Summer School on Neural Networks.
Curran, D., & O’Riordan, C. (2002). Applying evolutionary computation to designing neural networks: A study of the state of the art [technical report]. Galway, Ireland: Department of Information Technology, National University of Ireland. De Jong, E., & Pollack, J. (2001). Utilizing bias to evolve recurrent neural networks. Proceedings of the International Joint Conference on Neural Networks, Washington, D.C. Ehrig, H., Engels, G., Kreowski, H.-J., & Rozenberg, G. (Eds.). (1999a). Handbook on graph grammars and computing by graph transformation. Singapore: World Scientific. Ehrig, H., Kreowski, H.-J., Montanari, U., & Rozenberg, G. (Eds.). (1999b). Handbook on graph grammars and computing by graph transformation. Singapore: World Scientific. Fischer, I. (2000). Describing neural networks with graph transformations [doctoral thesis]. Nuremberg: FriedrichAlexander University Erlangen-Nuremberg. Klop, J.W., De Vrijer, R.C., & Bezem, M. (2003). Term rewriting systems. Cambridge, MA: Cambridge University Press. Nipkow, T., & Baader, F. (1999). Term rewriting and all that. Cambridge, MA: Cambridge University Press.
538
KEY TERMS Confluence: A rewrite system is confluent, if no matter in which order rules are applied, they lead to the same result. Graph: A graph consists of vertices and edges. Each edge is connected to a source node and a target node. Vertices and edges can be labeled with numbers and symbols. Graph Production: Similar to productions in general Chomsky grammars, a graph production consists of a lefthand side and a right-hand side. The left-hand side is embedded in a host graph. Then, it is removed, and in the resulting hole, the right-hand side of the graph production is inserted. To specify how this right-hand side is attached into this hole, how edges are connected to the new nodes, some additional information is necessary. Different approaches exist how to handle this problem. Graph Rewriting: The application of a graph production to a graph is also called graph rewriting. Neural Networks: Learning systems, designed by analogy with a simplified model of the neural connections in the brain, which can be trained to find nonlinear relationships in data. Several neurons are connected to form the neural networks.
Graph Transformations and Neural Networks
Neuron: The smallest processing unit in a neural network.
follows the configuration y with the help of a rule application.
Probabilistic Neural Network: One of the many different kinds of neural networks with the application area to classify input data into different classes.
Termination: A rewrite system terminates if it has no infinite chain.
Rewrite System: Consists of a set of configurations and a relation x → y denoting that the configuration x
Weight: Connections between neurons of neural networks have a weight. This weight can be changed during the training of the net.
539
/
540
Graph-Based Data Mining Lawrence B. Holder University of Texas at Arlington, USA Diane J. Cook University of Texas at Arlington, USA
INTRODUCTION
BACKGROUND
Graph-based data mining represents a collection of techniques for mining the relational aspects of data represented as a graph. Two major approaches to graphbased data mining are frequent subgraph mining and graph-based relational learning. This article will focus on one particular approach embodied in the Subdue system, along with recent advances in graph-based supervised learning, graph-based hierarchical conceptual clustering, and graph-grammar induction. Most approaches to data mining look for associations among an entity’s attributes, but relationships between entities represent a rich source of information, and ultimately knowledge. The field of multi-relational data mining, of which graph-based data mining is a part, is a new area investigating approaches to mining this relational information by finding associations involving multiple tables in a relational database. Two main approaches have been developed for mining relational information: logic-based approaches and graph-based approaches. Logic-based approaches fall under the area of inductive logic programming (ILP). ILP embodies a number of techniques for inducing a logical theory to describe the data, and many techniques have been adapted to multi-relational data mining (Dzeroski & Lavrac, 2001; Dzeroski, 2003). Graph-based approaches differ from logic-based approaches to relational mining in several ways, the most obvious of which is the underlying representation. Furthermore, logic-based approaches rely on the prior identification of the predicate or predicates to be mined, while graph-based approaches are more data-driven, identifying any portion of the graph that has high support. However, logic-based approaches allow the expression of more complicated patterns involving, for example, recursion, variables, and constraints among variables. These representational limitations of graphs can be overcome, but at a computational cost.
Graph-based data mining (GDM) is the task of finding novel, useful, and understandable graph-theoretic patterns in a graph representation of data. Several approaches to GDM exist based on the task of identifying frequently occurring subgraphs in graph transactions, that is, those subgraphs meeting a minimum level of support. Washio & Motoda (2003) provide an excellent survey of these approaches. We here describe four representative GDM methods. Kuramochi and Karypis (2001) developed the FSG system for finding all frequent subgraphs in large graph databases. FSG starts by finding all frequent single and double edge subgraphs. Then, in each iteration, it generates candidate subgraphs by expanding the subgraphs found in the previous iteration by one edge. In each iteration the algorithm checks how many times the candidate subgraph occurs within an entire graph. The candidates, whose frequency is below a user-defined level, are pruned. The algorithm returns all subgraphs occurring more frequently than the given level. Yan and Han (2002) introduced gSpan, which combines depth-first search and lexicographic ordering to find frequent subgraphs. Their algorithm starts from all frequent one-edge graphs. The labels on these edges together with labels on incident vertices define a code for every such graph. Expansion of these one-edge graphs maps them to longer codes. Since every graph can map to many codes, all but the smallest code are pruned. Code ordering and pruning reduces the cost of matching frequent subgraphs in gSpan. Yan & Han (2003) describe a refinement to gSpan, called CloseGraph, which identifies only subgraphs satisfying the minimum support, such that no supergraph exists with the same level of support. Inokuchi et al. (2003) developed the Apriori-based Graph Mining (AGM) system, which searches the space of frequent subgraphs in a bottom-up fashion, beginning
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Graph-Based Data Mining
with a single vertex, and then continually expanding by a single vertex and one or more edges. AGM also employs a canonical coding of graphs in order to support fast subgraph matching. AGM returns association rules satisfying user-specified levels of support and confidence. The last approach to GDM, and the one discussed in the remainder of this chapter, is embodied in the Subdue system (Cook & Holder, 2000). Unlike the above systems, Subdue seeks a subgraph pattern that not only occurs frequently in the input graph, but also significantly compresses the input graph when each instance of the pattern is replaced by a single vertex. Subdue performs a greedy search through the space of subgraphs, beginning with a single vertex and expanding by one edge. Subdue returns the pattern that maximally compresses the input graph. Holder & Cook (2003) describe current and future directions in this graph-based relational learning variant of GDM.
MAIN THRUST As a representative of GDM methods, this section will focus on the Subdue graph-based data mining system. The input data is a directed graph with labels on vertices and edges. Subdue searches for a substructure that best compresses the input graph. A substructure consists of a subgraph definition and all its occurrences throughout the graph. The initial state of the search is the set of substructures consisting of all uniquely labeled vertices. The only operator of the search is the Extend Substructure operator. As its name suggests, it extends a substructure in all possible ways by a single edge and a vertex, or by only a single edge if both vertices are already in the subgraph. Subdue’s search is guided by the minimum description length (MDL) principle, which seeks to minimize the description length of the entire data set. The evaluation heuristic based on the MDL principle assumes that the best substructure is the one that minimizes the description length of the input graph when compressed by the substructure. The description length of the substructure S given the input graph G is calculated as DL(G,S) = DL(S)+DL(G|S), where DL(S) is the description length of the substructure, and DL(G|S) is the description length of the input graph compressed by the substructure. Subdue seeks a substructure S that minimizes DL(G,S). The search progresses by applying the Extend Substructure operator to each substructure in the current state. The resulting state, however, does not contain all the substructures generated by the Extend Substructure operator. The substructures are kept on a queue and are ordered based on their description length (or some-
times referred to as value) as calculated using the MDL principle. The queue’s length is bounded by a userdefined constant. The search terminates upon reaching a user-specified limit on the number of substructures extended, or upon exhaustion of the search space. Once the search terminates and Subdue returns the list of best substructures found, the graph can be compressed using the best substructure. The compression procedure replaces all instances of the substructure in the input graph by single vertices, which represent the substructure’s instances. Incoming and outgoing edges to and from the replaced instances will point to, or originate from the new vertex that represents the instance. The Subdue algorithm can be invoked again on this compressed graph. Figure 1 illustrates the GDM process on a simple example. Subdue discovers substructure S1, which is used to compress the data. Subdue can then run for a second iteration on the compressed graph, discovering substructure S2. Because instances of a substructure can appear in slightly different forms throughout the data, an inexact graph match, based on graph edit distance, is used to identify substructure instances. Most GDM methods follow a similar process. Variations involve different heuristics (e.g., frequency vs. MDL) and different search operators (e.g., merge vs. extend).
Graph-Based Hierarchical Conceptual Clustering Given the ability to find a prevalent subgraph pattern in a larger graph and then compress the graph with this pattern, iterating over this process until the graph can no longer be compressed will produce a hierarchical, conceptual clustering of the input data. On the i th iteration, the best subgraph Si is used to compress the input graph, introducing new vertices labeled Si in the graph input to the next iteration. Therefore, any subsequently-discovered subgraph Sj can be defined in terms of one or more of Sis, where i < j. The result is a lattice, where each cluster can be defined in terms of more than one parent Figure 1. Graph-based data mining: A simple example
S1
S1
S1 S2
S2
S1
S1 S2 541
/
Graph-Based Data Mining
subgraph. For example, shows such a clustering done on a DNA molecule. Note that the ordering of pattern discovery can affect the parents of a pattern. For instance, the lower-left pattern in could have used the C-C-O pattern, rather than the C-C pattern, but in fact, the lowerleft pattern is discovered before the C-C-O pattern. For more information on graph-based clustering, see Jonyer et al. (2001).
Graph-Based Supervised Learning Extending a graph-based data mining approach to perform supervised learning involves the need to handle negative examples (focusing on the two-class scenario). In the case of a graph the negative information can come in three forms. First, the data may be in the form of numerous smaller graphs, or graph transactions, each labeled either positive or negative. Second, data may be composed of two large graphs: one positive and one negative. Third, the data may be one large graph in which the positive and negative labeling occurs throughout. We will talk about the third scenario in the section on future directions. The first scenario is closest to the standard supervised learning problem in that we have a set of clearly defined examples. Let G + represent the set of positive graphs, and G- represent the set of negative graphs. Then, one approach to supervised learning is to find a subgraph that appears often in the positive graphs, but not in the negative graphs. This amounts to replacing the information-theoretic measure with simply an error-based measure. This approach will lead the search toward a small subgraph that discriminates well. However, such a subgraph does not necessarily compress well, nor represent a characteristic description of the target concept. Figure 2. Graph-based hierarchical, conceptual clustering of a DNA molecule DNA
C—N
C—C
C—C \ O
C \ N—C \ C
O
C—C / O
542
\ C / \
N—C
\ C
O | O == P — OH
O | O == P — OH | O | CH2
We can bias the search toward a more characteristic description by using the information-theoretic measure to look for a subgraph that compresses the positive examples, but not the negative examples. If I(G) represents the description length (in bits) of the graph G, and I(G|S) represents the description length of graph G compressed by subgraph S, then we can look for an S that minimizes I(G +|S) + I(S) + I(G-) – I(G-|S), where the last two terms represent the portion of the negative graph incorrectly compressed by the subgraph. This approach will lead the search toward a larger subgraph that characterizes the positive examples, but not the negative examples. Finally, this process can be iterated in a set-covering approach to learn a disjunctive hypothesis. If using the error measure, then any positive example containing the learned subgraph would be removed from subsequent iterations. If using the information-theoretic measure, then instances of the learned subgraph in both the positive and negative examples (even multiple instances per example) are compressed to a single vertex. Note that the compression is a lossy one, that is we do not keep enough information in the compressed graph to know how the instance was connected to the rest of the graph. This approach is consistent with our goal of learning general patterns, rather than mere compression. For more information on graph-based supervised learning, see Gonzalez et al. (2002).
Graph Grammar Induction As mentioned earlier, two of the advantages of logicbased approach to relational learning are the ability to learn recursive hypotheses and constraints among variables. However, there has been much work in the area of graph grammars, which overcome this limitation. Graph grammars are similar to string grammars except that terminals can be arbitrary graphs rather than symbols from an alphabet. While much of the work on graph grammars involves the analysis of various classes of graph grammars, recent research has begun to develop techniques for learning graph grammars (Doshi et al., 2002; Jonyer et al., 2002). Figure 3b shows an example of a recursive graph grammar production rule learned from the graph in a. A GDM approach can be extended to consider graph grammar productions by analyzing the instances of a subgraph to see how they are related to each other. If two or more instances are connected to each other by one or more edges, then a recursive production rule generating an infinite sequence of such connected subgraphs can be constructed. A slight modification to the information-theoretic measure taking into account the extra information needed to describe the recursive
Graph-Based Data Mining
component of the production is all that is needed to allow such a hypothesis to compete along side simple subgraphs (i.e., terminal productions) for maximizing compression. These graph grammar productions can include nonterminals on the right-hand side. These productions can be disjunctive, as in c, which represents the final production learned from a using this approach. The disjunction rule is learned by looking for similar, but not identical, extensions to the instances of a subgraph. A new rule can be constructed that captures the disjunctive nature of this extension, and included in the pool of production rules competing based on their ability to compress the input graph. With a proper encoding of this disjunction information, the MDL criterion will tradeoff the complexity of the rule with the amount of compression it affords in the input graph. An alternative to defining these disjunction non-terminals is to instead construct a variable whose range consists of the different disjunctive values of the production. In this way we can introduce constraints among variables contained in a subgraph by adding a constraint edge to the subgraph. For example, if the four instances of the triangle structure in a each had another edge to a c, d, f and f vertex respectively, then we could propose a new subgraph, where these two vertices are represented by variables, and an equality constraint is introduced between them. If the range of the variable is numeric, then we can also consider inequality constraints between variables and other vertices or variables in the subgraph pattern. Jonyer (2003) has developed a graph grammar learning approach with the above capabilities. The approach has shown promise both in handling noise and learning Figure 3. Graph grammar learning example with (a) the input graph, (b) the first grammar rule learned, and (c) the second and third grammar rules learned (a)
a b
a c
a
b
d
a
b
f
b
f
x
y
x
y
z
q
z
q
k
(b)
(c)
x
y
x
y
z
q
z
q
S1
S2
x
y
z
q
a b
S3
r
S1
S2 S3
c
f
y
z
q
FUTURE TRENDS The field of graph-based relational learning is still young, but the need for practical algorithms is growing fast. Therefore, we need to address several challenging scalability issues, including incremental learning in dynamic graphs. Another issue regarding practical applications involves the blurring of positive and negative examples in a supervised learning task, that is, the graph has many positive and negative parts, not easily separated, and with varying degrees of class membership.
Partitioning and Incremental Mining for Scalability Scaling GDM approaches to very large graphs, graphs too big to fit in main memory, is an ever-growing challenge. Two approaches to address this challenge are being investigated. One approach involves partitioning the graph into smaller graphs that can be processed in a distributed fashion (Cook et al., 2001). A second approach involves implementing GDM within a relational database management system, taking advantage of userdefined functions and the optimized storage capabilities of the RDBMS. A newer issue regarding scalability involves dynamic graphs. With the advent of real-time streaming data, many data mining systems must mine incrementally, rather than off-line from scratch. Many of the domains we wish to mine in graph form are dynamic domains. We do not have the time to periodically rebuild graphs of all the data to date and run a GDM system from scratch. We must develop methods to incrementally update the graph and the patterns currently prevalent in the graph. One approach is similar to the graph partitioning approach for distributed processing. New data can be stored in an increasing number of partitions. Information within partitions can be exchanged, or a repartitioning can be performed if the information loss exceeds some threshold. GDM can be used to search the new partitions, suggesting new subgraph patterns as they evaluate highly in new and old partitions.
Learning with Supervised Graphs
a b
d
x
recursive hypotheses in many different domains including learning the building blocks of proteins and communication chains in organized crime.
S3
In a highly relational domain the positive and negative examples of a concept are not easily separated. Such a graph is called a supervised graph, in that the graph as a 543
/
Graph-Based Data Mining
whole contains class information, but perhaps not individual components of the graph. This scenario presents a challenge to any data mining system, but especially to a GDM system, where clearly classified data may be tightly related to less classified data. Two approaches to this task are being investigated. The first involves modifying the MDL encoding to take into account the amount of information necessary to describe the class membership of compressed portions of the graph. The second approach involves treating the class membership of a vertex or edge as a cost and weighting the informationtheoretic value of the subgraph patterns by the costs of the instances of the pattern. The ability to learn from supervised graphs will allow the user more flexibility in indicating class membership where known, and to varying degrees, without having to clearly separate the graph into disjoint examples.
CONCLUSION Graph-based data mining (GDM) is a fast-growing field due to the increasing interest in mining the relational aspects of data. We have described several approaches to GDM including logic-based approaches in ILP systems, graph-based frequent subgraph mining approaches in AGM, FSG and gSpan, and a graph-based relational learning approach in Subdue. We described the Subdue approach in detail along with recent advances in supervised learning, clustering, and graph-grammar induction. However, much work remains to be done. Because many of the graph-theoretic operations inherent in GDM are NP-complete or definitely not in P, scalability is a constant challenge. With the increased need for mining streaming data, the development of new methods for incremental learning from dynamic graphs is important. Also, the blurring of example boundaries in a supervised learning scenario gives rise to graphs, where the class membership of even nearby vertices and edges can vary considerably. We need to develop better methods for learning in these supervised graphs. Finally, we discussed several domains throughout this paper that benefit from a graphical representation and the use of GDM to extract novel and useful patterns. As more and more domains realize the increased predictive power of patterns in relationships between entities, rather than just attributes of entities, graph-based data mining will become foundational to our ability to better understand the ever-increasing amount of data in our world.
REFERENCES Cook, D., & Holder, L. (2000). Graph-based data mining. IEEE Intelligent Systems, 15(2), 32-41. Cook, D., Holder, L., Galal, G., & Maglothin, R. (2001). Approaches to parallel graph-based knowledge discovery. Journal of Parallel and Distributed Computing, 61(3), 427-446. Doshi, S., Huang, F., & Oates, T. (2002). Inferring the structure of graph grammars from data. In Proceedings of the International Conference on Knowledge-based Computer Systems. Deroski, S. (2003). Multi-relational data mining: An introduction. SIGKDD Explorations, 5(1), 1-16. Deroski, S., & Lavraè, N. (2001). Relational data mining. Berlin: Springer Verlag. Gonzalez, J., Holder, L., & Cook, D. (2002). Graph-based relational concept learning. In Proceedings of the Nineteenth International Conference on Machine Learning. Holder, L., & Cook, D. (2003). Graph-based relational learning: Current and future directions. SIGKDD Explorations, 5(1), 90-93. Inokuchi, A., Washio, T., & Motoda, H. (2003). Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50, 321-254. Jonyer, I. (2003). Context-free graph grammar induction using the minimum description length principle. Ph.D. thesis. Department of Computer Science and Engineering, University of Texas at Arlington. Jonyer, I., Cook, D., & Holder, L. (2001). Graph-based hierarchical conceptual clustering. Journal of Machine Learning Research, 2, 19-43. Jonyer, I., Holder, L., & Cook, D. (2002). Concept formation using graph grammars. In Proceedings of the KDD Workshop on Multi-Relational Data Mining. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the First IEEE Conference on Data Mining. Washio, T., & Motoda, H. (2003). State of the art of graphbased data mining. SIGKDD Explorations, 5(1), 59-68. Yan, X., & Han, J. (2002). Graph-based substructure pattern mining. In Proceedings of the International Conference on Data Mining. Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. In Proceedings of the Ninth
544
Graph-Based Data Mining
International Conference on Knowledge Discovery and Data Mining.
KEY TERMS Conceptual Graph: Graph representation described by a precise semantics based on first-order logic. Dynamic Graph: Graph representing a constantly changing stream of data. Frequent Subgraph Mining: Finding all subgraphs within a set of graph transactions whose frequency satisfies a user-specified level of minimum support. Graph-Based Data Mining: Finding novel, useful, and understandable graph-theoretic patterns in a graph representation of data.
Graph Grammar: Grammar describing the construction of a set of graphs, where terminals and non-terminals represent vertices, edges or entire subgraphs. Inductive Logic Programming: Techniques for learning a first-order logic theory to describe a set of relational data. Minimum Description Length (MDL) Principle: Principle stating that the best theory describing a set of data is the one minimizing the description length of the theory plus the description length of the data described (or compressed) by the theory. Multi-Relational Data Mining: Mining patterns that involve multiple tables in a relational database. Supervised Graph: Graph in which each vertex and edge can belong to multiple categories to varying degrees. Such a graph complicates the ability to clearly define transactions on which to perform data mining.
545
/
546
Group Pattern Discovery Systems for Multiple Data Sources Shichao Zhang University of Technology Sydney, Australia Chengqi Zhang University of Technology Sydney, Australia
INTRODUCTION Multiple data source mining is the process of identifying potentially useful patterns from different data sources, or datasets (Zhang et al., 2003). Group pattern discovery systems for mining different data sources are based on local pattern-analysis strategy, mainly including logical systems for information enhancing, a pattern discovery system, and a post-pattern-analysis system.
BACKGROUND Many large organizations have multiple data sources, such as different branches of a multinational company. Also, as the Web has emerged as a large distributed data repository, it is easy nowadays to access a multitude of data sources. Therefore, individuals and organizations have taken into account the Internet’s low-cost information and knowledge when making decisions (Lesser et al., 2000). Although the data collected from the Internet (called external data) brings us opportunities in improving the quality of decisions, it generates a significant challenge: efficiently identifying quality knowledge from different data sources (Kargupta et al., 2000; Liu et al., 2001; Prodromidis et al., 1998; Zhong et al., 1999). Potentially, almost every company must confront the multiple data source (MDS) problem (Hurson et al., 1994). This problem is difficult to solve, due to the facts that multiple data source mining is a procedure of searching for useful patterns in multidimensional spaces; and putting all data together from different sources might amass a huge database for centralized processing and cause problems, such as data privacy breaches, data inconsistency, and data conflict. Recently, the authors have developed local pattern analysis, a new multi-database mining strategy for discovering some kinds of potentially useful patterns that cannot be mined in traditional multi-database mining techniques (Zhang et al., 2003). Local pattern analysis delivers high-performance pattern discovery from MDSs.
This effort provides a good insight into knowledge discovery from multiple data sources. However, there are two fundamental problems that prevent local pattern analysis from widespread application. First, when the data collected from the Internet is of poor quality, the poor-quality data can disguise useful patterns. For example, a stock investor might need to collect information from outside data sources when making an investment decision, such as news. If fraudulent information collected is applied directly to investment decisions, the investor might lose money. In particular, much work has been built on consistent data. In the input to the distributed data mining algorithms, it is assumed that the data sources do not conflict with each other. However, reality is much more inconsistent than the ideal; the inconsistency must be resolved before any mining algorithms can be applied. These generate a crucial requirement: ontology-based data enhancement. The second fundamental challenge is the efficiency of mining algorithms for identifying potentially useful patterns in MDSs. Over the years, there has been a great deal of work in multiple source data mining (Aounallah et al., 2004; Krishnaswamy et al., 2000; Li et al., 2001; Yin et al., 2004). However, traditional multiple data source mining still utilizes mono-database mining techniques. That is, all the data from relevant data sources is pooled to amass a huge dataset for discovery. These algorithms cannot discover some useful patterns; for example, the pattern that 70% of the branches within a company agrees that a married customer usually has at least two cars if his or her age is between 45 and 65. On the other hand, using our local pattern analysis, there can be huge amounts of the local patterns. These generate a strong requirement: the development of efficient algorithms for identifying useful patterns in MDSs. This article introduces a group of pattern discovery systems for dealing with the MDS problem, mainly (1) a logical system for enhancing data quality, a logical system for resolving conflicts, a data cleaning system, and a database clustering system, for solving the first problem;
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Group Pattern Discovery Systems for Multiple Data Sources
and (2) a pattern discovery system and a post-mining system for solving the second problem.
MAIN THRUST Group pattern discovery systems are able to (i) effectively enhance data quality for mining MDSs and (ii) automatically identify potentially useful patterns from the multidimension data in MDSs.
(b)
Data Enhancement Data enhancement includes the following: 1.
2. 3. 4.
The data cleaning system mainly includes these functions: recovering incomplete data (filling the values missed or expelling ambiguity); purifying data (consistency of data name, consistency of data format, correcting errors, or removing outliers); and resolving data conflicts (using domain knowledge or expert decision to settle discrepancy). The logical system for enhancing data quality focuses on the following epistemic properties: veridicality, introspection, and consistency. The logical system for resolving conflict has the property of obeying the weighted majority principle in case of conflicts. The fuzzy database clustering system generates good database clusters.
Identifying Interesting Patterns A local pattern may be a frequent itemset, an association rule, causal rule, dependency, or some other expression. Local pattern analysis is an in-place strategy specifically designed for mining MDSs, providing a feasible way to generate globally interesting models from data in multidimensional spaces. Based on our local pattern analysis, three key systems can be developed for automatically searching for potentially useful patterns from local patterns: (a) identifying high-vote patterns; (b) finding exceptional patterns; and (c) synthesizing patterns by weighting majority. (a)
Identifying High-Vote Patterns: Within an MDS environment, each data source, large or small, can have an equal power to vote for their patterns for the decision-making of a company. Some patterns can receive votes from most of the data sources. These patterns are referred to as high-vote patterns. Highvote patterns represent the commonness of the
(c)
branches. Therefore, these patterns may be far more important in terms of decision-making within the company. The key problem is how to efficiently search for high-vote patterns of interest in multidimensional spaces. It can be attacked by mining the distribution of all patterns. Finding Exceptional Patterns: Like high-vote patterns, exceptional patterns also are regarded as novel patterns in multiple data sources, which reflect the individuality of data sources. While highvote patterns are useful when a company is reaching common decisions, headquarters also are interested in viewing exceptional patterns used when special decisions are made at only a few of the branches, perhaps for predicting the sales of a new product. Exceptional patterns can capture the individuality of branches. Therefore, although an exceptional pattern receives votes from only a few branches, it is extremely valuable information in MDSs. The key problem is how to construct efficient methods for measuring the interestingness of exceptional patterns. Searching for Synthesizing Patterns by Weighting Majority: Although each data source can have an equal power to vote for patterns for making decisions, data sources may be different in importance to a company. For example, in a company, if the sale of branch A is four times that of branch B, branch A is certainly more important than branch B in the company. (Here, each branch in a company is viewed as a data source in an MDS environment.) The decisions of the company are reasonably partial to high-sale branches. Also, local patterns may have different supports. For example, let the supports of patterns X1 and X2 be 0.9 and 0.4 in a branch. Pattern X1 is far more believable than pattern X2. These two examples present the importance of branches and patterns for decision making within a company. Therefore, synthesizing patterns is very useful.
Post Pattern Analysis In an MDS environment, a pattern (e.g., a high-vote association rule) is attached to certain factors, including name, vote, vsupp, and vconf. For a very large set of data sources, a high-vote association rule may be supported by a number of data sources. So, the sets of its support and confidence in these data sources are too large to be browsed by users, and thus, it is rather difficult to apply the rule to decision making for users. Therefore, postpattern analysis is very important in MDS mining. The key problem is how to construct effective partition for classifying the patterns mined.
547
/
Group Pattern Discovery Systems for Multiple Data Sources
FUTURE TRENDS The data that are distributed in multiple data sources present many complications, and associated mining tasks and mining approaches are diverse. Therefore, many challenging research issues have emerged. Important tasks that present themselves to multiple data source mining researchers and multiple data source mining system and application developers are as follows: • • • •
Establishment of a powerful representation for patterns in distributed data; Ontology-based methodologies and technologies for data preparation/enhancement; Development of efficient and effective mining strategies and systems; and Construction of interactive and integrated multiple data source mining environments.
CONCLUSION As stated previously, many companies/organizations have utilized the Internet’s low-cost information when making decisions. Potentially, almost every company must confront the MDS problem. Although the data from the Internet assists in improving the quality of decisions, the data is of poor quality. This leaves a large gap between the available data and the machinery available to process the data. It generates a crucial need: efficiently identifying quality knowledge from MDSs. To support this need, group pattern discovery systems have been introduced in the context of real MDS mining systems.
REFERENCES Aounallah, M. et al. (2004). Distributed data mining vs. sampling techniques: A comparison. Proceedings of the Canadian Conference on AI 2004. Grossman, R., Bailey, S., Ramu, A., Malhi, B., & Turinsky, A. (2000). The preliminary design of papyrus: A system for high performance, distributed data mining over clusters. Proceedings of Advances in Distributed and Parallel Knowledge Discovery. Hurson, A., Bright, M., & Pakzad, S. (1994). Multidatabase systems: An advanced solution for global information sharing. IEEE Computer Society Press. Kargupta, H., Huang, W., Sivakumar, K., Park, B., & Wang, S. (2000). Collective principal component analysis from distributed, heterogeneous data. Principles of Data Mining and Knowledge Discovery (pp. 452-457). 548
Krishnaswamy, S. et al. (2000). An architecture to support distributed data mining services in e-commerce environments. Proceedings of the WECWIS 2000. Lesser, V. et al. (2000). BIG: An agent for resourcebounded information gathering and decision making. Artificial Intelligence, 118(1-2), 197-244. Li, B. et al. (2001). An architecture for multidatabase systems based on CORBA and XML. Proceedings of the 12th International Workshop on Database and Expert Systems Applications. Liu, H., Lu, H., & Yao, J. (2001). Towards multi-database mining: Identifying relevant databases. IEEE Transactions on Knowledge and Data Engineering, 13(4), 541-553. Prodromidis, A., & Stolfo, S. (1998). Pruning meta-classifiers in a distributed data mining system. Proceedings of the First National Conference on New Information Technologies. Wu, X., & Zhang, S. (2003). Synthesizing high-frequency patterns from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15(2), 353-367. Yin, X., Han, J., Yang, J., & Yu, P. (2004). CrossMine: Efficient classification across multiple database relations. Proceedings of the ICDE 2004. Zhang, S., Wu, X., & Zhang, C. (2003). Multi-database mining. IEEE Computational Intelligence Bulletin, 2(1), 5-13. Zhang, S., Zhang, C., & Wu, X. (2004). Knowledge discovery in multiple databases. Springer. Zhong, N., Yao, Y., & Ohsuga, S. (1999). Peculiarity oriented multi-database mining. Proceedings of PKDD.
KEY TERMS Database Clustering: The process of grouping similar databases together. Exceptional Pattern: A pattern that is strongly supported (voted for) by only a few of a group data sources. High-Vote Pattern: A pattern that is supported (voted for) by most of a group data sources. Knowledge Discovery in Databases (KDD): Also referred to as data mining; the extraction of hidden predictive information large databases. Local Pattern Analysis: An in-place strategy specifically designed for generating globally interesting models from local patterns in multi-dimensional spaces.
Group Pattern Discovery Systems for Multiple Data Sources
Multi-Database Mining: The mining of potentially useful patterns in multi-databases.
/
Multiple Data Source Mining: The process of identifying potentially useful patterns from different data sources.
549
550
Heterogeneous Gene Data for Classifying Tumors Benny Yiu-ming Fung The Hong Kong Polytechnic University, Hong Kong Vincent To-yee Ng The Hong Kong Polytechnic University, Hong Kong
INTRODUCTION When classifying tumors using gene expression data, mining tasks commonly make use of only a single data set. However, classification models based on patterns extracted from a single data set are often not indicative of an entire population and heterogeneous samples subsequently applied to these models may not fit, leading to performance degradation. In short, it is not possible to guarantee that mining results based on a single gene expression data set will be reliable or robust (Miller et al., 2002). This problem can be addressed using classification algorithms capable of handling multiple, heterogeneous gene expression data sets. Apart from improving mining performance, the use of such algorithms would make mining results less sensitive to the variations of different microarray platforms and to experimental conditions embedded in heterogeneous gene expression data sets.
BACKGROUND Recent research into the mining of gene expression data has operated upon multiple, heterogeneous gene expression data sets. This research has taken two broad approaches, addressing issues related either to the theoretical flexibility that is required to integrate gene expression data sets with various microarray platforms and technologies (Lee et al., 2003), or – the focus of this chapter - issues related to tumor classification using an integration of multiple, heterogeneous gene expression data sets (Bloom et al., 2004; Ng, Tan, & Sundarajan, 2003). This type of tumor classification is made more difficult by three types of variation, variation in the available microarray technologies, experimental and biological variations, and variation in the types of cancers themselves. The first type of variation is caused by different probe array notations of available microarray technologies. The two most common microarray technologies
are photolithographically synthesized oligonucleotide probe arrays and spotted cDNA probe arrays. These have both been reviewed by Sebastiani, Gussoni, Kohane, and Ramoni (2003). They differ in their criteria for measuring gene expression levels. Oligonucleotide probe arrays measure mRNA abundance indirectly, while spotted cDNA probe arrays measure cDNA relative to hybridized reference mRNA samples. With the two common microarray technologies, there exist different probe array notations (Lee et al., 2003). For example, human probe array notations include GeneChip (Affymetrix, Santa Clara, CA) U133, U95, and U35 accession number sets, BMR chips (Stanford University), UniGene clusters, cDNA clone ID and GenBank identifiers. Although the notations used in different technologies sometimes referred to the same set of genes, this does not indicate a simple one-to-one mapping (Ramaswamy, Ross, Lander, & Golub, 2003). Users of these notations should be aware of the potential for duplicated accession numbers in mapped results. Another type of variation, the statistical variation among different gene expression data sets is unavoidable because experimental and biological variations are embedded in data sets (Miller et al., 2002). First of all, individual gene expression data sets are conducted by different laboratories with different experimental objectives and conditions even when using the same microarray technology. Integration of them is a painful task. Secondly, the expression levels of genes in experiments are normally measured by the ratio of the expression levels of the genes in the varying conditions of interest to the expression levels of the genes in some reference conditions. These reference conditions are varied from experiment to experiment. This is not a problem if sample sizes are large enough. Zien, Fluck, Zimmer, and Lengauer (2003) proposed that the use of larger sample sizes (e.g. 20 samples) can prevent mining results of gene expression data from suffering technical and biological variations, and produce more reliable results. Most gene expression data sets, however, contain fewer than 20 samples per class. A more flexible
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Heterogeneous Gene Data for Classifying Tumors
solution would be to meta-analyze multiple, heterogeneous gene expression data sets, forming meta-decisions from a number of individual decisions. The last difficulty is to find common features in various cancer types. These features can be referred as some sets of significant genes which are most likely expressed in most cancer types, but they may be expressed differently in varying cancer types. The study of human cancer has recently discovered that the development of antigen-specific cancer vaccines leads to the discovery of immunogenic genes. This group of tumor antigens has been introduced as the term “cancer-testis (CT) antigen” (Coulie et al., 2002). Discovered CT antigens are recently grouped into distinct subsets and named as “cancer/testis (CT) immunogenic gene families”. Some works show that most CT immunogenic gene families are expressed in more than one cancer type, but with various expression frequencies. Currently, researchers have reviewed and summarized that the current discovery is 44 CT immunogenic gene families consisting of 89 individual genes in total (Scanlan, Simpson, & Old, 2004).
MAIN THRUST It is possible is to make classification algorithms more reliable and robust by combining multiple, heterogeneous gene expression data sets. A simple combination method is to merge or append one data set to another. Unfortunately, this method is inflexible because data sets have various scales and ranges of variations. These are required to be the same in order to have consistent scales for comparisons after the combination. In this chapter, we discuss two approaches to combine data sets consisting of variation in the available microarray technologies. The first, and simplest, approach is to normalize gene expression levels of genes in the data sets with mean zero and standard deviation one (i.e. standard normal distribution, N(0, 1)) according to the means and standard deviations across samples in individual data sets. While this approach is simple to apply, it assumes that all genes have the same or similar expression rates. However, this assumption is incorrect. The fact is that only a small subset of genes reflects the existence of tumors, and that the remaining genes in a tumor are not epidemiologically significant. It should also be noted that the reflected genes do not all express at the same rate. Therefore, when all genes in data sets are normalized to have N(0, 1), the variations of the reflected genes may be underestimated and the variations of genes which are stable and irrelevant may be overestimated. This situation worsens as the number of genes in data sets increases.
The second and a better approach is to select a subset of reference genes, also known as significant genes, and to use the expression levels of these genes to estimate scaling factors which are used to rescale the expression levels of genes in other data sets with the same set of reference genes as in the original subset. This approach has two advantages. The first is that it allows the effects of outliers caused by non-significant genes to be eliminated while using only a subset of significant genes. In a gene expression data set, only a proportion of genes is tumor-specific. Because gene expression data contains high-dimensional data, by focusing on such tumor-specific genes in classification would reduce computational costs. The second advantage is that it improves the quality of the normalization or re-scaling since it avoids the underestimation of expression level of significant genes, a problem which may arise because of the presence of large amounts of non-significant genes. We also note that the selection algorithms are the focus of much current research. Some works that utilize existing features selection algorithms include Dudoit, Yang, Callow, and Speed (2002), Bloom et al. (2004), and Lee et al. (2003). New or enhanced algorithms have been proposed by Park et al. (2003), Ng, Tan, and Sundarajan. (2003), Choi, Yu, Kim, and Yoo (2003), Storey & Tibshirani (2003), Chilingaryan, Gevorgyan, Vardanyan, Jones, & Szabo (2002), and Golub et al. (1999). In recent years, detection of significant genes was mainly done using fold-change detection. This detection method is unreliable because it does not take into account statistical variability. Currently, however, most algorithms that are used to select significant genes apply statistical methods. In the rest of the chapter, we first present some recent works on the identification of significant genes using statistical methods. We then briefly describe our proposed measure, Impact Factors (IFs), which can be used to carry out tumor classification using heterogeneous gene expression data (Fung & Ng, 2003).
Statistical Methods The most common statistical method for identifying significant genes is the two-sample t-test (Cui & Churchill, 2003). The advantage of this test is that, because it requires only one gene to be studied for each t-test, it is insensitive to heterogeneity in variance across a couple of genes. However, while reliable tvalues require large sample sizes, gene expression data sets normally consist of small sample sizes. This problem of small sample sizes can be overcome using global t-tests, but it assumes that the variance is homogeneous between different genes (Tusher, Tibshirani, & Chu, 2001). Tusher, Tibshirani, and Chu (2001) proposed a 551
0
Heterogeneous Gene Data for Classifying Tumors
modified t-test that they called “significance analysis of microarray (SAM)”. SAM identifies significant genes in microarray experiments by measuring fluctuations of the expression levels of genes across a number of microarray experiments. These fluctuations are estimated using permutation tests and, to avoid higher t-values, they are expressed as a constant in the denominators of the twosample t-test. Tibshirani, Hastie, Narasimhan, and Chu (2002) proposed a modified nearest-centroid classification. For all genes, it uses a t-test to calculate a centroid distance, defined as the distance from class centroids to overall centroids among classes of the genes. The centroid distance is then used to shrink the class centroids towards the overall centroid to order to reduce overfitting. Correlation analysis is another common statistical method used to rank the significance of genes. Kuo, Jenssen, Butte, Ohno-Machado, and Kohane (2002) applied the Pearson linear and Spearman rank-order correlation coefficients to study the flexibility of crossplatform utilization of data from multiple gene expression data sets. Lee et al. (2003) also used the Pearson and Spearman correlation coefficients to study the correlation among NCI-60 cancer data sets consisting of different cancer types. They, however, proposed a measure to rank the “correlation of correlation” among various data sets. Later studies of correlation focused on multiplatform, multi-type tumor data sets. Bloom et al. (2004) used the Kruskal-Wallis H-test to identify significant genes within multi-type, multi-platform tumor data sets consisting of 21 data sets and 15 different cancer types.
Impact Factors Recently, we proposed a dissimilarity measure called Impact Factors (IFs) that measure the inter-experimental variations between individual classes in training samples and heterogeneous testing samples (Fung & Ng, 2003). The calculation of IFs takes place in two stages of selections, selection for re-scaling and selection for classification. In the first stage, we first use SAM to select a set of significant genes. From these, we calculate individual reference points corresponding to different classes in training set. These reference points are then used to calculate their own scaling factors for the corresponding classes. The factors are used to rescale the expression levels of all genes in testing samples. There are two advantages to using individual scaling factors corresponding to different classes: they ensure that the different gene expression levels of one class are not underestimated or overestimated because of unbalanced sample sizes between classes, and they allow individual testing samples to be compared with individual classes.
552
The second stage of selection is selection for classification. This is done by calculating the differences between rescaled testing samples (since there are two scaling factors, there are two rescaled samples in binaryclass tumor classification) and individual classes in training set. It should be noted that, to improve the discriminative power of the IFs, only those genes with higher differences are selected and used in classification. IFs have been integrated into classifiers to perform meta-classification of heterogeneous cancer gene expression data (Fung & Ng, 2003). For most classifiers using either similarity or dissimilarity measures for making classification decisions, IFs can be integrated into classifiers by multiplying the IFs directly with the original measures. The actual multiplication to be carried out depends on whether it is considered as dissimilarity or similarity measures used by classifiers for making decisions. If it is being applied to dissimilarity measures, the IF of a class is multiplied by the measure having the same class as the corresponding IF. In contrast, if it is being applied to similarity measures, the IF of a class is multiplied by the measure having another class as the corresponding IF.
FUTURE TRENDS Although DNA microarrays can be used to predict patients’ responses to medical treatment as well as clinical outcomes, tumor classification using gene expression data is as yet unreliable. This unreliability has multiple causes. First of all, while some international organizations “ such as the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) “ have created their own gene expression data repositories, there still exists no international benchmark data repositories. This makes it difficult for researchers to validate findings. Certainly, there is a need for integrated gene expression databases which can help researchers to validate their findings with data conducted by different laboratories around the world, and hence efficiency and effectiveness of different proposed mining algorithms can be compared objectively. Unreliability could also be said to arise from inadequate interdisciplinary communication between professionals and researchers in relevant fields. Recently, a number of promising mining algorithms have been proposed but it still remains for much work of this kind to be analyzed and validated in molecular and biological terms (Sevenet & Cussenot, 2003). The developers of these algorithms would welcome such input, as it would be an invaluable assistance to them in constructing
Heterogeneous Gene Data for Classifying Tumors
more general frameworks for defining different mining tasks in bioinformatics. Lastly, in terms of clinical studies, findings that have been made using tumor classification based on gene expression data is still not sufficiently reliable to draw conclusions. This is because there the performance of tumor classification may be influenced by many as yet unidentified factors. Such factors may include the effects of tumor heterogeneity and inflammatory cells. Unreliability also arises from the narrow use of gene expression data. Data mining algorithms mainly focus on a set of genes, which are used as diagnostic markers in clinical diagnosis. It could be more interesting to look for algorithms which use as diagnostic markers not only the expression levels of genes, but also other categories of data, such as histopathology and patients’ information.
CONCLUSION DNA microarray techniques are nowadays important in cancer-related studies. These techniques operate on large, publicly available gene expression data sets, controlled by dispersed laboratories operating under various experimental conditions and using a variety of microarray technologies and platforms. A realistic response to these facts is to mine data using integrated, multiple, heterogeneous gene expression data. The robustness and reliability of mining algorithms developed to handle such data can be validated on a new set of heterogeneous gene expression data in addition to crossvalidation within a single data set. Algorithms that pass such validation are less sensitive to variations induced by experimental conditions and to different microarray platforms, and thus the robustness and reliability of mining results can be guaranteed. In this chapter, we have discussed the difficulties of three types of variation in tumor classification of multiple, heterogeneous gene expression data: variation in the available microarray technologies, experimental and biological variation, and variation in the types of cancers themselves. We further point out that identification of significant genes is an important process in tumor classification using gene expression data, especially for multiple, heterogeneous data. We have reviewed some recent works on the identification of significant genes. Finally, we have presented the concept of Impact Factors (IFs), which are used to combine and classify heterogeneous gene expression data
REFERENCES Bloom, G., Yang, I.V., Boulware, D., Kwong, K.Y., Coppola, D., Eschrich, S. et al. (2004). Multi-platform, multi-site, microarray-based human tumor classification. American Journal of Pathology, 164(1), 9-16. Chilingaryan, A., Gevorgyan, N., Vardanyan, A., Jones, D., & Szabo, A. (2002). Multivariate approach for selecting sets of differentially expressed genes. Mathematical Biosciences, 176(1), 59-69. Cui, X., & Churchill, G.A. (2003). Statistical tests for differential expression in cDNA microarray experiments. Genome Biology, 4(4), 210. Choi, J.K., Yu, U., Kim, S., & Yoo, O.J. (2003). Combining multiple microarray studies and modeling interstudy variation. Bioinformatics, 19(Suppl. 1), i84-i90. Coulie, P.G., Karanikas, V., Lurquin, C., Colau, D., Connerotte, T., Hanagiri, T. et al. (2002). Cytolytic T-cell responses of cancer patients vaccinated with a MAGE antigen. Immunological Reviews, 188(1), 33-42. Dudoit, S., Yang, Y.H., Callow, M.J., & Speed T.P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12, 111-139. Fung, B.Y.M., & Ng, V.T.Y. (2003). Classification of heterogeneous gene expression data. SIGKDD Special Issue on Microarray Data Mining, 5(2), 69-78. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., GaasenBeek, M., Mesirov, J.P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene-expression monitoring. Science, 286, 531-537. Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado, L., & Kohane, I.S. (2002). Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics, 18(3), 405-412. Lee, J.K., Bussey, K.J., Gwadry, F.G., Reinhold, W., Riddick, G., Pelletier, S.L. et al. (2003). Comparing cDNA and oligonucleotide array data: concordance of gene expression across platforms for the NCI-60 cancer cells. Genome Biology, 4(12), R82. Miller, L.D., Long, P.M., Wong, L., Mukherjee, S., McShane, L.M., & Liu, E.T. (2002). Optimal gene expression analysis by microarrays. Cancer Cell, 2(5), 353-361.
553
0
Heterogeneous Gene Data for Classifying Tumors
Ng, S.K., Tan, S.H., & Sundarajan, V.S. (2003). On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. Genome Informatics, 14, 44-53. Park, T., Yi, S.G., Lee, S., Lee, S.Y., Yoo, D.H., Ahn, J.I. et al. (2003). Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics, 19(6), 694-703. Ramaswamy, S., Ross, K.N., Lander, E.S., & Golub, T.R. (2003). Evidence for a molecular signature of metastasis in primary solid tumors. Nature Genetics, 33, 49-54. Scanlan, M.J., Simpson, A.J.G., & Old, L.J. (2004). The cancer/testis genes: Review, standardization, and commentary. Cancer Immunity, 4, 1. Sebastiani, P., Gussoni, E., Kohane, I.S., & Ramoni, M.F. (2003). Statistical challenges in functional genomics. Statistical Science, 18(1), 33-70. Sevenet, N., & Cussenot, O. (2003). DNA microarrays in clinical practice: Past, present, and future. Clinical and Experimental Medicine, 3(1), 1-3. Storey, J.D., & Tibshirani, R.J. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America, 100(16), 9440-9445. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 99(10), 6567-6572. Tusher, V.G., Tibshirani, R., & Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America, 98(9), 5116-5121.
554
Zien, A., Fluck, J., Zimmer, R., & Lengauer, T. (2003). Microarrays: How many do you need? Journal of Computational Biology, 10(3-4), 653-667.
KEY TERMS Bioinformatics: It is an integration of mathematical, statistical and computational methods to analyze and handle biological, biomedical, biochemical, and biophysical information. Cancer-testis (CT) Antigen: It is immunogenic in cancer patients, which exhibit highly tissue-restricted expression, and are considered promising target molecules for cancer vaccines. Classification: It is the process of distributing things into classes or categories of the same type by a learnt mapping function. Gene Expression: It describes how the information of transcription and translation encoded in a segment of DNA is converted into proteins in a cell. Microarrays: It is the technology for biological exploration which allows to simultaneously measure the amount of mRNA in up to tens of thousand of genes in a single experiment. Normalization: In terms of gene expression data, it is a pre-processing to minimize systematic bias and remove the impact of non-biological influences before data analysis is performed. Probe Arrays: They are a list of labeled, singlestranded DNA or RNA molecules in specific nucleotide sequences, which are used to detect the complementary base sequence by hybridization.
555
Hierarchical Document Clustering Benjamin C. M. Fung Simon Fraser University, Canada Ke Wang Simon Fraser University, Canada Martin Ester Simon Fraser University, Canada
INTRODUCTION Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was recently developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.
by the overall frequency of the term in the entire document set. The idea is that if a term is too common across different documents, it has little discriminating power (Rijsbergen, 1979). Although many clustering algorithms have been proposed in the literature, most of them do not satisfy the special requirements for clustering documents: •
•
•
•
BACKGROUND Document clustering is widely applicable in areas such as search engines, web mining, information retrieval, and topological analysis. Most document clustering methods perform several preprocessing steps including stop words removal and stemming on the document set. Each document is represented by a vector of frequencies of remaining terms within the document. Some document clustering algorithms employ an extra preprocessing step that divides the actual term frequency
•
High Dimensionality: The number of relevant terms in a document set is typically in the order of thousands, if not tens of thousands. Each of these terms constitutes a dimension in a document vector. Natural clusters usually do not exist in the full dimensional space, but in the subspace formed by a set of correlated dimensions. Locating clusters in subspaces can be challenging. Scalability: Real world data sets may contain hundreds of thousands of documents. Many clustering algorithms work fine on small data sets, but fail to handle large data sets efficiently. Accuracy: A good clustering solution should have high intra-cluster similarity and low inter-cluster similarity, i.e., documents within the same cluster should be similar but are dissimilar to documents in other clusters. An external evaluation method, the F-measure (Rijsbergen, 1979), is commonly used for examining the accuracy of a clustering algorithm. Easy to Browse with Meaningful Cluster Description: The resulting topic hierarchy should provide a sensible structure, together with meaningful cluster descriptions, to support interactive browsing. Prior Domain Knowledge: Many clustering algorithms require the user to specify some input parameters, e.g., the number of clusters. However, the user often does not have such prior domain knowledge. Clustering accuracy may degrade drastically if an algorithm is too sensitive to these input parameters.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
0
Hierarchical Document Clustering
DOCUMENT CLUSTERING METHODS Hierarchical Clustering Methods One popular approach in document clustering is agglomerative hierarchical clustering (Kaufman & Rousseeuw, 1990). Algorithms in this family build the hierarchy bottom-up by iteratively computing the similarity between all pairs of clusters and then merging the most similar pair. Different variations may employ different similarity measuring schemes (Karypis, 2003; Zhao & Karypis, 2001). Steinbach (2000) shows that Unweighted Pair Group Method with Arithmatic Mean (UPGMA) (Kaufman & Rousseeuw, 1990) is the most accurate one in its category. The hierarchy can also be built top-down which is known as the divisive approach. It starts with all the data objects in the same cluster and iteratively splits a cluster into smaller clusters until a certain termination condition is fulfilled. Methods in this category usually suffer from their inability to perform adjustment once a merge or split has been performed. This inflexibility often lowers the clustering accuracy. Furthermore, due to the complexity of computing the similarity between every pair of clusters, UPGMA is not scalable for handling large data sets in document clustering as experimentally demonstrated in (Fung, Wang, & Ester, 2003).
Partitioning Clustering Methods K-means and its variants (Cutting, Karger, Pedersen, & Tukey, 1992; Kaufman & Rousseeuw, 1990; Larsen & Aone, 1999) represent the category of partitioning clustering algorithms that create a flat, non-hierarchical clustering consisting of k clusters. The k-means algorithm iteratively refines a randomly chosen set of k initial centroids, minimizing the average distance (i.e., maximizing the similarity) of documents to their closest (most similar) centroid. The bisecting k-means algorithm first selects a cluster to split, and then employs basic k-means to create two sub-clusters, repeating these two steps until the desired number k of clusters is reached. Steinbach (2000) shows that the bisecting kmeans algorithm outperforms basic k-means as well as agglomerative hierarchical clustering in terms of accuracy and efficiency (Zhao & Karypis, 2002). Both the basic and the bisecting k-means algorithms are relatively efficient and scalable, and their complexity is linear to the number of documents. As they are easy to implement, they are widely used in different clustering applications. A major disadvantage of kmeans, however, is that an incorrect estimation of the input parameter, the number of clusters, may lead to poor clustering accuracy. Also, the k-means algorithm 556
is not suitable for discovering clusters of largely varying sizes, a common scenario in document clustering. Furthermore, it is sensitive to noise that may have a significant influence on the cluster centroid, which in turn lowers the clustering accuracy. The k-medoids algorithm (Kaufman & Rousseeuw, 1990; Krishnapuram, Joshi, & Yi, 1999) was proposed to address the noise problem, but this algorithm is computationally much more expensive and does not scale well to large document sets.
Frequent Itemset-Based Methods Wang et al. (1999) introduced a new criterion for clustering transactions using frequent itemsets. The intuition of this criterion is that many frequent items should be shared within a cluster while different clusters should have more or less different frequent items. By treating a document as a transaction and a term as an item, this method can be applied to document clustering; however, the method does not create a hierarchy of clusters. The Hierarchical Frequent Term-based Clustering (HFTC) method proposed by (Beil, Ester, & Xu, 2002) attempts to address the special requirements in document clustering using the notion of frequent itemsets. HFTC greedily selects the next frequent itemset, which represents the next cluster, minimizing the overlap of clusters in terms of shared documents. The clustering result depends on the order of selected itemsets, which in turn depends on the greedy heuristic used. Although HFTC is comparable to bisecting k-means in terms of clustering accuracy, experiments show that HFTC is not scalable (Fung, Wang, & Ester, 2003).
A Scalable Algorithm for Hierarchical Document Clustering: FIHC A scalable document clustering algorithm, Frequent Itemset-based Hierarchical Clustering (FIHC) (Fung, Wang, & Ester, 2003), is discussed in greater detail because this method satisfies all of the requirements of document clustering mentioned above. We use “item” and “term” as synonyms below. In classical hierarchical and partitioning methods, the pairwise similarity between documents plays a central role in constructing a cluster; hence, those methods are “document-centered”. FIHC is “cluster-centered” in that it measures the cohesiveness of a cluster directly using frequent itemsets: documents in the same cluster are expected to share more common itemsets than those in different clusters. A frequent itemset is a set of terms that occur together in some minimum fraction of documents. To illustrate the usefulness of this notion for the task of clustering, let us consider two frequent items, “win-
Hierarchical Document Clustering
dows” and “apple”. Documents that contain the word “windows” may relate to renovation. Documents that contain the word “apple” may relate to fruits. However, if both words occur together in many documents, then another topic that talks about operating systems should be identified. By precisely discovering these hidden topics as the first step and then clustering documents based on them, the quality of the clustering solution can be improved. This approach is very different from HFTC where the clustering solution greatly depends on the order of selected itemsets. Instead, FIHC assigns documents to the best cluster from among all available clusters (frequent itemsets). The intuition of the clustering criterion is that there are some frequent itemsets for each cluster in the document set, but different clusters share few frequent itemsets. FIHC uses frequent itemsets to construct clusters and to organize clusters into a topic hierarchy. The following definitions are introduced in (Fung, Wang, & Ester, 2003): A global frequent itemset is a set of items that appear together in more than a minimum fraction of the whole document set. A global frequent item refers to an item that belongs to some global frequent itemset. A global frequent itemset containing k items is called a global frequent k-itemset. A global frequent item is cluster frequent in a cluster Ci if the item is contained in some minimum fraction of documents in Ci. FIHC uses only the global frequent items in document vectors; thus, the dimensionality is significantly reduced. The FIHC algorithm can be summarized in three phases: First, construct initial clusters. Second, build a cluster (topic) tree. Finally, prune the cluster tree in case there are too many clusters.
Figure 1. Initial Clusters
Constructing Clusters
•
For each global frequent itemset, an initial cluster is constructed to include all the documents containing this itemset. Initial clusters are overlapping because one document may contain multiple global frequent itemsets. FIHC utilities this global frequent itemset as the cluster label to identify the cluster. For each document, the “best” initial cluster is identified and the document is assigned only to the best matching initial cluster. The goodness of a cluster Ci for a document docj is measured by some score function using cluster frequent items of initial clusters. After this step, each document belongs to exactly one cluster. The set of clusters can be viewed as a set of topics in the document set. •
Example: Figure 1 depicts a set of initial clusters. Each of them is labeled with a global frequent
{Sports, Tennis}
{Sports, Tennis, Ball}
{Sports, Tennis, Racket}
{Sports, Ball} {Sports}
Doci Sports, Tennis, Ball
itemset. A document Doc1 containing global frequent items “Sports”, “Tennis”, and “Ball” is assigned to clusters {Sports}, {Sports, Ball}, {Sports, Tennis} and {Sports, Tennis, Ball}. Suppose {Sports, Tennis, Ball} is the “best” cluster for Doc1 measured by some score function. Doc1 is then removed from {Sports}, {Sports, Ball}, and {Sports, Tennis}.
Building Cluster Tree In the cluster tree, each cluster (except the root node) has exactly one parent. The topic of a parent cluster is more general than the topic of a child cluster and they are “similar” to a certain degree (see Figure 2 for an example). Each cluster uses a global frequent k-itemset as its cluster label. A cluster with a k-itemset cluster label appears at level k in the tree. The cluster tree is built bottom up by choosing the “best” parent at level k1 for each cluster at level k. The parent’s cluster label must be a subset of the child’s cluster label. By treating all documents in the child cluster as a single document, the criterion for selecting the best parent is similar to the one for choosing the best cluster for a document. Example: Cluster {Sports, Tennis, Ball} has a global frequent 3-itemset label. Its potential parents are {Sports, Ball} and {Sports, Tennis}. Suppose {Sports, Tennis} has a higher score. It becomes the parent cluster of {Sports, Tennis, Ball}.
Pruning Cluster Tree The cluster tree can be broad and deep, which becomes not suitable for browsing. The goal of tree pruning is to efficiently remove the overly specific clusters based on the notion of inter-cluster similarity. The idea is that if two sibling clusters are very similar, they should be merged into one cluster. If a child cluster is very similar to its parent (high inter-cluster similarity), then replace the child cluster with its parent cluster. The parent cluster will then also include all documents of the child cluster.
557
0
Hierarchical Document Clustering
Figure 2. Sample cluster tree
Cluster label
{Sports} {Sports, Ball}
{Sports, Tennis}
{Sports, Tennis, Ball}
•
{Sports, Tennis, Racket}
Example: Suppose the cluster {Sports, Tennis, Ball} is very similar to its parent {Sports, Tennis} in Figure 2. {Sports, Tennis, Ball} is pruned and its documents, e.g., Doc1, are moved up into cluster {Sports, Tennis}.
Evaluation of FIHC The FIHC algorithm was experimentally evaluated and compared to state-of-the-art document clustering methods. See (Fung, Wang, & Ester, 2003) for more details. FIHC uses only the global frequent items in document vectors, drastically reducing the dimensionality of the document set. Experiments show that clustering with reduced dimensionality is significantly more efficient and scalable. FIHC can cluster 100K documents within several minutes while HFTC and UPGMA cannot even produce a clustering solution. FIHC is not only scalable, but also accurate. The clustering accuracy of FIHC consistently outperforms other methods. FIHC allows the user to specify an optional parameter, the desired number of clusters in the solution. However, close-tooptimal accuracy can still be achieved even if the user does not specify this parameter. The cluster tree provides a logical organization of clusters which facilitates browsing documents. Each cluster is attached with a cluster label that summarizes the documents in the cluster. Different from other clustering methods, no separate post-processing is required for generating these meaningful cluster descriptions.
RELATED LINKS The followings are some clustering tools on the Internet:
• • •
558
Tools: FIHC implements Frequent Itemset-based Hierarchical Clustering. Website: http://www.cs.sfu.ca/~ddm/ Tools: CLUTO implements Basic/Bisecting K-means and Agglomerative methods.
•
Website: http://www-users.cs.umn.edu/~karypis/ cluto/
• •
Tools: Vivísimo® is a clustering search engine. Website: http://vivisimo.com/
FUTURE TRENDS Incrementally Updating the Cluster Tree One potential research direction is to incrementally update the cluster tree (Guha, Mishra, Motwani & O’Callaghan, 2000). In many cases, the number of documents is growing continuously in a document set, and it is infeasible to rebuild the cluster tree upon every arrival of a new document. Using FIHC, one can simply assign the new document to the most similar existing cluster. But the clustering accuracy may degrade in the course of the time, since the original global frequent itemsets may no longer reflect the current state of the overall document set. Incremental clustering is closely related to some of the recent research on data mining in stream data (Ordonez, 2003).
CONCLUSION Most traditional clustering methods do not completely satisfy special requirements for hierarchical document clustering, such as high dimensionality, high volume, and ease of browsing. In this chapter, we review several document clustering methods in the context of these requirements, and a new document clustering method, FIHC, is discussed a bit more detail. Due to massive volumes of unstructured data generated in the globally networked environment, the importance of document clustering will continue to grow.
REFERENCES Beil, F., Ester, M., & Xu, X. (2002). Frequent termbased text clustering. International Conference on Knowledge Discovery and Data Mining, KDD’02 (pp. 436-442), Edmonton, Alberta, Canada. Cutting, D.R., Karger, D.R., Pedersen, J.O., & Tukey, J.W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. International Conference on Research and Development in Infor-
Hierarchical Document Clustering
mation Retrieval, SIGIR’92 (pp. 318-329), Copenhagen, Denmark. Fung, B., Wang, K., & Ester, M. (2003, May). Hierarchical document clustering using frequent itemsets. SIAM International Conference on Data Mining, SDM’03 (pp. 59-70), San Francisco, CA, United States.. Guha, S., Mishra, N., Motwani, R., & O’Callaghan, L. (2000). Clustering data streams. Symposium on Foundations of Computing Science (pp. 359-366). Karypis, G. (2003). Cluto 2.1.1: A software package for clustering high dimensional datasets. Retrieved from http://www-users.cs.umn.edu/~karypis/cluto/
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis. Technical report, Department of Computer Science, University of Minnesota. Zhao, Y., & Karypis, G. (2002, November). Evaluation of hierarchical clustering algorithms for document datasets. International Conference on Information and Knowledge Management (pp. 515-524), McLean, Virginia, United States.
KEY TERMS
Kaufman, L., & Rousseeuw, P.J. (1990, March). Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons, Inc.
Cluster Frequent Item: A global frequent item is cluster frequent in a cluster Ci if the item is contained in some minimum fraction of documents in Ci.
Krishnapuram, R., Joshi, A., & Yi, L. (1999, August). A fuzzy relative of the k-medoids algorithm with application to document and snippet clustering. IEEE International Conference - Fuzzy Systems, FUZZIEEE 99, Korea.
Document Clustering: The automatic organization of documents into clusters or group so that documents within a cluster have high similarity in comparison to one another, but are very dissimilar to documents in other clusters.
Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. International Conference on Knowledge Discovery and Data Mining, KDD’99 (pp. 16-22), San Diego, California, United States. Ordonez, C. (2003). Clustering binary data streams with K-means. Workshop on Research issues in data mining and knowledge discovery, SIGMOD’03 (pp. 12-19), San Diego, California, United States. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Workshop on Text Mining, SIGKDD’00. van Rijsbergen, C.J. (1979). Information retrieval (2nd ed.). London: Butterworth Ltd. Wang, K., Xu, C., & Liu, B. (1999). Clustering transactions using large items. International Conference on Information and Knowledge Management, CIKM’99 (pp. 483-490), Kansas City, Missouri, United States. Wang, K., Zhou, S., & He Y. (2001, April). Hierarchical classification of real life documents. SIAM International Conference on Data Mining, SDM’01, Chicago, United States.
Document Vector: Each document is represented by a vector of frequencies of remaining items after preprocessing within the document. Global Frequent Itemset: A set of words that occur together in some minimum fraction of the whole document set. Inter-Cluster Similarity: The overall similarity among documents from two different clusters. Intra-Cluster Similarity: The overall similarity among documents within a cluster. Medoid: The most centrally located object in a cluster. Stemming: For text mining purposes, morphological variants of words that have the same or similar semantic interpretations can be considered as equivalent. For example, the words “computation” and “compute” can be stemmed into “comput”. Stop Words Removal: A preprocessing step for text mining. Stop words, like “the” and “this” which rarely help the mining process, are removed from input data.
559
0
560
High Frequency Patterns in Data Mining Tsau Young Lin San Jose State University, USA
INTRODUCTION
1.
The principal focus is to examine the foundation of association (rule) mining (AM) via granular computing (GrC). The main results is: The set of all high frequency patterns can be found by sloving linear inequalities within a polynomial time.
2. 3. 4.
BACKGROUND
Emerging Data Mining Method Granular Computing
Some Foundation Issues in Data Mining What is data mining? The following informal paraphrase of Fayad et al. (1996)’s definition seems quite universal: Deriving useful patterns from data. The keys are data, patterns, derivation system, and useful-ness. We will examine critically the current practices of AM.
Some Basic Terms in Association Mining (AM) In AM, two measures, support and confidence, are the main criteria. It is well known among researchers the support is the main hurdle, in other words, high frequency patterns are the main focus. AM is originated from the market basket data (Agrawal, 1993). However, we will be interested in AM for relational tables. For definitive, we assert:
A relational table is a bag relation, that is, repetitions of tuples are permissible (Garcia-Monila et al. 2002) An item is an attribute value, A q-itemset is a subtuple of length q, A high frequency pattern of length q is a q-subtuple if its number of occurrences is greater than or equal to a given threshold.
Bitmap index is a common notion in database theory. The advantage of bitmap representation is computationally efficient (Louis & Lin, 2000), and the drawback is the order of the table has to be fixed (Garcia-Molina, 2002). Based on granular computing, we propose a new method, called granular representations, that avoids this drawback. We will illustrate the idea by examples. The following example is modified from the text cited above (p. 702). A relational table K is viewed as a knowledge representation of a set V, called the universe, of real world entities by tuples of data; see Table 1. A bitmap index for an attribute is a collection of bitvectors, one for each possible value that may appear in the attribute. For the first attribute, BusinesSize (the amount of business in millions), the bitmap index would have nine bit-vectors. The first bit-vector, for value TWENTY, is 100011100, because the first, fifth, sixth, and seventh tuple have BusinesSize = TWENTY. The other two, for values TEN and THIRTY, are 011100000 and 000000011 respectively; Table 1 shows both the original
Table 1. K and B are isomorphic V BusinesSize Bmonth v1 TWENTY MAR v2 TEN MAR v3 TEN FEB v4 K TEN FEB v5 MAR → TWENTY v6 TWENTY MAR v7 TWENTY APR v8 THIRTY JAN v9 THIRTY JAN Relational Table K TABLE 1. K AND B ARE ISOMORPHIC
City NY SJ NY LA SJ SJ SJ LA LA
BusinesSize Bmonth 100011100 110011000 011100000 110011000 011100000 001100000 011100000 001100000 100011100 110011000 100011100 110011000 100011100 000000100 000000011 000000011 000000011 000000011 Bitmap Table B
City 101000000 010011100 101000000 000100011 010011100 010011100 010011100 000100011 000100011
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
High Frequency Patterns in Data Mining
Table 2a. Granular data model (GDM) for BusinesSize attribute BusinesSize TWENTY TEN THIRTY
Granular Representation ={v1, v5, v6, v7} ={v2, v3, v4} ={v8, v9} GDM in Granules
0
Bitmap Representation =100011100 =011100000 =000000011 GDM in Bitmaps
Table 2b. Granular data model (GDM) for Bmonth attribute Bmonth Jan Feb Mar APR
Granular Representation ={v8, v9} ={v3, v4} ={v1, v2, v5, v6} ={v7} GDM in Granules
Bitmap Representation =000000011 =001100000 =110011000 =000000100 GDM in Bitmaps
Table 2c. Granular data model (GDM) for CITY attribute City LA NY SJ
Granular Representation ={v4, v8, v9} ={v1, v3} ={v2, v5, v6, v7} GDM in Granules
Bitmap Representation =000100011 ={v1, v3} =010011100 GDM in Bitmaps
Table 2. K and G are isomorphic V BusinesSize v1 TWENTY v2 TEN v3 TEN v4 K TEN v5 → TWENTY v6 TWENTY v7 TWENTY v8 THIRTY v9 THIRTY Bag Relation K
Bmonth MAR MAR FEB FEB MAR MAR APR JAN JAN
City NY SJ NY LA SJ SJ SJ LA LA
BusinesSize Bmonth {v1,v5,v6,v7} {v1,v2,v5,v6} {v2,v3,v4} {v1,v2,v5,v6} {v2,v3,v4} {v3,v4} {v2,v3,v4} {v3,v4} {v1,v5,v6,v7} {V1,V2,V5,V6} {V1,V2,V5,V6} {V1,V5,V6,V7} {v7} {V1,V5,V6,V7} {v8,v9} {V8, V9} {V8, V9} {v8,v9} GRANULR TABLE G
table and bitmap table. Bmonth means Birth month; City means the location of the entities. Next, we will interpret the bit-vectors in terms of set theory. A bit-vector can be viewed as a representation of a subset of V. For example, the bit-vector, 100011100, of BusinesSize = TWENTY says that the first, fifth, sixth, and seventh entities have been selected, in other words, the bit-vector represents the subset {v1, v5, v6, v 7}. The other two bi-vectors, for values TEN and THIRTY, represent the subsets {v2, v3, v4} and {v 8, v9} respectively. We summarize such translations in Table 2a,b,c. and refer to these subsets as elementary granules.
City {v1,v3} {v2,v5,v6,v7} {v1,v3} {v4,v8,v9} {V2,V5,V6,V7} {V2,V5,V6,V7} {V2,V5,V6,V7} {V4,V8,V9} {V4,V8,V9}
Some easy observations: 1.
2.
The collection of elementary granules of an attribute (column) forms a partition, that is, all granules of this attribute are pairwise disjoint. This fact was observed by Pawlak (1982) and Tony Lee (1983). From Tables 1 and 2, one can easily conclude that the relational table K, the bitmap table B and granular table G are isomorphic. Two tables are isomorphic if one can transform a table to the other by renaming all attribute values in a one-toone fashion. 561
High Frequency Patterns in Data Mining
Granular Data Model (GDM) – Uninterpreted Relational Table in Free Format The middle columns of Table 2a, 2b and 2c define 3 partitions. The universe and such 3 partions, denoted by (V, {E BusinesSize,E Bmonth,ECity}), determines the granular table G and vice versa. More generally, 3-tuple (V, E, C) is called a GDM, where E is a set of finite family of partitions, and C consists of the names of all elementary granules. A partition (equivalence relation) of V that is not in the given E is referred to as an uninterpreted attribute of GDM, and its elementary granules are un-interpreted attribute values. •
GDM Theorem: The granular table G. determines GDM and vice versa.
In view of Isomorphic theorem below, it is sufficient to do AM in GDM.
MAIN THRUST Analysis of Association Mining (AM) To understand the mathematical mechanics of AM, let us examine how the information has been created and processed. We will take the deductive data mining approach. First, let us set up some terminology. A symbol is a string of “bits and bytes” that represents a slice of real world, however, such a real world meaning does not participate in the formal processing or computing. We term such a processing computing with symbols. In AI, such a symbol is termed a semantic primitive. (Feigenbaum,1981). A symbol is termed a word, if the intended real world meaning participates in the formal processing or computing. We term such a processing computing with words. Note that mathematicians use
words (in group theory) as symbols; their words are our symbols.
Data Processing and Computing with Words In traditional data processing (TDP), a relational table is a knowledge representation of a slice of real world. So each symbol of the table represents (to human) a piece of the real world; however, such a representation is not implemented in the system. Nevertheless, DBMS, under human commands, does process the data, for examples, Bmonth (attribute), April, March (attribute values) with human-perceived semantics. So in TDP the relational table is a table of words; TDP is human directed computing with words.
Data Mining and Computing with Symbols In (automated) AM we use the table created in TDP. However, AM algorithms regard the TDP data as symbols; no real world meaning of each word participates in the process of AM. High frequency patterns are completely deduced from the counting of the symbols. AM is computing with symbols. The input data of AM is a relational table of symbols, whose real wolrd meaning does not participate in formal computing. Under such a circumstance, if we replace the given set of symbols by a new set, then we can derive new patterns by simply replacing the symbols in “old” patterns. Formally, we have (Lin, 2002) •
Isomorphic Theorem: Isomorphic relational tables have isomorphic patterns.
This theorem implies that the theory of AM is a syntactic theory.
Table 3. The isomorphim of Table K and K’ V BusinesSize v1 TWENTY v2 TEN v3 TEN v4 K TEN v5 → TWENTY v6 TWENTY v7 TWENTY v8 THIRTY v9 THIRTY Bag Relation K
562
Bmonth MAR MAR FEB FEB MAR MAR APR JAN JAN
CITY NY SJ NY LA SJ SJ SJ LA LA
U W’T u1 20 u2 10 u3 10 u4 K’ 10 u5 → 20 u6 20 u7 20 u8 30 u9 30 Bag Relation K’
NAME SCREW SCREW NAIL NAIL SCREW SCREW PIN HAMMER HAMMER
Material STEEL BRASS STEEL ALLOY BRASS BRASS BRASS ALLOY ALLOY
High Frequency Patterns in Data Mining
Table 4. Three isomorphic 2-patterns; support=cardinality of granules K (TWENTY, MAR) (MAR, SJ) (TWENTY, SJ)
•
K’ (20, SCREW) (SCREW, BRASS) (20, BRASS)
Example: From Table 3, it should be clear that the one-to-one correspondences between K and K’ induces consistently a one-to-one correspondence between the two sets of distinct attribute values. We describe such a phenomenon by the statement: K and K’ are isomorphic.
In Table 4, we display the high frequency patterns of length 2 from Table K, K’ and GDM; the three sets of patterns are isomorphic to each other. So for AM, we can use any one of the three tables. An observation: In using K or K’ for AM, one needs to scan the table to get the support, while in using GDM, the support can be read from the cardinality of the granules, no database scan is required – one strength of GDM. Another observation: From the definition of elementary granules, it should be obvious that subtuples are mapped to the intersections of elementary granules.
Patterns and Granular Formulas Implicitly AM has assumed high frequency patterns are “expressions” of the input symbols (elements of the input relational table.) Such assumptions are not made in other techniques. In neural network techniques, the input data are numerical, its patterns are not numerical “expressions.” They are essentially functions that are derived from activation functions (Lin, 1996; Park & Sanders, 1989). Let us back to AM, the implicit assumption simplifies the problem. What are the possible “expressions” of the input symbols? There are two possible formalisms, logic formula and set theoretical algebraic expression. In logic form, we have several choices, deductive database systems, datalog, or decision logic among others (Pawlak, 1991; Ullman, 1988-89); we choose decision logic because it is simpler. In set theoretical form, we use GDM (Lin, 2000).
Expressions of High Frequency Patterns 1.
Logic Based: A high frequency pattern in decision logic is a logic formula, whose meaning set (support) has cardinality greater than or equal to the threshold.
GDM in Granules ={v1, v5, v6, v7}∩{v1, v2, v5, v6} ={v1, v2, v5, v6}∩{v2, v5, v6, v7} ={v1, v5, v6, v7}∩{v2, v5, v6, v7}
2.
0
Support 3 3 3
Set Based: A high frequency pattern in GDM is a granular expression, which is a set theoretical algebriac expression of elementary granules; when the expression is evaluated set theoretically, the cardinailty of the resultant set is greater than or equal to the threshold; we will call the algebraic expressions granular pattern. Note that several distinct algebraic expressions of elementary granules may have the same resultant set.
Informally, a logical formula of granular pattern is the “logic formula” of the names of elementary granules (Lin, 2000); more pricisely we translate elementary granules, ∪ and ∩ into their names, “or” and “and” respectively. Next, we note that there are only finitely many distinct subsets that can be generated by the intersections and unions of elementary granules in GDM. If we only consider the disjunct normal form, the total possible high frequency patterns in AM is finite.
Finding High Frequency Patterns by Solving a Set of Linear Inequalities Let B be the Boolean algebra generated by the elementary granules; the partial order is the set theoretical inclusion ⊇. Then B is the set of all granular expressions. Let O be the smallest element (it is not necessary an empty set) and I is the greatest element (I is the universe V). An element p is an atom, if p ⊇ O, and there is no element x such that p ⊇ x ⊇ O. Each atom p is an intersection of some elementary granules. Let S(b) be the set of all atom pj such that pj ⊆ b and s(b) be its cardinality. From (Birkoff & MacLane, 1977, Chapter 11), we have •
Proposition: Every b ∈ B can be expressed in the form b = p1∪ . . . ∪ps(b).
For convenience, let us define an “operation” of a binary number x and a set S. We write S*x to mean the following: S*x = S, if x=1 and S ≠ ∅ S*x = ∅, if x=0 or S = ∅
563
High Frequency Patterns in Data Mining
Let p1, p2, . . ,pm be the set of all atoms in B. Then a granular expression b can be expressed as b=p1*x 1 ∪ . . . ∪ p m* xm . and its cardinality can be expressed as |b| = ∑ | p i |*xi where | • | is the cardinality of •. Main Theorem: Let s be the threshold. Then b is a high frequency pattern, if
•
|b| = ∑ | pi |*xi ≥ s
(*)
In applications, pi‘s are readily computable; it is the elementary granules of the intersection of all partitions (defined by attributes); see the Table 1 and 2. So we only need to find all binary solutions of xi. The generators of the solution can be enumerated along the hyperplanes of the inequalities of the constraints.
Observations Theoretically, this is a remarkable theorem. It says all possible high frequency patterns can be found by solving linear inequalities. However, the practicality of the main theorem is highly depended on the complexity of the problem. If both | pi | and s are small, then the number of solutions will be out of hands, simply due to the size of the number. We would like to stress that the difficulty is simply due to the size of possible solutions, not the methodology. The result implies that the notion of high frequency patterns may not be tight enough. At this moment, (*) is useful only if the number of attributes under considerations is small.
FUTURE TRENDS Tighter Notion of Patterns Let us consider the real world meaning of the patterns of length 2, namely, (TWENTY, MAR) and (20, SCREW). What does this subtuple (TWENTY, MAR) mean? 20 million dollar business on March? The last statement is not the original meaning of the schema: Originally it means v1, v5, v6 have 20 million dollar business and they were born in March. This subtuple has no meaning on its own. On the other hand, (20, SCREW) from K’ is a valid pattern (most of screws have weight 20). In summary, we have
564
(TWENTY, MAR) from K has no meaning on its own, (20, SCREW) from K’ has a valid meaning. Let RW(K) be the Real World that K is representing. The summary implies that the subtuple (TWENTY, MAR), even though occurs very frequently in the table, there is no real world event correspond to it. The data implies that three entities v 1, v5, v6 have common properties encoded by “Twenty” and “Mar.” In the table K, they are “naively” summarized into one concept “(TWENTY, MAR).” Unfortunately, in the real world RW(K), the three occurrences of “Twenty” and “Mar” (from three entities, v1, v 5, v 6) do not integrate into an appropriate new concept “(TWENTY, MAR).” Such “error” occurs, because high frequency is an inadequate or inaccurate criterion. We need a tighter notion of patterns.
Semantic Oriented Data Mining If we do know how to compute the semantics, then the computation should tell us that the repeated two words “TWENTY” and “MAR” can not be combined into a new concept regardless of high repetitions, and should be dropped out. So semantic oriented data mining is needed (Lin & Louie. 2001, 2002). As ontology, semantic web, and computing with words (semantic computing) are heating up, it could be a right time to move onto semantic oriented data mining.
New Notions of Patterns and Algorithmic Information Theory In (Lin, 1993), based on algorithmic information theory or Kolmogorov complexity theory, we proposed that a non-random (compressible string) is a string with patterns and the shortest Turing machine that generates this string is the pattern. We concluded, then, that a finite sequence (a relational table is a finite sequence) with long constant subsequences (the length of such constant sequence is the support) is trivially compressible (having a pattern). High frequency patterns are such patterns. Taking the same thought, what would be the next less trivial compressible finite sequences?
CONCLUSION Our analysis on association mining seems fruitful: (1) High frequency patterns are natural generalizations of association rules. (2) All high frequency patterns (generalized associations) can be found by solving linear inequalities. (3) High frequency patterns are rather lean in semantics (Isomorphic Theorem). So semantic oriented AM or new notion of patterns may be needed.
High Frequency Patterns in Data Mining
REFERENCES
Pawlak Z. (1991). Rough sets. Theoretical aspects of reasoning about data. Kluwer Academic Publishers.
Agrawal, R,. Imielinski, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In Proceeding of ACM-SIGMOD international Conference on Management of Data (pp. 207-216), Washington, DC.
Ullman, J. (1988). Principles of database and knowledgebase systems,vol. I. Computer Science Press.
Barr, A., & Feigenbaum, E.A. (1981). The handbook of artificial intelligence. Willam Kaufmann. Fayad, U.M., Piatetsky-Sjapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery: An overview. In Fayard, Piatetsky-Sjapiro, Smyth, & Uthurusamy (Eds.), Knowledge Discovery in Databases. AAAI/MIT Press. Garcia-Molina, H., Ullman, J. D., & Widom, J. (2002). Database systems-The complete book. Prentice Hall. Lee, T.T. (1983). Algebraic theory of relational databases. The Bell System Technical Journal, 62(10),31593204. Lin, T.Y. (1993). Rough patterns in data - Rough sets and foundation of intrusion detection systems. Journal of Foundation of Computer Science and Decision Support, 18(3-4), 225-241. Lin, T.Y. (1996, July). The power and limit of neural networks. In Proceedings of the 1996 Engineering Systems Design and Analysis Conference, 7 (pp. 49-53), Montpellier, France. Lin, T.T. (2000). Data mining and machine oriented modeling: A granular computing approach. Journal of Applied Intelligence, 13(2), 113-124. Lin, T.Y., & Louie, E. (2001). Semantics oriented association rules. In 2002 World Congress of Computational Intelligence (pp. 956-961), Honolulu, Hawaii, May 12-17, (paper # 5754). Louie, E., & Lin, T.Y. (2000, October). Finding association rules using fast bit computation: Machine-oriented modeling. In Z. Ras, & S. Ohsuga (Eds.), Foundations of intelligent systems, Lecture Notes in Artificial Intelligence #1932, 486-494, Springer-Verlag. 12th International Symposium on Methodologies for Intelligent Systems, Charlotte, NC. Park, J., & Sandberg, I.W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3, 246-257. Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341-356.
Ullman, J. (1989). Principles of database and knowledgebase systems, vol. II. Computer Science Press.
KEY TERMS Algorithmic Information Theory: “Absolute information theory” based on Kolmogorov complexity theory. Association (Undirected Association Rule): A subtuple of a bag relation whose support is greater than a given threshold. Bag Relation: A relation that permits repetition of tuples Computing with Symbols: The interpretations of the symbols are not participating in the formal data processing or computing, Computing with Words: In this article, computing with words means one form of formal data processing or computing, in which the interpretations of the symbols do participate. L.A .Zadeh uses this term in a much deeper way. Deductive Data Mining: A data mining methodology that requires to list explicitly the input data and background knowledge. Roughly it treats data mining as deductive science (axiomatic method) Granulation and Partition: Partition is a decomposition of a set into a collection of mutually disjoint subsets. Granulation is defined similarly, but allows the subsets to be generalized subsets, such as fuzzy sets, and permits the overlapping. Kolmogorov Complexity of a String: The length of shortest program that can generate the given string. Semantic Primitive: It is a symbol that has interpretation, but the interpretation is not implemented in the system. So in automated computing, semantic primitive is treated as symbols. However, in interactive computing; it may be treated as a word (not necessary). Support: Support is the percentage of the tuples in which that subtuple occurs.
565
0
566
Homeland Security Data Mining and Link Analysis Bhavani Thuraisingham The MITRE Corporation, USA
INTRODUCTION Data mining is the process of posing queries to large quantities of data and extracting information often previously unknown using mathematical, statistical, and machine-learning techniques. Data mining has many applications in a number of areas, including marketing and sales, medicine, law, manufacturing, and, more recently, homeland security. Using data mining, one can uncover hidden dependencies between terrorist groups as well as possibly predict terrorist events based on past experience. One particular data-mining technique that is being investigated a great deal for homeland security is link analysis, where links are drawn between various nodes, possibly detecting some hidden links. This article provides an overview of the various developments in data-mining applications in homeland security. The organization of this article is as follows. First, we provide some background on data mining and the various threats. Then, we discuss the applications of data mining and link analysis for homeland security. Privacy considerations are discussed next as part of future trends. The article is then concluded.
BACKGROUND We provide background information on both data mining and security threats.
Data Mining Data mining is the process of posing various queries and extracting useful information, patterns, and trends often previously unknown from large quantities of data possibly stored in databases. Essentially, for many organizations, the goals of data mining include improving marketing capabilities, detecting abnormal patterns, and predicting the future, based on past experiences and current trends. There is clearly a need for this technology. There are large amounts of current and historical data being stored. Therefore, as databases become larger, it becomes increasingly difficult to support decision making. In addition, the data could be from multiple
sources and multiple domains. There is a clear need to analyze the data to support planning and other functions of an enterprise. Some of the data-mining techniques include those based on statistical reasoning techniques, inductive logic programming, machine learning, fuzzy sets, and neural networks, among others. The data-mining problems include classification (finding rules to partition data into groups), association (finding rules to make associations between data), and sequencing (finding rules to order data). Essentially, one arrives at some hypotheses, which is the information extracted from examples and patterns observed. These patterns are observed from posing a series of queries; each query may depend on the responses obtained to the previous queries posed. Data mining is an integration of multiple technologies. These include data management, such as database management, data warehousing, statistics, machine learning, decision support, and others, such as visualization and parallel computing. There is a series of steps involved in data mining. These include getting the data organized for mining, determining the desired outcomes to mining, selecting tools for mining, carrying out the mining, pruning the results so that only the useful ones are considered further, taking actions from the mining, and evaluating the actions to determine benefits. There are various types of data mining. By this we do not mean the actual techniques used to mine the data, but what the outcomes will be. These outcomes also have been referred to as datamining tasks. These include clustering, classification anomaly detection, and forming associations. While several developments have been made, there also are many challenges. For example, due to the large volumes of data, how can the algorithms determine which technique to select and what type of data mining to do? Furthermore, the data may be incomplete and/or inaccurate. At times, there may be redundant information, and at times, there may not be sufficient information. It is also desirable to have data-mining tools that can switch to multiple techniques and support multiple outcomes. Some of the current trends in data mining include mining Web data, mining distributed and heterogeneous databases, and privacy-preserving data mining, where one ensures that one can get useful results from mining and at the same time maintain the privacy of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Homeland Security Data Mining and Link Analysis
individuals (Berry & Linoff; Han & Kamber, 2000; Thuraisingham, 1998).
Security Threats Security threats have been grouped into many categories (Thuraisingham, 2003). These include information-related threats, where information technologies are used to sabotage critical infrastructures, and noninformation-related threats, such as bombing buildings. Threats also may be real-time threats and non-real-time threats. Real-time threats are threats where attacks have timing constraints associated with them, such as “building X will be attacked within three days.” Non-real-time threats are those threats that do not have timing constraints associated with them. Note that non-real-time threats could become real-time threats over time. Threats also include bioterrorism, where biological and possibly chemical weapons are used to attack, and cyberterrorism, where computers and networks are attacked. Bioterrorism could cost millions of lives, and cyberterrorism, such as attacks on banking systems, could cost millions of dollars. Some details on the threats and countermeasures are discussed in various texts (Bolz, 2001). The challenge is to come up with techniques to handle such threats. In this article, we discuss datamining techniques for security applications.
MAIN THRUST First, we will discuss data mining for homeland security. Then, we will focus on a specific data-mining technique called link analysis for homeland security. An aspect of homeland security is cyber security. Therefore, we also will discuss data mining for cyber security.
Applications of Data Mining for Homeland Security Data-mining techniques are being examined extensively for homeland security applications. The idea is to gather information about various groups of people and study their activities and determine if they are potential terrorists. As we have stated earlier, data-mining outcomes include making associations, linking analyses, forming clusters, classification, and anomaly detection. The techniques that result in these outcomes are techniques based on neural networks, decisions trees, market-basket analysis techniques, inductive logic programming, rough sets, link analysis based on graph theory, and nearest-neighbor techniques. The methods used for data mining include top-down reasoning, where we start with
a hypothesis and then determine whether the hypothesis is true, or bottom-up reasoning, where we start with examples and then come up with a hypothesis (Thuraisingham, 1998). In the following, we will examine how data-mining techniques may be applied for homeland security applications. Later, we will examine a particular data-mining technique called link analysis (Thuraisingham, 2003). Data-mining techniques include techniques for making associations, clustering, anomaly detection, prediction, estimation, classification, and summarization. Essentially, these are the techniques used to obtain the various data-mining outcomes. We will examine a few of these techniques and show how they can be applied to homeland security. First, consider association rule mining techniques. These techniques produce results, such as John and James travel together or Jane and Mary travel to England six times a year and to France three times a year. Essentially, they form associations between people, events, and entities. Such associations also can be used to form connections between different terrorist groups. For example, members from Group A and Group B have no associations, but Groups A and B have associations with Group C. Does this mean that there is an indirect association between A and C? Next, let us consider clustering techniques. Clusters essentially partition the population based on a characteristic such as spending patterns. For example, those living in the Manhattan region form a cluster, as they spend over $3,000 on rent. Those living in the Bronx from another cluster, as they spend around $2,000 on rent. Similarly, clusters can be formed based on terrorist activities. For example, those living in region X bomb buildings, and those living in region Y bomb planes. Finally, we will consider anomaly detection techniques. A good example here is learning to fly an airplane without wanting to learn to take off or land. The general pattern is that people want to get a complete training course in flying. However, there are now some individuals who want to learn to fly but do not care about take off or landing. This is an anomaly. Another example is John always goes to the grocery store on Saturdays. But on Saturday, October 26, 2002, he went to a firearms store and bought a rifle. This is an anomaly and may need some further analysis as to why he is going to a firearms store when he has never done so before. Some details on data mining for security applications have been reported recently (Chen, 2003).
Applications of Link Analysis Link analysis is being examined extensively for applications in homeland security. For example, how do we 567
0
Homeland Security Data Mining and Link Analysis
connect the dots describing the various events and make links and connections between people and events? One challenge to using link analysis for counterterrorism is reasoning with partial information. For example, agency A may have a partial graph, agency B another partial graph, and agency C a third partial graph. The question is how do you find the associations between the graphs, when no agency has the complete picture? One would ague that we need a data miner that would reason under uncertainty and be able to figure out the links between the three graphs. This would be the ideal solution, and the research challenge is to develop such a data miner. The other approach is to have an organization above the three agencies that will have access to the three graphs and make the links. The strength behind link analyses is that by visualizing the connections and associations, one can have a better understanding of the associations among the various groups. Associations such as A and B, B and C, D and A, C and E, E and D, F and B, and so forth can be very difficult to manage, if we assert them as rules. However, by using nodes and links of a graph, one can visualize the connections and perhaps draw new connections among different nodes. Now, in the real world, there would be thousands of nodes and links connecting people, groups, events, and entities from different countries and continents as well as from different states within a country. Therefore, we need link analysis techniques to determine the unusual connection, such as a connection between G and P, for example, which is not obvious with simple reasoning strategies or by human analysis. Link analysis is one of the data-mining techniques that is still in its infancy. That is, while much has been written about techniques such as association rule mining, automatic clustering, classification, and anomaly detection, very little material has been published on link analysis. We need interdisciplinary researchers such as mathematicians, computational scientists, computer scientists, machine-learning researchers, and statisticians working together to develop better link analysis tools.
Applications of Data Mining for Cyber Security Data mining also has applications in cyber security, which is an aspect of homeland security. The most prominent application is in intrusion detection. For example, our computers and networks are being intruded by unauthorized individuals. Data-mining techniques, such as those for classification and anomaly detection, are being used extensively to detect such unauthorized intrusions. For example, data about normal behavior is gathered, and when something occurs out of the ordinary, it is flagged as an unauthorized intrusion. Normal behavior could be 568
that John’s computer is never used between 2:00 A .M. and 5:00 A.M.. When John’s computer is in use at 3:00 A .M ., for example, then this is flagged as an unusual pattern. Data mining is also being applied to other applications in cyber security, such a auditing. Here again, data on normal database access is gathered, and when something unusual happens, then this is flagged as a possible access violation. Digital forensics is another area where data mining is being applied. Here again, by mining the vast quantities of data, one could detect the violations that have occurred. Finally, data mining is being used for biometrics. Here, pattern recognition and other machine-learning techniques are being used to learn the features of a person and then to authenticate the person, based on the features.
FUTURE TRENDS While data mining has many applications in homeland security, it also causes privacy concerns. This is because we need to collect all kinds of information about people, which causes private information to be divulged. Privacy and data mining have been the subject of much debate during the past few years, although some early discussions also have been reported (Thuraisingham, 1996). One promising direction is privacy-preserving data mining. The challenge here is to carry out data mining but at the same time ensure privacy. For example, one could use randomization as a technique and give out approximate values instead of the actual values. The challenge is to ensure that the approximate values are still useful. Many papers on privacy-preserving data mining have been published recently (Agrawal & Srikant, 2000).
CONCLUSION This article has discussed data-mining applications in homeland security. Applications in national security and cyber security both are discussed. We first provided an overview of data mining and security threats and then discussed data-mining applications. We also emphasized a particular data-mining technique—link analysis. Finally, we discussed privacy-preserving data mining. It is only during the past three years that data mining for security applications has received a lot of attention. Although a lot of progress has been made, there is also a lot of work that needs to be done. First, we need to have a better understanding of the various threats. We need to determine which data-mining techniques are
Homeland Security Data Mining and Link Analysis
applicable to which threats. Much research is also needed on link analysis. To develop effective solutions, datamining specialists have to work with counter-terrorism experts. We also need to motivate the tool vendors to develop tools to handle terrorism.
Thuraisingham, B. (2003). Web data mining technologies and their applications in business intelligence and counter-terrorism. FL: CRC Press.
KEY TERMS NOTE The views and conclusions expressed in this article are those of the author and do not reflect the policies of the MITRE Corporation or of the National Science Foundation.
REFERENCES
Cyber Security: Techniques used to protect the computer and networks from the threats. Data Management: Techniques used to organize, structure, and manage the data, including database management and data administration. Digital Forensics: Techniques to determine the root causes of security violations that have occurred in a computer or a network.
Agrawal, R, & Srikant, R. (2000). Privacy-preserving data mining. Proceedings of the ACM SIGMOD Conference, Dallas, Texas.
Homeland Security: Techniques used to protect a building or an organization from threats.
Berry, M., & Linoff, G. (1997). Data mining techniques for marketing, sales, and customer support. New York: John Wiley.
Intrusion Detection: Techniques used to protect the computer system or a network from a specific threat, which is unauthorized access.
Bolz, F. (2001). The counterterrorism handbook: Tactics, procedures, and techniques. CRC Press.
Link Analysis: A data-mining technique that uses concepts and techniques from graph theory to make associations.
Chen, H. (Ed.). (2003). Proceedings of the 1st Conference on Security Informatics, Tucson, Arizona. Han, J., & Kamber, M. (2000). Data mining, concepts and techniques. CA: Morgan Kaufman. Thuraisingham, B. (1996). Data warehousing, data mining and security. Proceedings of the IFIP Database Security Conference, Como, Italy. Thuraisingham, B. (1998). Data mining: Technologies, techniques, tools and trends, FL: CRC Press.
Privacy: Process of ensuring that information deemed personal to an individual is protected. Privacy-Preserving Data Mining: Data-mining techniques that extract useful information but at the same time ensure the privacy of individuals. Threats: Events that disrupt the normal operation of a building, organization, or network of computers.
569
0
570
Humanities Data Warehousing Janet Delve University of Portsmouth, UK
INTRODUCTION Data Warehousing is now a well-established part of the business and scientific worlds. However, up until recently, data warehouses were restricted to modeling essentially numerical data – examples being sales figures in the business arena (e.g. Wal-Mart’s data warehouse) and astronomical data (e.g. SKICAT) in scientific research, with textual data providing a descriptive rather than a central role. The lack of ability of data warehouses to cope with mainly non-numeric data is particularly problematic for humanities1 research utilizing material such as memoirs and trade directories. Recent innovations have opened up possibilities for non-numeric data warehouses, making them widely accessible to humanities research for the first time. Due to its irregular and complex nature, humanities research data is often difficult to model and manipulating time shifts in a relational database is problematic as is fitting such data into a normalized data model. History and linguistics are exemplars of areas where relational databases are cumbersome and which would benefit from the greater freedom afforded by data warehouse dimensional modeling.
BACKGROUND Hudson (2001, p. 240) declared relational databases to be the predominant software used in recent, historical research involving computing. Historical databases have been created using different types of data from diverse countries and time periods. Some databases are modest and independent, others part of a larger conglomerate like the North Atlantic Population Project (NAPP) project that entails integrating international census data. One issue that is essential to good database creation is data modeling; which has been contentiously debated recently in historical circles. When reviewing relational modeling in historical research, (Bradley, 1994) contrasted “straightforward” business data with incomplete, irregular, complex- or semi-structured historical data. He noted that the relational model worked well for simply-structured business data, but could be tortuous to use for historical data. (Breure, 1995) pointed out the advantages of in-
putting data into a model that matches it closely, something that is very hard to achieve with the relational model. (Burt2 & James, 1996) considered the relative freedom of using source-oriented data modeling (Denley, 1994) as compared to relational modeling with its restrictions due to normalization (which splits data into many separate tables), and highlighted the possibilities of data warehouses. Normalization is not the only hurdle historians encounter when using the relational model. Date and time fields provide particular difficulties: historical dating systems encompass a number of different calendars, including the Western, Islamic, Revolutionary and Byzantine. Historical data may refer to “the first Sunday after Michaelmas”, requiring calculation before a date may be entered into a database. Unfortunately, some databases and spreadsheets cannot handle dates falling outside the late 20th century. Similarly, for researchers in historical geography, it might be necessary to calculate dates based on the local introduction of the Gregorian calendar, for example. These difficulties can be time-consuming and arduous for researchers. Awkward and irregular data with abstruse dating systems thus do not fit easily into a relational model that does not lend itself to hierarchical data. Many of these problems also occur in linguistics computing. Linguistics is a data-rich field, with multifarious forms for words, multitudinous rules for coding sounds, words and phrases, and also numerous other parameters - geography, educational and social status. Databases are used for housing many types of linguistic data from a variety of research domains - phonetics, phonology, morphology, syntax, lexicography, computer-assisted learning (CAL), historical linguistics and dialectology. Data integrity and consistency are of utmost importance in this field. Relational DataBase Management Systems (RDBMSs) are able to provide this, together with powerful and flexible search facilities (Nerbonne, 1998). Bliss and Ritter (2001) discussed the constraints imposed on them when using “the rigid coding structure of the database” developed to house pronoun systems from 109 languages. They observed that coding introduced interpretation of data and concluded that designing “a typological database is not unlike trying to fit a square object into a round hole. Linguistic data is highly variable, database structures are highly rigid, and the two do not always ‘fit’.” Brown (2001) outlined the fact
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Humanities Data Warehousing
that different database structures may reflect a particular linguistic theory, and also mentioned the trade-off between quality and quantity in terms of coverage. The choice of data model thus has a profound effect on the problems that can be tackled and the data that can be interrogated. For both historical and linguistic research, relational data modeling using normalization often appears to impose data structures which do not fit naturally with the data and which constrain subsequent analysis. Coping with complicated dating systems can also be very problematic. Surprisingly, similar difficulties have already arisen in the business community, and have been addressed by data warehousing.
MAIN THRUST Data Warehousing in the Business Context Data warehouses came into being as a response to the problems caused by large, centralized databases which users found unwieldy to query. Instead, they extracted portions of the databases which they could then control, resulting in the “spider-web” problem where each department produces queries from its own, uncoordinated extract database (Inmon, 2001, 99. 6-14). The need was thus recognized for a single, integrated source of clean data to serve the analytical needs of a company. A data warehouse can provide answers to a completely different range of queries than those aimed at a traditional database. Using an estate agency as a typical business, the type of question their local databases should be able to answer might be “How many threebedroomed properties are there in the Botley area up to the value of £150,000?” The type of over-arching question a business analyst (and CEOs) would be interested in might be of the general form “Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data?” (Begg & Connolly, 2004, p. 1154). To trawl through each local estate agency database and corresponding local county council database, then amalgamate the results into a report would take a long time and a lot of resources. The data warehouse was created to answer this type of need.
Basic Components of a Data Warehouse Inmon (2002, p. 31), the “father of data warehousing”, defined a data warehouse as being subject-oriented, integrated, non-volatile and time-variant. Emphasis
is placed on choosing the right subjects to model as opposed to being constrained to model around applications. Data warehouses do not replace databases as such - they co-exist alongside them in a symbiotic fashion. Databases are needed both to serve the clerical community who answer day-to-day queries such as “what is A.R. Smith’s current overdraft?” and also to “feed” a data warehouse. To do this, snapshots of data are extracted from a database on a regular basis (daily, hourly and in the case of some mobile phone companies almost realtime). The data is then transformed (cleansed to ensure consistency) and loaded into a data warehouse. In addition, a data warehouse can cope with diverse data sources, including external data in a variety of formats and summarized data from a database. The myriad types of data of different provenance create an exceedingly rich and varied integrated data source opening up possibilities not available in databases. Thus all the data in a data warehouse is integrated. Crucially, data in a warehouse is not updated - it is only added to, thus making it “nonvolatile”, which has a profound effect on data modeling, as the main function of normalization is to obviate update anomalies. Finally, a data warehouse has a time horizon (that is contains data over a period) of five to ten years, whereas a database typically holds data that is current for two to three months.
Data Modeling in a Data Warehouse – Dimensional Modeling There is a fundamental split in the data warehouse community as to whether to construct a data warehouse from scratch, or to build them via data marts. A data mart is essentially a cut-down data warehouse that is restricted to one department or one business process. Inmon (2001, p. 142) recommended building the data warehouse first, then extracting the data from it to fill up several data marts. The data warehouse modeling expert Kimball (2002) advised the incremental building of several data marts that are then carefully integrated into a data warehouse. Whichever way is chosen, the data is normally modeled via dimensional modeling. Dimensional models need to be linked to the company’s corporate ERD (Entity Relationship Diagram) as the data is actually taken from this (and other) source(s). Dimensional models are somewhat different from ERDs, the typical star model having a central fact table surrounded by dimension tables. Kimball (2002, pp. 16-18) defined a fact table as “the primary table in a dimensional model where the numerical performance measurements of the business are stored…Since measurement data is overwhelmingly the largest part of any data mart, we avoid duplicating it in multiple places around the enterprise.” Thus the fact table contains dynamic numerical data such as sales 571
0
Humanities Data Warehousing
quantities and sales and profit figures. It also contains key data in order to link to the dimension tables. Dimension tables contain the textual descriptors of the business process being modeled and their depth and breadth define the analytical usefulness of the data warehouse. As they contain descriptive data, it is assumed they will not change at the same rapid rate as the numerical data in the fact table that will certainly change every time the data warehouse is refreshed. Dimension tables typically have 50-100 attributes (sometimes several hundreds) and these are not usually normalized. The data is often hierarchical in the tables and can be an accurate reflection of how data actually appears in its raw state (Kimball 2002, pp. 19-21). There is not the need to normalize as data is not updated in the data warehouse, although there are variations on the star model such as the snowflake and starflake models which allow varying degrees of normalization in some or all of their dimension tables. Coding is disparaged due to the long-term view that definitions may be lost and that the dimension tables should contain the fullest, most comprehensible descriptions possible (Kimball 2002, p. 49). The restriction of data in the fact table to numerical data has been a hindrance to academic computing. However, Kimball has recently developed “factless” fact tables (Kimball, 2002) which do not contain measurements, thus opening the door to a much broader spectrum of possible data warehouses.
Applying the Data Warehouse Architecture to Historical and Linguistic Research One of the major advantages of data warehousing is the enormous flexibility in modeling data. Normalization is no longer an automatic straightjacket and hierarchies can be represented in dimension tables. The expansive time dimension (Kimball, 2002, p. 39) is a welcome by-product of this modeling freedom, allowing country-specific calendars, synchronization across multiple time zones and the inclusion of multifarious time periods. It is possible to add external data from diverse sources and summarized data from the source database(s). The data warehouse is built for analysis which immediately makes it attractive to humanities researchers. It is designed to continuously receive huge volumes (terabytes) of data, but is sensitive enough to cope with the idiosyncrasies of geographic location dimensions within GISs (Kimball, 2002, p. 227). Additionally a data warehouse has advanced indexing facilities that make it desirable for those controlling vast quantities of data. With a data warehouse it is theoretically possible to publish the “right data” that has been collected from a variety of sources and edited for quality and consistency. In a data warehouse all data is collated so a variety of different subsets can be analyzed whenever 572
required. It is comparatively easy to extend a data warehouse and add material from a new source. The data cleansing techniques developed for data warehousing are of interest to researchers, as is the tracking facility afforded by the meta data manager (Begg & Connolly, 2004, p. 1169-1170). In terms of using data warehouses “off the shelf”, some humanities research might fit into the “numerical fact” topology, but some might not. The “factless fact table” has been used to create several American university data warehouses, but expertise in this area would not be as widespread as that with normal fact tables. The whole area of data cleansing may perhaps be daunting for humanities researchers (as it is to those in industry). Ensuring vast quantities of data is clean and consistent may be an unattainable goal for humanities researchers without recourse to expensive data cleansing software. The data warehouse technology is far from easy and is based on having existing databases to extract from, hence double the work. It is unlikely that researchers would be taking regular snapshots of their data, as occurs in industry, but they could equate to data sets taken at different periods of time to data warehouse snapshots (e.g. 1841 census, 1861 census). Whilst many data warehouses use familiar WYSIWYGs and can be queried with SQL-type commands, there is undeniably a huge amount to learn in data warehousing. Nevertheless, there are many areas in both linguistics and historical research where data warehouses may prove attractive.
FUTURE TRENDS Data Warehouses and Linguistics Research The problems related by Bliss and Ritter (2001) concerning rigid relational data structures and pre-coding problems would be alleviated by data warehousing. Brown (2001) outlined the dilemma arising from the alignment of database structures with particular linguistic theories, and also the conflict of quality and quantity of data. With a data warehouse there is room for both vast quantities of data and a plethora of detail. No structure is forced onto the data so several theoretical approaches can be investigated using the same data warehouse. Dalli (2001) observed that many linguistic databases are standalone with no hope of interoperability. His proffered solution to create an interoperable database of Maltese linguistic data involved an RDBMS and XML. Using data warehouses to store linguistic data should ensure interoperability. There is growing interest in corpora databases, with the
Humanities Data Warehousing
recent dedicated conference at Portsmouth, November 2003. Teich, Hansen and Fankhauser drew attention to the multi-layered nature of corpora and speculated as to how “multi-layer corpora can be maintained, queried and analyzed in an integrated fashion.” A data warehouse would be able to cope with this complexity. Nerbonne (1998) alluded to the “importance of coordinating the overwhelming amount of work being done and yet to be done.” Kretzschmar (2001) delineated “the challenge of preservation and display for massive amounts of survey data.” There appears to be many linguistics databases containing data from a range of locations/countries. For example, ALAP, the American Linguistic Atlas Project; ANAE, the Atlas of North American English (part of the TELSUR Project); TDS, the Typological Database System containing European data; AMPER, the Multimedia Atlas of the Romance Languages. Possible research ideas for the future may include a broadening of horizons - instead of the emphasis on individual database projects, there may develop an “integrated warehouse” approach with the emphasis on larger scale, collaborative projects. These could compare different languages or contain many different types of linguistic data for a particular language, allowing for new orders of magnitude analysis.
Data Warehouses and Historical Research There are inklings of historical research involving data warehousing in Britain and Canada. A data warehouse of current census data is underway at the University of Guelph, Canada and the Canadian Century Research Infrastructure aims to house census data from the last 100 years in data marts constructed using IBM software at several sites based in universities across the country. At the University of Portsmouth, UK, a historical data warehouse of American mining data is under construction using Oracle Warehouse Builder (Delve, Healey, & Fletcher, 2004). These projects give some idea of the scale of project a data warehouse can cope with that is, really large country-/state-wide problems. Following these examples, it would be possible to create a data warehouse to analyze all British censuses from 1841 to 1901 (approximately 108 bytes of data). Data from a variety of sources over time such as hearth tax, poor rates, trade directories, census, street directories, wills and inventories, GIS maps for a city such as Winchester could go into a city data warehouse. Such a project is under active consideration for Oslo, Norway. Similarly, a Voting data warehouse could contain voting data – poll book data and rate book data up to 1870 for the whole country. A Port data warehouse could contain all data from portbooks for all British ports together with yearly
trade figures. Similarly a Street directories data warehouse would contain data from this rich source for whole country for the last 100 years. Lastly, a Taxation data warehouse could afford an overview of taxation of different types, areas or periods. 19th century British census data does not fit into the typical data warehouse model as it does not have the numerical facts to go into a fact table, but with the advent of factless fact tables a data warehouse could now be made to house this data. The fact that some institutions have Oracle site licenses opens to way for humanities researchers with Oracle databases to use Oracle Warehouse Builder as part of the suite of programs available to them. These are practical project suggestions which would be impossible to construct using relational databases, but which, if achieved, could grant new insights into our history. Comparisons could be made between counties and cities and much broader analysis would be possible than has previously been the case.
CONCLUSION The advances made in business data warehousing are directly applicable to many areas of historical and linguistics research. Data warehouse dimensional modeling would allow historians and linguists to model vast amounts of data on a countrywide basis (or larger), incorporating data from existing databases and other external sources. Summary data could also be included, and this would all lead to a data warehouse containing more data than is currently possible, plus the fact that the data would be richer than in current databases due to the fact that normalization is no longer obligatory. Whole data sources could be captured, and more post-hoc analysis would result. Dimension tables particularly lend themselves to hierarchical modeling, so data would not need splitting into many tables thus forcing joins while querying. The time dimension particularly lends itself to historical research where significant difficulties have been encountered in the past. These suggestions for historical and linguistics research will undoubtedly resonate in other areas of humanities research, such as historical geography, and any literary or cultural studies involving textual analysis (for example biographies, literary criticism and dictionary compilation).
REFERENCES Begg, C., & Connolly, T. (2004). Database systems. Harlow: Addison-Wesley. Bliss, & Ritter (2001). IRCS (Institute for Research into Cognitive Science) Conference Proceedings. Re573
0
Humanities Data Warehousing
trieved from http://www.ldc.upenn.edu/annotation/database/proceedings.html Bradley, J. (1994). Relational database design. History and Computing, 6(2), 71-84. Breure, L. (1995). Interactive data entry. History and Computing, 7(1), 30-49. Brown. (2001). IRCS Conference Proceedings. Retrieved from http://www.ldc.upenn.edu/annotation/ database/proceedings.html Burt2, J., & James, T. B. (1996). Source-Oriented Data Processing: The triumph of the micro over the macro. History and Computing, 8(3), 160-169. Dalli. (2001). IRCS Conference Proceedings. Retrieved from http://www.ldc.upenn.edu/annotation/database/ proceedings.html Delve, J., Healey, R., & Fletcher, A. (2004, July). Teaching data warehousing as a standalone unit using Oracle Warehouse Builder. Proceedings of the British Computer Society TLAD Conference, Edinburgh, UK. Denley, P. (1994). Models, sources and users: Historical database design in the 1990s. History and Computing, 6(1), 33-43. Hudson, P. (2000). History by numbers. An introduction to quantitative approaches. London: Arnold. Inmon, W. H. (2002). Building the data warehouse. New York: Wiley. Kimball, R., & Ross, M. (2002). The data warehouse toolkit. New York: Wiley. Kretzschmar. (2001). IRCS Conference Proceedings. Retrieved from http://www.ldc.upenn.edu/annotation/ database/proceedings.html Nerbonne, J. (Ed.). (1998). Linguistic databases. Stanford, CA: CSLI Publications. SKICAT. Retrieved from http://www-aig.jpl.nasa.gov/ public/mls/home/fayyad/SKICAT-PR1-94.html Teich, Hansen, & Fankhauser. (2001). IRCS Conference Proceedings. Retrieved from http://www.ldc.upenn.edu/ annotation/database/proceedings.html
574
Wal-Mart. Retrieved from http://www.tgc.com/dsstar/99/ 0824/100966.html
KEY TERMS Data Modeling: The process of producing a model of a collection of data which encapsulates its semantics and hopefully its structure. Dimension Table: Dimension tables contain the textual descriptors of the business process being modeled and their depth and breadth define the analytical usefulness of the data warehouse. Dimensional Model: The dimensional model is the data model used in data warehouses and data marts, the most common being the star schema, comprising a fact table surrounded by dimension tables. Fact Table: “the primary table in a dimensional model where the numerical performance measurements of the business are stored” (Kimball, 2002). Factless Tact Table: A fact table which contains no measured facts (Kimball 2002). Normalization: The process developed by E.C. Codd whereby each attribute in each table of a relational database depend entirely on the key(s) of that table. As a result, relational databases comprise many tables, each containing data relating to one entity. Snowflake Schema: A dimensional model having some or all of the dimension tables normalized.
ENDNOTES 1
2
learning or literature concerned with human culture (Compact OED) Now Delve
575
Hyperbolic Space for Interactive Visualization
0
Jörg Andreas Walter University of Bielefeld, Germany
INTRODUCTION For many tasks of exploratory data analysis, visualization plays an important role. It is a key for efficient integration of human expertise — not only to include his background knowledge, intuition and creativity, but also his powerful pattern recognition and processing capabilities. The design goals for an optimal user interaction strongly depend on the given visualization task, but they certainly include an easy and intuitive navigation with strong support for the user’s orientation. Since most of available data display devices are twodimensional — paper and screens — the following problem must be solved: finding a meaningful spatial mapping of data onto the display area. One limiting factor is the “restricted neighborhood” around a point in a Euclidean 2-D surface. Hyperbolic spaces open an interesting loophole. The extraordinary property of exponential growth of neighborhood with increasing radius — around all points — enables us to build novel displays and browsers. We describe a versatile hyperbolic data viewer for building data landscapes in a nonEuclidean space, which is intuitively browseable with a very pleasing focus and context technique.
BACKGROUND The Hyperbolic Space H2 2,300 years ago, the Greek mathematician Euclid founded his geometry on five axioms. The fifth, the “parallel axiom,” appeared unnecessary to many of his colleagues. And they tried hard to prove it derivable — without success. After almost 2,000 years Lobachevsky (1793-1856), Bolyai (1802-1860), and Gauss (17771855) negated the axiom and independently discovered the non-Euclidean geometry. There exist only two geometries with constant non-zero curvature. Through our sight of common spherical surfaces (e.g., earth, orange) we are familiar with the spherical geometry and its constant positive curvature. Its counterpart with constant negative curvature is known as the hyperbolic plane H2 (with analogous generalizations to higher di-
mensions). Unfortunately, there is no “good” embedding of the H2 in R3, which makes it harder to grasp the unusual properties of the H2. Local patches resemble the situation at a saddle point, where the neighborhood grows faster than in flat space. Standard textbooks on Riemannian geometry (see, e.g., Morgan, 1993) examine the relationship and expose that the area a of a circle of radius r are given by a(r) = 4 π sinh2(r/2)
(1)
This bears two remarkable asymptotic properties: (i) for small radius r the space “looks flat” since a(r) = π r2. (ii) For larger r both grow exponentially with the radius. As observed in Lamping & Rao (1994, 1999) and Lamping et al. (1995), this property makes the hyperbolic space ideal for embedding hierarchical structures (since the number of leaves grows exponentially with the tree depth). This led to the development of the hyperbolic tree browser at Xerox Parc (today starviewer: see www.inxight.com). The question how effective is visualization and navigation in the hyperbolic space was studied by Pirolli et al. (2001). By conducting eye-tracker experiments they found that the focus+context navigation can significantly accelerate the “information foraging.” Risden et al. (2000) compared traditional and hyperbolic browsers and found significant improvement in task execution time for this novel display type.
MAIN THRUST In order to use the visualization potential of the H2 we must solve two problems: • •
(P1) How can we “accommodate” the data in the hyperbolic space, and (P2) how to project the H2 onto a suitable display?
In the following the answers are described in reverse sequence, first for the second problem (P2) and then three principal techniques for P1, which are today available for different data types.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Hyperbolic Space for Interactive Visualization
Figure 1. Regular H2 tessellation with equilateral triangles (here 8 triangles meet at each vertex). Three snapshots of a simple focus transfer are visible. Note the circular appearance of lines in the Poincaré disk PD and the fish-eye-lens effect: the triangles in the center appear larger and take less space in regions further away. Poincare
Solution for P2: Poincaré Disk PD For practical and technological reasons most available displays are flat. The perfect projection into the flat display area should preserve length, area, and angles (=form). But it lays in the nature of a curved space to resist the attempt to simultaneously achieve these goals. Consequently several projections or maps of the hyperbolic space were developed, four are especially well examined: (i) the Minkowski, (ii) the upper-half plane, (iii) the KleinBeltrami, and (iv) the Poincaré or disk mapping. For our purpose the latter is particularly suitable. Its main characteristics are:
• • •
•
•
576
Poincare
Poincare
Display Compatibility: The infinite large area of the H2 is mapped entirely into a circle, the Poincaré disk PD. Circle Rim is Infinity ∞: All remote points are close to the rim, without touching it. Focus+Context: The focus can be moved to each location in H2, like a “fovea.” The zooming factor is 0.5 in the center and falls (exponentially) off with distance to the fovea. Therefore, the context appears very natural. As more remote things are, the less spatial representation is assigned in the current display (compare Figure 1). Lines Become Circles: All H2-lines appear as circle arc segments of centered straight lines in PD (both belong to the set of so-called “generalized circles”). There extensions cross the PD-rim always perpendicular on both ends. Conformal Mapping: Angles (and therefore form) relations are preserved in PD, area and length relations obviously not.
•
•
Regular Tessellations with triangles offer richer possibilities than the R2. It turns out that there is an infinite set of choices to tessellate H2: for any ≥ 7, one can construct a regular tessellainteger n≥ tions in which n triangles meet at each vertex (in contrast to the plane with allows only n=3,4,6 and the sphere only n=3,4,5). Figure 1 depicts an examples for n=8. Moving Around and Changing the Focus: For changing the focus point in PD we need a translation operation, which can be bound to mouse click and drag events. In the Poincaré disk model the Möbius transformation T(z) is the appropriate solution. By describing the Poincaré disk PD as the unit circle in the complex plane, the isometric transformations for a point z⊂PD can be written as z’ = T(z; c, θ) = (θ z+c)/(c* θz+1), with | θ|=1, |c|<1. (2)
Here the complex number θ describes a pure rotation of PD around the origin 0 (the star * denotes complex conjugation). The following translation by c maps the origin to c and -c becomes the new center 0 (if θ=1). The Möbius transformations are also called the “circle automorphies” of the complex plane, since they describe the transformations from circles to (generalized) circles. Here they serve to translate H2 straight lines to lines — both appearing as generalized circles in the PD projection. For further details, see for example, Lamping & Rao (1999) or Walter (2004).
Three Layout Techniques in H2 Now we turn to the question raised earlier: how to accommodate data in the hyperbolic space. In the following
Hyperbolic Space for Interactive Visualization
section three known layout techniques for the H2 are shortly reviewed.
Hyperbolic Tree Layout (HTL) for TreeLike Graph Data A first solution to this question for the case of acyclic, treelike graph data in H2 was provided by Lamping & Rao (1994, 1999). Each tree node receives a certain open space “pie segment,” where the node chooses the locations of its siblings. For all its siblings i it calls recursively the layoutroutine after applying the Möbius transformation in order to center i. Tamara Munzner developed another graph layout algorithm for the three-dimensional hyperbolic space (1997). While she gains much more space for the layout, the problem of more complex navigation (and viewport control) in 3-D and, more serious, the problem of occlusion appears. The next two layout techniques are freed from the requirement of hierarchical data.
Hyperbolic Self-Organizing Map (HSOM) The standard Self-Organizing Map (SOM) algorithm is used in many applications for learning and visualization (Kohonen, 2001). Figure 2 illustrates the basic operation. The feature map is built by a lattice of nodes (or formal neurons) a⊂ A, each with a reference vector or “prototype vector” wa attached, projecting into the input space X. The response of a SOM to an input vector x is determined by the reference vector of wa of the discrete “best-matching unit” (BMU) a BMU, that is, the node which
Figure 2. The Hyperbolic Self-Organizing Map is formed by a grid of processing units, called neurons. Here the usual rectangular grid is replaced by a part of a tessellation grid seen in Figure 1.
a*
x w a*
Grid of Neurons a Input Space X
has its prototype vector wa closest to the given input a = argmin a| wa - x |. The distribution of the reference vectors wa, is iteratively adapted by a sequence of training vectors x. After finding the aBMU all reference vectors are updated towards the stimulus x: wa new:= wa old +ε h(d a, aBMU)(x - wa). Here h(.) is a bell-shaped Gaussian centered at the BMU and decaying with increasing distance d a, aBMU =|ga - g aBMU| in the node ga. Thus, each node or “neuron” in the neighborhood of the aBMU participates in the current learning step (as indicated by the gray shading in Figure 2). This neighborhood cooperation in the adaptation algorithm has important advantages: (i) it is able to generate topological order between the wa, which means that similar inputs are mapped to neighboring nodes; (ii) As a result, the convergence of the algorithm can be sped up by involving a whole group of neighboring neurons in each learning step. The structure of this neighborhood is essentially governed by the structure of h(a, aBMU) = h(d a, aBMU) — therefore also called the neighborhood function. While most learning and visualization applications choose d a, aBMU as distances in an rectangular (2-D, 3-D) lattice this can be generalized to the non-Euclidean case as suggested by Ritter (1999). The core idea of the Hyperbolic Self-Organizing Map (HSOM) is to employ an H2grid of nodes. A particular convenient choice is to take the ga in PD of a finite patch of the triangular tessellation grid as displayed in Figure 2. The internode distance is computed in the appropriate Poincaré metric (see equation 4). BMU
∀
Hyperbolic Multidimensional Scaling (HMDS) Multidimensional scaling refers to a class of algorithms for finding a suitable embedding of N objects in a low dimensional — usually Euclidean — space given only pairwise proximity values (in the following we represent proximity as dissimilarity or distance values between pairs of objects, mathematically written as dissimilarity δij≥ 0 between the i and j item). Often the raw dissimilarity distribution is not suitable for the low-dimensional embedding and an additional preprocessing step is applied. We model it here as a monotonic transformation D(.) of dissimilarities δij into disparities Dij = D(δij). The goal of the MDS algorithm is to find a spatial representation xi of each object i in the M-dimensional space, where the pair distances d ij = d(xi , xj) (in the target space) match the disparities Dij as faithfully as possible ∀ i ≠ j: Dij ≅δij. One of the most widely known MDS algorithms was introduced by Sammon (1969). He formulates a mini577
H
Hyperbolic Space for Interactive Visualization
mization problem of a cost function which sums over the squares of disparities-distance misfits E({xi}) = Σi=1N Σj>i wi j (δij - Dij)2.
(3)
The factors wi j are introduced to weight the disparities individually and also to normalize the cost function E() to be independent to the absolute scale of the disparities Dij. The set of xi is found by a gradient descent procedure, minimizing iteratively the cost or stress function [see Sammon (1969) or Cox (1994) for further details on this and other MDS algorithms]. The recently introduced Hyperbolic Multi-Dimensional Scaling (HMDS) combines the concept of MDS and hyperbolic geometry (Walter & Ritter, 2002). Instead of finding a MDS solution in the low-dimensional Euclidean RM and transferring it to the H2 (which can not work well), the MDS formalism operates in the hyperbolic space from the beginning. The key is to replace the Euclidean distance in the target space by the appropriate distance metric for the Poincaré model. dij = 2 arctanh [ | xi - xj | / |1- x i xj* | ] with xi , xj ⊂ PD; (4) While the gradients ∂δij,q/∂ xi, q required for the gradient descent are rather simple to compute for the Euclidean geometry, the case becomes complex for HMDS, see Walter (2002, 2004) for details.
•
Disparity preprocessing: Due to the non-linearity of the distance function above, the preprocessing function D(.) has more influence in H2. Consider, for example, linear rescaling of the dissimilarities Dij = α dij. In the Euclidean case the visual structure is not affected — only magnified by α. In contrast in H2, a scales the distribution and with it the amount of curvature felt by the data. The optimal α depends on the given task, the dataset, and its dissimilarity structure. One way is to set α manually and choose a compromise between vis-
•
ibility of the entire structure and space for navigation in the detail-rich areas. Comparison: The following table compares the main properties of the three available layout techniques.
All three techniques share the concept of spatial proximity representing similarities in the data. In HTL close relatives are directly connected. Skupin (2002) pointed out that we humans learn early the usage and the versatile concepts of maps and suggested to build displays implementing this underlying map metaphor. The HSOM can handle many objects and can generate topic maps, while the HMDS is ideal to smaller sets, presented on the object level (see below Figures 4 & 5).
Application Examples Even though the look and feel of an interactive visualization and navigation is hardly compressible to paper format, we present some application screenshots of visualization experiments. Figure 3 displays an application of browsing image collections with an HMDS. Here the direct image features based on color are employed to define the concept of similarity. Figures 4 and 5 depict the hybrid combination of HSOM and HMDS for an application from the field of text mining of newsfeed articles. The similarity concept is based on semantic information gained via a standard bag-of-words model of the unstructured text, here Reuters news articles [see Walter et al. (2003) for further details].
FUTURE TRENDS A closer look at the comparative table above suggests a hybrid architecture of techniques. This includes the combination of HSOM and HMDS for a two-stage navigation and retrieval process. This embraces a coarse grain theme map (e.g., with the HSOM) and a detailed map using
Table 1. HTL Data type Scaling #objects N Layout New object Result
578
Acyclic graph data O( N ) Recursive space partitioning Partial re-layout Good for tree data
HSOM
HMDS
Dissimilarity, distance Vectorial data data O( N ) O( N 2 ) Spatial arrangement Assignment to grid of neurons with topographic resembles similarity structure of the data ordering Map to BMU (possibly Iterate cost minimization adapt) procedure Good for up to a few Handles large data collections; coarse topic / hundred object; detail level map theme maps
Hyperbolic Space for Interactive Visualization
Figure 3. Snapshot of an Hyperbolic Image Viewer using a color distance metric for pairwise dissimilarity definition for 100 pictures. Note the dynamic zooming of images is dependent on the distance to the current focus point.
Figure 4. A HSOM projection of a large collection of newswire articles (“Reuter-21578”) forms semantically related category clusters as seen by the glyphs indicating the dominant out-of-band category label of the news objects gathered by the HSOM in each node. The similarity of objects is here derived from the angle of the feature vectors in the “bag-of-word” or “vector space” model of unstructured text [standard model in information retrieval, after Salton (1988)]
Figure 5. Screenshot of the HMDS visualizing all documents in the node previously selected in the HSOM (direction 5’o clock). The news headlines where enabled for the articles cluster now in focus They reveal that the object group are semantically close and all related to a strike in the oilseed company Cargill U.K.
HMDS, as suggested in the Hyperbolic Hybrid Data Viewer prototype in Walter et al. (2003). The HMDS allows to support different notions of similarity and furthermore to dynamically modulate the actual distance metric while observing the change in the spatial arrangement of the object. This feature is valuable in various multi-media applications with various natural and task-depended notions of similarity. For example, when browsing a collection of images, the similarity of objects can be based on textual description, metadata, or image features, for example, using color, shape, and texture. Due to the general data type dissimilarity the dynamic modulation of the distance metric is possible and can be controlled via the GUI or other techniques like relevance feedback — technique proved successful in the field of information retrieval.
H2-MDS
Earn Acquisition Money-FX Crude Grain Trade Interest Wheat Ship Corn Other
CARGILL_U.K._STRIKE_TALKS_POSTPONED-486 CARGILL_U.K._STRIKE_TALKS_POSTPONED_TILL_MONDAY-1966 CARGILL_STRIKE_TALKS_CONTINUING_TODAY-5833 CARGILL_U.K._STRIKE_TALKS_TO_RESUME_THIS_AFTERNOO CARGILL_U.K._STRIKE_TALKS_BREAK_OFF_WITHOUT_RESULT-12425 CARGILL_U.K._STRIKE_TALKS_TO_RESUME_TUESDAYCARGILL_U.K._STRIKE_TALKS_TO_CONTINUE_MONDAY-3710
CONCLUSION Information visualization can benefit from the use of the hyperbolic space as a projection manifold. On the one hand it gains the exponentially growing space around each point, which provides extra space for compressing object relationships. On the other hand the Poincaré model offers superb visualization and navigation properties, which were found to yield significant improvement in task time compared to traditional browsing methods 579
0
Hyperbolic Space for Interactive Visualization
(Risden et al., 2000; Pirolli et al., 2001). By simple mouse interaction the focus can be transferred to any location of interest. The core area close to the center of the Poincaré disk magnifies the data with a maximal zoom factor and decreases gradually to the outer area. The object placeholder (text box, image thumbnails, etc.) is scaled in proportion. The fovea is an area with high resolution, while remote areas are gradually compressed but are still visible as context. Interestingly, this situation resembles the logpolar density distribution of neurons in the retina, which governs the natural resolution allocation in our visual perception system.
REFERENCES Cox, T., & Cox, M. (1994). Multidimensional scaling. In Monographs on statistics and appied probability. Chapman & Hall. Kohonen, T. (2001). Self-organizing maps. Berlin: Springer. Lamping, J., & Rao, R. (1994). Laying out and visualizing large trees using a hyperbolic space. In ACM Symp User Interface Software and Technology (pp. 13-14). Lamping, J., & Rao, R. (1999). The hyperbolic browser: A Focus+Context technique for visualizing large hierarchies. In Readings in Information Visualization (pp. 382-408). Morgan Kaufmann. Lamping, J., Rao, R., & Pirolli, P. (1995). A focus+context technique based on hyperbolic geometry for viewing large hierarchies. In ACM SIGCHI Conf Human Factors in Computing Systems (pp. 401-408). Morgan, F. (1993). Riemannian geometry: A beginner’s guide. Jones and Bartlett Publishers. Munzner, T. (1997). H3: Laying out large directed graphs in 3d hyperbolic space. In Proceedings of IEEE Symposium on Information Visualization (pp. 2-10). Pirolli, P., Card, S.K., & Van Der Wege. (2001). Visual information foraging in a focus + context visualization. In CHI (pp. 506-513).
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 5(24), 513-523. Sammon, J. (1969). A non-linear mapping for data structure analysis. IEEE Transactions Computers, 18, 401-409. Skupin, A. (2002). A cartographic approach to visualizing conference abstracts. In IEEE Computer Graphics and Applications (pp. 50-58). Walter, J. (2004). H-MDS: A new approach for interactive visualization with multidimensional scaling in the hyperbolic space. Information Systems, 29(4), 273-292. Walter, J., Ontrup, J., Wessling, D., & Ritter, H. (2003). Interactive visualization and navigation in large data collections using the hyperbolic space. In IEEE International Conference on Data Mining (ICDM’03) (pp. 355362). Walter, J., & Ritter, H. (2002). On interactive visualization of high-dimensional data using the hyperbolic plane. In ACM SIGKDD International Conference of Knowledge Discovery and Data Mining (pp. 123-131).
KEY TERMS Focus+Context Technique: Allows to interactively transfer the focus as desired while the context of the region in focus remains in view with gradually degrading resolution to the rim. In contrast allows the standard zoom+panning technique to select the magnification factor, trading detail and overview and involving harsh view cut-offs. H2: Two-dimensional hyperbolic space (plane). H2DV: The Hybrid Hyperbolic Data Viewer incorporates a two stage browsing and navigation using the HSOM for a coarse thematic mapping of large object collections and the HMDS for detailed inspection of smaller subsets in the object level. HMDS: Hyperbolic Multi-Dimensional Scaling for laying out objects in the H2 such that the spacial arrangement resembles the dissimilarity structure of the data as close as possible.
Risden, K., Czerwinski, M., Munzner, T., & Cook, D. (2000). An initial examination of ease of use for 2D and 3D information visualizations of Web content. International Journal of Human Computer Studies, 53(5), 695714.
HSOM: Hyperbolic Self-Organizing Mapping. Extension of Kohonen’s topographic map, which offers the exponential growth of neighborhood for the inner nodes.
Ritter, H. (1999). Self-organizing maps on non-Euclidean spaces. In Kohonen Maps (pp. 97-110). Elsevier.
Hyperbolic Geometry: Geometry with constant negative curvature in contrast to the flat Euclidean geometry.
580
Hyperbolic Space for Interactive Visualization
The space around each point grows exponentially and allows better layout results than the corresponding flat space.
0
Hyperbolic Tree Viewer: Visualizes tree-like graphs in the Poincaré mapping of the H2 or H3. Poincaré Mapping Of The H2: A conformal mapping from the entire H2 into the unit circle in the flat R2 (disk model)
581
582
Identifying Single Clusters in Large Data Sets Frank Klawonn University of Applied Sciences Braunschweig/Wolfenbuettel, Germany Olga Georgieva Institute of Control and System Research, Bulgaria
INTRODUCTION Most clustering methods have to face the problem of characterizing good clusters among noise data. The arbitrary noise points that just do not belong to any class being searched for are of a real concern. The outliers or noise data points are data that severely deviate from the pattern set by the majority of the data, and rounding and grouping errors result from the inherent inaccuracy in the collection and recording of data. In fact, a single outlier can completely spoil the least squares (LS) estimate and thus the results of most LS based clustering techniques such as the hard C-means (HCM) and the fuzzy C-means algorithm (FCM) (Bezdek, 1999). For these reasons, a family of robust clustering techniques has emerged. There are two major families of robust clustering methods. The first includes techniques that are directly based on robust statistics. The second family, assuming a known number of clusters, is based on modifying the objective function of FCM in order to make the parameter estimates more resistant to the data noise. Among them one promising approach is the noise clustering (NC) technique (Dave, 1991; Klawonn, 2004). It maintains the principle of probabilistic clustering, but an additional noise cluster is introduced. NC was developed and investigated in the context of a variety of objective function-based clustering algorithms and it has demonstrated its reliable ability to detect clusters amongst noise data.
BACKGROUND Objective function-based clustering aims at minimizing an objective function that indicates a kind of fitting error of the determined clusters to the given data. In this objective function, the number of clusters has to be fixed in advance. However, as the number of clusters is usually unknown, an additional scheme has to be applied to determine the number of clusters (Guo, 2002; Tao, 2002). The parameters to be optimized are the membership degrees that are values of belonging of each data
point to every cluster, and the parameters, characterizing the cluster, which finally determine the distance values. In the simplest case, a single vector named cluster centre (prototype) represents each cluster. The distance of a data point to a cluster is simply the Euclidean distance between the cluster centre and the corresponding data point. More generally, one can use the squared innerproduct distance norm, in which by a norm inducing symmetric and positive matrix different variances in the directions of the coordinate axes of the data space are accounted for. If the norm inducing matrix is the identity matrix we obtain the standard Euclidean distance that form spherical clusters. Clustering approaches that use more complex cluster prototypes than only the cluster centres, leading to adaptive distance measures, are for instance the Gustafson-Kessel (GK) algorithm (Gustafson, 1979), the volume adaptation strategy (Höppner, 1999; Keller, 2003) and the Gath-Geva (GG) algorithm (Gath, 1989). The latter one is not a proper objective function algorithm, but corresponds to a fuzzified expectation maximization strategy. No matter which kind cluster prototype is used, the assignment of the data to the clusters is based on the corresponding distance measure. In hard clustering, a data object is assigned to the closest cluster, whereas in fuzzy clustering a membership degree, that is, a value that belongs to the interval [0,1] is computed. The highest membership degree of a data corresponds to the closest cluster. Noise clustering has a benefit of the collection of the noise points in one single cluster. A virtual noise prototype with no parameters to be adjusted is introduced that has always the same distance to all points in the data set. The remaining clusters are assumed to be the good clusters in the data set. The objective function that considers the noise cluster is defined in the same manner as the general scheme for the clustering minimization functional. The main problem of NC is the proper choice of the noise distance. If it is set too small, then most of the points will get classified as noise points, while for a large noise distance most of the points will be classified into clusters other than the noise cluster. A right selection of the distance will result in a classification where the points that are actu-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Identifying Single Clusters in Large Data Sets
ally close enough to the good clusters will get correctly classified into a good cluster, while the points that are away from the good clusters will get classified into the noise cluster. The detection of noise and outliers is a serious problem and has been addressed in various approaches (Leski, 2003; Yang, 2004; Zhang, 2003). However, all these methods need complex additional computations.
MAIN THRUST The purpose of most clustering algorithms is to partition a given data set into clusters. However, in data mining tasks partitioning is not always the main goal. Finding interesting patterns or substructures in a large data set in the form of one or a few clusters that do not cover or partition the data set completely is an important issue in data mining. For this purpose a new clustering algorithm named noise clustering with one good cluster, based on the noise clustering technique and able to detect single clusters step-by-step in a given data set, has been recently developed (Georgieva, 2004). In addition to identifying clusters step by step, as a side-effect noise data are detected automatically. The algorithm assesses the dynamics of the number of points that are assigned to only one good cluster of the data set by slightly decreasing the noise distance. Starting with some large enough noise distance it is decreased with a prescribed decrement till the reasonable smallest distance is reached. The number of data belonging to the good cluster is calculated for every noise distance using the formula for the hard membership values or fuzzy membership values, respectively. Note that in this scheme only one cluster centre has to be computed, which in the case of hard noise clustering is the mean value of the good cluster data points and in the case of fuzzy noise clustering is the weighted average of all data points. It is obvious that by decreasing the noise distance a process of “loosing” data, that is, separating them to the noise cluster, will begin. Continuing to decrease the noise distance, we will start to separate points from the good cluster and add them to the noise cluster. A further reduction of the noise distance will lead to a decreasing amount of data in the good cluster until the cluster will be entirely empty, as all data will be assigned to the noise cluster. The described dynamics can be illustrated in a curve viewing the number of data points assigned to the good cluster over the noise distance. In this curve a plateau will indicate that we are in a phase of assigning proper noise data to the noise cluster, whereas a strong slope means that we actually loose data belonging to the good cluster to the noise cluster.
Generally, a number of clusters with different shapes and densities exist in large data sets and thus a complicated dynamics of the data assigned to the single cluster will be observed. However, the smooth part of the considered curve corresponds to the situation that a relatively small amount of data is removed, which is usually caused by loosing noise data. When we loose data from a good cluster (with higher density than the noise data), a small decrease of the noise distance will lead to a large amount of data we loose to the noise cluster, so that we will see a strong slope in our curve instead of a plateau. Thus, a strong slope indicates that at least one cluster is just removed and separated from the (single) good cluster we try to find. By this way the algorithm determines the number of clusters and detects noise data. It does not depend on the initialisation, so that the danger of converging into local optima is further reduced compared to standard fuzzy clustering (Höppner, 2003). The described procedure is implemented in a cluster identification algorithm that assesses the dynamics of the quantity of the points assigned to the good cluster or equivalently assigned to the noise cluster through the slight decrease of the noise distance. By detecting the strong slopes the algorithm separates one cluster at every algorithm pass. A significant reduction of the noise is achieved even in the first algorithm pass. The clustering procedure is repeated by proceeding with a smaller data set as the original one is reduced by the identified noise data and data belonging to the already identified cluster(s). The curve that is determined by the dynamics of the data assigned to the noise cluster is smoother in case of fuzzy noise clustering compared to the hard clustering case due to the fuzzily defined membership values. Also local minima of this curve could be observed due to the given freedom of the points to belong to both good and noise cluster simultaneously. Fuzzy clustering can deal better with complex data sets than hard clustering due to the given relative degree of membership of a point to the good cluster. However, for the same reason the amount of the identified noise points is less than in the hard clustering case.
FUTURE TRENDS Whereas the standard clustering partitions the whole data set, the main goal of the noise clustering with one good cluster is to identify single clusters even in the case when a large part of the data does not have any kind of group structure at all. This will have a large benefit in some application areas of cluster analysis like, for
583
I
Identifying Single Clusters in Large Data Sets
instance, gene expression data and astrophysics data analysis, where the ultimate goal is not to partition the data, but to find some well-defined clusters that only cover a small fraction of the data. By removing a large amount of the noise data the obtained clusters are used to find some interesting substructures in the data set. One future improvement will lead to an extended procedure that first finds the location of the centres of the interesting clusters using the standard Euclidean distance. Then, the algorithm could be started again with this cluster centres but using more sophisticated distance measures as GK or volume adaptation strategy that can better adapt to the shape of the identified cluster. For extremely large data sets the algorithm can be combined with speed-up techniques as the one proposed by Höppner (2002). Another further application consists in incorporation of the proposed algorithm as an initialisation step for other more sophisticated clustering algorithm providing the number of clusters and their approximate location.
CONCLUSION A clustering algorithm, based on the principles of noise clustering and able to find good clusters with variable shape and density step-by-step, is proposed. It can be applied to find just a few substructures, but also as an iterative method to partition a data set including the identification of the number of clusters and noise data. The algorithm is applicable in terms of both hard and fuzzy clustering. It automatically determines the number of clusters, while in the standard objective functionbased clustering additional strategies have to be applied in order to define the number of clusters. The identification algorithm is independent of the dimensionality of the considered problem. It does not require comprehensive or costly computations and the computation time increases linearly only with the number of data points.
REFERENCES Bezdek, J.C., Keller, J., Krisnapuram, R., & Pal, N.R. (1999). Fuzzy models and algorithms for pattern recognition and image processing. Kluwer Academic Publishers. Dave, R.N. (1991). Characterization and detection of noise in clustering. Pattern Recognition Letters, 12, 657-664.
584
Gath, I., & Geva, A.B. (1989). Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7, 773-781. Georgieva, O., & Klawonn, F. (2004). A clustering algorithm for indentification of single clusters in large datat sets. In Proceedings of East West Fuzzy Colloquium, 81(pp. 118-125), Sept. 8-10, Zittau, Germany. Guo, P., Chen, C.L.P., & Lyu, M.R. (2002). Cluster number selection for a small set of samples using the Bayesian Ying-Yang model. IEEE Transactions on Neural Networks, 13, 757-763. Gustafson, D., & Kessel, W. (1979). Fuzzy clustering with a fuzzy covariance matrix. In Advances in fuzzy set theory and applications (pp. 605-620). North-Holland. Höppner, F. (2002). Speeding up fuzzy c-means: Using a hierarchical data organisation to control the precision of membership calculation. Fuzzy Sets and Systems, 12 (3), 365-378. Höppner, F., & Klawonn, F. (2003). A contribution to convergence theory of fuzzy c-means and derivatives. IEEE Transactions on Fuzzy Systems, 11, 682-694. Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis. Chichester: John Wiley & Sons. Keller, A., & Klawonn, F. (2003). Adaptation of cluster sizes in objective function based fuzzy clustering. In C.T. Leondes (ed.), Intelligent Systems: Technology and Applications IV. Database and Learning Systems (pp. 181-199). Boca Raton, FL: CRC Press. Klawonn, F. (2004). Noise clustering with a fixed fraction of noise. In A. Lotfi & J.M. Garibaldi (Eds.), Applications and Science in Soft Computing (pp. 133-138). Berlin: Springer. Leski, J.M. (2003). Generalized weighted conditional fuzzy clustering. IEEE Transactions on Fuzzy Systems, 11, 709-715. Tao, C.W. (2002). Unsupervised fuzzy clustering with multi-center clusters. Fuzzy Sets and Systems, 128, 305-322. Yang, T.-N., & Wang, S.-D. (2004). Competitive algorithms for the clustering of noisy data. Fuzzy Sets and Systems, 141, 281-299. Zhang, J.-S., & Leung, Y.-W. (2003). Robust clustering by pruning outliers. IEEE Transactions on Systems, Man and Cybernetics – Part B, 33, 983-999.
Identifying Single Clusters in Large Data Sets
KEY TERMS
Hard Clustering: Cluster analysis where each data object is assigned to a unique cluster.
Cluster Analysis: Partition a given data set into clusters where data assigned to the same cluster should be similar, whereas data from different clusters should be dissimilar.
Noise Clustering: An additional noise cluster is induced in objective function-based clustering to collect the noise data or outliers. All data objects are assumed to have a fixed (large) distance to the noise cluster, so that only data far away from all other clusters will be assigned to the cluster.
Cluster Centres (Prototypes): Clusters in objective function-based clustering are represented by prototypes that define how the distance of a data object to the corresponding cluster is computed. In the simplest case a single vector represents the cluster, and the distance to the cluster is the Euclidean distance between cluster centre and data object. Fuzzy Clustering: Cluster analysis where a data object can have membership degrees to different clusters. Usually it is assumed that the membership degrees of a data object to all clusters sum up to one, so that a membership degree can also be interpreted as the probability the data object belongs to the corresponding cluster.
Objective Function-Based Clustering: Cluster analysis is carried on minimizing an objective function that indicates a fitting error of the clusters to the data. Robust Clustering: Refers to clustering techniques that behave robust with respect to noise, i.e.that is, adding some noise to the data in the form changing the values of the data objects slightly as well as adding some outliers will not drastically influence the clustering result.
585
I
586
Immersive Image Mining in Cardiology Xiaoqiang Liu Delft University of Technology, The Netherlands and Donghua University, China Henk Koppelaar Delft University of Technology, The Netherlands Ronald Hamers Erasmus Medical Thorax Center, The Netherlands Nico Bruining Erasmus Medical Thorax Center, The Netherlands
INTRODUCTION Buried within the human body, the heart prohibits direct inspection, so most knowledge about heart failure is obtained by autopsy (in hindsight). Live immersive inspection within the human heart requires advanced data acquisition, image mining and virtual reality techniques. Computational sciences are being exploited as means to investigate biomedical processes in cardiology. IntraVascular UltraSound (IVUS) has become a clinical tool in recent several years. In this immersive data acquisition procedure, voluminous separated slice images are taken by a camera, which is pulled back in the coronary artery. Image mining deals with the extraction of implicit knowledge, image data relationships, or other patterns not explicitly stored in the image databases (Hsu, Lee, & Zhang, 2002). Human medical data are among the most rewarding and difficult of all biological data to mine and analyze, which has the uniqueness of heterogeneity and are privacy-sensitive (Cios & Moore, 2002). The goals of immersive IVUS image mining are providing medical quantitative measurements, qualitative assessment, and cardiac knowledge discovery to serve clinical needs on diagnostics, therapies, and safety level, cost and risk effectiveness etc.
tricles to pump blood and lead to overt failure. To diagnose the many possible anomalies and heart diseases is difficult because physicians can’t literally see in the human heart. Various data acquisition techniques have been invented to partly remedy the lack of sight: noninvasive inspection including CT (Computered Tomography), Angiography, MRI (Magnetic Resonance Imaging), ECG signals etc. These techniques do not take into account crucial features of lesion physiology and vascular remodeling to really mine blood-plaque. IVUS, a minimal-invasive technique, in which a camera is pulled back inside the artery, and the resulting immersive tomographic images are used to remodel the vessel. This remodeling vessel and its virtual reality (VR) aspect offer interesting future alternatives for mining these data to unearth anomalies and diseases in the moving heart and coronary vessels at earlier stage. It also serves in clinical trials to evaluate results of novel interventional techniques, e.g. local kill by heating cancerous cells via an electrical current through a piezoelectric transducer as well as local nano-technology pharmaceutical treatments. Figure 1 explains some aspects of IVUS technology. However, IVUS images are more complicated than medical data in general since they suffer from some artifacts during immersed data acquisition (Mintz et al., 2001): •
BACKGROUND Heart disease is the leading cause of death in industrialized nations and is characterized by diverse cellular abnormalities associated with decreased ventricular function. At the onset of many forms of heart disease, cardiac hypertrophy and ventricular changes in wall thickness or chamber volume occur as a compensatory response to maintain cardiac output. These changes eventually lead to greater vascular resistance, chamber dilation, wall fibrosis, which ultimately impair the ability of the ven-
• • •
Non-uniform rotational distortion and motion artifacts. Ring-down, blood speckle, and near field artifacts. Obliquity, eccentricity, and problems of vessel curvature. Problems of spatial orientation.
The second type of artifacts is treated by image processing and therefore falls outside the scope of this paper. The pumping heart, respiring lungs and moving immersed catheter camera cause the other three types of artifacts. These cause distortion on longitudinal position, x-y po-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Immersive Image Mining in Cardiology
Figure 1. IVUS immersive data acquisition, measurements and remodeling
I
A: IVUS catheter/endosonics and the wire. B: angiograophy of a contrast-filled section of coronary artery. C: two IVUS cross-sectional images and some quantitative measurements on them. D: Virtual L-View of the vessel reconstruction. E: Virtual 3D-Impression of the vessel reconstruction.
sition, and spatial orientation of the slices on the coronary vessel. For example, it has been reported that more than 5 mm of longitudinal catheter motion relative to the vessel may occur during one cardiac cycle (Winter et al., 2004), when the catheter was pulled back at 0.5 mm/sec and the non-gated samples were stored on S-VHS videotape at a rate of 25 images/sec. Figure 2 explains the longitudinal displacement caused by cardiac cycles during a camera pullback in a segment of coronary artery. The catheter
position equals to the sum of the pullback distance and the longitudinal catheter displacement. In F, the absolute catheter positions of solid dots are in disorder, which will cause a disordered sequence of camera images. The consecutive image samples selected in relation to the positions of the catheter relative to the coronary vessel wall are highlighted in G. In conclusion, these samples used for analysis are anatomically dispersed in space (III, I, V, II, IV, and VI).
Figure 2. Trajectory position anomalies of the invasive camera
587
Immersive Image Mining in Cardiology
The data of IVUS sources are voluminous separated slices with artifacts, and the quantitative measurements, the physicians’ interpretations are also essential components to remodel the vessel, but the accompanying mathematical models are poorly characterized. Mining in these datasets is more difficult and it inspires interdisciplinary work in medicine, physics and computer science.
MAIN THRUST Immersive Image Mining Processes in Cardiology IVUS IVUS image mining includes a series of complicated mining procedures due to the complex immersive data: data reconciliation, image mining, remodeling and VR display, and knowledge discovery. Figure 3 shows the function-driven processes. The data reconciliation compensates or decreases artifacts to improve data, usually fused with non-invasive inspection cardiology data such as angiography and ECG etc. This data mining extracts features in the individual dataset to use for data reconciliation. Cardiac computation is taken on the reconciled individual dataset. Quantitative measurement calculates features such as lumen, atheroma, calcium and stent on every slice. Vessel remodeling based on these measurements forms a straight 3D volume, and the fusion with the pullback path determined from angiography yields an accurate spatio-temporal (4D) model of the coronary vasculature.
This 4D model is used for further cardiac computation to gain VR display and volume measurements on the vessel. IVUS images are fundamentally different from histology and cannot be used to detect and quantify specific histologic contents directly. But based on quantitative measurements and VR display, combined cardiac knowledge, the qualitative assessment such as atheroma morphology, unstable lesions and complications after intervention may be gained semi-automatically or artificially by the physicians. These quantitative measurements and qualitative assessments may be organized and stored in a database or data warehouse. Statistics-based mining on colony data and mining on individual history data lead to knowledge discovery of heart diseases.
Individual Data Mining to Reconcile the Data Data reconciliation is the basis and most important step to cardiac computation, and it is expecting more effective methods to attack the data artifacts. Three types of methods are applied: parsimonious data acquiring, data fusion of invasive and non-invasive data, and hybrid motion compensation. •
Parsimonious Phase-Dependent Data Acquisition: Cardiac knowledge dictates systolic/diastolic timing features. It has been suggested to select IVUS images recorded in the end-diastolic phase of the cardiac cycle, in which the heart is motionless and
Figure 3. Immersive IVUS image mining
Immersive image acquisition
Cardiac knowledge
Knowledge discovery
Data reconciliation
Cardiac computation
Non-invasive data acquisition
588
L-Mode 3D reconstruction Picture in picture ……
Expert knowledge
Vessel remodeling & Visual display Qualitative assessments Quantitative measurements Border Identification Lumen measurements Calcium measurements ……
Atheroma morphology Unstable lesions Ruptured plaque Complications after intervention ……
Immersive Image Mining in Cardiology
•
•
blood flow has ceased so that their influences on the catheter can be neglected. Online ECG-gated pullback has been used to acquire phase-dependent data but the technology is expensive and prolongs the acquisition procedure. Instead, a retrospective image-based gating mining method has been studied (Winter et al, 2004; Zhu, Oakeson, & Friedman, 2003). In this method, the different features are mined from sampled IVUS images over time by transforming the data with spectral analysis, to discover the most prominent repetition frequencies of appearance of these image features. From this mining, the near images at the end-diastolic phases can be inferred. The selection of images is parsimonious: only about 5% of the dataset are selected, wherein about 10% of the selections are mispositioned. Data Fusion: The motion vessel courses provide helpful information to identify space and time of the IVUS camera in the coronary of the moving heart. Fusion of complementary information from two or more differing modalities enhances insights into the underlying anatomy and physiology. Combining non-invasive data mining techniques to battle measurement errors is a preferred method. The positioning for the camera could be remedied if from angiograms the outer form of a vessel is available, as a path-road for the camera. Fusing the route and IVUS data, a simulator generates a VR reconstruction of the vessel (Wahle, Mitchell, Olszewski, Long, & Sonka, 2000; Sarry & Boire, 2001; Ding & Friedman, 2000; Rotger, Radeva, Mauri, & FernandezNofrerias, 2002; Weichert, Wawro, & Wilke, 2004). This should help to detect the absolute catheter spatial positions and orientations, but usually the routes are static and data are parsimonious, phasedependent, or without exhibiting the distortion. There are few papers on the consecutive motion tracking of the coronary tree (Chen & Carrol, 2003; Shechter, Devernay, Coste-Maniere, Quyyumi, & McVeigh, 2003), but they omitted the stretch of the arteries that is an important property for accurate positioning analyses. Hybrid Motion Compensation: Physical prior knowledge would predict space and time of the global position of IVUS camera, if it would be fully modeled. The many mechanical and electrical mechanisms in a heart make a full model intractable, but if most of its mechanisms could be dispensed with, the prediction model would become considerably simplified. A hybrid approach could solve the problem if a full ‘Courant-type’ model would be available, coined by the term “Computational Cardiology”
(Cipra, 1999). An effective and pragmatic model is futuristic yet. Timinger reconstructed the catheter position on 3D roadmap by a motion compensation algorithm based on an affine model for compensating the respiratory motion and ECG gating method for the catheter positions acquired using a magnetic tracking system (Timinger, Krueger, Borgert, & Grewer, 2004). Fusing the empirical features of the coronary longitudinal movement with a motion compensation model is a novel way to resolve the longitudinal distortion of the IVUS dataset (Liu, Koppelaar, Koffijberg, Bruining, & Hamers, 2004).
Cardiac Computation to Quantitative and Qualitative Analysis Cardiac computation aims at quantitative measurements, remodeling and qualitative assessments on the vessel and the lesion zones. The technologies of border detection of image processing and pattern recognition of the vessel layers are important in the process of cardiac calculation (Koning, Dijkstra, von Birgelen, Tuinenburg, et al., 2002). For qualitative assessments, expert knowledge of physicians must be fused within reasoning to get the assessments.
Quantitative Measurements •
•
Cross-Sectional Slices Calculation: The normal coronary artery consists of the lumen surrounded by intima, media, and adventitia of the vessel wall (Halligan & Higano, 2003). The innermost layer consists of a complex of three elements: intima, atheroma (diseased arteries), and internal elastic membrance. After gaining vessel layers and their attributes through edge identification and pattern recognition of image processing, every slice is calculated separately and the measurements such as lumen area, EEM (external elastic membrance) area, and maximum atheroma thickness can be reported in detail. Vessel Remodeling and Virtual Display: Based on the above calculation results, vessel layers including plaques can be remodeled and visualized in longitude. VR display is an important way to assist inspecting the artery and the distribution of plaque and lesions, to help navigate in guided surgery facilities for minimally invasive surgery. For examples, L-Mode display sets of slices taken from a single cut plane in longitude (Figure 1, panel D); 3D reconstruction display a shaded or wire-frame image of the vessel to give an entire view (Figure 1, Panel E). 589
I
Immersive Image Mining in Cardiology
•
•
Derived Measurements: Calculating on the virtual remodeled vessel, the derived measurements can be obtained such as hemodynamics, length, volumes of the vessel and special lesion zones. Qualitative Assessments: IVUS images are fundamentally different from histology and cannot be used to detect and quantify specific histologic contents directly. However, based on quantitative measurements and a virtual model, as well as combined cardiac knowledge, the qualitative assessment such as atheroma morphology, unstable lesions and complications after intervention may be gained semiautomatically or artificially to serve physicians.
•
•
Knowledge Discovery in Cardiology Once a large amount of quantitative and qualitative features are in hand, mining on these data can reveal knowledge about the forming and regression of some heart diseases, and can simulate the different stages of the disease as well as assess surgical treatment procedure and risk level. Some papers are reported on medical knowledge discovery in cardiology using data mining. Pressure calculation by using fluid dynamic equations on 3D IVUS volumetric measurements predicts the physiologic severity of coronary lesions (Takayama & Hodgson, 2001). Artificial intelligence methods: a structural description, syntactic reasoning and pattern recognition method are applied on angiography to recognize stenosis of coronary artery lumen (Ogiela & Tadeusiewicz, 2002). Logical Analysis of Data is a method based on combinatory, optimization and theory of Boolean functions is used on 21 variables dataset to predict coronary risk, but these dataset do not include any medical images information (Alexe, 2003). Mining in single proton emission computed tomography images accompanied by clinical information and physician interpretation using inductive machine learning and heuristic approaches to mimic cardiologist’s diagnosis (Kurgan, Cios, Tadeusiewicz, Ogiela, & Goodenday, 2001). Immersive data mining or fusing the mined data with other cardiac data is a challenge for improving medical knowledge in cardiology. Mining will contribute to some difficult cardiology applications, for example: interventional procedures and healthcare; mining coronary vessel movement anomalies; highlight local abnormal cellular growth; accurate heart and virtual vessel reconstruction; adjust ferro-electric materials to monitor heart movement anomalies; prophylactic patient monitoring etc. Some technicalities should be considered in this mining field.
590
•
Volume and complexity in data: First, IVUS data have a large volume in one acquisition procedure for one person. Second, usually tracing a patient needs a long time: more than several years. Third, the data from patients maybe are incomplete. Finally, immersive data acquisition is a complicated procedure depending on body condition, immersive mechanical system, and even operating procedure. The immersive complexity may even bias the historical data from the same heart. Data standards and semantics: IVUS images, especially the qualitative assessments, need a consistent standard and semantics to support mining. IVUS documents (Mintz, et al., 2001) and DICOM IVUS standards may be the base to follow. It is necessary to consider the importance of relative parameters, spatial positions, multi-interpretations of image quantitative measurements, which also need to be addressed in the IVUS standards. Data fusion: Since heart diseases are complicated, IVUS is one of the preferred diagnose tools, so mining IVUS needs considering other clinical information including physician’s interpretation at the same time.
FUTURE TRENDS Immersive image mining in cardiology is a new challenge for medical informatics. In mining our body physicians thus inspire interdisciplinary work in medicine, physics and computer science which will improve monitoring heart data to many deep applications serving clinical needs in diagnostics, therapies, safety level, cost and risk effectiveness. For instance, in due course of time nanotechnology will mature to a degree of immersive medical mining equipment, physicians will directly control medicine by instruction via mobile communication because of a transducer inside the human body.
CONCLUSION Over the last several years, IVUS has developed into an important clinical tool in the assessment of atherosclerosis. The limitations on data artifacts and difficult discerning between entities with similar echodensities (Halligan & Higano, 2003) are waiting for better solution. Hospitals have stored a large amount of immersive data, which may be mined for effective application in cardiac knowledge discovery. Data mining technologies play crucial roles in the whole procedures in IVUS application: from data acquisi-
Immersive Image Mining in Cardiology
tion, data reconciliation, image processing, vessel remodeling and virtual display, knowledge discovery, disease diagnosing, clinical treatment, etc. Reconciling coronary longitudinal movement with a motion compensation model is a novel way to resolve the longitudinal distortion of the IVUS dataset. This fusion with other cardiac datasets and online processing are very important for future application, which means more effective and efficient mining methods on the complicated datasets need to be studied.
REFERENCES Alexe, S. (2003). Coronary risk prediction by logical analysis of data. Annals of Operations Research, 119(1-4), 1542. Chen, S-Y. J. & Carroll, J. D. (2003). Kinematic and deformation analysis of 4-D coronary arterial trees reconstructed from cine angiograms. IEEE Transactions on Medical Imaging, 22(6), 710-720. Cios, K. J. & Moore, W. (2002). Uniqueness of medical data mining. Artificial Intelligence in Medicine Journal, 26(1-2), 1-24. Cipra, B. A. (1999). Failure in sight for a mathematical model of the heart. SIAM News, 32(8). Ding, Z. & Friedman, M.H. (2000). Quantification of 3D coronary arterial motion using clinical biplane cineangiograms. The International Journal of Cardiac Imaging, 16(5), 331-346.
tional Conference on Systems, Man and Cybernetics, Hague, Netherlands. Mintz, G. S., Nissen, S. E., Anderson, W. D., Bailey, S. R., Erbel, R., Fitzgerald, P. J., Pinto, F. J., Rosenfield, K., Siegel, R. J., Tuzcu, E. M., & Yock, P. G. (2001). ACC clinical expert consensus document on standards for the acquisition, measurement and reporting of intravascular ultrasound studies: A report of the American college of cardiology task force on clinical expert consensus documents. Journal of the American College of Cardiology, 37, 1478–1492. Ogiela, M. R. & Tadeusiewicz, R. (2002). Syntactic reasoning and pattern recognition for analysis of coronary artery images. Artificial Intelligence in Medicine, 26, 145-159. Rotger, B., Radeva, P., Mauri, J., & Fernandez-Nofrerias, E. (2002). Internal and external coronary vessel images registration. In M. T. Escrig, F. Toledo, & E. Golobardes (Eds.), Topics in Artificial Intelligence, 5th Catalonian Conference on AI, Castellón (pp. 408-418). Spain. Sarry, L. & Boire, J. Y. (2001). Three-Dimensional tracking of coronary arteries from biplane angiographic sequences using parametrically deformationable models. IEEE Transactions on Medical Imaging, 20(12), 1341-1351. Shechter, G., Devernay, F., Coste-Maniere, E., Quyyumi, A., & McVeigh, E. R. (2003). Three-Dimentional motion tracking of coronary arteries in biplane cineangiograms. IEEE Transaction on Medical Imaging, 2, 1-16.
Halligan, S. & Higano, S. T. (2003). Coronary assessment beyond angiography. Applications in Imaging Cardiac Interventions, 12, 29-35.
Takayama, T. & Hodgson, J. M. (2001). Prediction of the physiologic severity of coronary lesions using 3D IVUS: Validation by direct coronary pressure measurements. Catheterization and Cardiovascular Interventions, 53(1), 48-55.
Hsu, W., Lee, M. L., & Zhang, J. (2002). Image mining: Trends and developments. Journal of Intelligent Information System: Special Issue on Multimedia Data Mining, 19(1), 7-23.
Timinger, H., Krueger, S., Borgert, J., & Grewer, R. (2004). Motion compensation for interventional navigation on 3D static roadmaps based on an affine model and gating. Physics in Medicine and Biology, 49, 719-732.
Koning, G., Dijkstra, J., von Birgelen, C., Tuinenburg, J. C. et al. (2002). Advanced contour detection for three-dimensional intracoronary ultrasound: A validation—in vitro and in vivo. The International Journal of Cardiovascular Imaging, 18, 235-248.
Wahle A., Mitchell, S. C., Olszewski, M. E., Long R. M., & Sonka, M. (2000). Accurate visualization and quantification of coronary vasculature by 3-D/4-D fusion from biplane angiography and intravascular ultrasound. EBiOS 2000, Biomonitoring and Endoscopy Technologies (pp. 144-155). Amsterdam NL, SPIE Europto, 4158.
Kurgan, L. A., Cios, K. J., Tadeusiewicz, R., Ogiela, M., & Goodenday, L. S. (2001). Knowledge discovery approach to automated cardiac SPECT diagnosis. Artificial Intelligence in Medicine, 23(2), 149-169. Liu, X., Koppelaar, H., Koffijberg, H., Bruining, N., & Hamers, R. (2004, October). Data reconciliation of immersive heart inspection. Paper presented at the IEEE Interna-
Weichert, F., Wawro, M., & Wilke, C. (2004). A 3D computer graphics approach to brachytherapy planning. The International Journal of Cardiovascular Imaging, 20, 173-182. Winter, S. A. de, Hamers, R. Degertekin, M., Tanabe, K., Lemos, P.A., Serruys, P.W., Roelandt, J. R. T. C., & 591
I
Immersive Image Mining in Cardiology
Bruining, N. (2004). Retrospective image-based gating of intracoronary ultrasound images for improved quantitative analysis: The intelligate method. Catheterization and Cardiovascular Interventions, 61(1), 84-94. Zhu, H., Oakeson, K. D., & Friedman, M. H. (2003). Retrieval of cardiac phase from IVUS sequences. In F. William, M.F. Walker, & Insana (Eds.), Medical Imaging 2003: Ultrasonic Imaging and Signal Processing, Proceedings of SPIE 5035 (pp. 135-146).
UltraSound transducer in human arteries. The dataset is usually a volume with artifacts caused by the complicated immersed environments. IVUS Data Reconciliation: Mining in the IVUS individual dataset to compensate or decrease artifacts to get improved data using for further cardiac calculation and medical knowledge discovery.
KEY TERMS
IVUS Standards and Semantics: It refers to the standard and semantics of IVUS data and their medical quantitative measurements, qualitative assessments. The consistent definition and description improve medical data management and mining.
Cardiac Data Fusion: Fusion of complementary information from two or more differing cardiac modalities (IVUS, ECG, CT, physician’s interpretation etc.) enhances insights into the underlying anatomy and physiology.
Medical Image Mining: It involves extracting the most relevant image features into a form suitable for data mining for medical knowledge discovery; or generating image patterns to improve the accuracy of images retrieved from image databases.
Computational Cardiology: Using mathematic and computer model to simulate the heart motion and its properties as a whole. Immersive IVUS Images: The real-time cross-sectional images obtained from a pullback IntraVascular
592
Virtual Reality Vasculature Reconstruction: For effective applications of intravascular analyses and brachytherapy, reconstruct and visualize the vessel-wall’s interior structure in a single 3D/4D model by fusing invasive IVUS data and non-invasive angiography on?
593
Imprecise Data and the Data Mining Process Marvin L. Brown Grambling State University, USA John F. Kros East Carolina University, USA
INTRODUCTION
MAIN THRUST
Missing or inconsistent data has been a pervasive problem in data analysis since the origin of data collection. The management of missing data in organizations has recently been addressed as more firms implement largescale enterprise resource planning systems (see Vosburg & Kumar, 2001; Xu et al., 2002). The issue of missing data becomes an even more pervasive dilemma in the knowledge discovery process, in that as more data is collected, the higher the likelihood of missing data becomes. The objective of this research is to discuss imprecise data and the data mining process. The article begins with a background analysis, including a brief review of both seminal and current literature. The main thrust of the chapter focuses on reasons for data inconsistency along with definitions of various types of missing data. Future trends followed by concluding remarks complete the chapter.
The article focuses on the reasons for data inconsistency and the types of missing data. In addition, trends regarding missing data and data mining are discussed along with future research opportunities and concluding remarks.
BACKGROUND
Procedural Factors
The analysis of missing data is a comparatively recent discipline. However, the literature holds a number of works that provide perspective on missing data and data mining. Afifi and Elashoff (1966) provide an early seminal paper reviewing the missing data and data mining literature. Little and Rubin’s (1987) milestone work defined three unique types of missing data mechanisms and provided parametric methods for handling these types of missing data. These papers sparked numerous works in the area of missing data. Lee and Siau (2001) present an excellent review of data mining techniques within the knowledge discovery process. The references in this section are given as suggested reading for any analyst beginning their research in the area of data mining and missing data.
Data entry errors are common and their impact on the knowledge discovery process and data mining can generate serious problems. Inaccurate classifications, erroneous estimates, predictions, and invalid pattern recognition may also take place. In situations where databases are being refreshed with new data, blank responses from questionnaires further complicate the data mining process. If a large number of similar respondents fail to complete similar questions, the deletion or misclassification of these observations can take the researcher down the wrong path of investigation or lead to inaccurate decision-making by end users.
REASONS FOR DATA INCONSISTENCY Data inconsistency may arise for a number of reasons, including: • • •
Procedural Factors Refusal of Response Inapplicable Responses
These three reasons tend to cover the largest areas of missing data in the data mining process.
Refusal of Response Some respondents may find certain survey questions offensive or they may be personally sensitive to certain
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Imprecise Data and the Data Mining Process
questions. For example, some respondents may have no opinion regarding certain questions such as political or religious affiliation. In addition, questions that refer to one’s education level, income, age or weight may be deemed too private for some respondents to answer. Furthermore, respondents may simply have insufficient knowledge to accurately answer particular questions. Students or inexperienced individuals may have insufficient knowledge to answer certain questions (such as salaries in various regions of the country, retirement options, insurance choices, etc).
Inapplicable Responses Sometimes questions are left blank simply because the questions apply to a more general population rather than to an individual respondent. If a subset of questions on a questionnaire does not apply to the individual respondent, data may be missing for a particular expected group within a data set. For example, adults who have never been married or who are widowed or divorced are likely to not answer a question regarding years of marriage.
TYPES OF MISSING DATA The following is a list of the standard types of missing data: • • • •
Data Missing at Random Data Missing Completely at Random Non-Ignorable Missing Data Outliers Treated as Missing Data
It is important for an analyst to understand the different types of missing data before they can address the issue. Each type of missing data is defined next.
[Data] Missing At Random (MAR) Rubin (1978), in a seminal missing data research paper, defined missing data as MAR “when given the variables X and Y, the probability of response depends on X but not on Y.” Cases containing incomplete data must be treated differently than cases with complete data. For example, if the likelihood that a respondent will provide his or her weight depends on the probability that the respondent will not provide his or her age, then the missing data is considered to be Missing At Random (MAR) (Kim, 2001).
594
[Data] Missing Completely At Random (MCAR) Kim (2001), based on an earlier work, classified data as MCAR when “the probability of response [shows that] independence exists between X and Y.” MCAR data exhibits a higher level of randomness than does MAR. In other words, the observed values of Y are truly a random sample for all values of Y, and no other factors included in the study may bias the observed values of Y. Consider the case of a laboratory providing the results of a chemical compound decomposition test in which a significant level of iron is being sought. If certain levels of iron are met or missing entirely and no other elements in the compound are identified to correlate then it can be determined that the identified or missing data for iron is MCAR.
Non-Ignorable Missing Data In contrast to the MAR situation where data missingness is explained by other measured variables in a study; nonignorable missing data arise due to the data missingness pattern being explainable — and only explainable — by the very variable(s) on which the data are missing. For example, given two variables, X and Y, data is deemed Non-Ignorable when the probability of response depends on variable X and possibly on variable Y. For example, if the likelihood of an individual providing his or her weight varied within various age categories, the missing data is non-ignorable (Kim, 2001). Thus, the pattern of missing data is non-random and possibly predictable from other variables in the database. In practice, the MCAR assumption is seldom met. Most missing data methods are applied upon the assumption of MAR. And in correspondence to Kim (2001), “Non-Ignorable missing data is the hardest condition to deal with, but unfortunately, the most likely to occur as well.”
Outliers Treated As Missing Data Many times it is necessary to classify these outliers as missing data. Pre-testing and calculating threshold boundaries are necessary in the pre-processing of data in order to identify those values which are to be classified as missing. Data whose values fall outside of predefined ranges may skew test results. Consider the case of a laboratory providing the results of a chemical compound decomposition test. If it has been predetermined that the maximum amount of iron that can be
Imprecise Data and the Data Mining Process
contained in a particular compound is 500 parts/million, then the value for the variable “iron” should never exceed that amount. If, for some reason, the value does exceed 500 parts/million, then some visualization technique should be implemented to identify that value. Those offending cases are then presented to the end users.
COMMONLY USED METHODS OF ADDRESSING MISSING DATA Several methods have been developed for the treatment of missing data. The simplest of these methods can be broken down into the following categories: • • •
Use Of Complete Data Only Deleting Selected Cases Or Variables Data Imputation
These categories are based on the randomness of the missing data, and how the missing data is estimated and used for replacement. Each category is discussed next.
Use of Complete Data Only Use of complete data only is generally referred to as the “complete case approach” and is readily available in all statistical analysis packages. When the relationships within a data set are strong enough to not be significantly affected by missing data, large sample sizes may allow for the deletion of a predetermined percentage of cases. While the use of complete data only is a common approach, the cost of lost data and information when cases containing missing value are simply deleted can be dramatic. Overall, this method is best suited to situations where the amount of missing data is small and when missing data is classified as MCAR.
Delete Selected Cases or Variables The simple deletion of data that contains missing values may be utilized when a non-random pattern of missing data is present. However, it may be ill-advised to eliminate ALL of the samples taken from a test. This method tends to be the method of last choice.
Data Imputation Methods A number of researchers have discussed specific imputation methods. Seminal works include Little & Rubin (1987), and Rubin (1978) has published articles regarding imputation methodologies. Imputation methods are
procedures resulting in the replacement of missing values by attributing them to other available data. This research investigates the most common imputation methods including: • • • • •
Case Deletion Mean Substitution Cold Deck Imputation Hot Deck Imputation Regression Imputation
These methods were chosen mainly as they are very common in the literature and their ease and expediency of application (Little & Rubin, 1987). However, it has been concluded that although imputation is a flexible method for handling missing-data problems it is not without its drawbacks. Caution should be used when employing imputation methods as they can generate substantial biases between real and imputed data. Nonetheless, imputation methods tend to be a popular method for addressing the issue of missing data.
Case Deletion The simple deletion of data that contains missing values may be utilized when a nonrandom pattern of missing data is present. Large sample sizes permit the deletion of a predetermined percentage of cases, and/ or when the relationships within the data set are strong enough to not be significantly affected by missing data. Case deletion is not recommended for small sample sizes or when the user knows strong relationships within the data exist.
Mean Substitution This type of imputation is accomplished by estimating missing values by using the mean of the recorded or available values. It is important to calculate the mean only from valid responses that are chosen from a verified population having a normal distribution. If the data distribution is skewed, the median of the available data can be used as a substitute. The main advantage of mean substitution is its ease of implementation and ability to provide all cases with complete information. Although a common imputation method, three main disadvantages to mean substitution do exist: • •
Understatement of variance Distortion of actual distribution of values – mean substitution allows more observations to fall into the category containing the calculated mean than may actually exist 595
I
Imprecise Data and the Data Mining Process
•
Depression of observed correlations due to the repetition of a constant value
Obviously, a researcher must weigh the advantages against the disadvantages before implementation.
Cold Deck Imputation Cold deck imputation methods select values or use relationships obtained from sources other than the current data. With this method, the end user substitutes a constant value derived from external sources or from previous research for the missing values. It must be ascertained by the end user that the replacement value used is more valid than any internally derived value. Unfortunately, feasible values are not always provided using cold deck imputation methods. Many of the same disadvantages that apply to the mean substitution method apply to cold deck imputation. Cold deck imputation methods are rarely used as the sole method of imputation and instead are generally used to provide starting values for hot deck imputation methods.
Hot Deck Imputation Generally speaking, hot deck imputation replaces missing values with values drawn from the next most similar case. The implementation of this imputation method results in the replacement of a missing value with a value selected from an estimated distribution of similar responding units for each missing value. In most instances, the empirical distribution consists of values from responding units. For example, Table 1 displays a data set containing missing data. From Table 1, it is noted that case three is missing data for item four. In this example, case one, two, and four are examined. Using hot deck imputation, each of the other cases with complete data is examined and the value for the most similar case is substituted for the missing data value. Case four is easily eliminated, as it has nothing in common with case three. Case one and two both have similarities with case three. Case one has one item in common whereas case two has two items in common. Therefore, case two is the most similar to case three. Table 1. Illustration of hot deck imputation: incomplete data set Case 1 2 3 4
596
Item 1 10 23 25 11
Item 2 22 20 20 25
Item 3 30 30 30 10
Item 4 25 23 ??? 12
Once the most similar case has been identified, hot deck imputation substitutes the most similar complete case’s value for the missing value. Since case two contains the value of 23 for item four, a value of 23 replaces the missing data point for case three. The advantages of hot deck imputation include conceptual simplicity, maintenance and proper measurement level of variables, and the availability of a complete set of data at the end of the imputation process that can be analyzed like any complete set of data. One of hot deck’s disadvantages is the difficulty in defining what is “similar,” Hence, many different schemes for deciding on what is “similar” may evolve.
Regression Imputation Regression Analysis is used to predict missing values based on the variable’s relationship to other variables in the data set. Single and/or multiple regression can be used to impute missing values. The first step consists of identifying the independent variables and the dependent variable. In turn, the dependent variable is regressed on the independent variables. The resulting regression equation is then used to predict the missing values. Table 2 displays an example of regression imputation. From the table, twenty cases with three variables (income, age, and years of college education) are listed. Income contains missing data and is identified as the dependent variable while age and years of college education are identified as the independent variables. The following regression equation is produced for the example
Table 2. Illustration of regression imputation Years of College
Regression
Case
Income
Age
Education
Prediction
1
$95,131.25
26
4
$96,147.60
2
$108,664.75
45
6
$104,724.04
3
$98,356.67
28
5
$98,285.28
4
$94,420.33
28
4
$96,721.07
5
$104,432.04
46
3
$100,318.15
6
$97,151.45
38
4
$99,588.46
7
$98,425.85
35
4
$98,728.24
8
$109,262.12
50
6
$106,157.73
9
$95,704.49
45
3
$100,031.42
10
$99,574.75
52
5
$105,167.00
11
$96,751.11
30
0
$91,037.71
12
$111,238.13
50
6
$106,157.73
13
$102,386.59
46
6
$105,010.78
14
$109,378.14
48
6
$105,584.26
15
$98,573.56
50
4
$103,029.32
16
$94,446.04
31
3
$96,017.08
17
$101,837.93
50
4
$103,029.32
18
???
55
6
$107,591.43
19
???
35
4
$98,728.24
20
???
39
5
$101,439.40
Imprecise Data and the Data Mining Process
ˆy =
$79,900.95 + $268.50(age) $2,180,97(years of college education)
+
Predictions of income can be made using the regression equation and the right-most column of the table displays these predictions. For cases eighteen, nineteen, and twenty, income is predicted to be $107,591.43, $98,728.24, and $101,439.40, respectfully. An advantage to regression imputation is that it preserves the variance and covariance structures of variables with missing data. Although regression imputation is useful for simple estimates, it has several inherent disadvantages: •
• • •
Regression imputation reinforces relationships that already exist within the data – as the method is utilized more often, the resulting data becomes more reflective of the sample and becomes less generalizable Understates the variance of the distribution An implied assumption that the variable being estimated has a substantial correlation to other attributes within the data set Regression imputation estimated value is not constrained and therefore may fall outside predetermined boundaries for the given variable
In addition to these points, there is also the problem of over-prediction. Regression imputation may lead to over-prediction of the model’s explanatory power. For example, if the regression R 2 is too strong, multicollinearity most likely exists. Otherwise, if the R2 value is modest, errors in the regression prediction equation will be substantial. Overall, regression imputation not only estimates the missing values but also derives inferences for the population.
FUTURE TRENDS Poor data quality has plagued the knowledge discovery process and all associated data mining techniques. Future data mining systems should be sensitive to noise and have the ability to deal with all types of data pollution, both internally and in conjunction with end users. Systems should still produce the most significant findings from the data set possible even if noise is present. As data mining continues to evolve and mature as a viable business tool, it will be monitored to address its role in the technological life cycle. Tools for dealing with missing data will grow from being used as a horizontal solution (not designed to provide business-specific end solutions) into a type of vertical solution
(integration of domain-specific logic into data mining solutions). As gigabyte-, terabyte-, and petabyte-size data sets become more prevalent in data warehousing applications, the issue of dealing with missing data will itself become an integral solution for the use of such data rather than simply existing as a component of the knowledge discovery and data mining processes (Han & Kamber, 2001). Although the issue of statistical analysis with missing data has been addressed since the early 1970s, the advent of data warehousing, knowledge discovery, data mining and data cleansing has pushed the concept of dealing with missing data into the limelight. Brown & Kros (2003) provide an overview of the trends, techniques, and impacts of imprecise data on the data mining process related to the k-Nearest Neighbor Algorithm, Decision Trees, Association Rules, and Neural Networks.
CONCLUSION It can be seen that future generations of data miners will be faced with many challenges concerning the issues of missing data. This article gives a background analysis and brief literature review of missing data concepts. The authors addressed reasons for data inconsistency and methods for addressing missing data. Finally, the authors offered their opinions on future developments and trends on issues expected to face developers of knowledge discovery software and the needs of end users when confronted with the issues of data inconsistency and missing data.
REFERENCES Afifi, A., & Elashoff, R. (1966). Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association, 61, 595-604. Brown, M.L., & Kros, J.F. (2003). The impact of missing data on data mining. In J. Wang (Ed.), Data mining opportunities and challenges (pp. 174-198). Hershey, PA: Idea Group Publishing. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Academic Press. Kim, Y. (2001). The curse of the missing data. Retrieved from http://209.68.240.11:8080/2ndMoment/ 978476655/addPostingForm/ Lee, S., & Siau, K. (2001). A review of data mining techniques. Industrial Management & Data Systems, 101(1), 41-46. 597
I
Imprecise Data and the Data Mining Process
Little, R., & Rubin, D. (1987). Statistical analysis with missing data. New York: Wiley. Rubin, D. (1978). Multiple imputations in sample surveys: A phenomenological Bayesian approach to nonresponse. In Imputation and Editing of Faulty or Missing Survey Data (pp. 1-23). Washington, DC: U.S. Department of Commerce. Vosburg, J., & Kumar, A. (2001). Managing dirty data in organizations using ERP: Lessons from a case study. Industrial Management & Data Systems, 101(1), 21-31. Xu, H., Horn Nord, J., Brown, N., & Nord, G.D. (2002). Data quality issues in implementing an ERP. Industrial Management & Data Systems, 102(1), 47-58.
KEY TERMS [Data] Missing Completely at Random (MCAR): When the observed values of a variable are truly a random sample of all values of that variable (i.e., the response exhibits independence from any variables).
598
Data Imputation: The process of estimating missing data of an observation based on the valid values of other variables. Data Missing at Random (MAR): When given the variables X and Y, the probability of response depends on X but not on Y. Inapplicable Responses: Respondents omit answer due to doubts of applicability. Knowledge Discovery Process: The overall process of information discovery in large volumes of warehoused data. Non-Ignorable Missing Data: Arise due to the data missingness pattern being explainable, non-random, and possibly predictable from other variables. Procedural Factors: Inaccurate classifications of new data, resulting in classification error or omission. Refusal of Response: Respondents outward omission of a response due to personal choice, conflict, or inexperience.
599
Incorporating the People Perspective into Data Mining Nilmini Wickramasinghe Cleveland State University, USA
INTRODUCTION Today’s economy is increasingly based on knowledge and information (Davenport & Grover, 2001). Knowledge is now recognized as the driver of productivity and economic growth, leading to a new focus on the roles of information technology and learning in economic performance. Organizations trying to survive and prosper in such an economy are turning their focus to strategies, processes, tools, and technologies that can facilitate the creation of knowledge. A vital and well-respected technique in knowledge creation is data mining, which enables critical knowledge to be gained from the analysis of large amounts of data and information. Traditional data mining and the KDD process (knowledge discovery in data bases) tends to view the knowledge product as a homogeneous product. Knowledge, however, is a multifaceted construct, drawing upon various philosophical perspectives including Lockean/Leibnitzian and Hegelian/Kantian, exhibiting subjective and objective aspects, as well as having tacit and explicit forms (Nonaka, 1994; Alavi & Leidner, 2001; Schultze & Leidner, 2002; Wickramasinghe et al., 2003). The thesis of this article is that taking a broader perspective of the resultant knowledge product from the KDD process,
namely by incorporating a people-based perspective into the traditional KDD process, not only provides a more complete and macro perspective on knowledge creation but also a more balanced approach, which in turn serves to enhance the knowledge base of an organization and facilitates the realization of effective knowledge. The implications for data mining are clearly far-reaching and are certain to help organizations more effectively realize the full potential of their knowledge assets, improve the likelihood of using/reusing the created knowledge, and thereby enables them to be well positioned in today’s knowledgeable economy.
BACKGROUND Knowledge Creation through Data Mining and the KDD Process KDD, and more specifically, data mining, approaches knowledge creation from a primarily technology driven perspective. In particular, the KDD process focuses on how data are transformed into knowledge by identifying valid, novel, potentially useful, and ultimately under-
Figure 1. Integrated view of the knowledge discovery process (Adapted from Wickramasinghe et al., 2003) Knowledge evolution Knowledge
Information
Data Steps in knowledge discovery Selection
Data
Preprocessing
Target Data
Transformation
Preprocessed Data
Data Mining
Transformed Data
Interpretation / Evaluation
Patterns
Knowledge
Types of data mining Exploratory Data Mining
Predictive Data Mining
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Incorporating the People Perspective into Data Mining
standable patterns in data (Spiegler, 2003; Fayyad, Piatetsky-Shapiro, & Smyth, 1996). KDD is primarily used on data sets for creating knowledge through model building or by finding patterns and relationships in data. From an application perspective, data mining and KDD are often used interchangeably. Figure 1 presents a generic representation of a typical knowledge discovery process. This figure not only depicts each stage within the KDD process but also highlights the evolution of knowledge from data through information in this process as well as the two major types of data mining, namely, exploratory and predictive, whereas the last two steps (i.e., data mining and interpretation/evaluation) in the KDD process are considered predictive data mining. It is important to note in Figure 1 that typically in the KDD process, the knowledge component itself is treated as a homogeneous block. Given the well-established multifaceted nature of the knowledge construct (Boland & Tenkasi, 1995; Malhotra, 2000; Alavi & Leidner, 2001; Schultze & Leidner, 2002; Wickramasinghe et al., 2003), this would appear to be a significant limitation or oversimplification of knowledge creation through data mining as a technique and of the KDD process in general.
The Psychosocial-Driven Perspective to Knowledge Creation Knowledge can exist in essentially two forms: explicit, or factual, knowledge and tacit, or experiential (i.e., “know how”) (Polyani, 1958, 1966). Of equal significance is the fact that organizational knowledge is not static; rather, it changes and evolves during the lifetime of an organization (Becerra-Fernandez & Sabherwal, 2001; Bendoly, 2003; Choi & Lee, 2003). Furthermore, it is possible to change the form of knowledge, that is, transform existing tacit knowledge into new explicit knowledge, and existing explicit knowledge into new tacit knowledge, or to transform the subjective form of knowledge into the objective form of knowledge (Nonaka & Nishiguchi, 2001; Nonaka, 1994). This process of transforming the form of knowledge and thus increasing the extant knowledge base as well as the amount and utilization of the knowledge within the organization is known as the knowledge spiral (Nonaka & Nishiguchi, 2001). In each of these instances, the overall extant knowledge base of the organization grows to a new superior knowledge base. According to Nonaka and Nishiguchi (2001), four things are true: a) Tacit-to-tacit knowledge transformation usually occurs through apprenticeship-type relations, where the teacher or master passes on the skill to the apprentice, b) Explicit-to-explicit knowledge transformation usually occurs via formal learning of facts, c) Tacitto-explicit knowledge transformation usually occurs when 600
there is an articulation of nuances; for example, as in health care, if a renowned surgeon is questioned as to why he does a particular procedure in a certain manner, by his articulation of the steps, the tacit knowledge becomes explicit, and d) Explicit-to-tacit knowledge transformation usually occurs as new explicit knowledge is internalized; it can then be used to broaden, reframe, and extend one’s tacit knowledge. These transformations are often referred to as the modes of socialization, combination, externalization, and internalization, respectively (Nonaka, 1994). Integral to this changing of knowledge through the knowledge spiral is that new knowledge is created (Nonaka & Nishiguchi, 2001), which can bring many benefits to organizations. Specifically, in today’s knowledge-centric economy, processes that effect a positive change to the existing knowledge base of the organization and facilitate better use of the organization’s intellectual capital, as the knowledge spiral does, are of paramount importance. Two other primarily people-driven frameworks that focus on knowledge creation as a central theme are Spender’s and Blackler’s respective frameworks (Newell, Robertson, Scarbrough, & Swan, 2002; Swan, Scarbrough, & Preston, 1999). Spender draws a distinction between individual knowledge and social knowledge, each of which he claims can be implicit or explicit (Newell et al.). From this framework, you can see that Spender’s definition of implicit knowledge corresponds to Nonaka’s tacit knowledge. However, unlike Spender, Nonaka doesn’t differentiate between individual and social dimensions of knowledge; rather, he focuses on the nature and types of the knowledge itself. In contrast, Blackler (Newell et al.) views knowledge creation from an organizational perspective, noting that knowledge can exist as encoded, embedded, embodied, encultured, and/or embrained. In addition, Blackler emphasized that for different organizational types, different types of knowledge predominate, and he highlighted the connection between knowledge and organizational processes (Newell et al.). Blackler’s types of knowledge can be thought of in terms of spanning a continuum of tacit (implicit) through to explicit with embrained being predominantly tacit (implicit) and encoded being predominantly explicit while embedded, embodied, and encultured types of knowledge exhibit varying degrees of a tacit (implicit)/ explicit combination. An integrated view of all three frameworks is presented in Figure 2. Specifically, from Figure 2, Spender’s and Blackler’s perspectives complement Nonaka’s conceptualization of knowledge creation and, more importantly, do not contradict his thesis of the knowledge spiral, wherein the extant knowledge base is continually being expanded to a new knowledge base, be it tacit/explicit (in Nonaka’s terminology), implicit/explicit (in Spender’s terminology), or
Incorporating the People Perspective into Data Mining
Figure 2. People driven knowledge creation map
The Knowledge Continuum Explicit
Other K Types (embodied, encultured,embedded)
Tacit/Implicit/Embrained
Social
The Knowledge Spiral
(Spender’s Spender’s Actors) Actors
Individual
embrained/encultured/embodied/embedded/encoded (in Blackler’s terminology).
ENRICHING DATA MINING WITH THE KNOWLEDGE SPIRAL To conceive of knowledge as a collection of information seems to rob the concept of all of its life. . . . Knowledge resides in the user and not in the collection. It is how the user reacts to a collection of information that matters (Churchman, 1971, p. 10). Churchman is clearly underscoring the importance of people in the process of knowledge creation. However, most formulations of information technology (IT) enabled knowledge management, and data mining in particular seems to have not only ignored the ignored the human element but also taken a very myopic and homogenous perspective on the knowledge construct itself. Recent research that has surveyed the literature on KM indicates the need for more frameworks for knowledge management, particularly a metaframework to facilitate more successful realization of the KM steps (Wickramasinghe & Mills, 2001; Holsapple & Joshi, 2002; Alavi & Leidner, 2001; Schultze & Leidner, 2002). From a macro knowledge management perspective, the knowledge spiral is the cornerstone of knowledge creation. From a micro data-mining perspective, one of the key strengths of data mining as a technique is that it facilitates knowledge creation from data. Therefore, by integrating the algorithmic approach of knowledge creation (in particular data mining) with the psychosocial approach of knowledge creation (i.e., the people-driven frameworks of knowledge creation, in particular the knowledge spiral), it is indeed possible to develop a
metaframework for knowledge creation. By so doing, a richer and more complete approach to knowledge creation is realized. Such an approach not only leads to a deeper understanding of the knowledge creation process but also offers a knowledge creation methodology that is more customizable to specific organizational contexts, structures, and cultures. Furthermore, it brings the human factor back into the knowledge creation process and doesn’t oversimplify the complex knowledge construct as a homogenous product. Specifically, in Figure 3, the knowledge product of data mining is broken into its constituent components based on the people-driven perspectives (i.e., Blackler, Spender, and Nonaka, respectively) of knowledge creation. On the other hand, the specific modes of transformation of the knowledge spiral discussed by Nonaka in his Knowledge Spiral should benefit from the algorithmic structured nature of both exploratory and predictive data-mining techniques. For example, if you consider socialization, which is described in Nonaka and Nishiguchi (2001) and Nonaka (1994) as the process of creating new tacit knowledge through discussion within groups, more specifically groups of experts, and you then incorporate the results of datamining techniques into this context, this provides a structured forum and hence a jump start for guiding the dialogue and, consequently, knowledge creation. Note, however, that this only enriches the socialization process without restricting the actual brainstorming activities and thus not necessarily leading to the side effect of truncating divergent thoughts. This also holds for Nonaka’s other modes of knowledge transformation.
An Example within a Health Care Context An illustration from health care will serve to illustrate the potential of combining the people perspective with data mining. Health care is a very information-rich industry. The collecting of data and information permeates most if not all areas of this industry. By incorporating a people perspective with data mining, it will be possible to realize the full potential of these data assets. Table 1 details some specific instances of each of the transformations identified in Figure 3, and Table 2 provides an example of explicit knowledge stored in a medication repository 2. Using the association rules data-mining algorithm, the following patterns can be discovered:
• •
D1 is administered to 60% of the patients (i.e., 3/5). D1 and D2 are administered together to 40% of the patients (i.e., 2/5). 601
I
Incorporating the People Perspective into Data Mining
Figure 3. Integrating data mining with the knowledge spiral Nonaka’s/Spender’s/Blackler’s K-Types
KNOWLEDGE
Tacit/Implicit/Embrained
Other K Types (embodied, encultured,embedded)
Explicit
The Knowledge Spiral
Social (Tacit)
2. EXTERNALIZATION
1. SOCIALIZATION
Exploratory
Spender’s Actors
DM Predictive
Individual (Explicit)
3. INTERNALIZATION
4. COMBINATION
Specific Modes of Transformations and the Role of Data Mining in Realizing Them: 1. 2. 3. 4.
Socialization — experiential to experiential via practice group interactions on data-mining results Externalization — written extensions to data-mining results Internalization — enhancing experiential skills by learning from data-mining results Combination – gaining insights through applying various data mining visualization techniques
Table 1. Data mining as an enabler of the knowledge spiral EXPLICIT1
TACIT
EXPLICIT
The performance of exploratory data mining, such as summarization and visualization, makes it possible to upgrade, expand, and/or revise current facts and protocols.
TACIT
Interpretation of findings from data mining helps reveal tacit knowledge, which can then be articulated and stored as explicit cases in the case repository.
By assimilating and internalizing knowledge discovered through data mining, physicians can turn this explicit knowledge into tacit knowledge, which they can apply to treating new patients. Interpreting treatment patterns (for example, for hip disease) discovered through data mining enables interaction among physicians, hence making it possible to stimulate and grow their own tacit knowledge.
TO FROM
602
Incorporating the People Perspective into Data Mining
Table 2. Drugs administered to patients Patient ID 1 2 3 4 5
Drug D1, D2 S3, D4, D5 D3, D1, D2 D3, D5, D1 D5, D2
D2 is administered to 67% of the patients who are given drug D1 (i.e., 2/3). As the physicians try to understand these findings, one physician could explain that D2 has to be given with D1 for patients who had a heart attack at age 40 or less. Thus, from this observation, the following rule can be added to the rule repository: If a patient’s age is <= 40 years, the patient has a heart attack, and D1 is administered to the patient, then D2 should also be administered to that patient. This is an example of existing tacit knowledge, because it originated from the physician’s head and was transformed into new explicit knowledge that is now recorded for everyone to use (as Figure 3 demonstrates).
As physicians discuss the implications of these findings, tacit knowledge from some of them is transformed into tacit knowledge for other physicians. Thus, during the interaction of physicians from different specialties, an environment of existing tacit to new tacit knowledge transformations occur. These knowledge transformations are summarized in Table 3. Figure 3 and Table 3 illustrate how data mining helps to realize the four modes or transformations of the knowledge spiral (socialization, externalization, internalization, and combination).
FUTURE TRENDS The two significant ways to create knowledge are through the a) synthesis of new knowledge through socialization with experts (a primarily people-dominated perspective) and b) discovery by finding interesting patterns through observation and combination of explicit data (a primarily technology-driven perspective) (BecerraFernandez, et al., 2004). In today’s knowledge economy, knowledge creation and the maximization of an organization’s knowledge and data assets are key strate-
Table 3. Data mining as an enabler of the knowledge spiral from the example data set FROM EXPLICIT Patient ID 1 2 3 4 5
TACIT
TO
Drug D1, D2 S3, D4, D5 D3, D1, D2 D3, D5, D1 D5, D2
EXPLICIT1
TACIT
D1 is administered to 60% of the patients. D1 and D2 are administered together to 40% of the patients. D2 is administered to 67% of the patients who are given drug D1.
If a patient’s age is <= 40 years, the patient has a heart attack, and D1 is administered to the patient, then D2 should also be administered to that patient.
INTERACTION
1
Each entry gives an example of how data mining enables the knowledge transfer from the knowledge type in the cell row to the knowledge type in the cell column.
603
I
Incorporating the People Perspective into Data Mining
gic necessities. Furthermore, more techniques, such as business intelligence and business analytics, which have their foundations in traditional data mining, are being embraced by organizations in order to try to facilitate the discovery of novel and unique patterns in data that will lead to new knowledge and maximization of an organization’s data assets. Full maximization of an organization’s data assets, however, will not be realized until the people perspective is incorporated into these data-mining techniques to enable the full potential of knowledge creation to occur. Thus, as organizations strive to survive and thrive in today’s competitive business environment, incorporating a people perspective into their data-mining initiatives will increasingly become a competitive necessity.
Choi, B., & Lee, H. (2003). An empirical investigation of KM styles and their effect on corporate performance. Information & Management, 40, 403-417.
CONCLUSION
Churchman, C. (1971). The design of inquiring systems: Basic concepts of systems and organizations. New York: Basic Books.
Sustainable competitive advantage is dependent on building and exploiting core competencies (Newell et al., 2002). In order to sustain competitive advantage, resources that are idiosyncratic (and thus scarce) and difficult to transfer or replicate are required (Grant, 1991). A knowledge-based view of the firm identifies knowledge as the organizational asset that enables sustainable competitive advantage, especially in hypercompetitive environments (Wickramasinghe, 2003; Davenport & Prusak, 1998; Zack, 1999). This is attributed to the fact that barriers exist regarding the transfer and replication of knowledge (Wickramasinghe, 2003), thus making knowledge and knowledge management of strategic significance (Kanter, 1999). The key to maximizing the knowledge asset is in finding novel and actionable patterns and in continuously creating new knowledge, thereby increasing the extant knowledge base of the organization. By incorporating a people perspective into data-mining, it becomes truly possible to support both major types of knowledge creation scenarios and thereby realize the synergistic effect of the respective strengths of these approaches in enabling superior knowledge creation to ensue.
REFERENCES Alavi, M., & Leidner, D. (2001). Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS Quarterly, 25(1), 107-136. Becerra-Fernandez, I., Gonzalez, A., & Sabherwal, R. (2004). Knowledge management. Upper Saddle River, NJ: Prentice Hall. 604
Becerra-Fernandez, I., & Sabherwal, R. (2001). Organizational knowledge management: A contingency perspective. Journal of Management Information Systems, 18(1), 23-55. Bendoly, E. (2003). Theory and support for process frameworks of knowledge discovery and data mining from ERP systems. Information & Management, 40, 639647. Boland, R., & Tenkasi, R. (1995). Perspective making perspective taking. Organizational Science, 6, 350-372.
Davenport, T., & Grover, V. (2001). Knowledge management. Journal of Management Information Systems, 18(1), 3-4. Davenport, T., & Prusak, L. (1998). Working knowledge. Boston: Harvard Business School Press. Fayyad, Piatetsky-Shapiro, & Smyth, (1996). From data mining to knowledge discovery: An overview. In Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy (Eds.), Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press/MIT Press. Grant, R. (1991). The resource-based theory of competitive advantage: Implications for strategy formulation. California Management Review, 33(3), 114-135. Holsapple, C., & Joshi, K. (2002). Knowledge manipulation activities: Results of a delphi study. Information & Management, 39, 477-419. Kanter, J. (1999). Knowledge management practically speaking. Information Systems Management. Malhotra, Y. (2000). Knowledge management and new organizational form. In Malhotra (Ed.), Knowledge management and virtual organizations. Hershey, PA: Idea Group Publishing. Newell, S., Robertson, M., Scarbrough, H., & Swan, J. (2002). Managing knowledge work. New York: Palgrave Macmillan. Nonaka, I. (1994). A dynamic theory of organizational knowledge creation. Organizational Science, 5, 14-37. Nonaka, I., & Nishiguchi, T. (2001). Knowledge emergence. Oxford, UK: Oxford University Press.
Incorporating the People Perspective into Data Mining
Polyani, M. (1958). Personal knowledge: Towards a postcritical philosophy. Chicago: University of Chicago Press.
Externalization: A knowledge transfer mode that involves new explicit knowledge being derived from existing tacit knowledge.
Polyani, M. (1966). The tacit dimension. London: Routledge & Kegan Paul.
Hegelian/Kantian Perspective of Knowledge Management: Refers to the subjective component of knowledge management and can be viewed as an ongoing phenomenon being shaped by social practices of communities and encouraging discourse and divergence of meaning.
Schultze, U., & Leidner, D. (2002). Studying knowledge management in information systems research: Discourses and theoretical assumptions. MIS Quarterly, 26(3), 212-242. Spiegler, I. (2003). Technology and knowledge: Bridging a generating gap. Information & Management, 40, 533-539. Swan, J., Scarbrough, H., & Preston, J. (1999). Knowledge management: The next fad to forget people? Proceedings of the Seventh European Conference in Information Systems. Wickramasinghe, N. (2003). Do we practise what we preach: Are knowledge management systems in practice truly reflective of knowledge management systems in theory? Business Process Management Journal, 9(3), 295-316. Wickramasinghe, N., Fadlalla, A., Geisler, E., & Schaffer, J. (2003). Knowledge management and data mining: Strategic imperatives for healthcare. Proceedings of the Third Hospital of the Future Conference, Warwick, UK. Wickramasinghe, N., & Mills, G. (2001). MARS: The electronic medical record system the core of the kaiser galaxy. International Journal of Healthcare Technology Management, 3(5/6), 406-423. Zack, M. (1999). Knowledge and strategy. Boston: Butterworth Heinemann.
Internalization: A knowledge transfer mode that involves new tacit knowledge being derived from existing explicit knowledge. Knowledge Spiral: The process of transforming the form of knowledge and thus increasing the extant knowledge base as well as the amount and utilization of the knowledge within the organization. Lockean/Leibnitzian Perspective of Knowledge Management: Refers to the objective aspects of knowledge management, where the need for knowledge is to improve effectiveness and efficiency. Socialization: A knowledge transfer mode that involves new tacit knowledge being derived from existing tacit knowledge. Tacit Knowledge: Also known as experiential knowledge (i.e., “know how”) (Cabena et al., 1998) represents knowledge that is gained through experience.
ENDNOTES 1
2
KEY TERMS Combination: A knowledge transfer mode that involves new explicit knowledge being derived from existing explicit knowledge. Explicit Knowledge: Also known as factual knowledge (i.e., “know what”) (Cabena et al., 1998), represents knowledge that is well established and documented.
3
Each entry explains how data mining enables the knowledge transfer from the type of knowledge in the cell row to the type of knowledge in the cell column. The example is kept small and simple for illustrative purposes; naturally, in large medical databases the data would be much larger. Each entry gives an example of how data mining enables the knowledge transfer from the knowledge type in the cell row to the knowledge type in the cell column.
605
I
606
Incremental Mining from News Streams Seokkyung Chung University of Southern California, USA Jongeun Jun University of Southern California, USA Dennis McLeod University of Southern California, USA
INTRODUCTION With the rapid growth of the World Wide Web, Internet users are now experiencing overwhelming quantities of online information. Since manually analyzing the data becomes nearly impossible, the analysis would be performed by automatic data mining techniques to fulfill users’ information needs quickly. On most Web pages, vast amounts of useful knowledge are embedded into text. Given such large sizes of text collection, mining tools, which organize the text datasets into structured knowledge, would enhance efficient document access. This facilitates information search and, at the same time, provides an efficient framework for document repository management as the number of documents becomes extremely huge. Given that the Web has become a vehicle for the distribution of information, many news organizations are providing newswire services through the Internet. Given this popularity of the Web news services, text mining on news datasets has received significant attentions during the past few years. In particular, as several hundred news stories are published everyday at a single Web news site, triggering the whole mining process whenever a document is added to the database is computationally impractical. Therefore, efficient incremental text mining tools need to be developed.
BACKGROUND The simplest document access method within Web news services is keyword-based retrieval. Although this method seems effective, there exist at least three serious drawbacks. First, if a user chooses irrelevant keywords, then retrieval accuracy will be degraded. Second, since keyword-based retrieval relies on the syntactic properties of information (e.g., keyword counting), semantic gap cannot be overcome (Grosky, Sreenath, & Fotouhi, 2002). Third, only expected information can be retrieved since
the specified keywords are generated from users’ knowledge space. Thus, if users are unaware of the airplane crash that occurred yesterday, then they cannot issue a query about that accident even though they might be interested. The first two drawbacks stated above have been addressed by query expansion based on domain-independent ontologies. However, it is well known that this approach leads to a degradation of precision. That is, given that the words introduced by term expansion may have more than one meaning, using additional terms can improve recall, but decrease precision. Exploiting a manually developed ontology with a controlled vocabulary would be helpful in this situation (Khan, McLeod, & Hovy, 2004). However, although ontology-authoring tools have been developed in the past decades, manually constructing ontologies whenever new domains are encountered is an error-prone and time-consuming process. Therefore, integration of knowledge acquisition with data mining, which is referred to as ontology learning, becomes a must (Maedche & Staab, 2001). To facilitate information navigation and search on a news database, clustering can be utilized. Since a collection of documents is easy to skim if similar articles are grouped together, if the news articles are hierarchically classified according to their topics, then a query can be formulated while a user navigates a cluster hierarchy. Moreover, clustering can be used to identify and deal with near-duplicate articles. That is, when news feeds repeat stories with minor changes from hour to hour, presenting only the most recent articles is probably sufficient. In particular, a sophisticated incremental hierarchical document clustering algorithm can be effectively used to address high rate of document update. Moreover, in order to achieve rich semantic information retrieval, an ontology-based approach would be provided. However, one of the main problems with concept-based ontologies is that topically related concepts and terms are not explicitly linked. That is, there is no relation between court-attorney, kidnap-police, and etcetera. Thus, concept-based
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Incremental Mining from News Streams
ontologies have a limitation in supporting a topical search. In sum, it is essential to develop incremental text mining methods for intelligent news information presentation.
6.
MAIN THRUST In the following, we will explore text mining approaches that are relevant for news streams data.
into the database, whenever a new document is inserted it should perform a fast update of the existing cluster structure. Meaningful theme of clusters: We expect each cluster to reflect a meaningful theme. We define “meaningful theme” in terms of precision and recall. That is, if a cluster (C) is about “Turkey earthquake,’’ then all documents about “Turkey earthquake’’ should belong to C, and documents that do not talk about “Turkey earthquake” should not belong to C. Interpretability of resulting clusters: A clustering structure needs to be tied up with a succinct summary of each cluster. Consequently, clustering results should be easily comprehensible by users.
Requirements of Document Clustering in News Streams
7.
Data we are considering are high dimensional, large in size, noisy, and a continuous stream of documents. Many previously proposed document clustering algorithms did not perform well on this dataset due to a variety of reasons. In the following, we define application-dependent (in terms of news streams) constraints that the clustering algorithm must satisfy.
Previous Document Clustering Approaches
1.
2.
3.
4.
5.
Ability to determine input parameters: Many clustering algorithms require a user to provide input parameters (e.g., the number of clusters), which is difficult to be determined in advance, in particular when we are dealing with incremental datasets. Thus, we expect the clustering algorithm not to need such kind of knowledge. Scalability with large number of documents: The number of documents to be processed is extremely large. In general, the problem of clustering n objects into k clusters is NP-hard. Successful clustering algorithms should be scalable with the number of documents. Ability to discover clusters with different shapes and sizes: The shape of document cluster can be of arbitrary shapes; hence we cannot assume the shape of document cluster (e.g., hyper-sphere in k-means). In addition, the sizes of clusters can be of arbitrary numbers, thus clustering algorithms should identify the clusters with wide variance in size. Outliers Identification: In news streams, outliers have a significant importance. For instance, a unique document in a news stream may imply a new technology or event that has not been mentioned in previous articles. Thus, forming a singleton cluster for the outlier is important. Efficient incremental clustering: Given different ordering of a same dataset, many incremental clustering algorithms produce different clusters, which is an unreliable phenomenon. Thus, the incremental clustering should be robust to the input sequence. Moreover, due to the frequent document insertion
The most widely used document clustering algorithms fall into two categories: partition-based clustering and hierarchical clustering. In the following, we provide a concise overview for each of them, and discuss why these approaches fail to address the requirements discussed above. Partition-based clustering decomposes a collection of documents, which is optimal with respect to some predefined function (Duda, Hart, & Stork, 2001; Liu, Gong, Xu, & Zhu, 2002). Typical methods in this category include center-based clustering, Gaussian Mixture Model, and etcetera. Center-based algorithms identify the clusters by partitioning the entire dataset into a pre-determined number of clusters (e.g., k-means clustering). Although the center-based clustering algorithms have been widely used in document clustering, there exist at least five serious drawbacks. First, in many center-based clustering algorithms, the number of clusters needs to be determined beforehand. Second, the algorithm is sensitive to an initial seed selection. Third, it can model only a spherical (k-means) or ellipsoidal (k-medoid) shape of clusters. Furthermore, it is sensitive to outliers since a small amount of outliers can substantially influence the mean value. Note that capturing an outlier document and forming a singleton cluster is important. Finally, due to the nature of an iterative scheme in producing clustering results, it is not relevant for incremental datasets. Hierarchical (agglomerative) clustering (HAC) identifies the clusters by initially assigning each document to its own cluster and then repeatedly merging pairs of clusters until a certain stopping condition is met (Zhao & Karypis, 2002). Consequently, its result is in the form of a tree, which is referred to as a dendrogram. A dendrogram is represented as a tree with numeric levels associated to its branches. The main advantage of HAC lies in its ability
607
I
Incremental Mining from News Streams
to provide a view of data at multiple levels of abstraction. Although HAC can model arbitrary shapes and different sizes of clusters, and can be extended to the robust version (in outlier handling sense), it is not relevant for news streams application due to the following two reasons. First, since HAC builds a dendrogram, a user should determine where to cut the dendrogram to produce actual clusters. This step is usually done by human visual inspection, which is a time-consuming and subjective process. Second, the computational complexity of HAC is expensive since pairwise similarities between clusters need to be computed.
Topic Detection and Tracking Over the past six years, the information retrieval community has developed a new research area, called TDT (Topic Detection and Tracking) (Makkonen, Ahonen-Myka, & Salmenkivi, 2004; Allan, 2002). The main goal of TDT is to detect the occurrence of a novel event in a stream of news stories, and to track the known event. In particular, there are three major components in TDT. 1.
2. 3.
Story segmentation: It segments a news stream (e.g., including transcribed speech) into topically cohesive stories. Since online Web news (in HTML format) is supplied in segmented form, this task only applies to audio or TV news. First Story Detection (FSD): It identifies whether a new document belongs to an existing topic or a new topic. Topic tracking: It tracks events of interest based on sample news stories. It associates incoming news stories with the related stories, which were already discussed before. It can be also asked to monitor the news stream for further stories on the same topic.
Event is defined as “some unique thing that happens at some point in time”. Hence, an event is different from a topic. For example, “airplane crash” is a topic while “Chinese airplane crash in Korea in April 2002” is an event. Note that it is important to identify events as well as topics. Although a user is not interested in a flood topic, the user may be interested in the news story on the Texas flood if the user’s hometown is from Texas. Thus, a news recommendation system must be able to distinguish different events within a same topic. Single-pass document clustering (Chung & McLeod, 2003) has been extensively used in TDT research. However, the major drawback of this approach lies in ordersensitive property. Although the order of documents is already fixed since documents are inserted into the database in chronological order, order-sensitive property implies that the resulting cluster is unreliable. Thus, new 608
methodology is required in terms of incremental news article clustering.
Dynamic Topic Mining Dynamic topic mining is a framework that supports the identification of meaningful patterns (e.g., events, topics, and topical relations) from news stream data (Chung & McLeod, 2003). To build a novel paradigm for an intelligent news database management and navigation scheme, it utilizes techniques in information retrieval, data mining, machine learning, and natural language processing. In dynamic topic mining, a Web crawler downloads news articles from a news Web site on a daily basis. Retrieved news articles are processed by diverse information retrieval and data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request (in terms of topic detection and tracking, keyword-based retrieval, document cluster visualization, etc). Key contributions of the dynamic topic mining framework are development of a novel hierarchical incremental document clustering algorithm, and a topic ontology learning framework. Despite the huge body of research efforts on document clustering, previously proposed document clustering algorithms are limited in that it cannot address special requirements in a news environment. That is, an algorithm must address the seven application-dependent constraints discussed before. Toward this end, the dynamic topic mining framework presents a sophisticated incremental hierarchical document clustering algorithm that utilizes a neighborhood search. The algorithm was tested to demonstrate the effectiveness in terms of the seven constraints. The novelty of the algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, to overcome the lack of topical relations in conceptual ontologies, a topic ontology learning framework is presented. The proposed topic ontologies provide interpretations of news topics at different levels of abstraction. For example, regarding to a Winona Ryder court trial news topic (T), the dynamic topic mining could capture “winona, ryder, actress, shoplift, beverly” as specific terms describing T (i.e., the specific concept for T) while “attorney, court, defense, evidence, jury, kill, law, legal, murder, prosecutor, testify, trial” as general terms representing T (i.e., the general concept for T). There exists research work on extracting hierarchical relations between terms from a set of documents (Tseng,
Incremental Mining from News Streams
2002). However, the dynamic topic mining framework is unique in that the topical relations are dynamically generated based on incremental hierarchical clustering rather than based on human defined topics, such as Yahoo directory.
FUTURE TRENDS
would be provided. To overcome the problem of conceptbased ontologies (i.e., topically related concepts and terms are not explicitly linked), topic ontologies are presented to characterize news topics at multiple levels of abstraction. In sum, coupling with topic ontologies and concept-based ontologies, supporting a topical search as well as semantic information retrieval can be achieved.
There are many future research opportunities in news streams mining. First, although a document hierarchy can be obtained using unsupervised clustering, as shown in Aggarwal, Gates, & Yu (2004), the cluster quality can be enhanced if a pre-existing knowledge base is exploited. That is, based on this priori knowledge, we can have some control while building a document hierarchy. Second, document representation for clustering can be augmented with phrases by employing different levels of linguistic analysis (Hatzivassiloglou, Gravano, & Maganti, 2000). That is, representation model can be augmented by adding n-gram (Peng & Schuurmans, 2003), or frequent itemsets using association rule mining (Hand, Mannila, & Smyth, 2001). Investigating how different feature selection algorithms affect on the accuracy of clustering results is an interesting research work. Third, besides exploiting text data, other information can be utilized since Web news articles are composed of text, hyperlinks, and multimedia data. For example, both terms and hyperlinks (which point to related news articles or Web pages) can be used for feature selection. Finally, a topic ontology learning framework can be extended to accommodating rich semantic information extraction. For example, topic ontologies can be annotated within Protégé (Noy, Sintek, Decker, Crubezy, Fergerson, & Musen, 2001) WordNet tab. In addition, a query expansion algorithm based on ontologies needs to be developed for intelligent news information presentation.
REFERENCES
CONCLUSION
Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002, August). Document clustering with cluster refinement and model selection capabilities. In ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR’02) (pp. 91-198). Tampere, Finland.
Incremental text mining from news streams is an emerging technology as many news organizations are providing newswire services through the Internet. In order to accommodate dynamically changing topics, efficient incremental document clustering algorithms need to be developed. The algorithms must address the special requirements in news clustering, such as high rate of document update or ability to identify event level clusters, as well as topic level clusters. In order to achieve rich semantic information retrieval within Web news services, an ontology-based approach
Aggarwal, C.C., Gates, S.C., & Yu, P.S. (2004). On using partial supervision for text categorization. IEEE Transactions on Knowledge and Data Engineering, 16(2), 245-255. Allan, J. (2002). Detection as multi-topic tracking. Information Retrieval, 5(2-3), 139-157. Chung, S., & McLeod, D. (2003, November). Dynamic topic mining from news stream data. In ODBASE’03 (pp. 653-670). Catania, Sicily, Italy. Duda, R.O., Hart, P.E., & Stork D.G. (2001). Pattern classification. New York: Wiley Interscience. Grosky, W.I., Sreenath, D.V., & Fotouhi, F. (2002). Emergent semantics and the multimedia semantic web. SIGMOD Record, 31(4), 54-58. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining (adaptive computation and machine learning). Cambridge, MA: The MIT Press. Hatzivassiloglou, V., Gravano, L., & Maganti, A. (2000, July). An investigation of linguistic features and clustering algorithms for topical document clustering. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00) (pp. 224-231). Athens, Greece. Khan, L., McLeod, D., & Hovy, E. (2004). Retrieval effectiveness of an ontology-based model for information selection. The VLDB Journal, 13(1), 71-85.
Maedche, A., & Staab, S. (2001). Ontology learning for the semantic Web. IEEE Intelligent Systems, 16(2), 72-79. Makkonen, J., Ahonen-Myka, H., & Salmenkivi, M. (2004). Simple semantics in topic detection and tracking. Information Retrieval, 7(3-4), 347-368. Noy, N.F., Sintek, M., Decker, S., Crubezy, M., Fergerson, R.W., & Musen M.A. (2001). Creating seman609
I
Incremental Mining from News Streams
tic Web contents with Protégé 2000. IEEE Intelligent Systems, 6(12), 60-71. Peng, F., & Schuurmans, D. (2003, April). Combining naive Bayes and n-gram language models for text classification. European Conference on IR Research (ECIR’03) (pp. 335-350). Pisa, Italy. Tseng, Y. (2002). Automatic thesaurus generation for Chinese documents. Journal of the American Society for Information Science and Technology, 53(13), 1130-1138. Zhao, Y., & Karypis, G. (2002, November). Evaluations of hierarchical clustering algorithms for document datasets. In ACM International Conference on Information and Knowledge Management (CIKM’02) (pp. 515-524). McLean, VA.
KEY TERMS Clustering: An unsupervised process of dividing data into meaningful groups such that each identified cluster can explain the characteristics of underlying data distribution. Examples include characterization of different customer groups based on the customer’s purchasing patterns, categorization of documents in the World Wide Web, or grouping of spatial locations of the earth where neighbor points in each region have similar short-term/ long-term climate patterns. Dynamic Topic Mining: A framework that supports the identification of meaningful patterns (e.g., events, topics, and topical relations) from news stream data.
610
First Story Detection: A TDT component that identifies whether a new document belongs to an existing topic or a new topic. Ontology: A collection of concepts and inter-relationships. Text Mining: A process of identifying patterns or trends in natural language text including document clustering, document classification, ontology learning, and etcetera. Topic Detection And Tracking (TDT): Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art for news understanding systems. Specifically, TDT is composed of the following three major components: (1) segmenting a news stream (e.g., including transcribed speech) into topically cohesive stories; (2) identifying novel stories that are the first to discuss a new event; and (3) tracking known events given sample stories. Topic Ontology: A collection of terms that characterize a topic at multiple levels of abstraction. Topic Tracking: A TDT component that tracks events of interest based on sample news stories. It associates incoming news stories with the related stories, which were already discussed before or it monitors the news stream for further stories on the same topic.
611
Inexact Field Learning Approach for Data Mining
I
Honghua Dai Deakin University, Australia
INTRODUCTION Inexact fielding learning (IFL) (Ciesieski & Dai, 1994; Dai & Ciesieski, 1994a, 1994b, 1995, 2004; Dai & Li, 2001) is a rough-set, theory-based (Pawlak, 1982) machine learning approach that derives inexact rules from fields of each attribute. In contrast to a point-learning algorithm (Quinlan, 1986, 1993), which derives rules by examining individual values of each attribute, a field learning approach (Dai, 1996) derives rules by examining the fields of each attribute. In contrast to exact rule, an inexact rule is a rule with uncertainty. The advantage of the IFL method is the capability to discover high-quality rules from low-quality data, its property of low-quality data tolerant (Dai & Ciesieski, 1994a, 2004), high efficiency in discovery, and high accuracy of the discovered rules.
BACKGROUND Achieving high prediction accuracy rates is crucial for all learning algorithms, particularly in real applications. In the area of machine learning, a well-recognized problem is that the derived rules can fit the training data very well, but they fail to achieve a high accuracy rate on new unseen cases. This is particularly true when the learning is performed on low-quality databases. Such a problem is referred as the Low Prediction Accuracy (LPA) problem (Dai & Ciesieski, 1994b, 2004; Dai & Li, 2001), which could be caused by several factors. In particular, overfitting lowquality data and being misled by them seem to be the significant problems that can hamper a learning algorithm from achieving high accuracy. Traditional learning methods derive rules by examining individual values of instances (Quinlan, 1986, 1993). To generate classification rules, these methods always try to find cut-off points, such as in well-known decision tree algorithms (Quinlan, 1986, 1993). What we present here is an approach to derive rough classification rules from large low-quality numerical databases that appear to be able to overcome these two problems. The algorithm works on the fields of continuous numeric variables; that is, the intervals of possible
values of each attribute in the training set, rather than on individual point values. The discovered rule is in a form called b-rule and is somewhat analogous to a decision tree found by an induction algorithm. The algorithm is linear in both the number of attributes and the number of instances (Dai & Ciesieski, 1994a, 2004). The advantage of this inexact field-learning approach is its capability of inducing high-quality classification rules from low-quality data and its high efficiency that makes it an ideal algorithm to discover reliable knowledge from large and very large low-quality databases suitable for data mining, which needs higher discovering capability.
INEXACT FIELD-LEARNING ALGORITHM Detailed description and the applications of the algorithm can be found from the listed articles (Ciesieski & Dai, 1994a; Dai & Ciesieski, 1994a, 1994b, 1995, 2004; Dai, 1996; Dai & Li, 2001; Dai, 1996). The following is a description of the inexact field-learning algorithm, the Fish_net algorithm: Input: The input of the algorithm is a training data set with m instances and n attributes as follows: Instances
X1
X2
... X n
Instance1
a11
a12
Instance 2 ...
a21 ...
a22 ...
... a1n ... a2 n ... ...
Instancem
am1
am 2 ... amn
Classes γ1 γ2 ... γm
(1) Learning Process: •
Step 1: Work Out Fields of each attribute {xi |1 ≤ i ≤ n} with respect to each class. h(j k ) = [h(jlk ) , h(juk ) ] (k = 1, 2,..., s; j = 1, 2,...n).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
(2)
Inexact Field Learning Approach for Data Mining
h(juk ) = max{ aij | i = 1, 2,..., m} (k) aij ∈a j
hl( + ) = min {α( I i ) | i = 1, 2,..., m}
(9)
α ( Ii ), Ii ∈+
Similarly we can find h− =< hl− , hu− >
h(jlk ) = min( k ){aij | i = 1, 2,..., m} aij ∈a j
(4)
(k = 1, 2,..., s; j = 1, 2,...n).
•
Step 2: Construct Contribution Function based on the fields found in Step 1.
•
(8)
α ( Ii ), Ii ∈+
(3)
(k = 1, 2,..., s; j = 1, 2,...n)
hu( + ) = max {α( I i ) | i = 1, 2,..., m}
s x j ∈ U i ≠ k h(j i ) − h(j k ) 0 s µ ck ( x j ) = 1 x j ∈ h(j k ) − U i ≠ k h(j i ) x j − a x ∈ h( k ) ( s h (i ) ) j j I Ui ≠ k j b − a (k = 1, 2,..., s)
(5)
Step 4: Construct Belief Function using the derived contribution fields. −1 Br (C ) = 1 c − a b − a
•
Contribution ∈ NegativeRegion Contribution ∈ PositiveRegion Contribution ∈ RoughRegion
Step 5: Decide Threshold. It could have 6 different cases to be considered. The simplest case is to take the threshold α = midpoint of h + and h −
The formula (5) is given on the assumption that s
[a, b] = h (j k ) I (U i ≠ k h (j i ) ) , and for any small number s
Otherwise, the formula (5) becomes,
Step 3: Work Out Contribution Fields by applying the constructed contribution functions to the training data set.
•
•
(12)
This algorithm was tested on three large real observational weather data sets containing both high-quality and low-quality data. The accuracy rates of the forecasts were 86.4%, 78%, and 76.8%. These are significantly better than the accuracy rates achieved by C4.5 (Quinlan, 1986, 1993), feed forward neural networks, discrimination analysis, K-nearest neighbor classifiers, and human weather forecasters. The fish-net algorithm exhibited significantly less overfitting than the other algorithms. The training times were shorter, in some cases by orders of magnitude (Dai & Ciesieski, 1994a, 2004; Dai 1996).
FUTURE TRENDS
n
j =1
(i = 1, 2,..., m)
(7)
Work out the contribution field for each class. h+ =< hl+ , hu+ >
612
1 N ∑ µ( xi ) > α N i =1 γ = 1( Br (c))
α( I ) =
Calculate the contribution of each instance.
α( I i ) = (∑ µ( xij )) / n
•
If Then
(6)
(11)
Step 6: Form the Inexact Rule.
s
ε > 0, b ± ε ∈ h(j k ) and a + ε ∉ U i ≠ k h(ji ) or a − ε ∉ U i ≠ k h(ji ) .
s x j ∈ U i ≠ k h(j i ) − h(j k ) 0 s µ ck ( x j ) = 1 x j ∈ h (j k ) − U i ≠ k h (ji ) x j − b x ∈ h( k ) ( s h (i ) ) j j I Ui ≠ k j a − b (k = 1, 2,..., s)
•
(10)
The inexact field-learning approach has led to a successful algorithm in a domain where there is a high level of noise. We believe that other algorithms based on fields also can be developed. The b-rules, produced by the current FISH-NET algorithm involve linear combinations of attributes. Non-linear rules may be even more accurate.
Inexact Field Learning Approach for Data Mining
While extensive tests have been done on the fish-net algorithm with large meteorological databases, nothing in the algorithm is specific to meteorology. It is expected that the algorithm will perform equally well in other domains. In parallel to most existing exact machine-learning methods, the inexact field-learning approaches can be used for large or very large noisy data mining, particularly where the data quality is a major problem that may not be dealt with by other data-mining approaches. Various learning algorithms can be created, based on the fields derived from a given training data set. There are several new applications of inexact field learning, such as Zhuang and Dai (2004) for Web document clustering and some other inexact learning approaches (Ishibuchi et al., 2001; Kim et al., 2003). The major trends of this approach are in the following: 1. 2. 3.
Heavy application for all sorts of data-mining tasks in various domains. Developing new powerful discovery algorithms in conjunction with IFL and traditional learning approaches. Extend current IFL approach to deal with high dimensional, non-linear, and continuous problems.
CONCLUSION The inexact field-learning algorithm: Fish-net is developed for the purpose of learning rough classification/forecasting rules from large, low-quality numeric databases. It runs high efficiently and generates robust rules that do not overfit the training data nor result in low prediction accuracy. The inexact field-learning algorithm, fish-net, is based on fields of the attributes rather than the individual point values. The experimental results indicate that: 1.
2.
3.
The fish-net algorithm is linear both in the number of instances and in the number of attributes. Further, the CPU time grows much more slowly than the other algorithms we investigated. The Fish-net algorithm achieved the best prediction accuracy tested on new unseen cases out of all the methods tested (i.e., C4.5, feed-forward neural network algorithms, a k-nearest neighbor method, the discrimination analysis algorithm, and human experts. The fish-net algorithm successfully overcame the LPA problem on two large low-quality data sets examined. Both the absolute LPA error rate and the relative LPA error rate (Dai & Ciesieski, 1994b) of the fish-net were very low on these data sets. They were significantly lower than that of point-learning ap-
4.
proach, such as C4.5, on all the data sets and lower than the feed-forward neural network. A reasonably low LPA error rate was achieved by the feedforward neural network but with the high time cost of error back-propagation. The LPA error rate of the KNN method is comparable to fish-net. This was achieved after a very high-cost genetic algorithm search. The FISH-NET algorithm obviously was not affected by low-quality data. It performed equally well on low-quality data and high-quality data.
REFERENCES Ciesielski, V., & Dai, H. (1994a). FISHERMAN: A comprehensive discovery, learning and forecasting systems. Proceedings of 2nd Singapore International Conference on Intelligent System, Singapore. Dai, H. (1994c). Learning of forecasting rules from large noisy meteorological data [doctoral thesis]. RMIT, Melbourne, Victoria, Australia. Dai, H. (1996a). Field learning. Proceedings of the 19th Australian Computer Science Conference. Dai, H. (1996b). Machine learning of weather forecasting rules from large meteorological data bases. Advances in Atmospheric Science, 13(4), 471-488. Dai, H. (1997). A survey of machine learning [technical report]. Monash University, Melbourne, Victoria, Australia. Dai, H. & Ciesielski, V. (1994a). Learning of inexact rules by the FISH-NET algorithm from low quality data. Proceedings of the 7 th Australian Joint Conference on Artificial Intelligence, Brisbane, Australia. Dai, H. & Ciesielski, V. (1994b). The low prediction accuracy problem in learning. Proceedings of Second Australian and New Zealand Conference On Intelligent Systems, Armidale, NSW, Australia. Dai, H., & Ciesielski, V. (1995). Inexact field learning using the FISH-NET algorithm [technical report]. Monash University, Melbourne, Victoria, Australia. Dai, H., & Ciesielski, V. (2004). Learning of fuzzy classification rules by inexact field learning approach [technical report]. Deakin University, Melbourne, Australia. Dai, H. & Li, G. (2001). Inexact field learning: An approach to induce high quality rules from low quality data. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM-01), San Jose, California. 613
I
Inexact Field Learning Approach for Data Mining
Ishibuchi, H., Yamamoto, T., & Nakashima, T. (2001). Fuzzy data mining: Effect of fuzzy discretization. Proceedings of IEEE International Conference on Data Mining, San Jose, California. Kim, M., Ryu, J., Kim, S., & Lee, J. (2003). Optimization of fuzzy rules for classification using genetic algorithm. Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Seoul, Korea. Pawlak, Z. (1982). Rough sets. International Journal of Information and Computer Science, 11(5), 145-172. Quinlan, R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers. Zhuang, L. & Dai, H. (2004). Maximal frequent itemset approach for Web document clustering. Proceedings of the 2004 International Conference on Computer and Information Technology (CIT’04), Wuhan, China.
KEY TERMS b-Rule: A type of inexact rule that represent the uncertainty with contribution functions and belief functions. Exact Learning: The learning approaches that are capable of inducing exact rules. Exact Rules: Rules without uncertainty. Field Learning: Derives rules by looking at the field of the values of each attribute in all the instances of the training data set. Inexact Learning: The learning by which inexact rules are induced. Inexact Rules: Rules with uncertainty. Low-Quality Data: Data with lots of noise, missing values, redundant features, mistakes, and so forth. LPA (Low Prediction Accuracy) Problem: The problem when derived rules can fit the training data very well but fail to achieve a high accuracy rate on new unseen cases. Point Learning: Derives rules by looking at each individual point value of the attributes in every instance of the training data set.
614
615
Information Extraction in Biomedical Literature
I
Min Song Drexel University, USA Il-Yeol Song Drexel University, USA Xiaohua Hu Drexel University, USA Hyoil Han Drexel University, USA
INTRODUCTION Information extraction (IE) technology has been defined and developed through the US DARPA Message Understanding Conferences (MUCs). IE refers to the identification of instances of particular events and relationships from unstructured natural language text documents into a structured representation or relational table in databases. It has proved successful at extracting information from various domains, such as the Latin American terrorism, to identify patterns related to terrorist activities (MUC-4). Another domain, in the light of exploiting the wealth of natural language documents, is to extract the knowledge or information from these unstructured plain-text files into a structured or relational form. This form is suitable for sophisticated query processing, for integration with relational databases, and for data mining. Thus, IE is a crucial step for fully making text files more easily accessible.
we review recent advances in applying IE techniques to biomedical literature.
MAIN THRUST This article attempts to synthesize the works that have been done in the field. Taxonomy helps us understand the accomplishments and challenges in this emerging field. In this article, we use the following set of criteria to classify the biomedical literature mining related studies: 1. 2. 3.
What are the target objects that are to be extracted? What techniques are used to extract the target objects from the biomedical literature? How are the techniques or systems evaluated?
Figure 1. Shows the overview of a typical biomedical literature mining system.
BACKGROUND Curated DB
The advent of large volumes of text databases and search engines have made them readily available to domain experts and have significantly accelerated research on bioinformatics. With the size of a digital library commonly exceeding millions of documents, rapidly increasing, and covering a wide range of topics, efficient and automatic extraction of meaningful data and relations has become a challenging issue. To tackle this issue, rigorous studies have been carried out recently to apply IE to biomedical data. Such research efforts began to be called biomedical literature mining or text mining in bioinformatics (de Bruijn & Martin, 2002; Hirschman et al., 2002; Shatkay & Feldman, 2003). In this article,
Genbank
SwissProt
Evaluate the system Extract entities & relationships
Biomedical Knowledge base (KB)
Text Mining Systems Build a KB
Integrated into the system
Patient Records MEDLINE
MESH UMLS
BLAST
SNOMED CT
Literature Collections Ontologies
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Gene Ontologies
Information Extraction in Biomedical Literature
4)
From what data sources are the target objects extracted?
Target Objects In terms of what is to be extracted by the systems, most studies can be broken into the following two major areas: (1) named entity extraction such as proteins or genes; and (2) relation extraction, such as relationships between proteins. Most of these studies adopt information extraction techniques using curated lexicon or natural language processing for identifying relevant tokens such as words or phrases in text (Shatkay & Feldman, 2003). In the area of named entity extraction, Proux et al. (2000) use single word names only with selected test set from 1,200 sentences coming from Flybase. Collier, et al. (2000) adopt Hidden Markov Models (HMMs) for 10 test classes with small training and test sets. Krauthammer et al. (2000) use BLAST database with letters encoded as 4-tuples of DNA. Demetriou and Gaizuaskas (2002) pipeline the mining processes, including hand-crafted components and machine learning components. For the study, they use large lexicon and morphology components. Narayanaswamy et al. (2003) use a part of speech (POS) tagger for tagging the parsed MEDLINE abstracts. Although Narayanaswamy and his colleagues (2003) implement an automatic protein name detection system, the number of words used is 302, and, thus, it is difficult to see the quality of their system, since the size of the test data is too small. Yamamoto, et al. (2003) use morphological analysis techniques for preprocessing protein name tagging and apply support vector machine (SVM) for extracting protein names. They found that increasing training data from 390 abstracts to 1,600 abstracts improved F-value performance from 70% to 75%. Lee et al. (2003) combined an SVM and dictionary lookup for named entity recognition. Their approach is based on two phases: the first phase is
identification of each entity with an SVM classifier, and the second phase is post-processing to correct the errors by the SVM with a simple dictionary lookup. Bunescu, et al. (2004) studied protein name identification and proteinprotein interaction. Among several approaches used in their study, the main two ways are one using POS tagging and the other using the generalized dictionary-based tagging. Their dictionary-based tagging presents higher F-value. Table 1 summarizes the works in the areas of named entity extraction in biomedical literature. The second target object type of biomedical literature extraction is relation extraction. Leek (1997) applies HMM techniques to identify gene names and chromosomes through heuristics. Blaschke et al. (1999) extract proteinprotein interactions based on co-occurrence of the form “… p1…I1… p2” within a sentence, where p1, p2 are proteins, and I1 is an interaction term. Protein names and interaction terms (e.g., activate, bind, inhibit) are provided as a dictionary. Proux (2000) extracts an interact relation for the gene entity from Flybase database. Pustejovsky (2002) extracts an inhibit relation for the gene entity from MEDLINE. Jenssen, et al. (2001) extract a genegene relations based on co-occurrence of the form “… g1…g2…” within a MEDLINE abstracts, where g1 and g2 are gene names. Gene names are provided as a dictionary, harvested from HUGO, LocusLink, and other sources. Although their study uses 13,712 named human genes and millions of MEDLINE abstracts, no extensive quantitative results are reported and analyzed. Friedman, et al. (2001) extract a pathway relation for various biological entities from a variety of articles. In their work, the precision of the experiments is high (from 79-96%). However, the recalls are relatively low (from 21-72%). Bunescu et al. (2004) conducted protein/protein interaction identification with several learning methods, such as pattern matching rule induction (RAPIER), boosted wrapper induction (BWI), and extraction using longest common subsequences (ELCS). ELCS automatically learns rules for extracting protein interactions using a bottom-up
Table 1. A summary of works in biomedical entity extraction Author Collier, et al. (2000) Krauthammer, et al. (2000)
Database MEDLINE Review articles
No. of Words 30,000 5,000
Demetriou and Gaizauskas (2002) Narayanaswamy (2003)
Protein, Species, and 10 more Protein
MEDLINE
30,000
MEDLINE
302
Yamamoto, et al. (2003) Lee, et al. (2003)
Protein
GENIA
1,600 abstracts
Protein DNA RNA Protein
GENIA
10,000
Bunescu (2004)
616
Named Entities Proteins and DNA Gene and Protein
MEDLINE
5,206
Learning Methods HMM Character sequence mapping PASTA template filing Hand-crafted rules and cooccurrence BaseNP recognition SVM RAPIER, BWI, TBL, k-NN , SVMs, MaxEnt
F Value 73 75 83 75.86 75 77 57.86
Information Extraction in Biomedical Literature
Table 2. A summary of relation extraction for biomedical data Authors
Relation
Entity
DB
Leek (1997) Blaschke (1999) Proux (2000) Pustejovsky (2001) Jenssen (2001) Friedman (2001)
Location Interact Interact Inhibit Location Pathway
Gene Protein Gene Gene Gene Many
OMIM MEDLINE Flybase MEDLINE MEDLINE Articles
Bunescu (2004)
Interact
Protein
MEDLINE
approach. They conducted experiments in two ways: one with manually crafted protein names and the other with the extracted protein names by their name identification method. In both experiments, Bunescu, et al. compared their results with human-written rules and showed that machine learning methods provide higher precisions than human-written rules. Table 2 summarizes the works in the areas of relation extraction in biomedical literature.
Techniques Used The most commonly used extraction technique is cooccurrence based. The basic idea of this technique is that entities are extracted based on frequency of co-occurrence of biomedical named entities such as proteins or genes within sentences. This technique was introduced by Blaschke, et al. (1999). Their goal was to extract information from scientific text about protein interactions among a predetermined set of related programs. Since Blaschke and his colleagues’ study, numerous other co-occurrence-based systems have been proposed in the literature. All are associated with information extraction of biomedical entities from the unstructured text corpus. The common denominator of the co-occurrence-based systems is that they are based on co-occurrences of names or identifiers of entities, typically along with activation/dependency terms. These systems are differentiated one from another by integrating different machine learning techniques such as syntactical analysis or POS tagging, as well as ontologies and controlled vocabularies (Hahn et al., 2002; Pustejovsky et al., 2002; Yakushiji et al., 2001). Although these techniques are straightforward and easy to develop, from the performance standpoint, recall and precision are much lower than any other machine-learning techniques (Ray & Craven, 2001). In parallel with co-occurrence-based systems, the researchers began to investigate other machine learning or NLP techniques. One of the earliest studies was done by Leek (1997), who utilized Hidden Markov Models (HMMs) to extract sentences discussing gene location of chromosomes. HMMs are applied to represent sentence structures for natural language processing, where states
Learning Precision Methods HMM 80% Co-occurrence n/a Co-occurrence 81% Co-occurrence 90% Co-occurrence n/a Co-occurrence 96% and thesauri RAPIER, BWI, n/a ELCS
I
Recall 36% n/a 44% 57% n/a 63% n/a
of an HMM correspond to candidate POS tags, and probabilistic transitions among states represent possible parses of the sentence, according to the matches of the terms occurring in it to the POSs. In the context of biomedical literature mining, HMM is also used to model families of biological sequences as a set of different utterances of the same word generated by an HMM technique (Baldi et al., 1994). Ray and Craven (2001) have proposed a more sophisticated HMM-based technique to distinguish fact-bearing sentences from uninteresting sentences. The target biological entities and relations that they intend to extract are protein subcellular localizations and gene-disorder associations. With a predefined lexicon of locations and proteins and several hundreds of training sentences derived from Yeast database, they trained and tested the classifiers over a manually labeled corpus of about 3,000 MEDLINE abstracts. There have been several studies applying natural language tagging and parsing techniques to biomedical literature mining. Friedman, et al. (2001) propose methods parsing sentences and using thesauri to extract facts about genes and proteins from biomedical documents. They extract interactions among genes and proteins as part of regulatory pathways.
Evaluation One of the pivotal issues yet to be explored further in biomedical literature mining is how to evaluate the techniques or systems. The focus of the evaluation conducted in the literature is on extraction accuracy. The accuracy measures used in IE are precision and recall ratio. For a set of N items, where N is either terms, sentences, or documents, and the system needs to label each of the terms as positive or negative, according to some criterion (positive, if a term belongs to a predefined document category or a term class). As discussed earlier, the extraction accuracy is measured by precision and recall ratio. Although these evaluation techniques are straightforward and are well accepted, recall ratios often are criticized in the field of information retrieval, when the total number of true positive terms is not clearly defined. 617
Information Extraction in Biomedical Literature
In IE, an evaluation forum similar to TREC in information retrieval (IR) is the Message Understanding Conference (MUC). Participants in MUC tested the ability of their systems to identify entities in text to resolve coreference, extract and populate attributes of entities, and perform various other extraction tasks from written text. As identified by Shatkay and Feldman (2003), the important challenge in biomedical literature mining is the creation of gold-standards and critical evaluation methods for systems developed in this very active field. The framework of evaluating biomedical literature mining systems was recently proposed by Hirschman, et al. (2002). According to Hirschman, et al. (2002), the following elements are needed for a successful evaluation: (1) challenging problem; (2) task definition; (3) training data; (4) test data; (5) evaluation methodology and implementation; (6) evaluator; (7) participants; and (8) funding. In addition to these elements for evaluation, the existing biomedical literature mining systems encounter the issues of portability and scalability, and these issues need to be taken into consideration of the framework for evaluation.
Data Sources In terms of data sources from which target biomedical objects are extracted, most of the biomedical data mining systems focus on mining MEDLINE abstracts of National Library of Medicine. The principal reason for relying on MEDLINE is related to complexity. Abstracts occasionally are easier to mine, since many papers contain less precise and less well supported sections in the text that are difficult to distinguish from more informative sections by machines (Andrade & Bork, 2000). The current version of MEDLINE contains nearly 12 million abstracts stored on approximately 43GB of disk space. A prominent example of methods that target entire papers is still restricted to a small number of journals (Friedman et al., 2000; Krauthammer et al., 2002). The task of unraveling information about function from MEDLINE abstracts can be approached from two different viewpoints. One approach is based on computational techniques for understanding texts written in natural language with lexical, syntactical, and semantic analysis. In addition to indexing terms in documents, natural language processing (NLP) methods extract and index higher-level semantic structures composed of terms and relationships between terms. However, this approach is confronted with the variability, fuzziness, and complexity of human language (Andrade & Bork, 2000). The Genies system (Friedman et al., 2000; Krauthammer et al., 2002), for automatically gathering and processing of knowledge about molecular pathways, and the Information Finding from Biological 618
Papers (IFBP) transcription factor database are natural language processing based systems. An alternative approach that may be more relevant in practice is based on the treatment of text with statistical methods. In this approach, the possible relevance of words in a text is deduced from the comparison of the frequency of different words in this text with the frequency of the same words in reference sets of text. Some of the major methods using the statistical approach are AbXtract and the automatic pathway discovery tool of Ng and Wong (1999). There are advantages to each of these approaches (i.e., grammar or pattern matching). Generally, the less syntax that is used, the more domain-specific the system is. This allows the construction of a robust system relatively quickly, but many subtleties may be lost in the interpretation of sentences. Recently, GENIA corpus has been used for extracting biomedical-named entities (Collier et al., 2000; Yamamoto et al., 2003). The reason for the recent surge of using GENIA corpus is because GENIA provides annotated corpus that can be used for all areas of NLP and IE applied to the biomedical domain that employs supervised learning. With the explosion of results in molecular biology, there is an increased need for IE to extract knowledge to build databases and to search intelligently for information in online journal collections.
FUTURE TRENDS With the taxonomy proposed here, we now identify the research trends of applying IE to mine biomedical literature. 1. 2. 3. 4.
A variety of biomedical objects and relations are to be extracted. Rigorous studies are conducted to apply advanced IE techniques, such as Random Common Field and Max Entropy based HMM to biomedical data. Collaborative efforts to standardize the evaluation methods and the procedures for biomedical literature mining. Continue to broaden the coverage of curated databases and extend the size of the biomedical databases.
CONCLUSION The sheer size of biomedical literature triggers an intensive pursuit for effective information extraction tools. To cope with such demand, the biomedical literature mining emerges as an interdisciplinary field that information extraction and machine learning are applied to the biomedical text corpus.
Information Extraction in Biomedical Literature
In this article, we approached the biomedical literature mining from an IE perspective. We attempted to synthesize the research efforts made in this emerging field. In doing so, we showed how current information extraction can be used successfully to extract and organize information from the literature. We surveyed the prominent methods used for information extraction and demonstrated their applications in the context of biomedical literature mining The following four aspects were used in classifying the current works done in the field: (1) what to extract; (2) what techniques are used; (3) how to evaluate; and (4) what data sources are used. The taxonomy proposed in this article should help identify the recent trends and issues pertinent to the biomedical literature mining.
MEDSYNDIKATE text mining system. Proceedings of the Pacific Symposium on Biocomputing. Hirschman, L., Park, J.C., Tsujii, J., Wong, L., & Wu, C.H. (2002). Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12), 1553-1561. Jenssen, T.K., Laegreid, A., Komorowski, J., & Hovig, E. (2001). A literature network of human genes for highthroughput analysis of gene expression. Nature Genetics, 28(1), 21-8. Krauthammer, M., Rzhetsky, A., Morozov P., & Friedman, C. (2000). Using BLAST for identifying gene and protein names in journal articles. Gene, 259(1-2), 245-252.
REFERENCES
Lee, K., Hwang, Y., & Rim, H. (2003). Two-phase biomedical NE recognition based on SVMs. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine.
Andrade, M.A., & Bork, P. (2000). Automated extraction of information in molecular biology. FEBS Letters, 476,12-7.
Leek, T.R. (1997). Information extraction using hidden Markov models [master’s theses]. San Diego, CA: Department of Computer Science, University of California.
Blaschke, C., Andrade, M.A., Ouzounis, C., & Valencia, A. (1999). Automatic extraction of biological information from scientific text: Protein-protein interactions, Proceedings of the First International Conference on Intelligent Systems for Molecular Biology.
Narayanaswamy, M., Ravikumar, K.E., & Vijay-Shanker, K. (2003). A biological named entity recognizer. Proceedings of the Pacific Symposium on Biocomputing.
Bunescu, R. et al. (2004). Comparative experiments on learning information extractors for proteins and their interactions [to be published]. Journal Artificial Intelligence in Medicine on Summarization and Information Extraction from Medical Documents. Collier, N., Nobata,C., & Tsujii, J. (2000). Extracting the names of genes and gene products with a hidden Markov model. Proceedings of the 18th International Conference on Computational Linguistics (COLING2000). De Bruijn, B., & Martin, J. (2002). Getting to the (c)ore of knowledge: Mining biomedical literature. International Journal of Medical Informatics, 67, 7-18. Demetriou, G., & Gaizauskas, R. (2002). Utilizing text mining results: The pasta Web system. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain. Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. (2001). GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, S74-82. Hahn, U., Romacker, M., & Schulz, S. (2002). Creating knowledge repositories from biomedical reports: The
Ng, S.K., & Wong, M. (1999). Toward routine automatic pathway discovery from on-line scientific text abstracts. Proceedings of the Genome Informatics Series: Workshop on Genome Informatics. Proux, D., Rechenmann, F., & Julliard, L. (2000). A pragmatic information extraction strategy for gathering data on genetic interactions. Proceedings of the International Conference on Intelligent System for Molecular Biology. Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., & Cochran, B. (2002). Robust relational parsing over biomedical literature: extracting inhibit relations. Pacific Symposium on Biocomputing (pp. 362-73). Ray, S., & Craven, M. (2001). Representing sentence structure in hidden Markov models for information extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, Washington. Shatkay, H., & Feldman, R. (2003). Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6), 821-855. Yakushiji, A., Tateisi, Y., Miyao,Y., & Tsujii, J. (2001). Event extraction from biomedical papers using a full parser. Proceedings of the Pacific Symposium on Biocomputing.
619
I
Information Extraction in Biomedical Literature
Yamamoto, K., Kudo, T., Konagaya, A., & Matsumoto, Y. (2003). Protein name tagging for biomedical annotation in text. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine.
KEY TERMS F-Value: Combines recall and precision in a single efficiency measure (it is the harmonic mean of precision and recall): F = 2 * (recall * precision) / (recall + precision). Hidden Markov Model (HMM): A statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters, based on this assumption. Natural Language Processing (NLP): A subfield of artificial intelligence and linguistics. It studies the prob-
620
lems inherent in the processing and manipulation of natural language. Part of Speech (POS): A classification of words according to how they are used in a sentence and the types of ideas they convey. Traditionally, the parts of speech are the noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. Precision: The ratio of the number of correctly filled slots to the total number of slots the system filled. Recall: Denotes the ratio of the number of slots the system found correctly to the number of slots in the answer key. Support Vector Machine (SVM): A learning machine that can perform binary classification (pattern recognition) as well as multi-category classification and real valued function approximation (regression estimation) tasks.
621
Instance Selection
I
Huan Liu Arizona State University, USA Lei Yu Arizona State University, USA
INTRODUCTION The amounts of data have become increasingly large in recent years as the capacity of digital data storage worldwide has significantly increased. As the size of data grows, the demand for data reduction increases for effective data mining. Instance selection is one of the effective means to data reduction. This article introduces the basic concepts of instance selection and its context, necessity, and functionality. The article briefly reviews the state-ofthe-art methods for instance selection. Selection is a necessity in the world surrounding us. It stems from the sheer fact of limited resources, and data mining is no exception. Many factors give rise to data selection: Data is not purely collected for data mining or for one particular application; there are missing data, redundant data, and errors during collection and storage; and data can be too overwhelming to handle. Instance selection is one effective approach to data selection. This process entails choosing a subset of data to achieve the original purpose of a data-mining application. The ideal outcome is a model independent, minimum sample of data that can accomplish tasks with little or no performance deterioration.
BACKGROUND AND MOTIVATION When we are able to gather as much data as we wish, a natural question is “How do we efficiently use it to our advantage?” Raw data is rarely of direct use, and manual analysis simply cannot keep pace with the fast accumulation of massive data. Knowledge discovery and data mining (KDD), an emerging field comprising disciplines such as databases, statistics, and machine learning, comes to the rescue. KDD aims to turn raw data into nuggets and create special edges in this ever-competitive world for science discovery and business intelligence. The KDD process is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy,
1996). It includes data selection, preprocessing, data mining, interpretation, and evaluation. The first two processes (data selection and preprocessing) play a pivotal role in successful data mining (Han & Kamber, 2001). Facing the mounting challenges of enormous amounts of data, much of the current research concerns itself with scaling up data-mining algorithms (Provost & Kolluri, 1999). Researchers have also worked on scaling down the data — an alternative to the scaling up of the algorithms. The major issue of scaling down data is to select the relevant data and then present it to a datamining algorithm. This line of work is parallel with the work on scaling up algorithms, and the combination of the two is a two-edged sword in mining nuggets from massive data. In data mining, data is stored in a flat file and described by terms called attributes or features. Each line in the file consists of attribute-values and forms an instance, which is also called a record, tuple, or data point in a multidimensional space defined by the attributes. Data reduction can be achieved in many ways (Liu & Motoda, 1998; Blum & Langley, 1997; Liu & Motoda, 2001). By selecting features, we reduce the number of columns in a data set; by discretizing feature values, we reduce the number of possible values of features; and by selecting instances, we reduce the number of rows in a data set. We focus on instance selection here. Instance selection reduces data and enables a datamining algorithm to function and work effectively with huge data. The data can include almost everything related to a domain (recall that data is not solely collected for data mining), but one application normally involves using one aspect of the domain. It is natural and sensible to focus on the relevant part of the data for the application so that the search is more focused and mining is more efficient. Cleaning data before mining is often required. By selecting relevant instances, we can usually remove irrelevant, noise, and redundant data. The high-quality data will lead to high-quality results and reduced costs for data mining.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Instance Selection
MAJOR LINES OF RESEARCH AND DEVELOPMENT A spontaneous response to the challenge of instance selection is, without fail, some form of sampling. Although sampling is an important part of instance selection, other approaches do not rely on sampling but resort to search or take advantage of data-mining algorithms. In this section, we start with sampling methods and proceed to other instance-selection methods associated with data-mining tasks, such as classification and clustering.
Sampling Methods Sampling methods are useful tools for instance selection (Gu, Hu, & Liu, 2001). Simple random sampling is a method of selecting n
instances out of the N such that every one of the (nN ) distinct samples has an equal chance of being drawn. If an instance that has been drawn is removed from the data set for all subsequent draws, the method is called random sampling without replacement. Random sampling with replacement is entirely feasible: At any draw, all N instances of the data set have an equal chance of being drawn, no matter how often they have already been drawn. Stratified random sampling divides the data set of N instances into subsets of N1, N2,…, Nl instances, respectively. These subsets are nonoverlapping, and together they comprise the whole data set (i.e., N1+N2,…,+Nl =N). The subsets are called strata. When the strata have been determined, a sample is drawn from each stratum, the drawings being made independently in different strata. If a simple random sample is taken in each stratum, the whole procedure is described as stratified random sampling. It is often used in applications when one wishes to divide a heterogeneous data set into subsets, each of which is internally homogeneous. Adaptive sampling refers to a sampling procedure that selects instances depending on results obtained from the sample. The primary purpose of adaptive sampling is to take advantage of data characteristics in order to obtain more precise estimates. It takes advantage of the result of preliminary mining for more effective sampling, and vice versa. Selective sampling is another way of exploiting data characteristics to obtain more precise estimates in sampling. All instances are first divided into partitions according to some homogeneity criterion, and then random sampling is performed to select instances from each partition. Because instances in each partition are more similar to each other than instances in other 622
partitions, the resulting sample is more representative than a randomly generated one. Recent methods can be found in Liu, Motoda, and Yu (2002), in which samples selected from partitions based on data variance result in better performance than samples selected with random sampling.
Methods for Labeled Data One key data-mining application is classification — predicting the class of an unseen instance. The data for this type of application is usually labeled with class values. Instance selection in the context of classification has been attempted by researchers according to the classifiers being built. In this section, we include five types of selected instances. Critical points are the points that matter the most to a classifier. The issue originated from the learning method of Nearest Neighbor (NN) (Cover & Thomas, 1991). NN usually does not learn during the training phase. Only when it is required to classify a new sample does NN search the data to find the nearest neighbor for the new sample, using the class label of the nearest neighbor to predict the class label of the new sample. During this phase, NN can be very slow if the data are large and can be extremely sensitive to noise. Therefore, many suggestions have been made to keep only the critical points, so that noisy ones are removed and the data set is reduced. Examples can be found in Yu, Xu, Ester, and Kriegel (2001) and Zeng, Xing, and Zhou (2003), in which critical data points are selected to improve the performance of collaborative filtering. Boundary points are the instances that lie on borders between classes. Support vector machines (SVM) provide a principled way of finding these points through minimizing structural risk (Burges, 1998). Using a nonlinear function ∅ to map data points to a high-dimensional feature space, a nonlinearly separable data set becomes linearly separable. Data points on the boundaries, which maximize the margin band, are the support vectors. Support vectors are instances in the original data sets and contain all the information a given classifier needs for constructing the decision function. Boundary points and critical points are different in the ways they are found. Prototypes are representatives of groups of instances via averaging (Chang, 1974). A prototype that represents the typicality of a class is used in characterizing a class rather than describing the differences between classes. Therefore, they are different from critical points or boundary points. Tree-based sampling is a method involving decision trees (Quinlan, 1993), which are commonly used classification tools in data mining and machine learn-
Instance Selection
ing. Instance selection can be done via the decision tree built. Breiman and Friedman (1984) propose delegate sampling. The basic idea is to construct a decision tree such that instances at the leaves of the tree are approximately uniformly distributed. Delegate sampling then samples instances from the leaves in inverse proportion to the density at the leaf and assigns weights to the sampled points that are proportional to the leaf density. In real-world applications, although large amounts of data are potentially available, the majority of data are not labeled. Manually labeling the data is a labor-intensive and costly process. Researchers investigate whether experts can be asked to label only a small portion of the data that is most relevant to the task if labeling all data is too expensive and time-consuming, a process that is called instance labeling. Usually an expert can be engaged to label a small portion of the selected data at various stages. So we wish to select as little data as possible at each stage and use an adaptive algorithm to guess what else should be selected for labeling in the next stage. Instance labeling is closely associated with adaptive sampling, clustering, and active learning.
Methods for Unlabeled Data When data are unlabeled, methods for labeled data cannot be directly applied to instance selection. The widespread use of computers results in huge amounts of data stored without labels, for example, Web pages, transaction data, newspaper articles, and e-mail messages (BaezaYates & Ribeiro-Neto, 1999). Clustering is one approach to finding regularities from unlabeled data. We discuss three types of selected instances here. Prototypes are pseudo data points generated from the formed clusters. The idea is that after the clusters are formed, one may just keep the prototypes of the clusters and discard the rest of the data points. The k-means clustering algorithm is a good example of this sort. Given a data set and a constant k, the k-means clustering algorithm is to partition the data into k subsets such that instances in each subset are similar under some measure. The k means are iteratively updated until a stopping criterion is satisfied. The prototypes in this case are the k means. Bradley, Fayyad, and Reina (1998) extend the kmeans algorithm to perform clustering in one scan of the data. By keeping some points that defy compression plus some sufficient statistics, they demonstrate a scalable kmeans algorithm. From the viewpoint of instance selection, prototypes plus sufficient statistics is a method of representing a cluster by using both defiant points and
pseudo points that can be reconstructed from sufficient statistics rather than keeping only the k means. Squashed data are some pseudo data points generated from the original data. In this aspect, they are similar to prototypes, as both may or may not be in the original data set. Squashed data points are different from prototypes in that each pseudo data point has a weight, and the sum of the weights is equal to the number of instances in the original data set. Presently, two ways of obtaining squashed data are (a) model free (DuMouchel, Volinsky, Johnson, Cortes, & Pregibon, 1999) and (b) model dependent, or likelihood based (Madigan, Raghavan, DuMouchel, Nason, Posse, & Ridgeway, 2002).
FUTURE TRENDS As shown in this article, instance selection has been studied and employed in various tasks, such as sampling, classification, and clustering. Each task is very unique, as each has different information available and different requirements. Clearly, a universal model of instance selection is out of the question. This short article provides some starting points that can hopefully lead to more concerted study and development of new methods for instance selection. Instance selection deals with scaling down data. When we better understand instance selection, we will naturally investigate whether this work can be combined with other lines of research, such as algorithm scaling-up, feature selection, and construction, to overcome the problem of huge amounts of data,. Integrating these different techniques to achieve the common goal — effective and efficient data mining — is a big challenge.
CONCLUSION With the constraints imposed by computer memory and mining algorithms, we experience selection pressures more than ever. The central point of instance selection is approximation. Our task is to achieve as good of mining results as possible by approximating the whole data with the selected instances and, hopefully, to do better in data mining with instance selection, as it is possible to remove noisy and irrelevant data in the process. In this short article, we have presented an initial attempt to review and categorize the methods of instance selection in terms of sampling, classification, and clustering.
623
I
Instance Selection
REFERENCES Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley and ACM Press. Blum, A., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97, 245-271. Bradley, P., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (pp. 9-15). Breiman, L. & Friedman, J. (1984). Tools for large data set analysis. In E.J. Wegman & J.G. Smith (Eds.), Statistical signal processing. New York: M. Dekker. Burges, C. (1998). A tutorial on support vector machines. Journal of Data Mining and Knowledge Discovery, 2, 121-167. Chang, C. (1974). Finding prototypes for nearest neighbor classifiers. IEEE Transactions on Computers, C-23.
Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., & Ridgeway, G. (2002). Likelihood-based data squashing: A modeling approach to instance construction. Journal of Data Mining and Knowledge Discovery, 6(2), 173-190. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Journal of Data Mining and Knowledge Discovery, 3, 131-169. Quinlan, R. J. (1993). C4.5: Programs for machine learning. Morgan Kaufmann. Yu, K., Xu, X., Ester, M., & Kriegel, H. (2001). Selecting relevant instances for efficient and accurate collaborative filtering. Proceedings of the 10th International Conference on Information and Knowledge Management (pp.239-46). Zeng, C., Xing, C., & Zhou, L. (2003). Similarity measure and instance selection for collaborative filtering. Proceedings of the 12th International Conference on World Wide Web (pp. 652-658).
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley.
KEY TERMS
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., & Pregibon, D. (1999). Squashing flat files flatter. Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining (pp. 6-15).
Classification: A process of predicting the classes of unseen instances based on patterns learned from available instances with predefined classes.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). From data mining to knowledge discovery. Advances in Knowledge Discovery and Data Mining.
Clustering: A process of grouping instances into clusters so that instances are similar to one another within a cluster but dissimilar to instances in other clusters.
Gu, B., Hu, F., & Liu, H. (2001). Sampling: Knowing whole from its part. In H. Liu & H. Motoda (Eds.), Instance selection and construction for data mining. Boston: Kluwer Academic.
Data Mining: The application of analytical methods and tools to data for the purpose of discovering patterns, statistical or predictive models, and relationships among massive data.
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann.
Data Reduction: A process of removing irrelevant information from data by reducing the number of features, instances, or values of the data.
Liu, H., & Motoda, H., (1998). Feature selection for knowledge discovery and data mining. Boston: Kluwer Academic. Liu, H., & Motoda, H. (Eds.). (2001). Instance selection and construction for data mining. Boston: Kluwer Academic. Liu, H., Motoda, H., & Yu, L. (2002). Feature selection with selective sampling. Proceedings of the 19th International Conference on Machine Learning (pp. 395-402).
624
Instance: A vector of attribute values in a multidimensional space defined by the attributes, also called a record, tuple, or data point. Instance Selection: A process of choosing a subset of data to achieve the original purpose of a data-mining application as if the whole data is used. Sampling: A procedure that draws a sample, Si, by a random process in which each Si receives its appropriate probability, Pi, of being selected.
625
Integration of Data Sources through Data Mining Andreas Koeller Montclair State University, USA
INTRODUCTION Integration of data sources refers to the task of developing a common schema as well as data transformation solutions for a number of data sources with related content. The large number and size of modern data sources make manual approaches at integration increasingly impractical. Data mining can help to partially or fully automate the data integration process.
for each datum object. Database integration or migration projects often deal with hundreds of tables and thousands of fields (Dasu, Johnson, Muthukrishnan, & Shkapenyuk, 2002), with some tables having 100 or more fields and/or hundreds of thousands of rows. Methods of improving the efficiency of integration projects, which still rely mostly on manual work (Kang & Naughton, 2003), are critical for the success of this important task.
MAIN THRUST BACKGROUND Many fields of business and research show a tremendous need to integrate data from different sources. The process of data source integration has two major components. Schema matching refers to the task of identifying related fields across two or more databases (Rahm & Bernstein, 2001). Complications arise at several levels, for example •
•
•
Source databases can be organized by using several different models, such as the relational model, the object-oriented model, or semistructured models (e.g., XML). Information stored in a single table in one relational database can be stored in two or more tables in another. This problem is common when source databases show different levels of normalization and also occurs in nonrelational sources. A single field in one database, such as Name, could correspond to multiple fields, such as First Name and Last Name, in another.
Data transformation (sometimes called instance matching) is a second step in which data in matching fields must be translated into a common format. Frequent reasons for mismatched data include data format (such as 1.6.2004 vs. 6/1/2004), numeric precision (3.5kg vs. 3.51kg), abbreviations (Corp. vs. Corporation), or linguistic differences (e.g., using different synonyms for the same concept across databases). Today’s databases are large both in the number of records stored and in the number of fields (dimensions)
In this article, I explore the application of data-mining methods to the integration of data sources. Although data transformation tasks can sometimes be performed through data mining, such techniques are most useful in the context of schema matching. Therefore, the following discussion focuses on the use of data mining in schema matching, mentioning data transformation where appropriate.
Schema-Matching Approaches Two classes of schema-matching solutions exist: schema-only-based matching and instance-based matching (Rahm & Bernstein, 2001). Schema-only-based matching identifies related database fields by taking only the schema of input databases into account. The matching occurs through linguistic means or through constraint matching. Linguistic matching compares field names, finds similarities in field descriptions (if available), and attempts to match field names to names in a given hierarchy of terms (ontology). Constraint matching matches fields based on their domains (data types) or their key properties (primary key, foreign key). In both approaches, the data in the sources are ignored in making decisions on matching. Important projects implementing this approach include ARTEMIS (Castano, de Antonellis, & de Capitani di Vemercati, 2001) and Microsoft’s CUPID (Madhavan, Bernstein, & Rahm, 2001). Instance-based matching takes properties of the data into account as well. A very simple approach is to conclude that two fields are related if their minimum and maximum values and/or their average values are
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Integration of Data Sources through Data Mining
equal or similar. More sophisticated approaches consider the distribution of values in fields. A strong indicator of a relation between fields is a complete inclusion of the data of one field in another. I take a closer look at this pattern in the following section. Important instance-based matching projects are SemInt (Li & Clifton, 2000) and LSD (Doan, Domingos, & Halevy, 2001). Some projects explore a combined approach, in which both schema-level and instance-level matching is performed. Halevy and Madhavan (2003) present a Corpusbased schema matcher. It attempts to perform schema matching by incorporating known schemas and previous matching results and to improve the matching result by taking such historical information into account. Data-mining approaches are most useful in the context of instance-based matching. However, some mining-related techniques, such as graph matching, are employed in schema-only-based matching as well.
Instance-Based Matching through Inclusion Dependency Mining An inclusion dependency is a pattern between two databases, stating that the values in a field (or set of fields) in one database form a subset of the values in some field (or set of fields) in another database. Such subsets are relevant to data integration for two reasons. First, fields that stand in an inclusion dependency to one another might represent related data. Second, knowledge of foreign keys is essential in successful schema matching. Because a foreign key is necessarily a subset of the corresponding key in another table, foreign keys can be discovered through inclusion dependency discovery. The discovery of inclusion dependencies is a very complex process. In fact, the problem is in general NPhard as a function of the number of fields in the largest inclusion dependency between two tables. However, a number of practical algorithms have been published. De Marchi, Lopes, and Petit (2002) present an algorithm that adopts the idea of levelwise discovery used in the famous Apriori algorithm for association rule mining. Inclusion dependencies are discovered by first comparing single fields with one another and then combining matches into pairs of fields, continuing the process through triples, then 4-sets of fields, and so on. However, due to the exponential growth in the number of inclusion dependencies in larger tables, this approach does not scale beyond inclusion dependencies with a size of about eight fields. A more recent algorithm (Koeller & Rundensteiner, 2003) takes a graph-theoretic approach. It avoids enumerating all inclusion dependencies between two tables and finds candidates for only the largest inclusion de626
pendencies by mapping the discovery problem to a problem of discovering patterns (specifically cliques) in graphs. This approach is able to discover inclusion dependencies with several dozens of attributes in tables with tens of thousands of rows. Both algorithms rely on the antimonotonic property of the inclusion dependency discovery problem. This property is also used in association rule mining and states that patterns of size k can only exist in the solution of the problem if certain patterns of sizes smaller than k exist as well. Therefore, it is meaningful to first discover small patterns (e.g., single-attribute inclusion dependency) and use this information to restrict the search space for larger patterns.
Instance-Based Matching in the Presence of Data Mismatches Inclusion dependency discovery captures only part of the problem of schema matching, because only exact matches are found. If attributes across two relations are not exact subsets of each other (e.g., due to entry errors), then data mismatches requiring data transformation, or partially overlapping data sets, it becomes more difficult to perform data-driven mining-based discovery. Both false negatives and false positives are possible. For example, matching fields might not be discovered due to different encoding schemes (e.g., use of a numeric identifier in one table, where text is used to denote the same values in another table). On the other hand, purely data-driven discovery relies on the assumption that semantically related values are also syntactically equal. Consequently, fields that are discovered by a mining algorithm to be matching might not be semantically related.
Data Mining by Using Database Statistics The problem of false negatives in mining for schema matching can be addressed by more sophisticated mining approaches. If it is known which attributes across two relations relate to one another, data transformation solutions can be used. However, automatic discovery of matching attributes is also possible, usually through the evaluation of statistical patterns in the data sources. In the classification of Kang and Naughton (2003), interpreted matching uses artificial intelligence techniques, such as Bayesian classification or neural networks, to establish hypotheses about related attributes. In the uninterpreted matching approach, statistical features, such as the unique value count of an attribute or its frequency distribution, are taken into consideration. The underlying assumption is that two
Integration of Data Sources through Data Mining
attributes showing a similar distribution of unique values might be related even though the actual data values are not equal or similar. Another approach for detecting a semantic relationship between attributes is to use information entropy measures. In this approach, the concept of mutual information, which is based on entropy and conditional entropy of the underlying attributes, “measures the reduction in uncertainty of one attribute due to the knowledge of the other attribute” (Kang & Naughton, 2003, p. 207).
Further Problems in Information Integration through Data Mining In addition to the approaches mentioned previously, several other data-mining and machine-learning approaches, in particular classification and rule-mining techniques, are used to solve special problems in information integration. For example, a common problem occurring in realworld integration projects is related to duplicate records across two databases, which must be identified. This problem is usually referred to as the record-linking problem or the merge/purge problem (Hernandéz & Stolfo, 1998). Similar statistical techniques as the ones described previously are used to approach this problem. The Commonwealth Scientific and Industrial Research Organisation (2003) gives an overview of approaches, which include decision models, predictive models such as support vector machines, and a Bayesian decision cost model. In a similar context, data-mining and machine-learning solutions are used to improve the data quality of existing databases as well. This important process is sometimes called data scrubbing. Lübbers, Grimmer, and Jarke (2003) present a study of the use of such techniques and refer to the use of data mining in data quality improvement as data auditing. Recently, the emergence of Web Services such as XML, SOAP, and UDDI promises to open opportunities for database integration. Hansen, Madnick, and Siegel (2002) argue that Web services help to overcome some of the technical difficulties of data integration, which mostly stem from the fact that traditional databases are not built with integration in mind. On the other hand, Web services by design standardize data exchange protocols and mechanisms. However, the problem of identifying semantically related databases and achieving schema matching remains.
FUTURE TRENDS Increasing amounts of data are being collected at all levels of business, industry, and science. Integration of data also becomes more and more important as businesses merge and research projects increasingly require interdisciplinary efforts. Evidence for the need for solutions in this area is provided by the multitude of partial software solutions for such business applications as ETL (Pervasive Software, Inc., 2003), and by the increasing number of integration projects in the life sciences, such as Genbank by the National Center for Biotechnology Information (NCBI) or Gramene by the Cold Spring Harbor Laboratory and Cornell University. Currently, the integration of data sources is a daunting task, requiring substantial human resources. If automatic methods for schema matching were more readily available, data integration projects could be completed much faster and could incorporate many more databases than is currently the case. Furthermore, an emerging trend in data source integration is the move from batch-style integration, where a set of given data sources is integrated at one time into one system, to real-time integration, where data sources are immediately added to an integration system as they become available. Solutions to this new challenge can also benefit tremendously from semiautomatic or automatic methods of identifying database structure and relationships.
CONCLUSION Information integration is an important and difficult task for businesses and research institutions. Although data sources can be integrated with each other by manual means, this approach is not very efficient and does not scale to the current requirements. Thousands of databases with the potential for integration exist in every field of business and research, and many of those databases have a prohibitively high number of fields and/or records to make manual integration feasible. Semiautomatic or automatic approaches to integration are needed. Data mining provides very useful tools to automatic data integration. Mining algorithms are used to identify schema elements in unknown source databases, to relate those elements to each other, and to perform additional tasks, such as data transformation. Essential
627
I
Integration of Data Sources through Data Mining
business tasks such as extraction, transformation, and loading (ETL) and data integration and migration in general become more feasible when automatic methods are used. Although the underlying algorithmic problems are difficult and often show exponential complexity, several interesting solutions to the schema-matching and data transformation problems in integration have been proposed. This is an active area of research, and more comprehensive and beneficial applications of data mining to integration are likely to emerge in the near future.
REFERENCES Castano, S., de Antonellis, V., & de Capitani di Vemercati, S. (2001). Global viewing of heterogeneous data sources. IEEE Transactions on Knowledge and Data Engineering, 13(2), 277-297. Commonwealth Scientific and Industrial Research Organisation. (2003, April). Record linkage: Current practice and future directions (CMIS Tech. Rep. No. 03/83). Canberra, Australia: L. Gu, R. Baxter, D. Vickers, & C. Rainsford. Retrieved July 22, 2004, from http:// www.act.cmis.csiro.au/rohanb/PAPERS/record _linkage.pdf Dasu, T., Johnson, T., Muthukrishnan, S., & Shkapenyuk, V. (2002). Mining database structure; or, how to build a data quality browser. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, USA (pp. 240-251). de Marchi, F., Lopes, S., & Petit, J.-M. (2002). Efficient algorithms for mining inclusion dependencies. Proceedings of the Eighth International Conference on Extending Database Technology, Prague, Czech Republic, 2287 (pp. 464-476). Doan, A. H., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machinelearning approach. Proceedings of the ACM SIGMOD International Conference on Management of Data, USA (pp. 509-520). Halevy, A. Y., & Madhavan, J. (2003). Corpus-based knowledge representation. Proceedings of the 18th International Joint Conference on Artificial Intelligence, Mexico (pp. 1567-1572). Hernández, M. A., & Stolfo, S. J. (1998). Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2(1), 9-37.
628
Kang, J., & Naughton, J. F. (2003). On schema matching with opaque column names and data values. Proceedings of the ACM SIGMOD International Conference on Management of Data, USA (pp. 205-216). Koeller, A., & Rundensteiner, E. A. (2003). Discovery of high-dimensional inclusion dependencies. Proceedings of the 19th IEEE International Conference on Data Engineering, India (pp. 683-685). Li, W., & Clifton, C. (2000). SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural network. Journal of Data and Knowledge Engineering, 33(1), 49-84. Lübbers, D., Grimmer, U., & Jarke, M. (2003). Systematic development of data mining-based data quality tools. Proceedings of the 29th International Conference on Very Large Databases, Germany (pp. 548-559). Madhavan, J., Bernstein, P. A., & Rahm, E. (2001) Generic schema matching with CUPID. Proceedings of the 27th International Conference on Very Large Databases, Italy (pp. 49-58). Massachusetts Institute of Technology, Sloan School of Management. (2002, May). Data integration using Web services (Working Paper 4406-02). Cambridge, MA. M. Hansen, S. Madnick, & M. Siegel. Retrieved July 22, 2004, from http://hdl.handle.net/1721.1/1822 Pervasive Software, Inc. (2003). ETL: The secret weapon in data warehousing and business intelligence. [Whitepaper]. Austin, TX: Pervasive Software. Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 334-350.
KEY TERMS Antimonotonic: A property of some pattern-finding problems stating that patterns of size k can only exist if certain patterns with sizes smaller than k exist in the same dataset. This property is used in levelwise algorithms, such as the Apriori algorithm used for association rule mining or some algorithms for inclusion dependency mining. Database Schema: A set of names and conditions that describe the structure of a database. For example, in a relational database, the schema includes elements such as table names, field names, field data types, primary key constraints, or foreign key constraints.
Integration of Data Sources through Data Mining
Domain: The set of permitted values for a field in a database, defined during database design. The actual data in a field are a subset of the field’s domain. Extraction, Transformation, and Loading (ETL): Describes the three essential steps in the process of data source integration: extracting data and schema from the sources, transforming it into a common format, and loading the data into an integration database. Foreign Key: A key is a field or set of fields in a relational database table that has unique values, that is, no duplicates. A field or set of fields whose values form a subset of the values in the key of another table is called a foreign key. Foreign keys express relationships between fields of different tables. Inclusion Dependency: A pattern between two databases, stating that the values in a field (or set of fields) in one database form a subset of the values in some field (or set of fields) in another database.
Levelwise Discovery: A class of data-mining algorithms that discovers patterns of a certain size by first discovering patterns of size 1, then using information from that step to discover patterns of size 2, and so on. A well-known example of a levelwise algorithm is the Apriori algorithm used to mine association rules. Merge/Purge: The process of identifying duplicate records during the integration of data sources. Related data sources often contain overlapping information extents, which have to be reconciled to improve the quality of an integrated database. Relational Database: A database that stores data in tables, which are sets of tuples (rows). A set of corresponding values across all rows of a table is called an attribute, field, or column. Schema Matching: The process of identifying an appropriate mapping from the schema of an input data source to the schema of an integrated database.
629
I
630
Intelligence Density David Sundaram The University of Auckland, New Zealand Victor Portougal The University of Auckland, New Zealand
INTRODUCTION
Figure 1. Steps for increasing intelligence density (Dhar & Stein, 1997)
The amount of information that decision makers have to process has been increasing at a tremendous pace. A few years ago it was suggested that information in the world was doubling every 16 months. The very volume has prevented this information from being used effectively. Another problem that compounds the situation is the fact that the information is neither easily accessible nor available in an integrated manner. This has led to the oftquoted comment that though computers have promised a fount of wisdom they have swamped us with a flood of data. Decision Support Systems (DSS) and related decision support tools like data warehousing and data mining have been used to glean actionable information and nuggets from this flood of data.
BACKGROUND Dhar and Stein (1997) define Intelligence Density (ID) as the amount of useful “decision support information” that a decision maker gets from using a system for a certain amount of time. Alternately ID can be defined as the amount of time taken to get the essence of the underlying data from the output. This is done using the “utility” concept, initially developed in decision theory and game theory (Lapin & Whisler, 2002). Numerical utility values, referred to as utilities (sometimes called utiles) express the true worth of information. These values are obtained by constructing a special utility function. Thus intelligence density can be defined more formally as follows: Intelligence Density =
Utilities of decision making power gleaned (quality) ----------------------------------------------------------------Units of analytic time spent by the decision maker
Increasing the intelligence density of its data enables an organization to be more effective, productive, and flexible. Key processes that allow one to increase the ID of data are illustrated in Figure 1. Mechanisms that will allow us to access different types of data need to be in place first. Once we have access to the data we
Knowledge
Data
need to have the ability to scrub or cleanse the data of errors. After scrubbing the data we need to have tools and technologies that will allow us to integrate data in a flexible manner. This integration should support not only data of different formats but also data that are not of the same type. Enterprise Systems/Enterprise Resource Planning (ERP) systems with their integrated databases have provided clean and integrated view of a large amount of information within the organization thus supporting the lower levels of the intelligence density pyramid (Figure 2). But even in the biggest and best organizations with massive investments in ERP systems we still find the need for data warehouses and OLAP even though they predominantly support the lower levels of the intelligence density pyramid. Once we have an integrated view of the data we can use data mining and other decision support tools to transform the data and discover patterns and nuggets of information from the data.
MAIN THRUST Three key technologies that can be leveraged to overcome the problems associated with information of low
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Intelligence Density
Figure 2. ERP and DSS support for increasing intelligence density (Adapted from Shafiei and Sundaram, 2004)
•
K now ledge Learn
D SS •
D iscover Transform
•
Integrate Scrub A ccess
homonyms and synonyms. The key steps that need to be undertaken to transform raw data to a form that can be stored in a Data Warehouse for analysis are:
ER P System
D ata
intelligence density are Data Warehousing (DW), Online Analytical Processing (OLAP), and Data Mining (DM). These technologies have had a significant impact on the design and implementation of DSS. A generic decision support architecture that incorporates these technologies is illustrated in Figure 3. This architecture highlights the complimentary nature of data warehousing, OLAP, and data mining. The data warehouse and its related components support the lower end of the intelligence density pyramid by providing tools and technologies that allow one to extract, load, cleanse, convert, and transform the raw data available in an organisation into a form that then allows the decision maker to apply OLAP and data mining tools with ease. The OLAP and data mining tools in turn support the middle and upper levels of the intelligence density pyramid. In the following paragraphs we look at each of these technologies with a particular focus on their ability to increase the intelligence density of data.
• •
The extraction and loading of the data into the Data Warehouse environment from a number of systems on a periodic basis Conversion of the data into a format that is appropriate to the Data Warehouse Cleansing of the data to remove inconsistencies, inappropriate values, errors, etc Integration of the different data sets into a form that matches the data model of the Data Warehouse Transformation of the data through operations such as summarisation, aggregation, and creation of derived attributes.
Once all these steps have been completed the data is ready for further processing. While one could use different programs/packages to accomplish the various steps listed above they could also be conducted within a single environment. For example, Microsoft SQL Server (2004) provides the Data Transformation Services by which raw data from organisational data stores can be loaded, cleansed, converted, integrated, aggregated, summarized, and transformed in a variety of ways.
Figure 3. DSS architecture incorporating data warehouses, OLAP, and data mining (Adapted from Srinivasan et al., 2000)
Data Warehousing Data warehouses are fundamental to most information system architectures and are even more crucial in DSS architectures. A Data Warehouse is not a DSS, but a Data Warehouse provides data that is integrated, subjectoriented, time-variant, and non-volatile in a bid to support decision making (Inmon, 2002). There are a number of processes that needs to be undertaken before data can enter a Data Warehouse or be analysed using OLAP or Data Mining Tools. Most Data Warehouses reside on relational DBMS like ORACLE, Microsoft SQL Server, or DB2. The data from which the Data Warehouses are built can exist on varied hardware and software platforms. The data quite often also needs to be extracted from a number of different sources from within as well as without the organization. This requires the resolution of many data integration issues such as 631
I
Intelligence Density
OLAP
FUTURE TRENDS
OLAP can be defined as the creation, analysis, ad hoc querying, and management of multidimensional data (Thomsen, 2002). Predominantly the focus of most OLAP systems is on the analysis and ad hoc querying of the multidimensional data. Data warehousing systems are usually responsible for the creation and management of the multidimensional data. A superficial understanding might suggest that there does not seem to be much of a difference between data warehouses and OLAP. This is due to the fact, that both are complimentary technologies with the aim of increasing the intelligence density of data. OLAP is a logical extension to the data warehouse. OLAP and related technologies focus on providing support for the analytical, modelling, and computational requirements of decision makers. While OLAP systems provide a medium level of analysis capabilities most of the current crop of OLAP systems do not provide the sophisticated modeling or analysis functionalities of data mining, mathematical programming, or simulation systems.
There are two key trends that are evident in the commercial as well as the research realm. The first is the complementary use of various decision support tools such as data warehousing, OLAP, and data mining in a synergistic fashion leading to information of high intelligence density. Another subtle but vital trend is the ubiquitous inclusion of data warehousing, OLAP, and data mining in most information technology architectural landscapes. This is especially true of DSS architectures.
Data Mining Data mining can be defined as the process of identifying valid, novel, useful, and understandable patterns in data through automatic or semiautomatic means (Berry & Linoff, 1997). Data mining borrows techniques that originated from diverse fields such as computer science, statistics, and artificial intelligence. Data mining is now being used in a range of industries and for a range of tasks in a variety of contexts (Wang, 2003). The complexity of the field of data mining makes it worthwhile to structure it into goals, tasks, methods, algorithms, and algorithm implementations. The goals of data mining drive the tasks that need to be undertaken, and the tasks drive the methods that will be applied. The methods that will be applied, drives the selections of algorithms followed by the choice of algorithm implementations. The goals of data mining are description, prediction, and/or verification. Description oriented tasks include clustering, summarisation, deviation detection, and visualization. Prediction oriented tasks include classification and regression. Statistical analysis techniques are predominantly used for verification. Methods or techniques to carry out these tasks are many, chief among them are: neural networks, rule induction, market basket, cluster detection, link, and statistical analysis. Each method may have several supporting algorithms and in turn each algorithm may be implemented in a different manner. Data mining tools such as Clementine (SPSS, 2004) not only support the discovery of nuggets but also support the entire intelligence density pyramid by providing a sophisticated visual interactive environment. 632
CONCLUSION In this chapter we first defined intelligence density and the need for decision support tools that would provide intelligence of a high density. We then introduced three emergent technologies integral in the design and implementation of DSS architectures whose prime purpose is to increase the intelligence density of data. We introduced and described data warehousing, OLAP, and data mining briefly from the perspective of their ability to increase intelligence density of data. We also proposed a generic decision support architecture that complementarily uses data warehousing, OLAP, and data mining.
REFERENCES Berry, M.J.A., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. John Wiley & Sons Inc. Berson, A., & Smith, S.J. (1997). Data warehousing, Data mining, & OLAP. McGraw-Hill. Dhar, V., & Stein, R. (1997). Intelligent decision support methods: The science of knowledge work. Prentice Hall. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34. Inmon, W.H. (2002). Building the data warehouse. John Wiley & Sons. Kimball, R., & Ross, M. (2002). The data warehouse toolkit: The complete guide to dimensional modeling. John Wiley & Sons. Lapin, L., & Whisler, W.D. (2002). Quantitative decision making with spreadsheet applications. Belmont, CA: Duxbury/Thomson Learning.
Intelligence Density
Microsoft. (2004). Microsoft SQL Server. Retrieved from http://www.microsoft.com/ Shafiei, F., & Sundaram, D. (2004, January 5-8). Multienterprise collaborative enterprise resource planning and decision support systems. Thirty-Seventh Hawaii International Conference on System Sciences (CD/ROM). SPSS. (2004). Clementine. Retrieved from http:// www.spss.com Srinivasan, A., Sundaram, D., & Davis, J. (2000). Implementing decision support systems. McGraw Hill. Thomsen, E. (2002). OLAP solutions: Building multidimensional information systems (2nd ed.). New York; Chichester, UK: Wiley. Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing. Westphal, C., & Blaxton, T. (1998). Data mining solutions: Methods and tools for solving real-world problems. John Wiley & Sons.
KEY TERMS Data Mining: can be defined as the process of identifying valid, novel, useful, and understandable patterns in data through automatic or semiautomatic means ultimately leading to the increase of the intelligence density of the raw input data.
Data Warehouses: provide data that are integrated, subject-oriented, time-variant, and non-volatile thereby increasing the intelligence density of the raw input data. Decision Support Systems/Tools: in a wider sense can be defined as systems/tools that affect the way people make decisions. But in our present context could be defined as systems that increase the intelligence density of data. Enterprise Resource Planning /Enterprise Systems: are integrated information systems that support most of the business processes and information system requirements in an organization. Intelligence Density: is the useful “decision support information” that a decision maker gets from using a system for a certain amount of time or alternately the amount of time taken to get the essence of the underlying data from the output. Online Analytical Processing (OLAP): enables the creation, management, analysis, and ad hoc querying of multidimensional data thereby increasing the intelligence density of the data already available in data warehouses. Utilities (Utiles): are numerical utility values, expressing the true worth of information. These values are obtained by constructing a special utility function.
633
I
634
Intelligent Data Analysis Xiaohui Liu Brunel University, UK
INTRODUCTION Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).
BACKGROUND The job of a data analyst typically involves problem formulation, advice on data collection (though it is not uncommon for the analyst to be asked to analyze data that have already been collected), effective data analysis, and interpretation and report of the finding. Data analysis is about the extraction of useful information from data and is often performed by an iterative process in which exploratory analysis and confirmatory analysis are the two principal components. Exploratory data analysis, or data exploration, resembles the job of a detective; that is, understanding evidence collected, looking for clues, applying relevant background knowledge, and pursuing and checking the possibilities that clues suggest. Data exploration is not only useful for data understanding but also helpful in generating possibly interesting hypotheses for a later study—normally a more formal or confirmatory procedure for analyzing data. Such procedures often assume a potential model structure for the data and may involve estimating the model parameters and testing hypotheses about the model. Over the last 15 years, we have witnessed two phenomena that have affected the work of modern data analysts more than any others. First, the size and variety of machine-readable data sets have increased dramatically, and the problem of data explosion has become apparent. Second, recent developments in computing have provided the basic infrastructure for fast data access as well as many advanced computational methods for extracting information from large quantities of data. These developments have created a new range of problems and
challenges for data analysts as well as new opportunities for intelligent systems in data analysis, and have led to the emergence of the field of Intelligent Data Analysis (IDA), which draws the techniques from diverse fields, including artificial intelligence (AI), databases, high-performance computing, pattern recognition, and statistics. What distinguishes IDA is that it brings together often complementary methods from these diverse disciplines to solve challenging problems with which any individual discipline would find difficult to cope, and to explore the most appropriate strategies and practices for complex data analysis.
MAIN THRUST In this paper, we will explore the main disciplines and associated techniques as well as applications to help clarify the meaning of intelligent data analysis, followed by a discussion of several key issues.
Statistics and Computing: Key Disciplines IDA has its origins in many disciplines, principally statistics and computing. For many years, statisticians have studied the science of data analysis and have laid many of the important foundations. Many of the analysis methods and principles were established long before computers were born. Given that statistics are often regarded as a branch of mathematics, there has been an emphasis on mathematics rigor, a desire to establish that something is sensible on theoretical ground before trying it out on practical problems (Berthold & Hand, 2003). On the other hand, the computing community, particularly in machine learning (Mitchell, 1997) and data mining (Wang, 2003) is much more willing to try something out (e.g., designing new algorithms) to see how they perform on real-world datasets, without worrying too much about the theory behind it. Statistics is probably the oldest ancestor of IDA, but what kind of contributions has computing made to the subject? These may be classified into three categories. First, the basic computing infrastructure has been put in place during the last decade or so, which enables largescale data analysis (e.g., advances in data warehousing
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Intelligent Data Analysis
and online analytic processing, computer networks, desktop technologies have made it possible to easily organize and move the data around for the analysis purpose). The modern computing processing power also has made it possible to efficiently implement some of the very computationally-intensive analysis methods such as statistical resampling, visualizations, large-scale simulation and neural networks, and stochastic search and optimization methods. Second, there has been much work on extending traditional statistical and operational research methods to handle challenging problems arising from modern data sets. For example, in Bayesian networks (Ramoni et al., 2002), where the work is based on Bayesian statistics, one tries to make the ideas work on large-scale practical problems by making appropriate assumptions and developing computationally efficient algorithms; in support vector machines (Cristianini & Shawe-Taylor, 2000), where one tries to see how the statistical learning theory (Vapnik, 1998) could be utilized to handle very high-dimensional datasets in linear feature spaces; and in evolutionary computation (Eiben & Michalewicz, 1999) one tries to extend the traditional operational research search and optimization methods. Third, new kinds of IDA algorithms have been proposed to respond to new challenges. Here are several examples of the novel methods with distinctive computing characteristics: powerful three-dimensional virtual reality visualization systems that allow gigabytes of data to be visualized interactively by teams of scientists in different parts of the world (Cruz-Neira, 2003); parallel and distributed algorithms for different data analysis tasks (Zaki & Pan, 2002); so-called any-time analysis algorithms that are designed for real-time tasks, where the system, if stopped any time from its starting point, would be able to give some satisfactory (not optimal) solution (of course, the more time it has, the better solution would be); inductive logic programming extends the deductive power of classic logic programming methods to induce structures from data (Mooney, 2004) ; Association rule learning algorithms were motivated by the need in retail industry where customers tend to buy related items (Nijssen & Kok, 2001), while work in inductive databases attempt to supply users with queries involving inductive capabilities (De Raedt, 2002). Of course, this list is not meant to be exhaustive, but it gives some ideas about the kind of IDA work going on within the computing community.
IDA Applications Data analysis is performed for a variety of reasons by scientists, engineers, business communities, medical and government researchers, and so forth. The increasing size and variety of data as well as new exciting applications
such as bioinformatics and e-science have called for new ways of analyzing the data. Therefore, it is a very difficult task to have a sensible summary of the type of IDA applications that are possible. The following is a partial list. •
•
•
Bioinformatics: A huge amount of data has been generated by genome-sequencing projects and other experimental efforts to determine the structures and functions of biological molecules and to understand the evolution of life (Orengo et al., 2003). One of the most significant developments in bioinformatics is the use of high-throughput devices such as DNA microarray technology to study the activities of thousands of genes in a single experiment and to provide a global view of the underlying biological process by revealing, for example, which genes are responsible for a disease process, how they interact and are regulated, and which genes are being co-expressed and participate in common biological pathways. Major IDA challenges in this area include the analysis of very high dimensional but small sample microarray data, the integration of a variety of data for constructing biological networks and pathways, and the handling of very noisy microarray image data. Medicine and Healthcare: With the increasing development of electronic patient records and medical information systems, a large amount of clinical data is available online. Regularities, trends, and surprising events extracted from these data by IDA methods are important in assisting clinicians to make informed decisions, thereby improving health services (Bellazzi et al., 2001). Examples of such applications include the development of novel methods to analyze time-stamped data in order to assess the progression of disease, autonomous agents for monitoring and diagnosing intensive care patients, and intelligent systems for screening early signs of glaucoma. It is worth noting that research in bioinformatics can have significant impact on the understanding of disease and consequently better therapeutics and treatments. For example, it has been found using DNA microarray technology that the current taxonomy of cancer in certain cases appears to group together molecularly distinct diseases with distinct clinical phenotypes, suggesting the discovery of subgroups of cancer (Alizadeh et al., 2000). Science and Engineering: Enormous amounts of data have been generated in science and engineering (Cartwight, 2000) (e.g., in cosmology, chemical engineering, or molecular biology, as discussed previously). In cosmology, advanced computational tools are needed to help astronomers understand the origin of large-scale cosmological structures 635
I
Intelligent Data Analysis
•
as well as the formation and evolution of their astrophysical components (i.e., galaxies, quasars, and clusters). In chemical engineering, mathematical models have been used to describe interactions among various chemical processes occurring inside a plant. These models are typically very large systems of nonlinear algebraic or differential equations. Challenges for IDA in this area include the development of scalable, approximate, parallel, or distributed algorithms for large-scale applications. Business and Finance: There is a wide range of successful business applications reported, although the retrieval of technical details is not always easy, perhaps for obvious reasons. These applications include fraud detection, customer retention, cross selling, marketing, and insurance. Fraud is costing industries billions of pounds, so it is not surprising to see that systems have been developed to combat fraudulent activities in such areas as credit card, health care, stock market dealing, or finance in general. Interesting challenges for IDA include timely integration of information from different resources and the analysis of local patterns that represent deviations from a background model (Hand et al., 2002).
•
•
IDA Key Issues In responding to challenges of analyzing complex data from a variety of applications, particularly emerging ones such as bioinformatics and e-science, the following issues are receiving increasing attention in addition to the development of novel algorithms to solve new emerging problems. •
•
636
Strategies: There is a strategic aspect to data analysis beyond the tactical choice of this or that test, visualization or variable. Analysts often bring exogenous knowledge about data to bear when they decide how to analyze it. The question of how data analysis may be carried out effectively should lead us to having a close look not only at those individual components in the data analysis process but also at the process as a whole, asking what would constitute a sensible analysis strategy. The strategy should describe the steps, decisions, and actions that are taken during the process of analyzing data to build a model or answer a question. Data Quality: Real-world data contain errors and are incomplete and inconsistent. It is commonly accepted that data cleaning is one of the most difficult and most costly tasks in large-scale data analysis and often consumes most of project resources. Research on data quality has attracted a significant amount of attention from different communities and includes statistics, computing, and information
•
systems. Important progress has been made, but further work is needed urgently to come up with practical and effective methods for managing different kinds of data quality problems in large databases. Scalability: Currently, technical reports analyzing really big data are still sketchy. Analysis of big, opportunistic data (i.e., data collected for an unrelated purpose) is beset with many statistical pitfalls. Much research has been done to develop efficient, heuristic, parallel, and distributed algorithms that are able to scale well. We will be eager to see more practical experience shared when analyzing large, complex, real-world datasets in order to obtain a deep understanding of the IDA process. Mapping Methods to Applications: Given that there are so many methods developed in different communities for essentially the same task (e.g., classification), what are the important factors in choosing the most appropriate method(s) for a given application? The most commonly used criterion is the prediction accuracy. However, it is not always the only, or even the most important, criterion for evaluating competing methods. Credit scoring is one of the most quoted applications where misclassification cost is more important than predictive accuracy. Other important factors in deciding that one method is preferable to another include computational efficiency and interpretability of methods. Human-Computer Collaboration: Data analysis is often an iterative, complex process in which both analyst and computer play an important part. An interesting issue is how one can have an effective analysis environment where the computer will perform complex and laborious operations and provide essential assistance, while the analyst is allowed to focus on the more creative part of the data analysis using knowledge and experience.
FUTURE TRENDS There is strong evidence that IDA will continue to generate a lot of interest in both academic and industrial communities, given the number of related conferences, journals, working groups, books, and successful case studies already in existence. It is almost inconceivable that this topic will fade in the foreseeable future, since there are so many important and challenging real-world problems that demand solutions from this area, and there are still so many unanswered questions. The debate on what constitutes intelligent or unintelligent data analysis will carry on for a while.
Intelligent Data Analysis
More analysis tools and methods inevitably will appear, but help in their proper use will not be fast enough. A tool can be used without an essential understanding of what it can offer and how the results should be interpreted, despite the best intentions of the user. Research will be directed toward the development of more helpful middle-ground tools, those that are less generic than current data analysis software tools but more general than specialized data analysis applications. Much of the current work in the area is empirical in nature, and we are still in the process of accumulating more experience in analyzing large, complex data. A lot of heuristics and trial and error have been used in exploring and analyzing these data, especially the data collected opportunistically. As time goes by, we will see more theoretical work that attempts to establish a sounder foundation for analysts of the future.
CONCLUSION Statistical methods have been the primary analysis tool, but many new computing developments have been applied to the analysis of large and challenging real-world datasets. Intelligent data analysis requires careful thinking at every stage of an analysis process, assessment, and selection of the most appropriate approaches for the analysis tasks in hand and intelligent application of relevant domain knowledge. This is an area with enormous potential, as it seeks to answer the following key questions. How can one perform data analysis most effectively (intelligently) to gain new scientific insights, to capture bigger portions of the market, to improve the quality of life, and so forth? What are the guiding principles to enable one to do so? How can one reduce the chance of performing unintelligent data analysis? Modern datasets are getting larger and more complex, but the number of trained data analysts is certainly not keeping up at any rate. This poses a significant challenge for the IDA and other related communities such as statistics, data mining, machine learning, and pattern recognition. The quest for bridging this gap and for crucial insights into the process of intelligent data analysis will require an interdisciplinary effort from all these disciplines.
Berthold, M., & Hand, D.J. (Eds). (2003). Intelligent data analysis: An introduction. Springer-Verlag. Cartwright, H. (Ed). (2000). Intelligent data analysis in science. Oxford University Press. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge University Press Cruz-Neira, C. (2003). Computational humanities: The new challenge for virtual reality. IEEE Computer Graphics and Applications, 23(3), 10-13. De Raedt, L. (2002). A perspective on inductive databases. ACM SIGKDD Explorations Newsletter, 4(2), 69-77. Eiben, A.E., & Michalewicz, Z. (Eds). (2003). Evolutionary computation. IOS Press. Hand, D.J., Adams, N., & Bolton, R. (2002). Pattern detection and discovery. Lecture Notes in Artificial Intelligence, 2447. Liu, X. (1999). Progress in intelligent data analysis. International Journal of Applied Intelligence, 11(3), 235-240. Mitchell, T. (1997). Machine learning. McGraw Hill. Mooney, R. et al. (2004). Relational data mining with inductive logic programming for link discovery. In H. Kargupta et al. (Eds.), Data mining: Next generation challenges and future directions. AAAI Press. Nijssen, S., & Kok, J. (2001). Faster association rules for multiple relations. Proceedings of the International Joint Conference on Artificial Intelligence. Orengo, C., Jones, D., & Thornton, J. (Eds). (2003). Bioinformatics: Genes, proteins & computers. BIOS Scientific Publishers. Ramoni M., Sebastiani P., & Cohen, P. (2002). Bayesian clustering by dynamics. Machine Learning, 47(1), 91-121. Vapnik, V.N. (1998). Statistical learning theory. Wiley. Wang, J. (Ed.). (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing. Zaki, M., & Pan, Y. (2002). Recent developments in parallel and distributed data mining. Distributed and Parallel Databases: An International Journal, 11(2), 123-127.
REFERENCES Alizadeh, A.A. et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, 503-511. Bellazzi, R., Zupan, B., & Liu, X. (Eds). (2001). Intelligent data analysis in medicine and pharmacology. London.
KEY TERMS Bioinformatics: The development and application of computational and mathematical methods for organizing, analyzing, and interpreting biological data.
637
I
Intelligent Data Analysis
E-Science: The large-scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet.
procedures. They can be incomplete, inaccurate, out-ofdate, or inconsistent.
Intelligent Data Analysis: An interdisciplinary study concerned with the effective analysis of data, which draws the techniques from diverse fields including AI, databases, high-performance computing, pattern recognition, and statistics.
Support Vector Machines (SVM): Learning machines that can perform difficult classification and regression estimation tasks. SVM non-linearly map their n-dimensional input space into a high-dimensional feature space. In this high-dimensional feature space, a linear classifier is constructed.
Machine Learning: A study of how computers can be used automatically to acquire new knowledge from past cases or experience or from the computer’s own experiences.
Visualization: Visualization tools to graphically display data in order to facilitate better understanding of their meanings. Graphical capabilities range from simple scatter plots to three-dimensional virtual reality systems.
Noisy Data: Real-world data often contain errors due to the nature of data collection, measurement, or sensing
638
639
Intelligent Query Answering Zbigniew W. Ras University of North Carolina, Charlotte, USA Agnieszka Dardzinska Bialystok Technical University, Poland
INTRODUCTION One way to make query answering system (QAS) intelligent is to assume a hierarchical structure of its attributes. Such systems have been investigated by Cuppens & Demolombe (1988), Gal & Minker (1988), and Gaasterland et al. (1992), and they are called cooperative. Any attribute value listed in a query, submitted to cooperative QAS, is seen as a node of the tree representing that attribute. If QAS retrieves an empty set of objects, which match query q in a target information system S, then any attribute value listed in q can be generalized and the same the number of objects that possibly can match q in S can increase. In cooperative systems, these generalizations are usually controlled by users. Another way to make QAS intelligent is to use knowledge discovery methods to increase the number of queries which QAS can answer: knowledge discovery module of QAS extracts rules from a local system S and requests their extraction from remote sites (if system is distributed). These rules are used to construct new attributes and/or impute null or hidden values of attributes in S. By enlarging the set of attributes from which queries are built and by making information systems less incomplete, we not only increase the number of queries which QAS can handle but also increase the number of retrieved objects. So, QAS based on knowledge discovery has two classical scenarios that need to be considered: •
•
In a standalone and incomplete system, association rules are extracted from that system and used to predict what values should replace null values before queries are answered. When system is distributed with autonomous sites and user needs to retrieve objects, from one of these sites (called client), satisfying query q based on attributes which are not local for that site, we search for definitions of these non-local attributes at remote sites and use them to approximate q (Ras, 2002; Ras & Joshi, 1997; Ras & Dardzinska, 2004).
The goal of this article is to provide foundations and basic results for knowledge discovery-based QAS.
BACKGROUND Modern query answering systems area of research is related to enhancements of query answering systems into intelligent systems. The emphasis is on problems in users posing queries and systems producing answers. This becomes more and more relevant as the amount of information available from local or distributed information sources increases. We need systems not only easy to use but also intelligent in answering the users’ needs. A query answering system often replaces human with expertise in the domain of interest, thus it is important, from the user’s point of view, to compare the system and the human expert as alternative means for accessing information. Knowledge systems are defined as information systems coupled with a knowledge base simplified in Ras (2002), Ras and Joshi (1997), and Ras and Dardzinska (1997) to a set of rules treated as definitions of attribute values. If information system is distributed with autonomous sites, these rules can be extracted either from the information system, which is seen as local (query was submitted to that system), or from remote sites. Domains of attributes in the local information system S and the set of decision values used in rules from the knowledge base associated with S form the initial alphabet for the local query answering system. When the knowledge base associated with S is updated (new rules are added or some deleted), the alphabet for the local query answering system is automatically changed. In this paper we assume that knowledge bases for all sites are initially empty. Collaborative information system (Ras, 2002) learns rules describing values of incomplete attributes and attributes classified as foreign for its site called a client. These rules can be extracted at any site but their condition part should use, if possible, only terms that can be processed by the query-answering system associated with the client. When
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Intelligent Query Answering
the time progresses more and more rules can be added to the local knowledge base, which means that some attribute values (decision parts of rules) foreign for the client are also added to its local alphabet. The choice of which site should be contacted first, in search for definitions of foreign attribute values, is mainly based on the number of attribute values common for the client and server sites. The solution to this problem is given in Ras (2002).
MAIN THRUST The technology dimension will be explored to help clarify the meaning of intelligent query answering based on knowledge discovery and chase.
Intelligent Query Answering for Standalone Information System QAS for an information system is concerned with identifying all objects in the system satisfying a given description. For example, an information system might contain information about students in a class and classify them using four attributes of “hair color,” “eye color,” “gender,” and “size.” A simple query might be to find all students with brown hair and blue eyes. When an information system is incomplete, students having brown hair and unknown eye color can be handled by either including or excluding them from the answer to the query. In the first case we talk about optimistic approach to query evaluation while in the second case we talk about pessimistic approach. Another option to handle such a query would be to discover rules for eye color in terms of the attributes hair color, gender, and size. These rules could then be applied to students with unknown eye color to generate values that could be used in answering the query. Consider that in our example one of the generated rules said: (hair, brown) ∧ (size, medium) → (eye, brown). Thus, if one of the students having brown hair and medium size has no value for eye color, then the query answering system should not include this student in the list of students with brown hair and blue eyes. Attributes hair color and size are classification attributes and eye color is the decision attribute. We are also interested in how to use this strategy to build intelligent QAS for incomplete information systems. If a query is submitted to information system S, the first step of QAS is to make S as complete as possible. The approach proposed in Dardzinska & Ras 640
(2003b) is to use not only functional dependencies to chase S (Atzeni & DeAntonellis, 1992) but also use rules discovered from a complete subsystem of S to do the chasing. In the first step, intelligent QAS identifies all incomplete attributes used in a query. An attribute is incomplete in S if there is an object in S with incomplete information on this attribute. The values of all incomplete attributes are treated as concepts to be learned (in a form of rules) from S. Incomplete information in S is replaced by new data provided by Chase algorithm based on these rules. When the process of removing incomplete vales in the local information system is completed, QAS finds the answer to query in a usual way.
Intelligent Query Answering for Distributed Autonomous Information Systems Semantic inconsistencies are due to different interpretations of attributes and their values among sites (for instance one site can interpret the concept “young” differently than other sites). Different interpretations are also due to the way each site is handling null values. Null value replacement by values suggested either by statistical or knowledge discovery methods is quite common before a user query is processed by QAS. Ontology (Guarino, 1998; Sowa, 1999, 2000; Van Heijst et al., 1997) is a set of terms of a particular information domain and the relationships among them. Currently, there is a great deal of interest in the development of ontologies to facilitate knowledge sharing among information systems. Ontologies and inter-ontology relationships between them are created by experts in the corresponding domain, but they can also represent a particular point of view of the global information system by describing customized domains. To allow intelligent query processing, it is often assumed that an information system is coupled with some ontology. Inter-ontology relationships can be seen as semantical bridges between ontologies built for each of the autonomous information systems so they can collaborate and understand each other. In Ras and Dardzinska (2004), the notion of optimal rough semantics and the method of its construction have been proposed. Rough semantics can be used to model semantic inconsistencies among sites due to different interpretations of incomplete values of attributes. Distributed chase (Ras & Dardzinska, 2004) is a chase-type algorithm, driven by a client site of a distributed information system (DIS), which is similar to chase algorithms based on knowledge discovery and presented in
Intelligent Query Answering
Dardzinska & Ras (2003a, 2003b). Distributed chase has one extra feature in comparison to other chase-type algorithms: the dynamic creation of knowledge bases at all sites of DIS involved in the process of solving a query submitted to the client site of DIS. The knowledge base at the client site may contain rules extracted from the client information system and also rules extracted from information systems at remote sites in DIS. These rules are dynamically updated through the incomplete values replacement process (Ras & Dardzinska, 2004). Although the names of attributes are often the same among sites, their semantics and granularity levels may differ from site to site. As the result of these differences, the knowledge bases at the client site and at remote sites have to satisfy certain properties in order to be applicable in a distributed chase. So, assume that system S = (X,A,V), which is a part of DIS, is queried be user. Chase algorithm, to be applicable to S, has to be based on rules from the knowledge base D associated with S, which satisfies the following conditions: 1.
2.
3.
Attribute value used in decision part of a rule from D has the granularity level either equal to or finer than the granularity level of the corresponding attribute in S. The granularity level of any attribute used in the classification part of a rule from D is either equal or softer than the granularity level of the corresponding attribute in S. Attribute used in the decision part of a rule from D either does not belong to A or is incomplete in S.
Assume again that S=(X,A,V) is an information system (Pawlak, 1991; Ras & Dardzinska, 2004), where X is a set of objects, A is a set of attributes (seen as partial functions from X into 2(V×[0,1])) and, V is a set of values of attributes from A. By [0,1] we mean the set of real numbers from 0 to 1. Let L(D)={[t → vc] ∈ D: c ∈ In(A)} be a set of all rules (called a knowledge-base) extracted initially from the information system S by ERID (Dardzinska & Ras, 2003c), where In(A) is a set of incomplete attributes in S. Assume now that query q(B) is submitted to system S=(X,A,V), where B is the set of all attributes used in q(B) and that A ∩B ≠ ∅. All attributes in B - [A ∩ B] are called foreign for S. If S is a part of a distributed information system, definitions of foreign attributes for S can be extracted at its remote sites (Ras, 2002). Clearly, all semantic inconsistencies and differences in granularity of attribute values among sites have to be resolved first. In Ras & Dardzinska (2004) only different granularity of attribute values and different semantics related to differ-
ent interpretations of incomplete attribute values among sites have been considered. In Ras (2002), it was shown that query q(B) can be processed at site S by discovering definitions of values of attributes from B - [A ∩B] at the remote sites for S and next use them to answer q(B). Foreign attributes for S in B, can be also seen as attributes entirely incomplete in S, which means values (either exact or partially incomplete) of such attributes should be ascribed by chase to all objects in S before query q(B) is answered. The question remains, if values discovered by chase are really correct? Classical approach to this kind of problem is to build a simple DIS environment (mainly to avoid difficulties related to different granularity and different semantics of attributes at different sites). As the testing data set (Ras & Dardzinska, 2005) have taken 10,000 tuples randomly selected from a database of some insurance company. This sample table, containing 100 attributes, was randomly partitioned into four subtables of equal size containing 2,500 tuples each. Next, from each of these subtables 40 attributes (columns) have been randomly removed leaving four data tables of the size 2,500×60 each. One of these tables was called a client and the remaining 3 have been called servers. Now, for all objects at the client site, values of one of the attributes, which were chosen randomly, have been hidden. This attribute is denoted by d. At each server site, if attribute d was listed in its domain schema, descriptions of d using See5 software (data are complete so it was not necessary to use ERID) have been learned. All these descriptions, in the form of rules, have been stored in the knowledge base of the client. Distributed Chase was applied to predict what is the real value of the hidden attribute for each object x at the client site. The threshold value λ = 0.125 was used to rule out all values predicted by distributed Chase with confidence below that threshold. Almost all hidden values (2476 out of 2500) have been discovered correctly (assuming λ = 0.125).
Distributed Chase and Security Problem of Hidden Attributes Assume now that an information system S=(X,A,V) is a part of DIS and attribute b∈A has to be hidden. For that purpose, we construct Sb=(X,A,V) to replace S, where: 1. 2. 3.
aS(x) = aSb(x), for any a ∈ A-{b}, x ∈ X, bSb(x) is undefined, for any x ∈ X, bS(x) ∈ Vb.
641
I
Intelligent Query Answering
Users are allowed to submit queries to S b and not to S. What about the information system Chase(Sb)? How it differs from S? If bS(x) = bChase(Sb)(x), where x ∈ X, then values of additional attributes for object x have to be hidden in Sb to guarantee that value bS(x) can not be reconstructed by Chase. In Ras and Dardzinska (2005) it was shown how to identify the minimal number of such values.
FUTURE TRENDS One of the main problems related to semantics of an incomplete information system S is the freedom of how new values are constructed to replace incomplete values in S, before any rule extraction process begins. This replacement of incomplete attribute values in some of the slots in S can be done either by chase or/and by a number of available statistical methods (Giudici, 2003). This implies that semantics of queries submitted to S and driven (defined) by query answering system QAS based on chase may often differ. Although rough semantics can be used by QAS to handle this problem, we still have to look for new alternate methods. Assuming different semantics of attributes among sites in DIS, the use of global ontology or local ontologies built jointly with inter-ontology relationships among them seems to be necessary for solving queries in DIS using knowledge discovery and chase. Still a lot of research has to be done in this area.
CONCLUSION Assume that the client site in DIS is represented by partially incomplete information system S. When a query is submitted to S, its query answering system QAS will replace S by Chase(S) and next will solve the query using, for instance, the strategy proposed in Ras & Joshi (1997). Rules used by Chase can be extracted from S or from its remote sites in DIS assuming that all differences in semantics of attributes and differences in granularity levels of attributes are resolved first. We can argue here why the resulting information system obtained by Chase can not be stored aside and reused when a new query is submitted to S? If system S is not frequently updated, we can do that by keeping a copy of Chase(S) and next reusing that copy when a new query is submitted to S. But, the original information system S still has to be kept so when user wants to enter new data to S, they can be stored in the original system. System Chase(S), if stored aside, can not be reused by QAS when the number of updates in the original S exceeds a
642
given threshold value. It means that the new updated information system S has to be chased again before any query is answered by QAS.
REFERENCES Atzeni, P., & DeAntonellis, V. (1992). Relational database theory. The Benjamin Cummings Publishing Company. Cuppens, F., & Demolombe, R. (1988). Cooperative answering: A methodology to provide intelligent access to databases. In Proceedings of the Second International Conference on Expert Database Systems (pp. 333-353). Dardzinska, A., & Ras, Z.W. (2003a). Rule-based Chase algorithm for partially incomplete information systems. In Proceedings of the Second International Workshop on Active Mining, Maebashi City, Japan (pp. 42-51). Dardzinska, A., & Ras, Z.W. (2003b). Chasing unknown values in incomplete information systems. In Proceedings of ICDM’03 Workshop on Foundations and New Directions of Data Mining, Melbourne, Florida (pp. 2430). IEEE Computer Society. Dardzinska, A., & Ras, Z.W. (2003c). On rule discovery from incomplete information systems. In Proceedings of ICDM’03 Workshop on Foundations and New Directions of Data Mining. Melbourne, Florida (pp. 31-35). IEEE Computer Society. Gaasterland, T., Godfrey, P., & Minker, J. (1992). Relaxation as a platform for cooperative answering. Journal of Intelligent Information Systems, 1(3), 293-321. Gal, A., & Minker, J. (1988). Informative and cooperative answers in databases using integrity constraints. In natural language understanding and logic programming (pp. 288-300). North Holland. Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. West Sussex, UK: Wiley. Guarino, N. (Ed.). (1998). Formal ontology in information systems. Amsterdam: IOS Press. Pawlak, Z. (1991). Rough sets-theoretical aspects of reasoning about data. Kluwer. Ras, Z. (2002). Reducts-driven query answering for distributed knowledge systems. International Journal of Intelligent Systems, 17(2), 113-124.
Intelligent Query Answering
Ras, Z., & Dardzinska, A. (2004). Ontology based distributed autonomous knowledge systems. Information Systems International Journal, 29(1), 47-58. Ras, Z., & Dardzinska, A. (2005). Data security and null value imputation in distributed information systems. In Advances in Soft Computing, Proceedings of MSRAS’04 Symposium (pp. 133-146). Poland: Springer-Verlag. Ras, Z., & Joshi, S. (1997). Query approximate answering system for an incomplete DKBS. Fundamenta Informaticae, 30(3), 313-324. Sowa, J.F. (1999). Ontological categories. In L. Albertazzi (Ed.), Shapes of forms: From Gestalt psychology and phenomenology to ontology and mathematics (pp. 307-340). Kluwer. Sowa, J.F. (2000). Knowledge representation: Logical, philosophical, and computational foundations. Pacific Grove, CA: Brooks/Cole Publishing. Van Heijst, G., Schreiber, A., & Wielinga, B. (1997). Using explicit ontologies in KBS development. International Journal of Human and Computer Studies, 46 (2/3), 183-292.
KEY TERMS Autonomous Information System: Information system existing as an independent entity. Chase: Kind of a recursive strategy applied to a database V, based on functional dependencies or rules extracted from V, by which a null value or an incomplete value in V is replaced by a new more complete value.
Distributed Chase: Kind of a recursive strategy applied to a database V, based on functional dependencies or rules extracted both from V and other autonomous databases, by which a null value or an incomplete value in V is replaced by a new more complete value. Any differences in semantics among attributes in the involved databases have to be resolved first. Intelligent Query Answering: Enhancements of query answering systems into sort of intelligent systems (capable or being adapted or molded). Such systems should be able to interpret incorrectly posed questions and compose an answer not necessarily reflecting precisely what is directly referred to by the question, but rather reflecting what the intermediary understands to be the intention linked with the question. Knowledge Base: A collection of rules defined as expressions written in predicate calculus. These rules have a form of associations between conjuncts of values of attributes. Ontology: An explicit formal specification of how to represent objects, concepts and other entities that are assumed to exist in some area of interest and relationships holding among them. Systems that share the same ontology are able to communicate about domain of discourse without necessarily operating on a globally shared theory. System commits to ontology if its observable actions are consistent with the definitions in the ontology. Query Semantics: The meaning of a query with an information system as its domain of interpretation. Application of knowledge discovery and Chase in query evaluation makes semantics operational. Semantics: The meaning of expressions written in some language, as opposed to their syntax, which describes how symbols may be combined independently of their meaning.
643
I
644
Interactive Visual Data Mining Shouhong Wang University of Massachusetts Dartmouth, USA Hai Wang Saint Mary’s University, Canada
INTRODUCTION In the data mining field, people have no doubt that high level information (or knowledge) can be extracted from the database through the use of algorithms. However, a oneshot knowledge deduction is based on the assumption that the model developer knows the structure of knowledge to be deducted. This assumption may not be invalid in general. Hence, a general proposition for data mining is that, without human-computer interaction, any knowledge discovery algorithm (or program) will fail to meet the needs from a data miner who has a novel goal (Wang, S. & Wang, H., 2002). Recently, interactive visual data mining techniques have opened new avenues in the data mining field (Chen, Zhu, & Chen, 2001; de Oliveira & Levkowitz, 2003; Han, Hu & Cercone, 2003; Shneiderman, 2002; Yang, 2003). Interactive visual data mining differs from traditional data mining, standalone knowledge deduction algorithms, and one-way data visualization in many ways. Briefly, interactive visual data mining is human centered, and is implemented through knowledge discovery loops coupled with human-computer interaction and visual representations. Interactive visual data mining attempts to extract unsuspected and potentially useful patterns from the data for the data miners with novel goals, rather than to use the data to derive certain information based on a priori human knowledge structure.
BACKGROUND A single generic knowledge deduction algorithm is insufficient to handle a variety of goals of data mining since a goal of data mining is often related to its specific problem domain. In fact, knowledge discovery in databases is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns of data mining (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). By this definition, two aspects of knowledge discovery are important to meaningful data mining. First, the criteria of validity, novelty, usefulness of knowledge to be discovered could be subjective. That is, the usefulness of a data
pattern depends on the data miner and does not solely depend on the statistical strength of the pattern. Second, heuristic search in combinatorial spaces built on computer and human interaction is useful for effective knowledge discovery. One strategy for effective knowledge discovery is the use of human-computer collaboration. One technique used for human-computer collaboration in the business information systems field is data visualization (Bell, 1991; Montazami & Wang, 1988) which is particularly relevant to data mining (Keim & Kriegel, 1996; Wang, 2002). From the human side of data visualization, graphics cognition and problem solving are the two major concepts of data visualization. It is a commonly accepted principle that visual perception is compounded out of processes in a way which is adaptive to the visual presentation and the particular problem to be solved (Kosslyn, 1980; Newell & Simon, 1972).
MAIN THRUST Major components of interactive visual data mining and their functions that make data mining more effective are the current research theme in this field. Wang, S. and Wang, H. (2002) have developed a model of interactive visual data mining for human-computer collaboration knowledge discovery. According to this model, an interactive visual data mining system has three components on the computer side, besides the database: data visualization instrument, data and model assembly, and humancomputer interface.
Data Visualization Instrument Data visualization instruments are tools for presenting data in human understandable graphics, images, or animation. While there have been many techniques for data visualization, such as various statistical charts with colors and animations, the self-organizing maps (SOM) method based on Kohonen neural network (Kohonen, 1989) has become one of the promising techniques of data visualization in data mining. SOM is a dynamic system that can learn the topological relations and abstract struc-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Interactive Visual Data Mining
tures in the high-dimensional input vectors using low dimensional space for representation. These low-dimensional presentations can be viewed and interpreted by human in discovering knowledge (Wang, 2000).
Data and Model Assembly The data and model assembly is a set of query functions that assemble the data and data visualization instruments for data mining. Query tools are characterized by structured query language (SQL), the standard query language for relational database systems. To support human-computer collaboration effectively, query processing is necessary in data mining. As the ultimate objective of data retrieval and presentation is the formulation of knowledge, it is difficult to create a single standard query language for all purposes of data mining. Nevertheless, the following functionalities can be implemented through the design of query that support the examination of the relevancy, usefulness, interestingness, and novelty of extracted knowledge. 1.
2.
3.
4.
Schematics Examination: Through this query function, the data miner is allowed to set different values for the parameters of the data visualization instrument to perceive various schematic visual presentations. Consistency Examination: To cross-check the data mining results, the data miner may choose different sets of data of the database to check if the conclusion from one set of data is consistent with others. This query function allows the data miner to make such consistency examination. Relevancy Examination: It is a fundamental law that, to validate a data mining result, one must use external data, which are not used in generating this result but are relevant to the problem being investigated. For instance, the data of customer attributes can be used for clustering to identify significant market segments for the company. However, whether the market segments relevant to a particular product, one must use separate product survey data. This query function allows the data miner to use various external data to examine the data mining results. Dependability Examination: The concept of dependability examination in interactive visual data mining is similar to that of factor analysis in traditional statistical analysis, but the dependability examination query function is more comprehensive in determining whether a variable contributes the data mining results in a certain way.
5.
Homogeneousness Examination: Knowledge formulation often needs to identify the ranges of values of a determinant variable so that observations with values of a certain range in this variable have a homogeneous behavior. This query function provides interactive mechanism for the data miner to decompose variables for homogeneousness examination.
Human-Computer Interface Human-computer interface allows the data miner to dialog with the computer. It integrates the data base, data visualization instruments, and data and model assembly into a single computing environment. Through the humancomputer interface, the data miner is able to access the data visualization instruments, select data sets, invoke the query process, organize the screen, set colors and animation speed, and manage the intermediate data mining results.
FUTURE TRENDS Interactive visual data mining techniques will become key components of any data mining instruments. More theories and techniques of interactive visual data mining will be developed in the near future, followed by comprehensive comparisons of these theories and techniques. Query systems along with data visualization functions on largescale database systems for data mining will be available for data mining practitioners.
CONCLUSION Given the fact that a one-shot knowledge deduction may not provide an alternative result if it fails, we must provide an integrated computing environment for the data miner through interactive visual data mining. An interactive visual data mining system consists of three intertwined components, besides the database: data visualization instrument, data and model assembly instrument, and human-computer interface. In interactive visual data mining, the human-computer interaction and effective visual presentations of multivariate data allow the data miner to interpret the data mining results based on the particular problem domain, his/her perception, specialty, and the creativity. The ultimate objective of interactive visual data mining is to allow the data miner to conduct the experimental process and examination simultaneously through the human-computer collaboration in order to obtain a “satisfactory” result.
645
I
Interactive Visual Data Mining
REFERENCES Bell, P. C. (1991). Visual interactive modelling: The past, the present, and the prospects. European Journal of Operational Research, 54(3), 274-286. Chen, M., Zhu, Q., & Chen, Z. (2001). An integrated interactive environment for knowledge discovery from heterogeneous data resources. Information and Software Technology, 43(8), 487-496. de Oliveira, M. C. F., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. IEEE Transactions on Visualization and Computer Graphics, 9(3), 378-394. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM , 39(11), 27-34. Han, J., Hu, X., & Cercone, N. (2003). A visualization model of interactive knowledge discovery systems and its implementations. Information Visualization, 2(2), 105-112. Keim, D. A., & Kriegel, H. P. (1996). Visualization techniques for mining large databases: A comparison. IEEE Transactions on Knowledge & Data Engineering, 8(6), 923-938. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Kosslyn, S. M. (1980). Image and mind. Cambridge, MA: Harvard University Press. Montazemi, A., & Wang, S. (1988). The impact of information presentation modes on decision making: A metaanalysis. Journal of Management Information Systems, 5(3), 101-127. Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice Hall. Shneiderman, B. (2002). Inventing discovery tools: Combining information visualization with data mining. Information Visualization, 1, 5-12. Wang, S. (2000) Neural networks. In M. Zeleny (Ed.), IEBM Handbook of IT in Business (pp. 382-391). London, UK: International Thomson Business Press. Wang, S. (2002). Nonlinear pattern hypothesis generation for data mining. Data & Knowledge Engineering, 40(3), 273-283.
646
Wang, S., & Wang, H. (2002). Knowledge discovery through self-organizing maps: Data visualization and query processing. Knowledge and Information Systems, 4(1), 31-45. Yang, L. (2003). Visual exploration of large relational data sets through 3D projections and footprint splatting. IEEE Transactions on Knowledge and Data Engineering, 15(6), 1460-1471.
KEY TERMS Data and Model Assembly: A set of query functions that assemble the data and data visualization instruments for data mining. Data Visualization: Presentation of data in human understandable graphics, images, or animation. Human-Computer Interface: Integrated computing environment that allows the data miner to access the data visualization instruments, select data sets, invoke the query process, organize the screen, set colors and animation speed, and manage the intermediate data mining results. Interactive Data Mining: Human-computer collaboration knowledge discovery process through the interaction between the data miner and the computer to extract novel, plausible, useful, relevant, and interesting knowledge from the data base. Query Tool: Structured query language that supports the examination of the relevancy, usefulness, interestingness, and novelty of extracted knowledge for interactive data mining. Self-Organizing Map (SOM): Two layer neural network that maps the high-dimensional data onto lowdimensional pictures through unsupervised learning or competitive learning process. It allows the data miner to view the clusters on the output maps. Visual Data Mining: Data mining process through data visualization. The fundamental concept of visual data mining is the interaction between data visual presentation, human graphics cognition, and problem solving.
647
Interscheme Properties Role in Data Warehouses Pasquale De Meo Università “Mediterranea” di Reggio Calabria, Italy Giorgio Terracina Università della Calabria, Italy Domenico Ursino Università “Mediterranea” di Reggio Calabria, Italy
INTRODUCTION In this article, we illustrate a general approach for the semi-automatic construction and management of data warehouses. Our approach is particularly suited when the number or the size of involved sources is large and/ or when it changes quite frequently over time. Our approach is based mainly on the semi-automatic derivation of interschema properties (i.e., terminological and structural relationships holding among concepts belonging to different input schemas). It consists of the following steps: (1) enrichment of schema descriptions obtained by the semi-automatic extraction of interschema properties; (2) exploitation of derived interschema properties for obtaining in a data repository an integrated and abstracted view of available data; and (3) design of a three-level data warehouse having as its core the derived data repository.
BACKGROUND In the last years, an enormous increase of data available in electronic form has been witnessed, as well as a corresponding proliferation of query languages, data models, and data management systems. Traditional approaches to data management do not seem to guarantee, in these cases, the needed level of access transparency to stored data while preserving the autonomy of local data sources. This situation contributed to push the development of new architectures for data source interoperability, allowing users to query preexisting autonomous data sources in a way that guarantees model language and location transparency. In all the architectures for data source interoperability, components handling the reconciliation of involved information sources play a relevant role. In the construction of these components, schema integration (Chua, Chiang, & Lim, 2003; dos Santos
Mello, Castano & Heuser, 2002; McBrien & Poulovassilis, 2003) plays a key role. However, when involved systems are numerous and/ or large, schema integration alone typically ends up producing a too complex global schema that may, in fact, fail to supply a satisfactory and convenient description of available data. In these cases, schema integration steps must be completed by executing schema abstraction steps (Palopoli, Pontieri, Terracina & Ursino, 2000). Carrying out a schema abstraction activity amounts to clustering objects belonging to a schema into homogeneous subsets and producing an abstracted schema obtained by substituting each subset with one single object representing it. In order for schema integration and abstraction to be correctly carried out, the designer has to understand clearly the semantics of involved information sources. One of the most common ways for deriving and representing schema semantics consists in detecting the socalled interschema properties (Castano, De Antonellis & De Capitani di Vimercati, 2001; Doan, Madhavan, Dhamankar, Domingos & Levy, 2003; Gal, Anaby-Tavor, Trombetta & Montesi, 2004; Madhavan, Bernstein & Rahm, 2001; Melnik, Garcia-Molina & Rahm, 2002; Palopoli, Saccà, Terracina & Ursino, 2003; Palopoli, Terracina & Ursino, 2001; Rahm & Bernstein, 2001). These are terminological and structural properties relating to concepts belonging to different schemas. In the literature, several manual methods for deriving interschema properties have been proposed (see Batini, Lenzerini, and Navathe (1986) for a survey about this argument). These methods can produce very precise and satisfactory results. However, since they require a great amount of work, to the human expert, they are difficult to be applied when involved sources are numerous and large. To handle large amounts of data, various semi-automatic methods also have been proposed. These are much less resource consuming than manual ones; moreover,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Interscheme Properties’ Role in Data Warehouses
interschema properties obtained by semi-automatic techniques can be updated and maintained more simply. In the past, semi-automatic methods were based on considering only structural similarities among objects belonging to different schemas. Presently, all interschema property derivation techniques also take into account the context in which schema concepts have been defined (Rahm & Bernstein, 2001). The dramatic increase of available data sources led also to a large variety of both structured and semistructured data formats; in order to uniformly manage them, it is necessary to exploit a unified paradigm. In this context, one of the most promising solutions is XML. Due to its semi-structured nature, XML can be exploited as a unifying formalism for handling the interoperability of information sources characterized by heterogeneous data representation formats.
MAIN THRUST Overview of the Approach In this article we define a new framework for uniformly and semi-automatically constructing a data warehouse from numerous and large information sources characterized by heterogeneous data representation formats. In more detail, the proposed framework consists of the following steps: • •
• •
Translation of involved information sources into XML ones. Application to the XML sources derived in the previous step of almost automatic techniques for detecting interschema properties, specifically conceived to operate on XML environments. Exploitation of derived interschema properties for constructing an integrated and uniform representation of involved information sources. Exploitation of this representation as the core of the reconciled data level of a data warehouse1.
In the following subsections, we will illustrate the last three steps of this framework. The translation step is not discussed here, because it is performed by applying the translation rules from the involved source formats to XML already proposed in the literature.
Extraction of Interschema Properties A possible taxonomy classifies interschema properties into terminological properties, subschema similarities, and structural properties. 648
Terminological properties are synonymies, homonymies, hyponymies, overlappings, and type conflicts. A synonymy between two concepts A and B indicates that they have the same meaning. A homonymy between two concepts A and B indicates that they have the same name but different meanings. Concept A is said to be a hyponym of concept B (which, in turn, is a hypernym of A), if A has a more specific meaning than B. An overlapping exists between concepts A and B, if they are neither synonyms nor hyponyms of the other but share a significant set of properties; more formally, there exists an overlapping between A and B, if there exist non-empty sets of properties {pA1, pA2, …, pAn} of A and {pB1, pB2, …, p Bn} of B such that, for 1≤i ≤n, pAi is a synonym of pBi. A type conflict indicates that the same concept is represented by different constructs (e.g., an element and an attribute in an XML source) in different schemas. A subschema similarity represents a similitude between fragments of different schemas. Structural properties are inclusions and assertions between knowledge patterns. An inclusion between two concepts A and B indicates that the instances of A are a subset of the instances of B. An assertion between knowledge patterns indicates either a subsumption or an equivalence between knowledge patterns. Roughly speaking, knowledge patterns can be seen as views on involved information sources. Our interschema property extraction approach is characterized by the following features: (1) it is XMLbased; (2) it is almost automatic; and (3) it is semantic. Given two concepts belonging to different information sources, one of the most common ways for determining their semantics consists of examining their neighborhoods, since the concepts and the relationships in which they are involved contribute to define their meaning. In addition, our approach exploits two further indicators for defining in a more precise fashion the semantics of involved data sources. These indicators are the types and the cardinalities of the elements, taking the attributes belonging to the XML schemas into consideration. It is clear from this reasoning that concept neighborhood plays a crucial role in the interschema property computation. In XML schemas, concepts are expressed by elements or attributes. Since, for the interschema property extraction, it is not important to distinguish concepts represented by elements from concepts represented by attributes, we introduce the term x-component for denoting an element or an attribute in an XML schema. In order to compute the neighborhood of an x-component, it is necessary to define a semantic distance2 between two x-components of the same schema; it takes
Interscheme Properties’ Role in Data Warehouses
into account how much they are related. The formulas for computing this distance are quite complex; due to space limitations we cannot show them here. However, the interested reader can refer to De Meo, Quattrone, Terracina, and Ursino (2003) for a detailed illustration of them. We can define now the neighborhood of an x-component. In particular, given an x-component xS of an XML schema S, the neighborhood of level j of xS consists of all x-components of S whose semantic distance from xS is less than or equal to j. In order to verify if two x-components x1j, belonging to an XML schema S1, and x2k, belonging to an XML Schema S2, are synonymous, it is necessary to examine their neighborhoods. More specifically, first, it is necessary to verify if their nearest neighborhoods (i.e., the neighborhoods of level 0) are similar. This decision is made by computing a suitable objective function associated with the maximum weight matching on a bipartite graph constructed from the x-components of the neighborhoods into consideration and the lexical synonymies stored in a thesaurus (e.g., WordNet) 3. If these two neighborhoods are similar, then x1j and x 2k are assumed to be synonymous. However, observe that the neighborhoods of level 0 of x 1j and x2k provide quite a limited vision of their contexts. If a higher certainty on the synonymy between x1j and x 2k is required, it is necessary to verify the similarity, not only of their neighborhoods of level 0, but also of the other neighborhoods. As a consequence, it is possible to introduce a severity level u against which interschema properties can be determined, and to say that x1j and x2k are synonymous with a severity level u, if all neighborhoods of x1j and x 2k of a level lesser than or equal to u are similar. After all synonymies of S1 and S2 have been extracted, homonymies can be derived. In particular, there exist a homonymy between two x-components x1j and x2k with a severity level u if: (1) x1j and x2k have the same name; (2) both of them are elements or both of them are attributes; and (3) they are not synonymous with a severity level u. In other words, a homonymy indicates that two concepts having the same name represent different meanings. Due to space constraints, we cannot describe in this article the derivation of all the other interschema properties mentioned; however, it follows the same philosophy as the detection of synonymies and homonymies. The interested reader can find a detailed description of it in Ursino (2002).
Construction of a Uniform Representation Detected interschema properties can be exploited for constructing a global representation of involved infor-
mation sources; this becomes the core of the reconciled data level in a three-level DW. Generally, in classical approaches, this global representation is obtained by integrating all involved data sources into a unique one. However, when involved sources are numerous and large, a unique global schema presumably encodes an enormous number and variety of objects and becomes far too complex to be used effectively. In order to overcome the drawbacks mentioned previously, our approach does not directly integrate involved source schemas to construct a global flat schema. Rather, it first groups them into homogeneous clusters and then integrates schemas on a cluster-bycluster basis. Each integrated schema thus obtained is then abstracted to construct a global schema representing the cluster. The aforementioned process is iterated over the set of obtained cluster schemas, until one schema is left. In this way, a hierarchical structure is obtained, which is called Data Repository (DR). Each cluster of a DR represents a group of homogeneous schemas and is, in turn, represented by a schema (hereafter called C-schema). Clusters of level n of the hierarchy are obtained by grouping some C-schemas of level n-1; clusters of level 0 are obtained by grouping input source schemas. Therefore, each cluster Cl is characterized by (1) its identifier C-id; (2) its C-schema; (3) the group of identifiers of clusters whose C-schemas originated the C-schema of Cl (hereafter called Oidentifiers); (4) the set of interschema properties involving objects belonging to the C-schemas that originated the C-schema of Cl; and (5) a level index. It is clear from this reasoning that the three fundamental operations for obtaining a DR are (1) schema clustering (Han & Kumber, 2001), which takes a set of schemas as input and groups them into semantically homogeneous clusters; (2) schema integration, which produces a global schema from a set of heterogeneous input schemas; and (3) schema abstraction, which groups concepts of a schema into homogeneous clusters and, in the abstracted schema, represents each cluster with only one concept.
Exploitation of the Uniform Representation for Constructing a DW The Data Repository can be exploited as the core structure of the reconciled level of a new three-level DW architecture. Indeed, different from classical three-level architectures, in order to reconcile data, we do not directly integrate involved schemas to construct a flat global schema. Rather, we first collect subsets of involved schemas into homogeneous clusters and construct a DR that is used as the core of the reconciled data level. 649
I
Interscheme Properties’ Role in Data Warehouses
In order to pinpoint the differences between classical three-level DW architectures and ours, the following observations can be drawn:
possibly relative to different, yet related and complementary, application contexts.
•
CONCLUSION
• •
•
•
A classical three-level DW architecture is a particular case of the one proposed here, since it corresponds to a case where involved sources are all grouped into one cluster and no abstraction is carried out over the associated C-schema. The architecture we propose here is naturally conducive to an incremental DW construction. In a classical three-level architecture designed over a large number of sources, presumably hundreds of concepts are represented in the global schema. In particular, the global schema can be seen as partitioned into subschemas loosely related to each other, whereas each subschema contains objects tightly related to each other. Such a difficulty does not characterize our architecture, where a source cluster would have been associated to a precise semantics. In our architecture, data mart design is presumably simpler than with classical architectures, since each data mart will insist over a bunch of data sources spanned by a subtree of our core DR rooted at some C-schema of level k, for some k. In our architecture, reconciled data are (virtually) represented at various abstraction levels within the core DR.
It follows from these observations that by paying a limited price in terms of required space and computation time, we obtain an architecture that retains all worths of classical three-level architectures but overcomes some of their limitations arising when involved data sources are numerous and large.
FUTURE TRENDS In the next years, interschema properties presumably will play a relevant role in various applications involving heterogeneous data sources. Among them we cite ontology matching, semantic Web, e-services, semantic query processing, Web-based financial services, and biological data management. As an example, in this last application field, the highthroughput techniques for data collection developed in the last years have led to an enormous increase of available biological databases; for instance, the largest public DNA database contains much more than 20 GB of data (Hunt, Atkinson & Irving, 2002). Interschema properties can play a relevant role in this context; indeed, they can allow the manipulation of biological data sources 650
In this article we have illustrated an approach for the semi-automatic construction and management of data warehouses. We have shown that our approach is particularly suited when the number or the size of involved sources is large and/or when they change quite frequently over time. Various experiments have been conducted for verifying the applicability of our approach. The interested reader can find many of them in Ursino (2002), as far as their application to Italian Central Government Offices databases is concerned, and in De Meo, Quattrone, Terracina, and Ursino (2003) for their application to semantically heterogeneous XML sources. In the future, we plan to extend our studies on the role of interschema properties to the various application fields mentioned in the previous section.
REFERENCES Batini, C., Lenzerini, M., & Navathe, S.B. (1986). A comparative analysis of methodologies for database scheme integration. ACM Computing Surveys, 15(4), 323-364. Castano, S., De Antonellis, V., & De Capitani di Vimercati, S. (2001). Global viewing of heterogeneous data sources. IEEE Transactions on Data and Knowledge Engineering, 13(2), 277-297. Chua, C.E.H, Chiang, R.H.L., & Lim, E.P. (2003). Instancebased attribute identification in database integration. The International Journal on Very Large Databases, 12(3), 228-243. De Meo, P., Quattrone, G., Terracina, G., & Ursino, D. (2003). “Almost automatic” and semantic integration of XML schemas at various “severity levels.” Proceedings of the International Conference on Cooperative Information Systems (CoopIS 2003), Taormina, Italy. Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., & Halevy, A. (2003). Learning to match ontologies on the semantic Web. The International Journal on Very Large Databases, 12(4), 303-319. dos Santos Mello, R., Castano, S., & Heuser, C.A. (2002). A method for the unification of XML schemata. Information & Software Technology, 44(4), 241-249.
Interscheme Properties’ Role in Data Warehouses
Gal, A., Anaby-Tavor, A., Trombetta, A., & Montesi, D. (2004). A framework for modeling and evaluating automatic semantic reconciliation. The International Journal on Very Large Databases [forthcoming]. Han, J. & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann Publishers. Hunt, E., Atkinson, M.P., & Irving, R.W. (2002). Database indexing for large DNA and protein sequence collections. The International Journal on Very Large Databases, 11(3), 256-271. Madhavan, J., Bernstein, P.A., & Rahm, E. (2001). Generic schema matching with cupid. Proceedings of the International Conference on Very Large Data Bases (VLDB 2001), Rome, Italy. McBrien, P., & Poulovassilis, A. (2003). Data integration by bi-directional schema transformation rules. Proceedings of the International Conference on Data Engineering (ICDE 2003), Bangalore, India. Melnik, S., Garcia-Molina, H., & Rahm, E. (2002). Similarity flooding: A versatile graph matching algorithm and its application to schema matching. Proceedings of the International Conference on Data Engineering (ICDE 2002), San Josè, California. Palopoli, L., Pontieri, L., Terracina, G., & Ursino, D. (2000). Intensional and extensional integration and abstraction of heterogeneous databases. Data & Knowledge Engineering, 35(3), 201-237. Palopoli, L., Saccà, D., Terracina, G., & Ursino, D. (2003). Uniform techniques for deriving similarities of objects and subschemas in heterogeneous databases. IEEE Transactions on Knowledge and Data Engineering, 15(2), 271-294. Palopoli, L., Terracina, G., & Ursino, D. (2001). A graphbased approach for extracting terminological properties of elements of XML documents. Proceedings of the International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany. Rahm, E., & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The International Journal on Very Large Databases, 10(4), 334-350. Ursino, D. (2002). Extraction and exploitation of intensional knowledge from heterogeneous information sources: Semi-automatic approaches and tools. Springer.
KEY TERMS Assertion Between Knowledge Patterns: A particular interschema property. It indicates either a subsumption or an equivalence between knowledge patterns. Roughly speaking, knowledge patterns can be seen as views on involved information sources. Data Repository: A complex catalogue of a set of sources organizing both their description and all associated information at various abstraction levels. Homonymy: A particular interschema property. An homonymy between two concepts A and B indicates that they have the same name but different meanings. Hyponymy/Hypernymy: A particular interschema property. Concept A is said to be a hyponym of a concept B (which, in turn, is a hypernym of A), if A has a more specific meaning than B. Interschema Properties: Terminological and structural relationships involving concepts belonging to different sources. Overlapping: A particular interschema property. An overlapping exists between two concepts A and B, if they are neither synonyms nor hyponyms of the other but share a significant set of properties; more formally, there exists an overlapping between A and B, if there exist non-empty sets of properties {pA1, pA2, …, pAn} of A and {pB1, pB2, …, pBn} of B such that, for 1≤i ≤n, pAi is a synonym of pBi.. Schema Abstraction: The activity that clusters objects belonging to a schema into homogeneous groups and produces an abstracted schema obtained by substituting each group with one single object representing it. Schema Integration: The activity by which different input source schemas are merged into a global structure representing all of them. Subschema Similarity: A particular interschema property. It represents a similitude between fragments of different schemas. Synonymy: A particular interschema property. A synonymy between two concepts A and B indicates that they have the same meaning. Type Conflict: A particular interschema property. It indicates that the same concept is represented by different constructs (e.g., an element and an attribute in an XML source) in different schemas.
651
I
Interscheme Properties’ Role in Data Warehouses
ENDNOTES 1
2
652
Here and in the following, we shall consider a three-level data warehouse architecture. Semantic distance is often called connection cost in the literature.
3
Clearly, if necessary, a more specific thesaurus, possibly constructed with the support of a human expert, might be used.
653
Inter-Transactional Association Analysis for Prediction Ling Feng University of Twente, The Netherlands Tharam Dillon University of Technology Sydney, Australia
INTRODUCTION The discovery of association rules from large amounts of structured or semi-structured data is an important datamining problem (Agrawal et al., 1993; Agrawal & Srikant, 1994; Braga et al., 2002, 2003; Cong et al., 2002; Miyahara et al., 2001; Termier et al., 2002; Xiao et al., 2003). It has crucial applications in decision support and marketing strategy. The most prototypical application of association rules is market-basket analysis using transaction databases from supermarkets. These databases contain sales transaction records, each of which details items bought by a customer in the transaction. Mining association rules is the process of discovering knowledge such as, 80% of customers who bought diapers also bought beer, and 35% of customers bought both diapers and beer, which can be expressed as “diaper ⇒ beer” (35%, 80%), where 80% is the confidence level of the rule, and 35% is the support level of the rule indicating how frequently the customers bought both diapers and beer. In general, an association rule takes the form X ⇒ Y (s, c), where X and Y are sets of items, and s and c are support and confidence, respectively.
BACKGROUND While the traditional association rules have demonstrated strong potential in areas such as improving marketing strategies for the retail industry (Dunham, 2003; Han & Kamer, 2001), their emphasis is on description rather than prediction. Such a limitation comes from the fact that traditional association rules only look at association relationships among items within the same transactions, whereas the notion of the transaction could be the items bought by the same customer, the atmospheric events that happened at the same time, and so on. To overcome this limitation, we extend the scope of mining association rules from such traditional intra-transactional associations to intertransactional associations for prediction (Feng et al., 1999, 2001; Lu et al., 2000). Compared to
intratransactional associations, an intertransactional association describes the association relationships across different transactions, such as, if (company) A’s stock goes up on day one, B’s stock will go down on day two but go up on day four. In this case, whether we treat company or day as the unit of transaction, the associated items belong to different transactions.
MAIN TRUSTS Extensions from Intratransaction to Intertransaction Associations We extend a series of concepts and terminologies for intertransactional association analysis. Throughout the discussion, we assume that the following notation is used. • • •
A finite set of literals called items I = {i 1, i2, …, i n}. A finite set of transaction records T = {t1, t2, …, tl}, where for ∀ti ∈T, t i ⊆ I. A finite set of attributes called dimensional attributes A = {a1, a2, …, am}, whose domains are finite subsets of nonnegative integers.
An Enhanced Transactional Database Model In classical association analysis, records in a transactional database contain only items. Although transactions occur under certain contexts, such as time, place, customers, and so forth, such contextual information has been ignored in classical association rule mining, due to the fact that such rule mining was intratransactional in nature. However, when we talk about intertransactional associations across multiple transactions, the contexts of occurrence of transactions become important and must be taken into account. Here, we enhance the traditional transactional database model by associating each transaction record with
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Inter-Transactional Association Analysis for Prediction
a number of attributes that describe the context within which the transaction happens. We call them dimensional attributes, because, together, these attributes constitute a multi-dimensional space, and each transaction can be mapped to a certain point in this space. Basically, dimensional attributes can be of any kind, as long as they are meaningful to applications. Time, distance, temperature, latitude, and so forth are typical dimensional attributes.
Multidimensional Contexts An m-dimensional mining context can be defined through m dimensional attributes a1, a2, …, am, each of which represents a dimension. When m=1, we have a singledimensional mining context. Let ni = (ni.a1, ni.a2, …, ni.am) and nj = (nj.a1, nj.a2, …, nj.am) be two points in an mdimensional space, whose values on the m dimensions are represented as ni.a1, ni.a2, …, ni.am and nj.a1, nj.a2, …, nj.am, respectively. Two points ni and nj are equal, if and only if for ∀k (1 ≤ k ≤ m), ni.ak = nj.ak. A relative distance between ni and nj is defined as ∆〈ni, nj〉 = (nj.a1-ni.a1, nj.a2-ni.a2, …, nj.am-ni.am). We also use the notation ∆(d1, d2, …, dm), where dk = nj.ak-ni.a k (1 ≤ k ≤ m), to represent the relative distance between two points ni and nj in the m-dimensional space. Besides, the absolute representation (ni.a1, ni.a2, …, ni.a m) for point ni, we also can represent it by indicating its relative distance ∆〈n0, ni〉 from a certain reference point n0, (i.e., n0+∆〈n0, ni〉, where ni = n0+∆〈n0, ni〉). Note that ni, ∆〈n0, ni〉, and ∆(ni.a1-n0.a1, ni.a2-n0.a2, …, ni.am-n0.am) can be used interchangeably, since each of them refers to the same point ni in the space. Let N = {n1, n 2, …, nu} be a set of points in an m-dimensional space. We construct the smallest reference point of N, n*, where for ∀k (1 ≤ k ≤ m), n*.ak = min (n1.ak, n 2.ak, …, nu.ak).
Extended Items (Transactions) The traditional concepts regarding item and transaction can be extended accordingly under an m-dimensional context. We call an item i k∈I happening at the point ∆(d1, , (i.e., at the point (n0.a1+d1, n0.a2+d2, …, n0.am+dm)), d2, …, dm) an extended item and denote it as ∆(d1, d2, …, dm)(ik). In a similar fashion, we call a transaction tk∈T happening at the point ∆(d1, d2, …, dm) an extended transaction and denote it as D(d1, d2, …, dm)(t k). The set of all possible extended items, IE, is defined as a set of ∆(d1, d2, …, dm)(i k) for any ik∈I at all possible points ∆(d1, d2, …, dm) in the m-dimensional space. TE is the set of all extended transactions, each of which contains a set of extended items, in the mining context.
654
Normalized Extended Item (Transaction) Sets We call an extended itemset a normalized extended itemset, if all its extended items are positioned with respect to the smallest reference point of the set. In other words, the extended items in the set have the minimal relative distance 0 for each dimension. Formally, let Ie = {∆(d1,1, d1,2, (i ), ∆(d2,1, d2,2, …, d2,m)(i 2), …, ∆ (dk,1, dk,2, …, dk,m)(ik)} be an …,~d1,m) 1 extended itemset. Ie is a normalized extended itemset, if and only if for ∀j (1 ≤ j ≤ k) ∀i (1 ≤ i ≤ m), min (dj, i) = 0. The normalization concept can be applied to an extended transaction set as well. We call an extended transaction set a normalized extended transaction set, if all its extended transactions are positioned with respect to the smallest reference point of the set. Any non-normalized extended item (transaction) set can be transformed into a normalized one through a normalization process, where the intention is to reposition all the involved extended items (transactions) based on the smallest reference point of this set. We use INE and TNE to denote the set of all possible normalized extended itemsets and normalized extended transaction sets, respectively. According to the above definitions, any superset of a normalized extended item (transaction) set is also a normalized extended item (transaction) set.
Multidimensional Intertransactional Association Rule Framework With the above extensions, we are now in a position to formally define intertransactional association rules and related measurements.
Definition 1 A multidimensional intertransactional association rule is an implication of the form X ⇒ Y, where (1) (2) (3) (4)
X ⊂ INE and Y ⊂ IE; The extended items in X and Y are positioned with respect to the same reference point; For ∀∆ (x1, x2, …, xm)(i x) ∈X, ∀∆ (y1, y2, …, ym)(ix) ∈Y, xj ≤ yj (1 ≤ j ≤ m); X ∩ Y = ∅.
Different from classical intratransactional association rules, the intertransactional association rules capture the occurrence contexts of associated items. The first clause
Inter-Transactional Association Analysis for Prediction
of the definition requires the precedent and antecedent of an intertransactional association rule to be a normalized extended itemset and an extended itemset, respectively. The second clause of the definition ensures that items in X and Y are comparable in terms of their contextual positions. For prediction, each of the consequent items in Y takes place in a context later than any of its precedent items in X, as stated by the third clause.
Candidate Generation •
C1 = { ∆0(1), ∆1(1), …, ∆maxspan(1), ∆0(2), ∆1(2), …, ∆maxspan(2), …
Definition 2 Given a normalized extended itemset X and an extended itemset Y, let |Txy| be the total number of minimal extended transaction sets that contain X∪Y, |Tx| and the total number of minimal extended transaction sets that contain X, and |Te| be the total number of extended transactions in the database. The support and confidence of an intertransactional association rule X ⇒ Y are support(X ⇒ Y) = |T xy|/|T e| and confidence(X ⇒ Y) = |Txy| / |T x|.
Discovery of Intertransactional Association Rules To investigate the feasibility of mining intertransaction rules, we extended the classical a priori algorithm and applied it to weather forecasting. To limit the search space, we set a maximum span threshold, maxspan, to define a sliding window along time dimensions. Only the associations among the items that co-occurred within the window are of interest. Basically, the mining process of intertransactional association rules can be divided into two steps: frequent extended itemset discovery and association rule generation. 1.
Frequent Extended Itemset Discovery
In this phase, we find the set of all frequent extended itemsets. For simplicity, in the following, we use itemset and extended itemset, transaction and extended transaction interchangeably. Let Lk represent the set of frequent k-itemsets and Ck the set of candidate k-itemsets. The algorithm makes multiple passes over the database. Each pass consists of two phases. First, the set of all frequent (k-1)-itemsets Lk-1 found in the (k-1)th pass is used to generate the candidate itemset Ck. The candidate generation procedure ensures that Ck is a superset of the set of all frequent k-itemsets. The algorithm now scans the database. For each list of consecutive transactions, it determines which candidates in Ck are contained and increments their counts. At the end of the pass, Ck is examined to check which of the candidates actually are frequent, yielding Lk. The algorithm terminates when L k becomes empty. In the following, we detail the procedures for candidate generation and support counting.
Pass 1: Let I={1, 2, ..., m} be a set of items in a database. To generate the candidate set C1 of 1itemsets, we need to associate all possible intervals with each item. That is,
∆0(m), ∆1(m), …, ∆maxspan(m) } Starting from transaction t at the reference point ∆s (i.e., extended transaction ∆s(t)), the transaction t´ at the point ∆s+x in the dimensional space (i.e., extended transaction ∆s+x(t´)) is scanned to determine whether item i n exists. If so, the count of ∆x(i n) increases by one. One scan of the database will deliver the frequent set L1. •
Pass 2: We generate a candidate 2-itemset {∆0(i m), ∆x(in)} from any two frequent 1-itemsets in L1, ∆0(im) and ∆x(i n), and obtain C2 = {{∆0(i m), ∆x(i n)} | (x=0 ∧ ≠0)}. im
•
Pass k>2: Given Lk-1, the set of all frequent (k-1)itemsets, the candidate generation procedure returns a superset of the set of all frequent k-itemsets. This procedure has two parts. In the join phase, we join two frequent (k-1)-itemsets in Lk-1 to derive a candidate k-itemset. Let p={∆u1 (i 1), ∆u2 (i2),…,∆uk(i )} and q={∆v1 (j 1), ∆v2 (j2),…, ∆vk-1 (jk-1)}, where 1 k-1 p, q ∈Lk-1, we have insert into Ck select p.∆u1 (i1),, …, p. ∆uk-1 (i k-1), q.∆ vk-1 (jk-1) from p in Lk-1, q in Lk-1
where (i1=j1 ∧ u1=v1) ∧ … ∧ (i k-2=j k-2 ∧ uk-2=vk-2) ∧ (uk-1
Support Counting To facilitate the efficient support counting process, a candidate Ck of k-itemsets is divided into k groups, with each group Go containing o number of items whose intervals are 0 (1 ≤ o ≤ k). For example, a candidate set of 3-itemsets
655
I
Inter-Transactional Association Analysis for Prediction
C3={ {∆0(a), ∆1(a), ∆2(b)}, {∆0(c), ∆(d), ∆2(d)}, {∆0(a), ∆0(b), ∆3(h)}, {∆0(l), ∆0(m), ∆0(n)}, {∆0(p), v0(q), ∆0(r)} ] is divided into three groups: G1={{∆0(a), ∆1(a), ∆2(b)}},G2={{∆0(c), ∆0(d), ∆2(d)}, {∆0(a), ∆0(b), ∆3(h)}}, G3={{∆0(l), ∆0(m), ∆0(n)}, {∆0(p), ∆0(q), ∆0(r)} } Each group is stored in a modified hash-tree. Only those items with interval 0 participate the construction of this hash tree (e.g., in G2, only {0(c), ∆0(d)} and {∆0(a), D0(b)} enter the hash-tree). The construction process is similar to that of a priori. The rest items, ∆2(d) and ∆3(h), are simply attached to the corresponding itemsets, {∆0(c), ∆0(d)} and {∆0(a), ∆0(b)}, respectively, in the leaves of the tree. Upon reading one transaction of the database, every hash tree is tested. If one itemset is contained, its attached itemsets whose intervals are larger than 0 will be checked against the consecutive transactions. In the previous example, if {∆0(a), ∆0(b)} exists in the current extended transaction at the point ∆s, then the extended transaction ∆s+3(t´) will be scanned to see whether it contains item h. If so, the support of 3-itemset {∆0(a), ∆0(b), ∆3(h)} will increase by 1.
Association Rule Generation Using sets of frequent itemsets, we can find the desired intertransactional association rules. The generation of intertransaction association rules is similar to the generation of the classical association rules.
Application of Intertransactional Association Rules to Weather Prediction We apply the previous algorithm to studying Hong Kong meterological data. The database records wind direction,
wind speed, temperature, relative humidity, rainfall, and mean sea level pressure, and so forth every six hours each day. In some data records, certain atmospheric observations, such as relative humidity, are missing. Since the context constituted by the time dimension is valid for the whole set of meteorological records (transactions), we fill in these empty fields by averaging their nearby values. In this way, the data to be mined contains no missing fields; in addition, no database holes (i.e., meaningless contexts) exist in the mining space. Essentially, there is one dimension in this case; namely, time. After preprocessing the data set, we discover intertransactional association rules from the 1995 meteorological records and then examine their prediction accuracy using the 1996 meteorological data from the same area in Hong Kong. Considering seasonal changes of weather, we extract records from May to October for our experiments, totaling 736 records (total_days * record_num_per_day = (31+30+31+ 31+30+31) * 4 = 736) for each year. These raw data sets, containing continuous data, are further converted into appropriate formats with which the algorithms can work. Each record has six meteorological elements (items). The interval of every two consecutive records is six hours. We set maxspan=11 in order to detect the association rules for a three-day horizon (i.e., (11+1) / 4 = 3). At support=45% and confidence=92%, from the 1995 meteorological data, we found only one classical association rule: if the humidity is medium wet, then there is no rain at the same time (which is quite obvious), but we found 580 intertransactional association rules. Note that the number of intertransactional association rules returned depends on the maxspan parameter setting. Table 1 lists some significant intertransactional association rules found from the single-station meteorological data. We measure their predictive capabilities using the 1996 meteorological data recorded by the same station through Prediction-Rate (X⇒Y) = sup(X∪Y) / sup(X), which can achieve more than 90% prediction rate. From the test results, we find that, with intertransactional association rules, more comprehensive and interesting knowledge can be discovered from the databases.
Table 1. Some significant intertransactional association rules found from meteorological data
Extended Association Rules ∆0(2), ∆3(13) ⇒ ∆4(13) (If there is an east wind direction and no rain in 18 hours, then there also will be no rain in 24 hours.) ∆0(2), ∆0(25), ∆2(25) ⇒ ∆3(25) (If it is currently warm with an east wind direction, and still warm 12 hours later, then it will be continuously warm until 18 hours later.)
656
Sup. 46%
Conf. 92%
Pred. Rate 90%
45%
95%
91.8%
Inter-Transactional Association Analysis for Prediction
FUTURE TRENDS Intertransactional association rule framework provides a general view of associations among events (Feng et al., 2001). •
•
•
Traditional Association Rule Mining: If the dimensional attributes are not present in the knowledge to be discovered, or those attributes are not interrelated, then they can be ignored in analysis, and the problem becomes the traditional transaction-based association rule problem. Mining in Multidimensional Databases or Data Warehouse: Multidimensional database organizes data into data cubes with dimensional attributes and measures, on which multidimensional intertransactional association mining can be performed. Mining Spatial Association Rules: Existence of spatial objects can be viewed as events. The coordinates are dimensional attributes. Related properties can be added as other dimensional attributes.
CONCLUSION While the classical association rules have demonstrated strong potential values, such as improving market strategies for retail industry, they are limited to finding associations among items within a transaction. In this article, we propose a more general form of association rules, named multi-dimensional intertransaction association rules. These rules can represent not only the associations of items within transactions but also associations of items among different transactions. The classical association rule can be viewed as a special case of multi-dimensional intertransaction association rules.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, D.C. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20 th Conference on Very Large Data Bases, Santiago de Chile, Chile.
Braga, D. et al. (2003). Discovering interesting information in XML data with association rules. Proceedings of the 18th Symposium on Applied Computing. Cong, G., Yi, L., Liu, B., & Wang, K. (2002). Discovering frequent substructures from hierarchical semi-structured data. Proceedings of the SIAM DM’02, Arlington, Virginia. Dunham, M.H. (2003). Data mining: Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall. Feng, L. et al. (1999). Mining multi-dimensional intertransaction association rules with templates. Proceedings of the International ACM Conference on Information and Knowledge Management, Kansas City, USA. Feng, L. et al. (2002). A template model for multidimensional inter-transactional association rules. International Journal of VLDB, 11(2), 153-175. Feng, L., Dillon, T., & Liu, J. (2001). Inter-transactional association rules for multidimensional contexts for prediction and their applications to studying meteorological data. International Journal of Data and Knowledge Engineering, 37, 85-115. Feng, L., Li, Q., & Wong, A.K.Y. (2001). Mining intertransactional association rules: Generalization and empirical evaluation. Proceedings of the 3rd International Conference on Data Warehousing and Knowledge Discovery, Munich, Germany. Han, J., & Kamer, M. (2001). Data mining: Concepts and techniques. San Diego: Morgan Kaufmann Publishers. Lu, H., Feng, L., & Han, J. (2000). Beyond intra-transaction association analysis: Mining multi-dimensional intertransaction association rules. ACM Transactions on Information Systems, 18(4), 423-454. Miyahara, T. et al. (2001). Discovery of frequent tree structured patterns in semistructured Web documents. Proceedings of the 5th PAKDD, Hong Kong, China. Termier, A., Rousset, M., & Sebag, M. (2002). Treefinder: A first step towards XML data mining. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan. Xiao, Y., Yao, J., Li, Z., & Dunham, M. (2003). Efficicent data mining for maximal frequent subtrees. Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne, Florida.
Braga, D. et al. (2002). Mining association rules from XML data. Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, Aix-en Provence, France. 657
I
Inter-Transactional Association Analysis for Prediction
KEY TERMS
Intratransactional Association: Correlations among items within the same transactions.
Dimensional Attribute: One dimension of an m-dimensional context.
Multidimensional Context: The context where database transactions, and thus the association relationships, happen.
Extended Item: An item associated with the context where it happens. Extended Transaction: A transaction associated with the context where it happens. Intertransactional Association: Correlations among items not only within the same transactions but also across different transactions.
658
Normalized Extended Itemset: A set of extended items whose contextual positions have been positioned with respect to the smallest reference position of the set. Normalized Extended Transaction Set: A set of extended transactions whose contextual positions have been positioned with respect to the largest reference position of the set.
659
Interval Set Representations of Clusters Pawan Lingras Saint Mary’s University, Canada Rui Yan Saint Mary’s University, Canada Mofreh Hogo Czech Technical University, Czech Republic Chad West IBM Canada Limited, Canada
INTRODUCTION The amount of information that is available in the new information age has made it necessary to consider various summarization techniques. Classification, clustering, and association are three important data-mining features. Association is concerned with finding the likelihood of co-occurrence of two different concepts. For example, the likelihood of a banana purchase given that a shopper has bought a cake. Classification and clustering both involve categorization of objects. Classification processes a previously known categorization of objects from a training sample so that it can be applied to other objects whose categorization is unknown. This process is called supervised learning. Clustering groups objects with similar characteristics. As opposed to classification, the grouping process in clustering is unsupervised. The actual categorization of objects, even for a sample, is unknown. Clustering is an important step in establishing object profiles. Clustering in data mining faces several additional challenges compared to conventional clustering applications (Krishnapuram, Joshi, Nasraoui, & Yi, 2001). The clusters tend to have fuzzy boundaries. There is a likelihood that an object may be a candidate for more than one cluster. Krishnapuram et al. argued that clustering operations in data mining involve modeling an unknown number of overlapping sets. This article describes fuzzy and interval set clustering as alternatives to conventional crisp clustering.
precisely one cluster, we had to assign objects 4 and 9 to Cluster B. However, the dotted circle seems to represent a more natural representation of Cluster B. An ability to specify that Object 9 may either belong to Cluster B or Cluster C, and Object 4 may belong to Cluster A or Cluster B, will provide a better representation of the clustering of the 12 objects in Figure 1. Rough-set theory provides such an ability. The notion of rough sets was proposed by Pawlak (1992). Let U denote the universe (a finite ordinary set), and let R be an equivalence (indiscernibility) relation on U. The pair A = (U , R) is called an approximation space. The equivalence relation R partitions the set U into disjoint subsets. Such a partition of the universe is denoted by U / R = {E1 , E 2 , L , Em } , where Ei is an equivalence class of R. If two elements u, v ∈ U belong to the same equivalence class, we say that u and v are indistinguishable. The equivalence classes of R are called the elementary or atomic sets in the approximation space A = (U , R ) . Because it is not possible to differentiate the elements within the same equivalence class, one may not be able to obtain a precise representation for an arbitrary set X ⊆ U in terms of elementary sets in A. Instead, its lower and upper bounds may represent the set X. The lower bound A( X ) is the union of all the elementary sets, which are subsets of X. The upper bound A( X ) is the union of all the elementary sets that have a nonempty intersection with X. The pair ( A( X ), A( X )) is the representation of an ordinary set X
BACKGROUND Conventional clustering assigns various objects to precisely one cluster. Figure 1 shows a possible clustering of 12 objects. In order to assign all the objects to
in the approximation space A = (U , R ) , or simply the rough set of X. The elements in the lower bound of X definitely belong to X, while elements in the upper bound of X may or may not belong to X. The pair
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
I
Interval Set Representations of Clusters
Figure 1. Conventional clustering Cluster C
Cluster B Cluster A
Figure 2. Rough set approximation An equivalence class Lower approximation Actual set Upper approximation
( A( X ), A( X )) also provides a set theoretic interval for
the set X. Figure 2 illustrates the lower and upper approximation of a set.
MAIN THRUST Interval Set Clustering Rough sets were originally used for supervised learning. There are an increasing number of research efforts on clustering in relation to rough-set theory (do Prado, Engel, & Filho, 2002; Hirano & Tsumoto, 2003; Peters, Skowron, Suraj, Rzasa, & Borkowski, 2002). Lingras (2001) developed rough-set representation of clusters. Figure 3 shows how the 12 objects from Figure 1 could be clustered by using rough sets. Instead of Object 9 belonging to any one cluster, it belongs to the upper bounds of Clusters B and C. Similarly, Object 4 belongs to the upper bounds of Clusters A and B. Lingras (2001; Lingras & West, 2004; Lingras, Hogo, & Snorek, 2004) proposed three different approaches for unsupervised creation of rough or interval set representations of clusters: evolutionary, statistical, and neural. Lingras (2001) described how a roughset theoretic clustering scheme could be represented by using a rough-set genome. The rough-set genome con660
sists of one gene per object. The gene for an object is a string of bits that describes which lower and upper approximations the object belongs to. The string for a gene can be partitioned into two parts, lower and upper, as shown in Figure 4 for three clusters. Both lower and upper parts of the string consist of three bits each. The ith bit in the lower/upper string tells whether the object is in the lower/upper approximation of the ith cluster. Figure 4 shows examples of all the valid genes for three clusters. An object represented by g1 belongs to the upper bounds of the first and second clusters. An object represented by g6 belongs to the lower and upper bounds of the second cluster. Any other value not given by g1 to g7 is not valid. The objective of the genetic algorithms (GAs) is to minimize the within-group-error. Lingras provided a formulation of within-group-error for roughset based clustering. The resulting GAs were used to evolve interval clustering of highway sections. Lingras (2002) applied the unsupervised rough-set clustering based on GAs for grouping Web users. However, the clustering process based on GAs seemed computationally expensive for scaling to larger datasets. The K-means algorithm is one of the most popular statistical techniques for conventional clustering (Hartigan & Wong, 1979). Lingras and West (2004) provided a theoretical and experimental analysis of a modified K-means clustering based on the properties of rough sets. It was used to create interval set representa-
Interval Set Representations of Clusters
Figure 3. Rough set based clustering
I
Upper bound of Group C
12
11 10
Boundary area of Groups B and C
Lower bound of Group A 9 1
3
8
7
Lower bound of Group B
4 2
Upper bound of Group A
5
6
Upper bound of Group B Boundary area of Groups A and B
Figure 4. Rough set gene
Let
d ( v, x i ) = min d ( v, x j ) 1≤ j ≤ k
and
T = { j : d ( v, x i ) / d ( v, x j ) ≤ threshold and i ≠ j}.
1. 2.
If T ≠ ∅ , v exists in the upper bounds of clusters i and all the clusters j ∈ T . Otherwise, v exists in the lower and upper bounds of cluster i. Moreover, v does not belong to any other lower or upper bounds.
It was shown (Lingras & West, 2004) that the above method of assignment preserves important rough-set properties (Pawlak, 1992; Skowron, Stepaniuk, & Peters, 2003; Szmigielski & Polkowski, 2003; Yao, 2001).
Fuzzy Set Clustering
tions of clusters of Web users as well as a large set of supermarket customers. The Kohonen (1988) neural network or self-organizing map is another popular clustering technique. It is desirable in some applications due to its adaptive capabilities. Lingras et al. (2004) introduced interval set clustering, using a modification of the Kohonen self-organizing maps, based on rough-set theory. Both the modified statistical- and neural-network-based approaches used the following principle for assigning objects to lower and upper bounds of clusters. For each object, v, let d ( v, x i ) be the distance between v and the centroid of cluster i. The ratio d ( v, x i ) / d ( v, x j ) , 1 ≤ j ≤ k , is used to determine the membership of v.
A fuzzy generalization of clustering uses a fuzzy membership function to describe the degree of membership (between [0, 1]) of an object in a given cluster. The sum of fuzzy memberships of an object to all clusters must be equal to 1. Fuzzy C-means (FCMs) is a landmark algorithm in the area of fuzzy C-partition clustering. It was first proposed by Bezdek (1981). Rhee and Hwang (2001) proposed a type-2 fuzzy C-means algorithm to solve membership typicality. They point out that because the memberships generated are relative numbers, they may not be suitable for applications in which the memberships are supposed to represent typicality. Moreover, the conventional fuzzy C-means process suffers from noisy data, that is, when a noise point is located far from all the clusters, an undesirable clustering result may 661
Interval Set Representations of Clusters
occur. The algorithm of Rhee and Hwang is based on the fact that when updating the cluster centers, higher membership values should contribute more than memberships with smaller values. Rough-set theory and fuzzy-set theory complement each other (Shi, Shen, & Liu, 2003). It is possible to create interval clusters based on the fuzzy memberships obtained by using the fuzzy C-means algorithm described in the previous section. Let 1 ≥ α ≥ β ≥ 0 . An object v belongs in the lower bound of cluster i if its membership in the cluster is more than α . Similarly, if its membership in cluster i is greater than β , the object belongs in the
article briefly describes several approaches to creating interval and fuzzy set representations of clusters. The interval set clustering is based on the theory of rough sets. Changes in clusters can provide important clues about the changing nature of the usage of a facility as well as the changing nature of its users. Use of fuzzy and interval set clustering also adds an interesting dimension to cluster migration studies. Due to rough boundaries of interval clusters, it may be possible to get early warnings of potential significant changes in clustering patterns.
upper bound of cluster i. Because 1 ≥ α ≥ β ≥ 0 , if an object belongs to the lower bound of a cluster, it will also belong to its upper bound. Lingras and Yan (2004) describe further conditions on α and β that ensure satisfaction of other important properties of rough sets.
REFERENCES
FUTURE TRENDS Temporal data mining is an application of data-mining techniques to the data that takes the time dimension into account. Temporal data mining is assuming increasing importance. Much of the temporal data-mining tasks are related to the use and analysis of temporal sequences of raw data. There is little work that analyzes the results of data mining over a period of time. Changes in cluster characteristics of objects, such as supermarket customers or Web users, over a period of time can be useful in data mining. Such an analysis can be useful for formulating marketing strategies. Marketing managers may want to focus on specific groups of customers. Therefore, they may need to understand the migrations of the customers from one group to another group. The marketing strategies may depend on desirability of these cluster migrations. The overlapping clusters created by using interval or fuzzy set clustering can be especially useful in such studies. Overlapping clusters make it possible for an object to transition from the core of a cluster to the core of another cluster through the overlapping region. When an object moves from the core of a cluster to an overlapping region, it may be possible to provide an early warning that can trigger an appropriate marketing campaign.
CONCLUSION Clusters in data mining tend to have fuzzy and rough boundaries. An object may potentially belong to more than one cluster. Interval or fuzzy set representation enables modeling of such overlapping clusters. This 662
Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function. New York: Plenum Press. Do Prado, H. A., Engel, P. M., & Filho, H. C. (2002). Rough clustering: An alternative to finding meaningful clusters by using the reducts from a dataset. In J. Alpigini, J. F. Peters, A. Skowron, & N. Zhong, (Eds.), Proceedings of the Symposium on Rough Sets and Current Trends in Computing: Vol. 2475. Springer-Verlag. Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 28, 100-108. Hirano, S., & Tsumoto, S. (2003). Dealing with relatively proximity by rough clustering. Proceedings of the 22nd International Conference of the North American Fuzzy Information Processing Society (pp. 260-265). Kohonen, T. (1988). Self-organization and associative memory. Berlin: Springer-Verlag. Krishnapuram, R., Joshi, A., Nasraoui, O., & Yi, L. (2001). Low-complexity fuzzy relational clustering algorithms for Web mining. IEEE Transactions on Fuzzy Systems, 9(4), 595-607. Lingras, P. (2001). Unsupervised rough set classification using GAs. Journal of Intelligent Information Systems, 16(3), 215-228. Lingras, P. (2002). Rough set clustering for Web mining. Proceedings of the IEEE International Conference on Fuzzy Systems. Lingras, P., Hogo, M., & Snorek, M. (2004). Interval set clustering of Web users using modified Kohonen selforganizing maps based on the properties of rough sets. Web Intelligence and Agent Systems: An International Journal.
Interval Set Representations of Clusters
Lingras, P., & West, C. (2004). Interval set clustering of Web users with rough k-means. Journal of Intelligent Information Systems, 23(1), 5-16. Lingras, P., & Yan, R. (2004). Interval clustering using fuzzy and rough set theory. Proceedings of the 23rd International Conference of the North American Fuzzy Information Processing Society (pp. 780-784), Canada. Pawlak, Z. (1992). Rough sets: Theoretical aspects of reasoning about data. New York: Kluwer Academic. Peters, J. F., Skowron, A., Suraj, Z., Rzasa, W., & Borkowski, M. (2002). Clustering: A rough set approach to constructing information granules, soft computing and distributed processing. Proceedings of the Sixth International Conference (pp. 57-61). Rhee, F. C. H., & Hwang, C. (2001). A type-2 fuzzy cmeans clustering algorithm. Proceedings of the IFSA World Congress and the 20th International Conference of the North American Fuzzy Information Processing Society (Vol. 4, pp. 1926-1929). Shi, H., Shen, Y., & Liu, Z. (2003). Hyperspectral bands reduction based on rough sets and fuzzy C-means clustering. Proceedings of the 20th IEEE Instrumentation and Measurement Technology Conference (Vol. 2, pp. 1053-1056). Skowron, A., Stepaniuk, J., & Peters, J. F. (2003). Rough sets and infomorphisms: Towards approximation of relations in distributed environments. Fundamental Informatica, 54(2-3), 263-277. Szmigielski, A., & Polkowski, L. (2003). Computing from words via rough mereology in mobile robot navigation. Proceedings of the IEE/RSJ International Conference on Intelligent Robots and Systems (Vol. 4, pp. 3498-3503). Yao, Y. Y. (2001). Information granulation and rough set approximation. International Journal of Intelligent Systems, 16(1), 87-104.
KEY TERMS Clustering: A form of unsupervised learning that divides a data set so that records with similar content are in the same group and groups are as different from each other as possible. Evolutionary Computation: A solution approach guided by biological evolution that begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a model that best represents the data. Fuzzy C-Means Algorithms: Clustering algorithms that assign a fuzzy membership in various clusters to an object instead of assigning the object precisely to a cluster. Fuzzy Membership: Instead of specifying whether an object precisely belongs to a set, fuzzy membership specifies a degree of membership between [0,1]. Interval Set Clustering Algorithms: Clustering algorithms that assign objects to lower and upper bounds of a cluster, making it possible for an object to belong to more than one cluster. Interval Sets: If a set cannot be precisely defined, one can describe it in terms of a lower bound and an upper bound. The set will contain its lower bound and will be contained in its upper bound. Rough Sets: Rough sets are special types of interval sets created by using equivalence relations. Self-Organization: A system structure that often appears without explicit pressure or involvement from outside the system. Supervised Learning: A learning process in which the exemplar set consists of pairs of inputs and desired outputs. The process learns to produce the desired outputs from the given inputs. Temporal Data Mining: An application of datamining techniques to the data that takes the time dimension into account. Unsupervised Learning: Learning in the absence of external information on outputs.
663
I
664
Kernel Methods in Chemoinformatics Huma Lodhi Imperial College London, UK
INTRODUCTION Millions of people are suffering from fatal diseases such as cancer, AIDS, and many other bacterial and viral illnesses. The key issue is now how to design lifesaving and cost-effective drugs so that the diseases can be cured and prevented. It would also enable the provision of medicines in developing countries, where approximately 80% of the world population lives. Drug design is a discipline of extreme importance in chemoinformatics. Structure-activity relationship (SAR) and quantitative SAR (QSAR) are key drug discovery tasks. During recent years great interest has been shown in kernel methods (KMs) that give state-of-the-art performance. The support vector machine (SVM) (Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) is a wellknown example. The building block of these methods is an entity known as the kernel. The nondependence of KMs on the dimensionality of the feature space and the flexibility of using any kernel function make them an optimal choice for different tasks, especially modeling SAR relationships and predicting biological activity or toxicity of compounds. KMs have been successfully applied for classification and regression in pharmaceutical data analysis and drug design.
BACKGROUND Recently, I have seen a swift increase in the interest and development of data mining and learning methods for problems in chemoinformatics. There exist a number of challenges for these techniques for chemometric problems. High dimensionality is one of them. There are datasets with a large number of dimensions and a few data points. Label and feature noise are other important problems. Despite these challenges, learning methods are a common choice for applications in the domain of chemistry. This is due to automated building of predictors that have strong theoretical foundations. Learning techniques, including neural networks, decision trees, inductive logic programming, and kernel methods such as SVMs, kernel principal component analysis, and kernel partial least square, have been applied to chemometric problems with great success. Among these methods,
KMs are new to such tasks. They have been applied for applications in chemoinformatics since the late 1990s. KMs and their utility for applications in chemoinformatics are the focus of the research presented in this article. These methods possess special characteristics that make them very attractive for tasks such as the induction of SAR/QSAR. KMs such as SVMs map the data into some higher dimensional feature space and train a linear predictor in this higher dimensional space. The kernel trick offers an effective way to construct such a predictor by providing an efficient method of computing the inner product between mapped instances in the feature space. One does not need to represent the instances explicitly in the feature space. The kernel function computes the inner product by implicitly mapping the instances to the feature space. These methods can handle very high-dimensional noisy data and can avoid overfitting. SVMs suffer from a drawback that is the difficulty of interpretation of the models for nonlinear kernel functions.
MAIN THRUST I now present basic principles for the construction of SVMs and also explore empirical findings.
Support Vector Machines and Kernels The support vector machine was proposed in 1992 (Boser, Guyon, & Vapnik, 1992). A detailed analysis of SVMs can be found in Vapnik (1995) and Cristianini and Shawe-Taylor (2000). An SVM works by embedding the input data, d i ,K , d n , into a Hilbert space through a nonlinear mapping, φ , and constructing the linear function in this space. Mapping φ may not be known explicitly but be accessed via the kernel function described in the later section on kernels. The kernel function returns the inner product between the mapped instances d i and d j in a higher dimensional space that is for any mapping φ : D → F , k (d i , d j ) = φ (d i ), φ (d j ) . I now briefly de-
scribe SVMs for classification, regression, and kernel functions.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Kernel Methods in Chemoinformatics
Support Vector Classification The support vector machine for classification (SVC) (Vapnik, 1995) is based on the idea of constructing the maximal margin hyperplane in feature space. This unique hyperplane separates the data into two categories with maximum margin (hard margin). The maximal margin hyperplane fails to generalize well when there is a high level of noise in the data. In order to handle noise, data margin errors are allowed, hence achieving a soft margin instead of a hard margin (no margin errors). The generalization performance is improved by maintaining the right balance between the margin maximization and error. To find a hyperplane, one has to solve a convex quadratic optimization problem. The support vector classifier is constructed by using only the inner products between the mapped instances. The classification function for a new example d is given by n f = sgn ∑ α i ci k (d i , d ) + b , where α i are Lagrange mul i =1
tiplies, ci ∈ {− 1,+1} are categories, and b ∈ R .
Support Vector Regression Support vector machines for regression (SVR) (Vapnik, 1995) inherit all the main properties that characterize SVC. SVR embeds the input data into a Hilbert space through a nonlinear mapping φ and constructs a linear regression function in this space. In order to apply support vector technique to regression, a reasonable loss function is used. Vapnik’s ε − insensitive loss function is a popular choice that is defined by
c − f (d ) ε = max (0, c − f (d ) − ε ) , where c ∈ R . This loss function allows errors below some ε > 0 and controls the width of insensitive band. Regression estimation is performed by solving an optimization problem. The corresponding regression function f is given by n
f = ∑ (α~i − α i )k (d i , d ) + b , where α~i , α i are Lagrange i =1
multipliers, and b ∈ R .
Kernels A kernel function calculates an inner product between mapped instances in a feature space and allows implicit feature extraction. The mathematical foundation of such a function was established during the first decade of the 20th century (Mercer, 1909). A kernel function is a symmetric function k (d i , d j ) = k (d j , d i ) for all d i , d j and
n
satisfies positive (semi)definiteness,
∑ a a k (d d ) ≥ 0
i , j =1
i
j
i
j
for ai , a j ∈ R . The n × n matrix with entries of the form K ij = k (d i , d j ) is known as the kernel matrix, or the
Gram matrix, that is a symmetric positive definite matrix. Linear, polynomial, and Gaussian Radial Basis Function (RBF) kernels are well-known examples of general purpose kernel functions. Linear kernel function is given by k linear (d i , d j ) = k (d i , d j ) = d i′d j . Given a kernel k , the
polynomial
construction
is
given
by
k poly (d i , d j ) = (k (d i , d j ) + c ) . Here, p is a positive intep
ger, and c is a nonnegative constant. Clearly, this incurs a small computational cost to define a new feature space. The feature space corresponding to a degree p polynomial kernel includes all products of at most p input features. Note that for p = 1 , I get the linear construction. Furthermore, RBF kernel defines a feature space with an infinite number of dimensions. Given a set of instances, the RBF kernel is given by − d −d i j K RBF (d i , d j ) = exp 2σ 2
2
.
New kernels can be designed by keeping in mind that kernel functions are closed under addition and multiplication. Kernel functions can be defined over general sets (Watkins, 2000; Haussler, 1999). This important fact has allowed successful exploration of novel kernels for discrete spaces such as strings, graphs, and trees (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002; Kashima, Tsuda, & Inokuchi, 2003; Mahe, Ueda, Akutsu, Perret, & Vert, 2004).
Applications Given that I have presented basic concepts of SVMs, I now describe the applications of SVMs in chemical domains. SAR/QSAR analysis plays a crucial role in the design and development of drugs. It is based on the assumption that the chemical structure and activity of compounds are correlated. The aim of mining molecular databases is to select a set of important compounds, hence forming a small collection of useful molecules. The prediction of a new compound with low error probability is an important factor, as the false prediction can be costly
665
K
Kernel Methods in Chemoinformatics
and result in a loss of information. Furthermore, accurate prediction can speed up the drug design process. In order to apply learning techniques such as KMs to chemometric data, the compounds are transformed into a form amenable to these techniques. Modeling SAR/QSAR analysis can be viewed as comprising two stages. In the first stage, descriptors (features) are extracted or computed, and molecules are transformed into vectors. The dimensionality of the vector space can be very high, where, generally, molecular datasets comprise several tens to hundred compounds. In the second stage, an induction algorithm (SVC or SVR) is applied to learn an SAR/QSAR model. The similarity between two compounds is measured by the inner product between two vectors. The similarity between compounds is inversely proportional to the cosine of the angle between the vectors. It is worth noting that kernel methods can be applied not only for the induction of model but also for the feature extraction through specialized kernels. I now describe the application of kernel methods to inducing structure activity relationship models. I first focus on the classification task for chemometric data. Trotter, Buxton, and Holden (2001) studied the efficacy of an SVC to separating the compounds that will cross the blood/brain barrier from those that will not cross the barrier. SVC substantially improved other machine-learning techniques, including neural networks and decision trees. In another study, a classification problem has been formulated in order to predict the reduction of dihydrofolate reductase by pyrimidines (Burbidge, Trotter, Holden, & Buxton, 2001). An SVC in conjunction with an RBF kernel is used to conduct experiments. Experimental results show that SVC outperforms the other classification techniques, including artificial neural networks, C5.0 decision tree, and nearest neighbor. Structural feature extraction is an important problem in chemoinformatics. Structural feature extraction refers to the problem of computing the features that describe the chemical structure. In order to compute structural features, Kramer, Frank, and Helma (2000) viewed compounds as labeled graphs where vertices depict atoms and edges describe bonds between them. The method performs automated construction of structural features of two-dimensional represented compounds. Features are constructed by retrieving sequences of linearly connected atoms and bonds on the basis of some statistical criterion. The authors argue that SVMs’ ability to handle high-dimensional data makes them attractive in this scenario, as the dimensionality of the feature space may be very large. The authors applied an SVC to predict mutageneicity and carcinogenicity of compounds with promising results. In another study, an SVC in conjunction with structural features is applied to model mutageneicity structure activity relationships from 666
noncongeneric datasets (Helma, Cramer, Kramer, & De Raedt, in press). The performance of SVC is compared with decision trees C4.5 and a rule learner. SVC and rule learner showed excellent results. In chemoinformatics, extracted features play a key role in SAR/QSAR analysis. The feature extraction module is constructed in such a way so as the loss of information is at a minimum. This can make feature extraction tasks as complex and expensive as solving the entire problem. Kernel methods are an effective alternative to explicit feature extraction. Kernels that transform the compounds into feature vectors without explicitly representing them can be constructed. This can be achieved by viewing compounds as graphs (Kashima, Tsuda, & Inokuchi, 2003; Mahe, Ueda, Akutsu, Perret, & Vert, 2004). In this way, kernel methods can be utilized to generate or extract feature for SAR/QSAR analysis. Kashima et al. proposed a kernel function that computes the similarity between compounds (labeled graphs). The compounds are implicitly transformed into feature vectors, where each entry of feature vector represents the number of label paths. Label paths are generated by random walks on graphs. In order to perform binary classification of compounds, voted kernel perceptron (Freund & Shapire, 1999) is employed with promising results. Mahe, et al (2004) have improved the graph kernels in terms of efficiency (computational time) and classification accuracy. An SVC in conjunction with graph kernels is used to predict mutageneicity of aromatic and hetroaromatics nitro compounds. SVC in conjunction with graph kernels validates the efficacy of KMs for feature extraction and induction of SAR models. I now focus on the regression task for chemometric data. In a chemical domain, datasets are generally characterized as having high dimensionality and few data points. Partial least square (PLS) (a regression method) is very useful for such scenarios as compared to linear least square regression. Demiriz, Bemmett, Breneman, and Embrechts (2001) have applied an SVR for QSAR analysis. The authors performed feature selection by removing high-variance variables and induced a model by using a linear programming SVR. The experimental results show that SVR in conjunction with RBF kernel outperforms PLS for chemometric datasets. PLS’s ability of handling high-dimensional data makes it an optimal choice to combine it with KMs. There exist two techniques to kernelize PLS (Bennett & Embrechts, 2003; Rosipal, Trejo, Matthews, & Wheeler, 2003). One technique is based on mapping the data into a higher dimensional space and constructing a linear regression function in the space (Rosipal, Trejo, Matthews, & Wheeler, 2003). Alternatively, PLS is kernelized by obtaining a low-rank approximation of kernel matrix and computing a regression function based
Kernel Methods in Chemoinformatics
on this approximation (Bennett & Embrechts). Kernel partial least square is used to predict the binding affinities of molecules to human serum albumin. Experimental results show that kernelized versions of PLS outperform other methods, including PLS and SVR (Bennett & Embrechts). Different principal component techniques, including power method, singular value decomposition, and eigen-value decomposition, have also been kernelized to show the efficiency of KMs for datasets comprising few points and a large number of variables (Wu, Massarat, & de Jong, 1997). I conclude this section with a case study.
Gram-Schmidt Kernels for Prediction of Activity of Compounds As an example, I now present the results of a novel approach for modeling structure-activity relationships (Lodhi & Guo, 2002) based on the use of a special kernel namely Gram-Schmidt kernel (GSK) (Cristianini, Shawe-Taylor, & Lodhi, 2002). The kernel efficiently performs Gram-Schmidt orthogonalisation in a kernelinduced feature space. It is based on the idea of building a more informative kernel matrix, as compared to Gram matrices, which are constructed by standard kernels. An SVC in conjunction with GSK is used to perform QSAR analysis. The learning process of an SVC in conjunction with GSK comprises two stages. In the first stage, highly informative features are extracted in a kernel-induced feature space. In the next stage, a soft margin classifier is trained. In order to perform the analysis, the GSK algorithm requires a set of compounds — an underlying kernel function and the number T , which specifies the dimensionality of the feature space. For the underlying kernel function, an RBF kernel is employed. GSK in conjunction with SVC is applied on a benchmark dataset (King, Muggleton, Lewis, & Sternberg, 1992) to predict the inhibition of dihydrofolate reductase by pyrimidines. The dataset contains 55 compounds that are divided into 5-fold cross-validation series. For each drug there are three positions of possible substitution, and the number of attributes for each substitution is nine. The experimental results show that SVC in conjunction with GSK achieves lower classification error than the best reported results (Burbidge, Trotter, Holden, & Buxton, 2001). Burbidge et al. performed SAR analysis on this dataset by using a number of learning techniques with SVC in conjunction with RBF kernels, achieving a classification error of 0.1269. An SVC in conjunction with GSK improved these results, achieving a classification error of 0.1120 (Lodhi & Guo, 2002).
FUTURE TRENDS Researchers using kernel methods have handled a principal area in chemoinformatics; however, there exist other important application areas, such as structure elucidation, that require the utilization of such methods. There is a need to develop new methods for challenging tasks that are difficult to solve from existing methods. Interest in and development of KMs for applications in chemoinformatics are swiftly increasing, and I believe they will lead to progress in both fields.
CONCLUSION Kernel methods have strong theoretical foundations. These methods combine the principles of statistical learning theory and functional analysis. SVMs and other kernel methods have been applied in different domains, including text mining, face recognition, protein homology detection, analyses and classification of gene expression data, and many others with great success. They have shown impressive performance in chemoinformatics. I believe that the growing popularity of machine-learning techniques, especially kernel methods in chemoinformatics, will lead to significant development in both disciplines.
REFERENCES Bennett, K. P., & Embrechts, M. J. (2003). Advances in learning theory: Methods, models and applications. Nato Science Series III: Computer & Systems Science, 190, 227-250. Boser, B. E., Guyon, I. M., & Vapnik, V. (1992). A training algorithm for optimal margin classifier. Proceedings of the Fifth Annual ACM workshop on Computational Learning Theory (pp. 144-152). Burbidge, R., Trotter, M., Holden, S., & Buxton, B. (2001). Drug design by machine learning: Support vector machines for pharmaceutical data. Computers and Chemistry, 26(1), 4-15. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, MA: Cambridge University Press. Cristianini, N., Shawe-Taylor, J., & Lodhi, H. (2002). Latent semantic kernels. Journal of Intelligent Information Systems, 18(2/3), 127-152.
667
K
Kernel Methods in Chemoinformatics
Demiriz, A., Bemmett, K. P., Breneman, C. M., & Embrechts, M. J. (2001). Support vector machine regression in chemometrics. Computing Science and Statistics. Freund, Y., & Shapire, R. (1999). Large margin classification using perceptron algorithm. Machine Learning, 37(3), 277-296. Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No. UCSC-CRL-99-10). Santa Cruz: University of California, Computer Science Department. Helma, C., Cramer, T., Kramer, S., & De Raedt, L. (in press). Data mining and machine learning techniques for identification of mutageneicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of Chemical Information and Computer Systems. Kashima, H., Tsuda, K., & Inokuchi, A. (2003). Marginalized kernels between labeled graphs. Proceedings of the 20th International Conference on Machine Learning. King, R. D., Muggleton, S., Lewis, R. A., & Sternberg, M. J. E. (1992). Drug design by machine learning: The use of inductive logic programming to model the structure activity relationships of timethoprim analogues binding to dihydrofolate reeducates. Proceedings of the National Academy of Sciences, USA, 89 (pp. 1132211326). Kramer, S., Frank, E., & Helma, C. (2002). Fragment generation and support vector machines for inducing SARs. SAR and QSAR in Environmental Research, 13(5), 509-523. Lodhi, H., & Guo, Y. (2002). Gram-Schmidt kernels applied to structure activity analysis for drug design. Proceedings of the Second ACM SIGKDD Workshop on Data Mining in Bioinformatics (pp. 37-42). Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2, 419-444. Mahe, P., Ueda, N., Akutsu, T., Perret, J.-L., & Vert, J.P. (2004). Extension of marginalized graph kernels. Proceedings of the 21st International Conference on Machine Learning. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society London (A), 209, 415-446.
668
Rosipal, R., Trejo, L., Matthews, B., & Wheeler, K. (2003). Nonlinear kernel-based chemometric tools: A machine learning approach. Proceedings of the Third International Symposium on PLS and Related Methods (pp. 249260). Trotter, M., Buxton, B., & Holden, S. (2001). Support vector machines in combinatorial chemistry. Measurement and Control, 34(8), 235-239. Vapnik, V. (1995). The nature of statistical learning theory. Springer-Verlag. Watkins, C. (2000). Dynamic alignment kernels. In P. J. Bartlett, B. Schlkopf, D. Schuurmans, & A. J. Smola, Advances in large-margin classifiers (pp. 39-50). Cambridge, MA: MIT Press. Wu, W., Massarat, D. L., & de Jong, S. (1997). The kernel PCA algorithm for wide data. Part I: Theory and algorithms. Chemometrics and Intelligent Laboratory Systems, 36, 165-172.
KEY TERMS Chemoinformatics: Storage, analysis, and drawing inferences from chemical information (obtained from chemical data) by using computational methods for drug discovery. Kernel Function: A function that computes the inner product between mapped instances in a feature space. It is a symmetric, positive definite function. Kernel Matrix: A matrix that contains almost all the information required by kernel methods. It is obtained by computing the inner product between n instances. Machine Learning: A discipline that comprises the study of how machines learn from experience. Margin: A real-valued function. The sign and magnitude of the margin give insight into the prediction of an instance. Positive margin indicates correct prediction, whereas negative margin shows incorrect prediction. Quantitative Structure-Activity Relationship (QSAR): Illustrates quantitative relationships between chemical structures and the biological and pharmacological activity of chemical compounds. Support Vector Machines (SVMs): SVMs (implicitly) map input examples into a higher dimensional feature space via a kernel function and construct a linear function in this space.
669
Knowledge Discovery with Artificial Neural Networks Juan R. Rabuñal Dopico University of A Coruña, Spain Daniel Rivero Cebrián University of A Coruña, Spain Julián Dorado de la Calle University of A Coruña, Spain Nieves Pedreira Souto University of A Coruña, Spain
INTRODUCTION The world of Data Mining (Cios, Pedrycz & Swiniarrski, 1998) is in constant expansion. New information is obtained from databases thanks to a wide range of techniques, which are all applicable to a determined set of domains and count with a series of advantages and inconveniences. The Artificial Neural Networks (ANNs) technique (Haykin, 1999; McCulloch & Pitts, 1943; Orchad, 1993) allows us to resolve complex problems in many disciplines (classification, clustering, regression, etc.), and presents a series of advantages that convert it into a very powerful technique that is easily adapted to any environment. The main inconvenience of ANNs, however, is that they can not explain what they learn and what reasoning was followed to obtain the outputs. This implies that they can not be used in many environments in which this reasoning is essential. This article presents a hybrid technique that not only benefits from the advantages of ANNs in the data-mining field, but also counteracts their inconveniences by using other knowledge extraction techniques. Firstly we extract the requested information by applying an ANN, then we apply other Data Mining techniques to the ANN in order to explain the information that is contained inside the network. We thus obtain a two-levelled system that offers the advantages of the ANNs and compensates for its shortcomings with other Data Mining techniques.
their inductive learning convert them into a very robust technique that can be used in almost any domain. An ANN is an information processing technique that is inspired on neuronal biology and consists of a large amount of interconnected computational units (neurons), usually in different layers. When an input vector is presented to the input neurons, this vector is propagated and processed in the network until it becomes an output vector in the output neurons. Figure 1 shows an example of neural network with 3 inputs, 2 outputs and three layers with 3, 4 and two neurons in each one. The ANNs have proven to be a very powerful tool in a lot of applications, but they present a big problem: their reasoning process cannot be explained, i.e. there is no clear relationship between the inputs that are presented to the network and the outputs it produces. This means that ANNs cannot be used in certain domains, even though several approaches and attempts to explain their behaviour have tried to solve this problem. The reasoning of ANNs, and the rules extraction that explains their functioning, were explained in various ways. One of the first attempts established an equivalence between ANNs and fuzzy rules (Benítez, Castro, & Requena, 1997; Buckley, Hayashi, & Czogala, 1993; Jang & Sun, 1992), obtaining only theoretical solutions. Other works were based on the individual analysis of each Figure 1. Example of Artificial Neural Network
BACKGROUND Ever since artificial intelligence appeared, ANNs have been widely studied. Their generalisation capacity, and Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
K
Knowledge Discovery with Artificial Neural Networks
neuron in the network, and its connection with the consecutive neurons. Towell and Shavlik (1994) in particular see the connections between neurons as rules, and Andrews and Geva (1994) uses networks with functions that allow a clear identification of the dominant inputs. Other approaches are the RULENEG (Rule-Extraction from Neural Networks by Step-wise Negation) (Pop, Hayward, & Diederich, 1994) and TREPAN (Craven, 1996) algorithms. The first approach however modifies the training set and therefore loses the generalisation capacity of the ANNs. The TREPAN approach is similar to decision tree algorithms such as CART (Classification and Regression Trees) or C4.5, which turns the ANN into a MofN (Mof-N) decision tree. The DEDEC (Decision Detection) algorithm (Tickle, Andrews, Golea, & Tickle, 1998) extracts rules by finding minimal information sufficient to distinguish, from the neural network point of view, between a given pattern and all other patterns. The DEDEC algorithm uses the trained ANN to create examples from which rules can be extracted. Unlike other approaches, it also uses the weight vectors of the network to obtain an additional analysis that improves the extraction of rules. This information is then used to direct the strategy for generating a (minimal) set of examples for the learning phase. It also uses an efficient algorithm for the rule extraction phase. Based on these and other already mentioned techniques, Chalup, Hayward and Diedrich (1998); Visser, Tickle, Hayward and Andrews (1996); and Tickle, Andrews, Golea and Tickle (1998) also presented their solutions. The methods for the extraction of logical rules, developed by Duch, Adamczak and Grabczewski (2001), are based on multilayer perceptron networks with the MLP2LN (Multi-Layer Perceptron converted to Logical Network) method and its constructive version C-MLP2LN. MLP2LN consists in taking a multilayer and already trained perceptron and simplify it in order to obtain a network with weights 0, +1 or -1. C-MLP2LN acts in a similar way. After this process, the dominant rules are easily extracted, and the weights of the input layer allow us to deduct which parameters are relevant. More recently, genetic algorithms (GAs) have been used to discover rules in ANNs. Keedwell, Narayanan and Savic (2000) use a GA in which the chromosomes are rules based on value intervals or ranges applied to the inputs of the ANN. The values are obtained from the training patterns. The most recent works in rules extraction from ANNs are presented by Rivero, Rabuñal, Dorado, Pazos and Pedreira (2004) and Rabuñal, Dorado, Pazos and Rivero (2003). They extract rules by applying a symbolic regression system, based on Genetic Programming (GP) (Engelbrecht, Rouwhorst & Schoeman, 2001; Koza, Keane,
670
Streeter, Mydlowec, Yu, & Lanza, 2003; Wong & Leung, 2000), to a set of inputs / outputs produced by the ANN. The set of network inputs / produced outputs is dynamically modified, as explained on this paper.
ARCHITECTURE This article presents an architecture in two levels for the extraction of knowledge from databases. In a first level, we apply an ANN as Data Mining technique; in the second level, we apply a knowledge extraction technique to this network.
Data Mining with ANNs Artificial Neural Networks constitute a Data Mining technique that has been widely used as a technique for the extraction of knowledge from databases. Their training process is based on examples, and presents several advantages that other models do not offer: •
•
•
A high generalisation level. Once ANNs are trained with a training set, they produce outputs (close to desired or supposed outputs) for inputs that were never presented to them before. A high error tolerance. Since ANNs are based on the successive and parallel interconnection between many processing elements (neurons), the output of the system is not significantly affected if one of them fails. A high noise tolerance.
All these advantages turn ANNs into the ideal technique for the extraction of knowledge in almost any domain. They are trained with many different training algorithms. The most famous one is the backpropagation algorithm (Rumelhart, Hinton & Williams, 1986), but many other training algorithms are applied according to the topology of the network and the use that is given to it. In the course of recent years, Evolutionary Computation techniques such as Genetic Algorithms (Holland, 75) (Goldberg, 89) (Rabuñal, Dorado, Pazos, Gestal, Rivero & Pedreira, 2004a; Rabuñal, Dorado, Pazos, Pereira & Rivero, 2004b) are gaining ground, because they correct the defects of other training algorithms, such as the tendency to generate local minimums or to overtrain the network. Even so, and in spite of these algorithms that train the network automatically (and even search for the topology of the network), ANNs present a series of defects that make them useless in many application fields. As we already said, their main defect is the fact that in general they are not interpretable: once an input is applied to the
Knowledge Discovery with Artificial Neural Networks
network and an output obtained, we cannot follow the steps of the reasoning that was used by the network to produce this output. This is why ANNs are not applicable in those domains in which we need to know why an output is the result of a certain input. The best example of this type of domain is the medical field: if we use an ANN to obtain the diagnosis for a certain disease, we need to know why that specific diagnosis or output value is produced. The purpose of this article is to correct this defect through the use of another knowledge extraction technique. This technique is applied to the ANN in order to extract its internal knowledge and express it in terms that are proper of the used technique (i.e. decision trees, semantic networks, IF-THEN-ELSE rules, etc.).
Extracting Knowledge from ANNs The second level of the architecture is based on the application of a knowledge extraction technique to the ANN that was used for Data Mining. This technique depends on how we wish to express the internal knowledge of the network: if we want decision trees, we could use the CART algorithm, if we want rules, we could use Genetic Programming, etc. In general, the second knowledge extraction technique is applied to the network to explain its behaviour when it produces a set of outputs from a set of inputs, without considering its internal functioning. This means that the network is treated as a “black box”: we obtain information on its behaviour by generating a set of inputs / obtained outputs to which we apply the knowledge extraction technique. Obtaining this set is complicated, because it has to be so representative that we can extract from it all the knowledge that a network really contains, not just part of the knowledge. A first option could be to use the inputs of the training set that was used to create the network, and the outputs generated from this inputs by the network. This method supposes that the training set is already sufficiently ample and representative of the search space in which the network will be applied. However, our aim is not to explain the behaviour of the network only in those patterns in which it was trained or in patterns that are close (i.e., the search area in which the network was trained); we want to explain the functioning of the network as a whole, including in areas in which it was not trained. We discard the option of elaborating an exhaustive inputs set by taking all the combinations of all the possible values in each network input with small intervals, because the combinatorial explosion would make this set too large. For instance, for a network with 5 inputs that each takes values between 0 and 1, and with intervals of 0.1, we would obtain
5
(1 − 0) + 1 = 115 = 161051 0.1
K
patterns, an amount that is excessive for any Data Mining tool. The problem of choosing the input set that will be applied to the network is solved with a hybrid solution: the inputs of the training set are multiplied by duplicating some elements of the set with certain modifications. These inputs are presented to the network so that for each of them an output vector is obtained, and as such we shape a set of patterns that is applied to the selected algorithm for the extraction of knowledge. It is important to note that the duplicated inputs were modified so that the knowledge extraction algorithm already starts to explore the areas that lie beyond those that were used to train the network. These first duplicated inputs are nevertheless insufficient to explore these areas, which is why the knowledge extraction algorithm is controlled by a system for the generation of new patterns. This new patterns generation system takes those patterns whose adjustment is sufficiently good in the knowledge extraction algorithm, replaces them with new patterns that are created from existing ones (either those that will be replaced, or others from the training set), and applies to them modifications that create patterns in areas that were not explored before. When an area is already explored by the algorithm and is already represented (by trees, rules, etc.), it is deleted and the system continues exploring new areas. The whole process is illustrated in Figure 2. Figure 2. ANN knowledge extraction process Pattern Generation System
Input set
Output set
Knowledge Extraction Patterns
Knowledge Extraction Algorithm
Discard patterns with good adjustment
ANN Knowledge
671
Knowledge Discovery with Artificial Neural Networks
In this way, we create a closed system in which the patterns set is constantly updated and the search continues for new areas that are not yet covered by the knowledge that we have of the network. A detailed description of the method and of its application to concrete problems can be found in Rabuñal, Dorado, Pazos and Rivero (2003); Rabuñal, Dorado, Pazos, Gestal, Rivero and Pedreira (2004a); and Rabuñal, Dorado, Pazos, Pereira & Rivero (2004b).
the Australian Conference on Neural Networks, Brisbane, Queensland (pp. 9-12).
FUTURE TRENDS
Chalup, S., Hayward, R., & Diedrich, J. (1998). Rule extraction from artificial neural networks trained on elementary number classification task. Queensland University of Technology, Neurocomputing Research Centre. QUT NRC technical report.
The proposed architecture is based on the application of a knowledge extraction technique to an ANN, whereas the technique that will be used depends on the type of knowledge that we wish to obtain. Since there is a great variety of techniques and algorithms that generate information of the same kind (IF-THEN rules, trees, etc.), we need to study them and carry out experiments to test their functioning in the proposed system, and in particular their adequacy for the new patterns generation system. Also, this system for the generation of new patterns involves a large number of parameters, such as the percentage of patterns change, the number of new patterns, or the maximum error with which we consider that a pattern is represented by a rule. Since there are so many parameters, we need to study not only each parameter separately but also the influence of all the parameters on the final result.
CONCLUSION This article proposes a system architecture that makes good use of the advantages of the ANNs, such as Data Mining, and avoids its inconveniences. On a first level, we apply an ANN to extract and model a set of data. The resulting model offers all the advantages of ANNs, such as noise tolerance and generalisation capacity. On a second level, we apply another knowledge extraction technique to the ANN, and thus obtain the knowledge of the ANN, which is the generalisation of the knowledge that was used for its learning; this knowledge is expressed in the shape that is decided by the user. It is obvious that the union of various techniques in a hybrid system conveys a series of advantages that are associated to them.
REFERENCES Andrews, R., & Geva, S. (1994). Rule extraction from a constrained error backpropagation MLP. Proceedings of 672
Benítez, J.M., Castro, J.L., & Requena, I. (1997). Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8, 1156-1164. Buckley, J.J., Hayashi, Y., & Czogala, E. (1993). On the equivalence of neural nets and fuzzy expert systems. Fuzzy Sets Systems, 53, 129-134.
Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data mining methods for knowledge discovery. The 1st Edition, Kluver International Series in Engineering and Computer Science (pp. 495). Boston: Kluwer Academic Publishers. Craven, M. W. (1996). Extracting comprehensible models from trained neural networks. PhD Thesis, University of Wisconsin, Madison. Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new methodology of extraction, optimisation and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 12, 277-306. Engelbrecht, A.P., Rouwhorst, S.E., & Schoeman L. (2001). A building block approach to genetic programming for rule discovery. In Abbass, R. Sarkar, & C. Newton (Eds.), Data mining: A heuristic approach (pp. 175-189). Hershey: Idea Group Publishing. Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine learning. Reading, MA: Addison-Wesley. Haykin, S. (1999). Neural networks (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. Holland, J. H. (1975). Adaptation in natural and artificial systems. University of Michigan Press. Jang, J., & Sun, C. (1992). Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks, 4, 156-158. Keedwell E., Narayanan A., & Savic D. (2000). Creating rules from trained neural networks using genetic algorithms. Proceedings of the International Journal of Computers, Systems and Signals (IJCSS) (Vol. 1, pp. 3042). Koza, J. R., Keane, M. A., Streeter, M. J., Mydlowec, W., Yu, J., & Lanza, G. (Eds.). (2003). Genetic programming
Knowledge Discovery with Artificial Neural Networks
IV: Routine human-competitive machine intelligence. Dordrecht, The Netherlands: Kluwer Academic Publishers. McCulloch, W.S., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, (5), 115-133. Orchard, G. (Ed.) (1993). Neural computing. Research and Applications. Ed. London: Institute of Physics Publishing, Londres. Pop, E., Hayward, R., & Diederich, J. (1994). RULENEG: Extracting rules from a trained ANN by stepwise negation. Queensland University of Technology, Neurocomputing Research Centre. QUT NRC technical report. Rabuñal, J.R., Dorado, J., Pazos, A., & Rivero, D. (2003). Rules and generalization capacity extraction from ANN with GP. Lecture notes in Computer Science, 606-613. Rabuñal, J.R., Dorado, J., Pazos, A., Gestal, M., Rivero, D., & Pedreira, N. (2004a). Search the optimal RANN architecture, reduce the training set and make the training process by a distribute genetic algorithm. Artificial Intelligence and Applications, 1, 415-420. Rabuñal, J.R., Dorado, J., Pazos, A., Pereira, J., & Rivero, D. (2004 b). A new approach to the extraction of ANN rules and to their generalization capacity through GP. Neural computation, 7(16), 1483-1523. Rivero, D., Rabuñal, J.R., Dorado, J., Pazos, A., & Pedreira, N. (2004). Extracting knowledge from databases and ANNs with genetic programming: Iris flower classification problem. Intelligent agents for data mining and information retrieval (pp. 136-152). Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536. Tickle, A.B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Transaction on Neural Networks. (9), 1057-1068. Towell G., & Shavlik J.W. (1994). Knowledge-based artificial neural networks. Artificial Intelligence, 70, 119-165. Visser, U., Tickle, A., Hayward, R., & Andrews, R. (1996). Rule-extraction from trained neural networks: Different techniques for the determination of herbicides for the plant protection advisory system PRO_PLANT. Proceedings of the Rule Extraction from Trained Artificial Neural Networks Workshop (pp. 133-139), Brighton, UK.
Wong, M.L., & Leung, K.S. (2000). Data mining using grammar based genetic programming and applications. Series in Genetic Programming, 3, 232. Boston: Kluwer Academic Publishers.
KEY TERMS Area of the Search Space: Set of specific ranges or values of the input variables that constitute a subset of the search space. Artificial Neural Networks: A network of many simple processors (“units” or “neurons”) that imitates a biological neural network. The units are connected by unidirectional communication channels, which carry numeric data. Neural networks can be trained to find nonlinear relationships in data, and are used in applications such as robotics, speech recognition, signal processing or medical diagnosis. Backpropagation Algorithm: Learning algorithm of ANNs, based on minimising the error obtained from the comparison between the outputs that the network gives after the application of a set of network inputs and the outputs it should give (the desired outputs). Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping. Evolutionary Computation: Solution approach guided by biological evolution, which begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a model that best represents the data. Knowledge Extraction: Explicitation of the internal knowledge of a system or set of data in a way that is easily interpretable by the user. Rule Induction: Process of learning, from cases or instances, if-then rule relationships that consist of an antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent). Search Space: Set of all possible situations of the problem that we want to solve could ever be in.
673
K
674
Learning Bayesian Networks Marco F. Ramoni Harvard Medical School, USA Paola Sebastiani Boston University School of Public Health, USA
INTRODUCTION Born at the intersection of artificial intelligence, statistics, and probability, Bayesian networks (Pearl, 1988) are a representation formalism at the cutting edge of knowledge discovery and data mining (Heckerman, 1997). Bayesian networks belong to a more general class of models called probabilistic graphical models (Whittaker, 1990; Lauritzen, 1996) that arise from the combination of graph theory and probability theory, and their success rests on their ability to handle complex probabilistic models by decomposing them into smaller, amenable components. A probabilistic graphical model is defined by a graph, where nodes represent stochastic variables and arcs represent dependencies among such variables. These arcs are annotated by probability distribution shaping the interaction between the linked variables. A probabilistic graphical model is called a Bayesian network, when the graph connecting its variables is a directed acyclic graph (DAG). This graph represents conditional independence assumptions that are used to factorize the joint probability distribution of the network variables, thus making the process of learning from a large database amenable to computations. A Bayesian network induced from data can be used to investigate distant relationships between variables, as well as making prediction and explanation, by computing the conditional probability distribution of one variable, given the values of some others.
BACKGROUND The origins of Bayesian networks can be traced back as far as the early decades of the 20th century, when Sewell Wright developed path analysis to aid the study of genetic inheritance (Wright, 1923, 1934). In their current form, Bayesian networks were introduced in the early 1980s as a knowledge representation formalism to encode and use the information acquired from human experts in automated reasoning systems in order to perform diagnostic, predictive, and explanatory tasks (Charniak, 1991; Pearl, 1986, 1988). Their intuitive graphical nature and their principled probabilistic foundations were very at-
tractive features to acquire and represent information burdened by uncertainty. The development of amenable algorithms to propagate probabilistic information through the graph (Lauritzen, 1988; Pearl, 1988) put Bayesian networks at the forefront of artificial intelligence research. Around the same time, the machine-learning community came to the realization that the sound probabilistic nature of Bayesian networks provided straightforward ways to learn them from data. As Bayesian networks encode assumptions of conditional independence, the first machine-learning approaches to Bayesian networks consisted of searching for conditional independence structures in the data and encoding them as a Bayesian network (Glymour, 1987; Pearl, 1988). Shortly thereafter, Cooper and Herskovitz (1992) introduced a Bayesian method that was further refined by Heckerman, et al. (1995) to learn Bayesian networks from data. These results spurred the interest of the data-mining and knowledge-discovery community in the unique features of Bayesian networks (Heckerman, 1997); that is, a highly symbolic formalism, originally developed to be used and understood by humans, well-grounded on the sound foundations of statistics and probability theory, able to capture complex interaction mechanisms and to perform prediction and classification.
MAIN THRUST A Bayesian network is a graph, where nodes represent stochastic variables and (arrowhead) arcs represent dependencies among these variables. In the simplest case, variables are discrete, and each variable can take a finite set of values.
Representation Suppose we want to represent the variable gender. The variable gender may take two possible values: male and female. The assignment of a value to a variable is called the state of the variable. So, the variable gender has two states: Gender = Male and Gender = Female. The graphical structure of a Bayesian network looks like this:
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Learning Bayesian Networks
Figure 1.
The network represents the notion that obesity and gender affect the heart condition of a patient. The variable obesity can take three values: yes, borderline and no. The variable heart condition has two states: true and false. In this representation, the node heart condition is said to be a child of the nodes gender and obesity, which, in turn, are the parents of heart condition. The variables used in a Bayesian networks are stochastic, meaning that the assignment of a value to a variable is represented by a probability distribution. For instance, if we do not know for sure the gender of a patient, we may want to encode the information so that we have better chances of having a female patient rather than a male one. This guess, for instance, could be based on statistical considerations of a particular population, but this may not be our unique source of information. So, for the sake of this example, let’s say that there is an 80% chance of being female and a 20% chance of being male. Similarly, we can encode that the incidence of obesity is 10%, and 20% are borderline cases. The following set of distributions tries to encode the fact that obesity increases the cardiac risk of a patient, but this effect is more significant in men than women: The dependency is modeled by a set of probability distributions, one for each combination of states of the variables gender and obesity, called the parent variables of heart condition.
Figure 2.
Learning Learning a Bayesian network from data consists of the induction of its two different components: (1) the graphical structure of conditional dependencies (model selection) and (2) the conditional distributions quantifying the dependency structure (parameter estimation). There are two main approaches to learning Bayesian networks from data. The first approach, known as constraint-based approach, is based on conditional independence tests. As the network encodes assumptions of conditional independence, along this approach we need to identify conditional independence constraints in the data by testing and then encoding them into a Bayesian network (Glymour, 1987; Pearl, 1988; Whittaker, 1990). The second approach is Bayesian (Cooper & Herskovitz, 1992; Heckerman et al., 1995) and regards model selection as an hypothesis testing problem. In this approach, we suppose to have a set M= {M0,M1, ...,Mg} of Bayesian networks for the random variables Y1, ..., Yv,, and each Bayesian network represents an hypothesis on the dependency structure relating these variables. Then, we choose one Bayesian network after observing a sample of data D = {y 1k, ..., yvk}, for k = 1, . . . , n. If p(M h) is the prior probability of model M h, a Bayesian solution to the model selection problem consists of choosing the network with maximum posterior probability: p(Mh|D) ∝ p(Mh)p(D|Mh). The quantity p(Mh|D) is the marginal likelihood, and its computation requires the specification of a parameterization of each model Mh and the elicitation of a prior distribution for model parameters. When all variables are discrete or all variables are continuous, follow Gaussian distributions, and the dependencies are linear and the marginal likelihood factorizes into the product of marginal likelihoods of each node and its parents. An important property of this likelihood modularity is that in the comparison of models that differ only for the parent structure of a variable Yi, only the local marginal likelihood matters. Thus, the comparison of two local network structures that specify different parents for Yi can be done simply by evaluating the product of the local Bayes factor BFh,k = p(D|Mhi) / p(D|Mki), and the ratio p(Mhi)/ p(Mki), to compute the posterior odds of one model vs. the other as p(M hi|D) / p(Mki|D). In this way, we can learn a model locally by maximizing the marginal likelihood node by node. Still, the space of the possible sets of parents for each variable grows exponentially with the number of parents involved, but successful heuristic search procedures (both deterministic and stochastic) exist to render the task more amenable (Cooper & Herskovitz, 1992; Singh & Larranaga, 1996; Valtorta, 1995). 675
L
Learning Bayesian Networks
Once the structure has been learned from a dataset, we still need to estimate the conditional probability distributions associated to each dependency in order to turn the graphical model into a Bayesian network. This process, called parameter estimation, takes a graphical structure and estimates the conditional probability distributions of each parent-child combination. When all the parent variables are discrete, we need to compute the conditional probability distribution of the child variable, given each combination of states of its parent variables. These conditional distributions can be estimated either as relative frequencies of cases or, in a Bayesian fashion, by using these relative frequencies to update some, possibly uniform, prior distribution. A more detailed description of these estimation procedures for both discrete and continuous cases is available in Ramoni and Sebastiani (2003).
Prediction and Classification Once a Bayesian network has been defined, either by hand or by an automated discovery process from data, it can be used to reason about new problems for prediction, diagnosis, and classification. Bayes’ theorem is at the heart of the propagation process. One of the most useful properties of a Bayesian network is the ability to propagate evidence irrespective of the position of a node in the network, contrary to standard classification methods. In a typical classification system, for instance, the variable to predict (i.e., the class) must be chosen in advance before learning the classifier. Information about single individuals then will be entered, and the classifier will predict the class (and only the class) of these individuals. In a Bayesian network, on the other hand, the information about a single individual will be propagated in any direction in the network so that the variable(s) to predict must not be chosen in advance. Although the problem of propagating probabilistic information in Bayesian networks is known to be, in the general case, NP-complete (Cooper, 1990), several scalable algorithms exist to perform this task in networks with hundreds of nodes (Castillo, et al., 1996; Cowell et al., 1999; Pearl, 1988). Some of these propagation algorithms have been extended, with some restriction or approximations, to networks containing continuous variables (Cowell et al., 1999).
FUTURE TRENDS The technical challenges of current research in Bayesian networks are focused mostly on overcoming their current limitations. Established methods to learn Bayesian networks from data work under the assumption that each variable is either discrete or normally distributed around a mean that linearly depends on its parent variables. The 676
latter networks are termed linear Gaussian networks, which still enjoy the decomposability properties of the marginal likelihood. Imposing the assumption that continuous variables follow linear Gaussian distributions and that discrete variables only can be parent nodes in the network but cannot be children of any continuous node, leads to a closed-form solution for the computation of the marginal likelihood (Lauritzen, 1992). The second technical challenge is the identification of sound methods to handle incomplete information, either in the form of missing data (Sebastiani & Ramoni, 2001) or completely unobserved variables (Binder et al., 1997). A third important area of development is the extension of Bayesian networks to represent dynamic processes (Ghahramani, 1998) and to decode control mechanisms. The most fundamental challenge of Bayesian networks today, however, is the full deployment of their potential in groundbreaking applications and their establishment as a routine analytical technique in science and engineering. Bayesian networks are becoming increasingly popular in various fields of genomic and computational biology—from gene expression analysis (Friedman, 2004) to proteimics (Jansen et al., 2003) and genetic analysis (Lauritzen & Sheehan, 2004)—but they are still far from being a received approach in these areas. Still, these areas of application hold the promise of turning Bayesian networks into a common tool of statistical data analysis.
CONCLUSION Bayesian networks are a representation formalism born at the intersection of statistics and artificial intelligence. Thanks to their solid statistical foundations, they have been turned successfully into a powerful data-mining and knowledge-discovery tool that is able to uncover complex models of interactions from large databases. Their high symbolic nature makes them easily understandable to human operators. Contrary to standard classification methods, Bayesian networks do not require the preliminary identification of an outcome variable of interest, but they are able to draw probabilistic inferences on any variable in the database. Notwithstanding these attractive properties and the continuous interest of the data-mining and knowledge-discovery community, Bayesian networks still are not playing a routine role in the practice of science and engineering.
REFERENCES Binder, J. et al. (1997). Adaptive probabilistic networks with hidden variables. Mach Learn, 29(2-3), 213-244.
Learning Bayesian Networks
Castillo, E. et al. (1996). Expert systems and probabiistic network models. New York: Springer.
Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artif. Intell., 29(3), 241-288.
Charniak, E. (1991). Bayesian networks without tears. AI Magazine, 12(8), 50-63.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco: Morgan Kaufmann.
Cooper, G.F. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artif Intell, 42(2-3), 393-405. Cooper, G.F., & Herskovitz, G.F. (1992). A Bayesian method for the induction of probabilistic networks from data. Mach Learn, 9, 309-347. Cowell, R.G., et al. (1999). Probabilistic networks and expert systems. New York: Springer. Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 303, 799-805. Ghahramani, Z. (1998). Learning dynamic Bayesian networks. In C.L. Giles, & M. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 168-197). New York: Springer.
Ramoni, M., & Sebastiani, P. (2003). Bayesian methods. In M.B. Hand (Ed.), Intelligent data analysis: An introduction (pp. 128-166). New York: Springer. Sebastiani, P., & Ramoni, M. (2001). Bayesian selection of decomposable models with incomplete data. J Am Stat Assoc, 96(456), 1375-1386. Singh, M., & Valtorta, M. (1995). Construction of Bayesian network structures from data: A brief survey and an efficient algorithm. Int J Approx Reason, 12, 111-131. Whittaker, J. (1990). Graphical models in applied multivariate statistics. New York: John Wiley & Sons. Wright, S. (1923). The theory of path coefficients: A reply to Niles’ criticisms. Genetics, 8, 239-255.
Glymour, C., Scheines, R., Spirtes, P., & Kelly, K. (1987). Discovering causal structure: Artificial intelligence, philosophy of science, and statistical modeling. San Diego, CA: Academic Press.
Wright, S. (1934). The method of path coefficients. Ann Math Statist, 5, 161-215.
Heckerman, D. (1997). Bayesian networks for data mining. Data Mining and Knowledge Discovery, 1(1), 79-119.
KEY TERMS
Heckerman, D. et al. (1995). Learning Bayesian networks: The combinations of knowledge and statistical data. Mach Learn, 20, 197-243.
Bayes Factor: Ratio between the probability of the observed data under one hypothesis divided by its probability under an alternative hypothesis.
Jansen, R. et al. (2003). A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302, 449-453.
Conditional Independence: Let X, Y, and Z be three sets of random variables; then X and Y are said to be conditionally independent given Z, if and only if p(x|z,y)=p(x|z) for all possible values x, y, and z of X, Y, and Z.
Larranaga, P., Kuijpers, C., Murga, R., & Yurramendi, Y. (1996). Learning Bayesian network structures by searching for the best ordering with genetic algorithms. IEEE T Syst Man Cyb, 26, 487-493. Lauritzen, S.L. (1992). Propagation of probabilities, means and variances in mixed graphical association models. J Amer Statist Assoc, 87, 1098-108. Lauritzen, S.L. (1996). Graphical models. Oxford: Clarendon Press. Lauritzen, S.L. (1988). Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J Roy Stat Soc B Met, 50, 157-224.
Directed Acyclic Graph (DAG): A graph with directed arcs containing no cycles; in this type of graph, for any node, there is no directed path returning to it. Probabilistic Graphical Model: A graph with nodes representing stochastic variables annotated by probability distributions and representing assumptions of conditional independence among its variables. Statistical Independence: Let X and Y be two disjoint sets of random variables; then X is said to be independent of Y, if and only if p(x)=p(x|y) for all possible values x and y of X and Y.
Lauritzen, S.L., & Sheehan, N.A. (2004). Graphical models for genetic analysis. Statist Sci, 18(4), 489-514.
677
L
678
Learning Information Extraction Rules for Web Data Mining Chia-Hui Chang National Central University, Taiwan Chun-Nan Hsu Institute of Information Science, Academia Sinica, Taiwan
INTRODUCTION The explosive growth and popularity of the World Wide Web has resulted in a huge number of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and keyword searching. Sophisticated Webmining applications, such as comparison shopping, require expensive maintenance costs to deal with different data formats. The problem in translating the contents of input documents into structured data is called information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and search tools. Formally, an information extraction task is defined by its input and its extraction target. The input can be unstructured documents like free text that are written in natural language or semi-structured documents that are pervasive on the Web, such as tables, itemized and enumerated lists, and so forth. The extraction target of an IE task can be a relation of k-tuple (where k is the number of attributes in a record), or it can be a complex object with hierarchically organized data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in a record. The difficulty of an IE task can be complicated further when various permutations of attributes or typographical errors occur in the input documents. Programs that perform the task of information extraction are referred to as extractors or wrappers. A wrapper is originally defined as a component in an information integration system that aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that wraps an information source (e.g., a database server or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism. In the case where the information source is a Web server, a
wrapper must perform information extraction in order to extract the contents in HTML documents. Wrapper induction (WI) systems are software tools that are designed to generate wrappers. A wrapper usually performs a pattern-matching procedure (e.g., a form of finite-state machines), which relies on a set of extraction rules. Tailoring a WI system to a new requirement is a task that varies in scale, depending on the text type, domain, and scenario. To maximize reusability and minimize maintenance cost, designing a trainable WI system has been an important topic in research fields, including message understanding, machine learning, pattern mining, and so forth. The task of Web IE differs largely from traditional IE tasks in that traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. In contrast, Web IE processes online documents that are semi-structured and usually generated automatically by a server-side application program. As a result, traditional IE usually take advantage of natural language processing techniques such as lexicons and grammars, while Web IE usually applies machine learning and pattern-mining techniques to exploit the syntactical patterns or to lay out structures of the template-based documents.
BACKGROUND In the past few years, many approaches of WI systems and how to apply machine learning and pattern mining techniques to train WI systems have been proposed with various degrees of automation. Kushmerick and Thomas (2003) conducted a survey that categorizes WI systems based on the wrapper programs’ underlying formalism (whether they are finite-state approaches or Prolog-like logic programming systems). Sarawagi (2002) further distinguished deterministic finite-state approaches from probabilistic hidden Markov models in her 2002 VLDB tutorial. Another survey can be found in Laender, et al. (2002), which categorizes Web extraction tools into six classes based on their underlying techniques—declara-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Learning Information Extraction Rules for Web Data Mining
tive languages, HTML structure analysis, natural language processing, machine learning, data modeling, and ontology.
MAIN THRUST We classify previous work in Web IE into three categories. The first category contains the systems that require users to possess programming expertise. This category of wrapper generation systems provides specialized languages or toolkits for wrapper construction, such as W4F (Sahuguet & Azavant, 2001) and XWrap (Liu et al., 2000). Such languages or toolkits were proposed as alternatives to general-purpose languages in order to allow programmers to concentrate on formulating the extraction rules without being concerned about the detailed process of input strings. To apply these systems, users must learn the language in order to write their extraction rules. Therefore, such systems also feature user-friendly interfaces for easy use of the toolkits. However, writing correct extraction rules requires significant programming expertise. In addition, since the structures of Web pages are not always obvious and change frequently, writing specialized extraction rules can be time-consuming, error-prone, and not scalable to a large number of Web sites. Therefore, there is a need for automatic wrapper induction that can generalize extraction rules for each distinct IE task. The second category contains the WI systems that require users to label some extraction targets as training examples for WI systems to apply a machine-learning algorithm to learn extraction rules from the training examples. No programming is needed to configure these WI systems. Many IE tasks for Web mining belong to this category; for example, IE for semi-structured text such as RAPIER (Califf & Mooney, 1999), SRV (Freitag, 2000), WHISK (Soderland, 1999), and for IE for template-based pages such as WIEN (Kushmerick et al., 2000), SoftMealy (Hsu and Dung, 1998), STALKER (Muslea, et al., 2001), and so forth. Compared to the first category, these WI systems are preferable, since general users, instead of only programmers, can be trained to use these WI systems for wrapper construction. However, since the learned rules only apply to Web pages from a particular Web site, labeling training examples can be laborious, especially when we need to extract contents from thousands of data sources. Therefore, researchers have focused on developing tools that can reduce labeling effort. For instance, Muslea, et al. (2002) proposed selective sampling, a form of active learning that reduces the number of training examples. Chidlovskii, et al. (2000) designed a wrapper generation system that requires a small amount (one training record) of labeling by the user. Earlier annotation-based WI
systems place emphasize on the learning techniques in their paper. Recently, several works have been proposed to simplify the annotation process. For example, Lixto (Baumgartner et al., 2001), DEByE (Laender et al., 2002) and OLERA (Chang & Kuo, 2004) are three such systems that stress the importance of how annotation or examples are received from users. Note that OLERA also features the so-called semi-supervised approach, which receives rough rather than exact and perfect examples from users to reduce labeling effort. The third category contains the WI systems that do not require any preprocessing of the input documents by the users. We call them annotation-free WI systems. Example systems include IEPAD (Chang & Lui, 2001), RoadRunner (Crescenzi et al., 2001), DeLa (Wang & Lochovsky, 2003), and EXALG (Arasu & Garcia-Molina, 2003). Since no extraction targets are specified, such WI systems make heuristic assumptions about the data to be extracted. For example, the first three systems assume the existence of multiple tuples to be extracted in one page; therefore, the approach is to discover repeated patterns in the input page. With such an assumption, IEPAD and DeLa only apply to Web pages that contain multiple data tuples. RoadRunner and EXALG, on the other hand, try to extract structured data by deducing the template and the schema of the whole page from multiple Web pages. Their assumption is that strings that are stationary across pages are presumably template, and strings that are variant are presumably schema and need to be extracted. However, as commercial Web pages often contain multiple topics where a lot of information is embedded in a page for navigation, decoration, and interaction purposes, their systems may extract both useful and useless information from a page. However, the criterion of what is useful is quite subjective and depends on the application. In summary, these approaches are, in fact, not fully automatic. Rather, post-processing is required for users to select useful data and to assign the data to a proper attribute.
Task Difficulties A critical issue of WI systems is what types of documents and structuring variations can be handled. Documents can be classified into structured, semi-structured, and unstructured sets (Hsu & Dung, 1998). Early IE systems like RAPIER, SRV, and WHISK are designed to handle documents that contain semi-structured texts, while recent IE systems are designed mostly to handle documents that contain semi-structured data (Laender et al., 2002). In this survey, we focus on semi-structured data extraction and possible structure variation. These include missing data attributes, multi-valued attributes, attribute permutations, nested data structures, and so forth. Table 1 lists 679
L
Learning Information Extraction Rules for Web Data Mining
these variations and whether a WI system can correctly extract the contents from a document with a layout variation. Most WI systems can handle missing attributes except for WIEN. Multi-valued attributes can be considered as a special case of nested objects. However, it is neither right nor wrong to say annotation-free WI systems can handle multi-valued attributes, because it depends on how the values are delimited (by HTML tags or by delimiters like comma, spaces, etc.). Similarly, though RoadRunner and EXALG support the extraction of nested objects in general, whether they can handle a particular set of documents with nested objects depends, in fact, on the quality of template-based input. Permutations of attributes refer to multiple attribute orders in different data tuples in the target documents (see the PMI example in Chang & Kuo, 2004). Note that both missing attributes and permutations of attributes can lead to multiple attribute orders. Some approaches (e.g., STALKER and multi-pass SoftMealy) utilize multiple scans to deal with attribute permutation. Some (e.g., IEPAD, DeLa) employ string alignment technique for this issue. However, the way they handle the extract data results in different power. EXALG combines two equivalence classes to produce disjunction rules for handling permutated attributes. However, the procedure may fail when all tokens fall in one equivalence class or when no equivalence class is formed. Another difficult issue is what we called common delimiters (CD) for attributes and record boundaries. An example can be found in an Internet address finder document set from the repository of information sources for information extraction (RISE, http://www.isi.edu/infoagents/RISE/repository.html). In that document set, HTML tags like and are used as delimiters for both record boundaries and attributes. Such document sets are especially difficult for annotation-free systems. Even for annotation-based systems, the problem cannot be completely handled with their default extraction mechanism. Nonexistent delimiters (ND) also cause problems for some WI systems that rely on delimiter-based extraction rules. For example, suppose we want to extract the department code and course number from the following string: COMP4016. WI systems that depend on recognizing de-
limiters will fail in this case, because no delimiter exists between the department codes COMP and the course number 4016. Note that delimiter-related issues sometimes are caused by the tokenization/encoding method (see next paragraph) of the WI systems. Therefore, they do not necessarily cause problems for all WI systems. Finally, most WI systems assume that the attributes of a data object occur in a contiguous string, which does not interleave with other data objects except for MDR (Liu et al., 2003), which is able to handle non-contiguous (NC) data records.
Encoding Scheme, Scanning Passes, and Other Features Table 2 compares important features of WI systems. In order to learn the set of extraction rules, WI systems need to know how to segment a string of characters (the input) into tokens. For example, SoftMealy segments a page into tokens including HTML tags as well as words separated by spaces and uses a token taxonomy tree for extraction rule generation. On the other hand, IEPAD and RoadRunner regard every text string between two HTML tags as one token, which leads to coarser extraction granularity. Most WI systems have a predefined feature set for rule generalization. Still, some systems such as OLERA (Chang & Kuo, 2004) explicitly allow user-defined encoding schemes for tokenization. The extraction mechanisms of various WI systems also play an important role for extraction efficiency. Some WI systems scan the input document once, referred to as single-pass extractor. Others scan the input document several times to complete the extraction. Generally speaking, single-pass wrappers are more efficient than multi-pass wrappers. However, multi-pass wrappers are more effective at handling data objects with unrestricted attribute permutations or complex object extraction. Some of the WI systems have special characteristics or requirements that deserve discussion. For example, EXALG limits extraction failure to only part of the records (a few attributes) instead of the entire page. STALKER
Table 1. The expressive power of various WI systems WI systems Annotationbased
Annotationfree
680
WIEN SoftMealy STALKER OLERA IEPAD DeLa RoadRunner EXALG
Nested Objects No No Yes Yes No Yes Yes* Yes*
Missing No Yes Yes Yes Yes Yes Yes Yes
Multivalued No Yes Yes Yes -----
Permute
ND
CD
NC
No Yes Yes Limited Limited Yes No Limited
No Yes Yes Yes Part Part No No
No Part Part No No No No No
No No No No No No No No
Learning Information Extraction Rules for Web Data Mining
Table 2. Features of WI systems (C*=consecutive, E*=episodic) WI Systems Annotationbased
Annotationfree
WIEN SoftMealy
Encoding Word Word
Stalker
Word
OLERA IEPAD DeLa
Multiple HTML HTML + Word HTML Word
Roadrunner EXALG
Passes Single Single / Multiple Multiple
EF Entire Entire / Partial Partial
ER 1 Disj
C* C*
Disj
E*
Both Single Single
Partial Entire Entire
Disj Disj Disj
C* C* C*
Single Single
Entire Partial
1 Disj
C* C*
and multi-pass SoftMealy also feature this characteristic (see the extraction failure [EF] column). In such cases, lower-level encoding schemes may be necessary for finegrain extraction tasks. Therefore, multi-level encodings are even more important. As for requirements, IEPAD cannot handle single-record pages, for it assumes that there are multiple records in one page. RoadRunner and EXALG require at least two pages as the input in order to generate extraction templates. In contrast, most other WI systems can potentially generate extraction rules from just one example page. The supports for disjunctive rules or extraction rules that allow intermittent delimiters (i.e., episodic rules) in addition to consecutive delimiters also are an indication of the expressive power of WI systems (see the extraction rule [ER] column). These supports are helpful when no single consecutive extraction rule can be generalized; thus, disjunctive rules and episodic rules serve as alternatives. Finally, accuracy and automation are two of the major concerns for all WI systems. For annotation-free systems, they may reach a status where several patterns or grammars are induced. Some systems (e.g., IEPAD) leave the decision to users to choose the correct pattern or grammar; therefore, they are not fully automatic. Fully automatic systems may require further analysis on the derived schema to ensure good quality data extraction (Yang et al., 2003).
FUTURE TRENDS The most critical issue for these state-of-the-art WI systems is that the generated extraction rules only apply to a small collection of documents that share the same layout structure. Usually, those rules only apply to documents from one Web site. Generalization across different layout structures is still an extremely difficult and challenging research problem. Before this problem can be resolved, information integration and data mining to the scale of billions of Web sites is still too expensive and impractical.
L
Requirement
Only multirecord pages At least two input pages required
CONCLUSION In this overview, we present our taxonomy of WI systems from the user’s viewpoint and compare important features of WI systems that affect their effectiveness. We focus on semi-structured documents that usually are produced with templates by a server-side Web application. The extension to non-HTML can be achieved for some WI systems, if some proper tokenization or feature set is used. Some of the WI systems surveyed here have been applied successfully in large-scale real-world applications, including Web intelligence, comparison shopping, knowledge management, and bioinformatics (Chang et al., 2003; Hsu et al., 2005).
REFERENCES Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from Web pages. Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 337348), San Diego, CA, USA. Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Supervised wrapper generation with Lixto. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB) (pp. 715-716), Roma, Italy. Califf, M., & Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. Proceedings of the 16th National Conference on Artificial Intelligence (AAAI) (pp. 328-334), Orlando, FL, USA. Chang, C.-H., & Kuo, S.-C. (2004). OLERA: A semi-supervised approach for Web data extraction with visual support. IEEE Intelligent Systems, 19(6), 56-64. Chang, C.-H., & Lui, S.-C. (2001). IEPAD: Information extraction based on pattern discovery. Proceedings of the 10th International World Wide Web (WWW) Conference (pp. 681-688), Hong Kong, China.
681
Learning Information Extraction Rules for Web Data Mining
Chang, C.-H., Siek, H., Lu, J.-J., Chiou, J.-J., & Hsu, C.-N. (2003). Reconfigurable Web wrapper agents. IEEE Intelligent Systems, 18(5), 34-40. Chidlovskii, B., Ragetli, J., & Rijke, M. (2000). Automatic wrapper generation for Web search engines. Proceedings of WAIM (pp. 399-410), Shanghai, China. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large Web sites. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB) (pp. 109-118), Roma, Italy. Freitag, D. (2000). Machine learning for information extraction in informal domains. Machine Learning, 39(2/3), 169-202. Hsu, C.-N., Chang, C.-H., Hsieh, C.-H., Lu, J.-J., & Chang, C.-C. (2004). Reconfigurable Web wrapper agents for biological information integration. Journal of the American Society for Information Science and Technology (JASIST), 56(5), 505-517.
Muslea, I. (1999). Extraction patterns for information extraction tasks: A survey. The AAAI-99 Workshop on Machine Learning for Information Extraction (pp. 435442), Sydney, Australia. Muslea, I., Minton, S., & Knoblock, C.A. (2001). Hierarchical wrapper induction for semi-structured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4, 93-114. Muslea, I., Minton, S., & Knoblock, C. (2002). Active + semi-supervised learning = Robust multi-view learning. Proceedings of the 19th International Conference on Machine Learning (ICML), Sydney, Australia. Sahuguet, A., & Azavant, F. (2001). Building intelligent Web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3), 283-316. Sarawagi, S. (2002). Automation in information extraction and integration. Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Tutorial, Hong Kong, China.
Hsu, C.-N., & Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems, 23(8), 521-538.
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3), 233-272.
Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence Journal, 118(12), 15-68.
Wang, J., & Lochovsky, F.H. (2003). Data extraction and label assignment. Proceedings of the 10th International World Wide Web Conference (pp. 187-196), Budapest, Hungary.
Kushmerick, N., & Thomas, B. (2002). Adaptive information extraction: Core technologies for information agents. Intelligent Information Agents R&D In Europe: An AgentLink Perspective. Lecture Notes in Computer Science, 2586, 2003, 79-103. Springer. Laender, A.H.F., Ribeiro-Neto, B.A., & Da Silva, A.S. (2002). DEByE—Data extraction by example. Data and Knowledge Engineering, 40(2), 121-154. Laender, A.H.F., Ribeiro-Neto, B.A., Da Silva, A.S., & Teixeira, J.S. (2002). A brief survey of Web data extraction tools. SIGMOD Record, 31(2), 84-93. Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data records in Web pages. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 601-606), Washington, D.C., USA. Liu, L., Pu, C., & Han, W. (2000). Xwrap: An Xml-enabled wrapper construction system for Web information sources. Proceedings of the 16th International Conference on Data Engineering (ICDE) (pp. 611-621), San Diego, California, USA.
682
Yang, G., Ramakrishnan, I.V., & Kifer, M. (2003). On the complexity of schema inference from Web pages in the presence of nullable data attributes. Proceedings of the 12th International Conference on Information and Knowledge Management (pp. 224-231), New Orleans, Louisiana, USA.
KEY TERMS Hidden Markov Model: A variant of a finite state machine having a set of states, an output alphabet, transition probabilities, output probabilities, and initial state probabilities. It is only the outcome, not the state visible to an external observer, and, therefore, states are hidden to the outside; hence, the name Hidden Markov Model. Information Extraction: An information extraction task is to extract or pull out of user-defined and pertinent information from input documents. Logic Programming: A declarative, relational style of programming based on first-order logic. The original logic
Learning Information Extraction Rules for Web Data Mining
programming language was Prolog. The concept is based on Horn clauses. Semi-Structured Documents: Semi-structured (or template-based) documents refer to documents that are formatted between structured and unstructured documents. Software Agents: An artificial agent that operates in a software environment such as operating systems, computer applications, databases, networks, and virtual domains.
Transducer: A finite state machine specifically with a read-only input and a write-only output. The input and output cannot be reread or changed. Wrapper: A program that extracts data from the input documents and wraps them in user desired and structured form.
683
L
684
Locally Adaptive Techniques for Pattern Classification Carlotta Domeniconi George Mason University, USA Dimitrios Gunopulos University of California, USA
INTRODUCTION Pattern classification is a very general concept with numerous applications ranging from science, engineering, target marketing, medical diagnosis, and electronic commerce to weather forecast based on satellite imagery. A typical application of pattern classification is mass mailing for marketing. For example, credit card companies often mail solicitations to consumers. Naturally, they would like to target those consumers who are most likely to respond. Often, demographic information is available for those who have responded previously to such solicitations, and this information may be used in order to target the most likely respondents. Another application is electronic commerce of the new economy. E-commerce provides a rich environment to advance the state of the art in classification, because it demands effective means for text classification in order to make rapid product and market recommendations. Recent developments in data mining have posed new challenges to pattern classification. Data mining is a knowledge-discovery process whose aim is to discover unknown relationships and/or patterns from a large set of data, from which it is possible to predict future outcomes. As such, pattern classification becomes one of the key steps in an attempt to uncover the hidden knowledge within the data. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. While pattern classification has shown promise in many areas of practical significance, it faces difficult challenges posed by real-world problems, of which the most pronounced is Bellman’s curse of dimensionality, which states that the sample size required to perform accurate prediction on problems with high dimensionality is beyond feasibility. This is because, in high dimensional spaces, data become extremely sparse and are apart from each other. As a result, severe bias that affects any estimation process can be introduced in a high-dimensional feature space with finite samples.
Learning tasks with data represented as a collection of a very large number of features abound. For example, microarrays contain an overwhelming number of genes relative to the number of samples. The Internet is a vast repository of disparate information growing at an exponential rate. Efficient and effective document retrieval and classification systems are required to turn the ocean of bits around us into useful information and, eventually, into knowledge. This is a challenging task, since a word level representation of documents easily leads 30,000 or more dimensions. This paper discusses classification techniques to mitigate the curse of dimensionality and to reduce bias by estimating feature relevance and selecting features accordingly. This paper has both theoretical and practical relevance, since many applications can benefit from improvement in prediction performance.
BACKGROUND In a classification problem, an observation is character-
ized by q feature measurements x = (x1 , L, x q )∈ ℜ q and is presumed to be a member of one of J classes, L j , j = 1, L, J . The particular group is unknown, and the
goal is to assign the given object to the correct group, using its measured features x . Feature relevance has a local nature. Therefore, any chosen fixed metric violates the assumption of locally constant class posterior probabilities, and fails to make correct predictions in different regions of the input space. In order to achieve accurate predictions, it becomes crucial to be able to estimate the different degrees of relevance that input features may have in various locations of the feature space. Consider, for example, the rule that classifies a new data point with the label of its closest training point in the measurement space (1-Nearest Neighbor rule). Suppose
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Locally Adaptive Techniques for Pattern Classification
that each instance is described by 20 features, but only three of them are relevant to classifying a given instance. In this case, two points that have identical values for the three relevant features may, nevertheless, be distant from one another in the 20-dimensional input space. As a result, the similarity metric that uses all 20 features will be misleading, since the distance between neighbors will be dominated by the large number of irrelevant features. This shows the effect of the curse of dimensionality phenomenon; that is, in high dimensional spaces, distances between points within the same class or between different classes may be similar. This fact leads to highly-biased estimates. Nearest neighbor approaches (Ho, 1998; Lowe, 1995) are especially sensitive to this problem. In many practical applications, things often are further complicated. In the previous example, the three relevant features for the classification task at hand may be dependent on the location of the query point (i.e., the point to be classified) in the feature space. Some features may be relevant within a specific region, while other features may be more relevant in a different region. Figure 1 illustrates a case in point, where class boundaries are parallel to the coordinate axes. For query a, dimension X is more relevant, because a slight move along the X axis may change the class label, while for query b, dimension Y is more relevant. For query c, however, both dimensions are equally relevant. These observations have two important implications. Distance computation does not vary with equal strength or in the same proportion in all directions in the feature space emanating from the input query. Moreover, the value of such strength for a specific feature may vary from location to location in the feature space. Capturing such information, therefore, is of great importance to any classification procedure in high-dimensional settings.
MAIN THRUST Severe bias can be introduced in pattern classification in a high dimensional input feature space with finite samples. In the following, we introduce adaptive metric techniques Figure 1. Feature relevance varies with query locations
for distance computation capable of reducing the bias of the estimation. Friedman (1994) describes an adaptive approach (the Machete and Scythe algorithms) for classification that combines some of the best features of kNN learning and recursive partitioning. The resulting hybrid method inherits the flexibility of recursive partitioning to adapt the shape of the neighborhood N (x 0 ) of query x0 , as well as the ability of nearest neighbor techniques to keep the points within N (x 0 ) close to the point being predicted. The method is capable of producing nearly continuous probability estimates with the region N (x 0 ) centered at x0 and the shape of the region separately customized for each individual prediction point. The major limitation concerning the Machete/Scythe method is that, like recursive partitioning methods, it applies a greedy strategy. Since each split is conditioned on its ancestor split, minor changes in an early split, due to any variability in parameter estimates, can have a significant impact on later splits, thereby producing different terminal regions. This makes the predictions highly sensitive to the sampling fluctuations associated with the random nature of the process that produces the training data and, therefore, may lead to high variance predictions. In Hastie and Tibshirani (1996), the authors propose a discriminant adaptive nearest neighbor classification method (DANN), based on linear discriminant analysis. Earlier related proposals appear in Myles and Hand (1990) and Short and Fukunaga (1981). The method in Hastie and Tibshirani (1996) computes a local distance metric as a product of weighted within and between the sum of squares matrices. The authors also describe a method of performing global dimensionality reduction by pooling the local dimension information over all points in the training set (Hastie & Tibshirani, 1996a, 1996b). While sound in theory, DANN may be limited in practice. The main concern is that in high dimensions, one may never have sufficient data to fill in q × q (within and between sum of squares) matrices (where q is the dimensionality of the problem). Also, the fact that the distance metric computed by DANN approximates the weighted Chi-squared distance only when class densities are Gaussian and have the same covariance matrix may cause a performance degradation in situations where data do not follow Gaussian distributions or are corrupted by noise, which is often the case in practice. A different adaptive nearest neighbor classification method (ADAMENN) has been introduced to try to minimize bias in high dimensions (Domeniconi, Peng & Gunopulos, 2002) and to overcome the previously mentioned limitations. ADAMENN performs a Chi-squared distance analysis to compute a flexible metric for produc-
685
L
Locally Adaptive Techniques for Pattern Classification
ing neighborhoods that are highly adaptive to query locations. Let x be the nearest neighbor of a query x0 com-
r i (x 0 ) =
1 K
∑
z∈N ( x0 )
ri (z ),
(3)
puted according to a distance metric D(x, x0 ) . The goal is to find a metric D(x, x0 ) that minimizes E[ r (x 0 , x)] , where J
r (x 0 , x) = ∑ j =1 Pr( j | x 0 )(1 − Pr( j | x)) . Here Pr( j | x) is the
class conditional probability at x . That is, r (x 0 , x) is the finite sample error risk given that the nearest neighbor to x0 by the chosen metric is x . It can be shown (Domeniconi, Peng & Gunopulos, 2002) that the weighted Chi-squared distance [Pr( j | x) − Pr( j | x0 )]2 . Pr( j | x0 ) j =1 J
D(x, x 0 ) = ∑
(1)
approximates the desired metric, thus providing the foundation upon which the ADAMENN algorithm computes a measure of local feature relevance, as shown in the following. The first observation is that Pr( j | x) is a function of x . Therefore, one can compute the conditional expectation of Pr( j | x) , denoted by Pr( j | xi = z ) , given that xi assumes value z , where xi represents the nent of x . That is,
i th compo-
Here, p (x | xi = z ) is the conditional density of the other input variables defined as p (x | xi = z ) = p (x)δ ( xi − z )/ ∫ p (x)δ ( xi − z )dx ,
where δ ( x − z ) is the Dirac delta function having the
∫
∞
−∞
[Pr( j | z ) − Pr( j | xi = zi )]2 . Pr( j | xi = zi ) j =1
δ ( x − z )dx = 1 . Let
J
ri (z ) = ∑
(2)
ri (z ) represent the ability of feature i to predict the Pr( j | z ) s at xi = zi . The closer Pr( j | xi = zi ) is to Pr( j | z ) ,
the more information feature i carries for predicting the class posterior probabilities locally at z . We now can define a measure of feature relevance for x0 as 686
approximated along dimension
i in the vicinity of x0 .
Note that r i (x0 ) is a function of both the test point x0 and the dimension i , thereby making r i (x0 ) a local relevance measure in dimension i . The relative relevance, as a weighting scheme, then R ( x )t
i 0 can be given by wi (x0 ) = ∑ q Rl ( x0 )t , where t = 1, 2 , giving l =1 rise to linear and quadratic weightings, respectively, and
Ri (x 0 ) = max qj =1{r j (x 0 )} − r i (x 0 ) (i.e., the larger the Ri , the
more relevant dimension i ). We propose the following exponential weighting scheme q
wi (x0 ) = exp(cRi ( x0 )) / ∑ exp(cRl ( x0 )) l =1
Pr( j | xi = z ) = E[Pr( j | x) | xi = z ] = ∫ Pr( j | x) p( x | xi = z ) dx .
properties δ (x − z ) = 0 if x ≠ z and
where N (x 0 ) denotes the neighborhood of x0 containing the K nearest training points, according to a given metric. r i measures how well on average the class posterior probabilities can be approximated along input feature i within a local neighborhood of x0 . Small r i implies that the class posterior probabilities will be well
(4)
where c is a parameter that can be chosen to maximize (minimize) the influence of r i on
wi . When c = 0 , we have
wi = 1/q , which has the effect of ignoring any differ-
ence among the r i ’s. On the other hand, when c is large, a change in r i will be exponentially reflected in wi . The exponential weighting is more sensitive to changes in local feature relevance and, in general, gives rise to better performance improvement. In fact, it is more stable, because it prevents neighborhoods from extending infinitely in any direction (i.e., zero weight). However, this can occur when either linear or quadratic weighting is used. Thus, equation (4) can be used to compute the weight associated with each feature, resulting in the weighted distance computation: D(x, y ) =
q
∑ w (x − y ) i =1
i
i
i
2
.
(5)
The weights w i enable the neighborhood to elongate less important feature dimensions and, at the same time, to constrict the most influential ones. Note that
Locally Adaptive Techniques for Pattern Classification
the technique is query-based, because the weights depend on the query (Aha, 1997; Atkeson, Moore & Shaal, 1997). An intuitive explanation for (2) and, hence, (3) goes as follows. Suppose that the value of ri (z) is small, which implies a large weight along dimension i . Consequently, the neighborhood is shrunk along that direction. This, in turn, penalizes points along dimension i that are moving away from z i . Now, ri (z ) can be small, only if the subspace spanned by the other input dimensions at xi = z i likely contains samples similar to z in terms of the class conditional probabilities. Then, a large weight assigned to dimension i based on (4) says that moving away from the subspace and, hence, from the data similar to z is not a good thing to do. Similarly, a large value of ri (z) and, hence, a small weight indicates that in the vicinity of z i along dimension i , one is unlikely to find samples similar to z . This corresponds to an elongation of the neighborhood along dimension i . Therefore, in this situation, in order to better predict the query, one must look farther away from z i . One of the key differences between the relevance measure (3) and Friedman’s is the first term in the squared difference. While the class conditional probability is used in (3), its expectation is used in Friedman’s. This difference is driven by two different objectives: in the case of Friedman’s, the goal is to seek a dimension along which the expected variation of Pr( j | x) is maximized, whereas in (3) a dimension is found that minimizes the difference between the class probability distribution for a given query and its conditional expectation along that dimension (2). Another fundamental difference is that the machete/scythe methods, like recursive partitioning, employ a greedy peeling strategy that removes a subset of data points permanently from further consideration. As a result, changes in an early split, due to any variability in parameter estimates, can have a significant impact on later splits, thereby producing different terminal regions. This makes predictions highly sensitive to the sampling fluctuations associated with the random nature of the process that produces the training data, thus leading to high variance predictions. In contrast, ADAMENN employs a patient averaging strategy that takes into account not only the test point x0 , but also its K 0 nearest neighbors. As such, the resulting relevance estimates (3) are, in general, more robust and have the potential to reduce the variance of the estimates. In Hastie and Tibshirani (1996), the authors show that the resulting metric approximates the weighted
Chi-squared distance (1) by a Taylor series expansion, given that class densities are Gaussian and have the same covariance matrix. In contrast, ADAMENN does not make such assumptions, which are unlikely in realworld applications. Instead, it attempts to approximate the weighted Chi-Squared distance (1) directly. The main concern with DANN is that, in high dimensions, we may never have sufficient data to fill in q × q matrices. It is interesting to note that the ADAMENN algorithm potentially can serve as a general framework upon which to develop a unified adaptive metric theory that encompasses both Friedman’s work and that of Hastie and Tibshirani.
FUTURE TRENDS Almost all problems of practical interest are high dimensional. With the recent technological trends, we can expect an intensification of research efforts in the area of feature relevance estimation and selection. In bioinformatics, the analysis of micro-array data poses challenging problems. Here, one has to face the problem of dealing with more dimensions (genes) than data points (samples). Biologists want to find marker genes that are differentially expressed in a particular set of conditions. Thus, methods that simultaneously cluster genes and samples are required to find distinctive checkerboard patterns in matrices of gene expression data. In cancer data, these checkerboards correspond to genes that are up- or down-regulated in patients with particular types of tumors. Increased research efforts in this area are needed and expected. Clustering is not exempt from the curse of dimensionality. Several clusters may exist in different subspaces, comprised of different combinations of features. Since each dimension could be relevant to at least one of the clusters, global dimensionality reduction techniques are not effective. We envision further investigation on this problem with the objective of developing robust techniques in the presence of noise. Recent developments on kernel-based methods suggest a framework to make the locally adaptive techniques discussed previously more general. One can perform feature relevance estimation in an induced feature space and then use the resulting kernel metrics to compute distances in the input space. The key observation is that kernel metrics may be non-linear in the input space but are still linear in the induced feature space. Hence, the use of suitable non-linear features allows the computation of locally adaptive neighborhoods with arbitrary orientations and shapes in input space. Thus, more powerful classification techniques can be generated.
687
L
Locally Adaptive Techniques for Pattern Classification
CONCLUSION
KEY TERMS
Pattern classification faces a difficult challenge in finite settings and high dimensional spaces, due to the curse of dimensionality. In this paper, we have presented and compared techniques to address data exploration tasks such as classification and clustering. All methods design adaptive metrics or parameter estimates that are local in input space in order to dodge the curse of dimensionality phenomenon. Such techniques have been demonstrated to be effective for the achievement of accurate predictions.
Classification: The task of inferring concepts from observations. It is a mapping from a measurement space into the space of possible meanings, viewed as finite and discrete target points (class labels). It makes use of training data.
REFERENCES Aha, D. (1997). Lazy learning. Artificial Intelligence Review, 11, 1-5. Atkeson, C., Moore, A.W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11-73. Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1281-1285. Friedman, J.H. (1994). Flexible metric nearest neighbor classification. Technical Report. Stanford University. Hastie, T., & Tibshirani, R (1996a). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607-615. Hastie, T., & Tibshirani, R. (1996b). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, 58, 155-176. Ho, T.K. (1998). Nearest neighbors in random subspaces. Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition. Lowe, D.G. (1995). Similarity metric learning for a variablekernel classifier. Neural Computation, 7(1), 72-85. Myles, J.P., & Hand, D.J. (1990). The multi-class metric problem in nearest neighbor discrimination rules. Pattern Recognition, 23(11), 1291-1297. Short, R.D., & Fukunaga, K. (1981). Optimal distance measure for nearest neighbor classification. IEEE Transactions on Information Theory, 27(5), 622-627.
688
Clustering: The process of grouping objects into subsets, such that those within each cluster are more closely related to one another than objects assigned to different clusters, according to a given similarity measure. Curse of Dimensionality: Phenomenon that refers to the fact that, in high-dimensional spaces, data become extremely sparse and are far apart from each other. As a result, the sample size required to perform an accurate prediction in problems with high dimensionality is usually beyond feasibility. Kernel Methods: Pattern analysis techniques that work by embedding the data into a high-dimensional vector space and by detecting linear relations in that space. A kernel function takes care of the embedding. Local Feature Relevance: Amount of information that a feature carries to predict the class posterior probabilities at a given query. Nearest Neighbor Methods: Simple approach to the classification problem. It finds the K nearest neighbors of the query in the training set and then predicts the class label of the query as the most frequent one occurring in the K neighbors. Pattern: A structure that exhibits some form of regularity, able to serve as a model representing a concept of what was observed. Recursive Partitioning: Learning paradigm that employs local averaging to estimate the class posterior probabilities for a classification problem. Subspace Clustering: Simultaneous clustering of both row and column sets in a data matrix. Training Data: Collection of observations (characterized by feature measurements), each paired with the corresponding class label.
689
Logical Analysis of Data
L
Endre Boros RUTCOR, Rutgers University, USA Peter L. Hammer RUTCOR, Rutgers University, USA Toshihide Ibaraki Kwansei Gakuin University, Japan
INTRODUCTION The logical analysis of data (LAD) is a methodology aimed at extracting or discovering knowledge from data in logical form. The first paper in this area was published as Crama, Hammer, & Ibaraki (1988) and precedes most of the data mining papers appearing in the 1990s. Its primary target is a set of binary data belonging to two classes for which a Boolean function that classifies the data into two classes is built. In other words, the extracted knowledge is embodied as a Boolean function, which then will be used to classify unknown data. As Boolean functions that classify the given data into two classes are not unique, there are various methodologies investigated in LAD to obtain compact and meaningful functions. As will be mentioned later, numerical and categorical data also can be handled, and more than two classes can be represented by combining more than one Boolean function. Many of the concepts used in LAD have much in common with those studied in other areas, such as data mining, learning theory, pattern recognition, possibility theory, switching theory, and so forth, where different terminologies are used in each area. A more detailed description of LAD can be found in some of the references listed in the following and, in particular, in the forthcoming book (Crama & Hammer, 2004).
theme of LAD. In principle, any Boolean function that agrees with the given data is a potential extension and is considered in LAD. However, as in general learning theory, simplicity and generalization power of the chosen extensions are the main objectives. To achieve these goals, LAD breaks up the problem of finding a most promising extension into a series of optimization problems, each with their own objectives, first finding a smallest subset of the variables needed to distinguish the vectors in T from those in F (finding a so called support set), next finding all the monomials which have the highest agreement with the given data (finding the strongest patterns), finally finding a best combination of the generated patterns (finding a theory). In what follows, we shall explain briefly each of these steps. More details about the theory of partially defined Boolean functions can be found in Boros, et al. (1998, 1999). An example of a binary dataset (or pdBf) is shown in Table 1; Table 2 gives three extensions of it, among many others. Extension f1 may be considered as a most compact one, as it contains only two variables, while extension f 3 Table 1. An example for pdBf (T, F)
T
BACKGROUND Let us consider a binary dataset (T, F, where T ⊆{0,1} n (resp., F ⊆{0,1} n) is the set of positive (resp., negative) data (i.e., those belonging to positive (resp., negative) class). The pair (T, F) is called a partially defined Boolean function (or, in short, a pdBf), which is the principal mathematical notion behind the theory of LAD. A Boolean function f: {0,1}n à{0,1} is called an extension of the pdBf (T, F) if f(x)=1 for all vectors x in T, and f(y)=0 for all vectors y in F. The construction of extensions that carry the essential information of the given dataset is the main
F
x1 0 1 0 1 0 1 0
x2 1 1 1 0 0 1 0
x3 0 0 1 1 0 0 1
x4 1 1 0 0 1 1 0
x5 0 1 1 1 1 0 1
x6 1 0 0 0 1 1 0
x7 1 0 0 1 0 0 1
x8 0 1 1 0 0 1 0
Table 2. Some extensions of the pdBf (T, F) given in Table 1
f1 = x5 x8 ∨ x5 x8 f 2 = x1 x5 ∨ x3 x7 ∨ x1 x5 x7 f 3 = x5 x8 ∨ x6 x7
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Logical Analysis of Data
also may be interesting, as it reveals a monotone dependence on the involved variables.
MAIN THRUST In many applications, the available data is not binary, which necessitates the generation of relevant binary features, for which the three-staged LAD analysis then can be applied. Following are some details about all the main stages of LAD, including the generation of binary features, finding support sets, generating patterns, and constructing theories.
Binarization of Numerical and Categorical Data Many of the existing datasets contain numerical and/or categorical variables. Such data also can be handled by LAD after converting such data into binary data (i.e., binarization). A typical method for this is to introduce a cut point αi for a numerical variable xi; if xi≤αi holds; then, it is converted to 1, and 0 otherwise. It is possible to use more than one cut point for a variable to create one or more number of regions of 1. Similar ideas also can be used for a categorical variable, defining a region converted into 1 by a subset of values of the variable. Binarization aims at finding the cut points for numerical attributes/finding the subsets of values for categorical attributes, such that the given positive and negative observations can be distinguished with the resulting binary variables, and their number is as small as possible. Mathematical properties of and algorithms for binarization are studied in Boros, et al. (1997). Efficient methods for the binarization of categorical data are proposed in Boros and Meñkov (2004).
Support Sets It is desirable from the viewpoint of simplicity to build an extension by using as small a set of variables as possible. A subset S of the n original variables is called a support set, if the projections of T and F on S still has an extension. The pdBf in Table 1 has the following minimal support sets: S1={5,8}, S2={6,7}, S3={1,2,5}, S4={1,2,6}, S5={2,5,7}, S6={2,6,8}, S7={1,3,5,7}, S8={1,4,5,7}. For example, f1 in Table 2 is constructed from S1. Several methods to find small support sets are discussed and compared in Boros, et al. (2003).
Pattern Generation As a basic tool to construct an extension f, a conjunction of a set of literals is called a pattern of (T, F), if there is at 690
least one vector in T satisfying this conjunction, but no vector in F satisfies it, where a literal is either a variable or its complement. In the example of Tables 1 and 2, x5 x8 , x5 x8 , x1 x5 , K are patterns. The notion of a pat-
tern is closely related to the association rule, which is commonly used in data mining. Each pattern captures a certain characteristic of (T, F) and forms a part of knowledge about the data set. Several types of patterns (prime, spanned, strong) have been analyzed in the literature (Alexe et al., 2002), and efficient algorithms for enumerating large sets of patterns have been described (Alexe & Hammer, 2004; Alexe & Hammer, 2004; Eckstein et al., 2004).
Theories of a pdBf A set of patterns that together cover T (i.e., for any x∈T, there is at least one pattern that x satisfies) is called a theory of (T, F). Note that a theory defines an extension of (T, F), since disjunction of the patterns yields such a Boolean function. The Boolean functions fi in Table 2 correspond to theories constructed for the dataset of Table 1. Exchanging the roles of T and F, we can define copatterns and co-theories, which play a similar role in LAD. Efforts in LAD have been directed to the construction of compact patterns and theories from (T, F), as such theories are expected to have better performance to classify unknown data. An implementation in this direction and its results can be found in Boros, et al. (2000). The tradeoff between comprehensive and comprehensible theories is studied in Alexe, et al. (2002).
FUTURE TRENDS Extensions by Special Types of Boolean Functions In many cases, it is known in advance that the given dataset has certain properties, such as monotone dependence on variables. To utilize such information, extensions by Boolean functions with the corresponding properties are important. For special classes of Boolean functions, such as monotone (or positive), Horn, k-DNF, decomposable, and threshold, the algorithms and complexity of finding extensions were investigated (Boros et al., 1995; Boros et al., 1998). If there is no extension in the specified class of Boolean functions, we may still want to find an extension in the class with the minimum number of errors. Such an extension is called the best-fit extension and studied in Boros, et al. (1998).
Logical Analysis of Data
Imperfect Datasets In practice, the available data may contain missing or erroneous bits, due to various reasons (e.g., obtaining their values is dangerous or expensive). More typically, the binary attributes obtained by binarization can result in some uncertainties; for example, for a numerical variable xi the binary attribute xi≥αi may not be evaluated with high confidence, if the actual value of xi is too close to the value of the cut point αi (e.g., if it is within the error rate of the measurement procedure used to obtain the value of xi). For such binary datasets involving such missing/uncertain bits, three types of extensions—robust, consistent, and most robus—were defined. For example, a robust extension is one that remains as an extension of (T, F) for any assignment of 0, 1 values to the missing bits. Theoretical and algorithmic studies in this direction are made in Boros, et al. (1999, 2003).
CONCLUSION The LAD approach has been applied successfully to a number of real-world datasets. Among the published results, Boros, et al. (2000) deals with the datasets from the repository of the University of California at Irvine (http:/ /www.ics.uci.edu/~mlearn/MLRepository.html), as well as some other pilot studies. LAD has been applied successfully for several medical problems, including the detection of ovarian cancer (Alexe et al., 2004), stratification of risk among cardiac patients (Alexe et al., 2003), and polymer design for artificial bones (Abramson et al., 2002). Among other applications of LAD, we mention those concerning economic problems, including an analysis of labor productivity in China (Hammer et al., 1999) and a recent analysis of country risk ratings (Hammer et al., 2003).
REFERENCES Abramson, S., Alexe, G., Hammer, P.L., Knight, D., & Kohn, J. (2002). A computational approach to predicting cell growth on polymeric biomaterials. To appear in J. Biomed. Mater. Res. Alexe, G. et al. (2004). Ovarian cancer detection by logical analysis of proteomic data. Proteomics, 4(3), 766-783. Alexe, G., Alexe, S., Hammer, P.L., & Kogan, A. (2002). Comprehensive vs. comprehensible classifiers in LAD. To appear in Discrete Applied Mathematics. Alexe, G., & Hammer, P.L. (2004). Spanned patterns in the logical analysis of data. Discrete Applied Mathematics.
Alexe, S. et al. (2003). Coronary risk prediction by logical analysis of data. Annals of Operations Research, 119, 15-42. Alexe, S., & Hammer, P.L. (2004). Accelerated algorithm for pattern detection in logical analysis of data. Discrete Applied Mathematics. Boros, E. et al. (2000). An implementation of logical analysis of data. IEEE Trans. on Knowledge and Data Engineering, 12, 292-306. Boros, E., Gurvich, V., Hammer, P.L., Ibaraki, T., & Kogan, A. (1995). Decompositions of partially defined Boolean functions. Discrete Applied Mathematics, 62, 51-75. Boros, E., Hammer, P.L., Ibaraki, T., & Kogan, A. (1997). Logical analysis of numerical data. Mathematical Programming, 79, 163-190. Boros, E., Horiyama, T., Ibaraki, T., Makino, K., & Yagiura, M. (2003). Finding essential attributes from binary data. Annals of Mathematics and Artificial Intelligence, 39, 223-257. Boros, E., Ibaraki, T., & Makino, K. (1998). Error-free and best-fit extensions of a partially defined Boolean function. Information and Computation, 140, 254-283. Boros, E., Ibaraki, T., & Makino, K. (1999). Logical analysis of data with missing bits. Artificial Intelligence, 107, 219-264. Boros, E., Ibaraki, T., & Makino K. (2003). Variations on extending partially defined Boolean functions with missing bits. Information and Computation, 180(1), 53-70. Boros, E., & Meñkov, V. (2004). Exact and approximate discrete optimization algorithms for finding useful disjunctions of categorical predicates in data analysis. Discrete Applied Mathematics, 144(1-2), 43-58. Crama, Y., & Hammer, P.L. (2004). Boolean Functions, forthcoming. Crama, Y., Hammer, P.L., & Ibaraki, T. (1988). Causeeffect relationship and partially defined Boolean functions. Annals of Operations Research, 16, 299-326. Eckstein, J., Hammer, P.L., Liu, Y., Nediak, M., & Simeone, B. (2004). The maximum box problem and its application to data analysis. Computational Optimization and Applications, 23(3), 285-298. Hammer, A., Hammer, P.L., & Muchnik, I. (1999). Logical analysis of Chinese labor productivity patterns. Annals of Operations Research, 87, 165-176.
691
L
Logical Analysis of Data
Hammer, P.L., Kogan, A., & Lejeune, M.A. (2003). A nonrecursive regression model for country risk rating. RUTCOR Research Report RRR 9-2003.
KEY TERMS Binarization: The process of deriving a binary representation for numerical and/or categorical attributes. Boolean Function: A function from {0,1}n to {0,1}. A function from a subset of {0,1}n to {0,1} is called a partially defined Boolean function (pdBf). A pdBf is defined by a pair of datasets (T, F), where T (resp., F) denotes a set of data vectors belonging to positive (resp., negative) class. Extension: A Boolean function f that satisfies f(x)=1 for x∈T and f(x)=0 for x∈F for a given pdBf (T, F).
692
LAD (Logical Analysis of Data): A methodology that tries to extract and/or discover knowledge from datasets by utilizing the concept of Boolean functions. Pattern: A conjunction of literals that is true for some data vectors in T but is false for all data vectors in F, where (T, F) is a given pdBf. A co-pattern is similarly defined by exchanging the roles of T and F. Support Set: A set of variables S for a data set (T, F) such that projections of T and F on S still have an extension. Theory: A set of patterns of a pdBf (T, F), such that each data vector in T has a pattern satisfying it.
693
The Lsquare System for Mining Logic Data Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy Klaus Truemper University of Texas at Dallas, USA
INTRODUCTION The method described in this chapter is designed for data mining and learning on logic data. This type of data is composed of records that can be described by the presence or absence of a finite number of properties. Formally, such records can be described by variables that may assume only the values true or false, usually referred to as logic (or Boolean) variables. In real applications, it may also happen that the presence or absence of some property cannot be verified for some record; in such a case we consider that variable to be unknown (the capability to treat formally data with missing values is a feature of logic-based methods). For example, to describe patient records in medical diagnosis applications, one may use the logic variables healthy, old, has_high_temperature, among many others. A very common data mining task is to find, based on training data, the rules that separate two subsets of the available records, or explains the belonging of the data to one subset or the other. For example, one may desire to find a rule that, based one the many variables observed in patient records, is able to distinguish healthy patients from sick ones. Such a rule, if sufficiently precise, may then be used to classify new data and/or to gain information from the available data. This task is often referred to as machine learning or pattern recognition and accounts for a significant portion of the research conducted in the data mining community. When the data considered is in logic form or can be transformed into it by some reasonable process, it is of great interest to determine explanatory rules in the form of the combination of logic variables, or logic formulas. In the example above, a rule derived from data could be: if (has_high_temperature is true) and (running_nose is true) then (the patient is not healthy). Clearly such rules convey a lot of information and can be easily understood and interpreted by domain experts. Despite the apparent simplicity of this setting, the problem of determining, if possible, a logic formula
that holds true for all the records in one set while it is false for all records in another set, can become extremely difficult when the dimension involved is not trivial, and many different techniques and approaches have been proposed in the literature. In this article we describe one of them, the Lsquare System, developed in collaboration between IASI-CNR and UTD and described in detail in Felici & Truemper (2002), Felici, Sun, & Truemper (2004), and Truemper (2004). The system is freely distributed for research and study purposes at www.leibnizsystem.com. Data mining in logic domains is becoming a very interesting topic for both research and applications. The motivations for the study of such models are frequently found in real life situations where one wants to extract usable information from data expressed in logic form. Besides medical applications, these types of problems often arise in marketing, production, banking, finance, and credit rating. A quick scan of the updated Irvine Repository (see Murphy & Aha, 1994) is sufficient to show the relevance of logic-based models in data mining and learning application. The literature describes several methods that address learning in logic domains, for example the very popular decision trees (Breiman et al., 1984), the highly combinatorial approach of Boros et al. (1996), the interior point method of Kamath et al. (1992), or the partial enumeration scheme proposed by Triantaphyllou & Soyster (1996). While the problem formulation adopted by Lsquare is somewhat related the work in Kamath et al. (1992) and Triantaphyllou et al. (1994), substantial differences are found in the solution method adopted. Most of the methods considered in this area are of intrinsic deterministic nature, being based on the formal description of a problem in mathematical form and in its solution by a specific algorithm. Nevertheless, some real life situations present uncertainty and errors in the data that are often successfully dealt with by the use of fuzzy set and fuzzy membership theory. In such cases the proposed system may embed the uncertainty and the fuzziness of the data in a pre-processing step, providing fuzzy functions that determine the value of the Boolean variables.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
L
The Lsquare System for Mining Logic Data
BACKGROUND This section provides the needed definitions and concepts.
Propositional Logic, SAT and MINSAT A Boolean variable may take on the values true or false. One or more Boolean variables can be assembled in a Boolean formula using the operators ¬, ∧, and ∨. A Boolean formula where Boolean variables, possibly negated, are joined by the operator ∧ (resp. ∨) is called a conjunction (resp. disjunction). A conjunctive normal form system (CNF) is a Boolean formula where some disjunctions are joined with the operator ∧. A disjunctive normal form system (DNF) is a Boolean formula where some conjunctions are joined with the operator ∨. Each disjunction (resp. conjunction) of a CNF (resp. DNF) system is called a clause. A Boolean formula is satisfiable if there exists an assignment of true/false values to its Boolean variables so that the value of the formula is true. The problem of deciding whether a CNF formula is satisfiable is known as the satisfiability problem (SAT). In the affirmative case, one also must find an assignment of true/false values for the Boolean variables of the CNF that make the formula true. A variation of SAT is the minimum cost satisfiability problem (MINSAT), where rational costs are associated with the true value of each logic variable and the solution to be determined has minimum sum of costs for the variables with value true.
Logic Data and Logic Separation We consider {0, +/-1} vectors of given length n, each of which has an associated outcome with value true or false. We call these vectors records of logic data and view them as an encoding of logic information. A 1 in a record means that a certain Boolean variable has value true, and a –1 that the variable has value false. The value 0 is used for unknown. The outcome is considered to be the value of a Boolean variable t that we want to explain or predict. We collect the records for which the property t is present in a set A, and those for which t is not present in set B. For ease of recognition, we usually denote a member of A by a, and of B by b. The Lsquare system deduces {0, +/-1} separating vectors that effectively represent logic formulas and may be used to compute for each record the associated outcome, that is, to separate the records in A from the records in B. A separating set is a collection of separating vectors. The separation of A and B makes sense only when both A and B are non-empty, and when each record of A or B contains at least one {+/-1} entry. Consider two records 694
of logic data; say g and f. We say that f is nested in g if for any entry fi of f equal to +1 or –1, the corresponding entry gi of g satisfies gi = f i. It is easy to show that if A and B are sets of {0, +/-1} records of the same length, a separating set S exists if and only if no record a ∈ A is nested in any record b ∈ B. A clear characterization of separating vectors can thus be given: a {0, +/-1} vector s separates a record a ∈ A from B if: s is not nested in any b ∈ B s is nested in a
(1) (2)
Accordingly, we say that a separating set S separates B from A if, for each a ∈ A, there exists a separating vector s ∈ S that separates a from B.
MAIN THRUST The problem of finding a separating set for A and B is decomposed into a sequence of subproblems, each of which identifies a vector s that separates a nonempty subset of A from B solving two minimization problems. To formulate these problems we introduce Boolean variables linked with the elements s i of the vector s to be found. More precisely, we introduce Boolean variables pi and qi and state that s i = 1 if pi = True and qi = False, s i = −1 if pi = False and qi = True, and si = 0 if pi = qi = False. The case pi = qi = True is ruled out by enforcing the following conditions: ¬pi ∨ ¬qi , i = 1, 2, ..., n
(3)
We can express the separation conditions (1) and (2) with the Boolean variables pi and qi. For (1) we have that s must not be nested in any b ∈ B. Defining b+ as the set of indices i for which bi of b is equal to 1, that is, b+ = {ibi = 1} and b- = {ibi = -1}, b0 = {ibi = 0}, we summarize condition (1) writing that (∨ i∈(b+ ∪ b0) qi) ∨ (∨i∈(b- ∪ b0) pi);
∀
b∈B
(4)
For condition (2) we have to enforce that, if s separates a from B, then s is nested in a. In order to do so we introduce a new Boolean variable da that determines whether s must separate a from B. That is, da = true means that s need not separate a from B, while da = false requires that separation. For a ∈ A, the separation condition is: ¬qi ∨ da; ∀ i ∈ (a+ ∪ a0) ¬pi ∨ da; ∀ i ∈ (a+ ∪ a0)
(5)
The Lsquare System for Mining Logic Data
We now formulate the problem of determining a vector s that separates as many a ∈ A from B as possible (this amounts to a satisfying solution for (3)-(5) that assigns value true to as few variables da as possible). For each a ∈ A, define a rational cost c a that is equal to 1 if da is true, and equal to 0 otherwise. Using these costs and (3)-(5), the desired s may be found by solving the following MINSAT problem, with variables da, a ∈ A, and pi, qi, i = 1, 2, …, n. min ∑α∈Α c α s. t. ¬pι ∨ ¬qι (∨ι∈(b+ ∪ b0) qι ) ∨ (∨i∈(b- ∪ b0) pi) ¬qι ∨ dα pi da
∀ ∀ ∀
i = 1, 2,..., n b∈B (6)
a ∈ Α,∀ ι ∈ (a+∪a0) a ∈ Α, ∀ ι ∈ (a+∪a0)
The solution of problem (6) identifies an s and a largest subset A’ = {a ∈ A da = False} that is separated from B by s. Restricting the problem to A’ and using costs c(p i) and c(qi) associated with the variables pi and qi, we define a second objective function and solve again the problem, obtaining a separating vector whose properties depend on the c(pi) and c(qi). A simple example of the role of cost values c(p i) and c(q i) is the following. Assume that c(pi) and c(qi) assign cost of 1 when pi and qi are true and cost 0 when they are false. The separating vector will use the minimum number of logic variables, that is, the minimum amount of information contained in the data to separate the sets. On the opposite, if we assign cost 0 for true and 1 for false, it will use the maximum amount of information to define the separating sets. If one separating vector is not sufficient to separate all set A, we set A − A’ and iterate. The disjunction of all separating vectors constitutes the final logic formula that separates A from B.
The Leibniz System The MINSAT problems considered belong to the so called NP class, that means, shortly, that the time needed for their solution cannot be bounded by a polynomial function in the size of the input, as explained by the modern theory of computational complexity (Garey & Johnson, 1979). We solve the MINSAT instances with the Leibniz System, a logic programming solver developed at the University of Texas at Dallas. Such solver is based on decomposition and combinatorial optimization results described in Truemper (1998).
Error Control and Voting System In learning problems one views the sets A and B as training sets, and considers them to be subsets of sets A and B where A consists of all {0, +/-1} records of length n with property t, and B consists of all such records without
property t. One then determines a set S that separates B from A, and uses that set to guess whether a new vector r is in A or B. That is, we guess r to be in A if at least one s ∈ S is nested in r, and to be in B otherwise. Of course, the classification of r based on S is correct if r is in A or B, but otherwise need not be correct. Specifically, we may guess a record of A -A to be in B, and a record of B -B to be in A. Usually such errors are referred to as type α error and type β error respectively. The utility of S depends on which type of error is made how many times. In some settings, an error of one of the two types is bad, but an error of the other type is worse. For example, a noninvasive diagnostic system for cancer that claims a case to be benign when a malignancy is present has failed badly. On the other hand, prediction of a malignancy for an actually benign case triggers additional tests, and thus is annoying but not nearly as objectionable as an error of the first type. We can influence the extent of type α error and type β errors by an appropriate choice of the objective function c(p i) and c(q i). As anticipated, when c(pi) = c(qi) = 1 for all i = 1, …, n the formulas determined by the solution of the sequence of MINSAT problems will have minimal support, that is, will try to use the minimum number of variables to separate A’ from B. On the other hand, setting c(pi) = c(q i) = -1 for all i = 1, …, n, we will obtain the opposite effect, that is, a formula with maximum support. If we use a single vector s’ to classify a vector r, then we guess r to be in A if s’ is nested in r. The latter condition tends to become less stringent when the number of nonzero entries in s’ is reduced. Hence, we heuristically guess that a solution vector s’ with minimum support tends to avoid type α errors. Conversely, an s’ with maximum support tends to avoid type β errors. We apply this heuristic argument to the separating set S produced under one of the two choices of objective functions, and thus expect that a set S with minimum (resp. maximum) support tends to avoid type α (resp. β) errors. The above considerations are combined with a sophisticated voting scheme embedded in Lsquare, partly inspired to the notion of stacked generalization originally described in Wolpert (1992) (see also Breiman, 1996). Such scheme refines the separating procedure described in the previous section by a resampling technique of the logic records available for training. For each subsample we determine 4 separating formulas, by switching the roles of A and B in the MINSAT formulation and by switching the signs of the cost coefficients. Then, each formula so determined is used to produce a vote for each elements of A and B, the vote being 1 (resp. –1) if the element is recognised to belong to A (resp. B). The sum of all the votes so obtained is the vote total V. If the vote V is positive (resp. negative), then the record belongs to A (resp. to B). If A and B are representative subsets of A and B, then the 695
L
The Lsquare System for Mining Logic Data
votes V for records of A (resp. of B) tend to be positive (resp. negative). To assess the reliability of the classification of records of A and B, Lsquare computes two estimated probability distributions. For any odd integer k, the first (resp. second) distribution supplies an estimate α’ (resp. β’) of the probability that the vote V for a record of A (resp. B) is less (resp. greater) than k. For example, if one declares records with votes V greater than a given k to be in B, then α’ is an estimate of the probability that a record of A will be misclassified. Such threshold k can be choosen in such a way that the total estimated error (α’+β’) is small but also balanced in the desired direction, according to what error is more serious for the specific application. An extensive treatment of this feature and of its statistical and mathematical background may be found in Felici, Sun, & Truemper (1998).
FUTURE TRENDS Data mining in logic setting presents several interesting features, amongst which is the interpretability of its results, that are making it a more and more appreciated tool. The mathematical problems generated by this type of data mining are typically very complex, require very powerful computing machines and sophisticated solution approaches. All efforts made to design and test new efficient algorithmic frameworks to solve mathematical problems related with data mining in logic will thus be a rewarding topic for future research. Moreover, the size of the available databases is increasing at a high rate, and these methods suffer from both the computational complexity of the algorithm and from the explosion of the dimensions often related with the conversion of rational variables into Boolean ones. The solution of data mining in logic setting in an effective and scalable way is thus an exciting challenge. Only recently, researchers have started to consider distributed and grid computing as a potential key feature to win it.
REFERENCES Boros, E. et al. (1996). An implementation of logical analysis of data. RUTCOR Research Report 29-96. NJ: Rutgers University. Breiman, L. (1996). Stacked regressions. Machine Learning, 24, 49-64. Breiman, L., Friedman J.H., Olshen R.A., & Stone, C.J. (1984). Classification and regression trees. Wadsworth International. Felici, G., Sun, F-S., & Truemper, K. (2004). A method for controlling errors in two-class classifications. IASI Technical Report n. 980, November, 1998. Felici, G., & Truemper, K. (2002). A Minsat approach for learning in logic domains. INFORMS Journal on Computing, 14 (1), 20-36. Garey, M.R., & Johnson, D.S. (1979). Computers and intractability: A guide to the theory of NP-completeness. San Francisco: Freeman. Kamath, A.P., Karmarkar, N.K., Ramakrishnan, K.J., & Resende M.G.C. (1992). A continuous approach to inductive inference. Mathematical Programming, 57, 215-238. Murphy, P.M., & Aha, D.W. (1994). UCI repository of machine learning databases: Machine readable data repository. Department of Computer Science, University of California, Irvine. Triantaphyllou, E., Allen, L., Soyster, L., & Kumara, S.R.T. (1994). Generating logical expressions from positive and negative examples via a branch-and-bound approach. Computers and Operations Research, 21, 185-197. Triantaphyllou, E., & Soyster, L. (1996). On the minimum number of logical clauses inferred from examples. Computers and Operations Research, 21, 783-799.
CONCLUSION
Truemper, K. (1998). Effective logic computation. New York: Wiley-Interscience.
In this chapter we have considered a particular type of data mining and learning problem and have described one method to solve such problems. Despite their intrinsic difficulty, these problems are of high interest for their applicability and for their theoretical properties. For brevity, we have omitted many details and also results and comparisons with other methods on real life and test instances. Extensive information on the behavior of Lsquare on many data sets coming from the Irvine Repository (Murphy & Aha, 1994) is available in Felici & Truemper (2002) and Felici, Sun, and Truemper (2004).
Truemper, K. (2004). Design of logic-based intelligent systems. New York: Wiley.
696
Truemper, K. (2004). The Leibniz system for logic programming. Version 7.1 Leibniz. Plano, Texas. Wolpert, D.H. (1992). Stacked generalization. Neural Networks, 5, 241-259.
The Lsquare System for Mining Logic Data
KEY TERMS
Logic Data Mining: The application of data mining techniques where both data and extracted information are expressed by logic variables.
Classification Error: Number of elements that are classified in the wrong class by a classification rule. In two class problems, the classification error is divided into the so-called false positive and false negative.
MINSAT: Given a Boolean expression E and a cost function on the variables, determine, if exists, an assignment to the variables in E such that E is true and the sum of the cost is minumum.
Combinatorial Optimization: Branch of optimization in applied mathematics and computer science related to algorithm theory and computational complexity theory.
Propositional Logic: A system of symbolic logic based on the symbols “and,” “or,” “not” to stand for whole propositions and logical connectives. It only considers whether a proposition is true or false.
Error Control: Selection of a classification rule in such a way to obtain a desired proportion of false positive and false negative.
Voting: Classification method based on the combination of several different classifiers obtained by different methods and/or different data.
697
L
698
Marketing Data Mining Victor S.Y. Lo Fidelity Personal Investments, USA
INTRODUCTION
(1)
Data mining has been widely applied over the past two decades. In particular, marketing is an important application area. Many companies collect large amounts of customer data to understand their customers’ needs and predict their future behavior. This article discusses selected data mining problems in marketing and provides solutions and research opportunities.
(2)
BACKGROUND Analytics are heavily used in two marketing areas: market research and database marketing. The former addresses strategic marketing decisions through the analysis of survey data, and the latter handles campaign decisions through the analysis of behavioral and demographic data. Due to the limited sample size of a survey, market research normally is not considered data mining. This article focuses on database marketing, where data mining is used extensively to maximize marketing return on investment by finding the optimal targets. A typical application is developing response models to identify likely campaign responders. As summarized in Figure 1, a previous campaign provides data on the dependent variable (responded or not), which is merged with individual characteristics, including behavioral and demographic variables, to form an analyzable data set. A response model is then developed to predict the response rate given the individual characteristics. The model is then used to score the population and predict response rates for all individuals. Finally, the best list of individuals will be targeted in the next campaign in order to maximize effectiveness and minimize expense. Response modeling can be applied in the following activities (Peppers & Rogers, 1997, 1999):
(3)
Acquisition: Which prospects are most likely to become customers? Development: Which customers are most likely to purchase additional products (cross-selling) or to add monetary value (up-selling)? Retention: Which customers are most retainable? This can be relationship or value retention.
MAIN THRUST Standard database marketing problems have been described in literature (e.g., Hughes, 1994; Jackson & Wang, 1996; and Roberts & Berger, 1999). In this article, I describe the problems that are infrequently mentioned in the data mining literature as well as their solutions and research opportunities. These problems are embedded in various components of the campaign process, from campaign design to response modeling to campaign optimization; see Figure 2. Each problem is described in the Problem-Solution-Opportunity format.
CAMPAIGN DESIGN The design of a marketing campaign is the starting point of a campaign process. It often does not receive enough attention in data mining. If a campaign is not designed properly, postcampaign learning can be infeasible (e.g., insufficient sample size). On the other hand, if a campaign is scientifically designed, learning opportunities can be maximized. The design process includes activities such as determining the sample size for treatment and control groups (both have to be sufficiently large such that measurement and modeling, when required, are feasible), deciding on sampling methods (pure or stratified random
Figure 1. Response modeling process
Collect data
Develop model
Score population
Select best targets
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Marketing Data Mining
Figure 2. Database marketing campaign process
Design
Execute
sampling), and creating a cell design structure (testing various offers and also by age, income, or other variables). I focus on the latter here.
Problem 1 Classical designs often test one variable at a time. For example, in a cell phone direct mail campaign, you may test a few price levels of the phone. After launching the campaign and uncovering the price level that led to the highest revenue, another campaign is launched to test a monthly fee, a third campaign tests the direct mail message, and so forth. A more efficient way is to structure the cell design such that all these variables are testable in one campaign. Consider an example: A credit card company would like to determine the best combination of treatments for each prospect; the treatment attributes and attribute levels are summarized in Table 1. The number of all possible combinations = 44 x 22 = 1024 cells, which is not practical to test.
Solution 1 To reduce the number of cells, a fractional factorial design can be applied (full factorial refers to the design that includes all possible combinations); see Montgomery (1991) and Almquist and Wyner (2001). Two types of fractional factorials are a) an orthogonal design where all attributes are made orthogonal (uncorrelated) with each other and b) an optimal design where a certain criterion related to the variance-covariance matrix of parameter estimates is optimized; see Kuhfeld (1997, 2004) for the applications of SAS PROC FACTEX and PROC OPTEX in market research. (Kuhfeld’s market research applications are also applicable to database marketing). For the preceding credit card problem, an orthogonal fractional factorial design using PROC FACTEX in
M Measure
Model
Optimize
SAS with estimable main effects, all two-way interaction effects, and quadratic effects on quantitative variables generates a design of 256 cells, which may still be considered large. An optimal design using PROC OPTEX with the same estimable effects generates a design of only 37 cells (see Table 2; refer to Table 1 for attribute level definitions). Fractional factorial design has been used in credit card acquisition but is not widely used in other industries for marketing. Two reasons are (a) a lack of experimental design knowledge and experience and (b) business process requirement — tight coordination with list selection and creative design professionals is required.
Opportunity 1 Mayer and Sarkissien (2003) proposed using individual characteristics as attributes in the optimal design, where individuals are chosen “optimally.” Using both individual characteristics and treatment attributes as design attributes is theoretically interesting. In practice, we should compare this optimal selection of individuals with stratified random sampling. Simulation and theoretical and empirical studies are required to evaluate this idea. Additionally, if many individual variables (say, hundreds) are used in the design, then constructing an optimal design may be very computationally intensive due to the large design matrix and thus the design may require a unique optimization technique to solve.
RESPONSE MODELING Problem 2 As stated in the introduction, response modeling uses data from a previous marketing campaign to identify
Table 1. Example of design attributes and attribute levels Attribute APR Credit limit Color Rebate Brand Creative
Attribute level 5.9% (0), 7.9% (1), 10.9% (2), 12.9% (3) $2,000 (0), $5,000 (1), $7,000 (2), $10,000 (3) Platinum (0), Titanium (1), Ruby (2), Diamond (3) None (0), 0.5% (1), 1% (2), 1.5% (3) SuperCard (0), AdvantagePlus (1) A (0), B (1)
699
Marketing Data Mining
Table 2. An optimal design example Cell
Apr
Limit
Rebate
Color
Brand
Creative
Cell
Apr
Limit
Rebate
Color
Brand
Creative
1
0
0
0
3
0
1
21
2
1
3
1
1
1
2
0
0
3
3
1
1
22
2
3
0
1
0
1
3
0
0
3
3
0
0
23
3
0
0
3
1
0
4
0
0
0
2
1
1
24
3
0
3
3
0
1
5
0
0
0
2
0
0
25
3
0
3
2
1
1
6
0
0
0
1
0
1
26
3
0
3
2
0
0
7
0
0
0
0
1
0
27
3
0
0
1
1
1
8
0
0
3
0
0
1
28
3
0
0
0
0
0
9
0
1
3
2
1
0
29
3
0
3
0
1
0
10
0
3
0
3
1
0
30
3
1
0
2
0
1
11
0
3
3
3
0
1
31
3
1
2
1
0
0
12
0
3
3
2
0
1
32
3
3
0
3
1
1
13
0
3
2
1
1
1
33
3
3
0
3
0
0
14
0
3
3
1
0
0
34
3
3
3
3
1
0
15
0
3
0
0
0
1
35
3
3
0
2
1
0
16
0
3
3
0
1
1
36
3
3
0
0
1
0
17
1
0
3
1
1
0
37
3
3
3
0
0
1
18
1
2
0
1
1
0
19
1
3
2
0
0
0
20
2
0
1
0
1
1
likely responders given individual characteristics (Berry & Linoff, 1997, 2000; Jackson & Wang, 1996; Rud, 2001). Treatment and control groups were set up in the campaign for measurement. The methodology typically uses treatment data to identify characteristics of customers who are likely to respond. In other words, for individual i∈W, P(Y i = 1| Xi; treatment) = f(Xi),
(1)
where W represents all the individuals in the campaign, Yi is the dependent variable (defined by the campaign’s call to action, indicating whether or not those people responded), Xi is a vector of independent variables that represent individual characteristics, and f(.) is the functional form. For examples, in logistic regression, f(.) is a logistic function of Xi (Hosmer & Lemeshow, 1989); in decision trees, f(.) is a step function of Xi (Zhang & Singer, 1999); and in neural network, f(.) is a nonlinear function of Xi (Haykin, 1999). A problem may arise when we apply Model (1) to the next campaign. Depending on the industry and product, the model may select individuals who are likely to respond regardless of the marketing campaign. The results of the new campaign may show equal or similar (treatment minus control) performance in model and random cells; see Table 3 for an example.
Table 3. Sample campaign measurement of model effectiveness (response rates) Treatment (e.g., mail)
Model Random
700
1% 0.5%
Control (e.g., no mail) 0.8% 0.3%
Incremental (treatment minus control) 0.2% 0.2%
Solution 2 The key issue is that fitting Model 1 and applying it to find the best individuals is inappropriate. The correct way is to find those who are most positively influenced by the campaign. Doing so requires you to predict lift: P(Y i = 1| X i; treatment) – P(Y i = 1| Xi; control) for each i and then select those with the highest values of estimated lift. Alternative solutions to this problem are: (1)
(2)
Fitting two separate treatment and control models: This solution is relatively straightforward except that estimated lift can be sensitive to statistically insignificant differences in parameters of the treatment and control models. Using dummy variable and interaction effects: Lo (2002) proposed using the independent variables X i, Ti, and Xi*Ti, where Ti equals 1 if i is in the treatment group but otherwise equals 0, and modeling the response rate by using a logistic regression:
Pi =
exp(á + â'X i + ãTi + ä'X i Ti ) 1 + exp(á + â'X i + ãTi + ä'X i Ti )
(2)
In Equation 2, α denotes the intercept, β is a vector of parameters measuring the main effects of the independent variables, γ denotes the main treatment effect, and δ measures additional effects of the independent variables due to treatment. (See Russek-Cohen and Simon (1997) for similar problems in biomedicine.) To predict the lift, use the following equation:
Marketing Data Mining
Pi | treatment − Pi | control exp(á + ã + â'X i + ä'X i ) exp(á + â'X i ) = − 1+ exp(á + ã + â'X i + ä'X i ) 1+ exp(á + â'X i )
(
(3)
That is, we can use the parameter estimates from Model (2) in Equation (3) to predict the lift for each i in a new data set. Individuals with high positive predicted lift values will be selected for the next campaign. We can apply a similar technique of using a dummy variable and interactions in other supervised learning algorithms, such as neural network. See Lo (2002) for a detailed description of the methodology. (3)
exp(á + â'X i + ãTi + ä'X iTi + λ'Z + θ ' Z X ) i i i 1 + exp(á + â'X i + ãTi + ä'X iTi + λ'Z + θ'Z X ) i i i
Pi =
Decision tree approach: To maximize lift, the only known commercial software is Quadstone, based on decision trees where each split at every parent node is processed to maximize the difference between lift (treatment minus control response rates) in the left and right children nodes; see Radcliffe and Surry (1999) for their Uplift Analysis.
Opportunity 2 (1) Alternative modeling methods for this lift-based problem should exist, and readers are encouraged to uncover them. (2) As in typical data mining, the number of potential predictors (independent variables) is large. Variable reduction techniques are available for standard problems (i.e., finding the best subset associated with the dependent variable), but a unique variable reduction technique for the lift-based problem is needed (i.e., finding the best subset associated with the lift).
Problem 3 An extension of Problem 2 is the presence of multiple treatments. Data can be obtained from an experimentally designed campaign where multiple treatment attributes were tested. The question is how to incorporate treatment attributes and individual characteristics in the same response model.
Solution 3
where Z i is the vector of treatment attributes, and λ and θ are additional parameters ( Z i = 0 if Ti = 0).
(4)
In practice, Equation (4) can have many variables including many interaction effects, and thus multicollinearity may be a concern. Developing two separate treatment and control models may be more practical. Also, unlike the single-treatment problem, no commercial software is known to solve this problem directly.
Opportunity 3 Similar to Opportunity 2, alternative methods to handle multiple treatment response modeling should exist.
OPTIMIZATION Marketing optimization refers to delivering the best treatment combination to the right people so as to optimize certain criteria subject to constraints. The step naturally follows campaign design and response modeling, especially when multiple treatment combinations were tested in the previous campaign(s), and the response model score for each treatment combination and each individual is available (see Problem 3).
Problem 4 The main issue is that the number of decision variables equals the number of individuals in the targets (n) times number of treatment combinations (m). For example, in an acquisition campaign where the target population has 30 million individuals and the number of treatment combinations is 100. Then the number of decision variables is n times m, which equals 3 billion. Consider the following binary integer programming problem (Bertsimas & Tsitsiklis, 1997): Maximize π = ∑∑ π ij xij i
j
subject to :
An extension of Equation (2) is to incorporate both treatment attributes and individual characteristics in the same lift-based response model.
∑x i
ij
≤ max. # of individuals receiving treatment combination j,
∑∑ c x i
j
ij ij
≤ expense budget, plus other relevant constraints,
xij = 0 or 1, where π ij = inc. value received by sending treatment comb. j to individual i, cij = cost of sending treatment comb. j to individual i.
(5)
701
M
Marketing Data Mining
In Model (5), πij is linked to the predicted incremental response rate of individual i and treatment combination j. For example, if πij is the predicted incremental response rate (i.e., lift), the objective function is to maximize the total number of incremental responders; if πij is incremental revenue, it may be a constant times the incremental response rate. Similar forms of Model 5 appear in Storey and Cohen (2002) and Ansari and Mela (2003); here, however, I am maximizing incremental value or responders. In the literature, similar formulations for aggregate-level marketing optimization, such as Stapleton, Hanna, and Markussen (2003) and Larochelle and Sanso (2000), are common, but not for individual-level optimization.
Solution 4 One way to address such a problem is through heuristics, which do not guarantee global optimum. For example, to reduce the size of the problem, you may group the large number of individuals in the target population into clusters and then apply linear programming to solve at the cluster level. This is exactly the first stage outlined in Storey and Cohen (2002). The second stage of their approach is to optimize at the individual level for each of the optimal clusters obtained from the first stage. SAS has implemented this methodology in their new software known as Marketing Optimization. For exact solutions, consider Marketswitch at www.marketswitch.com, where the problem is solved with an exact solution through a mathematical transformation. The software has been applied at various companies (Leach, 2001).
Opportunity 4 (1) Simulation and empirical studies can be used to compare existing heuristics to exact global optimization. (2) General heuristics such as simulated annealing, genetic algorithms, and tabu search can be attempted (Goldberg, 1989; Michalewicz & Fogel, 2002).
Problem 5 Predictive models never produce exact results. Random variation of predicted response rates can be estimated in various ways. For example, in the holdout sample, the modeler can compute the standard deviation of lift by decile. More advanced methods, such as bootstrapping, can also be applied to assess variability (Efron & Tibshirani, 1998). Then how should the variability be accounted for in the optimization stage?
702
Solution 5 The simplest way is to perform sensitivity analysis of response rates on optimization. The downside is that if the variability is high and/or the number of treatment combinations is large, the large number of possibilities will make it difficult. An alternative is to solve the stochastic version of the optimization problem by using stochastic programming (see Birge & Louveaux, 1997). Solving such a large stochastic programming problem typically relies on the Monte Carlo simulation method (Ahmed & Shapiro, 2002) and can easily become a large project.
Opportunity 5 The two alternatives in Solution 5 may not be practical for large optimization problems. Thus, researchers have an opportunity to develop a practical solution. You may consider using robust optimization, which incorporates the trade-off between optimal solution and conservatism in linear programming; see Ben-Tal, Margalit, and Nemirovski (2000) and Bertsimas and Sim (2004).
FUTURE TRENDS (1) (2)
(3)
Experimental design is expected to be more utilized to maximize learning. More marketers are now aware of the lift problem in response modeling (see Solution 2), and alternative methods will be developed to address the problem. Optimization can be performed across campaigns, channels, customer segments, and business initiatives so that a more global optimal solution is achieved. This requires not only data and response models but also coordination and cultural shift.
CONCLUSION Several problems with suggested solutions and opportunities for research have been described in this article. They are not frequently mentioned in standard literature but are very valuable in marketing. The solutions involve multiple areas, including data mining, statistics, operations research, database marketing, and marketing science. Researchers are encouraged to evaluate current solutions and develop new methodologies.
Marketing Data Mining
REFERENCES
Kuhfeld, W. (2004). Experimental design, efficiency, coding, and choice designs. SAS Technical Note, TS-694C.
Ahmed, S., & Shapiro, A. (2002). The sample average approximation method for stochastic programs with integer recourse (Tech. Rep.). Georgia Institute of Technology, School of Industrial & Systems Engineering.
Larochelle, J., & Sanso, B. (2000). An optimization model for the marketing-mix problem in the banking industry. INFOR, 38(4), 390-406.
Almquist, E., & Wyner, G. (2001, October). Boost your marketing ROI with experimental design. Harvard Business Review, 135-141. Ansari, A., & Mela, C. (2003). E-customization. Journal of Marketing Research, XL, 131-145. Ben-Tal, A., Margalit, T., & Nemirovski, A. (2000). Robust modeling of multi-stage portfolio problems. In H. Frenk, K. Roos, T. Terlaky, & S. Zhang (Eds.), High performance optimization. Kluwer. Berry, M. J. A., & Linoff, G. S. (1997). Data mining techniques: For marketing, sales, and customer support. Wiley. Berry, M. J. A., & Linoff, G. S. (2000). Mastering data mining. Wiley. Bertsimas, D., & Sim, M. (2004). The price of robustness. Operations Research, 52(1), 35-53. Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to linear optimization. Athena Scientific. Birge, J. R., & Louveaux, F. (1997). Introduction to stochastic programming. Springer. Efron, B., & Tibshirani, R. J. (1998). An introduction to the bootstrap. CRC. Goldberg, D. E. (1989). Genetic algorithms in search, optimization & machine learning. Addison-Wesley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer. Haykin, S. (1999). Neural networks: A comprehensive foundation. Prentice Hall. Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. Wiley. Hughes, A. M. (1994). Strategic database marketing. McGraw-Hill. Jackson, R., & Wang, P. (1996). Strategic database marketing. NTC. Kuhfeld, W. (1997). Efficient experimental designs using computerized searches. Sawtooth Software Research Paper Series.
Leach, P. (2001, August). Capital One optimizes customer relationships. 1to1 Magazine. Lo, V. S. Y. (2002). The true-lift model — a novel data mining approach to response modeling in database marketing. SIGKDD Explorations, 4(2), 78-86. Mayer, U. F., & Sarkissien, A. (2003). Experimental design for solicitation campaigns. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 717-722). Michalewicz, Z., & Fogel, D. B. (2002). How to solve it: Modern heuristics. Springer. Montgomery, D. C. (1991). Design and analysis of experiments (3rd ed.). Wiley. Peppers, D., & Rogers, M. (1997). Enterprise one-toone. Doubleday. Peppers, D., & Rogers, M. (1999). The one-to-one fieldbook. Doubleday. Radcliffe, N. J., & Surry, P. (1999). Differential response analysis: Modeling true response by isolating the effect of a single action. Proceedings of Credit Scoring and Credit Control VI, Credit Research Centre, University of Edinburgh Management School. Roberts, M. L., & Berger, P. D. (1999). Direct marketing management. Prentice Hall. Rud, O. P. (2001). Data mining cookbook. Wiley. Russek-Cohen, E., & Simon, R. M. (1997). Evaluating treatments when a gender by treatment interaction may exist. Statistics in Medicine, 16(4), 455-464. Stapleton, D. M., Hanna, J. B., & Markussen, D. (2003). Marketing strategy optimization: Using linear programming to establish an optimal marketing mixture. American Business Review, 21(2), 54-62. Storey, A., & Cohen, M. (2002). Exploiting response models: Optimizing cross-sell and up-sell opportunities in banking. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 325-331). Zhang, H., & Singer, B. (1999). Recursive partitioning in the health sciences. Springer.
703
M
Marketing Data Mining
KEY TERMS
Fractional Factorial: A subset of the full factorial design; that is, a subset of all possible combinations.
Campaign Design: The art and science of designing a marketing campaign, including cell structure, sampling, and sizing.
Marketing Optimization: Delivering the best treatments to the right individuals.
Control Group: Individuals who look like those in the treatment group but are not contacted. Database Marketing: A branch of marketing that applies database technology and analytics to understand customer behavior and improve effectiveness in marketing campaigns.
704
Response Modeling: Predicting a response given individual characteristics by using data from a previous campaign. Treatment Group: Individuals who are contacted (e.g., mailed) in a campaign.
705
Material Acquisitions Using Discovery Informatics Approach Chien-Hsing Wu National University of Kaohsiung, Taiwan, ROC Tzai-Zang Lee National Cheng Kung University, Taiwan, ROC
INTRODUCTION Material acquisition is a time-consuming but important task for a library, because the quality of a library is not in the number of materials that are available but in the number of materials that are actually utilized. Its goal is to predict the users’ needs for information with respect to the materials that will most likely be used (Bloss, 1995). Discovery informatics using the technology of knowledge discovery in databases can be used to investigate in-depth how the acquired materials are being used and, in consequence, can be a predictive sign of information needs.
BACKGROUND Material searchers regularly spend large amounts of time to acquire resources for enormous numbers of library users. Therefore, something significant should be relied on to produce the acquisition recommendation list for the limited budget (Whitmire, 2002). Major resources for material acquisitions are, in general, the personal collections of the librarians and recommendations by users, departments, and vendors (Stevens, 1999). The collections provided by these collectors are usually determined by their individual preferences, rather than by a global view, and thus may not be adequate for the material acquisitions to rely on. Information in the usage data may show something different from the collectors’ recommendations (Hamaker, 1995). For example, knowing which materials were most utilized by the patrons would be highly useful for material acquisitions. First, circulation statistics is one of the most significant references for library material acquisition decisions (Budd & Adams, 1989; Tuten & Lones, 1995; Pu, Lin, Chien, & Juan, 1999). It is a reliable factor by which to evaluate the success of material utilization (Wise & Perushek, 2000). Second, the data-mining technique with a capability of description and predic-
tion can explore patterns in databases that are meaningful, interpretable, and decision supportable (Wang, 2003). The discovery informatics in circulation databases using induction mechanism are used in the decision of library acquisition budget allocation (Kao, Chang, & Lin, 2003; Wu, 2003). The circulation database is one of the important knowledge assets for library managerial decisions. For example, information such as “75.4% of patrons who made use of Organizations also made use of Financial Economics” via association relationship discovery is supportive for the material acquisition operation. Consequently, the data-mining technique with an association mechanism can be utilized to explore informatics that are useful.
MAIN THRUST Utilization discovery as a base of material acquisitions, comprising a combination of association utilization and statistics utilization, is discussed in this article. The association utilization is derived by a data-mining technique. Systemically, when data mining is applied in the field of material acquisitions, it follows five stages: collecting datasets, preprocessing collected datasets, mining preprocessed datasets, gathering discovery informatics, interpreting and implementing discovered informatics, and evaluating discovered informatics. The statistics utilization is simply the sum of numeric values of strength for all different types of categories in preprocessed circulation data tables (Kao et al., 2003). These need both domain experts and data miners to accomplish the tasks successfully.
Collecting Datasets Most libraries have employed computer information systems to collect circulation data that mainly includes users identifier, name, address, and department for a user; identifier, material category code, name, author, publisher, and publication date for a material; and users
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Material Acquisitions Using Discovery Informatics Approach
identifier, material identifier, material category code, date borrowed, and date returned for a transaction. In order to consider the importance of a material category, a data table must be created to define the degree of importance that a material presents to a department (or a group of users). For example, five scales of degree can be “absolutely matching,” “highly matching,” “matching,” “likely matching,” and “absolutely not matching,” and their importance strength can be defined as 0.4, 0.3, 0.2, 0.1, and 0.0, respectively (Kao et al., 2003).
Preprocessing Data Preprocessing data may have operations of refinement and reconstruction of data tables, consistency of multityped data tables, elimination of redundant (or unnecessary) attributes, combination of highly correlated attributes, and discretization of continuous attributes. Two operations in this stage for material acquisitions are the elimination of unnecessary attributes and the reconstruction of data tables. For the elimination of unnecessary attributes, four data tables are preprocessed to derive the material utilization. They are users tables (two attributes: department identifier and user identifier), category tables (two attributes: material identifiers and material category code), circulation table (three attributes: user identifier, material category code, and date borrowed), and importance table (three attributes: department identifier, material category identifier, and importance). For the reconstruction of data tables, a new table can be generated that contains attributes of department identifier, user identifier, material category code, strength, and date borrowed.
Mining Data Mining mechanisms can perform knowledge discovery with the form of association, classification, regression, clustering, and summarization/generalization (Hirota & Pedrycz, 1999). The association with a form of “If Condition Then Conclusion” captures relationships between variables. The classification is to categorize a set of data based on their values of the defined attributes. The regression is to derive a prediction model by altering the independent variables for dependent one(s) in a defined database. Clustering is to put together the physical or abstract objects into a class based on similar characteristics. The summarization/generalization is to abridge the general characteristics over a set of defined attributes in a database. The association informatics can be employed in material acquisitions. Like a rule, it takes the form of P=>Q (α, β), where P and Q are material categories, and
706
α andβ are support and confidence, respectively (Meo,
Pseila, & Ceri, 1998). The P is regarded as the condition, and Q as the conclusion, meaning that P can produce Q implicitly. For example, an association rule “Systems =>Organizations & Management (0.25, 0.33)” means, “If materials in the category of Systems were borrowed in a transaction, materials in Organization & Management were also borrowed in the same transaction with a support of 0.25 and a confidence of 0.33.” Support is defined as the ratio of the number of transactions observed to the total number of transactions, whereas confidence is the ratio of the number of transactions to the number of conditions. Although association rules having the form of P=>Q (α, β) can be generated in a transaction, the inverse association rules and single material category in a transaction also need to be considered. When two categories (C1 and C2) are utilized in a transaction, it is difficult to determine the association among C1=>C2, C2=>C1, and both. A suggestion from librarians is to take the third one (both) as the decision of this problem (Wu, Lee, & Kao, 2004). This is also supported by the study of Meo et al. (1998), which deals with association rule generation in customer purchasing transactions. The number of support and confidence of C1=>C2 may be different from those of C2=>C1. As a result, the inverse rules are considered as an extension for the transactions that contain more than two categories to determine the number of association rules. The number of rules can be determined via 2*[n*(n-1)/2], where n is the number of categories in a transaction. For example, {C1, C2, C3} are the categories of a transaction, and 6 association rules are then produced to be {C1=>C2, C1=>C3, C2=>C3, C2=>C1, C3=>C1, C3=>C2}. Unreliable association rules may occur because their supports and confidences are too small. Normally, there is a predefined threshold that defines the value of support and confidence to filter the unreliable association rules. Only when the support and confidence of a rule satisfy the defined threshold is the rule regarded as a reliable rule. However, no evidence exists so far is reliable determining the threshold. It mostly depends on how reliable the management would like the discovered rules to be. For a single category in a transaction, only the condition part without support and confidence is considered, because of the computation of support and confidence for other transactions. Another problem is the redundant rules in a transaction. It is realized that an association rule is to reveal the company of a certain kind of material category, independent of the number of its occurrences. Therefore, all redundant rules are eliminated. In other words, there is only one rule for a particular condition and only one conclusion in a transaction. Also, the importance of a material to a
Material Acquisitions Using Discovery Informatics Approach
department is omitted. However, the final material utilization will take into account this concern when the combination with statistics utilization is performed.
Gathering Discovery Informatics The final material utilization as the discovery informatics contains two parts. One is statistics utilization, and the other is association utilization (Wu et al., 2004). It is expressed as Formula 1 for a material category C. k
MatU (C ) = nC + ∑ nk * (α * support + β * confidence ) i
(1)
Where MatU(C): material utilization for category C nC: statistics utilization nk: statistics utilization of the kth category that can produce C α : intensity of support support: number of support β: intensity of confidence confidence: number of confidence
Evaluating Discovered Informatics Performance of the discovered informatics needs to be tested. Criteria used can be validity, significance/uniqueness, effectiveness, simplicity, and generality (Hirota & Pedrycz, 1999). The validity looks at whether the discovered informatics is practically applicable. Uniqueness/ significance deals with how different the discovered informatics are to the knowledge that library management already has. Effectiveness is to see the impact the discovered informatics has on the decision that has been made and implemented. Simplicity looks at the degree of understandability, while generality looks at the degree of scalability. The criteria used to evaluate the discovered material utilization for material acquisitions can be in particular the uniqueness/significance and effectiveness. The uniqueness/significance can show that material utilization is based not only on statistics utilization, but also on association utilization. The solution of effectiveness evaluation can be found by answering the questions “Do the discovered informatics significantly help reflect the information categories and subject areas of materials requested by users?” and “Do the discovered informatics significantly help enhance material utilizations for next year?”
Interpreting and Implementing Discovery Informatics
FUTURE TRENDS
Interpretation of discovery informatics can be performed by any visualization techniques, such as table, figure, graph, animation, diagram, and so forth. The main discovery informatics for material acquisition have three tables indicating statistics utilization, association rules, and material utilization. The statistics utilization table lists each material category and its utilization. The association rule table has four attributes, including condition, conclusion, supports, and confidence. Each tuple in this table represents an association rule. The association utilization is computed according to this table. The material utilization table has five attributes, including material category code, statistics utilization, association utilization, material utilization, and percentage. In this table, the value of the material utilization is the sum of statistics utilization and association utilization. For each material category, the percentage is the ratio of its utilization to the total utilization. Implementation deals with how to utilize the discovered informatics. The material utilization can be used as a base of material acquisitions by which informed decisions about allocating budget is made.
Digital libraries using innovative Internet technology promise a new information service model, where library materials are digitized for users to access anytime from anywhere. In fact, it is almost impossible for a library to provide patrons with all the materials available because of budget limitations. Having the collections that closely match the patrons’ needs is a primary goal for material acquisitions. Libraries must be centered on users and based on contents while building a global digital library (Kranich, 1999). This results in the increased necessity of discovery informatics technology. Advanced research tends to the integrated studies that may have requests for information for different subjects (material categories). The material associations discovered in circulation databases may reflect these requests. Library management has paid increased attention to easing access, filtering and retrieving knowledge sources, and bringing new services onto the Web, and users are industriously looking for their needs and figuring out what is really good for them. Personalized information service becomes urgent. The availability
707
M
Material Acquisitions Using Discovery Informatics Approach
of accessing materials via the Internet is rapidly changing the strategy from print to digital forms for libraries. For example, what can be relied on while making the decision on which electronic journals or e-books are required for a library, how do libraries deal with the number of login names and the number of users entering when analysis of arrival is concerned, and how do libraries create personalized virtual shelves for patrons by analyzing their transaction profiles? Furthermore, data collection via daily circulation operation may be greatly impacted by the way a user makes use of the online materials and, as a consequence, makes the material acquisitions operation even more difficult. Discovery informatics technology can help find the solutions for these issues.
CONCLUSION Material acquisition is an important operation for library that needs both technology and management involved. Circulation data are more than data that keep material usage records. Discovery informatics technology is an active domain that is connected to data processing, machine learning, information representation, and management, particularly when it has shown a substantial aid in decision making. Data mining is an application-dependent issue, and applications in domain will need adequate techniques to deal with. Although discovery informatics depends highly on the technologies used, its use with respect to applications in domain still needs more efforts to concurrently benefit management capability.
REFERENCES Bloss, A. (1995). The value-added acquisitions librarian: Defining our role in a time of change. Library Acquisitions: Practice & Theory, 19(3), 321-330. Budd, J. M., & Adams, K. (1989). Allocation formulas in practice. Library Acquisitions: Practice & Theory, 13, 381-390. Hamaker, C. (1995). Time series circulation data for collection development; Or, you can’t intuit that. Library Acquisition: Practice & Theory, 19(2), 191-195. Hirota, K., & Pedrycz, W. (1999). Fuzzy computing for data mining. Proceedings of the IEEE, 87(9), 1575-1600. Kao, S. C., Chang, H. C., & Lin, C. H. (2003). Decision support for the academic library acquisition budget allocation via circulation data base mining. Information Processing & Management, 39(1), 133-147.
708
Kranich, N. (1999). Building a global digital library. In C.C. Chen (Ed.), IT and global digital library development (pp. 251-256). West Newton, MA: MicroUse Information. Lu, H., Feng, L., & Han, J. (2000). Beyond intratransaction association analysis: Mining multidimensional intertransaction association rules. ACM Transaction on Information Systems, 18(4), 423-454. Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL for mining association rules. Data Mining & Knowledge Discovery, 2, 195-224. Pu, H. T., Lin, S. C., Chien, L. F., & Juan, Y. F. (1999). Exploration of practical approaches to personalized library and networked information services. In C.-C. Chen (Ed.), IT and global digital library development (pp. 333-343). West Newton, MA: MicroUse Information. Stevens, P. H. (1999). Who’s number one? Evaluating acquisitions departments. Library Collections, Acquisitions, & Technical Services, 23, 79-85. Tuten, J. H., & Lones, B. (1995). Allocation formulas in academic libraries. Chicago, IL: Association of College and Research Libraries. Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group. Whitmire, E. (2002). Academic library performance measures and undergraduates’ library use and educational outcomes. Library & Information Science Research, 24, 107-128. Wise, K., & Perushek, D. E. (2000). Goal programming as a solution technique for the acquisition allocation problem. Library & Information Science Research, 22(2), 165183. Wu, C. H. (2003). Data mining applied to material acquisition budget allocation for libraries: Design and development. Expert Systems with Applications, 25(3), 401-411. Wu, C. H., Lee, T. Z., & Kao, S. C. (2004). Knowledge discovery applied to material acquisitions for libraries. Information Processing & Management, 40(4), 709-725.
KEY TERMS Association Rule: The implication of connections for variables that are explored in databases, having a form of A→B, where A and B are disjoint subsets of a dataset of binary attributes.
Material Acquisitions Using Discovery Informatics Approach
Circulation Database: The information of material usages that are stored in a database, including user identifier, material identifier, date the material is borrowed and returned, and so forth. Digital Library: A library that provides the resources to select, structure, offer, access, distribute, preserve, and maintain the integrity of the collections of digital works.
Material Acquisition: A process of material information collection by recommendations of users, vendors, colleges, and so forth. Information explored in databases can be also used. The collected information is, in general, used in purchasing materials. Material Category: A set of library materials with similar subjects.
Discovery Informatics: Knowledge explored in databases with the form of association, classification, regression, summarization/generalization, and clustering.
709
M
710
Materialized Hypertext View Maintenance Giuseppe Sindoni ISTAT - National Institute of Statistics, Italy
INTRODUCTION A hypertext view is a hypertext containing data from an underlying database. The materialization of such hypertexts, that is, the actual storage of their pages in the site server, is often a valid option1. Suitable auxiliary data structures and algorithms must be designed to guarantee consistency between the structures and contents of each heterogeneous component where base data is stored and those of the derived hypertext view. This topic covers the maintenance features required by the derived hypertext to enforce consistency between page content and database status (Sindoni, 1998). Specifically, the general problem of maintaining hypertexts after changes in the base data and how to incrementally and automatically maintain the hypertext view are discussed and a solution using a Definition Language for Web page generation and an algorithm and auxilary data structure for automatic and incremental hypertext view maintenance is presented.
BACKGROUND Some additional maintenance features are required by a materialized hypertext to enforce consistency between page contents and the current database status. In fact, every time a transaction is issued on the database, its updates must be efficiently and effectively extended to the derived hypertext. In particular, (i) updates must be incremental, that is, only the hypertext pages dependent on database changes must be updated and (ii) all database updates must propagate to the hypertext. The principle of incremental maintenance has been previously explored by several authors in the context of materialized database views (Blakeley et al., 1986; Gupta et al., 2001; Paraboschi et al., 2003; Vista, 1998; Zhuge et al., 1995). Paraboschi et al. (2003) give a useful overview of the materialized view maintenance problem in the context of multidimensional databases. Blakeley et al. (1986) propose a method in which all database updates are first filtered to remove those that cannot possibly affect the view. For the remaining updates, they apply a differential algorithm to re-evaluate the view expression. This ex-
ploits the knowledge provided by both the view definition expression and the database update operations. Gupta et al. (2001) consider a variant of the view maintenance problem: to keep a materialized view up-to-date when the view definition itself changes. They try to “adapt” the view in response to changes in the view definition. Vista (1998) reports on the integration of view maintenance policies into a database query optimizer. She presents the design, implementation and use of a query optimizer responsible for the generation of both maintenance expressions to be used for view maintenance and execution plans. Zhuge et al. (1995) show that decoupling of the base data (at the sources) from the view definition and view maintenance machinery (at the warehouse) can lead the warehouse to compute incorrect views. They introduce an algorithm that eliminates the anomalies. Fernandez et al. (2000), Sindoni (1998) and Labrinidis & Roussopoulos (2000) have brought these principles to the Web hypertext field. Fernandez et al. (2000) provide a declarative query language for hypertext view specification and a template language for specification of its HTML representation. Sindoni (1998) deals with the maintenance issues required by a derived hypertext to enforce consistency between page content and database state. Hypertext views are defined as nested oid-based views over the set of base relations. A specific logical model is used to describe the structure of the hypertext and a nested relational algebra extended with an oid invention operator is proposed, which allows views and view updates to be defined. Labrinidis & Roussopoulos (2000) analytically and quantitatively compare three materialization policies (inside the DBMS, at the web server and virtual). Their results indicate that materialization at the Web server is a more scalable solution and can facilitate an order of magnitude more users than the other two policies, even under high update workloads. The orthogonal problem of deferring maintenance operations, thus allowing the definition of different policies, has been studied by Bunker et al. (2001), who provide an overview of the view maintenance subsystem of a commercial data warehouse system. They describe optimizations and discuss how the system’s focus on star schemas and data warehousing influences the maintenance subsystem.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Materialized Hypertext View Maintenance
MAIN THRUST With respect to incremental view maintenance, the heterogeneity caused by the semistructured nature of views, different data models and formats between base data and derived views makes the materialized hypertext view different to the database view context and introduces some new issues. Hypertexts are normally modeled by object-like models, because of their nested and network-like structure. Thus with respect to relational materialized views, where base tables and views are modeled using the same logical model, maintaining a hypertext view derived from relational base tables involves the additional challenge of taking into account for the materialized views a different data model to that of base tables. In addition each page is physically stored as a markedup text file, possibly on a remote server. Direct access to single values on the page is thus not permitted. Whenever a page needs to be updated, it must therefore be completely regenerated from the new database status. Furthermore, consistency between pages must be preserved, which is an operation analogous to the one of preserving consistency between nested objects. The problem of dynamically maintaining consistency between base data and derived hypertext is the hypertext view maintenance problem. It has been addressed in the framework of the STRUDEL project (Fernandez et al., 2000) as the problem of incremental view updates for semistructured data, by Sindoni (1998) and by Labrinidis & Roussopoulos (2000). There are a number of related issues: • •
•
different maintenance policies should be allowed (immediate or deferred); this implies the design of auxiliary data structures to keep track of database updates, but their management overloads the system and they must therefore be as light as possible; finally, due to the particular network structure of a hypertext, consistency must be maintained not only between the single pages and the database, but also between page links.
To deal with such issues, a manipulation language for derived hypertexts; an auxiliary data structure for (i) representing the dependencies between the database and hypertext and (ii) logging database updates; and an algorithm for automatic hypertext incremental maintenance can be introduced. For example, hypertext views and view updates can be defined using a logical model and an algebra (Sindoni, 1998). An auxiliary data structure allows information on the dependencies between database tables and hypertext
pages to be maintained. It may be based on the concept of view dependency graph, for the maintenance of the hypertext class described by the logical model. A view dependency graph stores information about the base tables, which are used in the hypertext view definitions. Finally, incremental page maintenance can be performed by a maintenance algorithm that takes as its input a set of changes on the database and produces a minimal set of update instructions for hypertext pages. The algorithm can be used whenever hypertext maintenance is required.
A Manipulation Language for Materialized Hypertexts Once a derived hypertext has been designed with a logical model and an algebra has been used to define its materialization as a view on base tables, a manipulation language is of course needed to populate the site with page scheme instances2 and maintain them when database tables are updated. The language is based on invocations of algebra expressions. The languages used for page creation and maintenance may be very simple, such as that composed of only two instructions: GENERATE and REMOVE. They allow manipulation of hypertext pages and can refer to the whole hypertext, to all instances of a page scheme, or to pages that satisfy a condition. The GENERATE statement has the following general syntax: G ENERATE [W HERE
ALL |
Its semantics essentially create the proper set of pages, taking data from the base tables as specified by the page scheme definitions. The ALL keyword allows generation of all instances of each page scheme. The REMOVE statement has a similar syntax and allows the specified sets of pages to be removed from the hypertext.
Incremental Maintenance of Materialized Hypertexts Whenever new data are inserted in the database, the status of all affected tables changes and the hypertexts whose content derives from those tables no longer reflect the database’s current status. These pages must therefore be updated in line with the changes. An extension to the system is thus needed to incrementally enforce consistency between database and hypertext. The simple “brute force” approach to the problem would simply regenerate the whole hypertext from the new 711
M
Materialized Hypertext View Maintenance
database status, thus performing a huge number of unnecessary page creation operations. A more sophisticated approach would regenerate only the instances of the involved page schemes, again however unnecessarily regenerating all page instances not actually containing the newly inserted data. The optimal solution is hence to regenerate only the page instances containing the newly inserted data. To perform such an incremental maintenance process, the database transaction must be extended by a GENERATE statement; with a suitable WHERE condition restricting the generation of pages to only those containing new data. This is done by using a list of URLs of pages affected by the database change, which is produced by the maintenance algorithm. Let us now consider a deletion of data from the database. In principle, this corresponds to the deletion and/or modification of derived hypertext instances. Pages will be deleted whenever all the information they contain has been deleted from the database, and re-generated whenever only part of their information has been deleted. Deletion transactions from the database are therefore extended by either REMOVE or GENERATE statements on the derived hypertext. The main maintenance problems in this framework are: (i) to produce the proper conditions for the GENERATE and REMOVE statements automatically and (ii) to distinguish database updates corresponding to page removals from those corresponding to page replacements. These problems are normally solved by allowing the system to log deletions from each base table and maintain information on which base tables are involved in the generation of each page scheme instance. When the underlying database is relational, this can be done by systems that effectively manage the sets of inserted and deleted tuples of each base relation since the last hypertext maintenance operation. These sets will be addressed as ∆+ and ∆-. ∆ were first introduced in the database programming language field, while Sindoni (1998) and Labrinidis & Roussopoulos (2001) used them in the context of hypertext views. In this framework, for a database relation, a ∆ is essentially a relation with the same scheme storing the set of inserted (∆+)or deleted (∆-) tuples since the last maintenance operation. A view dependency graph is also implemented, allowing information on dependencies between base relations and derived views to be maintained. It also maintains information on oid attributes, that is, the attributes of each base relation whose values are used to generate hypertext URLs. The maintenance algorithm takes ∆+, ∆- and the view dependency graph as inputs and produces as output the sequence of GENERATE and REMOVE statements necessary for hypertext maintenance. It allows maintenance to be postponed and performed in batch mode, by using the table update logs and implementing proper triggering mechanisms. For more details of the algorithm see Sindoni (1998). 712
FUTURE TRENDS There are a number of connected problems that may merit further investigation. •
•
•
Defining the most suitable maintenance policy for each page class involves a site analysis process to extract data on page access and, more generally, on site quality and impact. If a deferred policy is chosen for a given class of pages, this causes part of the hypertext to become temporarily inconsistent with its definition. Consequently, transactions reading multiple page schemes may not perform efficiently. The concurrency control problem needs to be extended to the context of hypertext views and suitable algorithms must be developed. There is a particular need to address the problem of avoiding dangling links, that is, links pointing nowhere because the former target page has been removed. Performance analysis is needed to show how transaction overhead and view refresh time are affected in the above approaches. Its results should be used to define a set of system-tuning parameters, to be used by site administrators for optimization purposes.
CONCLUSION We have shown that one of the most important problems in managing information repositories containing structured and semistructured data is to provide applications able to manage their materialization. This approach to the definition and generation of a database-derived Web site is particularly suitable for defining a framework for its incremental updates. In fact, the ability to formally define mappings between base data and derived hypertext allows easy definition of hypertext updates in the same formalism. To enable the system to propagate database updates to the relevant pages, a Data Manipulation Language must be defined, which allows automatic generation or removal of a derived site or specific subset of its pages. The presence of a DML for pages allows an algorithm for their automatic maintenance to be defined. This can use a specific data structure that keeps track of view-table dependencies and logs table updates. Its output is a set of the Data Manipulation Language statements to be executed in order to update the hypertext. The algorithm may be used to implement different view maintenance policies. It allows maintenance to be deferred and performed in batch mode, by using the table update logs and implementing proper triggering mechanisms.
Materialized Hypertext View Maintenance
These techniques, like many others defined previously, are now being applied to XML data by various researchers (Braganholo et al., 2003; Chen et al., 2002; Alon et al., 2003; Zhang et al., 2003; Shanmugasundaram et al., 2001). The most challenging issue for long-term research is probably that of extending hypertext incremental maintenance to the case where data come from many heterogeneous, autonomous and distributed databases.
Vista, D. (1998). Integration of incremental view maintenance into query optimizers. In EDBT (pp. 374-388). Zhang, X. et al. (2003). Rainbow: Multi-XQuery optimization using materialized XML views. SIGMOD Conference (pp. 671). Zhuge, Y. et al. (1995). View maintenance in a warehousing environment. In SIGMOD Conference (pp. 316-327).
KEY TERMS REFERENCES Alon, N. et al. (2003). Typechecking XML views of relational databases. ACM Transactions on Computational Logic, 4 (3), 315-354. Blakeley, J. et al. (1986). Efficiently updating materialized views. In ACM SIGMOD International Conf. on Management of Data (SIGMOD’86) (pp. 61-71). Braganholo, V. P. et al. (2003). On the updatability of XML views over relational databases. In WebDB (pp. 31-36). Bunker, C.J. et al. (2001). Aggregate maintenance for data warehousing in Informix Red Brick Vista. In VLDB 2001 (pp. 659-662). Chen, Y.B. et al. (2000). Designing valid XML views. In Entity Relationship Conference (pp. 463-478). Fernandez, M.F. et al. (2000). Declarative specification of Web sites with Strudel. The VLDB Journal, 9(1), 38-55. Gupta, A. et al. (2001). Adapting materialized views after redefinitions: Techniques and a performance study. Information Systems, 26 (5), 323-362. Labrinidis, A., & Roussopoulos, N. (2000). WebView materialization. In SIGMOD’00 (pp. 367-378). Labrinidis, A., & Roussopoulos, N. (2001). Update propagation strategies for improving the quality of data on the Web. In VLDB (pp. 391-400). Paraboschi, S. et al. (2003). Materialized views in multidimensional databases. In Multidimensional databases (pp. 222-251). Hershey, PA: Idea Group Publishing. Shanmugasundaram, J. et al. (2001). Querying XML views of relational data. In VLDB (pp. 261-270). Sindoni, G. (1998). Incremental maintenance of hypertext views. In Proceedings of the Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98). LNCS 1590 (pp. 98-117). Berlin: Springer-Verlag.
Database Status: The structure and content of a database at a given time stamp. It comprises the database object classes, their relationships and their object instances. Deferred Maintenance: The policy of not performing database maintenance operations when their need becomes evident, but postponing them to a later moment. Dynamic Web Pages: Virtual pages dynamically constructed after a client request. The request is usually managed by a specific program or described using a specific query language whose statements are embedded into pages. Immediate Maintenance: The policy of performing database maintenance operations as soon as their need becomes evident. Link Consistency: The ability of a hypertext network links to always point to an existing and semantically coherent target. Materialized Hypertext: A hypertext dynamically generated from an underlying database and physically stored as a marked-up text file. Semistructured Data: Data with a structure not as rigid, regular, or complete as that required by traditional database management systems.
ENDNOTES 1
2
For more details, see the paper Materialized Hypertext Views. A page scheme is essentially the abstract representation of pages with the same structure and a page scheme instance is a page with the structure described by the page scheme. For the definition of page scheme and instance, see paper Materialized Hypertext Views.
713
M
714
Materialized Hypertext Views Giuseppe Sindoni ISTAT - National Institute of Statistics, Italy
INTRODUCTION A materialized hypertext view can be defined as “a hypertext containing data coming from a database and whose pages are stored in files” (Sindoni, 1999). A Web site presenting data from a data warehouse is an example of such a view. Even if the most popular approach to the generation of such sites is based on dynamic Web pages, a rationale for the materialized approach has produced many research efforts. The topic will cover logical models to describe the structure of the hypertext.
BACKGROUND Hypertext documents in the Web are in essence collections of HTML (HyperText Markup Language) or XML (the eXtensible Markup Language) files and are delivered to users by an HTTP (HyperText Transfer Protocol) server. Hypertexts are very often used to publish very large amounts of data on the Web, in what are known as data intensive Web sites. These sites are characterized by a large number of pages sharing the same structure, such as in a University Web site, where there are numerous pages containing staff information, that is, “Name,” “Position,” “Department,” and so on. Each staff page is different, but they all share the types of information and their logical organization. A group of pages sharing the same structure is called page class. Similarly to databases, where it is possible to distinguish between the intensional (database structure) and extensional (the database records) levels, in data intensive Web sites it is possible to distinguish between the site structure (the structure of the different page classes and page links) and site pages (instances of page classes). Pages of a data intensive site may be generated dynamically, that is, on demand, or be materialized, as will be clarified in the following. In both approaches, each page corresponds to an HTML file and the published data normally come from a database, where they can be updated more efficiently than in the hypertext files themselves. The database is queried to extract records relevant to the hypertext being generated and page instances are filled with values according to a suitable hypertext model describing page classes (Agosti et al., 1995; Aguilera et al., 2002; Beeri et al., 1998; Crestani & Melucci, 2003;
Baresi et al., 2000; Balasubramanian et al., 2001; Merialdo et al., 2003; Rossi & Schwabe, 2002; Simeon & Cluet, 1998). The hypertext can then be accessed by any Internetenabled machine running a Web browser. In such a framework, hypertext can be regarded as database views, but in contrast with classic databases, such as relational databases, the model describing the view cannot be the same as the one describing the database storing the published data. The most relevant hypertext logical models proposed so far can be classified into three major groups, according to the purpose of the hypertext being modeled. Some approaches are aimed at building hypertext as integration views of distributed data sources (Aguilera et al., 2002; Beeri et al., 1998; Simeon & Cluet, 1998), other as views of an underlying local database (Fernandez et al., 2000; Merialdo et al., 2003). There are also proposals for models and methods to build hypertext independently from the source of the data that they publish (Agosti et al., 1995; Baresi et al., 2000; Balasubramanian et al., 2001; Rossi & Schwabe, 2002; Simeon & Cluet, 1998). Models can be based on graphs (Fernandez et al., 2000; Simeon & Cluet, 1998), on XML Data Type Definitions (Aguilera et al., 2002), extension of the Entity Relationship model (Balasubramanian et al., 2001), logic rules (Beeri et al., 1998) or object-like paradigms (Merialdo et al., 2003; Rossi & Schwabe, 2002).
MAIN THRUST The most common way to automatically generate a derived hypertext is based principally on the dynamic construction of virtual pages following a client request. Usually, the request is managed by a specific program (for example a Common Gateway Interface – CGI – called as a link in HTML files) or described using a specific query language, whose statements are embedded into pages. These pages are often called “pull pages,” because it is up to the client browser to pull out the interesting information. Unfortunately, this approach has some major drawbacks: •
it involves a degree of Data Base Management System overloading, because every time a page is
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Materialized Hypertext Views
•
•
•
requested by a client browser, a query is issued to the database in order to extract the relevant data; it introduces some platform-dependence, because the embedded queries are usually written in a proprietary language and the CGIs must be compiled on the specific platform; it hampers site mirroring, because if the site needs to be moved to another server, either the database needs to be replicated, or some network overload is introduced due to remote queries; it doesn’t allow the publication of some site metadata, more specifically information about the structure of the site, which may be very useful to querying applications.
An alternative approach is based on the concept of materialized hypertext view: a derived hypertext whose pages are actually stored by the system on a server or directly on the client machine, using a mark-up language like HTML. This approach overcomes the above disadvantages because: (i) pages are static, so the HTTP server can work on its own; (ii) there is no need to embed queries or script calls in the pages, as standard sites are generated; (iii) due to their standardization, sites can be mirrored more easily, as they are not tied to a specific technology; and finally, (iv) metadata can be published by either embedding them into HTML comments or directly generating XML files. A data model, preferably object-oriented, is used to describe a Web hypertext. This allows the system to manage nested objects by decomposition. This means that each hypertext page is seen as an object with attributes, which can be atomic or complex, such as a list of values. Complex attributes are also modeled as nested objects into the page object. These objects can also have both atomic and complex attributes (objects) and the nesting mechanism is virtually unlimited. Below we will describe the Araneus Data Model (ADM) (Merialdo et al., 2003), as an example of a hypertext data model. Different models can be found in (Fernandez et al., 2000; Fraternali & Paolini, 1998). ADM is a page-oriented model, as page is the main concept. Each hypertext page is seen as an object having an identifier (its Uniform Resource Locator - URL) and a number of attributes. Its structure is abstracted by its page scheme and each page is an instance of a page scheme. The notion of page scheme may be compared to that of relation scheme, in the relational data model, or object class, in object oriented databases. The following example describes the page of an author in a bibliographic site, described by the AUTHOR PAGE page scheme.
PAGE SCHEME AuthorPage Name : TEXT; WorkList: LIST OF (Authors: TEXT; Title: TEXT; Reference: TEXT; Year : TEXT; ToRefPage: LINK TO ConferencePage UNION JournalPage; AuthorList:LIST OF (Name: TEXT; ToAuthorPage: LINK TO AuthorPage OPTIONAL;);); END PAGE SCHEME
Each AUTHOR PAGE instance has a simple attribute (NAME). Pages can also have complex attributes: lists, possibly nested at an arbitrary level, and links to other pages. The example shows the page scheme with a list attribute (WORKLIST). Its elements are tuples, formed by three simple attributes (AUTHORS, TITLE and YEAR), a link to an instance of either a C ONFERENCE P AGE or a J OURNALPAGE and the corresponding anchor (REFERENCE), and a nested list (AUTHORLIST) of other authors of the same work. Once a description of the hypertext is available, its materialization is made possible using a mapping language, such as those described in Merialdo et al., 2003).
FUTURE TRENDS One of the topics currently attracting the interest of many researchers and practitioners of the Web and databases fields is XML. Most efforts are aimed at modeling XML repositories and defining query languages for querying and transforming XML sources (World Wide Web Consortium, 2004). One of the current research directions is to explore XML as both a syntax for metadata publishing and a document model to be queried and restructured. Mecca, Merialdo, & Atzeni (1999) show that XML modeling primitives may be considered as a subset of the Object Data Management Group standard enriched with union types and XML repositories may in principle be queried using a language like the Object Query Language.
CONCLUSION Data-intensive hypertext can be published by assuming that the pages contain data coming from an underlying database and that their logical structure is described
715
M
Materialized Hypertext Views
according to a specific model. Pages may be mapped on the database and automatically generated using a programming language. To allow external applications to access these metadata, a materialized approach to page generation can be adopted. The massive diffusion of XML as a preferred means for describing a Web page’s structure and publishing it on the Internet is facilitating integrated access to heterogeneous, distributed data sources: the Web is rapidly becoming a repository of global knowledge. The research challenge for the 21st century will probably be to provide global users with applications to efficiently and effectively find the required information. This could be achieved by utilizing models, methods and tools which have been already developed for knowledge discovery and data warehousing in more controlled and local environments.
REFERENCES Agosti, M. et al. (1995). Automatic authoring and construction of hypertext for information retrieval. Multimedia Systems, 3, 15-24. Aguilera, V. et al. (2002). Views in a large-scale XML repository. Very Large DataBase Journal, 11(3), 238-255. Balasubramanian, V. et al. (2001). A case study in systematic hypertext design. Information Systems, 26(4), 295-320. Baresi, L. et al. (2000). From Web sites to Web applications: New issues for conceptual modeling. In Entity Relationship (Workshops) (pp. 89-100). Beeri, C. et al. (1998). WebSuite: A tools suite for harnessing Web data. In Proceedings of the Workshop on the Web and Databases (Web and DataBases 98) (in conjunction with Extending DataBase Technology 98). Lecture Notes in Computer Science (Vol. 1590) (pp. 152-171). Crestani, F., & Melucci, M. (2003). Automatic construction of hypertexts for self-referencing: The hyper-text book project. Information Systems, 28(7), 769-790. Fernandez, M. F. et al. (2000). Declarative specification of Web sites with Strudel. Very Large DataBase Journal, 9(1), 38-55. Fraternali, P., & Paolini, P. (1998). A conceptual model and a tool environment for developing more scalable, dynamic, and customizable Web applications. In VI Intl. Conference on Extending Database Technology (EDBT 98) (pp. 421435).
716
Mecca, G. et al. (1999). Araneus in the Era of XML. IEEE Data Engineering Bulletin, 22(3), 19-26. Merialdo, P. et al. (2003). Design and development of data-intensive Web sites: The araneus approach. ACM Transactions on Internet Technology, 3(1), 49-92. Rossi, G., & Schwabe, D. (2002). Object-oriented design structures in Web application models. Annals of Software Engineering, 13(1-4), 97-110. Simeon, G., & Cluet, S. (1998). Using YAT to build a Web server. In Proceedings of the Workshop on the Web and Databases (Web and DataBases 98) (in conjunction with Extending DataBase Technology 98). Lecture Notes in Computer Science (Vol. 1590) (pp. 118-135). Sindoni, G. (1999). Maintenance of data and metadata in Web-based information systems. PhD Thesis. Università degli studi di Roma La Sapienza. World Wide Web Consortium. (2004). XML Query (XQuery). Retrieved August 23, 2004, from http:// www.w3.org/XML/Query
KEY TERMS Dynamic Web Pages: Virtual pages dynamically constructed after a client request. Usually, the request is managed by a specific program or is described using a specific query language whose statements are embedded into pages. HTML: The Hypertext Markup Language. A language based on labels to describe the structure and layout of a hypertext. HTTP: The HyperText Transaction Protocol. An Internet protocol, used to implement communication between a Web client, which requests a file, and a Web server, which delivers it. Knowledge Management: The practice of transforming the intellectual assets of an organization into business value. Materialized Hypertext View: A hypertext containing data coming from a database and whose pages are stored in files. Metadata: Data about data. Structured information describing the nature and meaning of a set of data. XML: The eXtensible Markup Language. An evolution of HTML, aimed at separating the description of the hypertext structure from that of its layout.
717
Materialized View Selection for Data Warehouse M Design Dimitri Theodoratos New Jersey Institute of Technology, USA Alkis Simitsis National Technical University of Athens, Greece
INTRODUCTION A data warehouse (DW) is a repository of information retrieved from multiple, possibly heterogeneous, autonomous, distributed databases and other information sources for the purpose of complex querying, analysis, and decision support. Data in the DW are selectively collected from the sources, processed in order to resolve inconsistencies, and integrated in advance (at design time) before data loading. DW data are usually organized multi-dimensionally to support online analytical processing (OLAP). A DW can be seen abstractly as a set of materialized views defined over the source relations. During the initial design of a DW, the DW designer faces the problem of deciding which views to materialize in the DW. This problem has been addressed in the literature for different classes of queries and views, and with different design goals.
BACKGROUND
MAIN THRUST
Figure 1 shows a simplified DW architecture. The DW contains a set of materialized views. The users address their queries to the DW. The materialized views are used partially or completely for the evaluation of the user queries. This is achieved through partial or complete rewritings of the queries using the materialized views. Figure 1. A simplified DW architecture
queries
answers
When selecting views to materialize in a DW, one attempts to satisfy one or more design goals. A design goal is either the minimization of a cost function or a constraint. A constraint can be classified as user-oriented or systemoriented. Attempting to satisfy the constraints can result in no feasible solution to the view selection problem. The design goals determine the design of the algorithms that select views to materialize from the space of alternative view sets.
Minimization of Cost Functions Most approaches comprise in their design goals the minimization of a cost function.
Data Warehouse
•
Maintenance Expressions
Data Sources
When the source relations change, the materialized views need to be updated. The materialized views are usually maintained using an incremental strategy. In such a strategy, the changes to the source relations are propagated to the DW. The changes to the materialized views are computed using the changes of the source relations and are eventually applied to the materialized views. The expressions used to compute the view changes involve the changes of the source relations and are called maintenance expressions. Maintenance expressions are issued by the DW against the data sources, and the answers are sent back to the DW. When the source relation changes affect more than one materialized view, multiple maintenance expressions need to be evaluated. The techniques of multi-query optimization can be used to detect common subexpressions among maintenance expressions in order to derive an efficient global evaluation plan for all the maintenance expressions.
...
Query Evaluation Cost: Often, the queries that the DW has to satisfy are given as input to the view selection problem. The overall query evaluation cost is the sum of the cost of evaluating each input
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Materialized View Selection for Data Warehouse Design
•
•
718
query rewritten (partially or completely) over the materialized views. This sum also can be weighted, each weight indicating the frequency or importance of the corresponding query. Several approaches aim at minimizing the query evaluation cost (Gupta & Mumick, 1999; Harinarayan et al., 1996; Shukla et al., 1998). View Maintenance Cost: The view maintenance cost is the sum of the cost of propagating each source relation change to the materialized views. This sum can be weighted, each weight indicating the frequency of propagation of the changes of the corresponding source relation. The maintenance expressions can be evaluated more efficiently if they can be partially rewritten over views already materialized at the DW; the evaluation of parts of the maintenance expression is avoided since their materializations are present at the DW. Moreover, access of the remote data sources and expensive data transmissions are reduced. Materialized views that are added to the DW for reducing the view maintenance cost are called auxiliary views (Ross et al., 1996; Theodoratos & Sellis, 1999). Obviously, maintaining the auxiliary views incurs additional maintenance cost. However, if this cost is less than the reduction to the maintenance cost of the initially materialized views, it is worth keeping the auxiliary views in the DW. Ross, et al. (1996) derive auxiliary views to materialize in order to minimize the view maintenance cost. Operational Cost: Minimizing the query evaluation cost and the view maintenance cost are conflicting requirements. Low view maintenance cost can be obtained by replicating source relations at the DW. In this case, though, the query evaluation cost is high, since queries need to be computed from the replicas of the source relations. Low query evaluation cost can be obtained by materializing at the DW all the input queries. In this case, all the input queries can be answered by a simple lookup, but the view maintenance cost is high, since complex maintenance expressions over the source relations need to be computed. The input queries may overlap; that is, they may share many common subexpressions. By materializing common subexpressions and other views over the source relations, it is possible, in general, to reduce the view maintenance cost. These savings must be balanced against higher query evaluation cost. For this reason, one can choose to minimize a linear combination of the query evaluation and view maintenance cost, which is called operational cost. Most approaches endeavor to minimize the operational cost (Baralis et al., 1997; Gupta, 1997; Theodoratos & Sellis, 1999; Yang et al., 1997).
System-Oriented Constraints System-oriented constraints are dictated by the restrictions of the system and are transparent to the users. •
•
•
Space Constraint: Although the degradation of the cost of disk space allows for massive storage of data, one cannot consider that the disk space is unlimited. The space constraint restricts the space occupied by the selected materialized views not to exceed the space allocated to the DW for this end. Space constraints are adopted in many works (Gupta, 1997; Golfarelli & Rizzi, 2000; Harinarayan et al., 1996, Theodoratos & Sellis, 1999). View Maintenance Cost Constraint: In many practical cases, the refraining factor in materializing all the views in the DW is not the space constraint but the view maintenance cost. Usually, DWs are updated periodically (e.g., at nighttime) in a large batch update transaction. Therefore, the update window must be sufficiently short so that the DW is available for querying and analysis during the daytime. The view maintenance cost constraint states that the total view maintenance cost should be less than a given amount of view maintenance time. Gupta and Mumick (1999), Golfareli and Rizzi (2000), and Lee and Hammer (2001) consider a view maintenance cost constraint in selecting materialized views. Self Maintainability: A materialized view is selfmaintainable if it can be maintained for any instance of the source relations over which it is defined and for all source relation changes, using only these changes, the view definition, and the view materialization. The notion is extended to a set of views in a straightforward manner. By adding auxiliary views to a set of materialized views, one can make the whole view set self-maintainable. There are different reasons for making a view set self-maintainable: (a) the remote source relations need not be contacted in turn for evaluating maintenance expressions during view updating; (b) anomalies due to concurrent changes are eliminated, and the view maintenance process is simplified; (c) the materialized views can be maintained efficiently even if the sources are not able to answer queries (e.g., legacy systems), or if they are temporarily unavailable (e.g., in mobile systems). By adding auxiliary views to a set of materialized views, the whole view set can be made self-maintainable. Self-maintainability can be trivially achieved by replicating at the DW all the source relations used in the view definitions. Selfmaintainability viewed as a constraint requires that the set of materialized views taken together is selfmaintainable. Quass, et al., (1996), Akinde, et al.,
Materialized View Selection for Data Warehouse Design
•
(1998), Liang, et al., (1999), and Theodoratos (2000) aim at making the DW self-maintainable. Answering the Input Queries Using Exclusively the Materialized Views: This constraint requires the existence of a complete rewriting of the input queries, initially defined over the source relations, over the materialized views. Clearly, if this constraint is satisfied, the remote data sources need not be contacted for evaluating queries. This way, expensive data transmissions from the DW to the sources, and conversely, are avoided. Some approaches assume a centralized DW environment, where the source relations are present at the DW site. In this case, the answerability of the queries from the materialized views is trivially guaranteed by the presence of the source relations. The answerability of the queries also can be trivially guaranteed by appropriately defining select-project views on the source relations and replicating them at the DW. This approach assures also the self-maintainability of the materialized views. Theodoratos and Sellis (1999) do not assume a centralized DW environment or replication of part of the source relations at the DW and explicitly impose this constraint in selecting views for materialization.
User-Oriented Constraints User-oriented constraints express requirements of the users. •
•
Answer Data Currency Constraints: An answer data currency constraint sets an upper bound on the time elapsed between the point in time the answer to a query is returned to the user and the point in time the most recent changes of a source relation that are taken into account in the computation of this answer are read (this time reflects the currency of answer data). Currency constraints are associated with every source relation in the definition of every input query. The upper bound in an answer data currency constraint (minimal currency required) is set by the users according to their needs. This formalization of data currency constraints allows stating currency constraints at the query level and not at the materialized view level, as is the case in some approaches. Therefore, currency constraints can be exploited by DW view selection algorithms, where the queries are the input, while the materialized views are the output (and, therefore, are not available). Furthermore, it allows stating different currency constraints for different relations in the same query. Query Response Time Constraints: A query response time constraint states that the time needed to
evaluate an input query using the views materialized at the DW should not exceed a given bound. The bound for each query is given by the users and reflects their needs for fast answers. For some queries, fast answers may be required, while for others, the response time may not be predominant.
Search Space and Algorithms Solving the problem of selecting views for materialization involves addressing two main tasks: (a) generating a search space of alternative view sets for materialization and (b) designing optimization algorithms that select an optimal or near-optimal view set from the search space. A DW is usually organized according to a star schema where a fact table is surrounded by a number of dimension tables. The dimension tables define hierarchies of aggregation levels. Typical OLAP queries involve star joins (key/foreign key joins between the fact table and the dimension tables) and grouping and aggregation at different levels of granularity. For queries of this type, the search space can be formed in an elegant way as a multidimensional lattice (Baralis et al., 1997; Harinarayan et al., 1996). Gupta (1997) states that the view selection problem is NP-hard. Most of the approaches on view selection problems avoid exhaustive algorithms. The adopted algorithms fall into two categories: deterministic and randomized. In the first category belong greedy algorithms with performance guarantee (Gupta, 1997; Harinarayan et al., 1996), 0-1 integer programming algorithms (Yang et al., 1997), A* algorithms (Gupta & Mumick, 1999), and various other heuristic algorithms (Baralis et al., 1997; Ross et al., 1996; Shukla et al., 1998; Theodoratos & Sellis, 1999). In the second category belong simulated annealing algorithms (Kalnis et al., 2002; Theodoratos et al., 2001), iterative improvement algorithms (Kalnis et al., 2002) and genetic algorithms (Lee & Hammer, 2001). Both categories of algorithms exploit the particularities of the specific view selection problem and the restrictions of the class of queries considered.
FUTURE TRENDS The view selection problem has been addressed for different types of queries. Research has focused mainly on queries over star schemas. Newer applications (e.g., XML or Web-based applications) require different types of queries. This topic has only been partially investigated (Golfarelli et al., 2001; Labrinidis & Roussopoulos, 2000). A relevant issue that needs further investigation is the construction of the search space of alternative view 719
M
Materialized View Selection for Data Warehouse Design
sets for materialization. Even though the construction of such a search space for grouping and aggregation queries is straightforward (Harinarayan et al., 1966), it becomes an intricate problem for general queries (Golfarelli & Rizzi, 2001). Indexes can be seen as special types of views. Gupta, et al. (1997) show that a two-step process that divides the space available for materialization and picks views first and then indexes can perform very poorly. More work needs to be done on the problem of automating the selection of views and indexes together. DWs are dynamic entities that evolve continuously over time. As time passes, new queries need to be satisfied. A dynamic version of the view selection problem chooses additional views for materialization and avoids the design of the DW from scratch (Theodoratos & Sellis, 2000). A system that dynamically materializes views in the DW at multiple levels of granularity in order to match the workload (Kotidis & Roussopoulos, 2001) is a current trend in the design of a DW.
Baralis, E., Paraboschi, S., & Teniente, E. (1997). Materialized views selection in a multidimensional database. International Conference on Very Large Data Bases, Athens, Greece.
CONCLUSION
Gupta, H., & Mumick, I.S. (1999). Selection of views to materialize under a maintenance cost constraint. International Conference on Database Theory (ICDT), Jerusalem, Israel.
A DW can be seen as a set of materialized views. A central problem in the design of a DW is the selection of views to materialize in it. Depending on the requirements of the prospective users of the DW, the materialized view selection problem can be formulated with various design goals that comprise the minimization of cost functions and the satisfaction of user- and system-oriented constraints. Because of its importance, different versions of it have been the focus of attention of many researchers in recent years. Papers in the literature deal mainly with the issue of determining a search space of alternative view sets for materialization and with the issue of designing optimization algorithms that avoid examining exhaustively the usually huge search space. Some results of this research have been used already in commercial database management systems (Agrawal et al., 2000).
REFERENCES Agrawal, S., Chaudhuri, S., & Narasayya, V.R. (2000). Automated selection of materialized views and indexes in SQL databases. International Conference on Very Large Data Bases (VLDB), Cairo, Egypt. Akinde, M.O., Jensen, O.G., & Böhlen, H.M. (1998). Minimizing detail data in data warehouses. International Conference on Extending Database Technology (EDBT), Valencia, Spain.
720
Golfarelli, M., & Rizzi, S. (2000). View materialization for nested GPSJ queries. International Workshop on Design and Management of Data Warehouses (DMDW), Stockholm, Sweden. Golfarelli, M., Rizzi, S., & Vrdoljak B. (2001). Data warehouse design from XML sources. ACM International Workshop on Data Warehousing and OLAP (DOLAP), Atlanta, Georgia. Gupta, H. (1997). Selection of views to materialize in a data warehouse. International Conference on Database Theory (ICDT), Delphi, Greece. Gupta, H., Harinarayan, V., Rajaraman, A., & Ullman, J.D. (1997). Index selection for OLAP. IEEE International Conference on Data Engineering, Birmingham, UK.
Harinarayan, V., Rajaraman, A., & Ullman, J. (1996). Implementing data cubes efficiently. ACM SIGMOD International Conference on Management of Data (SIGMOD), Montreal, Canada. Kalnis, P., Mamoulis, N., & Papadias, D. (2002). View selection using randomized search. Data & Knowledge Engineering, 42(1), 89-111. Kotidis, Y., & Roussopoulos, N. (2001). A case for dynamic view management. ACM Transactions on Database Systems, 26(4), 388-423. Labrinidis, A., & Roussopoulos, N. (2000). WebView materialization. ACM SIGMOD International Conference on Management of Data (SIGMOD), Dallas, Texas. Lee, M., & Hammer, J. (2001). Speeding up materialized view selection in data warehouses using a randomized algorithm. International Journal of Cooperative Information Systems (IJCIS), 10(3), 327-353. Liang, W. (1999). Making multiple views self-maintainable in a data warehouse. Data & Knowledge Engineering, 30(2), 121-134. Quass, D., Gupta, A., Mumick, I.S., & Widom, J. (1996). Making views self-maintainable for data warehousing. International Conference on Parallel and Distributed Information Systems (PDIS), Florida Beach, Florida.
Materialized View Selection for Data Warehouse Design
Ross, K., Srivastava, D., & Sudarshan, S. (1996). Materialized view maintenance and integrity constraint checking: Trading space for time. ACM SIGMOD International Conference on Management of Data (SIGMOD), Montreal, Canada. Shukla, A., Deshpande, P., & Naughton, J. (1998). Materialized view selection for multidimensional datasets. International Conference on Very Large Data Bases (VLDB), New York. Theodoratos, D. (2000). Complex view selection for data warehouse self-maintainability. International Conference on Cooperative Information Systems (CoopIS), Eilat, Israel. Theodoratos, D., Dalamagas, T., Simitsis, A., & Stavropoulos, M. (2001). A randomized approach for the incremental design of an evolving data warehouse. International Conference on Conceptual Modeling (ER), Yokohama, Japan. Theodoratos, D., & Sellis, T. (1999). Designing data warehouses. Data & Knowledge Engineering, 31(3), 279-301.
KEY TERMS Auxiliary View: A view materialized in the DW exclusively for reducing the view maintenance cost. Materialized View: A view whose answer is stored in the DW. Operational Cost: A linear combination of the query evaluation and view maintenance cost. Query Evaluation Cost: The sum of the cost of evaluating each input query rewritten over the materialized views. Self-Maintainable View: A materialized view that can be maintained, for any instance of the source relations, and for all source relation changes, using only these changes, the view definition, and the view materialization. View: A named query. View Maintenance Cost: The sum of the cost of propagating each source relation change to the materialized views.
Theodoratos, D., & Sellis, T. (2000). Incremental design of a data warehouse. Journal of Intelligent Information Systems (JIIS), 15(1), 7-27. Yang, J., Karlapalem, K., & Li, Q. (1997). Algorithms for materialized view design in data warehousing environment. International Conference on Very Large Data Bases, Athens, Greece.
721
M
722
Methods for Choosing Clusters in Phylogenetic Trees Tom Burr Los Alamos National Laboratory, USA
INTRODUCTION One data mining activity is cluster analysis, of which there are several types. One type deserving special attention is clustering that arises due to evolutionary relationships among organisms. Genetic data is often used to infer evolutionary relations among a collection of species, viruses, bacterial, or other taxonomic units (taxa). A phylogenetic tree (Figure 1, top) is a visual representation of either the true or the estimated branching order of the taxa, depending on the context. Because the taxa often cluster in agreement with auxiliary information, such as geographic or temporal isolation, a common activity associated with tree estimation is to infer the number of clusters and cluster memberships, which is also a common goal in most applications of cluster analysis. However, tree estimation is unique because of the types of data used and the use of probabilistic evolutionary models which lead to computationally demanding optimization problems. Furthermore, novel methods to choose the number of clusters and cluster memberships have been developed and will be described here. The methods include a unique application of model-based clustering, a maximum likelihood plus bootstrap method, and a Bayesian method based on obtaining samples from the posterior probability distribution on the space of possible branching orders.
BACKGROUND Tree estimation is frequently applied to genetic data of various types; we focus here on applications involving DNA data, such as that from HIV. Trees are intended to convey information about the genealogy of such viruses and the most genetically similar viruses are most likely to be most related. However, because the evolutionary process includes random effects, there is no guarantee that “closer in genetic distance” implies “closer in time,” for every pair of sequences. Sometimes the cluster analysis must be applied to large numbers of taxa, or applied repeatedly to the same number of taxa. For example, Burr, Myers, and Hyman (2001) recently investigated how many subtypes (clusters) arise under a simple model of how the env (gp120) region of HIV-1, group M sequences (Figure 1, top) are
evolving. One question was whether the subtypes of group M could be explained by the past population dynamics of the virus. For each of many simulated data sets each having approximately 100 taxa, model-based clustering was applied to automate the process of choosing the number of clusters. The novel application of model-based clustering and its potential for scaling to large numbers of taxa will be described, along with the two methods mentioned in the introduction section. It is well-known that cluster analysis results can depend strongly on the metric. There are at least three unique metric-related features of DNA data. First, the DNA data is categorical. Second, a favorable trend in phylogenetic analysis of DNA data is to choose the evolutionary model using goodness of fit or likelihood ratio tests (Huelsenbeck & Rannala, 1997). For nearly all of the currently used evolutionary models, there is an associated distance measure. Therefore, there is the potential to make an objective metric choice. Third, the evolutionary model is likely to depend on the region of the genome. DNA regions that code for amino acids are more constrained over time due to selective pressure and therefore are expected to have a smaller rate of change than non-coding sequences. A common evolutionary model is as follows (readers who are uninterested in the mathematical detail should skip this paragraph). Consider a pair of taxa denoted x and y. Define Fxy as n AA n AC n AG n AT nCA nCC nCG nCT , NFxy = nGA nGC nGG nGT n n n n TA TC TG TT
where N is the number of base pairs (sites) in set of aligned sequences, nAA is the number of sites with taxa x and y both having an A, n AC is the number of sites with taxa x having an A and taxa y having a C, etc. The most general time-reversible model (GTR) for which a distance measure has been defined (Swofford, Olsen, Waddell, & Hillis, 1996) defines the distance between taxa x and y as d xy = -trace{Π log(Π -1Fxy)} where Π is a diagonal matrix of the average base frequencies in taxa x and y and the trace is
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Methods for Choosing Clusters in Phylogenetic Trees
the sum of diagonal elements. The GTR is fully specified by 5 relative rate parameters (a, b, c, d, e) and 3 relative frequency parameters (πA, πC, and πG with πT determined via πA+ πC + πG + πT = 1) in the rate matrix Q defined as − aπ A Q/µ = aπ A aπ A
aπ C − dπ C
bπ G dπ G
eπ C
f πG
−
follows that Pij(t) = 0.25 + 0.75e-µt and that the distance between taxa x and y is –3/4 log(1 - 4/3D) where D is the percentage of sites where x and y differ (regardless of what kind of difference because all relative substitution rates and base frequencies are assumed to be equal). Important generalizations include allowing unequal relative frequencies and/or rate parameters), and to allow the rate µ to vary across DNA sites. Allowing µ to vary across sites via a gamma-distributed rate parameter is one way to model the fact that sites often have different observed rates. If the rate µ is assumed to follow a gamma distribution with shape parameter γ then these “gamma distances” can be obtained from the original distances by replacing the function log(x) with γ(1-x-1/γ) in the dxy = -trace{P log(P-1Fxy)} formula (Swofford et al. 1996). Generally, this rate heterogeneity and the fact that multiple substitutions at the same site tend to saturate any distance measure make it a practical challenge to find a metric such that the distance between any two taxa increases linearly with time.
cπ T eπ T , f πG −
where µ is the overall substitution rate. The rate matrix Q is related to the substitution probability matrix P via Pij(t)= e Qt, where P ij(t) is the probability of a change from nucleotide i to j in time t and P ij(t) satisfies the time reversibility and stationarity criteria: π iP ij = π jP ji. Commonly used models such as Jukes-Cantor (Swofford et al. 1996) assumes that a = b = c = d = e = 1 and π A= π C = π G= π T= 0.25. For the Jukes-Cantor model, it
0.2 0.0
GG G G G G GG G G E EE E E E E E E E E E E E E E A AA A A A AA A A AA AA A D D D D D D D D D D D D D D D B B B BB B B B B BB B BB BC C C C C C C CC C C C C C C F FF FF F F F F
0.4
Figure 1. HIV Data (env region). (Top) Hierarchical Clustering; (Middle) Principle Coordinate plot; (Bottom) Results of model-based clustering under six different assumptions regarding volume (V), shape (S), and orientation (O). E denotes “equal” among clusters and “V” denotes “varying” among clusters, for V, S, and O respectively. For example, case 6 has varying V, equal S, and varying O among clusters. Models 1 and 2 each assume a spherical shape (I denotes the identify matrix, so S and O are equal among clusters, while V is equal for case 1 and varying for case 2 ). Note that the B and D subtypes tend to be merged.
E E E EE EEEE EE E
A AA A AAAA G G G GA A GG G GG F G F
-0.2
x2 0.0
EEE
CC C
-0.2
A
A A A
B A
D DD
F
F
FFFF
D BB B B B BDB B B DD D BDD B BBD D D B
F
CC CC C C CC C
-0.1
0.0
0.1
0.2
600
BIC 1000
1400
x1
1 2 3 4 5 6
5 4 6 2 3 1
4 6 2 5 3
4 6 2 5 3
2 6 4 3 5 1
1 2 3 6 5 4
1 2 3 5 6
1 2 3 5 6
1 3 5
1 3 5
1
1 5
1 3 5
1 1 3 3 15 EI 5 2 VI 3 EEE 4 VVV 5 EEV 6 VEV
10
1 3
1 3
1 3
1 3
1 3
1 3
1 3
5
5
5
5
5
5
5
15
20
number of clusters
723
M
Methods for Choosing Clusters in Phylogenetic Trees
MAIN THRUST Much more can be said about evolutionary models, but the background should suffice to convey the notion that relationships among taxa determine the probability (albeit in a complicated way via the substitution probability matrix P) of observing various DNA character states among the taxa. Therefore, it is feasible to turn things around and estimate taxa relationships on the basis of observed DNA. The quality of the estimate depends on the adequacy of the model, the number of observed DNA sites, the number of taxa, and the complexity of the true tree. These important issues are discussed for example in Swofford et al. (1996) and nearly all published methods now are related to varying degrees to the fundamental result of Felsenstein (1981) for calculating the likelihood of a given set of character states for a given tree. Likelihood-based estimation is typically computationally demanding and rarely (only for small number of taxa, say 15 or less) is the likelihood evaluated for all possible branching orders. Instead, various search strategies are used that begin with a subset of the taxa. More recently, branch rearrangement strategies are also employed that allow exploration of the space of likelihoods (Li, Pearl, & Doss, 2000; Simon & Larget, 1998) and approximate probabilities via Markov Chain Monte Carlo (MCMC, see below) of each of the most likely trees. Finding the tree that optimizes some criteria such as the likelihood is a large topic that is largely outside our scope here (see Salter, 2000). Our focus is on choosing groups of taxa because knowledge of group structure adds considerably to our understanding of how the taxa is evolving (Korber et al. 2000) and perhaps also leads to efficient ways to estimate trees containing a large number of taxa (Burr, Skourikhine, Macken & Bruno, 1999). Here we focus on choosing groups of taxa without concern for the intended use of the estimated group structure, and consider three strategies. One option for clustering taxa is to construct trees that have maximum likelihood (not necessarily a true maximum likelihood because of incomplete searches used), identify groups, and then repeat using resampled data sets. The resampled (“bootstrap”) data sets are obtained from the original data sets by sampling DNA sites with replacement. This resampling captures some of the variation inherent in the evolutionary process. If, for example, 999 of 1000 bootsrapped ML trees each show a particular group being monophyletic, (coalesce to a common ancestor before any taxa from outside the group) then the “consensus” is that this group is strongly supported (Efron, Halloran, & Holmes, 1996). A second option is to evaluate the probability of each of the most likely branching orders. This is computationally demanding and relies heavily on efficient branch rearrangement methods (Li, Pearl, & Doss, 2000) to implement 724
MCMC as a way to evaluate the likelihood of many different branching orders. In MCMC, the likelihood ratio of branching orders (and branch lengths) is used to generate candidate branching orders according to their relative probabilities; and therefore, to evaluate the relative probability of many different branching orders. A third option (Figure 1, middle) is to represent the DNA character data using: (a) the substitution probability matrix P to define distances between each pair of taxa (described above), (b) multi-dimensional scaling to reduce the data dimension and represent each taxa in two to four or five dimensions in such a way that the pairwise distances are closely approximated by distances computed using the low-dimensional representation; and (c) model-based clustering of the low-dimensional data. Several clustering methods could be applied to the data if the data (A, C, T, and G) were coded so that distances could be computed. An effective way to do this is to represent the pairwise distance data via multidimensional scaling. Multidimensional scaling represents the data in new coordinates such that distances computed in the new coordinates very closely approximate the original distances computed using the chosen metric. For n taxa with an nby-n distance matrix, the result of multidimensional scaling is an n-by-p matrix (p coordinates) that can be used to closely approximate the original distances. Therefore, multidimensional scaling provides a type of data compression with the new coordinates being suitable for input to clustering methods. For example, one could use the cmdscale function in S-PLUS (2003) to implement classical multidimensional scaling. The implementation of model-based clustering we consider includes kmeans as a special case as we will describe. In model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component of the mixture represents a cluster. Given n p-dimensional observations x = (x1,x2,…,xn), assume there are G clusters and let fk(xi|θk) be the probability density for cluster k. The model for the composite of clusters is typically formulated in one of two ways. The classification likelihood approach maximizes LC(θ1,…θG;γ1,…,γn | x) = Πi fγi(xi | θγi), where the γi are discrete labels satisfying γi = k if xi belongs to cluster k. The mixture likelihood approach maximizes LM(θ1,…θG;τ1,…,τG | x) = Πi Σk τkfk(xi | θI), where τk is the probability that an observation belongs to cluster k. Fraley and Raftery (1999) describe their latest version of model-based clustering where the f k are assumed to be multivariate Gaussian with mean µk and covariance matrix
Methods for Choosing Clusters in Phylogenetic Trees
Σk. Banfield and Raftery (1993) developed a model-based framework by parameterizing the covariance matrix in terms of its eigenvalue decomposition in the form Σ k =λ k D k A k D k T , where D k is the orthonormal matrix of eigenvectors, Ak is a diagonal matrix with elements proportional to the eigenvalues of Σk and λk is a scalar, and under one convention is the largest eigenvalue of Σk. The orientation of cluster k is determined by Dk, Ak determines the shape, while λk specifies the volume. Each of the volume, shape, and orientation (VSO) can be variable among groups, or fixed at one value for all groups. One advantage of the mixture-model approach is that it allows the use of approximate Bayes factors to compare models, giving a means of selecting the model parameterization (which of V, S, and O are variable among groups) and the number of clusters (Figure 1, bottom). The Bayes factor is the posterior odds for one model against another model assuming that neither model is favored a priori (uniform prior). When the EM algorithm (estimation-maximum likelihood, Dempster, Laird, & Rubin, 1977) is used to find the maximum mixture likelihood, the most reliable approximation to twice the log Bayes factor (called the Bayesian Information Criterion, BIC) is BIC = 2lM ( x,θˆ) − mM log( n) , where lM ( x,θˆ) is the maximized mixture loglikelihood for the model and mM is the number of independent parameters to be estimated in the model. A convention for calibrating BIC differences is that differences less than 2 correspond to weak evidence, differences between 2 and 6 are positive evidence, differences between 6 and 10 are strong evidence, and differences more than 10 are very strong evidence. The two clustering methods “ML + bootstrap” and “model-based” clustering have been compared on the same data sets (Burr, Gattiker, & LaBerge, 2002b) and the differences were small. For example, “ML + bootstrap” suggested 7 clusters for 95 HIV env sequences and 6 clusters in HIV gag (p17) sequences while “model-based” clustering suggests 6 for env (tends to merge the socalled B and D subtypes – see Figure 1) and 6 for gag. Note from Figure 1(top) that only 7 of the 10 recognized subtypes were included among the 95 sequences. However, it is likely that case-specific features will determine the extent of difference between the methods. Also, modelbased clustering provides a more natural and automatic way to identify candidate groups. Once these candidate groups have been identified, then either method is reasonable for assigning confidence measures to the resulting cluster assignments. The two clustering methods “ML + bootstrap” and an MCMC-based estimate of the posterior probability on the space of branching orders have also been compared on the same data sets (Burr, Doak, Gattiker, & Stanbro, 2002a) with respect to the confidence that each method assigns to particular groups
that were chosen in advance of the analysis. The plot of the MCMC-based estimate versus the “ML + bootstrap” based estimate was consistent with the hypothesis that both methods assign (on average) the same probability to a chosen group, unless the number of DNA sites was very small, in which case there can be a non-negligible bias in the ML method, resulting in bias in the “ML + bootstrap” results. Because the group was chosen in advance, the method for choosing groups was not fully tested, so there is a need for additional research.
FUTURE TRENDS The recognized subtypes of HIV-1 were identified using informal observations of tree shapes followed by “ML + bootstrap” (Korber & Myers, 1992). Although identifying such groups is common in phylogenetic trees, there have been only a few attempts to formally evaluate clustering methods for the underlying genetic data. “ML + bootstrap” remains the standard way to assign confidence to hypothesized groups and the group structure is usually hypothesized either by using auxiliary information (geographic, temporal, or other) or by visual inspection of trees (which often display distinct groups). A more thorough evaluation could be performed using realistic simulated data with known branching orders. Having known branching order is almost the same as having known groups; however, choosing the number of groups is likely to involve arbitrary decisions even when the true branching order is known.
CONCLUSION One reason to cluster taxa is that evolutionary processes can sometimes be revealed once the group structure is recognized. Another reason is that phylogenetic trees are complicated objects that can often be effectively summarized by identifying the major groups together with a description of the typical between and within group variation. Also, if we correctly choose the number of clusters present in the tree for a large number of taxa (100 or more), then we can then use these groups to rapidly construct a good approximation to the true tree very rapidly. One strategy for doing this is to repeatedly apply model-based clustering to relatively small numbers of taxa (100 or fewer) and check for consistent indications for the number of groups. We described two other clustering strategies (“ML + bootstrap,” and MCMC-based) and note that two studies have made limited comparisons of these three methods on the same genetic data.
725
M
Methods for Choosing Clusters in Phylogenetic Trees
REFERENCES Banfield, J., & Raftery, A. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803-821. Burr, T., Charlton, W., & Stanbro, W. (2000). Comparison of signature pattern analysis methods in molecular epidemiology. Proc. Mathematical and Engineering Methods in Medicine and Biological Sciences, 1 (pp. 473-479). Burr, T., Doak, J., Gattiker, J., & Stanbro, W. (2002a). Assessing confidence in phylogenetic trees: Bootstrap versus Markov Chain Monte Carlo. Mathematical and Engineering Methods in Medicine and Biological Sciences, 1, 181-187. Burr, T., Gattiker, J., & LaBerge, G. (2002b). Genetic subtyping using cluster analysis. Special Interest Group on Knowledge Discovery and Data Mining Explorations, 3, 33-42. Burr, T., Myers, G., & Hyman, J. (2001). The origin of AIDS – Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B., 356, 877-887. Burr, T., Skourikhine, A.N., Macken, C., & Bruno, W. (1999). Confidence measures for evolutionary trees: Applications to molecular epidemiology. Proc. of the 1999 IEEE Inter. Conference on Information, Intelligence and Systems (pp. 107-114).
Korber, B., & Myers, G. (1992). Signature pattern analysis: A method for assessing viral sequence Relatedness. AIDS Research and Human Retroviruses, 8, 1549-1560. Li, S., Pearl, D., & Doss, H. (2000). Phylogenetic tree construction using Markov Chain Monte Carlo. Journal of the American Statistical Association, 95(450), 493508. Salter, L. (2000). Algorithms for phylogenetic tree reconstruction. Proc. Mathematical and Engineering Methods in Medicine and Biological Sciences, 2 (pp. 459-465). S-PLUS - Statical Programming Lanugage. (2003). Insightful Corp, Seattle, Washington. Swofford, D.L., Olsen, G.J., Waddell, P.J., & Hillis, D.M. (1996). Phylogenetic inference in molecular systematics (2nd ed.) (pp. 407-514). (Hillis et al., Eds.) Sunderland, Massachusetts: Sinauer Associates.
KEY TERMS Bayesian Information Criterion: An approximation to the Bayes Factor which can be used to estimate the Bayesian posterior probability of a specified model. Bootstrap: A resampling scheme in which surrogate data is generated by resampling the original data or sampling from a model that was fit to the original data.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39,138.
Coalesce: In the context of phylogenetic trees, two lineages coalesce at the time that they most recently share a common ancestor (and hence, “come together” in the tree).
Efron, B., Halloran, E., & Holmes, S. (1996). Bootstrap confi-dence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA, 93, 13429.
Estimation-Maximization Algorithm: An algorithm for computing maximum likelihood estimates from incomplete data. In the case of fitting mixtures, the group labels are the missing data.
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17, 368-376. Fraley, C., & Raftery, A. (1999). MCLUST: Software for model-based cluster analysis. Journal of Classification, 16, 297-306. Huelsenbeck, J., & Rannala, B. (1997). Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science, 276, 227-232. Korber, B., Muldoon, N., Theiler, J., Gao, R., Gupta, R. Lapedes, A, Hahn, B., Wolinsky, W., & Bhattacharya, T. (2000). Timing the ancestor of the HIV-1 pandemic Strains. Science, 288, 1788-1796.
726
HIV: Human-Immunodeficiency Virus. Markov Chain Monte Carlo: A stochastic method to approximate probabilities, available in many situations for which analytical methods are not available. The method involves generating observations from the probability distribution by evaluating the likelihood ratio of any two candidate solutions. Mixture of Distributions: A combination of two or more distributions in which observations are generated from distribution i with probability pi and (Σpi =1). Model-Based Clustering: A clustering method with relatively flexible assumptions regarding the volume, shape, and orientation of each cluster.
Methods for Choosing Clusters in Phylogenetic Trees
Phylogenetic Tree: A representation of the branching order and branch lengths of a collection of taxa, which, in its most common display form, looks like the branches of a tree. Probability Density Function: A function that can be summed (for discrete-valued random variables) or inte-
grated (for interval-valued random variables) to give the probability of observing values in a specified set. Substitution Probability Matrix: A matrix whose i, j entry is the probability of substituting DNA character j (C, G, T or A) for character i over a specified time period.
727
M
728
Microarray Data Mining Li M. Fu University of Florida, USA
INTRODUCTION
MAIN THRUST
Based on the concept of simultaneously studying the expression of a large number of genes, a DNA microarray is a chip on which numerous probes are placed for hybridization with a tissue sample. Biological complexity encoded by a deluge of microarray data is being translated into all sorts of computational, statistical, or mathematical problems bearing on biological issues ranging from genetic control to signal transduction to metabolism. Microarray data mining is aimed to identify biologically significant genes and find patterns that reveal molecular network dynamics for reconstruction of genetic regulatory networks and pertinent metabolic pathways.
The laboratory information management system (LIMS) keeps track of and manages data produced from each step in a microarray experiment, such as hybridization, scanning, and image processing. As microarray experiments generate a vast amount of data, the efficient storage and use of the data require a database management system. Although some databases are designed to be data archives only, other databases such as ArrayDB (Ermolaeva, Rastogi, & Pruitt, 1998) and Argus (Comander, Weber, Gimbrone, & Garcia-Cardena, 2001) allow information storage, query, and retrieval, as well as data processing, analysis, and visualization. These databases also provide a means to link microarray data to other bioinformatics databases (e.g., NCBI Entrez systems, Unigene, KEGG, and OMIM). The integration with external information is instrumental to the interpretation of patterns recognized in the gene-expression data. To facilitate the development of microarray databases and analysis tools, there is a need to establish a standard for recording and reporting microarray gene expression data. The MIAME (Minimum Information about Microarray Experiments) standard includes a description of experimental design, array design, samples, hybridization, measurements, and normalization controls (Brazma, Hingamp, & Quackenbush, 2001).
BACKGROUND The idea of microarray-based assays seemed to emerge as early as of the 1980s (Ekins & Chu, 1999). In that period, a computer-based scanning and image-processing system was developed to quantify the expression level in tissue samples of each cloned complementary DNA sequences spotted in a two-dimensional array on strips of nitrocellulose, which could be the first prototype of the DNA microarray. The microarray-based gene expression technology was actively pursued in the mid-1990s (Schena, Heller, & Theriault, 1998) and has seen rapid growth since then. Microarray technology has catalyzed the development of the field known as functional genomics by offering high-throughput analysis of the functions of genes on a genomic scale (Schena et al., 1998). There are many important applications of this technology, including elucidation of the genetic basis for health and disease, discovery of biomarkers of therapeutic response, identification and validation of new molecular targets and modes of action, and so on. The accomplishment of decoding human genome sequence together with recent advances in the biochip technology has ushered in genomics-based medical therapeutics, diagnostics, and prognostics.
Data Mining Objectives Data mining addresses the question of how to discover a gold mine from historical or experimental data, particularly in a large database. The goal of data mining and knowledge discovery algorithms is to extract implicit and previously unknown nontrivial patterns, regularities, or knowledge from large data sets that can be used to improve strategic planning and decision making. The discovered knowledge capturing the relations among the variables of interest can be formulated as a function for making prediction and classification or as a model for understanding the problem in a given domain. In the context of microarray data, the objectives are identifying significant genes and finding gene expression pat-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Microarray Data Mining
terns associated with known or unknown categories. Microarray data mining is an important topic in bioinformatics, dealing with information processing on biological data, particularly genomic data.
Practical Factors Prior to Data Mining Some practical factors should be taken into account prior to microarray data mining. First of all, microarray data produced by different platforms vary in their formats and may need to be processed differently. For example, one type of microarray with cDNA as probes produces ratio data from two channel outputs, whereas another type of microarray using oligonucleotide probes generates nonratio data from a single channel. Not only may different platforms pick up gene expression activity with different levels of sensitivity and specificity, but also different data processing techniques may be required for different data formats. Normalizing data to allow direct array-to-array comparison is a critical issue in array data analysis, because several variables in microarray experiments can affect measured mRNA levels (Schadt, Li, Ellis, & Wong, 2001; Yang, Dudoit, & Luu, 2002). Variations may occur during sample handling, slide preparation, hybridization, or image analysis. Normalization is essential for correct microarray data interpretation. In simple ways, data can be normalized by dividing or subtracting expression values by a representative value (e.g., mean or median in an array) or by taking a linear transformation to zero mean and unit variance. As an example, data normalization in the case of cDNA arrays may proceed as follows: The local background intensity is subtracted from the value of each spot on the array; the two channels are normalized against the median values on that array; and the Cy5/Cy3 fluorescence ratios and log10transformed ratios are calculated from the normalized values. In addition, genes that do not change significantly can be removed through a filter in a process called data filtration.
Differential Gene Expression To identify genes differentially expressed across two conditions is one of the most important issues in microarray data mining. In cancer research, for example, we wish to understand what genes are abnormally expressed in a certain type of cancer, so we conduct a microarray experiment and collect the gene expression profiles of normal and cancer tissues, respectively, as the control and test samples. The information regarding differential expression is derived from comparing the test against the control sample.
To determine which genes are differentially expressed, a common approach is based on fold-change; in this approach, we simply decide a fold-change threshold (e.g., 2C) and select genes associated with changes greater than that threshold. If a cDNA microarray is used, the ratio of the test over control expression in a single array can be converted easily to fold change in both cases of upregulation (induction) and down-regulation (suppression). For oligonucleotide chips, fold-change is computed from two arrays, one for test and the other for control sample. In this case, if multiple samples in each condition are available, the statistical t-test or Wilcoxon tests can be applied, but the catch is that the Bonferroni adjustment to the level of significance on hypothesis testing would be necessary to account for the presence of multiple genes. The t-test determines the difference in mean expression values between two conditions and identifies genes with significant difference. The nonparametric Wilcoxon test is a good alternative in the case of non-Gaussian data distribution. SAM (Significance Analysis of Microarrays) (Tusher, Tibshirani, & Chu, 2001) is a state-of-the-art technique based on balanced perturbation of repeated measurements and minimization of the false discovery rate.
Coordinated Gene Expression Identifying genes that are co-expressed across multiple conditions is an issue with significant implications in microarray data mining. For example, given gene expression profiles measured over time, we are interested in knowing what genes are functionally related. The answer to this question also leads us to deduce the functions of unknown genes from their correlation with genes of known functions. Equally important is the problem of organizing samples based on their gene expression profiles so that distinct phenotypes or disease processes may be recognized or discovered. The solutions to both problems are based on socalled cluster analysis, which is meant to group objects into clusters according to their similarity. For example, genes are clustered by their expression values across multiple conditions; samples are clustered by their expression values across genes. The issue is the question of how to measure the similarity between objects. Two popular measures are the Euclidean distance and Pearson’s correlation coefficient. Clustering algorithms can be divided into hierarchical and nonhierarchical (partitional). Hierarchical clustering is either agglomerative (starting with singletons and progressively merging) or divisive (starting with a single cluster and progressively breaking). Hierarchical agglomerative clustering is most commonly used in the
729
M
Microarray Data Mining
cluster analysis of microarray data. In this method, two most similar clusters are merged at each stage until all the objects are included in a single cluster. The result is a dendrogram (a hierarchical tree) that encodes the relationships among objects by showing how clusters merge at each stage. Partitional clustering algorithms are best exemplified by k-means and self-organization maps (SOMs).
Gene Selection for Discriminant Analysis Taking an action based on the category of the pattern recognized in microarray gene expression data is an increasingly important approach to medical diagnosis and management (Furey, Cristianini, & Duffy, 2000; Golub, Slonim, & Tamayo, 1999; Khan, Wei, & Ringner, 2001). A class predictor derived on this basis can automatically discover the distinction between different classes of samples, independent of previous biological knowledge (Golub et al., 1999). Gene expression information appears to be a more reliable indicator than phenotypic information for categorizing the underlying causes of diseases. The microarray approach has offered hope for clinicians to arrive at more objective and accurate cancer diagnoses and hence choose more appropriate forms of treatment (Tibshirani, Hastie, Narasimhan, & Chu, 2002). The central question is how to construct a reliable classifier that predicts the class of a sample on the basis of its gene expression profile. This is a pattern recognition problem, and the type of analysis involved is re-
ferred to as discriminant analysis. In practice, given a limited number of samples, correct discriminant analysis must rely on the use of an effective gene selection technique to reduce the gene number and, hence, the data dimensionality. The objective of gene selection is to select genes that most contribute to classification as well as provide biological insight. Approaches to gene selection range from statistical analysis (Golub et al., 1999) and a Bayesian model (Lee, Sha, Dougherty, Vannucci, & Mallick, 2003) to Fisher’s linear discriminant analysis (Xiong, Li, Zhao, Jin, & Boerwinkle, 2001) and support vector machines (SVMs) (Guyon, Weston, Barnhill, & Vapnik, 2002). This is one of the most challenging areas in microarray data mining. Despite good progress, the reliability of selected genes should be further improved. Table 1 summarizes some of most important microarray data-mining problems and their solutions.
Microarray Data-Mining Applications Microarray technology permits a large-scale analysis of gene functions in a genomic perspective and has brought about important changes in how we conduct basic research and practice clinical medicine. There have existed an increasing number of applications with this technology. Here, the role of data mining in discovering biological and clinical knowledge from microarray data is examined. Consider that only the minority of all the yeast (Saccharomyces cerevisiae) open reading frames in the genome sequence could be functionally annotated on the
Table 1. Three common computational problems in microarray data mining Problem 1: To identify differentially expressed genes, given microarray gene expression data collected in two conditions, types, or states. [Solutions:]
§ § §
Fold change t-test or Wilcoxon rank sum test (with Bonferroni’s correction) Significance analysis of microarrays
Problem 2: To identify genes expressed in a coordinated manner, given microarray gene expression data collected across a set of conditions or time points. [Solutions:]
§ § §
Hierarchical clustering Self-organization k-means clustering
Problem 3: To select genes for discriminant analysis, given microarray gene expression data of two or more classes. [Solutions:]
730
§ § § § §
Neighborhood analysis Support vector machines Principal component analysis Bayesian analysis Fisher’s linear discriminant analysis
Microarray Data Mining
basis of sequence information alone (Zweiger, 1999), although microarray results showed that nearly 90% of all yeast mRNAs (messenger RNAs) are observed to be present (Wodicka, Dong, Mittmann, Ho, & Lockhart, 1997). Functional annotation of a newly discovered gene based on sequence comparison with other known gene sequences is sometimes misleading. Microarray-based genome-wide gene expression analysis has made it possible to deduce the functions of novel or poorly characterized genes from co-expression with already known genes (Eisen, Spellman, Brown, & Botstein, 1998). The microarray technology is a valuable tool for measuring whole-genome mRNA and enables system-level exploration of transcriptional regulatory networks (Cho, Campbell, & Winzeler, 1998; DeRisi, Iyer, & Brown, 1997; Laub, McAdams, Feldblyum, Fraser, & Shapiro, 2000; Tavazoie, Hughes, Campbell, Cho, & Church, 1999). Hierarchical clustering can help us recognize genes whose cis-regulatory elements are bound by the same proteins (transcription factors) in vivo. Such a set of coregulated genes is known as a regulon. Statistical characterization of known regulons is used to derive criteria for inferring new regulatory elements. To identify regulatory elements and associated transcription factors is fundamental to building a global gene regulatory network essential for understanding the genetic control and biology in living cells. Thus, determining gene functions and gene networks from microarray data is an important application of data mining. The limitation of the morphology-based approach to cancer classification has led to molecular classification. Techniques such as immunohistochemistry and RT-PCR are used to detect cancer-specific molecular markers, but pathognomonic molecular markers are unfortunately unavailable for most solid tumors (Ramaswamy, Tamayo, & Rifkin, 2001). Furthermore, molecular markers do not guarantee a definitive diagno-
sis, owing to possible failure of detection or presence of marker variants. The approach of constructing a classifier based on gene expression profiles has gained increasing interest, following the success in demonstrating that microarray data differentiated between two types of leukemia (Golub et al., 1999). In this application, the two datamining problems are to identify gene expression patterns or signatures associated with each type of leukemia and to discover subtypes within each. The first problem is dealt with by gene selection, and the second one by cluster analysis. Table 2 illustrates some applications of microarray data mining.
FUTURE TRENDS The future challenge is to realize biological networks that provide qualitative and quantitative understanding of molecular logic and dynamics. To meet this challenge, recent research has begun to focus on leveraging prior biological knowledge and integration with biological analysis in quest of biological truth. In addition, there is increasing interest in applying statistical bootstrapping and data permutation techniques to mining microarray data for appraising the reliability of leaned patterns.
CONCLUSION Microarray technology has rapidly emerged as a powerful tool for biological research and clinical investigation. However, the large quantity and complex nature of data produced in microarray experiments often plague researchers who are interested in using this technology. Microarray data mining uses specific data processing and normalization strategies and has its own objectives, re-
Table 2. Examples of microarray data mining applications Classical Work: v v v v v
Identified functional related genes and their genetic control upon metabolic shift from fermentation to respiration (DeRisi et al., 1997). Explored co-expressed or coregulated gene families by cluster analysis (Eisen et al., 1998). Determined genetic network architecture based on coordinated gene expression analysis and promoter motif analysis (Tavazoie et al., 1999). Differentiated acute myeloid leukemia from acute lymphoblastic leukemia by selecting genes and constructing a classifier for discriminant analysis (Golub et al., 1999). Selected genes differentially expressed in response to ionizing radiation based on significance analysis (Tusher et al., 2001).
Recent Work: v v v v
Analyzed gene expression in the Arabidopsis genome (Yamada, Lim, & Dale, 2003). Discovered conserved genetic modules (Stuart, Segal, Koller, & Kim, 2003). Elucidated functional properties of genetic networks and identified regulatory genes and their target genes (Gardner, di Bernardo, Lorenz, & Collins, 2003). Identified genes associated with Alzheimer’s disease (Roy Walker, Smith, & Liu, 2004).
731
M
Microarray Data Mining
quiring effective computational algorithms and statistical techniques to arrive at valid results. The microarray technology has been perceived as a revolutionary technology in biomedicine, but the hardware device does not pay off unless backed up with sound data-mining software.
ACKNOWLEDGMENT This work is supported by the National Science Foundation under Grant IIS-0221954.
REFERENCES Brazma, A., Hingamp, P., & Quackenbush, J. (2001). Minimum information about a microarray experiment (MIAME) toward standards for microarray data. Nat Genet, 29(4), 365-371. Cho, R. J., Campbell, M. J., & Winzeler, E. A. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2(1), 65-73. Comander, J., Weber, G. M., Gimbrone, M. A., Jr., & GarciaCardena (2001). Argus: A new database system for Webbased analysis of multiple microarray data sets. Genome Res, 11(9), 1603-1610. DeRisi, J. L., Iyer, V. R., & Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278(5338), 680-686. Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Acad Sci, USA, 95(25), 14863-14868.
tion by gene expression monitoring. Science, 286(5439), 531-537. Guyon, I., Weston, J., Barnhill, S., & Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1/3), 389-422. Khan, J., Wei, J. S., & Ringner, M. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, 7(6), 673-679. Laub, M. T., McAdams, H. H., Feldblyum, T., Fraser & Shapiro (2000). Global analysis of the genetic network controlling a bacterial cell cycle. Science, 290(5499), 2144-2148. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci & Mallick (2003). Gene selection: A Bayesian variable selection approach. Bioinformatics, 19(1), 90-97. Ramaswamy, S., Tamayo, P., & Rifkin, R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Acad Sci, USA, 98(26), 15149-15154. Roy Walker, P., Smith, B., & Liu, Q. Y. (2004). Data mining of gene expression changes in Alzheimer brain. Artif Intell Med, 31(2), 137-154. Schadt, E. E., Li, C., Ellis, B., & Wong (2001). Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. Journal of Cell Biochemistry, (Suppl. 37), 120-125. Schena, M., Heller, R. A., & Theriault, T. P. (1998). Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol, 16(7), 301-306.
Ekins, R., & Chu, F. W. (1999). Microarrays: Their origins and applications. Trends Biotechnol, 17(6), 217-218.
Stuart, J. M., Segal, E., Koller, D., & Kim (2003). A genecoexpression network for global discovery of conserved genetic modules. Science, 302(5643), 249-255.
Ermolaeva, O., Rastogi, M., & Pruitt, K. D., (1998). Data management and analysis for gene expression arrays. Nat Genet, 20(1), 19-23.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho & Church (1999). Systematic determination of genetic network architecture. Nat Genet, 22(3), 281-285.
Furey, T. S., Cristianini, N., & Duffy, N., (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10), 906-914.
Tibshirani, R., Hastie, T., Narasimhan, B., & Chu (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Acad Sci, USA, 99(10), 6567-6572.
Gardner, T. S., di Bernardo, D., Lorenz, D., & Collins (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301(5629), 102-105.
Tusher, V. G., Tibshirani, R., & Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Acad Sci, USA, 98(9), 5116-5121.
Golub, T. R., Slonim, D. K., & Tamayo, P. (1999). Molecular classification of cancer: Class discovery and class predic-
Wodicka, L., Dong, H., Mittmann, M., Ho & Lockhart (1997). Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat Biotechnol, 15(13), 1359-1367.
732
Microarray Data Mining
Xiong, M., Li, W., Zhao, J., Jin & Boerwinkle (2001). Feature (gene) selection in gene expression-based tumor classification. Mol Genet Metab, 73(3), 239-247. Yamada, K., Lim, J., & Dale, J. M. (2003). Empirical analysis of transcriptional activity in the Arabidopsis genome. Science, 302(5646), 842-846. Yang, Y. H., Dudoit, S., & Luu, P. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, 30(4), e15. Zweiger, G. (1999). Knowledge discovery in gene-expression-microarray data: Mining the information output of the genome. Trends Biotechnol, 17(11), 429-436.
KEY TERMS Bioinformatics: All aspects of information processing on biological data, in particular genomic data. The rise of bioinformatics is driven by the genomic projects.
Cis-Regulatory Element: The genetic region that affects the activity of a gene on the same DNA molecule. Clustering: The process of grouping objects according to their similarity. This is an important approach to microarray data mining. Functional Genomics: The study of gene functions on a genomic scale, especially based on microarrays. Gene Expression: Production of mRNA from DNA (a process known as transcription) and production of protein from mRNA (a process known as translation). Microarrays are used to measure the level of gene expression in a tissue or cell. Genomic Medicine: Integration of genomic and clinical data for medical decision. Microarray: A chip on which numerous probes are placed for hybridization with a tissue sample to analyze its gene expression. Postgenome Era: The time after the complete human genome sequence is decoded. Transcription Factor: A protein that binds to the cis-element of a gene and affects its expression.
733
M
734
Microarray Databases for Biotechnology Richard S. Segall Arkansas State University, USA
INTRODUCTION Microarray informatics is a rapidly expanding discipline in which large amounts of multi-dimensional data are compressed into small storage units. Data mining of microarrays can be performed using techniques such as drill-down analysis rather than classical data analysis on a record-by-record basis. Both data and metadata can be captured in microarray experiments. The latter may be constructed by obtaining data samples from an experiment. Extractions can be made from these samples and formed into homogeneous arrays that are needed for higher level analysis and mining. Biologists and geneticists find microarray analysis as both a practical and appropriate method of storing images, together with pixel or spot intensities and identifiers, and other information about the experiment.
BACKGROUND A Microarray has been defined by Schena (2003) as “an ordered array of microscopic elements in a planar substrate that allows the specific binding of genes or gene products.” Schena (2003) claims microarray databases as “a widely recognized next revolution in molecular biology that enables scientists to analyze genes, proteins, and other biological molecules on a genomic scale.” According to an article (2004) on the National Center for Biotechnology Information (NCBI) Web site, “because microarrays can be used to examine the expression of hundreds or thousands of genes at once, it promises to revolutionize the way scientists examine gene expression,” and “this technology is still considered to be in its infancy.” The following Figure 1 is from a presentation by Kennedy (2003) of CSIRO (Commonwealth Scientific & Industrial Research Organisation) in Australia as available on the Web, and illustrates an overview of the microarray process starting with sequence data of individual clones that can be organized into libraries. Individual samples are taken from the library as spots and arranged by robots onto slides that are then scanned by lasers. The image scanned by lasers is than quantified according to the color generated by each individual spot
that are then organized into a results set as a text file that can then be subjected to analyses such as data mining. Jagannathan (2002) of the Swiss Institute of Bioinformatics (SIB) described databases for microarrays including their construction from microarray experiments such as gathering data from cells subjected to more than one conditions. The latter are hybridized to a microarray that is stored after the experiment by methods such as scanned images. Hence data is to be stored both before and after the experiments, and the software used must be capable of dealing with large volumes of both numeric and image data. Jagannathan (2002) also discussed some of the most promising existing non-commercial microarray databases of ArrayExpress, which is a public microarray gene expression repository, the Gene Express Omnibus (GEO), which is a gene expression database hosted at the National Library of Medicine, and GeneX, which is an open source database and integrated tool set released by the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico. Grant (2001) wrote an entire thesis on microarray databases describing the scene for its application to genetics and the human genome and its sequence of the three billion-letter sequences of genes. Kim (2002) presented improved analytical methods for micro-array based genome composition analysis by selecting a signal value that is used as a cutoff to discriminate present Figure 1. Overview of the microarray process (Kennedy, (2003) GENAdb
Genomics Array Database Overview of the Microarray Process Sequences
>m6kp10a06f1 (xseqid=j10a06f1) NAATTCCCGACCGTGAAAGTAAACCTAAAA GCCTATTTATTTCACCTCTCTCTCTCTCTC TCTGCAACTAATCACTTGTTCNATCTCGAA GCTGAAGCTAAAGCTTTCGCTAATTTGCTT
Plants Samples
Libraries Slides
Scans
Result Sets
Analyses
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Microarray Databases for Biotechnology
and divergent genes. Do et al. (2003) provided comparative evaluation of microarray-based gene expression databases by analyzing the requirements for microarray data management, and Sherlock (2003) discussed storage and retrieval of microarray data for molecular biology. Kemmeren (2001) described a bioinformatics pipeline for supporting microarray analysis with example of production and analysis of DNA (Deoxyribonucleic Acid) microarrays that require informatics support. Gonclaves & Marks (2002) discussed roles and requirements for a research microarray database. An XML description language called MAML (Microarray Annotation Markup Language) has been developed to allow communication with other databases worldwide (Cover Pages 2002). Liu (2004) discusses microarray databases and MIAME (Minimal Information about a Microarray Experiment) that defines what information at least should be stored. For example, the MIAME for array design would be the definite structure and definition of each array used and their elements. The Microarray Gene Expression Database Group (MGED) composed and developed the recommendations for microarray data annotations for both MAIME and MAML in 2000 and 2001 respectively in Cambridge, United Kingdom. Jonassen (2002) presents a microarray informatics resource Web page that includes surveys and introductory papers on informatics aspects, and database and software links. Another resourceful Web site is that from the Lawrence Livermore National Labs (2003) entitled Microarray Links that provides an extensive list of active Web links for the categories of databases, microarray labs, and software and tools including data mining tools. University-wide database systems have been established such as at Yale as the Yale Microarray Database (YMD) to support large-scale integrated analysis of large amounts of gene expression data produced by a wide variety of microarray experiments for different organisms as described by Cheung (2004), and similarly at Stanford with Stanford Microarray Database (SMD) as described by both Sherlock (2001) and Selis (2003). Microarray Image analysis is currently included in university curricula, such as in Rouchka (2003) Introduction to Bioinformatics graduate course at University of Louisville. In relation to the State of Arkansas, the medical school is situated in Little Rock and is known as the University of Arkansas for Medical Sciences (UAMS). A Bioinformatics Center is housed within UAMS that is involved with the management of microarray data. The software utilized at UAMS for microarray analysis includes BASE (BioArray Software Environment) and
AMAD, which is a Web driven database system written entirely in PERL and JavaScript (UAMS, Bioinformatics Center, 2004).
MAIN THRUST The purpose of this article is to help clarify the meaning of microarray informatics. The latter is addressed by summarizing some illustrations of applications of data mining to microarray databases specifically for biotechnology. First, it needs to be stated which data mining tools are useful in data mining of microarrays. SAS Enterprise Miner, which was used in Segall et al. (2003, 2004a, 2004b) as discussed below contains the major data mining tools of decisions trees, regression, neural networks, and clustering, and also other data mining tools such as association rules, variable selection, and link analysis. All of these are useful data mining tools for microarray databases regardless if using SAS Enterprise Miner or not. In fact, an entire text has been written by Draghici (2003) on data analysis tools for DNA microarrays that includes these data mining tools as well as numerous others tools such as analysis of functional categories and statistical procedure of corrections for multiple comparisons.
Scientific and Statistical Data Mining and Visual Data Mining for Genomes Data mining of microarray databases has been discussed by Deyholos (2002) for bioinformatics by methods that include correlation of patterns and identifying the significance analysis of microarrays (SAM) for genes within DNA. Visual data mining was utilized to distinguish the intensity of data filtering and the effect of normalization of the data using regression plots. Tong (2002) discusses supporting microarray studies for toxicogenomic databases through data integration with public data and applying visual data mining such as ScatterPlot viewer. Chen et al. (2003) presented a statistical approach using a Gene Expression Analysis Refining System (GEARS). Piatetsky-Shapiro and Tamayo (2003) discussed the main types of challenges for microarrray data mining as including gene selection, classification, and clustering. According to Piatetsky-Shapiro and Tamayo (2003), one of the important challenges for data mining of microarrays is that “the difficulty of collecting microarray samples causes the number of samples to remain small” and “while
735
M
Microarray Databases for Biotechnology
the number of fields corresponding to the number of genes is typically in the thousands” this “creates a high likelihood of finding false positives.” Piatetsky-Shapiro and Tamayo (2003) identify areas in which micorarrays and data mining tools can be improved that include “better accuracy, more robust models and estimators” as well as better appropriate biological interpretation of the computational or statistical results for those microarrays constructed from biomedical or DNA data. Piatetsky-Shapiro and Tamayo (2003) summarize up the areas in which microarray and microarry data mining tools can be improved by stating: Typically a computational researcher will apply his or her favorite algorithm to some microarray dataset and quickly obtain a voluminous set of results. These results are likely to be useful but only if they can be put in context and followed up with more detailed studies, for example by a biologist or a clinical researcher. Often this follow up and interpretation is not done carefully enough because of the additional significant research involvement, the lack of domain expertise or proper collaborators, or due to the limitations of the computational analysis itself. Draghici (2003) discussed in-depth other challenges in using microarrays specifically for gene expression studies, such as being very noisy or prone to error after the scanning and image processing steps, consensus as to how to perform normalization, and the fact that microarrays are not necessarily able to substitute completely other biological factors or tools in the realm of the molecular biologist. Mamitsuka et al. (2003) mined biological active patterns in metabolic pathways using microarray expression profiles. Mamitsuka (2003) utilized microarray data sets of gene expressions on yeast proteins. Curran et al. (2003) performed statistical methods for joint data mining of gene expressions and DNA sequence databases. The statistical methods used include linear mixed effect model, cluster analysis, and logistic regression. Zaki et al. (2003) reported on an overview of the papers on data mining in bioinformatics as presented at the International Conference on Knowledge Discovery and Data Mining held in Washington, DC in August 2003. Some of the novel data mining techniques discussed in papers at this conference included gene expression analysis, protein/RNA (ribonucleic acid) structure prediction, and gene finding.
736
Scientific and Statistical Data Mining and Visual Data Mining for Plants Segall et al. (2003, 2004a, 2004b) performed data mining for assessing the impact of environmental stresses on plant geonomics and specifically for plant data from the Osmotic Stress Microarray Information Database (OSMID). The latter databases are considered to be representative of those that could be used for biotech application such as the manufacture of plant-made-pharmaceuticals (PMP) and genetically modified (GM) foods. The Osmotic Stress Microarray Information Database (OSMID) database that was used in the data mining in Segall et al. (2003, 2004a, 2004b) contains the results of approximately 100 microarray experiments performed at the University of Arizona as part of a National Science Foundation (NSF) funded project named the “The Functional Genomics of Plant Stress” whose data constitutes a data warehouse. The OSMID microarray database is available for public access on the Web hosted by Universite Montpellier II (2003) in France, and the OSMID contains information about the more than 20,000 ESTs (Experimental Stress Tolerances) that were used to produce these arrays. These 20,000 ESTs could be considered as components of data warehouse of plant microarray databases that was subjected to data mining in Segall et al. (2003, 2004a, 2004b). The data mining was performed using SAS Enterprise Miner and its cluster analysis module that yielded both scientific and statistical data mining as well as visual data mining. The conclusions of Segall et al. (2003, 2004a, 2004b) included the facts about the twenty-five different variations or levels of the environmental factor of salinity on plant of corn, as also evidenced by the visualization of the clusters formed as a result of the data mining.
Other Useful Sources of Tools and Projects for Microarray Informatics • •
•
A bibliography on microarray data analysis created as available on the Web by Li (2004) that includes book and reprints for the last ten years. The Rosalind Franklin Centre for Genomics Research (RFCGR) of the Medical Research Council (MRC) (2004) in the UK provides a Web site with links for data mining tools and descriptions of their specific applications to gene expressions and microarray databases for genomics and genetics. Reviews of data mining software as applied to genetic microarray databases are included in an
Microarray Databases for Biotechnology
• • •
annotated list of references for microarray soft ware review compiled by Leung et al. (2002). Web links for the statistical analysis of microarray data are provided by van Helden (2004). Reid (2004) provides Web links of software tools for microarray data analysis including image analysis. Bio-IT World Journal Web site has a Microarray Resource Center that includes a link of extensive resources for microarray informatics at the European Bioinformatics Institute (EBI).
FUTURE TRENDS The wealth of resources available on the Web for microarray informatics only supports the premise that microarray informatics is a rapidly expanding field. This growth is in both software and methods of analysis that includes techniques of data mining. Future research opportunities in microarray informatics include the biotech applications for manufacture of plant-made-pharmaceuticals (PMP) and genetically modified (GM) foods.
CONCLUSION Because data within genome databases is composed of micro-level components such as DNA, microarray databases are a critical tool for analysis in biotechnology. Data mining of microarray databases opens up this field of microarray informatics as multi-facet tools for knowledge discovery.
ACKNOWLEDGMENT The author wishes to acknowledge the funding provided by a block grant from the Arkansas Biosciences Institute (ABI) as administered by Arkansas State University (ASU) to encourage development of a focus area in Biosciences Institute Social and Economic and Regulatory Studies (BISERS) for which he served as CoInvestigator (Co-I) in 2003, and with which funding the analyses of the Osmotic Stress Microarray Information Database (OSMID) discussed within this article were performed. The author also wishes to acknowledge a three-year software grant from SAS Incorporated to the College of Business at Arkansas State University for SAS Enterprise Miner that was used in the data mining of the OSMID microarrays discussed within this article.
Finally, the author also wishes to acknowledge the useful reviews of the three anonymous referees of the earlier version of this article without whose constructive comments the final form of this article would not have been possible.
REFERENCES Bio-IT World Inc. (2004). Microarray resources and articles. Retrieved from http://www.bio-itworld.com/ resources/microarray/ Chen, C.H. et al. (2003). Gene expression analysis refining system (GEARS) via statistical approach: A preliminary report. Genome Informatics, 14, 316-317. Cheung, K.H. et al. (2004). Yale Microarray Database System. Retrieved from http://crcjs.med.utah.edu/ bioinfo/abstracts/Cheung,%20Kei.doc Curran, M.D., Liu, H., Long, F., & Ge, N. (2003,December). Machine learning in low-level microarray analysis. SIGKDD Explorations, 5(2),122-129. Deyholos, M. (2002). An introduction to exploring genomes and mining microarrays. In O’Reilly Bioinformatics Technology Conference, January 28-31, 2002, Tucson, AZ. Retrieved from http:// conferences.oreillynet.com/cs/bio2002/view/e_sess/ 1962 - 11k - May 7, 2004 Do, H., Toralf, K., & Rahm, E. (2003). Comparative evaluation of microarray-based gene expression databases. Retrieved from http://www.btw2003.de/proceedings/paper/96.pdfDraghici, S. (2003). Data analysis tools for DNA microarrays. Boca Raton, FL: Chapman & Hall/CRC. Goncalves, J., & Marks, W.L. (2002). Roles and requirements for a research microarray Database. IEEE Engineering Medical Biol Magazine, 21(6), 154-157. Grant, E. (2001, September). A microarray database. Thesis for Masters of Science in Information Technology. The University of Glasgow. Jagannathan, V. (2002). Databases for microarrays. Presentation at Swiss Institute of Bioinformatics (SIB), University of Lausanne, Switzerland. Retrieved from http:// www.ch.embnet.org/CoursEMBnet/CHIP02/ppt/ Vidhya.ppt Jonassen, I. (2002). Microarray informatics resource page. Retrieved from http://www.ii.uib.no/~inge/micro Kemmeren, P.C., & Holstege, F.C. (2001). A bioinformatics pipeline for supporting microarray analysis. Retrieved 737
M
Microarray Databases for Biotechnology
from http://www.genomics.med.uu.nl/presentations/ Bioinformatics-2001-Patrick2.ppt Kennedy, G. (2003). GENAdb: Genomics Array Database. CSIRO (Commonwealth Scientific & Industrial Research Organisation) Plant Industry, Australia. Retrieved from http://www.pi.csiro.au/gena/repository/GENAdb.ppt Kim, C.K., Joyce E.A., Chan, K., & Falkow, S. (2002). Improved analytical methods for microarray-based genome composition analysis. Genome Biology, 3(11). Retrieved from http://genomebiology.com/2002/3/11/research/0065 Lawrence Livermore National Labs. (2003). Microarray Links. Retrieved from http://microarray.llnl.gov/ links.html. Leung, Y.F. (2002). Microarray software review. In D. Berrar, W. Dubitzky & M. Granzow (Eds.), A practical approach to microarray data analysis (pp. 326-344). Boston: Kluwer Academic Publishers. Li, W. (2004). Bibliography on microarray data analysis.Retrieved from http://www.nslij-genetics.org. microarray/2004.html Liu, Y. (2004). Microarray Databases and MIAME (Minimum Information About a Microarray Experiment). Retrieved from http://titan.biotec.uiuc.edu/cs491jh/slides/ cs491jh-Yong.ppt Mamitsuka, H., Okuno, Y., & Yamaguchi, A. (2003, December). Mining biological active patterns in metabolic pathways using microarray expression profiles. SIGKDD Explorations, 5(2), 113-121. Medical Research Council. (2004). Genome Web: Gene expression and microarrays. Retrieved from http:// www.rfcgr.mrc.ac.uk/GenomeWeb/nuc-genexp.html Microarray Markup Language (MAML). (2002, February 8). Cover Pages Technology Reports. Retrieved from http://xml.coverpages.org/maml.html National Center for Biotechnology Information (NCBI). (2004, March 30). Microarrays: Chipping away at the mysteries of science and medicine. National Library of Medicine (NLM), National Institutes of Health (NIH). Retrieved from http://www.ncbi.nlm.nih.gov/About/ primer/microarrys.html Piatetsky-Shapiro, G., & Tamayo, P. (2003, December). Microarray data mining: Facing the challenges. SIGKDD Explorations, 5(2), 1-5. Reid, J.F. (2004). Software tools for microarray data analysis. Retrieved from http://www.ifom-firc.it/ MICROARRAY/data_analysis.htm 738
Rouchka, E. (2003). CECS 694 Introduction to Bioinformatics. Lecture 12. Microarray Image Analysis, University of Louisville. Retrieved from http:// kbrin.a-bldg.louisville.edu/~rouchka/CECS694_ 2003/ Week12.html Schena, M. (2003). Microarray analysis. New York: John Wiley & Sons. Segall, R.S., Guha, G.S., & Nonis, S. (2003). Data mining for analyzing the impact of environmental stress on plants: A case study using OSMID. Manuscript in preparation for journal submission. Segall, R.S., Guha, G.S., & Nonis, S. (2004b, May). Data mining for assessing the impact of environmental stresses on plant geonomics. In Proceedings of the Thirty-Fifth Meeting of the Southwest Decision Sciences Institute (pp. 23-31). Orlando, FL. Segall, R.S., & Nonis, S. (2004a, February). Data mining for analyzing the impact of environmental stress on plants: A case study using OSMID. Accepted for publication in Acxiom Working Paper Series of Acxiom Laboratory of Applied Research (ALAR) and presented at Acxiom Conference on Applied Research and Information Technology, University of Arkansas at Little Rock (UALR). Selis, S. (2003, February 15). Stanford researcher advocates far-reaching microarray data exchange. News release of Stanford University School of Medicine. Retrieved from http://www.stanfordhospital.com/ newsEvents/mewsReleases/2003/02/aaasSherlock.html Sherlock, G. et al. (2001). The Stanford microarray database. Nucleic Acids Research, 29(1), 152-155. Sherlock, G., & Ball, C.A. (2003). Microarray databases: Storage and retrieval of microarray data. In M.J. Brownstein & A. Khodursky (Eds.), Functional genomics: Methods and protocols (pp. 235-248). Methods in Molecular Biology Series (Vol. 224). Totowa, NJ, Humana Press. Tong, W. (2002, December). ArrayTrack-Supporting microarray studies through data integration. U. S. Food and Drug Administration (FDA)/National Center for Toxicological Research (NCTR) Toxioinformatics Workshop: Toxicogenomics Database, Study Design and Data Analysis. Universite Montpellier II. (2003). The Virtual Library of Plant-Array: Databases. Retrieved from http://www.univmontp2.fr/~plant_arrays/databases.html University of Arkansas for Medical Sciences (UAMS) Bioinformatics Center. (2004). Retrieved from http:// bioinformatics.uams.edu/microarray/database.html
Microarray Databases for Biotechnology
Van Helden, J. (2004). Statistical analysis of microarray data: Links.Retrieved from http://www.scmbb.ulb.ac.be/ ~jvanheld/web_course_microarrays/links.html Zaki, M.J., Wang, H.T., & Toivonen, H.T. (2003, December). Data mining in bioinformatics. SIGKDD Explorations, 5(2), 198-199.
KEY TERMS Data Warehouses: A huge collection of consistent data that is both subject-oriented and time variant, and used in support of decision-making. Genomic Databases: Organized collection of data pertaining to the genetic material of an organism. Metadata: Data about data, for example, data that describes the properties or characteristics of other data.
MIAME (Minimal Information about a Microarray Experiment): Defines what information at least should be stored. Microarray Databases: Store large amounts of complex data as generated by microarray experiments (e.g., DNA). Microarray Informatics: The study of the use of microarray databases to obtain information about experimental data. Microarray Markup Language (MAML): An XML (Extensible Markup Language)-based format for communicating information about data from microarray experiments. Scientific and Statistical Data Mining: The use of data and image analyses to investigate knowledge discovery of patterns in the data. Visual Data Mining: The use of computer generated graphics in both 2-D and 3-D for the use in knowledge discovery of patterns in data.
739
M
740
Mine Rule Rosa Meo Universitá degli Studi di Torino, Italy Giuseppe Psaila Universitá degli Studi di Bergamo, Italy
INTRODUCTION Mining of association rules is one of the most adopted techniques for data mining in the most widespread application domains. A great deal of work has been carried out in the last years on the development of efficient algorithms for association rules extraction. Indeed, this problem is a computationally difficult task, known as NP-hard (Calders, 2004), which has been augmented by the fact that normally association rules are being extracted from very large databases. Moreover, in order to increase the relevance and interestingness of obtained results and to reduce the volume of the overall result, constraints on association rules are introduced and must be evaluated (Ng et al.,1998; Srikant et al., 1997). However, in this contribution, we do not focus on the problem of developing efficient algorithms but on the semantic problem behind the extraction of association rules (see Tsur et al. [1998] for an interesting generalization of this problem). We want to put in evidence the semantic dimensions that characterize the extraction of association rules; that is, we describe in a more general way the classes of problems that association rules solve. In order to accomplish this, we adopt a general-purpose query language designed for the extraction of association rules from relational databases. The operator of this language, MINE RULE, allows the expression of constraints, constituted by standard SQL predicates that make it suitable to be employed with success in many diverse application problems. For a comparison between this query language and other state-of-the-art languages for data mining, see Imielinski, et al. (1996); Han, et al. (1996); Netz, et al. (2001); Botta, et al. (2004). In Imielinski, et al. (1996), a new approach to data mining is proposed, which is constituted by a new generation of databases called Inductive Databases (IDBs). With an IDB, the user/analyst can use advanced query languages for data mining in order to interact with the knowledge discovery (KDD) system, extract data mining descriptive and predictive patterns from the database, and store them in the database. Boulicaut, et al.
(1998) and Baralis, et al. (1999) discuss the usage of MINE RULE in this context. We want to show that, thanks to a highly expressive query language, it is possible to exploit all the semantic possibilities of association rules and to solve very different problems with a unique language, whose statements are instantiated along the different semantic dimensions of the same application domain. We discuss examples of statements solving problems in different application domains that nowadays are of a great importance. The first application is the analysis of a retail data, whose aim is market basket analysis (Agrawal et al., 1993) and the discovery of user profiles for customer relationship management (CRM). The second application is the analysis of data registered in a Web server on the accesses to Web sites by users. Cooley, et al. (2000) present a study on the same application domain. The last domain is the analysis of genomic databases containing data on micro-array experiments (Fayyad, 2003). We show many practical examples of MINE RULE statements and discuss the application problems that can be solved by analyzing the association rules that result from those statements.
BACKGROUND An association rule has the form B ⇒ H, where B and H are sets of items, respectively called body (the antecedent) and head (the consequent). An association rule (also denoted for short with rule) intuitively means that items in B and H often are associated within the observed data. Two numerical parameters denote the validity of the rule: support is the fraction of source data for which the rule holds; confidence is the conditional probability that H holds, provided that B holds. Two minimum thresholds for support and confidence are specified before rules are extracted, so that only significant rules are extracted. This very general definition, however, is incomplete and very ambiguous. For example, what is the meaning of “fraction of source data for which the rule holds”? Or what are the items associated by a rule? If we do not answer these basic questions, an association rule does
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mine Rule
not have a precise meaning. Consider, for instance, the original problem for which association rules were initially proposed in Agrawal, et al. (1993)—the market baskets analysis. If we have a database collecting single purchase transactions (i.e., transactions performed by customers in a retail store), we might wish to extract association rules that associate items sold within the same transactions. Intuitively, we are defining the semantics of our problem—items are associated by a rule if they appear together in the same transaction. Support denotes the fraction of the total transactions that contain all the items in the rule (both B and H), while confidence denotes the conditional probability that, found B in a transaction, also H is found in the same transaction. Thus a rule {pants, shirt} ⇒ {socks, shoes} support=0.02 confidence=0.23 means that the items pants, shirt, socks, and shoes appear together in 2% of the transactions, while having found items pants and shirt in a transaction, the probability that the same transaction also contains socks and shoes is 23%.
Semantic Dimensions MINE RULE puts in evidence the semantic dimensions that characterize the extraction of association rules from within relational databases and force users (typically analysts) to understand these semantic dimensions. Indeed, extracted association rules describe the most recurrent values of certain attributes that occur in the data (in the previous example, the names of the purchased product). This is the first semantic dimension that characterizes the problem. These recurrent values are observed within sets of data grouped by some common features (i.e., the transaction identifier in the previous example but, in general, the date, the customer identifier, etc.). This constitutes the second semantic dimension of the association rule problem. Therefore, extracted association rules describe the observed values of the first dimension, which are recurrent in entities identified by the second dimension. When values belonging to the first dimension are associated, it is possible that not every association is suitable, but only a subset of them should be selected, based on a coupling condition on attributes of the analyzed data (e.g., a temporal sequence between events described in B and H). This is the third semantic dimension of the problem; the coupling condition is called mining condition. It is clear that MINE RULE is not tied to any particular application domain, since the semantic dimensions allow
the discovery of significant and unexpected information in very different application domains.
M
The main features and clauses of MINE RULE are as follows (see Meo, et al. [1998] for a detailed description): • • •
•
•
Selection of the relevant set of data for a data mining process: This feature is specified by the FROM clause. Selection of the grouping features w.r.t., which data are observed: These features are expressed by the GROUP BY clause. Definition of the structure of rules and cardinality constraints on body and head, specified in the SELECT clause: Elements in rules can be single values or tuples. Definition of coupling constraints: These are constraints applied at the rule level (mining condition instantiated by a WHERE clause associated to SELECT) for coupling values. Definition of rule evaluation measures and minimum thresholds: These are support and confidence (even if, theoretically, other statistical measures also would be possible). Support of a rule is computed on the total number of groups in which it occurs and satisfies the given constraints. Confidence is the ratio between the rule support and the support of the body satisfying the given constraints. Thresholds are specified by clause EXTRACTING RULES WITH.
MAIN THRUST In this section, we introduce MINE RULE in the context of the three application domains. We describe many examples of queries that can be conceived as a sort of template, because they are instantiated along the relevant dimensions of an application domain and solve some frequent, similar, and critical situations for users of different applications.
First Application: Retail Data Analysis We consider a typical data warehouse gathering information on customers’ purchases in a retail store: FactTable (TransId, CustId, TimeId, ItemId, Num, Discount) Customer (CustId, Profession, Age, Sex) Rows in FactTable describe sales. The dimensions of data are the customer (CustId), the time (TimeId), and the purchased item (ItemId); each sale is characterized by the 741
•
Mine Rule
number of sold pieces (Num) and the discount (Discount); the transaction identifier (TransId) is reported, as well. We also report table Customer. •
Example 1: We want to extract a set of association rules, named FrequentItemSets, that finds the associations between sets of items (first dimension of the problem) purchased together in a sufficient number of dates (second dimension), with no specific coupling condition (third dimension). These associations provide the business relevant sets of items, because they are the most frequent in time. The MINE RULE statement is now reported.
MINE RULE FrequentItemSets AS SELECT DISTINCT 1..n ItemId AS BODY, 1..n ItemId AS HEAD, SUPPORT, CONFIDENCE FROM FactTable GROUP BY TimeId EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4 The first dimension of the problem is specified in the SELECT clause that specifies the schema of each element in association rules, the cardinality of body and head (in terms of lower and upper bound), and the statistical measures for the evaluation of association rules (support and confidence); in the example, body and head are not empty sets of items, and their upper bound is unlimited (denoted as 1..n). The GROUP BY clause provides the second dimension of the problem: since attribute TimeId is specified, rules denote that associated items have been sold in the same date (intuitively, rows are grouped by values of TimeId, and rules associate values of attribute ItemId appearing in the same group). Support of an association rule is computed in terms of the number of groups in which any element of the rule co-occurs; confidence is computed analogously. In this example, support is computed over the different instants of time, since grouping is made according to the time identifier. Support and confidence of rules must not be lower than the values in EXTRACTING clause (respectively 0.2 and 0.4). •
742
Example 2: Customer profiling is a key problem in CRM applications. Association rules allow to obtain a description of customers (e.g., w.r.t. age and profession) in terms of frequently purchased products. To do that, values coming from two distinct dimensions of data must be associated.
MINE RULE CustomerProfiles AS SELECT DISTINCT 1..1 Profession, Age AS BODY, 1..n Item AS HEAD, SUPPORT, CONFIDENCE FROM FactTable JOIN Customer ON FactTable.CustId=Customer.CustId GROUP BY CustId EXTRACTING RULES WITH SUPPORT:0.6, CONFIDENCE:0.9 The observed entity is the customer (first dimension of data) described by a single pair in the body (cardinality constraint 1..1); the head associates products frequently purchased by customers (second dimension of data) with the profile reported in the body (see the SELECT clause). Thus a rule {(employee, 35)} ⇒ {socks, shoes} support=0.7 confidence=0.96 means that customers that are employees and 35 years old often (96% of cases) buy socks and shoes. Support tells about the absolute frequency of the profile in the customer base (GROUP BY clause). This solution can be generalized easily for any profiling problem.
Second Application: Web Log Analysis Typically, Web servers store information concerning access to Web sites stored in a standard log file. This is a relational table (WebLogTable) that typically contains at least the following attributes: RequestID: identifier of the request; IPcaller: IP address from which the request is originated; Date: date of the request; TS: time stamp; Operation: kind of operation (for instance, get or put); Page URL: URL of the requested page; Protocol: transfer protocol (such as TCP/IP); Return Code: code returned by the Web server; Dimension: dimension of the page (in Bytes). •
Example 1: To discover Web communities of users on the basis of the pages they visited frequently, we might find associations between sets of users (first dimension) that have all visited a certain number of pages (second dimension); no coupling conditions are necessary (third dimension). Users are observed by means of their IP address, Ipcaller, whose values are associated by rules (see SELECT). In this case, support and confidence of association rules are computed, based on the num-
Mine Rule
ber of pages visited by users in rules (see GROUP BY). Thus, rule
Third Application: Genes Classification by Micro-Array Experiments
{Ip1, Ip2} ⇒ {Ip3, Ip4} support=0.4 confidence=0.45
We consider information on a single micro-array experiment containing data on several samples of biological tissue tied to correspondent probes on a silicon chip. Each sample is treated (or hybridized) in various ways and under different experimental conditions; these can determine the over-expression of a set of genes. This means that the sets of genes are active in the experimental conditions (or inactive, if, on the contrary, they are under-expressed). Biologists are interested in discovering which sets of genes are expressed similarly and under what conditions. A micro-array typically contains hundreds of samples, and for each sample, several thousands of genes are measured. Thus, input relation, called MicroArrayTable, contains the following information:
means that users operating from Ip1, Ip2, Ip3 and Ip4 visited the same set of pages, which constitute 40% of the total pages in the site. MINE RULE UsersSamePages AS SELECT DISTINCT 1..n IPcaller AS BODY, 1..n IPcaller AS HEAD, SUPPORT, CONFIDENCE FROM WebLogTable GROUP BY PageUrl EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4 •
Example 2: In Web log analysis, it is interesting to discover the most frequent crawling paths.
MINE RULE FreqSeqPages AS SELECT DISTINCT 1..n PageUrl AS BODY, 1..n PageUrl AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.Date < HEAD.Date FROM WebLogTable GROUP BY IPcaller EXTRACTING RULES WITH SUPPORT:0.3, CONFIDENCE:0.4 Rows are grouped by user (IPcaller) and sets of pages frequently visited by a sufficient number of users are associated. Furthermore, pages are associated only if they denote a sequential pattern (third dimension); in fact, the mining condition WHERE BODY.Date < HEAD.Date constrains the temporal ordering between pages in antecedent and consequent of rules. Consequently, rule {P1, P2} ⇒ {P3, P4, P5} support=0.5 confidence=0.6 means that 50% of users visit pages P3, P4, and P5 after pages P1 and P2. This solution can be generalized easily for any problem requiring the search for sequential patterns. Many other examples are possible, such as rules that associate users to frequently visited Web pages (highlight the fidelity of the users to the service provided by a Web site) or frequent requests of a page by a browser that cause an error in the Web server (interesting because it constitutes a favorable situation to hackers’ attacks).
• • • •
•
SampleID: identifier of the sample of biological tissue tied to a probe on the microchip; GeneId: identifier of the gene measured in the sample; TreatmentConditionId: identifier of the experimental conditions under which the sample has been treated; LevelOfExpression: measured value—if higher than a threshold T2, the genes are over-expressed; if lower than another threshold T1, genes are under-expressed. Example: This analysis discovers sets of genes (first dimension of the problem) that, in the same experimental conditions (second dimension), are expressed similarly (third dimension).
MINE RULE SimilarlyCorrelatedGenes AS SELECT DISTINCT 1..n GeneId AS BODY, 1..n GeneId AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.LevelOfExpression < T1 AND HEAD.LevelOfExpression < T1 OR BODY.LevelOfExpression > T2 AND HEAD.LevelOfExpression > T2 FROM MicroArrayTable GROUP BY SampleId, TreatmentConditionId EXTRACTING RULES WITH SUPPORT:0.95, CONFIDENCE:0.8 The mining condition introduced by WHERE constrains both the sets of genes to be similarly expressed in the same experimental conditions (i.e., samples of tissue treated in the same conditions). Support thresh-
743
M
Mine Rule
old (0.95) determines the proportion of samples in which the sets of genes must be expressed similarly; confidence determines how strongly the two sets of genes are correlated. This statement might help biologists to discover the sets of genes that are involved in the production of proteins involved in the development of certain diseases (e.g., cancer).
FUTURE TRENDS This contribution wants to evaluate the usability of a mining query language and its results—association rules—in some practical applications. We identified many useful patterns that corresponded to concrete user problems. We showed that the exploitation of the nuggets of information embedded in the databases and of the specialized mining constructs provided by the query languages enables the rapid customization of the mining procedures leading to the real users’ needs. Given our experience, we also claim that, independently of the application domain, the use of queries in advanced languages, as opposed to ad-hoc heuristics, eases the specification and the discovery of a large spectrum of patterns. This motivates the need for powerful query languages in KDD systems. For the future, we believe that a critical point will be the availability of powerful query optimizers, such as the one proposed in Meo (2003). This one is able to solve data mining queries incrementally; that is, by modification of the previous queries results, materialized in the database.
CONCLUSION In this contribution, we focused on the semantic problem behind the extraction of association rules. We put in evidence the semantic dimensions that characterize the extraction of association rules; we did this by applying a general purpose query language designed for the extraction of association rules, named MINE RULE, to three important application domains. The query examples we provided show that the mining language is powerful and, at the same time, versatile because its operational semantics seems to be the basic one. Indeed, these experiments allow us to claim that Imielinski and Mannila’s (1996) initial view on inductive databases was correct: <
744
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the International Conference on Management of Data, Washington, D.C. Baralis, E., & Psaila, G. (1999). Incremental refinement of mining queries. Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery, Florence, Italy. Botta, M., Boulicaut, J.-F., Masson, C., & Meo, R. (2004). Query languages supporting descriptive rule mining: A comparative study. In R. Meo, P. Lanzi, & M. Klemettinen (Eds.), Database support for data mining applications. (pp. 24-51). Berlin: Springer-Verlag. Boulicaut, J.-F., Klemettinen, M., & Mannila, H. (1998). Querying inductive databases: A case study on the MINE RULE operator Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery, Nantes, France. Calders, T. (2004). Computational complexity of itemset frequency satisfiability. Proceedings of the Symposium on Principles Of Database Systems, Paris, France. Cooley, R., Tan, P.N., & Srivastava, J. (2000). Discovery of interesting usage patterns from Web data. In Proceedings of WEBKDD-99 International Workshop on Web Usage Analysis and User Profiling, San Diego, California. Berlin: Springer Verlag. Fayyad, U.M. (2003). Special issue on microarray data mining. SIGKDD Explorations, 5(2), 1-139. Han, J., Fu, Y., Wang, W., Koperski, K., & Zaiane, O. (1996). DMQL: A data mining query language for relational databases. Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada. Imielinski, T., & Mannila, H. (1996). A database perspective on knowledge discovery. Communications of the ACM, 39(11), 58-64. Imielinski, T., Virmani, A., & Abdoulghani, A. (1996). DataMine: Application programming interface and query language for database mining. Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. Meo, R. (2003). Optimization of a language for data mining. Proceedings of the Symposium on Applied Computing, Melbourne, Florida.
Mine Rule
Meo, R., Psaila, G., & Ceri, S. (1998). An extension to SQL for mining association rules. Journal of Data Mining and Knowledge Discovery, 2(2), 195-224. Netz, A., Chaudhuri, S., Fayyad, U.M., & Bernhardt, J. (2001). Integrating data mining with SQL databases: OLE DB for data mining Proceedings of the International Conference on Data Engineering, Heidelberg, Germany. Ng, R.T., Lakshmanan, V.S., Han, J., & Pang, A. (1998). Exploratory mining and pruning optimizations of constrained associations rules. Proceedings of the International Conference Management of Data, Seattle, Washington. Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. Proceedings of the International Conference on Knowledge Discovery from Databases, Newport Beach, California. Tsur, D. et al. (1998). Query flocks: A generalization of association-rule mining. Proceedings of the International Conference Management of Data, Seattle, Washington.
KEY TERMS Association Rule: An association between two sets of items co-occurring frequently in groups of data. Constraint-Based Mining: Data mining obtained by means of evaluation of queries in a query language allowing predicates. CRM: Management, understanding, and control of data on the customers of a company for the purposes of enhancing business and minimizing the customers churn. Inductive Database: Database system integrating in the database source data and data mining patterns defined as the result of data mining queries on source data. KDD: Knowledge Discovery Process from the database, performing tasks of data pre-processing, transformation and selection, and extraction of data mining patterns and their post-processing and interpretation. Semantic Dimension: Concept or entity of the studied domain that is being observed in terms of other concepts or entities. Web Log: File stored by the Web server containing data on users’ accesses to a Web site.
745
M
746
Mining Association Rules on a NCR Teradata System Soon M. Chung Wright State University, USA Murali Mangamuri Wright State University, USA
INTRODUCTION Data mining from relations is becoming increasingly important with the advent of parallel database systems. In this paper, we propose a new algorithm for mining association rules from relations. The new algorithm is an enhanced version of the SETM algorithm (Houtsma & Swami 1995), and it reduces the number of candidate itemsets considerably. We implemented and evaluated the new algorithm on a parallel NCR Teradata database system. The new algorithm is much faster than the SETM algorithm, and its performance is quite scalable.
BACKGROUND Data mining, also known as knowledge discovery from databases, is the process of finding useful patterns from databases. One of the useful patterns is the association rule, which is formally described in Agrawal, Imielinski, and Swami (1993) as follows: Let I = {i1, i2, . . . , i m} be a set of items. Let D represent a set of transactions, where each transaction T contains a set of items, such that T Í I. Each transaction is associated with a unique identifier, called transaction identifier (TID). A set of items X is said to be in transaction T if X Ì T. An association rule is an implication of the form X => Y, where X Ì I, Y Ì I and X ∩Y = Æ. The rule X => Y holds in the database D with confidence c if c% of the transactions in D that contain X also contain Y. The rule X => Y has a support s if s% of the transactions in D contain X U Y. For example, beer and disposable diapers are items such that beer => diapers is an association rule mined from the database if the cooccurrence rate of beer and disposable diapers (in the same transaction) is not less than the minimum support, and the occurrence rate of diapers in the transactions containing beer is not less than the minimum confidence. The problem of mining association rules is to find all the association rules that have support and confidence greater than or equal to the user-specified minimum sup-
port and minimum confidence, respectively. This problem can be decomposed into the following two steps: 1.
2.
Find all sets of items (called itemsets) that have support above the user-specified minimum support. These itemsets are called frequent itemsets or large itemsets. For each frequent itemset, all the association rules that have minimum confidence are generated as follows: For every frequent itemset f, find all nonempty subsets of f. For every such subset a, generate a rule of the form a => (f - a) if the ratio of support(f) to support(a) is at least the minimum confidence.
Finding all the frequent itemsets is a very resourceconsuming task, but generating all the valid association rules from the frequent itemsets is quite straightforward. There are many association rule-mining algorithms proposed (Agarwal, Aggarwal & Prasad, 2000; Agrawal, Imielinski & Swami, 1993; Agrawal & Srikant, 1994; Bayardo, 1998; Burdick, Calimlim & Gehrke, 2001; Gouda & Zaki, 2001; Holt & Chung, 2001, 2002; Houtsma & Swami, 1995; Park, Chen & Yu, 1997; Savasere, Omiecinski & Navathe, 1995; Zaki, 2000). However, most of these algorithms are designed for data stored in file systems. Considering that relational databases are used widely to manage the corporation data, integrating the data mining with the relational database system is important. A methodology for tightly coupling a mining algorithm with relational database using user-defined functions is proposed in Agrawal and Shim (1996), and a detailed study of various architectural alternatives for coupling mining with database systems is presented in Sarawagi, Thomas, and Agrawal (1998). The SETM algorithm proposed in Houtsma and Swami (1995) was expressed in the form of SQL queries. Thus, it can be applied easily to relations in the relational databases and can take advantage of the functionalities provided by the SQL engine, such as the query optimization, efficient execution of relational algebra operations, and indexing. SETM also can be implemented easily on a
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Association Rules on a NCR Teradata System
parallel database system that can execute the SQL queries in parallel on different processing nodes. By processing the relations directly, we can easily relate the mined association rules to other information in the same database, such as the customer information. In this paper, we propose a new algorithm named Enhanced SETM (ESETM), which is an enhanced version of the SETM algorithm. We implemented both ESETM and SETM on a parallel NCR Teradata database system and evaluated and compared their performance for various cases. It has been shown that ESETM is considerably faster than SETM.
MAIN THRUST NCR Teradata Database System The algorithms are implemented on an NCR Teradata database system. It has two nodes, where each node consists of 4 Intel 700MHz Xeon processors, 2GB shared memory, and 36GB disk space. The nodes are interconnected by a dual BYNET interconnection network supporting 960Mbps of data bandwidth for each node. Moreover, nodes are connected to an external disk storage subsystem configured as a level-5 RAID (Redundant Array of Inexpensive Disks) with 288GB disk space. The relational DBMS used here is Teradata RDBMS (version 2.4.1), which is designed specifically to function in the parallel environment. The hardware that supports Teradata RDBMS software is based on off-the-shelf Symmetric Multiprocessing (SMP) technology. The hardware is combined with a communication network (BYNET) that connects the SMP systems to form Massively Parallel Processing (MPP) systems, as shown in Figure 1 (NCR Teradata Division, 2002). The versatility of the Teradata RDBMS is based on virtual processors (vprocs) that eliminate the dependency on specialized physical processors. Vprocs are a set of software processes that run on a node within the multitasking environment of the operating system. Each vproc is a separate, independent copy of the processor Figure 1. Teradata system architecture
software, isolated from other vprocs but sharing some of the physical resources of the node, such as memory and CPUs (NCR Teradata Division, 2002). Vprocs and the tasks running under them communicate using the unique-address messaging, as if they were physically isolated from one another. The Parsing Engine (PE) and the Access Module Processor (AMP) are two types of vprocs. Each PE executes the database software that manages sessions, decomposes SQL statements into steps, possibly parallel, and returns the answer rows to the requesting client. The AMP is the heart of the Teradata RDBMS. The AMP is a vproc that performs many database and file-management tasks. The AMPs control the management of the Teradata RDBMS and the disk subsystem. Each AMP manages a portion of the physical disk space and stores its portion of each database table within that disk space, as shown in Figure 2 (NCR Teradata Division, 2002 ).
SETM Algorithm The SETM algorithm proposed in (Houtsma & Swami, 1995) for finding frequent itemsets and the corresponding SQL queries used are as follows: // SALES =
In this algorithm, initially, all frequent 1-itemsets and their respective counts (F1=
747
M
Mining Association Rules on a NCR Teradata System
and R1 tables. R’ k table can be viewed as the set of candidate k-itemsets coupled with their transaction identifiers. SQL query for generating R’k: INSERT INTO R’ k SELECT p.trans_id, p.item 1, . . . , p.item k-1, q.item FROM Rk-1 p, R1 q WHERE q.trans_id = p.trans_id AND q.item > p.itemk-1
Frequent k-itemsets are generated by a sequential scan over R’k and selecting only those itemsets that meet the minimum support constraint. SQL query for generating Fk: INSERT INTO F k SELECT p.item1, . . . , p.itemk, COUNT(*) FROM R’ k p GROUP BY p.item 1, . . . , p.itemk HAVING COUNT(*) >= :minimum_support
Rk table is created by filtering R’k table using Fk. Rk table can be viewed as a set of frequent k-itemsets coupled with their transaction identifiers. This step is performed to ensure that only the candidate k-itemsets (R’ k) relative to frequent k-itemsets are used to generate the candidate (k+1)-itemsets. SQL query for generating R k: INSERT INTO R k SELECT p.trans_id, p.item 1, . . . , p.item k FROM R’k p, Fk q WHERE p.item 1 = q.item 1 AND . . p.item k-1 = q.item k-1 AND p.itemk = q.item k ORDER BY p.trans_id, p.item1, . . . , p.itemk
A loop is used to implement the procedure described above, and the number of iterations depends on the size of the largest frequent itemset, as the procedure is repeated until Fk is empty.
Enhanced SETM (ESETM) The Enhanced SETM (ESETM) algorithm has three modifications to the original SETM algorithm: 1. 2. 3.
Create frequent 2-itemsets without materializing R1 and R’2. Create candidate (k+1)-itemsets in R’k+1 by joining Rk with itself. Use a subquery to generate Rk rather than materializing it, thereby generating R’k+1 directly from R’k.
The number of candidate 2-itemsets can be very large, so it is inefficient to materialize R’2 table. Instead of creating R’2 table, ESETM creates a view or a subquery to 748
generate candidate 2-itemsets and directly generates frequent 2-itemsets. This view or subquery is used also to create candidate 3-itemsets. CREATE VIEW R’ 2 (trans_id, item 1, item 2) AS SELECT P1.trans_id, P1.item, P2.item FROM (SELECT p.trans_id, p.item FROM SALES p, F 1 q WHERE p.item = q.item) AS P1, (SELECT p.trans_id, p.item FROM SALES p, F 1 q WHERE p.item = q.item) AS P2 WHERE P1.trans_id = P2.trans_id AND P1.item < P2.item
Note that R1 is not created, since it will not be used for the generation of R’k. The set of frequent 2-itemsets, F 2, can be generated directly by using this R’2 view. INSERT INTO F 2 SELECT item 1 , item 2 , COUNT(*) FROM R’2 GROUP BY item 1, item2 HAVING COUNT(*) >= :minimum_support
The second modification is to generate R’k+1 using the join of Rk with itself, instead of the merge-scan of Rk with R1. SQL query for generating R’k+1: INSERT INTO R’ k+1 SELECT p.trans_id, p.item1, . . . , p.itemk, q.itemk FROM R k p, Rk q WHERE p.trans_id = q.trans_id AND p.item 1 = q.item 1 AND . . p.item k-1 = q.itemk-1 AND p.item k < q.itemk
This modification reduces the number of candidates (k+1)-itemsets generated compared to the original SETM algorithm. The performance of the algorithm can be improved further if candidate (k+1)-itemsets are generated directly from candidate k-itemsets using a subquery as follows: SQL query for R’ k+1 using R’ k: INSERT INTO R’ k+1 SELECT P1.trans_id, P1.item 1, . . . , P1.itemk, P2.item k FROM (SELECT p.* FROM R’k p, Fk q WHERE p.item 1 = q.item1 AND . . . AND p.item k = q.item k) AS P1, (SELECT p.* FROM R’k p, Fk q WHERE p.item 1 = q.item1 AND . . . AND p.item k = q.item k) AS P2 WHERE P1.trans_id = P2.trans_id AND P1.item 1 = P2.item 1 AND . . P1.itemk-1 = P2.itemk-1 AND P1.item k < P2.itemk
Rk is generated as a derived table using a subquery, thereby saving the cost of materializing R k table.
Mining Association Rules on a NCR Teradata System
In the ESETM algorithm, candidate (k+1)-itemsets in R’k+1 are generated by joining Rk with itself on the first k-1 items, as described previously. For example, a 4-itemset {1, 2, 3, 9} becomes a candidate 4-itemset only if {1, 2, 3} and {1, 2, 9} are frequent 3-itemsets. It is different from the subsetinfrequency-based pruning of the candidates used in the Apriori algorithm, where a (k+1)-itemset becomes a candidate (k+1)-itemset, only if all of its k-subsets are frequent. So, {2, 3, 9} and {1, 3, 9} also should be frequent for {1, 2, 3, 9} to be a candidate 4-itemset. The above SQLquery for generating R’k+1 can be modified such that all the k-subsets of each candidate (k+1)-itemset can be checked. To simplify the presentation, we divided the query into subqueries. Candidate (k+1)-itemsets are generated by the Subquery Q1 using Fk. Subquery Q 0: SELECT item 1,item2, . . . , itemk FROM F k Subquery Q 1: SELECT p.item1, p.item 2, . . . , p.item k, q.item k FROM Fk p, Fk q WHERE p.item 1 = q.item 1 AND . . p.itemk-1 = q.item k-1 AND p.item k < q.item k AND (p.item2, . . . , p.itemk, q.itemk) IN (Subquery Q0) AND . . (p.item1, . . . , p.itemj-1, p.itemj+1, . . . , p.item k, q.itemk) IN (Subquery Q0 ) AND . . (p.item1, . . . , p.itemk-2, p.item k, q.item k) IN (Subquery Q 0) Subquery Q 2: SELECT p.* FROM R’ k p, F k q WHERE p.item 1 = q.item 1 AND . . . . AND p.itemk = q.item k INSERT INTO R’ k+1 SELECT p.trans_id, p.item1, . . . , p.itemk, q.itemk FROM (Subquery Q2) p, (Subquery Q 2) q WHERE p.trans_id = q.trans_id AND p.item 1 = q.item1 AND . . p.itemk-1 = q.item k-1 AND p.itemk < q.itemk AND (p.item1, . . . , p.itemk, q.item k) IN (Subquery Q1)
The Subquery Q1 joins Fk with itself to generate the candidate (k+1)-itemsets, and all candidate (k+1)-itemsets having any infrequent k-subset are pruned. The Subquery Q2 derives Rk, and R’k+1 is generated as: R’k+1 = (Rk JOIN Rk) JOIN (Subquery Q1). However, it is not efficient to prune all the candidates in all the passes, since the cost of pruning the candidates
in the Subquery Q1 is too high when there are not many candidates to be pruned. In our implementation, the pruning is performed until the number of rows in Fk becomes less than 1,000, or up to five passes. The difference between the total execution times with and without pruning was very small for most of the databases we tested.
Performance Analysis In this section, the performance of the Enhanced SETM (ESETM), ESETM with pruning (PSETM), and SETM are evaluated and compared. We used synthetic transaction databases generated according to the procedure described in (Agrawal & Srikant, 1994). The total execution times of ESETM, PSETM and SETM are shown in Figure 3 for the database T10.I4.D100K, where Txx.Iyy.DzzzK indicates that the average number of items in a transaction is xx, the average size of maximal potential frequent itemset is yy, and the number of transactions in the database is zzz in thousands. ESETM is more than three times faster than SETM for all minimum support levels, and the performance gain increases as the minimum support level decreases. ESETM and PSETM have almost the same total execution time, because the effect of the reduced number of candidates in PESTM is offset by the extra time required for the pruning. The time taken for each pass by the algorithms for the T10.I4.D100K database with the minimum support of 0.25% is shown in Figure 4. The second pass execution time of ESTM is much smaller than that of SETM, because R’2 table (containing candidate 2-itemsets together with the transaction identifiers) and R2 table (containing frequent 2-itemsets together with the transaction identifiers) are not materialized. In the later passes, the performance of ESETM is much better than that of SETM, because ESTM has much less candidate itemsets generated and does not materialize Rk tables, for k > 2. In Figure 5, the size of R’k table containing candidate k-itemsets is shown for each pass when the T10.I4.D100K database is used with the minimum support of 0.25%. From the third pass, the size of R’k table for ESETM is much Figure 3. Total execution times (for T10.I4.D100K) SETM
ESETM
PSETM
1200 1000 Time (sec)
ESETM with Pruning (PSETM)
800 600 400 200 0 1.00%
0.50%
0.25%
0.10%
Minimum Support
749
M
Mining Association Rules on a NCR Teradata System
smaller than that of SETM because of the reduced number of candidate itemsets. PSETM performs additional pruning of candidate itemsets, but the difference in the number of candidates is very small in this case. The scalability of the algorithms is evaluated by increasing the number of transactions and the average size of transactions. Figure 6 shows how the three algorithms scale up as the number of transactions increases. The database used here is T10.I4, and the minimum support is 0.5%. The number of transactions ranges from 100,000 to 400,000. SETM performs poorly as the number of transactions increases, because it generates much more candidate itemsets than others. The effect of the transaction size on the performance is shown in Figure 7. In this case, the size of the database wasn’t changed by keeping the product of the average transaction size and the number of transactions constant. The number of transactions was 20,000 for the average transaction size of 50 and 100,000 for the average transaction size of 10. We used the fixed minimum support count of 250 transactions, regardless of the number of transactions. The performance of SETM deteriorates as the transaction size increases, because the number of candidate itemsets generated is very large. On the other hand, the total execution times of ESETM and PSETM are stable,
because the number of candidate itemsets generated in the later passes is small.
Figure 4. Per pass execution times (for T10.I4.D100K)
Figure 5. Size of R'k (for T10.I4.D100K)
SETM
ESETM
Relational database systems are used widely, and the size of existing relational databases grows quite rapidly. Thus, mining the relations directly without transforming them into certain file structures is very useful. However, due to the high operation complexity of the mining processes, parallel data mining is essential for very large databases. Currently, we are developing an algorithm for mining association rules across multiple relations using our parallel NCR Teradata database system.
CONCLUSION In this paper, we proposed a new algorithm, named Enhanced SETM (ESETM) for mining association rules from relations. ESETM is an enhanced version of the SETM algorithm (Houtsma & Swami, 1995), and its performance is much better than SETM, because it generates much less
SETM
PSETM No. of Tuples (in 1000s)
250 200 Time (sec)
FUTURE TRENDS
150 100 50 0 1
2
3
4
5
6
7
8
ESETM
PSETM
2500 2000 1500 1000 500 0 R’3
9
R’4
R’5
R’6
R’7
R’8
R’9
Number of Passes
Figure 6. Effect of the number of transactions ESETM
1200
5000
1000
4000
800 600 400
ESETM
PSETM
3000 2000 1000
200 0
0
100
200
300
400
Number of Transactions (in 1000s)
750
SETM
PSETM
Time (sec)
Time (sec)
SETM
Figure 7. Effect of the transaction size
10
20
30
40
Average Transaction Size
50
Mining Association Rules on a NCR Teradata System
candidate itemsets to count. ESTM and SETM are implemented on a parallel NCR database system, and we evaluated their performance in various cases. ESTM is at least three times faster than SETM in most of our test cases, and its performance is quite scalable.
Holt, J.D., & Chung, S.M. (2001). Multipass algorithms for mining association rules in text databases. Knowledge and Information Systems, 3(2), 168-183.
ACKNOWLEDGMENTS
Houtsma, M., & Swami, A. (1995). Set-oriented mining for association rules in relational databases. Proceedings of the International Conference on Data Engineering, Taipei, Taiwan.
This research was supported in part by NCR, LexisNexis, Ohio Board of Regents (OBR), and AFRL/Wright Brothers Institute (WBI)
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, D.C., USA. Agrawal, R., & Shim, K. (1996). Developing tightly-coupled data mining applications on a relational database system. Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA. Agrawal. R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the VLDB Conference. Agarwal, R.C., Aggarwal, C.C., & Prasad, V.V.V. (2000). Depth first generation of long patterns. Proceedings of the International Conference on Knowledge Discovery and Data Mining, Boston, MS, USA. Bayardo, R.J. (1998). Efficient mining long patterns from databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA. Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A maximal frequent itemset algorithm for transaction databases. Proceedings of the International Conference on Data Engineering, Heidelberg, Germany. Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal frequent itemsets. Proceedings of the 1st IEEE International Conference on Data Mining, San Jose, CA, USA.
Holt, J.D., & Chung, S.M. (2002). Mining association rules using inverted hashing and pruning. Information Processing Letters, 83(4), 211-220.
NCR Teradata Division (2002). Introduction to Teradata RDBMS. Park, J.S., Chen, M.S., & Yu, P.S. (1997). Using a hashbased method with transaction trimming for mining association rules. IEEE Trans. on Knowledge and Data Engineering, 9(5), 813-825. Sarawagi, S., Thomas, S., & Agrawal, R. (1998). Integrating association rule mining with relational database systems: Alternatives and implications. Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA Savasere, A., Omiecinski, E., & Navathe, S. (1995). An eficient algorithm for mining association rules in large databases. Proceedings of the VLDB Conference, Zurich, Switzerland. Zaki, M.J. (2000). Scalable algorithms for association mining. IEEE Trans. on Knowledge and Data Engineering, 12(3), 372-390.
KEY TERMS Association Rule: Implication of the form X => Y, meaning that database tuples satisfying the conditions of X are also likely to satisfy the conditions of Y. Data Mining: Process of finding useful data patterns hidden in large data sets. Parallel Database System: Database system supporting the parallel execution of the individual basic database operations, such as relational algebra operations and aggregate operations.
751
M
752
Mining Association Rules Using Frequent Closed Itemsets Nicolas Pasquier Université de Nice-Sophia Antipolis, France
INTRODUCTION
BACKGROUND
In the domain of knowledge discovery in databases and its computational part called data mining, many works addressed the problem of association rule extraction that aims at discovering relationships between sets of items (binary attributes). An example association rule fitting in the context of market basket data analysis is cereal ∧ milk → sugar (support 10%, confidence 60%). This rule states that 60% of customers who buy cereals and sugar also buy milk, and that 10% of all customers buy all three items. When an association rule support and confidence exceed some user-defined thresholds, the rule is considered relevant to support decision making. Association rule extraction has proved useful to analyze large databases in a wide range of domains, such as marketing decision support; diagnosis and medical research support; telecommunication process improvement; Web site management and profiling; spatial, geographical, and statistical data analysis; and so forth. The first phase of association rule extraction is the data selection from data sources and the generation of the data mining context that is a triplet D = (O, I, R), where O and I are finite sets of objects and items respectively, and R ⊆ O × I is a binary relation. An item is most often an attribute value or an interval of attribute values. Each couple (o, i) ∈ R denotes the fact that the object o ∈ O is related to the item i ∈ I. If an object o is in relation with all items of an itemset I (a set of items) we say that o contains I. This phase helps to improve the extraction efficiency and enables the treatment of all kinds of data, often mixed in operational databases, with the same algorithm. Datamining contexts are large relations that do not fit in main memory and must be stored in secondary memory. Consequently, each context scan is very time consuming.
The support of an itemset I is the proportion of objects containing I in the context. An itemset is frequent if its support is greater or equal to the minimal support threshold defined by the user. An association rule r is an implication with the form r: I1 → I2 - I1 where I1 and I2 are frequent itemsets such that I1 ⊂ I2. The confidence of r is the number of objects containing I2 divided by the number of objects containing I1. An association rule is generated if its support and confidence are at least equal to the minsupport and minconfidence thresholds. Association rules with 100% confidence are called exact association rules; others are called approximate association rules. The natural decomposition of the association rule-mining problem is:
Table 1. Example context
These algorithms consider all itemsets of a given size (i.e., all itemsets of a level in the itemset lattice) at a time. They are based on the properties that all supersets of an infrequent itemset are infrequent and all subsets of a frequent itemset are frequent (Agrawal et al., 1995). Using this property, the candidate k-itemsets (itemsets of size k) of the kth iteration are generated by joining two frequent (k-1)-itemsets discovered during the preceding
OID 1 2 3 4 5 6
Items ACD BCE ABCE BE ABCE BCE
1. 2.
Extracting frequent itemsets and their support from the context. Generating all valid association rules from frequent itemsets and their support.
The first phase is the most computationally expensive part of the process, since the number of potential frequent itemsets 2|I| is exponential in the size of the set of items, and context scans are required. A trivial approach would consider all potential frequent itemsets at the same time, but this approach cannot be used for large databases where I is large. Then, the set of potential frequent itemsets that constitute a lattice called itemset lattice must be decomposed into several subsets considered one at a time.
Level-Wise Algorithms for Extracting Frequent Itemsets
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Association Rules Using Frequent Closed Itemsets
Figure 1. Itemset lattice A BC D E
A BC E
A BC D
A BD E
ACDE
BCDE
A BC
A BE
ACE
BCE
A BD
ACD
BCD
ADE
BDE
CDE
AB
AC
AE
BE
BC
CE
AD
CD
BD
DE
A
B
C
E
D
Ø
iteration, if their k-1 first items are identical. Then, one database scan is performed to count the supports of the candidates, and infrequent ones are pruned. This process is repeated until no new candidate can be generated. This approach is used in the well known APRIORI and OCD algorithms. Both carry out a number of context scans equal to the size of the largest frequent itemsets. Several optimizations have been proposed to improve the efficiency by avoiding several context scans. The COFI* (ElHajj & Zaïane, 2004) and FP-GROWTH (Han et al., 2004) algorithms use specific data structures for that, and the PASCAL algorithm (Bastide et al., 2000) uses a method called pattern counting inference to avoid counting all supports.
Algorithms for Extracting Maximal Frequent Itemsets Maximal and minimal itemsets are defined according to the inclusion relation. Maximal frequent itemsets are frequent itemsets of which all supersets are infrequent. They form a border under which all itemsets are frequent; knowing all maximal frequent itemsets, we can deduce all frequent itemsets, but not their support. Then, the following approach for mining association rules was proposed: 1. 2. 3.
Extracting maximal frequent itemsets and their supports from the context. Deriving frequent itemsets from maximal frequent itemsets and counting their support in the context during one final scan. Generating all valid association rules from frequent itemsets.
These algorithms perform an iterative search in the itemset lattice advancing during each iteration by one level from the bottom upwards, as in A PRIORI, and by one or more levels from the top downwards. Compared to preceding algorithms, both the number of iterations and, thus, the number of context scans and the number of CPU operations carried out are reduced. The most well known algorithms based on this approach are PINCER-SEARCH (Lin & Kedem, 1998) and MAX-MINER (Bayardo, 1998).
Relevance of Extracted Association Rules For many datasets, a huge number of association rules is extracted, even for high minsupport and minconfidence values. This problem is crucial with correlated data, for which several million association rules sometimes are extracted. Moreover, a majority of these rules bring the same information and, thus, are redundant. To illustrate this problem, nine rules extracted from the mushroom dataset (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/mushroom/) are presented in the following. All have the same support (51%) and confidence (54%), and the item free gills in the antecedent: 1. 2. 3. 4. 5. 6. 7. 8. 9.
free_gills → edible free_gills → edible, partial_veil free_gills → edible, white_veil free_gills → edible, partial_veil, white_veil free_gills, partial_veil → edible free_gills, partial_veil → edible, white_veil free_gills, white_veil → edible free_gills, white_veil → edible, partial_veil free_gills, partial_veil, white_veil → edible
The most relevant rule from the viewpoint of the user is rule 4, since all other rules can be deduced from this one, including support and confidence. This rule is a non-redundant association rule with minimal antecedent and maximal consequent, or minimal non-redundant rule, for short.
Association Rules Reduction Methods Several approaches for reducing the number of rules and selecting the most relevant ones have been proposed. The application of templates (Baralis & Psaila, 1997) or Boolean operators (Bayardo, Agrawal & Gunopulos, 2000) allows selecting rules according to the user’s preferences. When taxonomies of items exist, generalized association rules (Han & Fu, 1999) (i.e., rules between items of different levels of taxonomies) can be extracted. This produces fewer but more general associations. Other statistical measures, such as Pearson’s correlation or c 2, 753
M
Mining Association Rules Using Frequent Closed Itemsets
also can be used instead of the confidence to determine the rule precision (Silverstein, Brin & Motwani, 1998). Several methods to prune similar rules by analyzing their structures also have been proposed. This allows the extraction of rules only, with maximal antecedents among those with the same support and the same consequent (Bayardo & Agrawal, 1999), for instance.
MAIN THRUST Algorithms for Extracting Frequent Closed Itemsets In contrast with the (maximal) frequent itemsets-based approaches, the frequent closed itemsets approach (Pasquier et al., 1998; Zaki & Ogihara, 1998) is based on the closure operator of the Galois connection. This operator γ associates with an itemset I the maximal set of items common to all the objects containing I (i.e., the intersection of these objects). The frequent closed itemsets are frequent itemsets with γ(I) = I. An itemset C is a frequent closed itemset, if no other item i ∉ C is common to all objects containing C. The frequent closed itemsets, together with their supports, constitute a generating set for all frequent itemsets and their supports and, thus, for all association rules, their supports, and their confidences (Pasquier et al., 1999a). This property relies on the properties that the support of a frequent itemset is equal to the support of its closure and that the maximal frequent itemsets are maximal frequent closed itemsets. Using these properties, a new approach for mining association rules was proposed: 1. 2. 3.
Extracting frequent closed itemsets and their supports from the context. Deriving frequent itemsets and their supports from frequent closed itemsets. Generating all valid association rules from frequent itemsets.
The search space in the first phase is reduced to the closed itemset lattice, which is a sublattice of the itemset lattice. The first algorithms based on this approach proposed are CLOSE (Pasquier et al., 1999a) and A-CLOSE (Pasquier et al., 1999b). To improve the extraction efficiency, both perform a level-wise search for generators of frequent closed itemsets. The generators of a closed itemset C are the minimal itemsets whose closure is C; an itemset G is a generator of C, if there is no other itemset G’⊂ G whose closure is C. During an iteration k, CLOSE considers a set of candidate k-generators. One context scan is performed to com754
Figure 2. Closed itemset lattice ABCDE
ACD
ABCE
AC
BCE
C
BE
Ø
pute their supports and closures; for each generator G, the intersection of all objects containing G gives its closure, and counting them gives its support. Then, infrequent generators and generators of frequent closed itemsets previously discovered are pruned. During the (k+1)th iteration, candidate (k+1)-generators are constructed by joining two frequent k-generators having identical k-1 first items. In the A-CLOSE algorithm, generators are identified by comparing supports only, since the support of a generator is different from the supports of all its subsets. Then, one more context scan is performed at the end of the algorithm to compute closures of all frequent generators discovered. Recently, the CHARM (Zaki & Hsiao, 2002), CLOSET + (Wang, Han & Pei, 2003) and BIDE (Wang & Han, 2004) algorithms have been proposed. These algorithms efficiently extract frequent closed itemsets but not their generators. The TITANIC algorithm (Stumme et al., 2002) can extract frequent closed sets according to different closures, such as functional dependencies or Galois closures, for instance.
Comparing Execution Times Experiments conducted on both synthetic and operational datasets showed that (maximal) frequent itemsetsbased approaches are more efficient than closed itemsetsbased approaches on weakly correlated data, such as market-basket data. In such data, nearly all frequent itemsets also are frequent closed itemsets (i.e., closed itemset lattice and itemset lattice are nearly identical), and closure computations add execution times. Correlated data constitute a challenge for efficiently extracting association rules, since the number of frequent itemsets is most often very important, even for
Mining Association Rules Using Frequent Closed Itemsets
high minsupport values. On these data, few frequent itemsets are also frequent closed itemsets. Thus, the closure helps to reduce the search space; fewer itemsets are tested, and the number of context scans is reduced. On such data, maximal frequent itemsets-based approaches suffer from the time needed to compute frequent itemset supports that require accessing the dataset. With the closure, these supports are derived from the supports of frequent closed itemsets without accessing the dataset.
Extracting Bases for Association Rules Bases are minimal sets, with respect to some criteria, from which all rules can be deduced with support and confidence. The Duquenne-Guigues and the Luxenburger basis for global and partial implications were adapted to association rule framework in Pasquier et al. (1999c) and Zaki (2000). These bases are minimal regarding the number of rules; no smaller set allows the deduction of all rules with support and confidence. However, they do not contain the minimal non-redundant rules. An association rule is redundant, if it brings the same information or less general information than those conveyed by another rule with identical support and confidence. Then, an association rule r is a minimal nonredundant association rule, if there is no association rule r’ with the same support and confidence whose antecedent is a subset of the antecedent of r and whose consequent is a superset of the consequent of r. An inference system based on this definition was proposed in Cristofor and Simovici (2002). The Min-Max basis for exact association rules contains all rules G → g(G) - G between a generator G and its closure γ(G) such that γ(G) ≠ G. The Min-Max basis for approximate association rules contains all rules G → C G between a generator itemset G and a frequent closed itemset C that is a superset of its closure: γ(G) ⊂ C. These bases, also called informative bases, contain, respectively, the minimal non-redundant exact and approximate association rules. Their union constitutes a basis for all association rules: They all can be deduced with their support and confidence (Bastide et al., 2000). The objective is to capture the essential knowledge in a minimal number of rules without information loss. Algorithms for determining generators, frequent closed itemsets, and the min-max bases from frequent itemsets and their supports are presented in Pasquier et al. (2004).
Comparing Sizes of Association Rule Sets Results of experiments conducted on both synthetic and operational datasets show that the generation of the bases can reduce substantially the number of rules.
For weakly correlated data, very few exact rules are extracted, and the reduction for approximate rules is in the order of five for both the min-max and the Luxenburger bases. For correlated data, the Duquenne-Guigues basis reduces exact rules to a few tens; for the min-max exact basis, the reduction factor is about some tens. For approximate association rules, both the Luxenburger and the min-max bases reduce the number of rules by a factor of some hundreds. If the number of rules can be reduced from several million to a few hundred or a few thousand, visualization tools such as templates and/or generalization tools such as taxonomies are required to explore so many rules.
FUTURE TRENDS Most recent researches on association rules extraction concern applications to natural phenomena modeling, gene expression analysis (Creighton & Hanash, 2003), biomedical engineering (Gao, Cong et al., 2003), and geospatial, telecommunications, Web and semi-structured data analysis (Han et al., 2002). These applications most often require extending existing methods. For instance, to extract only rules with low support and high confidence in semi-structured (Cohen et al., 2001) or medical data (Ordonez et al., 2001), to extract temporal association rules in Web data (Yang & Parthasarathy, 2002) or adaptive sequential association rules in longterm medical observation data (Brisson et al., 2004). Frequent closed itemsets extraction also is applied as a conceptual analysis technique to explore biological (Pfaltz & Taylor, 2002) and medical data (Cremilleux, Soulet & Rioult, 2003). These domains are promising fields of application for association rules and frequent closed itemsets-based techniques, particularly in combination with other data mining techniques, such as clustering and classification.
CONCLUSION Next-generation data-mining systems should answer the analysts’ requirements for high-level ready-to-use knowledge that will be easier to exploit. This implies the integration of data-mining techniques in DBMS and domainspecific applications (Ansari et al., 2001). This integration should incorporate the use of knowledge visualization and exploration techniques, knowledge consolidation by cross-analysis of results of different techniques, and the incorporation of background knowledge, such as taxonomies or gene annotations for gene expression data, for example, in the process. 755
M
Mining Association Rules Using Frequent Closed Itemsets
REFERENCES Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A.I. (1995). Fast discovery of association rules. Advances in knowledge discovery and data mining. AAAI/MIT Press. Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating e-commerce and data mining: Architecture and challenges. Proceedings of the ICDM Conference. Baralis, E., & Psaila, G. (1997). Designing templates for mining association rules. Journal of Intelligent Information Systems, 9(1), 7-32.
Gao Cong, F.P., Tung, A., Yang, J., & Zaki, M.J. (2003). CARPENTER: Finding closed patterns in long biological datasets. Proceedings of the KDD Conference. Han, J., & Fu, Y. (1999). Mining multiple-level association rules in large databases. IEEE Transactions on Knowledge and Data Engineering, 11(5), 798-804. Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 8(1), 53-87. Han, J., Russ, B., Kumar, V., Mannila, H., & Pregibon, D. (2002). Emerging scientific applications in data mining. Communications of the ACM, 45(8), 54-58.
Bastide, Y., Pasquier, N., Taouil, R., Lakhal, L., & Stumme, G. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. Proceedings of the DOOD Conference.
Lin, D., & Kedem, Z.M. (1998). P INCER-SEARCH: A new algorithm for discovering the maximum frequent set. Proceedings of the EBDT Conference.
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent closed itemsets with counting inference. SIGKDD Explorations, 2(2), 66-75.
Ordonez, C. et al. (2001). Mining constrained association rules to predict heart disease. Proceedings of the ICDM Conference.
Bayardo, R.J. (1998). Efficiently mining long patterns from databases. Proceedings of the SIGMOD Conference.
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1998). Pruning closed itemset lattices for association rules. Proceedings of the BDA Conference.
Bayardo, R.J., & Agrawal, R. (1999). Mining the most interesting rules. Proceedings of the KDD Conference. Bayardo, R.J., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217-240. Brisson, L., Pasquier, N., Hebert, C., & Collard, M. (2004). HASAR: Mining sequential association rules for atherosclerosis risk factor analysis. Proceedings of the PKDD Discovery Challenge. Cohen, E. et al. (2001). Finding interesting associations without support pruning. IEEE Transaction on Knowledge and Data Engineering, 13(1), 64,78. Creighton, C., & Hanash, S. (2003). Mining gene expression databases for association rules. Bioinformatics, 19(1), 79-86. Cremilleux, B., Soulet, A., & Rioult, F. (2003). Mining the strongest emerging patterns characterizing patients affected by diseases due to atherosclerosis. Proceedings of the PKDD Discovery Challenge. Cristofor, L., & Simovici, D.A. (2002). Generating an informative cover for association rules. Proceedings of the ICDM Conference. El-Hajj, M., & Zaïane, O.R. (2004). COFI approach for mining frequent itemsets revisited. Proceedings of the SIGMOD/DMKD Workshop. 756
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999a). Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1), 25-46. Pasquier N., Bastide, Y., Taouil, R., & Lakhal, L. (1999b). Discovering frequent closed itemsets for association rules. Proceedings of the ICDT Conference. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999c). Closed set based discovery of small covers for association rules. Proceedings of the BDA Conference. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., & Lakhal, L. (2004). Generating a condensed representation for association rules. Journal of Intelligent Information Systems. Pfaltz J., & Taylor C. (2002, July). Closed set mining of biological data. Proceedings of the KDD/BioKDD Conference. Silverstein, C., Brin, S., & Motwani, R. (1998). Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1), 39-68. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing iceberg concept lattices with TITANIC. Data and Knowledge Engineering, 42(2), 189222.
Mining Association Rules Using Frequent Closed Itemsets
Wang, J., & Han, J. (2004). BIDE: Efficient mining of frequent closed sequences. Proceedings of the ICDE Conference. Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. Proceedings of the KDD Conference. Yang, H., & Parthasarathy, S. (2002). On the use of constrained associations for Web log mining. Proceedings of the KDD/WebKDD Conference. Zaki, M.J. (2000). Generating non-redundant association rules. Proceedings of the KDD Conference. Zaki, M.J., & Hsiao, C.-J. (2002). CHARM: An efficient algorithm for closed itemset mining. Proceedings of the SIAM International Conference on Data Mining. Zaki, M.J., & Ogihara, M. (1998). Theoretical foundations of association rules. Proceedings of the SIGMOD/DMKD Workshop.
KEY TERMS Association Rules: An implication rule between two itemsets with statistical measures of range (support) and precision (confidence). Basis for Association Rules: A set of association rules that is minimal with respect to some criteria and from which all association rules can be deduced with support and confidence. Closed Itemset: An itemset that is a maximal set of items common to a set of objects. An itemset is closed if it is equal to the intersection of all objects containing it. Frequent Itemset: An itemset contained in a number of objects at least equal to some user-defined threshold. Itemset: A set of binary attributes, each corresponding to an attribute value or an interval of attribute values.
757
M
758
Mining Chat Discussions Stanley Loh Catholic University of Pelotas, Brazil, and Lutheran University of Brasil, Brazil Daniel Licthnow Catholic University of Pelotas, Brazil Thyago Borges Catholic University of Pelotas, Brazil Tiago Primo Catholic University of Pelotas, Brazil Rodrigo Branco Kickhöfel Catholic University of Pelotas, Brazil Gabriel Simões Catholic University of Pelotas, Brazil Gustavo Piltcher Catholic University of Pelotas, Brazil Ramiro Saldaña Catholic University of Pelotas, Brazil
INTRODUCTION According to Nonaka and Takeuchi (1995), the majority of the organizational knowledge comes from interactions between people. People tend to reuse solutions from other persons in order to gain productivity. When people communicate to exchange information or acquire knowledge, the process is named Collaboration. Collaboration is one of the most important tasks for innovation and competitive advantage within learning organizations (Senge, 2001). It is important to record knowledge to later reuse and analysis. If knowledge is not adequately recorded, organized and retrieved, the consequence is re-work, low productivity and lost of opportunities. Collaboration may be realized through synchronous interactions (e.g., exchange of messages in a chat), asynchronous interactions (e.g., electronic mailing lists or forums), direct contact (e.g., two persons talking) or indirect contact (when someone stores knowledge and others can retrieve this knowledge in a remote place or time). In special, chat rooms are becoming important tools for collaboration among people and knowledge exchange. Intelligent software systems may be integrated into chat rooms in order to help people in this collaboration task. For example, systems can identify the theme being dis-
cussed and then offer new information or can remember people of existing information sources. This kind of systems is named recommender systems. Furthermore, chat sessions have implicit knowledge about what the participants know and how they are viewing the world. Analyzing chat discussions allows understanding what people are looking for and how people collaborates one with each other. Intelligent software systems can analyze discussions in chats to extract knowledge about the group or about the subject being discussed. Mining tools can analyze chat discussions to understand what is being discussed and help people. For example, a recommender system can analyze textual messages posted in a web chat, identify the subject of the discussion and then look for items stored in a Digital Library to recommend individually to each participant of the discussion. Items can be electronic documents, web pages and bibliographic references stored in a digital library, past discussions and authorities (people with expertise in the subject being discussed). Besides that, mining tools can analyze the whole discussion to map the knowledge exchanged among the chat participants. The benefits of such technology include supporting learning environments, knowledge management efforts within organizations, advertisement and support to decisions.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Chat Discussions
BACKGROUND Some works has investigated the analysis of online discussions. Brutlag and Meek (2000) have studied the identification of themes in e-mails. The work compares the identification by analyzing only the subject of the emails against analyzing the message bodies. One conclusion is that e-mail headers perform so well as message bodies, with the additional advantage of reducing the number of features to be analyzed. Busemann, Schmeier and Arens, (2000) investigated the special case of messages registered in call centers. The work proved possible to identify themes in this kind of message, although the informality of the language used in the messages. This informality causes mistakes due to jargons, misspellings and grammatical inaccuracy. The work of Durbin, Richter, and Warner (2003) has shown possible to identify affective opinions about products and services in e-mails sent by customers, in order to alert responsible people or to evaluate the organization and customers’ satisfaction. Furthermore, the work identifies the intensity of the rating, allowing the separation of moderate or intensive opinions. Tong (2001) investigated the analysis of online discussions about movies. Messages represent comments about movies. This work proved to be feasible to find positive and negative opinions, by analyzing key or cue words. Furthermore, the work also extracts information about the movies, like directors and actors, and then examines opinions about these particular characteristics. The only work found in the scientific literature that analyzes chat messages is the one from Khan, Fisher, Shuler, Wu, and Pottenger (2002). They apply mining techniques over chat messages in order to find social interactions among people. The goal is to find who is related to whom inside a specific area, by analyzing the exchange of messages in a chat and the subject of the discussion.
MAIN THRUST Following, the chapter explains how messages can be mined, how recommendations can be made and how the whole discussion (an entire chat session) can be analyzed.
Identifying Themes in Chat Messages To provide people with useful information during a collaboration session, the system has to identify what is being discussed. Textual messages sent by the users in the chat can be analyzed for this purpose. Texts can lead to the identification of the subject discussed because
the words and the grammar present in the texts represent knowledge from people, expressed in written formats (Sowa, 2000). An ontology or thesaurus can be used to help to identify cue words for each subject. The ontology or thesaurus has concepts of a domain or knowledge area, including relations between concepts and the terms used in written languages to express these concepts (Gilchrist, 2003). The ontology can be created by machine learning methods (supervised learning), where human experts select training cases for each subject (e.g., texts of positive and negative examples) and an intelligent software system identifies the keywords that define each subject. The TFIDF method from Salton and McGill (1983) is the most used in this kind of task. If considering that the terms that compose the messages compose a bag of words (have no difference in importance), probabilistic techniques can be used to identify the subject. By other side, natural language processing techniques can identify syntactic elements and relations, then supporting more precise subject identification. The identification of themes should consider the context of the messages to determine if the concept identified is really present in the discussion. A group of messages is better to infer the subject than a single message. That avoids misunderstandings due to words ambiguity and use of synonyms.
Making Recommendations in a Chat Discussion A recommender system is a software whose main goal is to aid in the social collaborative process of indicating or receiving indications (Resnick & Varian, 1997). Recommender systems are broadly used in electronic commerce for suggesting products or providing information about products and services, helping people to decide in the shopping process (Lawrence et al., 2001; Schafer et al., 2001). The offered gain is that people do not need to request recommendation or to perform a query over an information base, but the system decides what and when to suggest. The recommendation is usually based on user profiles and reuse of solutions. When a subject is identified in a message, the recommender searches for items classified in this subject. Items can come from different databases. For example, a Digital Library may provide electronic documents, links to Web pages and bibliographic references. A profile database may contain information about people, including the interest areas of each person, as well an associated degree, informing the user’s knowledge level on the subject or how much is his/her competence in the area (his/her expertise). This can be used to 759
M
Mining Chat Discussions
indicate the most active user in the area or who is the authority in the subject. A database of past discussions records everything that occurs in the chat, during every discussion session. Discussions may be stored by sessions, identified by data and themes discusses and can include who participated in the session, all the messages exchanged (with a label indicating who sent it), the concept identified in each message, the recommendations made during the session for each user and documents downloaded or read during the session. Past discussions may be recommended during a chat session, remembering the participants that other similar discussions have already happened. This database also allows users to review the whole discussion later after the session. The great benefit is that users do not re-discuss the same question.
Mining a Chat Session Analyzing the themes discussed in a chat session can bring an important overview of the discussion and also of the subject. Statistical tools applied over the messages sent and the subjects identified in each message can help users to understand which were the themes more discussed. Counting the messages associated with each subject, it is possible to infer the central point of the discussion and the peripheral themes. The list of subjects identified during the chat session compose an interesting order, allowing users to analyze the path followed by the participants during the discussion. For example, it is possible to observe which was the central point of the discussion, whether the discussion deviated from the main subject and whether the subjects present in the beginning of the discussion were also present at the end. The coverage of the discussion may be identified by the number of different themes discussed. Furthermore, this analysis allows identifying the depth of the discussion, that is, whether more specific themes were discussed or whether the discussion occurred superficially at a higher conceptual level. Analyzing the messages sent by every participant allows determining the degree of participation of each person in the discussion: who participated more and who did less. Furthermore, it is possible to observe which are the interesting areas for each person and in someway to determine the expertise of the group and of the participants (which are the areas where the group is more competent). Association techniques can be used to identify correlations between themes or between themes and persons. For example, it is possible to find that some theme is present always when other theme is also present or to find that every discussion where some person participated had a certain theme as the principal. 760
FUTURE TRENDS Recommender systems are still an emerging area. There are some doubts and open issues. For example, whether is good or bad to recommend items already suggested in past discussions (re-recommend, as if remembering the person). Besides that it is important to analyze the level of the participants in order to recommend only basic or advanced items. Collaborative filtering techniques can be used to recommend items already seen by other users (Resnick, et al., 1994; Terveen & Hill, 2001). Grouping people with similar characteristics allows for crossing of recommended items, for example, to offer documents read by one person to others. In the same way, software systems can capture relevance feedback from users to narrow the list of recommendations. Users should read some items of the list and rate them, so that the system can use this information to eliminate items from the list or to reorder the items in a new ranking. The context of the messages needs to be more studied. To infer the subject being discussed, the system can analyze a group of messages, but it is necessary to determine how many (a fixed number or all messages sent in the past N minutes?). An orthographic corrector is necessary to clean the messages posted to the chat. Lots of linguistic mistakes are expected since people are using chats in a hurry, with little attention to the language, without revisions and in an informal way. Furthermore, the text mining tools must analyze special signs like novel abbreviations, emoticons and slang expressions. Special words may be added to the domain ontology in order to hold the differences in the language.
CONCLUSION An example of such a system discussed in this chapter is available in http://gpsi.ucpel.tche.br/sisrec. Currently, the system uses a domain ontology for computer science, but others can be used. Similarly, the current digital library only has items related to Computer Science. The recommendation system facilitates the organizational learning because people receive suggestions of information sources during online discussions. The main advantage of the system is to free the user of the burden to search information sources during the online discussion. Users do not have to choose attributes or requirements from a menu of options, in order to retrieve items of a database; the system decides when and what information to recommend to the user. This proactive
Mining Chat Discussions
approach is useful for non-experienced users that receive hits about what to read in a specific subject. User’s information needs are discovered naturally during the conversation. Furthermore, when the system indicates people who are authorities in each subject, naïve users can meet these authorities for getting more knowledge. Other advantage of the system is that part of the knowledge shared in the discussion can be made explicit through the record of the discussion for future retrieval. Besides that, the system allows the posterior analysis of each discussion, presenting the subjects discussed, the messages exchanged, the items recommended and the order in which the subjects were discussed. An important feature is the statistical analysis of the discussion, allowing understanding the central point, the peripheral themes, the order of the discussion, its coverage and depth. The benefit of mining chat sessions is of special interest for Knowledge Management efforts. Organizations can store tacit knowledge formatted as discussions. The discussions can be retrieved, so that knowledge can be reused. In the same way, the contents of a Digital Library (or Organizational Memory) can be better used through recommendations. People do not have to search for contents neither to remember items in order to suggest to others. Recommendations play this role in a proactive way, examining what people are discussing and users’ profiles and selecting interesting new contents. In special, such systems (that mine chat sessions) can be used in e-learning environments, supporting the construction of knowledge by individuals or groups. Recommendations help the learning process, suggesting complementary contents (documents and sites stored in the Digital Library). Recommendations also include authorities in topics being discussed, that is, people with high degrees of knowledge.
Busemann, S., Schmeier, S., & Arens, R.G. (2000) Message classification in the call center. In Proceedings of the Applied Natural Language Processing Conference – ANLP’2000 (pp. 159-165), Seattle, WA. Durbin, S.D., Richter, J.N., & Warner, D. (2003). A system for affective rating of texts. In Proceedings of the 3rd Workshop on Operational Text Classification, 9th ACM International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington, DC. Khan, F.M., Fisher, T.A., Shuler, L., Wu, T., & Pottenger, W. M. (2002). Mining chat-room conversations for social and semantic interactions. Technical Report LU-CSE-02011, Lehigh University, Bethlehem, Pennsylvania, USA. Gilchrist, A. (2003). Thesauri, taxonomies and ontologies – an etymological note. Journal of Documentation, 59(1), 7-18. Lawrence, R.D. et al. (2001). Personalization of supermarket product recommendations. Journal of Data Mining and Knowledge Discovery, 5(1/2), 11-32. Nonaka, I., & Takeuchi, T. (1995). The knowledge-creating company: How Japanese companies create the dynamics of innovation. Cambridge: Oxford University Press. Resnick, P. et al. (1994). GroupLens: An open architecture for collaborative filtering of Netnews. In Proceedings of the Conference on Computer Supported Cooperative Work (pp. 175-186). Resnick, P., & Varian, H. (1997). Recommender systems. Communications of the ACM, 40(3), 56-58. Salton, G., & McGill, M.J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. Schafer, J.B. et al. (2001). E-commerce recommendation applications. Journal of Data Mining and Knowledge Discovery, 5(1/2), 115-153.
ACKNOWLEDGMENTS
Senge, P.M. (2001). The fifth discipline: The art and practice of the learning organization (9th ed.). São Paulo: Best Seller (in Portuguese).
This research group is partially supported by CNPq, an entity of the Brazilian government for scientific and technological development.
Sowa, J.F. (2000). Knowledge representation: Logical, philosophical, and computational foundations. Pacific Grove, CA: Brooks/Cole Publishing Co.
REFERENCES Brutlag, J.D., & Meek, C. (2000). Challenges of the email domain for text classification. In Proceedings of the 7th International Conference on Machine Learning (ICML 2000) (pp. 103-110), Stanford University, Stanford, CA, USA.
Terveen, L., & Hill, W. (2001). Human-computer collaboration in recommended systems. In J. Carroll (Ed.), Human computer interaction in the new millennium. Boston: Addison-Wesley. Tong, R. (2001). Detecting and tracking opinions in online discussions. In Proceedings of the Workshop on Operational Text Classification, SIGIR, New Orleans, Louisiana, USA. 761
M
Mining Chat Discussions
KEY TERMS Chat: A software system that enables real-time communication among users through the exchange of textual messages.
Mining: The application of statistical techniques to infer implicit patterns or rules in a collection of data, in order to discover new and useful knowledge. Ontology: A formal and explicit definition of concepts (classes or categories) and their attributes and relations.
Collaboration: The process of communication among people with the goal of sharing information and knowledge.
Recommendations: Results of the process of providing useful resources to a user, like products, services or information.
Digital Library: A set of electronic resources (usually documents) combined with a software system which allows storing, organizing and retrieving the resources.
Recommender System: A software system that makes recommendations to a user, usually analyzing the user’s interest or need.
Knowledge Management: Systems and methods for storing, organizing and retrieving explicit knowledge.
Text Mining: The process of discovering new information analyzing textual collections.
762
763
Mining Data with Group Theoretical Means Gabriele Kern-Isberner University of Dortmund, Germany
INTRODUCTION Knowledge discovery refers to the process of extracting new, interesting, and useful knowledge from data and presenting it in an intelligible way to the user. Roughly, knowledge discovery can be considered a three-step process: preprocessing data; data mining, in which the actual exploratory work is done; and interpreting the results to the user. Here, I focus on the data-mining step, assuming that a suitable set of data has been chosen properly. The patterns that we search for in the data are plausible relationships, which agents may use to establish cognitive links for reasoning. Such plausible relationships can be expressed via association rules. Usually, the criteria to judge the relevance of such rules are either frequency based (Bayardo & Agrawal, 1999) or causality based (for Bayesian networks, see Spirtes, Glymour, & Scheines, 1993). Here, I will pursue a different approach that aims at extracting what can be regarded as structures of knowledge — relationships that may support the inductive reasoning of agents and whose relevance is founded on information theory. The method that I will sketch in this article takes numerical relationships found in data and interprets these relationships as structural ones, using mostly algebraic techniques to elaborate structural information.
BACKGROUND Common sense and expert knowledge is most generally expressed by rules, connecting a precondition and a conclusion by an if-then construction. For example, you avoid puddles on sidewalks because you are aware of the fact that if you step into a puddle, then your feet might get wet; similarly, a physician would likely expect a patient showing the symptoms of fever, headache, and a sore throat to suffer from a flu, basing his diagnosis on the rule that if a patient has a fever, headache, and sore throat, then the ailment is a flu, equipped with a sufficiently high probability. If-then rules are more formally denoted as conditionals. The crucial point with conditionals is that they carry generic knowledge that is applicable to different situations. This fact makes them most interesting ob-
jects in artificial intelligence, in a theoretical as well as in a practical respect. For instance, a sales assistant who has a general knowledge about the preferences of his or her customers can use this knowledge when consulting any new customer. Typically, two central problems have to be solved in practical applications: First, where do the rules come from? How can they be extracted from statistical data? And second, how should rules be represented? How should conditional knowledge be propagated and combined for further inferences? Both of these problems can be dealt with separately, but it is most rewarding to combine them, that is, to discover rules that are most relevant with respect to some inductive inference formalism and to build up the best model from the discovered rules that can be used for queries.
MAIN THRUST This article presents an approach to discover association rules that are most relevant with respect to the maximum entropy methods. Because entropy is related to information, this approach can be considered as aiming to find the most informative rules in data. The basic idea is to exploit numerical relationships that are observed by comparing (relative) frequencies, or ratios of frequencies, and so forth, as manifestations of interactions of underlying conditional knowledge. My approach differs from usual knowledge discovery and data-mining methods in various respects: • • •
• •
It explicitly takes the instrument of inductive inference into consideration. It is based on statistical information but not on probabilities close to 1; actually, it mostly uses only structural information obtained from the data. It is not based on observing conditional independencies (as for learning causal structures), but aims at learning relevant conditional dependencies in a nonheuristic way. As a further novelty, it does not compute single, isolated rules, but yields a set of rules by taking into account highly complex interactions of rules. Zero probabilities computed from data are interpreted as missing information, not as certain knowledge.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining Data with Group Theoretical Means
The resulting set of rules may serve as a basis for maximum entropy inference. Therefore, the method described in this article addresses minimality aspects, as in Padmanabhan and Tuzhilin (2000), and makes use of inference mechanisms, as in Cristofor and Simovici (2002). Different from most approaches, however, it exploits the inferential power of the maximum entropy methods in full consequence and in a structural, nonheuristic way.
Modelling Conditional Knowledge by Maximum Entropy (ME) Suppose a set R* = {(B1|A1)[x1], …, (Bn|An)[xn]} of probabilistic conditionals is given. For instance, R* may describe the knowledge available to a physician when he has to make a diagnosis. Or R* may express common sense knowledge, such as “Students are young with a probability of (about) 80%” and “Singles (i.e., unmarried people) are young with a probability of (about) 70%”, the latter knowledge being formally expressed by R* = { (young|student)[0.8], (young|single)[0.7] }. Usually, these rule bases represent incomplete knowledge, in that a lot of probability distributions are apt to represent them. So learning or inductively representing the rules, respectively, means to take them as a set of conditional constraints and to select a unique probability distribution as the best model that can be used for queries and further inferences. Paris (1994) investigates several inductive representation techniques in a probabilistic framework and proves that the principle of maximum entropy (ME-principle) yields the only method to represent incomplete knowledge in an unbiased way, satisfying a set of postulates describing sound common sense reasoning. The entropy H(P) of a probability distribution P is defined as H(P) = - Σw P(w) log P(w), where the sum is taken over all possible worlds, w, and measures the amount of indeterminateness inherent to P. Applying the principle of maximum entropy, then, means to select the unique distribution P* = ME(R*) that maximizes H(P) among all distributions P that satisfy the rules in R*. In this way, the ME-method ensures that no further information is added, so the knowledge R* is represented most faithfully. Indeed, the ME-principle provides a most convenient and founded method to represent incomplete probabilistic knowledge (efficient implementations of ME-systems are described in Roedder & Kern-Isberner, 2003). In an ME-environment, the expert has to list only whatever relevant conditional probabilities he or she is aware of. Furthermore, ME-modelling preserves the 764
generic nature of conditionals by minimizing the amount of information being added, as shown in Kern-Isberner (2001). Nevertheless, modelling ME-rule bases has to be done carefully so as to ensure that all relevant dependencies are taken into account. This task can be difficult and troublesome. Usually, the modelling rules are based somehow on statistical data. So, a method to compute rule sets appropriate for ME-modelling from statistical data is urgently needed.
Structures of Knowledge The most typical approach to discover interesting rules from data is to look for rules with a significantly high (conditional) probability and a concise antecedent (Bayardo & Agrawal, 1999; Agarwal, Aggarwal, & Prasad, 2000; Fayyad & Uthurusamy, 2002; Coenen, Goulbourne, & Leng, 2001). Basing relevance on frequencies, however, is sometimes unsatisfactory and inadequate, particularly in complex domains such as medicine. Further criteria to measure the interestingness of the rules or to exclude redundant rules have also been brought forth (Jaroszewicz & Simovici, 2001; Bastide, Pasquier, Taouil, Stumme, & Lakhal, 2000; Zaki, 2000). Some of these algorithms also make use of optimization criteria, which are based on entropy (Jaroszewicz & Simovici, 2002). Mostly, the rules are considered as isolated pieces of knowledge; no interaction between rules can be taken into account. In order to obtain more structured information, one often searches for causal relationships by investigating conditional independencies and thus noninteractivity between sets of variables (Spirtes et al., 1993). Although causality is undoubtedly most important for human understanding, the concept seems to be too rigid to represent human knowledge in an exhaustive way. For instance, a person suffering from a flu is certainly sick (P(sick | flu) = 1), and he or she often will complain about headaches (P(headache | flu) = 0.9). Then you have P(headache | flu) = P(headache | flu & sick), but you would surely expect that P(headache | not flu) is different from P(headache | not flu & sick)! Although the first equality suggests a conditional independence between sick and headache, due to the causal dependency between headache and flu, the second inequality shows this to be (of course) false. Furthermore, a physician might also state some conditional probability involving sickness and headache, so you obtain a complex network of rules. Each of these rules will be considered relevant by the expert, but none will be found when searching for conditional independencies! So what, exactly, are the structures of knowledge by which conditional dependencies (not indepen-
Mining Data with Group Theoretical Means
dencies! See also Simovici, Cristofor, D., & Cristofor, L., 2000) manifest themselves in data? To answer this question, the theory of conditional structures has been presented in Kern-Isberner (2000). Conditional structures are an algebraic means to make the effects of conditionals on possible worlds (i.e., possible combinations or situations) transparent, in that they reflect whether the corresponding world verifies the conditional or falsifies it, or whether the conditional cannot be applied to the world because the if-condition is not satisfied. Consider, for instance, the conditional “If you step in a puddle, then your feet might get wet.” In a particular situation, the conditional is applicable (you actually step into a puddle) or not (you simply walk around it), and it can be found verified (you step in a puddle and indeed, your feet get wet) or falsified (you step in a puddle, but your feet remain dry because you are wearing rain boots). This intuitive idea of considering a conditional as a three-valued event is generalized in Kern-Isberner (2000) to handle the simultaneous impacts of a set of conditionals by using algebraic symbols for positive and negative impact, respectively. Then for each world, a word of these symbols can be computed, which shows immediately how the conditionals interact on this world. The proper mathematical structure for building words are (semi)groups, and indeed, group theory provides the basis for connecting numerical to structural information in an elegant way. In short, a probability (or frequency) distribution is called (conditionally) indifferent with respect to a set of conditionals R* iff its numerical information matches the structural information provided by conditional structures. In particular, each ME-distribution turns out to be indifferent with respect to a generating set of conditionals.
Data Mining and Group Theory — A Strange Connection? The concept of conditional structures, however, is not only an algebraic means to judge well-behavedness with respect to conditional information. The link between numerical and structural information, which is provided by the concept of conditional indifference, can also be used in the other direction, that is, to derive structural information about the underlying conditional relationships from numerical information. More precisely, finding a set of rules with the ability to represent a given probability distribution P via ME-methods can be done by elaborating numerical relationships in P, interpreting them as manifestations of underlying conditional dependencies. The procedure to discover appropriate sets of rules is sketched in the following steps:
• • • • •
Start with a set B of simple rules, the length of which is considered to be large enough to capture all relevant dependencies. Search for numerical relationships in P by investigating which products of probabilities match. Compute the corresponding conditional structures with respect to B, yielding equations of group elements. Solve these equations by forming appropriate factor groups. Building these factor groups corresponds to eliminating and joining the basic conditionals in B to make their information more concise, in accordance with the numerical structure of P. Actually, the antecedents of the conditionals in B are shortened so as to comply with the numerical relationships in P.
So the basic idea of this algorithm is to start with long rules and to shorten them in accordance with the probabilistic information provided by P without losing information. Group theory actually provides an elegant framework, on the one hand, to disentangle highly complex conditional interactions in a systematic way, and on the other hand, to make operations on the conditionals computable, which is necessary to make information more concise.
How to Handle Sparse Knowledge The frequency distributions calculated from data are mostly not positive — just to the contrary, they would be sparse, full of zeros, with only scattered clusters of nonzero probabilities. This overload of zeros is also a problem with respect to knowledge representation, because a zero in such a frequency distribution often merely means that such a combination has not been recorded. The strict probabilistic interpretation of zero probabilities, however, is that such a combination does not exist, which does not seem to be adequate. The method sketched in the preceding section is also able to deal with that problem in a particularly adequate way: The zero values in frequency distributions are taken to be unknown but equal probabilities, and this fact can be exploited by the algorithm. So they actually help to start with a tractable set B of rules right from the beginning (see also Kern-Isberner & Fisseler, 2004). In summary, zeros occurring in the frequency distribution computed from data are considered as missing information, and in my algorithm, they are treated as non-knowledge without structure.
765
M
Mining Data with Group Theoretical Means
FUTURE TRENDS Although by and large, the domain of knowledge discovery and data mining is dominated by statistical techniques and the problem of how to manage vast amounts of data, the increasing need for and popularity of humanmachine interactions will make it necessary to search for more structural knowledge in data that can be used to support (humanlike) reasoning processes. The method described in this article offers an approach to realize this aim. The conditional relationships that my algorithm reveals can be considered as kind of cognitive links of an ideal agent, and the ME-technology takes the task of inductive reasoning to make use of this knowledge. Combined with clustering techniques in large databases, for example, it may turn out a useful method to discover relationships that go far beyond the results provided by other, more standard data-mining techniques.
CONCLUSION In this article, I have developed a new method for discovering conditional dependencies from data. This method is based on information-theoretical concepts and grouptheoretical techniques, considering knowledge discovery as an operation inverse to inductive knowledge representation. By investigating relationships between the numerical values of a probability distribution P, the effects of conditionals are analyzed and isolated, and conditionals are joined suitably so as to fit the knowledge structures inherent to P.
REFERENCES Agarwal, R. C., Aggarwal, C. C., & Prasad, V. V. V. (2000). Depth first generation of long patterns. Proceedings of the Sixth ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 108-118). Bastide, Y., Pasquier, N., Taouil, R., Stumme, G. & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. Proceedings of the First International Conference on Computational Logic (pp. 972-986). Bayardo, R. J., & Agrawal, R. (1999). Mining the most interesting rules. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
766
Coenen, F., Goulbourne, G., & Leng, P. H. (2001). Computing association rules using partial totals. Proceedings of the Fifth European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 54-66). Cristofor, L., & Simovici, D. (2002). Generating an informative cover for association rules. Proceedings of the IEEE International Conference on Data Mining (pp. 597-600). Fayyad, U., & Uthurusamy, R. (2002). Evolving data mining into solutions for insights. Communications of the ACM, 45(8), 28-61. Jaroszewicz, S., & Simovici, D. A. (2001). A general measure of rule interestingness. Proceedings of the Fifth European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 253-265). Jaroszewicz, S., & Simovici, D. A. (2002). Pruning redundant association rules using maximum entropy principle. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Kern-Isberner, G. (2000). Solving the inverse representation problem. Proceedings of the 14th European Conference on Artificial Intelligence (pp. 581-585). Kern-Isberner, G. (2001). Conditionals in nonmonotonic reasoning and belief revision. Lecture Notes in Artificial Intelligence. Kern-Isberner, G., & Fisseler, J. (2004). Knowledge discovery by reversing inductive knowledge representation. Proceedings of the Ninth International Conference on the Principles of Knowledge Representation and Reasoning. Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 54-63). Paris, J. B. (1994). The uncertain reasoner’s companion: A mathematical perspective. Cambridge University Press. Roedder, W., & Kern-Isberner, G. (2003). From information to probability: An axiomatic approach. International Journal of Intelligent Systems, 18(4), 383-403. Simovici, D.A., Cristofor, D., & Cristofor, L. (2000). Mining for purity dependencies in databases (Tech. Rep. No. 00-2). Boston: University of Massachusetts. Spirtes, P., Glymour, C., & Scheines, R.. (1993). Causation, prediction and search. Lecture Notes in Statistics, 81.
Mining Data with Group Theoretical Means
Zaki, M. J. (2000). Generating non-redundant association rules. Proceedings of the Sixth ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 34-43).
KEY TERMS Conditional: The formal algebraic term for a rule that need not be strict, but also can be based on plausibility, probability, and so forth. Conditional Independence: A generalization of plain statistical independence that allows you to take a context into account. Conditional independence is often associated with causal effects.
Conditional Structure: An algebraic expression that makes the effects of conditionals on possible worlds transparent and computable. Entropy: Measures the indeterminateness inherent to a probability distribution and is dual to information. Possible World: Corresponds to the statistical notion of an elementary event. Probabilities over possible worlds, however, have a more epistemic, subjective meaning, in that they are assumed to reflect an agent’s knowledge. Principle of Maximum Entropy: A method to complete incomplete probabilistic knowledge by minimizing the amount of information added. Probabilistic Conditional: A conditional that is assigned a probability. To match the notation of conditional probabilities, a probabilistic conditional is written as (B|A)[x] with the meaning “If A holds, then B holds with probability x.”
767
M
768
Mining E-Mail Data Steffen Bickel Humboldt-Universität zu Berlin, Germany Tobias Scheffer Humboldt-Universität zu Berlin, Germany
INTRODUCTION E-mail has become one of the most important communication media for business and private purposes. Large amounts of past e-mail records reside on corporate servers and desktop clients. There is a huge potential for mining this data. E-mail filing and spam filtering are wellestablished e-mail mining tasks. E-mail filing addresses the assignment of incoming e-mails to predefined categories to support selective reading and organize large e-mail collections. First research on e-mail filing was conducted by Green and Edwards (1996) and Cohen (1996). Pantel and Lin (1998) and Sahami, Dumais, Heckerman, and Horvitz (1998) first published work on spam filtering. Here, the goal is to filter unsolicited messages. Recent research on e-mail mining addresses automatic e-mail answering (Bickel & Scheffer, 2004) and mining social networks from e-mail logs (Tyler, Wilkinson, & Huberman, 2004). In Section Background we will categorize common email mining tasks according to their objective, and give an overview of the research literature. Our Main Thrust Section addresses e-mail mining with the objective of supporting the message creation process. Finally, we discuss Future Trends and conclude.
BACKGROUND There are two objectives for mining e-mail data: supporting communication and discovering hidden properties of communication networks.
Support of Communication The problems of filing e-mails and filtering spam are text classification problems. Text classification is a well studied research area; a wide range of different methods is available. Most of the common text classification algorithms have been applied to the problem of e-mail classification and their performance has been compared in several studies. Because publishing an e-mail data set involves disclosure of private e-mails, there are only a small number of standard e-mail classification data sets.
Since there is no study that compares large numbers of data sets, different classifiers and different types of extracted features, it is difficult to judge which text classifier performs best specifically for e-mail classification. Against this background we try to draw some conclusions on the question which is the best text classifier for e-mail. Cohen (1996) applies rule induction to the e-mail classification problem and Provost (1999) finds that Naïve Bayes outperforms rule induction for e-mail filing. Naïve Bayes classifiers are widely used for e-mail classification because of their simple implementation and low computation time (Pantel & Lin, 1998; Rennie, 2000; Sahami, Dumais, Heckerman, & Horvitz, 1998). Joachims (1997, 1998) shows that Support Vector Machines (SVMs) are superior to the Rocchio classifier and Naïve Bayes for many text classification problems. Drucker, Wu, and Vapnik (1999) compares SVM with boosting on decision trees. SVM and boosting show similar performance but SVM proves to be much faster and has a preferable distribution of errors. The performance of an e-mail classifier is dependent on the extraction of appropriate features. Joachims (1998) shows that applying feature selection for text classification with SVM does not improve performance. Hence, using SVM one can bypass the expensive feature selection process and simply include all available features. Features that are typically used for e-mail classification include all tokens in the e-mail body and header in bag-ofwords representation using TF- or TFIDF-weighting. HTML tags and single URL elements also provide useful information (Graham, 2003). Boykin and Roychowdhury (2004) propose a spam filtering method that is not based on text classification but on graph properties of message sub-graphs. All addresses that appear in the headers of the inbound mails are graph nodes; an edge is added between all pairs of addresses that jointly appear in at least one header. The resulting sub-graphs exhibit graph properties that differ significantly for spam and non-spam sub-graphs. Based on this finding “black-” and “whitelists” can be constructed for spam and non-spam addresses. While this idea is appealing, it should be noted that the approach is not immediately practical since most headers of spam e-mails do not
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining E-Mail Data
contain other spam recipients’ addresses, and most senders’ addresses are used only once. Additionally, the “semantic e-mail” approach (McDowell, Etzioni, Halevy, & Levy, 2004) aims at supporting communication by allowing automatic e-mail processing and facilitating e-mail mining; it is the equivalent of semantic web for e-mail. The goal is to make e-mails human- and machine-understandable with a standardized set of e-mail processes. Each e-mail has to follow a standardized process definition that includes specific process relevant information. An example for a semantic e-mail process is meeting coordination. Here, the individual process tasks (corresponding to single e-mails) are issuing invitations and collecting responses. In order to work, semantic e-mail would require a global agreement on standardized semantic processes, special e-mail clients and training for all users. Additional mining tasks for support of communication are automatic e-mail answering and sentence completion. They are described in Section Main Thrust.
Discovering Hidden Properties of Communication Networks E-mail communication patterns reveal much information about hidden social relationships within organizations. Conclusions about informal communities and informal leadership can be drawn from e-mail graphs. Differences between informal and formal structures in business organizations can provide clues for improvement of formal structures which may lead to enhanced productivity. In the case of terrorist networks, the identification of communities and potential leaders is obviously helpful as well. Additional potential applications lie in marketing, where companies – especially communication providers – can target communities as a whole. In social science, it is common practice for studies on electronic communication within organizations to derive the network structure by means of personal interviews or surveys (Garton Garton, Haythornthwaite, & Wellman, 1997; Hinds & Kiesler, 1995). For large organizations, this is not feasible. Building communication graphs from email logs is a very simple and accurate alternative provided that the data is available. Tyler, Wilkinson, and Huberman (2004) derive a network structure from e-mail logs and apply a divisive clustering algorithm that decomposes the graph into communities. Tyler, Wilkinson, and Huberman verify the resulting communities by interviewing the communication participants; they find that the derived communities correspond to informal communities. Tyler et al. also apply a force-directed spring algorithm (Fruchterman & Rheingold, 1991) to identify leadership hierarchies. They find that with increasing distance of
vertices from the “spring” (center) there is a tendency of decreasing real hierarchy depth. E-mail graphs can also be used for controlling virus attacks. Ebel, Mielsch, and Bornholdt (2002) show that vertex degrees of e-mail graphs are governed by power laws. By equipping the small number of highly connected nodes with anti-virus software the spreading of viruses can be prevented easily.
MAIN THRUST In the last section we categorized e-mail mining tasks regarding their objective and gave a short explanation on the single tasks. We will now focus on the ones that we consider to be most interesting and potentially most beneficial for users and describe them in greater detail. These tasks aim at supporting the message creation process. Many e-mail management systems allow the definition of message templates that simplify the message creation for recurring topics. This is a first step towards supporting the message creation process, but past emails that are available for mining are disregarded. We describe two approaches for supporting the message creation process by mining historic data: mining question-answer pairs and mining sentences.
Mining Question-Answer Pairs We consider the problem of learning to answer incoming e-mails from records of past communication. We focus on environments in which large amounts of similar answers to frequently asked questions are sent – such as call centers or customer support departments. In these environments, it is possible to manually identify equivalence classes of answers in the records of outbound communication. Each class then corresponds to a set of semantically equivalent answers sent in the past; it depends strongly on the application context which fraction of the outbound communication falls into such classes. Mapping inbound messages to one of the equivalence classes of answers is now a multi-class text classification problem that can be solved with text classifiers. This procedure requires a user to manually group previously sent answers into equivalence classes which can then serve as class labels for training a classifier. This substantial manual labeling effort reduces the benefit of the approach. Even though it can be reduced by employing semi-supervised learning (Nigam, McCallum, Thrun, & Mitchell, 2000; Scheffer, 2004), it would still be much preferable to learn from only the available data: stored inbound and outbound messages. Bickel and Scheffer (2004) discuss an algorithm that learns to answer ques-
769
M
Mining E-Mail Data
tions from only the available data and does not require additional manual labeling. The key idea is to replace the manual assignment of outbound messages to equivalence classes by a clustering step. The algorithms for training (learning from message pairs) and answering a new question are shown in Table 1. In the training phase, a clustering algorithm identifies groups of similar outbound messages. Each cluster then serves as class label; the corresponding questions which have been answered by a member of the cluster are used as training examples for a multi-class text classifier. The medoid of each cluster (the outbound message closest to the center) is used as an answer template. The classifier maps a newly incoming question to one of the clusters; this cluster’s medoid is then proposed as answer to the question. Depending on the user interface, high confidence messages might be answered automatically, or an answer is proposed which the user may then accept, modify, or reject (Scheffer, 2004). The approach can be extended in many ways. Multiple topics in a question can be identified to mix different corresponding answer templates and generate a multitopic answer. Question specific information can be extracted in an additional information extraction step and automatically inserted into answer templates. In this extraction step also customer identifications can be extracted and used for a database lookup that provides customer and order specific information for generating more customized answers. Bickel and Scheffer (2004) analyze the relationship of answer classes regarding the separability of the corresponding questions using e-mails sent by the service department of an online shop. By analyzing this relationship one can draw conclusions about the amount of addi-
tional information that is needed for answering specific types of questions. This information can be visualized in an inseparability graph, where each class of equivalent answers is represented by a vertex, and an edge is drawn when a classifier that discriminates between these classes achieves only a low AUC performance (the AUC performance is the probability that, when a positive and a negative example are drawn at random, a discriminator assigns a higher value to the positive than to the negative one). Typical examples of inseparable answers are “your order has been shipped this morning” and “your order will be shipped tomorrow”. Intuitively, it is not possible to predict which of these answers a service employee will send, based on only the question “when will I receive my shipment?”
Mining Sentences The message creation process can also be supported on a sentence level. Given an incomplete sentence, the task of sentence completion is to propose parts or the total rest of the current sentence, based on an application specific document collection. A sentence completion user interface can, for instance, display a proposed completion in a “micro window” and insert the proposed text when the user presses the “tab” key. The sentence completion problem poses new challenges for data mining and information retrieval, including the problem of finding sentences whose initial fragment is similar to a given fragment in a very large text corpus. To this end, Grabski and Scheffer (2004) provide a retrieval algorithm that uses a special inverted indexing structure to find the sentence whose initial fragment is most similar to a given fragment, where similarity is
Table 1. Algorithms for learning from message pairs and answering new questions Learning from message pairs. Input: Message pairs, variance threshold σ2, pruning parameter π. 1. Recursively cluster answers of message pairs with bisecting partitioning cluster algorithm, end recursion when cluster variance lies below σ2. 2. Prune all clusters with less than π elements. Combine all pruned clusters into one “miscellaneous” cluster. Let n be the number of resulting clusters. 3. For all n clusters a. Construct an answer template by choosing the answer that is most similar to the centroid of this cluster in vector space representation and remove salutation line. b. Let the inbound mails that have been answered by a mail in the current cluster be the positive training examples for this answer class. 2. Train SVM classifier that classifies an inbound message into one of the n answer classes or the “miscellaneous” class from these training examples. Return this classifier. Answering new questions. Input: New question message, message answering hypothesis, confidence threshold θ. 1. Classify new message into one of the n answer classes and remember SVM decision function value. 2. If confidence exceeds the confidence threshold, propose the answer template that corresponds to the classification result. Perform instantiation operations that typically include formulating a salutation line.
770
Mining E-Mail Data
defined in terms of the greatest cosine similarity of the TFIDF vectors. In addition, they study an approach that compresses the data further by identifying clusters of the most frequently used similar sets of sentences. In order to evaluate the accuracy of sentence completion algorithms, Grabski and Scheffer (2004) measure how frequently the algorithm, when given a sentence fragment drawn from a corpus, provides a prediction with confidence above θ, and how frequently this prediction is semantically equivalent to the actual sentence in the corpus. They find that for the sentence mining problem higher precision and recall values can be obtained than for the problem of mining question answer pairs; depending on the threshold θ and the fragment length, precision values of between 80% and 100% and recall values of about 40% can be observed.
FUTURE TRENDS Spam filtering and e-mail filing based on message text can be reduced to the well studied problem of text classification. The challenges that e-mail classification faces today concern technical aspects, the extraction of spam-specific features from e-mails, and an arms race between spam filters and spam senders adapting to known filters. By comparison, research in the area of automatic e-mail answering and sentence completion is in an earlier stage; we see a substantial potential for algorithmic improvements to the existing methods. The technical integration of these approaches into existing e-mail clients or call-center automation software provides an additional challenge. Some of these technical challenges have to be addressed before mining algorithms that aim at supporting communication can be evaluated under realistic conditions. Construction of social network graphs from e-mail logs is much easier than by surveys and there is a huge interest in mining social networks – see, for instance, the DARPA program on Evidence Extraction and Link Discovery (EELD). While social networks have been studied intensely in the social sciences and in physics, we see a considerable potential for new and better mining algorithms for social networks that computer scientists can contribute.
CONCLUSION Some methods that can form the basis for effective spam filtering have reached maturity (text classification), additional foundations are being worked on (social network analysis). Today, technical challenges dominate the development of spam filters. The development of methods that support and automate communication processes is
a research topic and first solutions to some of the problems involved have been studied. Mining social networks from e-mail logs is a new challenge; research on this topic in computer science is in an early stage.
ACKNOWLEDGMENT The authors are supported by the German Science Foundation DFG under grant SCHE540/10-1. We would like to thank the anonymous reviewers.
REFERENCES Bickel, S., & Scheffer, T. (2004). Learning from message pairs for automatic email answering. Proceedings of the European Conference on Machine Learning. Boykin, P., & Roychowdhury, V. (2004). Personal e-mail networks: An effective anti-spam tool. Preprint, arXiv id 0402143. Cohen, W. (1996). Learning rules that classify e-mail. Proceedings of the IEEE Spring Symposium on Machine learning for Information Access, Palo Alto, California, USA. Drucker, H., Wu, D., & Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048-1055. Ebel, H., Mielsch, L., & Bornholdt, S. (2002). Scale-free topology of e-mail networks. Physical Review, E 66. Fruchterman, T. M., & Rheingold, E. M. (1991). Forcedirected placement. Software Experience and Practice, 21(11). Garton, L., Haythornthwaite, C., & Wellman, B. (1997). Studying online social networks. Journal of ComputerMediated Communication, 3(1). Grabski, K., & Scheffer, T. (2004). Sentence completion. Proceedings of the SIGIR International Conference on Information Retrieval, Sheffield, UK. Graham, P. (2003). Better Bayesian filtering. Proceedings of the First Annual Spam Conference, MIT. Retrieved from http://www.paulgraham.com/better.html Green, C., & Edwards, P. (1996). Using machine learning to enhance software tools for internet information management. Proceedings of the AAAI Workshop on Internet Information Management. Hinds, P., & Kiesler, S. (1995). Communication across boundaries: Work, structure, and use of communication 771
M
Mining E-Mail Data
technologies in a large organization. Organization Science, 6(4), 373-393.
KEY TERMS
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the International Conference on Machine Learning.
Community: A group of people having mutual relationships among themselves or having common interests. Clusters in social network graphs are interpreted as communities.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of the European Conference on Machine Learning. McDowell, L., Etzioni, O., Halevy, A., & Levy, H. (2004). Semantic e-mail. Proceedings of the WWW Conference. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3). Pantel, P., & Lin, D. (1998) Spamcop: a spam classification and organization program. Proceedings of the AAAI Workshop on Learning for Text Categorization. Provost, J. (1999). Naïve Bayes vs. rule-learning in classification of e-mail. Technical Report AI-TR-99-284, University of Texas at Austin. Rennie, J. (2000). iFILE: An application of machine learning to e-mail filtering. Proceedings of the SIGKDD Text Mining Workshop. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. Proceedings of AAAI Workshop on Learning for Text Categorization. Scheffer, T. (2004). E-mail answering assistance by semi-supervised text classification. Intelligent Data Analysis, 8(5). Tyler, J. R., Wilkinson, D. M., & Huberman, B. A. (2003). E-mail as spectroscopy: Automated discovery of community structure within organizations. Proceedings of the International Conference on Communities and Technologies (pp. 81-95). Kluwer Academic Publishers.
772
Mining E-Mails: The application of analytical methods and tools to e-mail data for a) support of communication by filing e-mails into folders, filtering spam, answering e-mails automatically, or proposing completions to sentence fragments, b) discovery of hidden properties of communication networks by e-mail graph analysis. Mining Question-Answer Pairs: Analytical method for automatically answering question e-mails using knowledge that is discovered in question-answer pairs of past e-mail communication. Mining Sentences: Analytical method for interactively completing incomplete sentences using knowledge that is discovered in a document collection. Semantic E-Mail: E-mail framework in which the semantics of e-mails is understandable by both, human and machine. A standardized definition of semantic email processes is required. Spam E-Mail: Unsolicited and unwanted bulk email. Identifying spam e-mail is a text classification task. Text Classification: The task of assigning documents expressed in natural language to one or more categories (classes) of a predefined set. TFIDF: Weighting scheme for document and query representation in the vector space model. Each dimension represents a term, its value is the product the frequency of the term in a document (TF) and the inverse document frequency (IDF) of the term. The inverse document frequency of a term is the logarithmic proportion of documents in which the term occurs. The TFIDF scheme assigns a high weight to terms which occur frequently in the focused document, but are infrequent in average documents.
773
Mining for Image Classification Based on Feature Elements Yu-Jin Zhang Tsinghua University, Beijing, China
INTRODUCTION Motivation: Image Classification in Web Search The growth of the Internet and storage capability not only increasingly makes images a widespread information format on the World Wide Web (WWW), but it also dramatically expands the number of images on WWW and makes the search of required images more complex and time-consuming. To efficiently search images on the WWW, effective image search engines need to be developed. The classification of images plays an important role both for Web image searching and retrieving, as it is time-consuming for users to browse through the huge amount of data on the Web. Classification has been used to provide access of large image collections in a more efficient manner, because the classification can reduce search space by filtering out images in an unrelated category (Hirata, 2000). The heterogeneous nature of Web images makes their classification a challenging task. A functional classification scheme should take the contents of images into consideration. The association rule mining, first proposed by Agrawal (1993), is an appropriate tool for pattern detection in knowledge discovery and data mining. Its objective is to extract useful information from very large databases (Renato, 2002). By using rules extracted from images, the content of images can be suitably analyzed, and the information required for image classification can be obtained.
Highlights of the Article A novel method for image classification based on feature element through association rule mining is presented. The feature elements can capture well the visual meanings of images according to the subjective perception of human beings. In addition, feature elements are discrete entities and are suitable for working with rulebased classification models. Different from traditional image classification methods, the proposed classification approach, based on a feature element, does not
compute the distance between two vectors in the feature space. This approach just tries to find associations between the feature elements and class attributes of the image. Techniques for mining the association rules are adapted, and the mined rules are applied to image classifications. Experiments with real images show that the new approach not only reduces the classification errors but also diminishes the time complexity. The remaining parts of this article are structured as follows:
• •
• •
Background: (1) Feature Elements vs. Feature Vectors; (2) Association Rules and Rule Mining; (3) Classification Based on Association Main Thrust: (1) Extracting Various Types of Feature Elements; (2) Feature Element Based Image Classification; (3) Database Used in Test; (4) Classification and Comparison Results Direction of Future Research. Conclusion.
BACKGROUND Feature Elements vs. Feature Vectors Traditionally, feature vectors are used for object identification and classification as well as for content-based image retrieval (CBIR). In object identification and classification, different features representing the characteristics of objects are extracted first. These features mark out an object to a point in the feature space. By detecting this point in the space, the object can be identified or classified. In CBIR, the procedure is similar. Features such as color, texture, and shape are extracted from images and grouped into feature vectors (Zhang, 2003). The similarity among images is measured by distances between corresponding vectors. However, these feature vectors often are different from the representation and description adapted by human beings. For example, when people look at a colorful image, they hardly figure out its color histogram but rather are concerned about what particular colors are contained in certain components of the image. In fact,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining for Image Classification Based on Feature Elements
these color components play a great role in perception and represent useful visual meanings of images. The pixels belonging to these visual components can be taken to form perceptual primitive units, by which human beings could identify the content of images (Xu, 2001). The feature elements are defined on the basis of these primitive units. They are discrete quantities, relatively independent of each other, and have obvious intuitive visual senses. In addition, they can be considered as sets of items. Based on feature elements, image classification becomes a process of counting the existence of representative components in images. For this purpose, it is required to find some association rules between the feature elements and the class attributes of image.
Association Rules and Rule Mining The association rule can be represented by an expression X ⇒ Y, where X and Y can be any discrete entity. As we discuss image database, X and Y can be some feature elements extracted from images. The meaning of X ⇒ Y is: Given an image database D, for each image I ∈ D, X ⇒ Y expresses that whenever an image I contains X then I probably will also contain Y. The support of association rule is defined as the probability p(X ⊆ I, Y ⊆ I), and the confidence of association rule is defined as the conditional probability p(X ⊆ I | Y ⊆ I). A rule with support bigger than a specified minimum support and with confidence bigger than a specified minimum confidence is considered as a significant association rule. Since the introduction of the association rule mining by Agrawal (1993), many researches have been conducted to enhance its performance. Most works can be grouped into the following categories: 1. 2.
3. 4.
Works for mining of different rules, such as multidimensional rules (Yang, 2001). Works for taking advantage of particular techniques, such as, tree projection (Guralnik, 2004), multiple minimum supports (Tseng, 2001), constraintbased clustering (Tung, 2001), and association (Cohen, 2001). Works for developing fast algorithms, such as algorithm based on anti-skew partitioning (Lin, 1998). Works for discovering a temporal database, such as discovering temporal association rules (Guimaraes, 2000; Li, 2003).
Currently, the association rule mining (Lee, 2003; Harms, 2004) is one of the most popular pattern discovery methods in knowledge discovery and data mining. In contrast to the classification rule mining (Pal, 2003), the purpose of association rule mining is to find all significant rules in the database that satisfy some mini774
mum support and minimum confidence constraints (Hipp, 2000). It is known that rule-based classification models often have difficulty dealing with continuous variables. However, as a feature element is just a discrete entity, association rules can easily be used for treating images represented and described by feature elements. In fact, a decision about whether an image I contains feature element X and/or feature element Y can be properly defined and detected.
Classification Based on Association Classification based on associations (CBA) is an algorithm for integrating classification and association rule mining (Liu, 1998). Assume that the data set is a normal relational table that consists of N cases described by distinct attributes and classified into several known classes. All the attributes are treated uniformly. For a categorical attribute, all the possible values are mapped to a set of consecutive positive integers. With these mappings, a data case can be treated as a set of (attribute, integer value) pairs plus a class label. Each (attribute, integer value) is called an item. Let D be the data set, I the set of all items in D, and Y the class labels. A class association rule (CAR) is an implication of the form X ⇒ y, where X ⊆ I, and y ∈ Y. A data case d ∈ D means d contains a subset of items; that is, X ⊆ d and X ⊆ I. A rule X ⇒ y holds in D with confidence C if C percentages of cases in D that contain X are labeled with class y. The rule X ⇒ y has support S in D if the S percentages of cases in D are contained in X and are labeled with class y. The objective of CBA is to generate the complete set of CARs that satisfy the specified minimum supports and minimum confidence constraints, and to build a classifier from CARs. It is easy to see that if the righthand-side of the association rules is restricted to the (classification) class attributes, then such rules can be regarded as classification rules to build classifiers.
MAIN THRUST Extracting Various Types of Feature Elements Various types of feature elements that put emphasis on different properties will be employed in different applications. The extractions of feature elements can be carried out first by locating the perceptual elements and then by determining their main properties and giving them suitable descriptions. Three typical examples are described in the following.
Mining for Image Classification Based on Feature Elements
One process for obtaining feature elements primarily based on color properties can be described by the following steps (Xu, 2001): 1. 2.
3.
Images are divided into several clusters with a perceptual grouping based on hue histogram. For each cluster, the central hue value is taken as its color cardinality named as Androutsos-cardinality (AC). In addition, color-coherence-vector (CCV) and color-auto-correlogram (CAC) also are calculated. Additional attributes such as the center coordinates and area of each cluster are recoded to represent the position and size information of clusters.
One type of feature element highlighting the form property of clusters is obtained with the help of Zernike moments (Xu, 2003). They are invariant to similarity transformations, such as translation, rotation, and scaling of the planar shape (Wee, 2003). Based on Zernike moments of clusters, different descriptors for expressing circularity, directionality, eccentricity, roundness, symmetry, and so forth, can be directly obtained, which provides useful semantic meanings of clusters with respect to human perception. Wavelet feature element is based on wavelet modulus maxima and invariant moments (Zhang, 2003). Wavelet modulus maxima can indicate the location of edges in images. A set of seven invariant moments (Gonzalez, 2002) is used to represent the multi-scale edges in wavelet-transformed images. Three steps are taken first: 1. 2. 3.
Images are decomposed, using dyadic wavelet, into a multi-scale modulus image. Pixels in the wavelet domain whose moduli are locally maxima are used to form multi-scale edges. The seven invariant moments at each scale are computed and combined to form the feature vector of images.
Figure 1. Splitting and groups feature vectors to construct feature elements m11
m12
m13
m14
m15
m16
m17
m21
m22
m23
m24
m25
m26
m27
m31
m32
m33
m34
m35
m36
m37
m41
m42
m43
m44
m45
m46
m47
m51
m52
m53
m54
m55
m56
m57
m61
m62
m63
m64
m65
m66
m67
Then, a process of discretization is followed (Li, 2002). Suppose the wavelet decomposition is performed in six levels; for each level, seven moments are computed. This gives a 42-D vector. It can be split into six groups, each of them being a 7-D vector that represents seven moments on one level. On the other side, the whole vector can be split into seven groups; each of them is a 6-D vector that represents one moment on all six levels. This process can be described with the help of Figure 1. In all these examples, the feature elements have property represented by numeric values. As not all of the feature elements have the same status in the visual sense, an evaluation of feature elements is required to select suitable feature elements according to the subjective perception of human beings (Xu, 2002).
Feature Element Based Image Classification Feature Element Based Image Classification (FEBIC) uses CBA to find association rules between feature elements and class attributes of the images, while the class attributes of unlabeled images could be predicted with such rules. In case an unlabeled image satisfies several rules, which might make this image be classified into different classes, the support values and confidence values can be used to make the final decision. In accordance with the assumption in CBA, each image is considered as a data case, which is described by a number of attributes. The components of the feature element are taken as attributes. The labeled image set can be considered as a normal relational table that is used to mine association rules for classification. In the same way, feature elements from unlabeled images are extracted and form another relational table without class attributes, on which the classification rules to predict the class attributes of each unlabeled image will be applied. The whole procedure can be summarized as follows: 1. 2. 3. 4.
Extract feature elements from images. Form relational table for mining association rules. Use mined rules to predict the class attributes of unlabeled images. Classify images using the association of feature elements.
Database Used in Test The image database for testing consists of 2,558 realcolor images that can be grouped into five different classes: (1) 485 images with (big) flowers; (2) 565 images with person pictures; (3) 505 images with au775
M
Mining for Image Classification Based on Feature Elements
Figure 2. Typical image examples from different classes
(a) Flower image
(b) Person picture
(d) Scenery
(c) Auto image
(e) Flower cluster
tos; (4) 500 images with different sceneries (e.g., sunset, sunrise, beach, mountain, forest, etc.); and (5) 503 images with flower clusters. Among these classes, the first three have prominent objects, while the other two normally have no dominant items. Two typical examples from each class are shown in Figure 2. Among these images, one-third have been used in the test set and the rest in the training set. The images in the training set are labeled manually and then used in the mining of association rules, while the images in the testing set will be labeled automatically by these mined rules.
Except the classification error, the time complexity is another important factor to be counted in Web application, as the number of images on the WWW is huge. The computation times for two methods are compared during the test experiments. The time needed for FEBIC is only about 1/100 of the time needed for NFL. Since NFL requires many arithmetic operations to compute distance functions, while FEBIC needs only a few operations for judging the existence of feature elements, such a big difference in computation is well expected.
Classification and Comparison Results
FUTURE TRENDS
Classification experiments using two methods with the previously mentioned database are carried out. The proposed method FEBIC is compared to another state-ofthe-art method—nearest feature line (NFL) (Li, 2000). NFL is a classification method based on feature vectors. In comparison, the color feature (i.e., AC, CCV, CAC) and the wavelet feature based on wavelet modulus maxima, and invariant moments are used. Two tests are performed. For each test, both methods use the same training set and testing set. The results of these experiments are summarized in Table 1, where the classification error rates for each class and for the average over the five classes are listed. The results in Table 1 show that the classification error rate of NFL is about 34.5%, while the classification error rate of FEBIC is about 25%. The difference is evident.
The detection and description of feature elements play an important role in providing suitable information and a basis for association rule mining. How to adaptively design feature elements that can capture the user’s intention based on perception and interpretation needs further research. The proposed techniques also can be extended to the content-based retrieval of images over the Internet. As feature elements are discrete entities, the similarity between images described by feature elements can be computed according to the number of common elements.
Table 1. Comparison of classification errors Error rate Flower Person Auto Scenery Flower cluster Average
776
FEBIC 32.1% 22.9% 21.3% 30.7% 26.8% 26.6%
Test set 1
NFL 48.8% 25.6% 23.1% 38.0% 45.8% 35.8%
FEBIC 36.4% 20.7% 18.3% 32.5% 20.2% 25.4%
Test set 2
NFL 46.9% 26.1% 23.1% 34.3% 37.0% 33.2%
CONCLUSION A new approach for image classification that uses feature elements and employs association rule mining is proposed. It provides lower classification error and higher computation efficiency. These advantages make it quite suitable to be included into a Web search engine for images over the Internet.
Mining for Image Classification Based on Feature Elements
ACKNOWLEDGMENTS This work has been supported by the Grants NNSF60172025 and TH-EE9906.
REFERENCES
Li, Y.J. et al. (2003). Discovering calendar-based temporal association rules. Data and Knowledge Engineering, 44(2), 193-218. Lin, J.L., & Dunham, M.H. (1998). Mining association rules: Antiskew algorithms. Proceeding of the International Conference on Data Engineering.
Agrawal, R., Imielinski, T. & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceeding of the ACM SIGMOD.
Liu, B., Hsu, W., & Ma, Y.M. (1998). Integrating classification and association rule mining. Proceedings of the International Conference on Knowledge Discovery and Data Mining.
Cohen, E. et al. (2001). Finding interesting associations without support pruning. IEEE Trans. Knowledge and Data Engineering, 13(1), 64-78.
Pal, S.K. (2003). Soft computing pattern recognition, case generation and data mining. Proceedings of the International Conference on Active Media Technology.
Gonzalez, R.C., & Woods, R.E. (2002). Digital image processing. Prentice Hall.
Renato, C. (2002). A theoretical framework for data mining: The “informational paradigm.” Computational Statistics and Data Analysis, 38(4), 501-515.
Guimaraes, G. (2000). Temporal knowledge discovery for multivariate time series with enhanced self-organizing maps. Proceedings of the International Joint Conference on Neural Networks. Guralnik, V., & Karypis, G. (2004). Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4), 443-472. Harms, S K., & Deogun, J.S. (2004). Sequential association rule mining with time lags. Journal of Intelligent Information Systems, 22(1), 7-22. Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algorithms for association rule mining—A general survey and comparison. ACM SIGKDD, 2(1), 58-64. Hirata, K. et al. (2000). Integration of image matching and classification for multimedia navigation. Multimedia Tools and Applications, 11, 295–309. Lee, C.H., Chen, M.S., & Lin, C.R. (2003). Progressive partition miner: An efficient algorithm for mining general temporal association rules. IEEE Trans. Knowledge and Data Engineering, 15(4), 1004-1017. Li, Q., Zhang, Y.J., & Dai, S.Y. (2002). Image search engine with selective filtering and feature element based classification. Proceedings of the SPIE of Internet Imaging III. Li, S.Z., Chan, K.L., & Wang, C.L. (2000). Performance evaluation of the nearest feature line method in image classification and retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11), 1335-1339.
Tseng, M.C., Lin, W., & Chien, B.C. (2001). Maintenance of generalized association rules with multiple minimum supports. Proceedings of the Annual Conference of the North American Fuzzy Information Processing Society. Tung, A.K.H. et al. (2001). Constraint-based clustering in large databases. Proceedings of International Conference on Database Theory. Wee, CY. (2003). New computational methods for full and subset Zernike moments. Information Sciences, 159(3-4), 203-220. Xu, Y., & Zhang, Y.J. (2001). Image retrieval framework driven by association feedback with feature element evaluation built in. Proceedings of the SPIE Storage and Retrieval for Media Databases. Xu, Y., & Zhang, Y.J. (2002). Feature element theory for image recognition and retrieval. Proceedings of the SPIE Storage and Retrieval for Media Databases. Xu, Y., & Zhang, Y.J. (2003). Semantic retrieval based on feature element constructional model and bias competition mechanism. Proceedings of the SPIE Storage and Retrieval for Media Databases. Yang, C., Fayyad, U., & Bradley, P.S. (2001). Efficient discovery of error-tolerant frequent itemsets in high dimensions. Proceedings of the International Conference on Knowledge Discovery and Data Mining. Zhang, Y.J. (2003). Content-based visual information retrieval. Science Publisher.
777
M
Mining for Image Classification Based on Feature Elements
KEY TERMS Classification Error: Error produced by incorrect classifications, which consists of two types: correct negative (wrongly classify an item belonging to one class into another class) and false positive (wrongly classify an item from other classes into the current class).
or minimize some classification error (i.e., supervised pattern detection), or with not only locating occurrences of the patterns in the database but also deciding whether such an occurrence is a pattern (i.e., unsupervised pattern detection).
Classification Rule Mining: A technique/procedure aiming to discover a small set of rules in the database to form an accurate classifier for classification.
Pattern Recognition: Concerned with the classification of individual patterns into pre-specified classes (i.e., supervised pattern recognition), or with the identification and characterization of pattern classes (i.e., unsupervised pattern recognition).
Content-Based Image Retrieval (CBIR): A process framework for efficiently retrieving images from a collection by similarity. The retrieval relies on extracting the appropriate characteristic quantities describing the desired contents of images. In addition, suitable querying, matching, indexing, and searching techniques are required.
Similarity Transformation: A group of transformations that will preserve the angles between any two curves at their intersecting points. It is also called equiform transformation, because it preserves the form of curves. A planar similarity transformation has four degrees of freedom and can be computed from a two-point correspondence.
Multi-Resolution Analysis: A process to treat a function (i.e., an image) at various levels of resolutions and/ or approximations. In such a way, a complicated function could be divided into several simpler ones that can be studied separately.
Web Image Search Engine: A kind of search engine that starts from several initially given URLs and extends from complex hyperlinks to collect images on the WWW. Web search engine is also known as Web crawler.
Pattern Detection: Concerned with locating patterns in the database to maximize/minimize a response variable
778
Web Mining: Concerned with the mechanism for discovering the correlations among the references to various files that are available on the server by a given client visit to the server.
779
Mining for Profitable Patterns in the Stock Market Yihua Philip Sheng Southern Illinois University, USA Wen-Chi Hou Southern Illinois University, USA Zhong Chen Shanghai JiaoTong University, PR China
INTRODUCTION The stock market, like other economic phenomena, is a very complex system. Many factors, such as company news, interest rates, macro economic data, and investors’ hopes and fears, all affect its behavior (Pring, 1991; Sharpe, Alexander, & Bailey, 1999). Investors have longed for tools and algorithms to analyze and predict stock market movement. In this study, we combine a financial theory, the market efficiency theory, and a data mining technique to explore profitable trading patterns in the stock market. To observe the price oscillation of several consecutive trading days, we examine the K-lines, each of which represents a stock’s one-day movement. We will use a data mining technique with a heuristic rating algorithm to mine for reliable patterns indicating price rise or fall in the near future.
BACKGROUND Methods of Stock Technical Analysis Conventional stock market technical analysis is often done through visually identifying patterns or indicators on the stock price and volume charts. Indicators like moving averages and support and resistance level are easy to implement algorithmatically. Patterns like headand-shoulder, inverse head-and-shoulder, broadening tops and bottoms, and etcetera, are easy for human to visually identify but difficult for computers to recognize. For such patterns, methods like smoothing estimators and kernel regression can be applied to increase their machine-readability (Dawson & Steely, 2003; Lo, Mamaysky, & Wang, 2000). The advances in data mining technology have pushed the technical analysis of stock market from simple indicators, visually recognizable patterns, and linear
statistical models to more complicated nonlinear models. A great deal of research has focused on the applications of artificial intelligence (AI) algorithms, such as artificial neural networks (ANNs) and genetic algorithm (e.g., Allen & Karjalainen, 1999; Chenoweth, Obradovic, & Stephenlee, 1996; Thawornwong, Enke, & Dagli, 2003). ANNs equipped with effective learning algorithms can use different kinds of inputs, handle noisy data, and identify highly nonlinear models. Genetic algorithms constitute a class of search, adaptation, and optimization techniques to emulate the principles of natural evolution. More recent studies tend to embrace multiple AI techniques in one approach. Tsaih, Hsu, & Lai (1998) integrated the rule-based systems technique and the neural networks technique to predict the direction of daily price changes in S&P 500 stock index futures. Armano, Murru, & Roli (2002) employed ANNs with a genetic algorithm to predict the Italian stock market. Fuzzy logic, a relatively newer AI algorithm, has also been used in stock market prediction literature (e.g., Dourra & Siy, 2001). The increasing popularity of fuzzy logic is due to its simplicity in constructing the models and less computation load. Fuzzy algorithms provide a fairly straightforward translation of the qualitative/linguistic statements of rules.
Market Efficiency Theory According to the market efficiency theory, a stock’s price is a full reflection of market information about that stock (Fama, 1991; Malkiel, 1996). Therefore, if there is information out on the market about a stock, the stock’s price will adjust accordingly. Interestingly, evidence shows that price adjustment in response to news usually does not settle down in one day; it actually takes some time for the whole market to digest the news. If the stock’s price really adjusted to relevant events in a timely manner, the stock price chart would have looked more like what Figure 1 shows.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining for Profitable Patterns in the Stock Market
Figure 1. Ideal stock price movement curve under the market efficiency theory
The flat periods indicate that there were no events occurring during the periods while the sharp edges indicate sudden stock price movements in response to event announcements. However, in reality, most stocks’ daily price resembles the curve shown in Figure 2. As the figure shows, there is no obvious flat period for a stock and the stock price seemed to keep on changing. In some cases, the stock price continuously moves down or up for a relatively long period, for example, the period of May 17, 2002 to July 2, 2002, and the period of October 16, 2002 to November 6, 2002. This could be either there were negative (or positive) events for the company every day for a long period of time or the stock price adjustment to events actually spans a period of time, rather than instantly. The latter means that stock price adjustment to the event announcements is not efficient and the semi-form of the market efficiency theory does not hold. Furthermore, we think the first few days’ price adjustments of the stock are crucial, and the price movements in these early days might contain enough information to predict whether the rest of price adjustment in the near future is upwards or downwards.
Knowledge Representation Knowledge representation holds the key to the success of data mining. A good knowledge representation should be able to include all possible phenomena of a problem domain without complicating it (Liu, 1998). Here, we use K-lines, a widely used representation method of the daily stock price in Asian stock markets, to describe the daily price change of a stock. Figure 3 is examples of KLines.
Figure 3(a) is a price-up K-Line, denoted by an empty rectangle, indicating the closing price is higher than the opening price. Figure 3(b) is a price-down K-line, denoted by a solid rectangle, indicating the closing price is lower than the opening price. Figure 3(c) and 3(d) are 3day K-Lines. Figure 3(c) shows that the price was up for two consecutive days and the second day’s opening price continued on the first day’s closing price. This indicates that the news was very positive. The price came down a little bit on the third day, which might be due to the correction to the over-valuation of the good news in the prior two days. Actually, the long price shadow above the closing price of the second day already shows some degree of price correction. Figure 3(d) is the opposite of Figure 3(c). When an event about a stock happens, such as rumors on merger/acquisition, or change of dividend policies, the price adjustments might last for several days till the price finally settles down. As a result, the stock’s price might keep rising or falling or stay the same during the price adjustment period. A stock has a K-Line for every trading day, but not every K-Line is of our interest. Our goal is to identify a stock’s K-Line patterns that reflect investors’ reactions to market events such as the releases of good or bad corporate news, major stock analysts’ upgrade on the stock, and etcetera. Such market events usually can cause the stock’s price to oscillate for a period of time. Certainly, a stock’s price sometimes might change with large magnitude just for one day or two due to transient market rumors. These types of price oscillations are regarded as market noises and therefore are ignored. Whether a stock’s daily price oscillates is determined by examining if the price change on that day is greater than the average price change of the year. If a stock’s price oscillates for at least three consecutive days, we regard it as a signal of the occurrence of a market event. The market’s response to the event is recorded in a 3-day K-Line pattern. Then, we examine whether this pattern is followed by an up or down trend of the stock’s price a few days later.
Figure 2. The daily stock price curve of Intel Corporation (NasdaqNM Symbol: INTC)
780
Mining for Profitable Patterns in the Stock Market
Figure 3. K-Line examples
M
The relative positions of the K-Lines such as one day’s opening/closing prices relative to the prior day’s closing/ opening prices, the length of the price body, etc. reveal market reactions to the events. The following bit-representation method, called Relative Price Movement or RPM for simplicity, is used to describe the positional relationship of the K-lines in three days.
• •
Day 1
•
• •
•
•
bit 0: 1 if the day’s price is up, 0 otherwise bit 1: 1 if the up shadow is longer than the price body, 0 otherwise. bit 2: 1 if the down shadow is longer than the price body, 0 otherwise
Day 2 • • • • • • • •
bits 0 – 2: The same as Day 1’s representation bits 3 – 5: 001, if the price body covers the day 1’s price body 010, if the price body is covered by the day 1’s price body 011, if the whole price body is higher than the day 1’s price body 100, if the whole price body is lower than the day 1’s price body 101, if the price body is partially higher than the day 1’s price body 110, if the price body is partially lower than the day 1’s price body
DAY 3 • •
bits 0 – 2: The same as Day 1’s representation bits 3 –7:
• •
• • • • • • • • • • • • •
00001—00111, reserved 01000, if the price body covers the day 1 and day 2’s price bodies 01001, if the price body covers the day 1’s price body only 01010, if the price body covers the day 2’s price body only 01011, if the price body is covered by the day 1 and day 2’s price bodies 01100, if the price body is covered by the day 2’s price body only 01101, if the price body is covered by the day 1’s price only 01110, if the whole price body is higher than the day 1 and day 2’s price bodies 01111, if the whole price body is higher than the day 1’s price body only 10000, if the whole price body is higher than the day 2’s price body only 10001, if the whole price body is lower than the day 1 and day 2’s price bodies 10010, if the whole price body is lower than the day 1’s price body only 10011, if the whole price body is lower than the day 2’s price body only 10100, if the price body is partially higher than the day 1 and day 2’s price bodies 10101, if the price body is partially lower than the day 1 and day 2’s price bodies 10110, if the price body is partially higher than the day 2’s price body only 10111, if the price body is partially higher than the day 1’s price body only 11000, if the price body is partially lower than the day 2’s price body only 11001, if the price body is partially lower than the day 1’s price body only 781
Mining for Profitable Patterns in the Stock Market
Mining for Rules The rules we mine for are similar to those by Liu (1998), Siberschatz & Tuzhilin (1996), and Zaki, Parthasatathy, Ogihara, & Li (1997). They have the following format:
• •
Rule type (1): a 3-day K-Line pattern →the stock’s price rises 10% in 10 days Rule type (2): a 3-day K-Line pattern →the stock’s price falls 10 % in 10 days
The search algorithm for finding 3-day K-Line patterns that lead to stock price rise or fall is as follows: 1. 2. 3. 4. 5.
For every 3-day K-Line pattern in the database Encode it by using the RPM method to get every day’s bit representation, c1, c2, c3; Increase pattern_occurrence[c1][c2][c3] by 1; base_price = the 3rd day’s closing price; if the stock’s price rises 10% or more, as compared to the base_price, in 10 days after the occurrence of this pattern Increase Pup[c1][c2][c3] by 1;
6.
if the stock’s price falls 10% or more, as compared to the base_price, in 10 days after the occurrence of this pattern Increase Pdown[c1][c2][c3] by 1;
We used the daily trading data from January 1, 1994, through December 31, 1998, of the 82 stocks, as shown in Table 1, as the base data set to mine for the price up and down patterns. After applying the above search algorithm on the base data set, the Pup and Pdown arrays contained the counts of all the patterns that led price to rise or fall by 10% in 10 days. In total, the up-patterns occurred 1,377 times, among which there were 870 different types of up-patterns; and the down-patterns occurred 1,001 times, among which there were 698 different types of down-patterns. A heuristic, stated below, was applied to all found Table 1. 82 selected stocks ADBE
BA
CDN
F
KO
MWY
S
WAG
ADSK
BAANF
CEA
FON
LGTO
NETG
SAPE
WCOM
ADVS
BEAS
CHKP
GATE
LU
NKE
SCOC
WMT
AGE
BEL
CLGY
GE
MACR
NOVL
SNPS
XOM
AIT
BTY
CNET
GM
MERQ
ORCL
SUNW
YHOO
AMZN
BVEW
CSCO
HYSL
MO
PRGN
SYBS
AOL
CA
DD
IBM
MOB
PSDI
SYMC
ARDT
CAL
DELL
IDXC
MOT
PSFT
T
AVNT
CBS
DIS
IFMX
MRK
RATL
TSFW
AVTC
CBTSY
EIDSY
INTU
MSFT
RMDY
TWX
AWRE
CCRD
ERTS
ITWO
MUSE
RNWK
VRSN
782
patterns to reduce the ambiguity of the patterns. Using the price-up pattern as an example, for a pattern to be labeled as a price-up pattern, we think the times it appeared in Pup should be at least twice as many as the times it appeared in the Pdown. Within all the patterns labeled as price-up patterns, they were sorted based on the ratio of the squared root of its total occurrences plus its occurrence as a price-up pattern over the occurrence as a price-down pattern. For price-up pattern: Pup Pup PO + * Pup , if Pdown > 2 down down P P Preference = up − P + PO * Pup , if Pup ≤ 2 Pdown Pdown Pdown
,
For price-down pattern: Pdown + Pup Preference = down − P + Pup
PO * Pdown , Pup
if
PO * Pdown , Pup
if
Pdown >2 Pup Pdown ≤2 Pup
The final winning patterns with positive Preference score are listed in Table 2.
Performance Evaluation To evaluate the performance of the found winning patterns listed in Table 2, we applied them to the prices of the same 82 stocks for the period from January 1, 1999, Table 2. Final winning patterns sorted by preference Pattern Code
PO
Pup
Pdown
Up[00][20][91] Up[01][28][68] Up[07][08][88] Up[00][24][88] Up[00][30][8E] Up[01][19][50] Up[00][30][90] Up[00][31][81] Up[00][20][51] Up[01][19][60] Down[01][1D][71] Down[00][11][71] Down[01][19][79] Down[00][20][67]
46 17 11 10 9 28 39 26 18 24 10 17 35 18
15 7 7 7 7 12 21 8 8 9 0 1 3 2
4 1 1 1 1 3 9 2 2 3 6 6 10 5
Preference 81.68 77.86 72.22 71.14 70.00 69.17 63.57 52.40 48.97 41.70 66.00 60.74 53.05 23.11
Mining for Profitable Patterns in the Stock Market
through December 31, 1999. A stop-loss of 5% was set to reduce the risk imposed by a wrong signal. This is a common practice in the investment industry. If a “buy” signal is generated, we buy that stock and hold it. The stock will be sold when it reaches the 10% profit target or 10 days of holding period, or when its price goes down 5%. Same rules were applied to the “sell” signals but in an opposite way. Table 3 shows the number of “buy” and “short sell” signals generated by these patterns. As seen from Table 3, the price-up winning pattern worked very well. 42.86% predictions were perfectly correct. Also, 20 of the 84 “buy” signals assured 6.7% gain after the signals. If we regard 5% increase also as making money, then in total we had 70.24% chance to win money, and 85.71% chance of not losing money. The price-down patterns did not work as well as the price-up patterns. It was probably because there were not as many down trends as the up trends in the U.S. stock market in 1999. Still, by following the “sell” signal, there was 43% chance of gaining money and 87.5% chance of not losing money in 1999. The final return for the year 1999 was 153.8%, which was superior as compared to 84% return of Nasdaq composite and 25% return of Dow Industrial Average.
FUTURE TRENDS Being able to identify price rise or drop patterns can be exciting for frequent stock traders. By following the “buy” or “sell” signals generated by these patterns, frequent stock traders can earn excessive returns over the simple “buy-and-hold” strategy (Allen & Karjalainen,
1999; Lo & MacKinlay, 1999). Data mining techniques combined with financial theories can be a powerful approach for discovering price movement patterns in the financial market. Unfortunately, researchers in the data mining field often focus exclusively on computational part of market analysis, not paying attention to the theories of the target area. In addition, the knowledge representation methods and variables chosen were often based on the common sense, rather than theories. This article borrows the market efficiency theory to model the problem and the out-of-sample performance was quite pleasing. We believe there will be more studies integrating theories in multiple disciplines to achieve better results in the near future.
CONCLUSION This paper combines a knowledge discovery technique with financial theory, the market efficiency theory, to solve a classic problem in stock market analysis, that is, finding stock trading patterns that lead to superior financial gains. This study is one of a few efforts that go across multi-disciplines to study the stock market and the results were quite good. There are also some future research opportunities in this direction. For example, the trading volume is not considered in this research, which it is an important factor of the stock market, and we believe it is worth further investigation. Using four or more days K-Line patterns, instead of just 3-day K-Line patterns, is also worth exploring.
Table 3. The performance of the chosen winning patterns Accumulated Times Total “Buy” Signals
Percentage
84
Price is up at least 10% after the signal
36
42.86%
Price is up 2/3 of 10%, i.e. 6.7%, after the signal
20
66.67%
Price is up 1/2 of 10%, i.e. 5%, after the signal
3
70.24%
Price is up only 1/10 of 10% after the signal
13
85.71%
12
100.00%
Price drops after the signal
Accumulated Times Total “Sell” Signals
Percentage
16
Price is down at least 10% after the signal
4
25.00%
Price is down 2/3 of 10%, i.e. 6.7%,
2
37.50%
Price is down 1/2 of 10%, i.e. 5%, after the signal
1
43.75%
Price is down only 1/10 of 10% after the signal
7
87.50%
Price raises after the signal
2
100.00%
after the signal
783
M
Mining for Profitable Patterns in the Stock Market
REFERENCES Allen, F., & Karjalainen, R. (1999). Using genetic algorithms to find technical trading rules. Journal of Financial Economics, 51, 245-271. Armano, G., Murru, A., & Roli, F. (2002). Stock market prediction by a mixture of genetic-neural experts. International Journal of Pattern Recognition and Artificial Intelligence, 16, 501-526. Chenoweth, T., Obradovic, Z., & Stephenlee, S. (1996). Embedding technical analysis into neural network based trading systems. Applied Artificial Intelligence, 10, 523-541. Dawson, E.R., & Steeley, J.M. (2003). On the existence of visual technical patterns in the UK stock market. Journal of Business Finance and Accounting, 20, 263-293. Dourra, H., & Siy, P. (2001). Stock evaluation using fuzzy logic. International Journal of Theoretical and Applied Finance, 4, 585-602. Fama, E.F. (1991). Efficient capital markets: II. Journal of Finance, 46, 1575-1617. Liu, H. (1998). Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers. Lo, A.W., & MacKinlay, A.C. (1999). A non-random walk down Wall Street. Princeton, NJ: Princeton University Press.
Thawornwong, S., Enke, D., & Dagli, C. (2003). Neural networks as a decision maker for stock trading: a technical analysis approach. International Journal of Smart Engineering System Design, 5, 313-325 Tsaih, R., Hsu, Y., & Lai, C.C. (1998). Forecasting S&P 500 stock index futures with a hybrid AI system. Decision Support Systems, 23, 161-174. Zaki, M.J., Parthasatathy, S., Ogiharam M., & Li, W. (1997). New algorithms for fast discovery of association rules. In Proceedings of the 3 rd International Conference on Knowledge Discovery and Data Mining (pp. 283-286).
KEY TERMS Buy-And-Hold Strategy: An investment strategy for buying portfolios of stocks or mutual funds with solid, long-term growth potential. The underlying value and stability of the investments are important, rather than the short or medium-term volatility of the market. Fuzzy Logic: Fuzzy logic provides an approach to approximate reasoning in which the rules of inference are approximate rather than exact. Fuzzy logic is useful in manipulating information that is incomplete, imprecise, or unreliable.
Lo, A.W., Mamaysky, H., & Wang, J. (2000). Foundations of technical analysis: Computational algorithms, statistical inference, and empirical implementation. The Journal of Finance, 55, 1705-1765
Genetic Algorithm: A genetic algorithm is an optimization algorithm based on the mechanisms of Darwinian evolution, which uses random mutation, crossover and selection procedures to breed better models or solutions from an originally random starting population or sample.
Pring, M.J. (1991) Technical analysis explained: The successful investor’s guide to spotting investment trends and turning points (3rd ed.). McGraw-Hill Inc.
K-Line: It is an Asian version of stock price bar chart where a lower close than open on a day period is shaded dark and a higher close day period is shaded light.
Sharpe, W.F., Alexander, G.J., & Bailey, J.V. (1999). Investments (6th ed.). Prentice-Hall.
Market Efficiency Theory: A financial theory that states that stock market prices reflect all available, relevant information.
Siberschatz, A., & Tuzhilin, A. (1996). What makes patterns interesting in knowledge discovery system. IEEE Trans. on Knowledge and Data Engineering, 8, 970-974.
784
Neural Networks: Neural networks are algorithms simulating the functioning of human neurons and may be used for pattern recognition problems, for example, to establish a quantitative structure-activity relationship.
785
Mining for Web-Enabled E-Business Applications Richi Nayak Queensland University of Technology, Australia
INTRODUCTION A small shop owner builds a relationship with its customers by observing their needs, preferences and buying behaviour. A Web-enabled e-business will like to accomplish something similar. It is an easy job for the small shop owner to serve his customers better in future by learning from past interactions. But, this may not be easy for Webenabled e-businesses when most customers may never interact personally, and the number of customers is much higher than of the small shop owner. Data mining techniques can be applied to understand and analyse e-business data, and turn into actionable information, that can support a Web enabled e-business to improve its marketing, sales and customer support operations. This seems to be more appealing, when data is produced and stored with advance electronic data interchange methods, the computing power is affordable, the competitive pressure among businesses is strong, and the efficient and commercial data mining tools are available for data analysis.
BACKGROUND Data mining is the process of searching the trends, clusters, valuable links and anomalies in the entire data. The process benefits from the availability of large amount of data with rich description. The rich descriptions of data such as wide customer records with many potentially useful fields allow data mining algorithms to search beyond obvious correlations. Examples of data mining in Web-enabled e-business applications are generation of user profiles, enabling customer relationship management, and targeting Web advertising based on user ac-
cess patterns extracted from the Web data. With the use of data mining techniques, e-business companies can improve the sales and quality of the products by anticipating problems before they occur. When dealing with Web-enabled e-business data, a data mining task is decomposed into many sub tasks (figure 1). The discovered knowledge is presented to user in an understandable and useable form. The analysis may reveal how a Web site is useful in making decision for a user, resulting in improving the Web site. The analysis may also lead into business strategies for acquiring new customers and retaining the existing ones.
DATA MINING OPPORTUNITIES Data obtained from the Web-enabled e-business transactions can be categorised into (1) primary data that includes actual Web contents, and (2) secondary data that includes Web server access logs, proxy server logs, browser logs, registration data if any, user sessions and queries, cookies, etc (Cooley, 2003; Kosala & Blockeel, 2000). The goal of mining the primary Web data is to effectively interpret the searched Web documents. Web search engines discover resources on the Web but have many problems such as (1) the abundance problem, where hundreds of irrelevant data are returned in response to a search query, (2) limited coverage problem, where only a few sites are searched for the query instead of searching the entire Web, (3) limited query interface, where user can only interact by providing few keywords, (4) limited customization to individual users, etc (Garofalakis, Rastogi, Seshadri, & Hyuseok, 1999). Mining of Web contents can assist e-businesses in improving the orga-
Figure 1. A mining process for Web-enabled e-business data Locating & Retrieving Web documents & Web access logs
Data Gathering
Data Selection & Data Quality check & Data Transformation & Data Distribution
Data Processing
Data Model Learning & Best Model Selection
Data Modelling
Information extraction
Information Retrieval
User Interface
Information Analysis & Knowledge Assimilation
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining for Web-Enabled E-Business Applications
nization of retrieved result and increasing the precision of information retrieval. Some of the data mining applications appropriate for such type of data are: •
•
•
•
•
Trend prediction within the retrieved information to indicate future values. For example, an e-auction company provides information about items to auction, previous auction details, etc. Predictive modelling can analyse the existing information, and as a result estimate the values for auctioneer items or number of people participating in future auctions. Text clustering within the retrieved information. For example structured relations can be extracted from unstructured text collections by finding the structure of Web documents, and present a hierarchical structure to represent the relation among text data in Web documents (Wong & Fu, 2000). Monitoring a competitor’s Web site to find unexpected information e.g. offering unexpected services and products. Because of the large number of competitor’s Web sites and huge information in them, automatic discovery is required. For instance, association rule mining can discover frequent word combination in a page that will lead a company to learn about competitors (Liu, Ma, & Yu, 2001). Categorization of Web pages by discovering similarity and relationships among various Web sites using clustering or classification techniques. This will lead into effectively searching the Web for the requested Web documents within the categories rather than the entire Web. Cluster hierarchies of hypertext documents can be created by analysing semantic information embedded in link structures and document contents (Kosala & Blockeel, 2000). Documents can also be given classification codes according to keywords present in them. Providing a higher level of organization for semistructured or unstructured data available on the Web. Users do not scan the entire Web site to find the required information, instead they use Web query languages to search within the document or to obtain structural information about Web documents. A Web query language restructures extracted information from Web information sources that are heterogenous and semi-structured (Abiteboul, Buneman, & Suciu, 2000). An agent based approach involving artificial intelligent systems can also organize Web based information (Dignum & Cortes, 2001).
The goal of mining the secondary Web data is to capture the buying and traversing habits of customers in an e-business environment. Secondary Web data includes Web transaction data extracted from Web logs.
786
Some of the data mining applications appropriate for such type of data are: •
·
•
Promoting cross-marketing strategies across products. Data mining techniques can analyse logs of different sales indicating customer’s buying patterns (Cooley, 2003). Classification and clustering of Web access log can help a company to target their marketing (advertising) strategies to a certain group of customers. For example, classification rule mining is able to discover that a certain age group of people from a certain locality are likely to buy a certain group of products. Web enabled e-business can also be benefited with link analysis for repeat buying recommendations. Schulz, Hahsler, & Jahn (1999) applied link analysis in traditional retail chains, and found that 70% cross-selling potential exists. Associative rule mining can find frequent products bought together. For example, association rule mining can discover rules such as “75% customers who place an order for product1 from the /company/ product1/ page also place the order for product2 from the /company/product2/ page”. Maintaining or restructuring Web sites to better serve the needs of customers. Data mining techniques can assist in Web navigation by discovering authority sites of a user’s interest, and overview sites for those authority sites. For instance, association rule mining can discover correlation between documents in a Web site and thus estimate the probability of documents being requested together (Lan, Bressan, & Ooi, 1999). An example association rule resulting from the analysis of a travelling e-business company Web data is: “79% of visitors who browsed pages about Hotel also browsed pages on visitor information: places to visit”. This rule can be used in redesigning the Web site by directly linking the authority and overview Web sites. Personalization of Web sites according to each individual’s taste. Data mining techniques can assist in facilitating the development and execution of marketing strategies such as dynamically changing a particular Web site for a visitor (Mobasher, Cooley, & Srivastave, 1999). This is achieved by building a model representing correlation of Web pages and users. The goal is to find groups of users performing similar activities. The built model is capable of categorizing Web pages and users, and matching between and across Web pages and/or users (Mobasher, et al, 1999). According to the clusters of user profiles, recommendations can be made to a visitor on return visit or to new visitors (Spiliopoulou,
Mining for Web-Enabled E-Business Applications
Pohle, & Faulstich, 1999). For example, people accessing educational products in a company Web site between 6-8 p.m. on Friday can be considered as academics and can be focused accordingly.
DIFFICULTIES IN APPLYING DATA MINING The idea of discovering knowledge in large amounts of data with rich description is both appealing and intuitive, but technically it is challenging. There should be strategies implemented for better analysis of data collected from Web-enabled e-business sources. •
•
•
Data Format: Data collected from Web-enabled ebusiness sources is semi-structured and hierarchical. Data has no absolute schema fixed in advance and the extracted structure may be irregular or incomplete. This type of data requires an additional processing before applying to traditional mining algorithms whose source is mostly confined to structured data. This pre-processing includes transforming unstructured data to a format suitable for traditional mining methods. Web query languages can be used to obtain structural information from semi-structured data. Based on this structural information, data appropriate to mining techniques are generated. Web query languages that combine path expressions with an SQL-style syntax such as Lorel or UnQL (Abiteboul, et al, 2000) are a good choice for extracting structural information. Data Volume: Collected e-business data sets are large in volume. The mining techniques should be able to handle such large data sets. Enumeration of all patterns may be expensive and unnecessary. In spite, selection of representative patterns that capture the essence of the entire data set and their use for mining may prove a more effective approach. But then selection of such data set becomes a problem. A more efficient approach would be to use an iterative and interactive technique that takes account into real time responses and feedback into calculation. An interactive process involves human analyst in the process, so an instant feedback can be included in the process. An iterative process first considers a selected number of attributes chosen by the user for analysis, and then keeps adding other attributes for analysis until the user is satisfied. This iterative method reduces the search space significantly. Data Quality: Web server logs may not contain all the data needed. Also, noisy and corrupt data can hide
•
•
patterns and make predictions harder (Kohavi, 2001). Nevertheless, the quality of data is increased with the use of electronic interchange. There is less noise present in the data due to electronic storage and processing in comparison to manual processing of data. Data warehousing provides a capability for the good quality data storage. A warehouse integrates data from operational systems, e-business applications, and demographic data providers, and handles issues such as data inconsistency, missing values, etc. A Web warehouse may be used as data source. There has been some initiative to warehouse the Web data generated from e-business applications, but still long way to go in terms of data mining (Bhowmick, Madria, & Ng, 2003). Another solution of collecting the good quality Web data is the use of (1) a dedicated server recording all activities of each user individually, or (2) cookies or scripts in the absence of such server (Chan, 1999; Kohavi, 2001). The agent based approaches that involve artificial intelligence systems can also be used to discover such Web based information. Data Adaptability: Data on the Web is ever changing. Data mining models and algorithms should be adapted to deal with real-time data such that the new data is incorporated for analysis. The constructed data model should be updated as the new data approaches. User-interface agents can be used to maximize the productivity of current users’ interactions with the system by adapting behaviours. Another solution can be to dynamically modifying mined information as the database changes (Cheung & Lee, 2000) or to incorporate user feedback to modify the actions performed by the system. XML Data: It is assumed that in few years XML will be the most highly used language of Internet in exchanging information. Assuming the metadata stored in XML, the integration of the two disparate data sources becomes much more transparent, field names are matched more easily and semantic conflicts are described explicitly (Abiteboul et al., 2000). As a result, the types of data input to and output from the learned models and the detailed form of the models are determined. XML documents may not completely be in the same format thus resulting in missing values when integrated. Various techniques e.g., tag recognition can be used to fill missing information created from the mismatch in attributes or tags (Abiteboul et al., 2000). Moreover, many query languages such as XML-QL, XSL and XML-GL 787
M
Mining for Web-Enabled E-Business Applications
•
(Abiteboul et al., 2000) are designed specifically for querying XML and getting structured information from these documents. Privacy Issues: There are always some concerns of proper balancing between company’s desire to use personal information versus individual’s desire to protect it (Piastesky-Shapiro, 2000). The possible solution is to (1) ensure users of secure and reliable data transfer by using high speed high-valued data encryption procedures, and/ or (2) give a choice to user to reveal the information that he wants to and give some benefit in exchange of revealing their information such as discount on certain shopping product etc.
FUTURE TRENDS Earlier data mining tools such as C5 (http:// www.rulequest.com) and several neural network softwares (QuickLearn, Sompack, etc) were limited to some individual researchers. These individual algorithms are capable of solving a single data mining task. But now the second generation data mining system produced by commercial companies such as clementine (http:// www.spss.com/clementine/), AnswerTree (http:// www.spss.com/answertree/), SAS (http://www.sas.com/), IBM Intelligent. Miner (http://www.ibm.com/software/data/iminer/) and DBMiner (http://db.cs.sfu.ca/DBMiner) incorporate multiple discoveries (classification, clustering, etc), preprocessing (data cleaning, transformation, etc) and postprocessing (visualization) tasks, and becoming known to public and successful. Moreover, tools that combine ad hoc query or OLAP (Online analytical processing) with data mining are also developed (Wu, 2000). Faster CPU, bigger disks and easy net connectivity make these tools liable to analyse large volume of data.
CONCLUSION It is easy to collect data from Web-enabled e-business sources as visitors to a Web site leave the trail which automatically is stored in log files by Web server. The data mining tools can process and analyse such Web server log files or Web contents to discover meaningful information. This analysis uncovers the previously unknown buying habits of their online customers to the companies. More importantly, the fast feedback the companies obtained using data mining is very helpful in increasing the company’s benefit.
788
REFERENCES Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: From relations to substructured data and XML. California: Morgan Kaumann. Bhowmick, S.S., Madria, S.K., & Ng, W.K. (2003) Web data management: A warehouse approach. Springer Computing & Information Science. Chan, P.K. (1999). A non-invasive learning approach to building Web user profile. In Masand & Spiliopoulou (Eds.), WEBKDD’99. Cheung, D.W., & Lee, S.D. (2000). Maintenance of discovered association rules. In Knowledge Discovery for Business Information Systems (The Kluwer International Series in Engineering and Computer Science, 600). Boston: Kluwer Academic Publishers. Cooley, R. (2003, May). The use of Web structure and content to identify subjectively interesting Web usage patterns. In ACM Transactions on Internet Technology, 3(2). Dignum, F., & Cortes, U. (Eds.). (2001). Agent-mediated electronic commerce III: Current issues in agent-based electronic commerce systems. Lecture Notes in Artificial Intelligence, Springer Verlag. Garofalakis, M.N., Rastogi, R., Seshadri, S., & Hyuseok, S. (1999). Data mining and the Web: Past, present and future. In Proceedings of the second International Workshop on Web Information and Data Management (pp. 43-47). Kohavi, R. (2001). Mining e-commerce data: The good, the bad and the ugly. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001). Kosala, R., & Blockeel, H. (2000, July). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15. Lan, B., Bressan, S., & Ooi, B.C. (1999). Making Web servers pushier. In Masand & Spiliopoulou (Eds.), WEBKDD’99. Liu, B., Ma, Y., & Yu, P.H. (2001, August). Discovering unexpected information from your competitor’s Web sites. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001), SanFrancisco, USA. Masand, B., & Spiliopoulou, M. (1999, August). KDD’99 workshop on Web usage analysis and user profiling (WEBKDD’99), San Diego, CA. ACM.
Mining for Web-Enabled E-Business Applications
Mobasher, B., Cooley, R., & Srivastave, J. (1999). Automatic personalization based on Web usage mining. In Masand & Spiliopoulou (Eds.), WEBKDD’99. Piastesky-Shapiro, G. (2000, January). Knowledge discovery in databases: 10 years after. SIGKDD Explorations, 1(2), 59-61, ACM SIGKDD. Schulz, A.G., Hahsler, M., & Jahn, M. (1999). A customer purchase incidence model applied to recommendation service. In Masand & Spiliopoulou (Eds.), WEBKDD’99. Spiliopoulou, M., Pohle, C., & Faulstich, L.C. (1999). Improving the effectiveness of a Web site with Web usage mining. In Masand & Spiliopoulou (Eds.), WEBKDD’99. Wong, W.C., & Fu, A.W. (2000, July). Finding structure and characteristic of Web documents for classification. In Proceedings of the ACM SIGMOD Workshop on Research issues in Data Mining and Knowledge discovery, ACM. Wu, J. (2000, August). Business intelligence: What is data mining? In Data Mining Review Online.
KEY TERMS Clustering Data Mining Task: To identify items with similar characteristics, and thus creating a hierarchy of classes from the existing set of events. A data set is partitioned into segments of elements (homogeneous) that share a number of properties. Data Mining (DM) or Knowledge Discovery in Databases: The extraction of interesting, meaningful, implicit, previously unknown, valid and actionable information from a pool of data sources.
Link Analysis Data Mining Task: Establishes internal relationship to reveal hidden affinity among items in a given data set. Link analysis exposes samples and trends by predicting correlation of items that are otherwise not obvious. Mining of Primary Web Data: Assists to effectively interpret the searched Web documents. Output of this mining process can help e-business customers to improve the organization of retrieved result and to increase the precision of information retrieval. Mining of Secondary Web Data: Assists to capture the buying and traversing habits of customers in an ebusiness environment. Output of this mining process can help e-business to predicting customer behaviour in future, to personalization of Web sites, to promoting campaign by cross-marketing strategies across products. Predictive Modelling Data Mining Task: Makes predictions based on essential characteristics about the data. The classification task of data mining builds a model to map (or classify) a data item into one of several predefined classes. The regression task of data mining builds a model to map a data item to a real-valued prediction variable. Primary Web Data: Includes actual Web contents. Secondary Web Data: Includes Web transaction data extracted from Web logs Examples are Web server access logs, proxy server logs, browser logs, registration data if any, user sessions, user queries, cookies, product correlation and feedback from the customer companies. Web-Enabled E-Business: A business transaction or interaction in which participants operate or transact business or conduct their trade electronically on the Web.
789
M
790
Mining Frequent Patterns via Pattern Decomposition Qinghua Zou University of California - Los Angeles, USA Wesley Chu University of California - Los Angeles, USA
INTRODUCTION Pattern decomposition is a data-mining technology that uses known frequent or infrequent patterns to decompose a long itemset into many short ones. It finds frequent patterns in a dataset in a bottom-up fashion and reduces the size of the dataset in each step. The algorithm avoids the process of candidate set generation and decreases the time for counting supports due to the reduced dataset.
BACKGROUND A fundamental problem in data mining is the process of finding frequent itemsets (FI) in a large dataset that enable essential data-mining tasks, such as discovering association rules, mining data correlations, and mining sequential patterns. Three main classes of algorithms have been proposed:
•
•
•
Candidates Generation and Test (Agrawal &Srikant, 1994; Heikki, Toivonen &Verkamo, 1994; Zaki et al., 1997): Starting at k=0, it first generates candidate k+1 itemsets from known frequent k itemsets and then counts the supports of the candidates to determine frequent k+1 itemsets that meet a minimum support requirement. Sampling Technique (Toivonen, 1996): Uses a sampling method to select a random subset of a dataset for generating candidate itemsets and then tests these candidates to identify frequent patterns. In general, the accuracy of this approach is highly dependent on the characteristics of the dataset and the sampling technique that has been used. Data Transformation: Transforms an original dataset to a new one that contains a smaller search space than the original dataset. FP-tree-based (Han, Pei & Yin, 2000) mining first builds a compressed data representation from a dataset, and then, mining
tasks are performed on the FP-tree rather than on the dataset. It has performance improvements over Apriori (Agrawal &Srikant, 1994), since infrequent items do not appear on the FP-tree, and, thus, the FPtree has a smaller search space than the original dataset. However, FP-tree cannot reduce the search space further by using infrequent 2-item or longer itemsets. What distinguishes pattern decomposition (Zou et al., 2002) from most previous works is that it reduces the search space of a dataset in each step of its mining process.
MAIN THRUST Both the technology and application will be discussed to help clarify the meaning of pattern decomposition.
Search Space Definition Let N=X:Y be a transaction where X, called the head of N, is the set of required items, and Y, called the tail of N, is the set of optional items. The set of possible subsets of Y is called the power set of Y, denoted by P(Y).
Definition 1 For N=X:Y, the set of all the itemsets obtained by concatenating X with the itemsets in P(Y) is called the search space of N, denoted as {X:Y}. That is, { X : Y } = { X ∪ V | V ∈ P(Y )}.
For example, the search space {b:cd} includes four itemsets b, bc, bd, and bcd. The search space {:abcde} includes all subsets of abcde.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Frequent Patterns via Pattern Decomposition
By definition 1, we have {X:Y}={X:Z}, where Z=Y-X refer to the set of items contained in Y but not in X. Thus, we will assume that Y does not contain any item in X, when {X:Y} is mentioned in this article.
Theorem 3 •
Definition 2 Let S, S1, and S2 be search spaces. The set {S1, S2} is a partition of S if and only if S= S 1 ∪ S2 and S1 ∩ S2= φ . The relationship is denoted by S=S1+S 2 or S1= S-S2 or S 2= S-S1. We say S is partitioned into S1 and S2. Similarly, a set {S1, S2, …, Sk} is a partition of S if and only if S= S1 ∪ S2 ∪ … ∪ Sk and Si ∩ Sj= φ for i,j ∈ [1..k] and i ≠ j. We denote it as S=S1+S2+…+Sk. Let a be an item where aX is an itemset by concatenating a with X.
Theorem 1
Pruning Search Space: If Z does not contain the head X, the space {X:Y} cannot be pruned by Z (i.e., {X:Y}-{:Z}={X:Y}). Otherwise, the space can be pruned as k
{X:Y}-{:Z} = ∑{ Xai : ai +1...ak (Y ∩ Z )} , a1a2…ak=Y-Z. i =1
Proof If Z does not contain X, no itemset in {X:Y} is subsumed by Z. Therefore, knowing that Z is frequent, we cannot prune any part of the search space {X:Y}. Otherwise, when X is a subset of Z, we have k
{X:Y}= ∑{ Xai : ai +1 ...a kV } + X : V , where V=Y ∩ Z. The i =1
For a ∉ X,Y, the search space {X:aY} can be partitioned into {Xa:Y} and {X:Y} by item a (i.e., {X:aY}={Xa:Y}+{X:Y}).
head in the first part is Xai where ai is a member of Y-Z. Since Z does not contain ai, the first part cannot be pruned by Z. For the second part, we have {X:V}-{:Z}={X:V}-
It follows from the fact that each itemset of {X:aY} either contains a (i.e., {Xa:Y}) or does not contain a (i.e., {X:Y}). For example, we have {b:cd}={bc:d}+{b:d}.
{X:(Z-X)}. Since X ∩ Y= φ , we have V ⊆ Z-X. Therefore, {X:V} can be pruned away entirely. For example, we have {:bcde}-{:abcd} = {:bcde}-{:bcd} = {e:bcd}. Here, a is irrelevant and is removed in the first step. Another example is {e:bcd}-{:abe} = {e:bcd}-{:be} = {e:bcd}-{e:b} = {ec:bd}+{ed:b}.
Theorem 2
Pattern Decomposition
•
Given a known frequent itemset Z, we are able to decompose the search space of a transaction N=X:Y to N’=Z:Y’, if X is a subset of Z, where Y’ is the set of items that appears in Y but not in Z, denoted by PD(N=X:Y|Z)= Z:Y’. For example, if we know that an itemset abc is frequent, we can decompose a transaction N=a:bcd into N’=abc:d; that is, PD(a:bcd|abc)=abc:d. Given a known infrequent itemset Z, we also can decompose the search space of a transaction N=X:Y. For simplicity, we use three examples to show the decomposition by known infrequent itemsets and leave out its formal mathematic formula in general cases. Interested readers can refer to Zou, Chu, and Lu (2002) for details. For example, if N=d:abcef and a known infrequent itemsets, then we have:
Proof
Partition Search Space: Let a1, a2, …, ak be distinct items and a1a2…akY be an itemset; the search space of {X: a1a2…akY} can be partitioned into k
∑{Xa i =1
i
: ai +1 K a k Y } + { X : Y }, where ai ∉ X ,Y .
Proof It follows by partitioning the search space via items a1,a2,…,ak sequentially as in theorem 1. For example, we have {b:cd}={bc:d}+{bd:}+{b:} and {a:bcde}= {ab:cde} +{ac:de}+{a:de}. Let {X:Y} be a search space and Z be a known frequent itemset. Since Z is frequent, all subsets of Z will be frequent (i.e., every itemset of {:Z} is frequent). Theorem 3 shows how to prune the space {X:Y} by Z.
• •
For infrequent 1-itemset ~a, PD(d:abcef|~a) = d:bcef by dropping a from its tail. For infrequent 2-itemset ~ab, PD(d:abcef|~ab) = d:bcef+da:cef by excluding ab.
791
M
Mining Frequent Patterns via Pattern Decomposition
•
For infrequent 3-itemset ~abc, PD(d:abcef|~abc) = d:bcef+da:cef+dab:ef by excluding abc.
By decomposing a transaction t, we reduce the number of items in its tails and thus reduce its search space. For example, the search space of a:bcd contains the following eight itemsets {a, ab, ac, ad, abc, abd, acd, abcd}. Its decomposition result, abc:d, contains only two itemsets {abc, abcd}, which is only 25% of its original search space. When using pattern decomposition, we find frequent patterns in a step-wise fashion starting at step 1 for 1-item itemsets. At step k, it first counts the support for every possible k-item itemsets contained in the dataset Dk to find frequent k-item itemsets Lk and infrequent k-item itemsets ~L k. Then, using the Lk and ~Lk, Dk, they can be decomposed into Dk+1, which has a smaller search space than Dk. These steps continue until the search space Dk becomes empty.
An Application The motivation of our work originates from the problem of finding multi-word combinations in a group of medical report documents, where sentences can be viewed as transactions and words can be viewed as items. The problem is to find all multi-word combinations that occur at least in two sentences of a document. As a simple example, for the following text: Aspirin greatly underused in people with heart disease. DALLAS (AP) – Too few heart patients are taking aspirin, despite its widely known ability to prevent heart attacks, according to a study released Monday. The study, published in the American Heart Association’s journal Circulation, found that only 26% of patients who had heart disease and could have benefited from aspirin took the pain reliever. “This suggests that there’s a substantial number of patients who are at higher risk of more problems because they’re not taking aspirin,” said Dr. Randall Stafford, an internist at Harvard’s Massachusetts General Hospital, who led the study. “As we all know, this is a very inexpensive medication – very affordable.” The regular use of aspirin has been shown to reduce the risk of blood clots that can block an artery and trigger a heart attack. Experts say aspirin also can reduce the risk of a stroke and angina, or severe chest pain. Because regular aspirin use can cause some side effects, such as stomach ulcers, internal bleeding, and allergic reactions, doctors too often are reluctant to prescribe it for heart patients, Stafford said. “There’s a bias in medicine toward treatment, and within that bias, we tend to underutilize preventative services, even if they’ve been clearly proven,” said Marty Sullivan, 792
a professor of cardiology at Duke University in Durham, North Carolina. Stafford’s findings were based on 1996 data from 10,942 doctor visits by people with heart disease. The study may underestimate aspirin use; some doctors may not have reported instances in which they recommended patients take over-the-counter medications, he said. He called the data “a wake-up call” to doctors who focus too much on acute medical problems and ignore general prevention. We can find frequent one-word, two-word, threeword, four-word, and five-word combinations. For instance, we found 14 four-word combinations. heart aspirin use regul, aspirin they take not, aspirin patient take not, patient doct use some, aspirin patient study take, patient they take not, aspirin patient use some, aspirin doct use some, aspirin patient they not, aspirin patient they take, aspirin patient doct some, heart aspirin patient too, aspirin patient doct use, heart aspirin patient study. Multi-word combinations are effective for document indexing and summarization. The work in Johnson, et al. (2002) shows that multi-word combinations can index documents more accurately than single-word indexing terms. Multi-word combinations can delineate the concepts or content of a domain-specific document collection more precisely than single word. For example, from the frequent one-word table, we may infer that heart, aspirin, and patient are the most important concepts in the text, since they occur more often than others. For the frequent two-word table, we see a large number of twoword combinations with aspirin (i.e., aspirin patient, heart aspirin, aspirin use, aspirin take, etc.). This infers that the document emphasizes aspirin and aspirin-related topics more than any other words.
FUTURE TRENDS There is a growing need for mining frequent sequence patterns from human genome datasets. There are 23 pairs of human chromosomes, approximately 30,000 genes, and more than 1,000,000 proteins. The previously discussed pattern decomposition method can be used to capture sequential patterns with some small modifications. When the frequent patterns are long, mining frequent itemsets (FI) are infeasible because of the exponential number of frequent itemsets. Thus, algorithms mining frequent closed itemsets (FCI) (Pasquier, Bastide, Taouil & Lakhal, 1999; Pei, Han & Mao, 2000; Zaki & Hsiao, 1999) are proposed, since FCI is enough to generate associa-
Mining Frequent Patterns via Pattern Decomposition
tion rules. However, FCI also could be as exponentially large as the FI. As a result, many algorithms for mining maximal frequent itemsets (MFI) are proposed, such as Mafia (Burdick, Calimlim & Gehrke, 2001), GenMax (Gouda & Zaki, 2001), and SmartMiner (Zou, Chu & Lu, 2002). The main idea of pattern decomposition also is used in SmartMiner, except that SmartMiner uses tail information (frequent itemsets) to decompose the search space of a dataset rather than the dataset itself. While pattern decomposition avoids candidate set generation, SmartMiner avoids superset checking, which is a timeconsuming process.
CONCLUSION We propose to use pattern decomposition to find frequent patterns in large datasets. The PD algorithm shrinks the dataset in each pass so that the search space of the dataset is reduced. Pattern decomposition avoids the costly candidate set generation procedure, and using reduced datasets greatly decreases the time for support counting.
ACKNOWLEDGMENT This research is supported by NSF IIS ITR Grant # 6300555.
REFERENCES Agrawal, R. & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 1994 International Conference on Very Large Data Bases.
Johnson, D., Zou, Q., Dionisio, J.D., Liu, Z., Chu, W.W. (2002). Modeling medical content for automated summarization. Annals of the New York Academy of Sciences. Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Discovering frequent closed itemsets for association rules. Proceedings of the 7th International Conference on Database Theory. Pei, J., Han, J., & Mao, R. (2000). Closet: An efficient algorithm for mining frequent closed itemsets. Proceedings of the SIGMOD International Workshop on Data Mining and Knowledge Discovery. Toivonen, H. (1996). Sampling large databases for association rules. Proceedings of the 22nd International Conference on Very Large Data Bases, Bombay, India. Zaki, M.J., & Hsiao, C. (1999). Charm: An efficient algorithm for closed association rule mining. Technical Report 99-10. Rensselaer Polytechnic Institute. Zaki, M.J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast discovery of association rules. Proceedings of the Third International Conference on Knowledge Discovery in Databases and Data Mining. Zou, Q., Chu, W., Johnson, D., & Chiu, H. (2002). Using pattern decomposition (PD) methods for finding all frequent patterns in large datasets. Journal Knowledge and Information Systems (KAIS). Zou, Q., Chu, W., & Lu, B. (2002). SmartMiner: A depth first algorithm guided by tail information for mining maximal frequent itemsets. Proceedings of the IEEE International Conference on Data Mining, Japan.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A maximal frequent itemset algorithm for transactional databases. Proceedings of the International Conference on Data Engineering.
KEY TERMS
Gouda, K., & Zaki, M.J. (2001). Efficiently mining maximal frequent itemsets. Proceedings of the IEEE International Conference on Data Mining, San Jose, California.
Infrequent Pattern: An itemset that is not a frequent pattern.
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. Proceedings of the 2000 ACM International Conference on Management of Data, Dallas, Texas. Heikki, M., Toivonen, H., & Verkamo, A.I. (1994). Efficient algorithms for discovering association rules. Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, Seattle, Washington.
Frequent Itemset (FI): An itemset whose support is greater than or equal to the minimal support.
Minimal Support (minSup): A user-given number that specifies the minimal number of transactions in which an interested pattern should be contained. Pattern Decomposition: A technique that uses known frequent or infrequent patterns to reduce the search space of a dataset. Search Space: The union of the search space of every transaction in a dataset.
793
M
Mining Frequent Patterns via Pattern Decomposition
Search Space of a Transaction N=X:Y: The set of unknown frequent itemsets contained by N. Its size is decided by the number of items in the tail of N, i.e. Y. Support of an Itemset x: The number of transactions that contains x.
794
Transaction: An instance that usually contains a set of items. In this article, we extend a transaction to a composition of a head and a tail (i.e., N=X:Y), where the head represents a known frequent itemset, and the tail is the set of items for extending the head for new frequent patterns.
795
Mining Group Differences
M
Shane M. Butler Monash University, Australia Geoffrey I. Webb Monash University, Australia
INTRODUCTION Finding differences among two or more groups is an important data-mining task. For example, a retailer might want to know what the different is in customer purchasing behaviors during a sale compared to a normal trading day. With this information, the retailer may gain insight into the effects of holding a sale and may factor that into future campaigns. Another possibility would be to investigate what is different about customers who have a loyalty card compared to those who don’t. This could allow the retailer to better understand loyalty cardholders, to increase loyalty revenue, or to attempt to make the loyalty program more appealing to noncardholders. This article gives an overview of such group mining techniques. First, we discuss two data-mining methods designed specifically for this purpose—Emerging Patterns and Contrast Sets. We will discuss how these two methods relate and how other methods, such as exploratory rule discovery, can also be applied to this task. Exploratory data-mining techniques, such as the techniques used to find group differences, potentially can result in a large number of models being presented to the user. As a result, filter mechanisms can be a useful way to automatically remove models that are unlikely to be of interest to the user. In this article, we will examine a number of such filter mechanisms that can be used to reduce the number of models with which the user is confronted.
BACKGROUND There have been two main approaches to the group discovery problem from two different schools of thought. The first, Emerging Patterns, evolved as a classification method, while the second, Contrast Sets, grew as an exploratory method. The algorithms of both approaches are based on the Max-Miner rule discovery system (Bayardo Jr., 1998). Therefore, we will briefly describe rule discovery.
Rule discovery is the process of finding rules that best describe a dataset. A dataset is a collection of records in which each record contains one or more discrete attributevalue pairs (or items). A rule is simply a combination of conditions that, if true, can be used to predict an outcome. A hypothetical rule about consumer purchasing behaviors, for example, might be IF buys_milk AND buys_cookies THEN buys_cream. Association rule discovery (Agrawal, Imielinski & Swami, 1993; Agrawal & Srikant, 1994) is a popular rule-discovery approach. In association rule mining, rules are sought specifically in the form of where the antecedent group of items (or itemset), A, implies the consequent itemset, C. An association rule is written as A → C . Of particular interest are the rules where the probability of C is increased when the items in A also occur. Often association rule-mining systems restrict the consequent itemset to hold only one item as it reduces the complexity of finding the rules. In association rule mining, we often are searching for rules that fulfill the requirement of a minimum support criteria, minsup, and a minimum confidence criteria, minconf. Where support is defined as the frequency with which A and C co-occur: support( A → C ) = frequency(A ∪ C )
and confidence is defined as the frequency with which A and C co-occur, divided by the frequency with which A occurs throughout all the data: confidence(A → C ) =
support ( A → C ) frequency( A)
The association rules discovered through this process then are sorted according to some user-specified interestingness measure before they are displayed to the user. Another type of rule discovery is k-most interesting rule discovery (Webb, 2000). In contrast to the supportconfidence framework, there is no minimum support or
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Group Differences
confidence requirement. Instead, k-most interesting rule discovery focuses on the discovery of up to k rules that maximize some user-specified interestingness measure.
MAIN THRUST Emerging Patterns Emerging Pattern analysis is applied to two or more datasets, where each dataset contains data relating to a different group. An Emerging Pattern is defined as an itemset whose support increases significantly from one group to another (Dong & Li, 1999). This support increase is represented by the growth rate—the ratio of support of an itemset in group 1 over that of group 2. The support of a group G is given by:
exploratory method for finding differences between one group and another that the user can utilize, rather than as a classification system focusing on prediction accuracy. To this end, they present filtering and pruning methods to ensure only the most interesting and optimal number rules are shown to the user, from what is potentially a large space of possible rules. Contrast Sets are discovered using STUCCO, an algorithm that is based on the Max-Miner search algorithm (Bayardo Jr., 1998). Initially, only Contrast Sets are sought that have supports that are both significant and the difference large (i.e., the difference is greater than a userdefined parameter, mindev). Significant Contrast Sets (cset), therefore, are defined as those that meet the criteria: P(cset | Gi ) ≠ P (cset | G j )
Large Contrast Sets are those for which: count G ( X ) supp G (X ) = G
The GrowthRate( X ) is defined as 0 if supp1 (X ) = 0 and
supp 2 (X ) = 0 ;
∞ if
supp1 (X ) = 0 and supp 2 ( X ) ≠ 0 ; or
else supp 2 ( X ) supp1 (X ) . The special case where
GrowthRate(X ) = ∞ is called a Jumping Emerging Pattern, as it is said to have jumped from not occurring in one group to occurring in another group. This also can be thought of as an association rule having a confidence equaling 1.0. Emerging Patterns are not presented to the user, as models are in the exploratory discovery framework. Rather, the Emerging Pattern discovery research has focused on using the mined Emerging Patterns for classification, similar to the goals of Liu et al. (1998, 2001). Emerging Pattern mining-based classification systems include CAEP (Dong, Zhang, Wong & Li, 1999), JEP-C (Li, Dong & Ramamohanarao, 2001), BCEP (Fan & Ramamohanarao, 2003), and DeEP (Li, Dong, Ramamohanarao & Wong, 2004). Since the Emerging Patterns are classification based, the focus is on classification accuracy. This means no filtering method is used, other than the infinite growth rate constraint used during discovery by some the classifiers (e.g., JEP-C and DeEP). This constraint discards any
Emerging Pattern X for which GrowthRate(X ) ≠ ∞ .
Contrast Sets Contrast Sets (Bay & Pazzani, 1999, 2001) are similar to Emerging Patterns, in that they are also itemsets whose support differs significantly across datasets. However, the focus of Contrast Set research has been to develop an
796
support (cset, Gi ) − support (cset , G j ) ≥ mindev
As Bay and Pazzani have noted, the user is likely to be overwhelmed by the number of results. Therefore, a filter method is applied to reduce the number of Contrast Sets presented to the user and to control the risk of type-1 error (i.e., the risk of reporting a Contrast Set when no difference exists). The filter method employed involves a chisquare test of statistical significance between the itemset on one group to that Contrast Set on the other group(s). A correction for multiple comparisons is applied that lowers the value of α as the size of the Contrast Set (number of attribute value pairs) increases. Further pruning mechanisms also are used to filter Contrast Sets that are purely specializations of other more general Contrast Sets. This is done using another chisquare test of significance to test the difference between the parent Contrast Set and its specialization Contrast Set.
Mining Group Differences Using Rule Discovery Webb, Butler, and Newlands (2003) studied how Contrast Sets relate to generic rule discovery approaches. They used the OPUS_AR algorithm-based Magnum Opus software to discover rules and to compare them to those discovered by the STUCCO algorithm. OPUS_AR (Webb, 2000) is a rule-discovery algorithm based on the OPUS (Webb, 1995) efficient search technique, to which the Max-Miner algorithm is closely related. By limiting the consequent to a group variable, this rule discovery framework is able to be adapted for group discovery.
Mining Group Differences
While STUCCO and Magnum Opus specify different support conditions in the discovery phase, their conditions were proven to be equivalent (Webb et al., 2003). Further investigation found that the key difference between the two techniques was the filtering technique. Magnum Opus uses a binomial sign test to filter spurious rules, while STUCCO uses a chi-square test. STUCCO attempts to control the risk of type-1 error by applying a correction for multiple comparisons. However, such a correction, when given a large number of tests, will reduce the α value to an extremely low number, meaning that the risk of type-2 error (i.e., the risk of not accepting a nonspurious rule) is substantially increased. Magnum Opus does not apply such corrections so as not to increase the risk of type-2 error. While a chi-square approach is likely to be better suited to Contrast Set discovery, the correction for multiple comparisons, combined with STUCCO’s minimum difference, is a much stricter filter than that employed by Magnum Opus. As a result of Magnum Opus’ much more lenient filter mechanisms, many more rules are being presented to the end user. After finding that the main difference between the systems was their control of type-1 and type-2 errors via differing statistical test methods, Webb, et al. (2003) concluded that Contrast Set mining is, in fact, a special case of the rule discovery task. Experience has shown that filters are important for removing spurious rules, but it is not obvious which of the filtering methods used by systems like Magnum Opus and STUCCO is better suited to the group discovery task. Given the apparent tradeoff between type-1 and type-2 error in these data-mining systems, recent developments (Webb, 2003) have focused on a new filter method to avoid introducing type-1 and type-2 errors. This approach divides the dataset into exploratory and holdout sets. Like the training and test set method of statistically evaluating a model within the classification framework, one set is used for learning (the exploratory set) and the other is used for evaluating the models (the holdout set). A statistical test then is used for the filtering of spurious rules, and it is statistically sound, since the statistical tests are applied using a different set. A key difference between the traditional training and test set methodology of the classification framework and the new holdout technique is that many models are being evaluated in the exploratory framework rather than only one model in the classification framework. We envisage the holdout technique will be one area of future research, as it is adapted by exploratory data-mining techniques as a statistically sound filter method.
Case Study In order to evaluate STUCCO and the more lenient Magnum Opus filter mechanisms, Webb, Butler, and Newlands (2003)
conducted a study with a retailer to find interesting patterns between transactions from two different days. This data was traditional market-basket transactional data, containing the purchasing behaviors of customers across the many departments. Magnum Opus was used with the group, encoded as a variable and the consequent restricted to that variable only. In this experiment, Magnum Opus discovered all of the Contrast Sets that STUCCO found, and more. This is indicative of the more lenient filtering method of Magnum Opus. It was also interesting that, while all of the Contrast Sets discovered by STUCCO were only of size 1, Magnum Opus discovered conjunctions of sizes up to three department codes. This information was presented to the retail marketing manager in the form of a survey. For each rule, the manager was asked if the rule was surprising and if it was potentially useful to the organization. For ease of understanding, the information was transformed into a plain text statement. The domain expert judged a greater percentage of the Magnum Opus rules of surprise than the STUCCO contrasts; however, the result was not statistically significant. The percentage of rules found that potentially were useful were similar for both systems. In this case, Magnum Opus probably found some rules that were spurious, and STUCCO probably failed to discover some rules that were potentially interesting.
FUTURE TRENDS Mining differences among groups will continue to grow as an important research area. One area likely to be of future interest is improving filter mechanisms. Experience has shown that the use of filter is important, as it reduces the number of rules, thus avoiding overwhelming the user. There is a need to develop alternative filters as well as to determine which filters are best suited to different types of problems. An interestingness measure is a user-generated specification of what makes a rule potentially interesting. Interestingness measures are another important issue, because they attempt to reflect the user’s interest in a model during the discovery phase. Therefore, the development of new interestingness measures and determination of their appropriateness for different tasks are both expected to be areas of future study. Finally, while the methods discussed in this article focus on discrete attribute-value data, it is likely that there will be future research on how group mining can utilize quantitative, structural, and sequence data. For example, group mining of sequence data could be used to investigate what is different about the sequence of 797
M
Mining Group Differences
events between fraudulent and non-fraudulent credit card transactions.
CONCLUSION We have presented an overview of techniques for mining differences among groups, discussing Emerging Pattern discovery, Contrast Set discovery, and association rule discovery approaches. Emerging Patterns are useful in a classification system where prediction accuracy is the focus but are not designed for presenting the group differences to the user and thus don’t have any filters. Exploratory data mining can result in a large number of rules. Contrast Set discovery is an exploratory technique that includes mechanisms to filter spurious rules, thus reducing the number of rules presented to the user. By forcing the consequent to be the group variable during rule discovery, generic rule discovery software like Magnum Opus can be used to discover group differences. The number of differences reported to the user by STUCCO and Magnum Opus are related to the different filter mechanisms for controlling the output of potentially spurious rules. Magnum Opus uses a more lenient filter than STUCCO and thus presents more rules to the user. A new method, the holdout technique, will be an improvement over other filter methods, since the technique is statistically sound.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., USA. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile. Bay, S.D., & Pazzani, M.J. (1999). Detecting change in categorical data: Mining contrast sets. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. Bay, S.D., & Pazzani, M.J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213-246.
798
Bayardo, Jr., R.J. (1998). Efficiently mining long patterns from databases. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 85-93, Seattle, Washington, USA. Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, San Diego, California, USA. Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP: Classification by aggregating emerging patterns. Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan. Fan, H., & Ramamohanarao, K. (2003). A Bayesian approach to use emerging patterns for classification. Proceedings of the 14th Australasian Database Conference, Adelaide, Australia. Li, J., Dong, G., & Ramamohanarao, K. (2001). Making use of the most expressive jumping emerging patterns for classification. Knowledge and Information Systems, 3(2), 131-145. Li, J., Dong, G., Ramamohanarao, K., & Wong, L. (2004). DeEPs: A new instance-based lazy discovery and classification system. Machine Learning, 54(2), 99-124. Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, New York. Liu, B., Ma, Y., & Wong, C.K. (2001). Classification using association rules: Weaknesses and enhancements. In V. Kumar et al. (Eds.), Data mining for scientific and engineering applications (pp. 506-605). Boston: Kluwer Academic Publishing. Webb, G.I. (1995). An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431-465. Webb, G.I. (2000). Efficient search for association rules. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachesetts, USA. Webb, G.I. (2003). Preliminary investigations into statistically valid exploratory rule discovery. Proceedings of the Australasian Data Mining Workshop, Canberra, Australia. Webb, G.I., Butler, S.M., & Newlands, D. (2003). On detecting differences between groups. Proceedings of the Ninth ACM SIGKDD International Conference on
Mining Group Differences
Knowledge Discovery and Data Mining, Washington, D.C., USA.
KEY TERMS Association Rule: A rule relating two itemsets— the antecedent and the consequent. The rule indicates that the presence of the antecedent implies that the consequent is more probable in the data. Written as A→C. Contrast Set: Similar to an Emerging Pattern, it is also an itemset whose support differs across groups. The main difference is the method’s application as an exploratory technique rather than as a classification one.
Growth Rate: The ratio of the proportion of data covered by the Emerging Pattern in one group over the proportion of the data it covers in another group. Holdout Technique: A filter technique that splits the data into exploratory and holdout sets. Rules discovered from the exploratory set then can be evaluated against the holdout set using statistical tests. Itemset: A conjunction of items (attribute-value pairs) (e.g.,
age = teen ∧ hair = brown ).
k-Most Interesting Rule Discovery: The process of finding k rules that optimize some interestingness measure. Minimum support and/or confidence constraints are not used.
Emerging Pattern: An itemset that occurs significantly more frequently in one group than another. Utilized as a classification method by several algorithms.
Market Basket: An itemset; this term is sometimes used in the retail data-mining context, where the itemsets are collections of products that are purchased in a single transaction.
Filter Technique: Any technique for reducing the number of models with the aim of avoiding overwhelming the user.
Rule Discovery: The process of finding rules that then can be used to predict some outcome (e.g., IF 13 <= age <= 19 THEN teenager).
799
M
800
Mining Historical XML Qiankun Zhao Nanyang Technological University, Singapore Sourav Saha Bhowmick Nanyang Technological University, Singapore
INTRODUCTION
BACKGROUND
Nowadays the Web poses itself as the largest data repository ever available in the history of humankind (Reis et al., 2004). However, the availability of huge amount of Web data does not imply that users can get whatever they want more easily. On the contrary, the massive amount of data on the Web has overwhelmed their abilities to find the desired information. It has been claimed that 99% of the data reachable on the Web is useless to 99% of the users (Han & Kamber, 2000, pp. 436). That is, an individual may be interested in only a tiny fragment of the Web data. However, the huge and diverse properties of Web data do imply that Web data provides a rich and unprecedented data mining source. Web mining was introduced to discover hidden knowledge from Web data and services automatically (Etzioni, 1996). According to the type of Web data, Web mining can be classified into three categories: Web content mining, Web structure mining, and Web usage mining (Madria et al., 1999). Web content mining is to extract patterns from online information such as HTML files, e-mails, or images (Dumais & Chen, 2000; Ester et al., 2002). Web structure mining is to analysis the link structures of Web data, which can be inter-links among different Web documents (Kleinberg 1998) or intralinks within individual Web document (Arasu & Hector, 2003; Lerman et al., 2004). Web usage mining is defined as to discover interesting usage patterns from the secondary data derived from the interaction of users while surfing the Web (Srivastava et al., 2000; Cooley, 2003). Recently, XML is widely used as a standard for data exchanging in the Internet. Existing work on XML data mining includes frequent substructure mining (Inokuchi et al., 2000; Kuramochi & Karypis, 2001; Zaki, 2002, Yan & Han, 2003; Huan et al., 2003), classification (Zaki & Aggarwal, 2003; Huan et al., 2004), and association rule mining (Braga et al., 2002). As data in different domains can be represented as XML documents, XML data mining can be useful in many applications such as bioinformatics, chemistry, network analysis (Deshpande et al., 2003; Huan et al., 2004) and etcetera.
The historical XML mining research is largely inspired by two research communities: XML data mining and XML data change detection. The XML data mining community has looked at developing novel algorithms to mine snapshots of XML data. The database community has focused on detecting, representing, and querying changes to XML data. Some of the initial work for XML data mining is based on the use of the XPath language as the main component to query XML documents (Braga et al., 2002; Braga et al., 2003). In Braga et al. (2002, 2003), the authors presented the XMINE operator, which is a tool developed to extract XML association rules for XML documents. The operator is based on XPath and inspired by the syntax of XQuery. It allows us to express complex mining tasks, compactly and intuitively. XMINE can be used to specify indifferently (and simultaneously) mining tasks both on the content and on the structure of the data, since the distinction in XML is slight. Other works for XML data mining focus on extracting the frequent tree patterns from the structure of XML data such as TreeFinder (Termier et al., 2002) and TreeMiner (Zaki, 2002). TreeFinder uses an Inductive Logic Programming approach. Notice that TreeFinder cannot produce complete results. It may miss many frequent subtrees, especially when the support threshold is small or trees in the database have common node labels. TreeMiner can produce the complete results by using a novel vertical representation for fast subtree support counting. Different from the above techniques, which focus on designing ad-hoc algorithms to extract structures that occur frequently in the snapshot data collections, historical XML mining focus on the sequence of changes among XML versions. Considering the dynamic nature of XML data, many efforts have been directed into the research of change detection for XML data. XML TreeDiff (Curbera & Epstein, 1999) computes the difference between two XML documents using hash values and simple tree
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Historical XML
comparison algorithm. XyDiff (Cobena et al., 2002) is proposed to detect changes of ordered XML documents. Besides insertion, deletion, and updating, XyDiff also support move operation. X-Diff (Wang et al., 2003) is designed to detect changes of unordered XML documents. In our historical XML mining, we extend the XML change detection techniques to discover hidden knowledge from the history of changes to XML data with data mining techniques.
MAIN THRUST Overview Consider the dynamic property of XML data and existing XML data mining research; it can be observed that the dynamic nature of XML leads to two challenging problems in XML data mining. First is the maintenance of the mining results for existing mining techniques. As the data source changes, new knowledge may be found and some old ones may not be valid. Second, is the discovery of novel knowledge hidden behind the historical changes, some of them are difficult or impossible to be discovered from snapshot data. In this paper, we focus on the second issue. That is, to discover novel hidden knowledge from the historical changes to XML data. Suppose there is a sequence of XML documents, which are different versions of the same documents. Then, following novel knowledge can be discovered. Note that, by no means we claim that the list is exhaustive. We use them as representatives for the various types of knowledge behind the history of changes. •
•
•
Frequently changing/Frozen structures/contents: Some parts of the structure or content change more frequently and significantly compared to other structures. Such structures and contents reflect the relatively more dynamic parts of the XML document. Frozen structures/contents represent the most stable part of the XML document. Identifying such structure is useful for various applications such as trend monitoring and change detection of very large XML documents. Association rules: Some structures/contents are associated in terms of their changes. The association rules imply the concurrence of changes among different parts of the XML document. Such knowledge can be used for XML change prediction, XML index maintenance, and XML based multimedia annotation. Change patterns: From the historical changes, one may observe that more and more nodes are inserted under certain substructures, while nodes
are inserted and deleted frequently under others. Such change patterns can be critical for monitoring and predicting trends in e-commerce Web sites. They may imply certain underlying semantic meanings; and can be exploited by strategy makers.
Applications Such novel knowledge can be useful in different applications, such as intelligent change detection for very large XML documents, Web usage mining, dynamic XML indexing, association rule mining, evolutionary pattern based classification, and etcetera. We only elaborate on the first two applications due to the limitation of space. Suppose one can discover substructures that change frequently and those that do not (frozen structures), then he/she can use this knowledge to detect changes to relevant portions of the documents at different frequency based on their change patterns. That is, one can detect changes to frequently changing content and structure at a different frequency compared to structures that do not change frequently. Moreover, one may ignore frozen substructures during change detection, as most likely they are not going to change. As one of the major limitations of existing XML change detection systems (Cobena et al., 2002; Wang et al., 2003) is that they are not scalable for very large XML documents. Knowledge extracted from historical changes can be used to improve the scalability of XML change detection systems. Recently, a lot of work has been done in Web usage mining. However, most of the existing works focus on snapshot Web usage data, while usage data is dynamic in real life. Knowledge hidden behind historical changes of Web usage data, which reflects how Web access patterns (WAP) change, is critical to adaptive Web, Web site maintenance, business intelligence, and etcetera. The Web usage data can be considered as a set of trees, which have the similar structures as XML documents. By partitioning Web usage data according to the user-defined calendar pattern, we can obtain a sequence of changes from the historical Web access patterns. From the changes, useful knowledge, such as how certain Web access patterns changed, which parts changes more frequently and which parts do not, can be extracted. Some preliminary results of mining the changes to historical Web access patterns have been shown in Zhao and Bhowmick (2004).
Research Issues To the best of our knowledge, existing state-of-the-art XML (structure related) data mining techniques (Yan & Han, 2002; Yan & Han, 2003) cannot extract such novel knowledge. Even if we apply such techniques repeatedly 801
M
Mining Historical XML
to a sequence of snapshots of XML structure data, they cannot discover such knowledge efficiently and completely. It is because that the historical XML mining research focuses on the structural changes between versions of XML documents, which is generated by XML data change detection tools (Zhao & Bhowmick, 2004). Given a sequence of XML documents (which are versions of the same XML document), the objective of historical XML data mining is to extract hidden and useful knowledge from the sequence of changes to the XML versions. In this article, we focus on the structural changes of XML documents. We proposed three major issues for historical XML mining. They are identifying interesting structures, mining XML delta association rules, and classification/clustering evolutionary XML data. Based on the historical change behaviors of the XML data, the first issue is to discover the interesting structures. We elaborate on some of the representatives.
•
•
•
•
•
802
Frozen/Frequently changing structure: Frozen structure refers to structure that does not change frequently and significantly in the history. There are different possible reasons that a structure does not change in the sequence of XML document versions. First, the structure is so well designed that it does not change even if the content has changed. Second, some parts of the XML document may be ignored or redundant so that they never change. Also, some data in the XML document may never change by nature. Frequently changing structure: Refers to substructures in the XML document may change more frequently and significantly compared to others. They may reflect the relatively more dynamic parts of the XML document. Periodic dynamic structure: Among the history of changes for some structures, there may be certain fixed patterns. Such change patterns may occur repeatedly in the history. Those structures, where there exist certain patterns in their change histories, are called periodic dynamic structure. Increasing/Decreasing dynamic structure: Among those frequently changing structures, some of them change according to certain patterns. For instance, the changes of some structures may become more significant and more frequent. Such structures will be defined as increasing dynamic structures. Similarly, decreasing dynamic structures denote for structures whose frequency and significance of changes are decreasing. Outlier structures: From the historical change behavior of the structures, one may observe that some of them may not comply with their change
patterns as predicted based on the history. For instance, some of the frozen structures that are not supposed to change may change while some of the frequently changing structures that are supposed to change may not change. Such changes may happen for various reasons. For example, some of the structures may be modified by mistakes; some of them may be the result of some intrusion or fraud actions; and others may be caused by intentional or highly occurrence of underling events. Besides the interesting structures, association rules between structures can be extracted as well. There are two types of association rules. One is structural delta association rule and another is semantic delta association rule.
•
•
Structural delta association rule: Among those interesting structures, we may observe that some of the structures change together frequently with certain confidence. For example, whenever structure A changes, structure B also changes with a probability of 80%. Structural delta association rule is used to represent such kind of knowledge. It can be used for different applications such as version and concurrency control systems. Semantic delta association rule: By incorporating some metadata, such as types of changes, ontology, and content summaries of leaf nodes, semantic delta association rule can be extracted. For example, with the history of changes in an ecommerce Web site, we may discover that when product A becomes more popular, product B will become less popular. Such semantic delta association rules can be useful for competitor monitoring and strategy analysis in e-commerce.
FUTURE TRENDS Considering the different types of mining we proposed and the types of data we are going to mine, there are many challenges ahead. We elaborate on some of the representatives here.
Real Data Collection To collect real data of historical XML structural versions is a challenge. To get the structural data from XML documents, a parser is needed. Currently, there are many XML parsers available. However, it has been verified that the parsing process is the most expensive
Mining Historical XML
process of XML data management (Nicola & John, 2003). Moreover, to get the historical structural information, every time the XML document changes the entire document has to be parsed again. Consequently, to extract historical structural information is a very expensive task. Especially for very large XML documents, usually only some parts of the document change frequently. In order to get knowledge that is more useful, a longer sequence of historical structural data and larger datasets are desirable. However, to get a longer and larger historical XML structural datasets, a longer time is needed to collect the corresponding XML documents.
detection, the temporal and dynamic property of XML data is incorporated with the semistructure property in historical XML mining. Examples shown that the results of historical XML mining can be useful in a variety of applications, such as intelligent change detection for very large XML documents, Web usage mining, dynamic XML indexing, association rule mining, evolutionary pattern based classification, and etcetera. By exploring the characteristics of XML changes, we present a framework of historical XML mining with a list of major research issues and challenges.
Frequency of Data Gathering
REFERENCES
With the dynamic nature of XML data, the process of gathering all versions of the historical XML structural data becomes a challenging task. The most naive method is to keep checking corresponding XML documents continuously, but this approach is very expensive and may overwhelm the network bandwidth. An alternative method is to determine the frequency of checking by analyzing the historical changes of similar XML documents. However, this method does not guarantee that all versions of the historical data can be detected. In our research, we propose to extract appropriate frequency for data gathering based on the changes of the historical data.
Arasu, A., & Hector, G.-M. (2003). Extracting structured data from Web pages. In Proceedings of ACM SIGMOD (pp. 337-348).
Incremental Mining The high cost of some data mining processes and the dynamic nature of XML data make it desirable to mine the data incrementally based on the part of data that changed rather than mining the entire data again from scratch. Such algorithms are based on the previous mining results. Most recently, there is a survey of incremental mining of sequential patterns in large database (Parthasarathy et al., 1999). Integrated with a change detection system, our incremental mining of historical XML structural data can keep the discovered knowledge up to date and valid. However, the challenge is that existing incremental mining techniques are for relational and transactional data, while XML structural is semistructure. How to modify those approaches so that they can be used for incremental mining historical XML will be one of our research issues.
CONCLUSION In this article, we present a novel research direction: mining historical XML documents. Different from existing research of XML data mining and XML change
Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi, P.L. (2002). A tool for extracting XML association rules. In Proceedings of IEEE ICTAI (pp. 57-65). Braga, D., Campi, A., Ceri, S., Klemettinen, M., & Lanzi, P.L. (2003). Discovering interesting information in XML data with association rules. In Proceeding of ACM SAC (pp. 450-454). Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In Proceedings of IEEE ICDE (pp. 41-52). Cooley, R. (2003). The use of Web structure and content to identify subjectively interesting Web usage patterns. In ACM Transactions on Internet Technology (pp. 93-116). Curbera, & Epstein, D.A. (1999). Fast diûerence and update of XML documents. In Proceeding of XTech. Deshpande, M., Kuramochi, M., & Karypis, G. (2003). Frequent sub-structure-based approaches for classifying chemical compounds. In Proceedings of IEEE ICDM (pp. 35-42). Dumais, S.T., & Chen, H. (2000). Hierarchical classiûcation of Web content. In Proceedings of Annual International ACM SIGIR (pp. 256-263). Ester, M., Kriegel, H.-P., & Schubert, M. (2002). Web site mining: A new way to spot competitors, customers and suppliers in the World Wide Web. In Proceedings of the eighth ACM SIGKDD (pp. 249-258). Etzioni, O. (1996). The World-Wide Web: Quagmire or gold mine? Communications of the ACM, 39(11), 65-68.
803
M
Mining Historical XML
Han, J., & Kamber, M. (2000). Data mining concepts and techniques. Morgan Kaufmann. Huan, J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraph in the presence of isomorphism. In Proceedings of IEEE ICDM (pp. 549-552). Huan, J., Wang, W., Washington, A., Prins, J., Shah, R., & Tropshas, A. (2004). Accurate classification of protein structural families using coherent subgraph analysis. In Proceedings of PSB (pp. 411-422). Kleinberg, J.M. (1998). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632. Lerman, K., Getoor, L., Minton, S., & Knoblock, C. (2004). Using the structure of Web sites for automatic segmentation of tables. In Proceedings of ACM SIGMOD (pp. 119-130). Madria, S.K., Bhowmick, S.S., Ng, W.K., & Lim, E.-P. (1999). Research issues in Web data mining. In DMKD (pp. 303-312). Matthias, N., & Jasmi, J. (2003). XML parsing: A threat to database performance. In Proceedings of CIKM (pp. 175178). McHugh, J., Abiteboul, S., Goldman, R., Quass, D., & Widom, J. (1997). Lore: A database management system for semistructured data. ACM SIGMOD Record, 26(3), 54-66. Reis, D. de C., Golgher, P.B, da Silva Altigran, S., & Laender, A.H.F. (2004). Automatic Web news extraction using tree edit distance. In Proceedings of WWW (pp. 502511). Srinivasan, P., Zaki, M.J., Ogihara, M. & Dwarkadas, S.. (1999). Incremental and interactive sequence mining. In Proceedings of CIKM (pp. 251-258). Shearer, K., Dorai, C., & Venkatesh, S. (2000). Incorporating domain knowledge with video and voice data analysis in news broadcasts. In Proceedings of ACM SIGKDD (pp. 46-53). Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. ACM SIGKDD Explorations, 1(2), 12-23.
804
Wang, Y., DeWitt, D.J., & Cai, J.-Y. (2003). X-difi: An effective change detection algorithm for XML documents. In Proceedings of IEEE ICDE (pp. 519-530). Yang, L.H., Lee, M.L., & Hsu, W. (2003). Efficient mining of XML query patterns for caching. In Proceedings of VLDB (pp. 69–80). Zaki, M.J. (2002). Efficiently mining frequent trees in a forest. In Proceedings of ACM SIGKDD (pp. 71-80). Zaki, M.J., & Aggarwal, C.C. (2003). XRules: An effective structural classifier for XML data. In Proceedings of ACM SIGKDD (pp. 316-325). Zhao, Q., & Bhowmick, S.S. (2004). Mining history of changes to Web access patterns. In Proceeding of PKDD.
KEY TERMS Changes to XML: Given two XML document, the set of edit operations that transform one document to another is called changes to XML. Historical XML: A sequence of XML documents, which are different versions of the same XML document. It records the change history of the XML document. Mining Historical XML: It is the process of knowledge discovery from the historical changes to versions of XML documents. It is the integration of XML change detection systems and XML data mining techniques. Semi-Structured Data Mining: Semi-structured data mining is a sub-field of data mining, where the data collections are semi-structured, such as Web data, chemical data, biological data, network data, and etcetera. Web Mining: Web data mining is to use data mining techniques to automatically discovery and extract information from Web data and services. XML: Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML. XML Structural Changes: Among the changes to XML, not all the changes may cause the structure changes of the XML when it is represented as a tree structure. In this case, only insertion and deletion are called XML structural changes.
805
Mining Images for Structure Terry Caelli Australian National University, Australia
INTRODUCTION Most data warehousing and mining involves storing and retrieving data either in numerical or symbolic form, varying from tables of numbers to text. However, when it comes to everyday images, sounds, and music, the problem turns out to be far more complex. The major problem with image data mining is not so much image storage, per se, but rather how to automatically index, extract, and retrieve image content (content-based retrieval [CBR]). Most current image data-mining technologies encode image content by means of image feature statistics such as color histograms, edge, texture, or shape densities. Two well- known examples of CBR are IBM’s QBIC system used in the State Heritage Museum and PICASSO (Corridoni, Del Bimbo & Pala, 1999) used for the retrieval of paintings. More recently, there have been some developments in indexing and retrieving images based on the semantics, particularly in the context of multimedia, where, typically, there is a need to index voice and video (semantic-based retrieval [SBR]). Recent examples include the study by Lay and Guan (2004) on artistry-based retrieval of artworks and that of Benitez and Chang (2002) on combining semantic and perceptual information in multimedia retrieval for sporting events. However, this type of concept or semantics-based image indexing and retrieval requires new methods for encoding and matching images, based on how content is structured, and here we briefly review two approaches to this.
BACKGROUND Generally speaking, image structure is defined in terms of image features and their relations. For SBR, such features and relations reference scene information. These features typically are multi-scaled, varying from pixel attributes derived from localized image windows to edges, regions, and even larger image area properties.
MAIN THRUST In recent years there has been an increasing interest in SBR. However, this requires the development of meth-
ods for binding image content with semantics. In turn, this reduces to the need for models and algorithms that are capable of efficiently encoding and matching relational properties of images and associating these relational properties with semantic descriptions of what is being sensed. To illustrate this approach, we briefly discuss two representative examples of such methods: (1) Bayesian Networks (Bayesian Nets) for SBR, based first on multi-scaled image and then on image feature models; (2) principal components analysis (also termed latent semantic indexing or spectral methods).
Bayesian Network Approaches Bayesian Nets have recently proved to be a powerful method for SBR, since semantics are defined in terms of the dependencies between image features (nodes), their labels, and known states of what is being sensed. Inference is performed by propagation probabilities through the network. For example, Benitez et al. (2003) have developed MediaNet, a knowledge representation network and inference model for the retrieval of conceptually defined scene properties integrated with natural language processing. In a similar way, Hidden Markov Random Fields (HMRFs) have become a common class of image models for binding images with symbolic descriptions. In particular, Hierarchical Hidden Markov Random Fields (HHMRF) provide a powerful SBR representation. HHMRFs are defined over multi-scaled image pixel or features defined by Gaussian or Laplacian pyramids (Bouman & Shapiro, 1994). Each feature, or pixel, x , at a given scale is measured (observed) to evidence scene properties, states, s, corresponding to semantic entities such as ground, buildings, and so forth, as schematically illustrated in Figure 1. The relationships between states serves to define the grammar. The link between observations and states defines, in this approach, the image semantics. Accordingly, at each scale, l, we have a set of observations and states, where p(ol (x) /sl (x)) defines the dependency of the observation at scale, l, on the state of the world (scene). Specifically, the HHMRF assumes that the state at a pixel, x , is dependent on the states of its neighboring pixels at the same or neighboring levels of the pyramid. A simple example of the expressiveness of this model is a forestry scene. This could be an image region (label:
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining Images for Structure
Figure 1. The hierarchical hidden Markov random field (HHMRF) model for image understanding. Here, the hidden state variables, X, at each scale are evidenced by observations, Y, at the same scale and the state dependencies within and between levels of the hierarchy. The HHMRF is defined over pixels and/ or feature graphs.
case, the HMRF is defined over graphs that depict features and their relations. That is, consider two attributed graphs, Gs and G x , representing the image and the query, respectively. We want to determine just how, if at all, the query (graph) structure is embedded somewhere in the image (graph). We define HMRF over the query graph, G x . A single node in G x is defined by xi , and in the graph Gs , by sα . Each node in each graph has vertex and edge attributes, and the query corresponds to solving a subgraph isomorphism problem that involves the assignment
of each xi a unique sα , assuming that there is only one instance of the query structure embedded in the image, although this can be generalized. In this formulation, the forest) dependent on a set of regions labeled tree at the next scale, which, in turn, are dependent on trunk, branches, and leaves labels at the next scale, and so forth. These labels have specific positional relations over scales, and the observations for each label are supported by their compatibilities and observations. Consequently, the SBR query for finding forestry images is translated into finding the posterior maximum likelihood (MAP) of labeling the image regions, given the forestry model. Using Bayes’ rule, this reduces to the following optimization problem:
s*l (x)) ∝ argmax{p(sl (x) /ol (x))∏ p(sl ±v (x ± u))} S
•
The Observation Component: Using HMRF formalities, the similarity (distance: dist) between vertex attributes of both graphs is consequently defined as the observation matrix model
Biα = p ( y xi / xi = sα ) = dist ( y xi , y sα ) . •
u,v
where l ± v corresponds to the states above and below level l of the hierarchy. In other words, the derived labeling MAP probability is equivalent to a probabilistic answer to the query if this image is a forestry scene. There are many approaches to approximate solutions to this problem, including relaxation labeling, expectation maximization (EM), loopy belief propagation and the junction tree algorithm (see definitions in Terms and Definitions). All of these methods are concerned with optimal propagation of evidence over different layers of the graphical representation of the image, given the model and the observations, and all have their limitations. When the HHMRF model is approximated by a triangulated image state model, the junction tree algorithm is optimal. However, triangulating such hierarchical meshes is computationally expensive. On the other hand, the other approaches mentioned previously are not optimal, converging to local minima (Caetano & Caelli, 2004). When the structural information in the query and image is defined in terms of features (i.e., regions or edge segments) and their relational attributes, again, HMRFs can be applied to image feature matching. In this 806
HMRF model considers each node xi in G x as a random variable that can assume any of S possible values corresponding to the nodes of Gs .
The Markov Component: Here, we use the binary (relational) attributes to construct the compatibility functions between states of neighboring nodes. Assume that xi and x j are neighbors in the HMRF (being connected in G x ). Similar to the previously described unary attributes, we have
A ji; βα = p( x j = s β / xi = sα ) = dist ( y xij , y sαβ ) . •
Optimization Problem and Solutions: Given this general HMRF formulation for graph matching, the optimal solution reduces to that of deriving a state vector s * = ( s 1 ,.., s T ) where s i ∈ G s for each vertex x i ∈ G x such that the MAP criterion is satisfied, given the model λ = ( A, B) and data s * = arg max{ p ( x1 = sα ,...., xT = sς / λ )} . sα .. sς
Specifically, we introduce two sets of HMRF models for solving this problem.
Mining Images for Structure
First, probabilistic relaxation labeling (PRL) is a parallel iterative update method for deriving the most consistent labels, having the form p t 1 (d i )
1,..i 1,i 1,..,n
i
p( yi / d i )
i
p ( d i / d i ) p t (d i )
where pt (d ∂i ) is typically factored into pair-wise conditionals. This algorithm is known to converge to local minima and is particularly sensitive to initial values. Albeit, this is an example of an inexact (suboptimal) model applied to the complete data model. The following models are the opposite—exact inference models over approximate data models—and so their optimality is critically dependent on how well the data model approximation is. Single Path Dynamic Programming (SPDP) consists of approximating the fully connected graph into a single chain, which traverses each vertex exactly once. This model allows us to use dynamic programming in order to find the optimal match between the graphs. In the second type of model, we improve over this simple chain model by proposing HMRFs with maximal cliques of size 3 and 4, which still retain the model feasibility for accomplishing optimal inference via junction tree (JT) methods (Lauritzen, 1996). These cliques encode more information than the single path models insofar as vertices are connected to 2 (JT3) and 3 (JT4) other neighboring vertices, as compared to only 1 (the previous) in the SPDP case. Such larger cliques, their associated vertex labels, then are used to derive the optimal labels for a given vertex. The JT algorithm is similar to dynamic programming involving a forward and backward pass for determining the most likely labels, given the evidence and neighborhood consistencies (Caetano & Caelli, 2004). Another recent approach to encoding and retrieving image structures in ways that can index their semantics is the method of Latent Semantic Indexing (LSI), as used in many other areas of data mining (Berry et al., 1995). Here, the basic idea is to use Principle Components Analysis (PCA) to determine the basic dimensionality of a relational data structure (i.e., items by attribute matrix), project, and cluster items, and to retrieve examples via proximity in the projection subspaces. For SBR and for images, what is needed is a form of LSI that functions on the relational attributes of image features, and recently, the LSI method has been extended to incorporate such structures. Here, relationships between features are encoded as a graph, and the LSI approach offers a different method for solving graph-matching problems. In this way, SBR consists of matching the query (a graph) with observed image data encoded also as a graph. The LSI approach, when applied to relational features, uses standard matrix models for graphs, such as the
adjacency matrix where relations are defined in terms of the matrix off-diagonal elements. These matrices are decomposed into their eigenvalues and eigenvectors, as G = PΛP ' ,
G = ΧT Χ
where Χ T Χ corresponds to the covariance matrix of the vertex and edge attributes. In the case where only adjacencies (1,0) are used, then G is simply the adjacency matrix. The core idea is that vertices are then projected into a selected eigen-subspace, and correspondences are determined by proximities in such spaces. The method, in different variations, has been explored by Scott and Longuet-Higgins (1991), Shapiro and Brady (1992), and Caelli and Kosinov (2004). Normalization is required when graphs are not of the same size. Sometimes, only the eigenspectra (i.e., list of eigenvalues) are used to compare graphs (Siddiqi et al., 1999; Luo & Hancock, 2001). Such spectral methods offer insightful ways to define image structures, which often can correspond to semantically corresponding properties between the query and the image. For example, comparing two different views of the same animal often can result in the predicted correspondences having semantically correct labels, as shown in Figure 2 and Table 1, even when only basic adjacencies (simply whether two features are connected or not, without attributes such as distances, angles, etc.) are used to encode the relational structures. Figure 2. Two sample shapes—boxer-10 and boxer24—and their shock graph representations (Siddiqi et al., 1999)
807
M
Mining Images for Structure
Table 1. Cluster decomposition obtained from correspondence clustering of the two shock graphs for shapes BOXER-10 and BOXER-24. Each cluster enumerates the vertices from the two graphs that are grouped together and lists in brackets the part of the body from which the vertex approximately comes. The percentage of variance explained by the first three dimensions for shapes Boxer-10 and Boxer-24 was 32% and 33%, respectively (Caelli & Kosinov, 2004).
and robust way. Machine learning, Bayesian networks, and even new uses of matrix algebra all open up new possibilities. Unless computers can store, evaluate and interpret images the way we do, image data warehousing and mining will remain a doubtful and certainly laborintensive technology with limited use without significant human intervention.
REFERENCES Benitez, A., & Chang, S. (2002). Multimedia knowledge interrelations: Summarization and evaluation. Proceedings of the MDM/KDD-2002, Edmonton, Alberta, Canada. Berry, M., Dumais, S., & O’Bien, G. (1994). SIAM Review, 37(4), 573-595. Bouman, C., & Shapiro, M. (1994). A multiscale random field model for Bayesian image segmentation. IEEE: Transactions On Image Processing, 3(2), 162-177. Caelli, T., Cheng, L., & Feng, Q. (2003). A Bayesian approach to image understanding: From images to virtual forests. Proceedings of the 16 th International Conference on Vision Interface (pp. 1-12), Halifax, Nova Scotia, Canada.
FUTURE TRENDS As can be seen from the previous discussion, SBR requires the development of new methods for relational queries, ideally to pose such querying as an optimization problem that results in a degree of match that, in some sense, is derived optimally rather than by using heuristic methods. Here, we have discussed briefly two such methods: one based on Bayesian methods and the other based on matrix methods. What is clearly needed is an integration of both approaches in order to capitalize on the benefits of both views. Recent developments in both deterministic and stochastic kernel methods (Schölkopf & Smola, 2002) may provide the basis for such an integration.
CONCLUSION Image data mining is still an emerging area where ultimate success and use can only come about if we can solve the difficult problem of querying images and searching for content as humans do. This must involve the creation of new SBR algorithms that clearly are based on the ability to support relational queries in an optimal 808
Caelli, T., & Kosinov, S. (2004). Inexact graph matching using eigenspace projection clustering. International Journal of Pattern Recognition and Artificial Intelligence, 18(3), 329-354. Caetano, T., & Caelli, T. (2004). Graphical models for graph matching. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition: CVPR (pp. 466-473). Corridoni, J., Del Bimbo, A., & Pala, P. (1999). Retrieval of paintings using effects induced by colour features. IEEE Multimedia, 6(3), 38-53. Lauritzen, S. (1996). Graphical models. Oxford: Clarendon Press. Lay, J., & Guan, L. (2004). Retrieval for color artistry concepts. IEEE Transactions on Image Processing, 13(3), 326-339. Luo, B., & Hancock, E. (2001). Structural graph matching using the EM algorithm and singular value decomposition. IEEE Transaction of Pattern Analysis and Machine Intelligence, 23(10), 1120-1136. Schölkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press.
Mining Images for Structure
Scott, G., & Longuet-Higgins, H. (1991). An algorithm for associating the features of two patterns. Proceedings of the Royal Society of London. Shapiro, L., & Brady, J. (1992). Feature-based correspondence—An eigenvector approach. Image and Vision Computing, 10, 268-281. Siddiqi, K., Shokoufandeh, P., Dickinson, S., & Zucker, S. (1999). Shock graphs and shape matching. International Journal of Computer Vision, 30, 1-24.
KEY TERMS Attributed Graph: Graphs whose vertices and edges have attributes typically defined by vectors and matrices, respectively. Bayesian Networks: A graphical model defining the dependencies between random variables. Dynamic Programming: A method for deriving the optimal path through a mesh. For hidden Markov models (HMMs), it is also termed the Viterbi algorithm and involves a method for deriving the optimal state sequence, given a model and an observation sequence. Expectation Maximization: Process of updating model parameters from new data where the new parameter values constitute maximum posterior probability estimates. Typically used for mixture models. Grammars: A set of formal rules that define how to perform inference over a dictionary of terms. Graph: A set of vertices (nodes) connected in various ways by edges. Graph Spectra: The plot of the eigenvalues of the graph adjacency matrix.
Hidden Markov Random Field (HMRF): A Markov Random Field with additional observation variables at each node, whose values are dependent on the node states. Additional pyramids of MRFs defined over the HMRF give rise to hierarchical HMRFs (HHMRFs). Image Features: Discrete properties of images that can be local or global. Examples of local features include edges, contours, textures, and regions. Examples of global features include color histograms and Fourier components. Image Understanding: The process of interpreting images in terms of what is being sensed. Junction Tree Algorithm: A two-pass method for updating probabilities in Bayesian networks. For triangulated networks the inference procedure is optimal. Loopy Belief Propagation: A parallel method for update beliefs or probabilities of states of random variables in a Bayesian network. It is a second-order extension of probabilistic relaxation labeling. Markov Random Field (MRF): A set of random variables defined over a graph, where dependencies between variables (nodes) are defined by local cliques. Photogrammetry: The science or art of obtaining reliable measurements or information from images. Relaxation Labeling: A parallel method for updating beliefs or probabilities of states of random variables in a Bayesian network. Node probabilities are updated in terms of their consistencies with neighboring nodes and the current evidence. Shape-From-X: The process of inferring surface depth information from image features such as stereo, motion, shading, and perspective.
809
M
810
Mining Microarray Data Nanxiang Ge Aventis, USA Li Liu Aventis, USA
INTRODUCTION During the last 10 years and in particularly within the last few years, there has been a data explosion associated with the completion of the human genome project (HGP) (IHGMC and Venter et al., 2001) in 2001 and the many sophisticated genomics technologies. The human genome (and genome from other species) now provides an enormous amount of data waiting to be transformed into useful information and scientific knowledge. The availability of genome sequence data also sparks the development of many new technology platforms. Among the available different technology platforms, microarray is one of the technologies that is becoming more and more mature and has been widely used as a tool for scientific discovery. The major application of microarray is for simultaneously measuring the expression level of thousands of genes in the cell. It has been widely used in drug discovery and starts to impact the drug development process. The mining microarray database is one of the many challenges facing the field of bioinformatics, computational biology and biostatistics. We review the issues in microarray data mining.
BACKGROUND The quantity of data generated from a microarray experiment is usually large. Besides the actual gene expression measurement, there are many other data available such as the corresponding sequence information, gene’s chromosome location information, gene ontology classification and gene pathway information. Management of such data is a very challenging issue for bioinformatics. The following is a list of requirements for data management. •
Data Organization: different microarray platforms generate data with different formats, different gene identifiers, etc. Sequence data, gene ontology data, and gene pathway information all with diverse for-
•
mats. Sensible organization of such diverse data types will ease the data mining process. Data Standards: the microarray data analysis community has developed a standard for microarray data: the Minimum Information About a Microarray Experiment (MIAME: http://www.mged.org). The MIAME standard is needed to enable the consistent interpretation of experiment results and potentially to reproduce the experiment.
MAIN THRUST Mining microarray data is a very challenging task. The raw data from microarray experiments usually comes in image format. Images are then quantified using image analysis software. Such quantified data are then subject to three steps of analysis: • • •
Pre-processing microarray data Mining microarray data. Joint mining of microarray data and sequence database.
We review the data analysis methods in these three aspects. We focus our discussion on Affymetrix (http:/ /www.affymetrix.com) technology, but many of the methods are applicable to data from other platforms.
Pre-Processing Microarray Data While effectively managing microarray and related data is an important first step, microarray data have to be preprocessed so that downstream analysis can be performed. Array-to-array and sample-to-sample variations are the main reason for the requirement of pre-processing. Typically, pre-processing involves the following four steps: •
Image Analysis: image analysis in microarray experiment involves gridding of image, signal extraction and back ground adjustment. Affymetrix’s
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Mining Microarray Data
•
•
•
microarray analysis suite (MAS) provides the quantification software. Data Normalization: normalization is a step necessary for remove systematic array to array variations. Different normalization methods have been proposed. Cyclic loess method (Dudoit et al., 2002), quantile normalization method and contrast-based method normalize the probe level data; Scaling method and non-linear method normalize expression intensity level data. Bolstad, Irizarry, Astrand, and Speed (2003) provides a comparison of all these methods and suggests that simple quantile normalization performs relatively stable. Estimation of Expression Intensities: different methods to summarize expression intensity from probe level data have appeared in literature. Among them, Li and Wong (2001) proposed a model based expression intensity (MEBI), Irizarry’s (2003) robust multi-array (RMA) methods and the method provided in Affymetrix MAS5 software. Data Transformation: data transformation is a very important step and it enables the data to fit to many of the assumptions behind statistical methods. Log transformation and glog (Durbin, Hardin, Hawkins, & Rocke, 2002) transformation are the two commonly used methods.
•
Mining Microarray Data In principal, the current data mining activities in microarray data can be grouped into two types of studies: unsupervised and supervised. Unsupervised analysis has been used widely for mining microarray experiment. Cluster analysis has been the dominant method for unsupervised mining. Examples of unsupervised data mining: •
Eisen, Spellman, Brown, and Botstein (1998) studied the gene expression of budding yeast Saccharomyces cerevisiae spotted on cDNA microarrays during the diauxic shift, the mitotic cell division cycle, sporulation, and temperature and reducing shocks. Hierarchical clustering was applied to this gene expression data, and the result was represented by a tree whose branch lengths reflected the degree of similarity between genes, which was assessed by a pair-wise similarity function. The computed tree was then used to order genes in the original data table, and genes with similar expression pattern were grouped together. The ordered gene expression table can be displayed graphically in a colored image, where cells with log ratios of 0’s were colored black, cells with positive log
ratios were colored red, cells with negative log ratios were colored green, and the intensities of reds or greens were proportional to the absolute values of the log ratios. The clustering analysis efficiently grouped genes with similar functions together, and the colored image provided an overall pattern in the data. Clustering analysis can also help us understand the novel genes if they were coexpressed with genes with known functions. Standard clustering analysis, such as hierarchical clustering, k-means clustering, self-organizing maps, are very useful in mining the microarray data. However, these data tables are often corrupted with extreme values (outliers), missing values, and non-normal distributions that preclude standard analysis. Liu, Hawkins, Ghosh, and Young (2003) proposed a robust analysis method, called rSVD (Robust Singular Value Decomposition), to address these problems. The method applies a combination of mathematical and statistical methods to progressively take the data set apart so that different aspects can be examined for both general patterns and for very specific effects. The benefits of this robust analysis will be both the understanding of large-scale shifts in gene effects and the isolation of particular sample-by-gene effects that might be either unusual interactions or the result of experimental flaws. The method requires a single pass, and does not resort to complex “cleaning” or imputation of the data table before analysis. The method rSVD was applied to a micro array data, revealed different aspects of the data, and gave some interesting findings. Examples for supervised data mining:
•
Golub et al. (1999) studied the gene expression of two types of acute leukemias, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), and demonstrated the feasibility of cancer prediction based on gene expression data. The data comes from Affymetrix arrays with 6817 genes, and consists of 47 cases of ALL and 25 cases of AML. 38 samples (27 ALL, 11 AML) were used as training data. A set of 50 genes with the highest correlations with an “idealized expression pattern” vector, where the expression level is uniformly high for AML and uniformly low for ALL, were selected. The prediction of a new sample was based on “weighted votes” of these 50 genes. The method made strong prediction for 29 of the 34 test samples, and the accuracy was 100%. Golub’s method is actually a minor variant of the maximum likelihood linear discriminate analysis for two 811
M
Mining Microarray Data
•
•
classes. Instead of using variances in computing weights, Golub’s method uses standard deviations. Dudoit el al. (2002) compared the prediction performance of a variety of classification methods, such as linear/quadratic discriminate analysis, nearest neighbor method, classification trees, aggregated classifiers, bagging and boosting, on three microarray datasets, lymphoma data (Alizadeh et al.), leukemia data (Golub et al.) and NCI60 data (Ross et al.). Based on their comparisons, the rankings of the classifiers were similar across datasets and the main conclusion, for the three datasets, is that simple classifiers such as diagonal linear discriminate analysis and nearest neighbors perform remarkably well compared to more sophisticated methods such as aggregated classification trees. Dimension reduction techniques, such as principal component analysis (PCA) and partial least squares (PLS) can be used to reduce the dimension of the microarray data before certain classifier is used. For example, Nguyen et al. (2002) proposed an analysis procedure to predict tumor samples. The procedure first reduces the dimension of the microarray data using PLS, and then applies logistic discrimination (LD) or quadratic discriminate analysis (QDA). See also West et al. (2001) and Huang et al. (2003).
Joint Data Mining of Microarray, Gene Ontology and TFBS Data Combining microarray data and other available data such as DNA sequence data, gene ontology data has attracted lots of attention in recent years. From a system biology view, combining different data types provides added power than looking microarray data alone. This is still a very active research area for data mining and we review some of the current work done in two aspects: combining with gene ontology database and combining with (TFBS) database. •
812
Combining microarray data with gene ontology data is important in interpreting the microarray data. Expression Analysis Systematic Explorer (EASE, http://david.niaid.nih.gov/david/ease.htm) is developed by Hosack, Dennis, Sherman, Lane, and Lempicki (2003), and is a customizable, standalone software application that facilitates the biological interpretation of gene lists derived from the results of microarray, proteomics, and SAGE experiments. EASE can generate the annotations of a list of genes in one shot, automatically link to online analysis tools, and provide statistical methods for
•
discovering enriched biological themes within gene lists. Blalock et al. (2004) studied the gene expression of Alzheimer’s disease (AD), and found some interesting over-represented gene categories among the regulated genes using EASE. The findings suggest a new model of AD pathogenesis in which a genomically orchestrated up-regulation of tumor suppressor-mediated differentiation and involution processes induces the spread of pathology along myelinated axons. Transcriptional factor binding factors are essential elements in the gene regulatory networks. Directly experimental identification of such transcriptional factors is not practical or efficient in many situations. Conlon, Liu, Lieb, and Liu (2003) provided one method based on least regression to integrating microarray data and transcriptional factor binding sites (TFBS) patterns. Curran, Liu, Long, and Ge (2004) provided a logistic regression approach to joint mining an internal microarray database and a corresponding TFBS database (http://transfac.gbf.de/).
FUTURE TRENDS As more and more new technologies platform are introduced for medical research, it is imaginable that data will continue to grow. For example, Affymetrix recently introduced its SNP chip which contains 100,000 human SNPs. If such a technology were applied to a clinical trial with 10000 subjects, the SNP data alone will be a 10000 by 100000 table in addition to data from potential other technologies such as proteomics, metabonomics and bio-imaging. Associated with the growth of data will be the increasing need for effective data management and data integration. More efficient data retrieval system will be needed as well as system that can accommodate large scale and diversified data. Besides the growth of data management system, it is foreseeable that integrated data analysis will become more and more routine. At this point, most of the data analysis software is only capable of analyzing small scale, isolated data sets. So, the challenges to the informatics field and statistics field will continue to grow dramatically.
CONCLUSION Microarray technology has generated vast amount of data for very interesting data mining research. How-
Mining Microarray Data
ever, as the central dogma indicates, microarray only provides one snapshot of the biological system. Sequencing, proteomics technology, metabolite profiling provide different view about the biological system and it will remain a challenge to deal with the joint data mining of data from such diverse technology platform. There are significant challenges with respect to development of data mining technology to deal with data generated from different technology platform. The challenges to deal with the combined data will be even more difficult. Besides methodological challenges, how to organize the data, how to develop standards for data generated from different platforms and how to develop common terminologies, all remain to be answered.
REFERENCES Alizadeh, A.A. et al. (2000). Different types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503-511. Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Markesbery, W.R., & Landfield, P.W. (2004). Incipient Alzheimer’s disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses. PNAS, 101(7), 2173-8. Bolstad, B.M., Irizarry, R.A., Astrand, M., & Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19(2), 185-193.
Durbin, B.P., Hardin, J.S., Hawkins, D.M., & Rocke, D.M. (2002). A variance-stabilizing transformation of gene expression microarray data. Bioinformatics, 18(Supplement 1), S105-S110. Eisen, M.B., Spellman, P.T., Brown, P.O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. PNAS, 95, 14863-14868. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene Expression monitoring. Science, 286(5439), 531-537. Hosack, D.A., Dennis, G. Jr., Sherman, B.T., Lane, H.C., & Lempicki, R.A. (2003). Identifying biological themes within lists of genes with EASE. Genome Biology, 4(9), R60. Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou, M.H., Horng, C.F. et al. (2003). Gene expression predictors of breast cancer outcomes. Lancet, 361, 1590-1596. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2), 249-264. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., & Pavlidis, P. (2004). Coexpression analysis of human genes across many microarray data sets. Genome Res, 14(6), 1085-1094. Li, C., & Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. PNAS, 98(1), 31-36.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O. et al. (1998). The transcriptional program of sporulation in budding yeast. Science, 282(5389), 699705.
Liu, L., Hawkins, D.M., Ghosh, S., & Young, S.S. (2003). Robust singular value decomposition analysis of microarray data. PNAS, 100(23), 13167-13172.
Conlon, E.M., Liu, X.S., Lieb, J.D., & Liu, J.S. (2003). Integrating regulatory motif discovery and genomewide expression analysis. PNAS, 100(6), 3339-3344.
The International Human Genome Mapping Consortium (IHGMC). (2001). A physical map of the human genome. Nature, 409(6822), 934-941.
Curran, M., Liu, H., Long, F., & Ge, N. (2003). Statistical methods for joint data mining of gene expression and DNA sequence database. SIGKDD Explorations Special Issue on Microarray Data Mining, 5, 122-129.
Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., et al. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24, 227-234.
Dudoit, S., Fridlyand, J., & Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77-87.
Venter, J. et al. (2001). The human genome. Science, 291(5507), 1304-51.
Dudoit, S., Yang, Y.H., Callow, M.J., & Speed, T.P. (2002). Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Statistical Sinica, 12(1), 111-139.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., et al. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS, 98, 11462-11467.
813
M
Mining Microarray Data
KEY TERMS Bioinformatics: The analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research Gene Ontology (GO): GO is three structured, controlled vocabularies that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner (http://www.geneontology.org ). Metabonomics: The evaluation of tissues and biological fluids for changes in metabolite levels that result from toxicant-induced exposure.
814
MIAME: Minimum Information About a Microarray Experiment. It is a standard for data from a microarray experiment. Normalization: A process to remove systematic variation in microarray experiment. Examples of such systematic variation includes between array variation, dye bias, etc. Proteomics: is the large-scale study of proteins, particularly their structures and functions. SNP: Single Nucleotide Polymorphism. The same sequence from two individuals can often be single base pair changes. Such change can be useful genetic markers and might explain why certain individual respond to certain drug while others don’t.
815
Mining Quantitative and Fuzzy Association Rules Hong Shen Japan Advanced Institute of Science and Technology, Japan Susumu Horiguchi Tohoku University, Japan
INTRODUCTION The problem of mining association rules from databases was introduced by Agrawal, Imielinski, & Swami (1993). In this problem, we give a set of items and a large collection of transactions, which are subsets (baskets) of these items. The task is to find relationships between the occurrences of various items within those baskets. Mining association rules has been a central task of data mining, which is a recent research focus in database systems and machine learning and shows interesting applications in various fields, including information management, query processing, and process control. When items contain quantitative and categorical values, association rules are called quantitative association rules. For example, a quantitative association rule derived from a regional household living standard investigation database has the following form:
age ∈ [50,55] ∧ married → house. Here, the first item, age, is numeric, and the second item is categorical. Categorical attributes can be converted to numerical attributes in a straightforward way by enumerating all categorical values and mapping them to numerical values. An association rule becomes a fuzzy association rule if its items contain probabilistic values that are defined by fuzzy sets.
BACKGROUND Many results on mining quantitative association rules can be found in Li, Shen, and Topor (1999) and Miller and Yang (1997). A common method for quantitative association rule mining is to map numerical attributes to binary attributes, then use algorithms of binary association rule mining. A popular technique for mapping numerical attributes is to attribute discretization that converts a con-
tinuous attribute value range to a set of discrete intervals and then map all the values in each interval to an item of binary values (Dougherty, Kohavi, & Sahami, 1995). Two classical discretization methods are equal-width discretization, which divides the attribute value range into N intervals of equal width without considering the population (number of instances) within each interval, and equal-depth (or equal-cardinality) discretization, which divides the attribute value range into N intervals of equal populations without considering the similarity of instances within each interval. Examples of using these methods are given in Fukuda, Morimoto, Morishita, and Tokuyama (1996); Miller and Yang (1997); and Srikant and Agrawal (1996). To overcome the problems of sharp boundaries (Gyenesei, 2001) and expressiveness (Kuok, Fu, & Wong, 1998) in traditional discretization methods, methods for mining fuzzy association rules were suggested. Many traditional algorithms, such as Apriori, MAFIA, CLOSET, and CHARM, were employed to discover fuzzy association rules. However, the number of fuzzy attributes is usually at least double the number of attributes; therefore, these algorithms require huge computational times. To reduce the computation cost of association mining, various parallel algorithms based on count, data, and candidate distribution, along with other suitable strategies, were suggested (Han, Karypis, & Kumar, 1997; Shen, Liang, & Ng, 1999; Shen, 1999a; Zaki, Parthasarathy, & Ogihara, 2001). Recently, a parallel algorithm for mining fuzzy association rules, which divides the set of fuzzy attributes into independent partitions based on the natural independence among fuzzy sets defined by the same attribute, was proposed (Phan & Horiguchi, 2004b).
MAIN THRUST This article introduces recent results of our research on this topic in two prospects: mining quantitative association rules and mining fuzzy association rules.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mining Quantitative and Fuzzy Association Rules
Mining Quantitative Association Rules One method that was proposed is an adaptive numerical value discretization method that considers both value density and value distance of numerical attributes and produces better quality intervals than the two classical methods (Li et al., 1999). This method repeatedly selects a pair of adjacent intervals that have the minimum difference to merge until a given criterion is met. It requires quadratic time on the number of attribute values, because each interval initially contains only a single value, and all the intervals may be merged into one large interval containing all the values in the worst case. A method of linearscan merging for quantizing numeric attribute values that can be implemented in linear time sequentially and linear cost in parallel was also proposed (Shen, 2001). This method takes into consideration the maximal intrainterval distance and load balancing simultaneously to improve the quality of merging. In comparison with the existing results in the same category, the algorithm achieves a linear time speedup. Suppose that a numerical attribute has m distinct values, I = {x0 , x1 ,...x m−1 } , where attribute value xi has ni
tal thus has its representative center given by the average weighted value of (cu, nu) and (cv, nv): v −1
cu′ =
v −1
attribute values {xu, xu+1,…,x v-1} and Nu = ∑i =u ni instances, v is the index of the next interval after Iu in P, and 0
D * (cu , cv ) = max | xi − cu | i
Assume that two adjacent intervals, I u = {xu ,...xv −1} and I v = {xv ,...xw−1} , contain v −1
w −1
N u = ∑i =u ni and N v = ∑i = v ni attribute value occurrences v −1
and have representative centers cu = ∑i =u xi ni / N u and
∑
w −1
i =u
ni
=
cu N u + cv N v Nu + Nv
An optimal interval merge scheme produces a minimum number of intervals whose maximal intrainterval distances are each within a given threshold and whose populations are as equal as possible. Assume that the threshold for the maximal intrainterval difference is d, which can be the average interinterval differences of all the adjacent interval pairs or can be given by the system. For k intervals, let the average population (support) of each interval be N k = N / k and the population deviation of interval Iu be ∆ u =| N u − N k | , where Nu is the actual population of Iu. Initially, Iu={xu} and 0
m −1
occurrences in the database (weight). Let N = ∑i =0 ni be the total number of attribute value occurrences, called instances. Without loss of generality, we further assume that xi < xi+1 for all 0
w −1
cu ∑i =u ni + cv ∑i =u ni
2.
Partition {I0, I1, …, Im-1} into a minimum number of intervals such that each interval has a maximal intrainterval distance not greater than d. Assume that Step 1 produces k intervals: {I u , I u ,..., I u } and 0=u0< u1…< uk-1<m-1. For 0
k −1
1
I u j = [ X u j : cu j ] ,where X u j = {xu j , xu j +1 ,..., xu j +1 −1 } and cu j is the representative center of I u j , check to see
if moving xu −1 to I u will result in a better load balance while preserving the maximal intrainterval distance property, and do so if it will. j +1
j +1
Noticing that x 0<x1 < …< xm-1, we can implement Step 1 simply by using a linear scan to form appropriate segments of intervals after a single pass. Starting from I0 , merge Iu with Iu+j for j = 1,2, … , until the next merge would result in Iu’s maximal intrainterval distance greater than the threshold; continue this process until no interval to be merged remains. This process requires time O(m). Step 2 examines every adjacent pair of intervals after the merge, requiring, at most, m-1 steps. Each step checks the changes of population deviation by moving uj+1-1 instances from I u to I u . That is, it considers whether the following condition holds: j
j +1
w −1
cv = ∑i =v xi ni / N v , respectively, and 0
I u′ = I u U I v , of the two intervals containing (v-u)+(ww −1
v)=w-u attribute values and N u + N v = ∑i =u ni instances to-
816
| ∆ u j − ∆−u j |<| ∆ u j +1 − ∆+u j +1 | ,
where ∆−u =| N u − nu j
j
j +1
− N k | and ∆+u j =| N u j+1 − nu j+1 −1 − N k | .
Mining Quantitative and Fuzzy Association Rules
Do the move only when the condition holds. Clearly, this step requires O(m) time as well. It was shown that the above linear-scan implementation indeed produces the minimum number of intervals satisfying the intrainterval distance property.
Mining Fuzzy Association Rules Let D be a relational database with I ={i1, …, in} as the set of n attributes and T={t1, …,tm} as the set of m records. Using the fuzzification method, we associate each attribute, iu, with a set of fuzzy sets: Fiu = { f iu1 ,..., f iuk } . A fuzzy association rule (Gyenesei, 2001; Kuok et al., 1998) is an implication as follows:
X is A = Y is B , where X,Y ∈ I are disjointedly frequent itemsets. A and B are sets of disjoint fuzzy sets corresponding to attributes in X and Y: A= { f x ,... f x } , B = { f y ,... f y } , f x ∈ Fx , f y ∈ Fy . A fuzzy itemset is now defined as a pair (<X, A>) of an itemset X and a set A of fuzzy sets associated with attributes in X. The support factor, denoted fs(<X, A>), is determined by the formula 1
m
∑{α v =1
x1
p
1
p
i
i
j
j
(tv x1 ) ⊗ ... ⊗ α x p (tv x p )} / T ,
where X ={x1,…, xp} and tv is the vth record in T; ⊗ is the Tnorm operator in fuzzy logic theory; α x (tv xu ) is calculated as u
mxu (tv xu ), mxu (tv xu ) ≥ wxu 0 otherwise,
FUTURE TRENDS
where mx is the membership function of fuzzy set f x ; u
u
and wx is a user-specified threshold of membership funcu
tion mx . Each fuzzy attribute is a pair of the original attribute name accompanied by the fuzzy set name. We perceive that any fuzzy association rule never contains two fuzzy attributes sharing a common original attribute in I. For example, the rule “Age_Old< Cholesterol_High
There are two main reasons for this assumption. First, fuzzy attributes sharing a common original attribute are usually mutually exclusive in meaning, so they would largely reduce the support of rules in which they are contained together. Second, such a rule would not be worthwhile and would carry little meaning. Hence, we can conclude that all fuzzy attributes in the same rule are independent with respect to the fact that no pair of fuzzy attributes whose original attributes are identical exists. This observation is the foundation of our new parallel algorithm. The idea of partitioning algorithm is to divide the original set of fuzzy attributes into separated parts (each for a processor), so that every part retains at least one fuzzy attribute for each original attribute. To do so, we divide according to several original attributes, so that the number of fuzzy attributes reduced at each processor is maximized. After dividing the set of all the fuzzy attributes for parallel processors, we can use any traditional algorithms, such as Apriori, CHARM, and so forth, to mine local association rules. Finally, the local results at processors are gathered to constitute the overall result. Space limitations prevent a detailed discussion of the partitioning algorithm (FDivision) and the parallel algorithm (PFAR) (Phan & Horiguchi, 2004b) The PFAR algorithm was implemented using MPI standard on the SP2 parallel system, which has a total of 64 nodes. Each node consists of four 322MHz PowerPC 604e processors, 512 MB local memory, and 9.1GB local disks. In the experiments, we used only 24 processors (PEs) on 24 nodes. The testing data included synthetical data and real world databases of heart disease (created by George John, 1994, [email protected], [email protected]). The experimental results showed that the performance of PFAR is satisfactory (Phan & Horiguchi, 2004a).
Quantitative and fuzzy association rules are two types of important association rules that exist in many real-life applications. Future research trends include mining them among data that have structural properties such as sequence, time series, and spatiality. Mining these types of rules in multidimensional databases is also a challenging problem. These complex data mining tasks require techniques not only for data mining but also for data representation, reduction, transformation, and visualization. Completion of these tasks depends on successful integration of all the relevant techniques and effective application of those techniques.
817
M
Mining Quantitative and Fuzzy Association Rules
CONCLUSION We have summarized research activities in mining quantitative and fuzzy association rules and some recent results of our research on these topics. These results, together with those given in Shen (1999b, 2005), show a complete picture of the major achievements made within our data mining research projects. We hope that this article serves as a useful reference to the researchers working in this area and that its contents provide interesting techniques and tools to solve these two challenging problems, which are significant in various applications.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 207-216). Dehaspe, L., & Raedt, L. D. (1997). Mining association rules in multiple relations. Proceedings of the Seventh International Workshop on Inductive Logic Programming, 1297 (pp. 125-132).
Phan, X. H., & Horiguchi, S. (2004a). A parallel algorithm for mining fuzzy association rules (Tech. Rep. No. IS-RR2004-014, JAIST). Phan, X. H., & Horiguchi, S. (2004b). Parallel data mining for fuzzy association rules (Tech. Rep). Shen, H. (1999a). Fast subset statistics and its applications in data mining. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (pp. 1476-1480). Shen, H. (1999b). New achivements in data mining. In J. Sun (Ed.), Science: Advancing into the new millenium (pp. 162-178). Beijing: People’s Education. Shen, H. (2001). Optimal attribute quantization for mining quantitative association rules. Proceedings of the 2001 Asian High-Performance Computing Conference, Australia. Shen, H. (2005). Advances in mining association rules. In John Wang (Ed.), Encyclopedia of data warehousing and mining. Hershey, PA: Idea Group Reference. Shen, H., Liang, W., & Ng, J. K-W. (1999). Efficient computation of frequent itemsets in a subcollection of multiple set families. Informatica, 23(4), 543-548.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the 12th International Conference on Machine Learning (pp. 194-202).
Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. ACM SIGMOD Record, 25(2), 1-8.
Fukuda, T., Morimoto, Y., Morishita, S., & Tokuyama, T. (1996). Mining optimized association rules for numeric attributes. Proceedings of the 15th ACM SIGACTSIGMOD (pp. 182-191).
Zaki, M. J., Parthasarathy, S., & Ogihara, M. (2001). Parallel data mining for association rules on sharedmemory systems. Knowledge and Information Systems, 3(1), 1-29.
Gyenesei, A. (2001). A fuzzy approach for mining quantitative association rules. Acta Cybernetica, 15(2), 305320.
KEY TERMS
Han, E-W., Karypis, G., & Kumar, V. (1997). Scalable parallel data mining for association rules. ACM SIGMOD Record, 26(2), 277-288. Kuok, C. M., Fu, A., & Wong, M. H. (1998). Mining fuzzy association rules in databases. ACM SIGMOD Record, 27(1), 41-46. Li, J., Shen, H., & Topor, R. (1999). An adaptive method of numerical attribute merging for quantitative association rule mining. Proceedings of the 1999 International Computer Science Conference (pp. 31-40), Hong Kong. Miller, R. J., & Yang, Y. (1997). Association rules over interval data. ACM SIGMOD Record, 26(2), 452-460.
818
Attribute Discretization: The process of converting a (continuous) attribute value range to a set of discrete intervals and representing all the values in each interval with a single (interval) label. Attribute Fuzzification: A process that converts numeric attribute values into membership values of fuzzy sets. Equal-Depth Discretization: A technique that divides the attribute value range into intervals of equal population (value density). Equal-Width Discretization: A technique that divides the attribute value range into intervals of equal width (length).
Mining Quantitative and Fuzzy Association Rules
Fuzzy Association Rule: An implication rule showing the conditions of co-occurrence of itemsets that are defined by fuzzy sets whose elements have probabilistic values.
M
Quantitative Association Rule: An implication rule containing items of quantitative and/or categorical values that shows in a given database.
819
820
Model Identification through Data Mining Diego Liberati Consiglio Nazionale delle Ricerche, Italy
INTRODUCTION In many fields of research as in everyday life, one has to face a huge amount of data, often not completely homogeneous and many times without an immediate grasp of an underlying simple structure. A typical example is the growing field of bio-informatics, where new technologies, like the so-called micro-arrays, provide thousands of gene expression data on a single cell in a simple and quickly integrated way. Further, the everyday consumer is involved in a process not so different from a logical point of view, when the data associated with the consumer’s identity contributes to a large database of many customers whose underlying consumer trends are of interest to the distribution market. The large number of variables (i.e., gene expressions, goods) for so many records (i.e., patients, customers) usually are collected with the help of wrapping or warehousing approaches in order to mediate among different repositories, as described elsewhere in the encyclopedia. Then, the problem arises of reconstructing a synthetic mathematical model, capturing the most important relations between variables. This will be the focus of the present contribution. A possible approach to blindly building a simple linear approximating model is to resort to piece-wise affine (PWA) identification (Ferrari-Trecate et al., 2003). The rationale for this model will be explored in the second part of the methodological section of this article. In order to reduce the dimensionality of the problem, thus simplifying both the computation and the subsequent understanding of the solution, the critical problems of selecting the most salient variables must be solved. A possible approach is to resort to a rule induction method, like the one described in Muselli and Liberati (2002) and recalled in the first methodological part of this contribution. Such a strategy offers the advantage of extracting underlying rules and implying conjunctions and/or disjunctions between such salient variables. The first idea of their non-linear relationships is provided as a first step to designing a representative model,using the selected variables. The joint use of the two approaches recalled in this article, starting from data without known background information about their relationships, first allows a reduction in dimensionality without significant loss in informa-
tion, and later to infer logical relationships. Finally, it allows the identification of a simple input-output model of the involved process that also could be used for controlling purposes.
BACKGROUND The two tasks of selecting salient variables and identifying their relationships from data may be sequentially accomplished with various degrees of success in a variety of ways. Principal components order the variables from the most salient to the least, but only under a linear framework. Partial least squares allow an extension to non-linear models, provided that one has prior information on the structure of the involved non-linearity; in fact, the regression equation needs to be written before identifying its parameters. Clustering may operate in an unsupervised way without the a priori correct classification of a training set (Booley, 1998). Neural networks are known to use embedded rules with the indirect possibility (Taha & Ghosh, 1999)of making rules explicit or to underline the salient variables. Decision trees (Quinlan, 1994), a popular framework, can provide a satisfactory answer to both questions.
MAIN THRUST Recently, a different approach has been suggested—Hamming clustering. This approach is related to the classical theory exploited in minimizing the size of electronic circuits, with the additional ability to obtain a final function able to generalize everything from the training dataset to the most likely framework describing the actual properties of the data. In fact, the Hamming metric tends to cluster samples with code that is less distant. This is likely to be natural if variables are redundantly coded via thermometer (for numeric variables) or used for only-one (for logical variables) code (Muselli & Liberati, 2000). The Hamming clustering approach reflects the following remarkable properties: •
It is fast, exploiting (after the mentioned binary coding) just logical operations instead of floating point multiplications.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Model Identification through Data Mining
•
It directly provides a logical understandable expression (Muselli & Liberati, 2002), which is the final synthesized function directly expressed as the OR of ANDs of the salient variables, possibly negated.
When the variables are selected, a mathematical model of the underlying generating framework still must be produced. At this point, a first hypothesis of linearity may be investigated (usually being only a very rough approximation) where the values of the variables are not close to the functioning point around which the linear approximation is computed. Building a non-linear model is far from easy; the structure of the non-linearity needs to be a priori knowledge, which is not usually the case. A typical approach consists of exploiting a priori knowledge, when available, to define a tentative structure, then refining and modifying it on the training subset of data, and finally retaining the structure that best fits a cross-validation on the testing subset of data. The problem is even more complex when the collected data exhibit hybrid dynamics (i.e., their evolution in time is a sequence of smooth behaviors and abrupt changes). An alternative approach is to infer the model directly from the data without a priori knowledge via an identification algorithm capable of reconstructing a very general class of a piece-wise affine model (Ferrari-Trecate et al., 2003). This method also can be exploited for the datadriven modeling of hybrid dynamic systems, where logic phenomena interact with the evolution of continuousvalued variables. Such an approach will be described concisely later, after a more detailed drawing of the rulesoriented mining procedure, and some applications will be discussed briefly.
Binary Rule Generation and Variable Selection While Mining Data The approach followed by Hamming clustering in mining the available data to select the salient variables and to build the desired set of rules consists of the three steps in Table 1.
•
Step 1: A critical issue is the partition of a possibly continuous range in intervals, whose number and limits may affect the final result. The thermometer code then may be used to preserve ordering and distance (in the case of nominal input variables, for which a natural ordering cannot be defined, instead adopting the only-one). The simple metric used is the Hamming distance, computed as the number of different bits between binary strings. In this way, the training process does not require floating point computation but rather basic logic operations. This
Table 1. The three steps executed by Hamming clustering to build the set of rules embedded in the mined data 1. 2. 3.
•
•
The input variables are converted into binary strings via a coding designed to preserve distance and, if relevant, ordering. The ‘OR of ANDs’ expression of a logical function is derived from the training examples coded in the binary form of step 1. In the OR final expression, each logical AND provides intelligible conjunctions or disjunctions of the involved variables, ruling the analyzed problem.
is one reason for the algorithm speed and for its insensitivity to precision. Step 2: Classical techniques of logical synthesis are specifically designed to obtain the simplest ANDOR expression able to satisfy all the available inputoutput pairs without an explicit attitude to generalization. To generalize and infer the underlying rules at every iteration, by Hamming clustering groups together in a competitive way, binary strings have the same output and are close to each other. A final pruning phase does simplify the resulting expression, further improving its generalization ability. Moreover, the minimization of the involved variables intrinsically excludes the redundant ones, thus enhancing the very salient variables for the investigated problem. The low (quadratic) computational cost allows for managing quite large datasets. Step 3: Each logical product directly provides an intelligible rule, synthesizing a relevant aspect of the underlying system that is believed to generate the available samples.
Identification of Piece-wise Affine Systems Through a Clustering Technique Once the salient variables have been selected, it may be of interest to capture a model of their dynamical interaction. Piece-wise affine identification exploits K-means clustering that associates data points in multivariable space in such a way that jointly determines a sequence of linear submodels and their respective regions of operation without even imposing continuity at each change in the derivative. In order to obtain such a result, the five steps reported in Table 2 are executed.
•
Step 1: The model is locally linear; small sets of data points close to each other likely belong to the same submodel. For each data point, a local set is built, 821
M
Model Identification through Data Mining
•
•
•
•
collecting the selected points together with a given number of its neighbors (whose cardinality is one of the parameters of the algorithm). Each local set will be pure if made of points really belonging to the same single linear subsystem; otherwise, it is mixed. Step 2: For each local dataset, a linear model is identified through a usual least squares procedure. Pure sets belonging to the same submodel give similar parameter sets, while mixed sets yield isolated vectors of coefficients, looking as outliers in the parameter space. If the signal-to-noise ratio is good enough, and if there are not too many mixed sets (i.e., the number of data points is more than the number of submodels to be identified, and the sampling is fair in every region), then the vectors will cluster in the parameter space around the values pertaining to each submodel, apart from a few outliers. Step 3: A modified version of the classical K-means, whose convergence is guaranteed in a finite number of steps (Ferrari-Trecate et al., 2003), takes into account the confidence on pure and mixed local sets in order to cluster the parameter vectors. Step 4: Data points are then classified, each belonging to a local dataset one-to-one related to its generating data point, which is classified according to the cluster to which its parameter vector belongs. Step 5: Both the linear submodels and their regions are estimated from the data in each subset. The coefficients are estimated via weighted least squares, taking into account the confidence measures. The shape of the polyhedral region characterizing the domain of each model may be obtained via linear support vector machines (Vapnik, 1998), easily solved via linear/quadratic programming.
A Few Applications The field of application of the proposed approach is intrinsically wide (both tools are most general and quite powerful, especially if combined together). Here, only a few suggestions will be drawn with reference to already obtained results or to some application under considerTable 2. The five steps for piece-wise affine identification 1. 2. 3. 4. 5.
822
The local datasets neighboring each sample are built. The local linear models are identified through least squares. The parameters vectors are clustered through a modified K-means. The data points are classified according to the clustered parameter vectors. The submodels are estimated together with their domains.
ation for ongoing research projects. The field of life science plays a central role because of its relevance to science and to society. A growing acceptance of international relevance, in which the described approaches are being used to provide a contribution, is the so-called field of systems biology—a feedback model of how proteins interact with each other and with nucleic acids within the cell, both of which are needed to better understand the control mechanisms of the cellular cycle. This is especially evident with respect to duplication (such as cancer, when such mechanisms become out of control). Such an understanding will hopefully encourage a drive to personal therapy, when everybody’s gene expression will be correlated to the amount of corresponding proteins involved in the cellular cycle. Moreover, a new computational paradigm could arise by exploiting biological components like cells instead of the usual silicon hardware, thus overcoming some technological issues and possibly facilitating neuroinformatics. The study of systems biology as a leading edge of the large field of bioinformatics, begins by analyzing data from so-called micro-arrays. They are little standard chips, where thousands of gene expressions may be obtained from the same cell material, thus providing a large amount of data whose handling with the usual deterministic approaches is not conceivable, and whose fault is its inability to obtain significant synthetic information. Thus, matrices of as many subjects as available, possibly grouped in homogeneous categories for supervised training, each one carrying thousands of gene expressions, are the natural input to algorithms like Hamming clustering. The desired output are rules able to classify; for instance, patients affected by different tumors from healthy subjects, on the basis of a few identified genes, whose set is the candidate basis for the piece-wise linear model describing their complex interaction in such a particular class of subjects. Also, without deeper insight into the cell, the identification of the prognostic factors in oncology is already occuring with Hamming clustering (Paoli et al., 2000) and also providing a hint about their interaction, which is not explicit in the outcome of a simple neural network (Drago et al., 2002).
FUTURE TRENDS Beside improvements in the algorithmic approaches (also implying the possibility of taking into account potential a priori knowledge), a wide range of applications will benefit from the proposed ideas in the near future; some of which are outlined here and partially recalled in Table 3.
Model Identification through Data Mining
Drug design would benefit from the a priori forecasting provided by Hamming clustering about hydrophilic behavior of the not yet experimented pharmacological molecule, on the basis of the known properties of some possible radicals and in the track of Manga et al. (2003) within the frame of computational methods for the prediction of drug-likeness (Clark & Pickett, 2000). In principle, such not quite different in silico predictions from the systems biology expectation are of paramount relevance to pharmaceutical companies to save money in designing minimally invasive drugs controlling kidney metabolism that would have less of an affect on the liver the more hydrophilic they are. Compartmental behavior of drugs then may be analysed via piece-wise identification by collecting in vivo data samples and clustering them within the more active compartment at each stage, instead of the classical linear system identification that requires non-linear algorithms. The same is true for metabolism, such as glucose in diabetes or urea in renal failure. Dialysis can be modeled as a compartmental process in a sufficiently accurate way. A piece-wise linear approach is able to simplify the identification even on a single patient, when population data are not available to allow a simple linear deconvolution approach (Liberati & Turkheimer, 1999) (whose result is only an average knowledge of the overall process, without taking into special account the very subject under analysis). Moreover, compartmental models are pervasive, like in ecology, wildlife, and population studies; potential applications in that direction are almost never ending. Many physiological processes are switching naturally or intentionally from an active to quiescent state, like hormone pulses whose identification (Sartorio et al., 2002) is important in growth and fertility diseases as well as in doping assessment. In that respect, the most fascinating
human organ is probably the brain, whose study may be undertaken today either in the sophisticated frame of functional nuclear magnetic imaging or in the simple way of EEG recording. Multidimensional (three space variables plus time) images or multivariate time series provide an abundance of raw data to mine in order to understand which kind of activation is produced in correspondence with an event (Maieron et al., 2002) or a decision. A brain computer-interfacing device then may be built that is able to reproduce one’s intention to perform a move–not directly possible for the subject in some impaired physical conditions–and command a proper actuator. Both Hamming clustering and piece-wise affine identification would improve the only partially successful approach based on artificial neural networks (Babiloni et al., 2000). A simple drowsiness detector based on the EEG may be designed, as well as a flexible anaesthesia/hypotension level detector, without needing a time-varying more precise, but more costly, identification. A psychological stress indicator may be inferred, outperforming (Pagani et al., 1991). The multivariate analysis (Liberati et al., 1997), possibly taking into account the input stimulation so useful in approaching difficult neurological tasks (like modeling electro encephalographic coherence in Alzheimer’s disease patients (Locatelli et al., 1998)) or non-linear effects in muscle contraction (Orizio et al., 1996) would be outperformed by piece-wise linear identification, even in time-varying contexts like epilepsy. Industrial applications, of course, are not excluded from the field of possible interests. In Ferrari-Trecate et al. (2003), for instance, the classification and identification of the dynamics of an industrial transformer are performed via the piece-wise approach, with a much simpler cost, and no really significant reduction of performances with respect to the non-linear modeling described in Bittanti, et al. (2001).
Table 3. A summary of selected applications of the proposed approaches System Biology and Bioinformatics: To identify the interaction of the main proteins involved in cell cycle. Drug Design: To forecast desired and undesired behavior of the final molecule from components. Compartmental Modeling: To identify number, dimensions, and exchange rates of communicating reservoirs. Hormone Pulses Detection: To detect true pulses among the stochastic oscillations of the baseline. Sleep Detector: To forecast drowsiness as well as to identify sleep stages and transitions. Stress Detector: To identify the psychological state of a subject from multivariate analysis of biological signals. Prognostic Factors: To identify the interaction between selected features (e.g., in oncology). Pharmacokinetics and Metabolism: To identify diffusion and metabolic time constants from time series sampled in blood. Switching Processes: To identify abrupt or possibly smoother commutations within the process duty cycle. Brain Computer Interfacing: To identify a decision taken within the brain and propagate it directly form brain waves. Anesthesia Detector: To monitor the level of anesthesia and provide close-loop control. Industrial Applications: A wide spectrum of logical or dynamic-logical hybrid problems may be faced (i.e., the tracking of a transformer).
823
M
Model Identification through Data Mining
CONCLUSION The proposed approaches are very powerful tools for quite a wide spectrum of applications in and beyond data mining, providing an up-to-date answer to the quest of formally extracting knowledge from data and sketching a model of the underlying process.
ACKNOWLEDGMENTS Marco Muselli had the key role in conceiving and developing the recalled clustering algorithms, while Giancarlo Ferrari-Trecate has been determinant in extending them to model identification as well as in kindly revising the present contribution. Their stimulating and friendly interaction over the past few years is warmly acknowledged. An anonymous reviewer is also gratefully acknowledged for indicating how to improve the writing in several critical points.
REFERENCES Babiloni, F. et al. (2000). Comparison between human and ANN detection of laplacian-derived electroencephalographic activity related to unilateral voluntary movements. Comput Biomed Res, 33, 59-74. Bittanti, S., Cuzzola, F.A., Lorito, F., & Poncia, G. (2001). Compensation of nonlinearities in a current transformer for the reconstruction of the primary current. IEEE T Control System Techn, 9(4), 565-573. Boley, D.L. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325-344. Clark, D.E., & Pickett, S.D. (2000). Computational methods for the prediction of “drug-likeness.” Drug Discovery Today, 5(2), 49-58. Drago, G.P., Setti, E., Licitra, L., & Liberati, D. (2002). Forecasting the performance status of head and neck cancer patient treatment by an interval arithmetic pruned perceptron. IEEE T Bio-Med Eng, 49(8), 782-787. Ferrari, T.G., Muselli, M., Liberati, D., & Morari, M. (2003). A clustering technique for the identification of piecewise affine systems. Automatica, 39, 205-217. Liberati, D., Cursi, M., Locatelli, T., Comi, G., & Cerutti, S. (1997). Total and partial coherence of spontaneous and evoked EEG by means of multi-variable autoregressive processing. Med Biol Eng Comput, 35(2), 124-130.
824
Liberati, D., & Turkheimer, F. (1999). Linear spectral deconvolution of catabolic plasma concentration decay in dialysis. Med Biol Eng Comput, 37, 391-395. Locatelli, T., Cursi, M., Liberati, D., Franceschi, M., & Comi, G. (1998). EEG coherence in Alzheimer’s disease. Electroenceph Clin Neurophys, 106(3), 229-237. Maieron, M. et al. (2002). An ARX model-based approach to the analysis of single-trial event-related fMRI data. Proceedings of the 8th International Conference on Functional Mapping of the Human Brain, Sendai, Japan. Manga, N., Duffy, J.C., Rowe, P.H., & Cronin, M.T.D. (2003). A hierarchical quantitative structure-activity relationship model for urinary excretion of drugs in humans as predictive tool for biotrasformation. Quantitative Structure-Activity Relationship Comb Sci, 22, 263-273. Muselli, M., & Liberati, D. (2000). Training digital circuits with Hamming clustering. IEEE T Circuits and Systems— I: Fundamental Theory and Applications, 47, 513-527. Muselli, M., & Liberati, D. (2002). Binary rule generation via Hamming clustering. IEEE T Knowledge and Data Eng, 14, 1258-1268. Orizio, C., Liberati, D., Locatelli, C., DeGrandis, D., & Veicsteinas, A. (1996). Surface mechanomyogram reflects muscle fibres twitches summation. J Biomech, 29(4), 475-481. Pagani, M. et al. (1991). Sympatho-vagal interaction during mental stress: A study employing spectral analysis of heart rate variability in healthy controls and in patients with a prior myocardial infarction. Circulation, 83(4), 43-51. Paoli, G. et al. (2000). Hamming clustering techniques for the identification of prognostic indices in patients with advanced head and neck cancer treated with radiation therapy. Med Biol Eng Comput, 38, 483-486. Quinlan, J.R. (1994). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Sartorio, A., De Nicolao, G., & Liberati, D. (2002). An improved computational method to assess pituitary responsiveness to secretagogue stimuli. Eur J Endocrinol, 147(3), 323-332. Taha, I., & Ghosh, J. (1999). Symbolic interpretation of artificial neural networks. IEEE T Knowledge and Data Eng., 11, 448-463.
Model Identification through Data Mining
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
KEY TERMS Bioinformatics: The processing of the huge amount of information pertaining to biology. Hamming Clustering: A fast binary rule generator and variable selector able to build understandable logical expressions by analyzing the Hamming distance between samples. Hybrid Systems: Their evolution in time is composed by both smooth dynamics and sudden jumps.
Micro-Arrays: Chips where thousands of gene expressions may be obtained from the same biological cell material. Model Identification: Definition of the structure and computation of its parameters best suited to mathematically describe the process underlying the data. Rule Generation: The extraction from the data of the embedded synthetic logical description of their relationships. Salient Variables: The real players among the many apparently involved in the true core of a complex business. Systems Biology: The quest of a mathematical model of the feedback regulation of proteins and nucleic acids within the cell.
825
M
826
Modeling Web-Based Data in a Data Warehouse Hadrian Peter University of the West Indies, Barbados Charles Greenidge University of the West Indies, Barbados
INTRODUCTION
BACKGROUND
Good database design generates effective operational databases through which we can track customers, sales, inventories, and other variables of interest. The main reason for generating, storing, and managing good data is to enhance the decision-making process. The tool used during this process is the decision support system (DSS). The information requirements of the DSS have become so complex, that it is difficult for it to extract all the necessary information from the data structures typically found in operational databases. For this reason, a new storage facility called a data warehouse has been developed. Data in the data warehouse have characteristics that are quite distinct from those in the operational database (Rob & Coronel, 2002). The data warehouse extracts or obtains its data from operational databases as well as from other sources, thus providing a comprehensive data pool from which useful information can be extracted. The other sources represent the external data component of the data warehouse. External information in corporate data warehouses has increased during the past few years because of the wealth of information available, the desire to work more closely with third parties and business partners, and the Internet. In this paper, we focus on the Internet as the source of the external data and propose a model for representing such data. The model represents a timely intervention by two important technologies (i.e., data warehousing and search engine technology) and is a departure from current thinking of major software vendors (e.g., IBM, Oracle, MicroSoft) that seeks to integrate multiple tools into one environment with one administration. The paper is organized as follows: background on the topic is discussed in the following section, including a literature review on the main areas of the paper; the main thrust section discusses the detailed model, including a brief analysis of the model; finally, we identify and explain the future trends of the topic.
Data Warehousing Data warehousing is a process concerned with the collection, organization, and analysis of data typically from several heterogeneous sources with an aim to augment end-user business functions (Hackathorn, 1998; Strehlo, 1996). It is intended to provide a working model for the easy flow of data from operational systems to decision support systems. The data warehouse structure includes three major levels of information—granular data, archival data, and summary data—and the metadata to support them (Inmon, Zachman & Geiger, 1997). Data warehousing: 1. 2. 3. 4.
5. 6.
Is centered on ad hoc end-user queries posed by business users rather than by information system experts. Is concerned with off-line, historical data rather than online, volatile, operational type data. Must efficiently handle larger volumes of data than those handled by normal operational systems. Must present data in a form that coincides with the expectations and understanding of the business users of the system rather than that of the information system architects. Must consolidate data elements. Diverse operational systems will refer to the same data in different ways. Relies heavily on meta-data. The role of meta-data is particularly vital, as data must remain in its proper context over time. The main issues in data warehousing design are:
1. 2. 3.
performance versus flexibility; cost; maintenance.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Modeling Web-Based Data in a Data Warehouse
Details of data warehousing/warehouses issues are provided in Berson and Smith (1997), Bischoff and Yevich (1996), Devlin (1998), Hackathorn (1995, 1998), Inmon (2002), Mattison (1996), Rob and Coronel (2002), Becker (2002), Kimball and Ross (2002), Lujan-Mora and Trujillo (2003).
acted on by the search engine. Results coming from the search engine also can be processed prior to being relayed to the warehouse. This intermediate, independent meta-data bridge is an important concept in this model.
External Data and Search Engines
THE DETAILED MODEL
External data represent an essential ingredient in the production of decision-support analysis in the data warehouse environment (Inmon, 2002). The ability to make effective decisions often requires the use of information that is generated outside the domain of the person making the decision (Parsaye, 1996). External data sources provide data that cannot be found within the operational database but are relevant to the business (e.g., stock prices, market indicators, and competitors’ data). Once in the warehouse, the external data go through processes that enable them to be interrogated by the end user and business analysts seeking answers to strategic questions. This external data may augment, revise, or even challenge the data residing in the operational systems (Barquin & Edelstein, 1997). Our model provides a cooperative nexus between the data warehouse and search engine, and is aimed at enhancing the collection of external data for end-user queries originating in the data warehouse (Bischoff & Yevich, 1996). The model grapples with the complex design differences between data warehouse and search engine technologies by introducing an independent, intermediate, data-staging layer. Search engines have been used since the 1990s to grant to millions access to global external data via the Internet (Goodman, 2000; Sullivan, 2000). In spite of their shortcomings (Chakrabarti, et al., 2001; Soudah, 2000), search engines still represent the best approach to the retrieval of Internet-based external data (Goodman, 2000; Sullivan, 2000). Importantly, under our model, queries originating with the business expert on the warehouse side can be modified at an intermediate stage before being
Motivation/Justification To better understand the importance of the new model, we must examine the search engine and meta-data engine components. The meta-data engine cooperates closely with both the data warehouse and search engine architectures to facilitate queries and efficiently handle results. In this paper, we propose a special-purpose search engine, which is a hybrid between the traditional, general-purpose search engine and the meta-search engine. The hybrid approach we have taken is an effort to maximize the efficiency of the search process. Although the engine does not need to produce results in a real-time mode, the throughput of the engine may be increased by using multi-threading. The polling of the major commercial engines by our engine ensures the widest possible access to Internet information (Bauer et al., 2002; Pfaffenberger, 1996; Ray et al., 1998; Sullivan, 2000). The special-purpose hybrid search engine in Figure 1 goes through the following steps each time a query is processed: 1. 2. 3. 4. 5.
Retrieve major search engine’s home pages HTML documents; Parse files and submit GET type queries to major engines; Retrieve results and use links found in results to form starting seed links; Perform depth/breath first traversal using seed links; and Capture results from no. 4 to local disk.
Figure 1. Query and retrieval in hybrid search engine User
Filter
Submit Query
Final Query
Major Engines
Download Results
HTML FILES
PARSE FILES Store Links
Invoke MetaData Engine on Results
Download Results
HTML FILES
Follow Seed Links
Seed Links
827
M
Modeling Web-Based Data in a Data Warehouse
Architecture It is evident that our proposed system uses the Internet to supplement the external data requirements of the data warehouse. To function effectively, the system must address the issues surrounding the architectures of the data warehouse and search engine. The differences are in the following areas: 1. 2. 3. 4. 5.
Subject Area Specificity Granularity Time Basis (Time Stamping) Frequency of Updates Purity of Data
To address these areas, we need a flexible architecture that can provide a bridge over which relevant data can be transmitted. This bridge needs to be a communication channel, as well, so that an effective relationship between the data warehouse and search engine domains is maintained. One approach is to integrate the search engine into the warehouse. This approach, though possible, is not expedient, because the warehouse environment is not an operational type environment but one optimized for the analysis of enterprise-wide historical data. Typical commercial search engines are composed of a crawler (spider) and an indexer (mite). The indexer is used to codify results into a database for easy querying. Our approach reduces the complexity of the search engine by moving the indexing tasks to the meta-data engine. The purpose of the meta-data engine is to facilitate an automatic information retrieval mode. Figure 2 illustrates the concept of the bridge between data warehouse and search engine environments. The meta-data engine provides a useful transaction point for both queries and results. It is useful architecturally and practically to keep the search engine and data warehouse separate. The separation of the three architectures means that they can be optimized independently. The meta-data engine performs a specialized task of interfacing
between two distinct environments, and, as such, we believe that it should be developed separately. We now briefly examine each of the four components of the meta-data engine operation as shown in Figure 2. First, queries originating from the warehouse will have to be filtered to ensure they are actionable; that is, nonredundant and meet the timeliness and other processing requirements of the search engine component. Second, incoming queries may have to be modified in order to meet the goal of being actionable or in order to meet specific efficiency/success criteria for the search engine. For example, the recall of results may be enhanced, if certain terms are included/excluded from the original query coming from the warehouse. Third, after results are retrieved, some minimal analysis of those results will have to occur to determine either the success or failure of the retrieval process or to verify the content type or structure (e.g., HTML, XML, .pdf) of the retrieved results. Due to the ability of search engines to return large volumes of irrelevant results, some preliminary analysis is warranted. More advanced versions of the engine would tackle semantic issues. The fourth component (labeled format in the diagram) will be invoked to produce plain text from the recognizable document formats or, alternatively, to convert from one known document format to another. Off-the-shelf tools exist online to parse HTML, XML, and other Internet document formats as well as provide language toolkits from which these four important areas of meta-data engine construction can be done. The Web provides a sprawling proving ground of semi-structured, structured, and unstructured data that must be handled using multiple syntactic, structural, and semantic techniques. The meta-data engine seeks to architecturally isolate these techniques from the data warehouse and search engine domains. Also, in our proposed model, the issue of relevance must be addressed. Therefore, a way must be found to sort out relevant from non-relevant documents. For example, we are interested in questions such as the follow-
Figure 2. Meta-data engine operation Handle Query from D.W.
Filter
Modify
Submit Modified Queries to S.E.
Meta-Data
Format
Provide Results to D.W.
828
Analyze
Handle Results from S.E.
Modeling Web-Based Data in a Data Warehouse
ing: Does every page containing the word Java specifically address Java programming issues or tools?
The Need for Data Synchronization Data synchronization is an important issue in data warehousing. Data warehouses are designed to provide a consistent view of data emanating from several sources. The data from these sources must be combined, in a staging area outside of the warehouse, through a process of transformation and integration. To achieve this goal requires that the data drawn from the multiple sources be synchronized. Data synchronization can only be guaranteed by safeguarding the integrity of the data and ensuring that the timeliness of the data is preserved. We propose that the warehouse transmit a request for external data to the search engine. Once the eventual results are obtained, they are left for the warehouse to gather as it completes a new refresh cycle. This process is not intended to provide an immediate response (as required by a Web surfer) but to lodge a request that will be answered at some convenient time in the future. At their simplest, data originating externally may be synchronized by their date of retrieval as well as by system statistics such as last time of update, date of posting, and page expiration date. More sophisticated schemes could include verification of dates of origin, use of Internet archives, and automated analysis of content to establish dates. For example, online newspaper articles often carry the date of publication close to the headlines. An analysis of such a document’s content would reveal a date field in a position adjacent to the heading for the story.
Analysis of the Model
data. With careful selection of external data, organizations with data warehouses can reap disproportionate benefits over competitors who do not use their historical or external data to maximum effect. Our model also allows for flexibility. By allowing development to take place in three languages—query language SQL (e.g., for the data warehouse), Perl for the meta-data engine, and Java for the search engine—the strengths of each can be utilized effectively. By allowing flexibility in the actual implementation of the model, we can increase efficiency and decrease development time. Our model allows for security and integrity to be maintained. The warehouse need not expose itself to dangerous integrity questions or to increasingly malicious threats on the Internet. By isolating the module that interfaces with the Internet from the module that carries vital internal data, the prospects for good security are improved. Our model also relieves information overload. By using the meta-data engine to manipulate external data, we are indirectly removing time-consuming filtering activities from users of the data warehouse.
FUTURE TRENDS It is our intention to incorporate portals (Hall, 1999; Raab, 2000) and vortals (Chakrabarti et al., 2001; Rupley, 2000) in our continued research in data warehousing. A goal of our model is to provide a subject-oriented bias of the warehouse when satisfying queries involving the identification and extraction of external data from the Internet. The following three potential problems are still to be addressed in our model:
The major strengths of our model are its:
1. 2.
• • •
3.
Independence; Value-added approach; and Flexibility and security
This model exhibits (a) logical independence (ensures that each component of the model retains its integrity); (b) physical independence (allows summarized results of external data to be transferred via network connections to any remote physical machines; and (c) administrative independence (administration of each of the three components in the model is a specialized activity; hence, the need for three separate administrators). In terms of its value-added approach, the model maximizes the utility of acquired data by adding value to this data long after their day-to-day usefulness has expired. The ability to add pertinent external data to volumes of aging internal data extends the usefulness of the internal
It relies on fast-changing search engine technologies; It introduces more system overhead and the need for more administrators; It results in the storage of large volumes of irrelevant/worthless information. Other issues to be addressed by our model are:
1. 2. 3. 4.
The need to investigate how the search engine component can make smarter choices of which links to follow; The need to test the full prototype in a live data warehouse environment; Providing a more detailed comparison between our model and the traditional external data approaches. Determining how the traditional Online Transaction Processing (OLTP) systems will funnel data into this model.
829
M
Modeling Web-Based Data in a Data Warehouse
We are confident that further research with our model will provide the techniques for handling the aforementioned issues. One possible approach is to incorporate languages such as WebSQL (Mihaila, 1996) and Squeal (Spertus, 1999) in our model.
Bischoff, J., & Yevich, R. (1996). The superstore: Building more than a data warehouse. Database Programming and Design, 9(9), 220-229.
CONCLUSION
Chakrabarti, S., van den Berg, M., & Dom, B. (2001). Focused crawling: A new approach to topic-specific Web resource discovery. Retrieved January, 2004, from http://www8.org/w8-papers/5a-search-query/crawling/
The model presented in this paper is an attempt to bring the data warehouse and search engine into cooperation to satisfy the external data requirements for the warehouse. However, the new model anticipates the difficulty of simply merging two conflicting architectures. The approach we have taken is the introduction of a third intermediate layer called the meta-data engine. The role of this engine is to coordinate data flows to and from the respective environments. The ability of this layer to filter extraneous data retrieved from the search engine, as well as compose domain-specific queries, will determine its success. In our model, the data warehouse is seen as the agent that initiates the search for external data. Once a search has been started, the meta-data engine component takes over and eventually returns results. The model provides for querying of general-purpose search engines, but a simple configuration allows for subject-specific engines (vortals) to be used in the future. When specific engines designed for specific niches become widely available, the model stands ready to accommodate this change. We believe that WebSQL and Squeal can be used in our model to perform the structured queries that will facilitate speedy identification of relevant documents. In so doing, we think that these tools will allow for faster development and testing of a prototype and, ultimately, a full-fledged system
REFERENCES Barquin, R., & Edelstein, H. (Eds.). (1997). Planning and designing the data warehouse. Upper Saddle River, NJ: Prentice-Hall. Bauer, A., Hümmer, W., Lehner, W., & Schlesinger, L. (2002). A decathlon in multidimensional modeling: Open issues and some solutions. DaWaK, 274-285. Becker, S.A. (Ed.). (2002). Data warehousing and Web engineering. Hershey, PA: IRM Press. Berson, A., & Smith, S.J. (1997). Data warehousing, data mining and olap. New York: McGraw-Hill.
830
Chakrabarti, S. (2002). Mining the Web: Analysis of hypertext and semi-structured data. New York: Morgan Kaufmann.
Day, A. (2004). Data warehouses. American City and County, 119(1), 18. Devlin, B. (1998). Meta-data: The warehouse atlas. DB2 Magazine, 3(1), 8-9. Goodman, A. (2000). Searching for a better way. Retrieved July 25, 2002, from http://www.traffick.com Hackathorn, R. (1995). Data warehousing energizes your enterprise. Datamation, 38-45. Hackathorn, R. (1998). Web farming for the data warehouse. New York: Morgan Kaufmann. Hall, C. (1999). Enterprise information portals: Hot air or hot technology [Electronic version]. Business Intelligence Advisor, 111(11). Imhoff, C., Galemmo, N., & Geiger, J.G. (2003). Mastering data warehouse design: Relational and dimensional techniques. New York: John Wiley and Sons. Inmon, W.H. (2002). Building the data warehouse. New York: John Wiley and Sons. Inmon, W.H., Zachman, J.A., & Geiger, J.G. (1997). Data stores, data warehousing, and the Zachman framework: Managing enterprise knowledge. New York: McGraw-Hill. Kimball, R. (1996). Dangerous preconceptions. Retrieved June 13, 2002, from http://pwp.starnetinc.com/larryg/ index.html Kimball, R., & Ross, M. (2002). The data warehouse toolkit: The complete guide to dimensional modeling. New York: John Wiley and Sons. Lujan-Mora, S., & Trujillo, J. (2003). A comprehensive method for data warehouse design. DMDW. Marakas, G.M. (2003). Modern data warehousing, mining, and visualization: Core concepts. Upper Saddle River, NJ: Prentice-Hall College Division.
Modeling Web-Based Data in a Data Warehouse
Mattison, R. (1996). Data warehousing: Strategies, technologies and techniques. New York: McGraw-Hill.
Strehlo, K. (1996). Data warehousing: Avoid planned obsolescence. Datamation, 38-45.
McElreath, J. (1995). Data warehouses: An architectural perspective. CA: Computer Sciences Corp.
Sullivan, D. (2000). Search engines review chart. Retrieved June 10, 2002, from http://searchenginewatch.com
Mihaila, G.A. (1996). WebSQL—An SQL-like query language for the World Wide Web [master’s thesis]. University of Toronto.
Zghal, H.B., Faiz, S., & Ghezala, H.B (2003). Casme: A case tool for spatial data marts design and generation, design and management of data warehouses 2003. Proceedings of the 5th International Workshop DMDW’2003, Berlin, Germany.
Parsaye, K. (1996). Data mines for data warehouses. Database Programming and Design, 9(9). Peralta, V., & Ruggia, R. (2003). Using design guidelines to improve data warehouse logical design. DMDW. Pfaffenberger, B. (1996). Web search strategies. MIS Press. Raab, D.M. (1999). Enterprise information portals [Electronic version]. Relationship Marketing Report. Ray, E.J., Ray, D.S., & Seltzer, R. (1998). The Alta Vista search revolution. CA: Osborne/McGraw-Hill. Rob, P., & Coronel, C. (2002). Database systems: Design, implementation, and management. New York: Thomson Learning. Rupley, S. (2000). From portals to vortals. PC Magazine. Sander-Beuermann, F., & Schomburg, M. (1998). Internet information retrieval—The further development of metasearch engine technology. Proceedings of the Internet Summit, Internet Society. Schneider, M. (2003). Well-formed data warehouse structures, design and management of data warehouses 2003. Proceedings of the 5th International Workshop DMDW’2003, Berlin, Germany. Soudah, T. (2000). Search, and you shall find. Retrieved July 17, 2002, from http://searchenginesguides.com Spertus, E., & Stein, L.A. (1999). Squeal: A structured query language for the Web. Retrieved August 10, 2002, from http://www9.org/w9cdrom/222/222.html
WEBSITES OF INTEREST www.perl.com www.cpan.com www.xml.com http://java.sun.com http://semanticweb.org
KEY TERMS Decision Support System (DSS): An interactive arrangement of computerized tools tailored to retrieve and display data regarding business problems and queries. External Data: Data originating from other than the operational systems of a corporation. Granular Data: Data representing the lowest level of detail that resides in the data warehouse. Metadata: Data about data; in the data warehouse, it describes the contents of the data warehouse. Operational Data: Data used to support the daily processing a company does. Refresh Cycle: The frequency with which the data warehouse is updated (e.g., once a week). Transformation: The conversion of incoming data into the desired form.
831
M
832
Moral Foundations of Data Mining Kenneth W. Goodman University of Miami, USA
INTRODUCTION It has become a commonplace observation that scientific progress often, if not usually, outstrips or precedes the ethical analyses and tools that society increasingly relies on and even demands. In the case of data mining and knowledge discovery in databases, such an observation would be mistaken. There are, in fact, a number of useful ethical precedents, strategies, and principles available to guide those who request, pay for, design, maintain, use, share, and sell databases used for mining and knowledge discovery. These conceptual tools — and the need for them — will vary as one is using a database to, say, analyze cosmological data, identify potential customers, or run a hospital. But these differences should not be allowed to mask the ability of applied ethics to provide practical guidance to those who work in an exciting and rapidly growing new field.
BACKGROUND Data mining is itself a hybrid discipline, embodying aspects of computer science, artificial intelligence, cryptography, statistics, and logic. In greater or lesser degree, each of these disciplines has noted and addressed the ethical issues that arise in their practice. In statistics, for instance, leading professional organizations have ratified codes of ethics that address issues ranging from safeguarding privileged information and avoiding conflicts of interest or sponsor bias (International Statistical Institute, 1985) to “the avoidance of any tendency to slant statistical work toward predetermined outcomes” (American Statistical Association, 1999). It is in computer ethics, however, that one finds the earliest, sustained, and most thoughtful literature (Bynum, 1985; Johnson & Snapper, 1985; Ermann, Williams & Gutierrez, 1990; Forester & Morrison, 1994; Johnson, 1994) in addition to ethics codes by professional societies (Association for Computing Machinery, 1992; IEEE, 1990). Traditional issues in computer ethics include privacy and workplace monitoring, hacking, intellectual property, and appropriate uses and users. The intersection of computing and medicine has also begun to attract interest (Goodman, 1998a).
What is clear about this landscape is that terms we attach to issues — “privacy,” for instance — can mask significant differences according as one uses a computer to keep track of warehouse stock or arrest records or real estate transactions or sexually transmitted diseases. Moreover, ethical issues take on somewhat different aspects depending on whether a computer and its data storage media are used by an individual, a business, a university, or a government. Atop this is the general purpose to which the machine is put: science, business, law enforcement, public health, or national security. This triad — content, user, and purpose — frames the space in which ethical issues arise.
MAIN THRUST One can identify a suite of ethical issues that arise in data mining. All are tethered in one way or another to issues encountered in computing, statistics, and kindred fields. The question of whether data mining, or any discipline for that matter, presents unique or unprecedented issues is open to dispute. Issues between or among disciplines often vary by degree more than by kind. If it is learned or inferred from a database that Ms. Garcia prefers blue frocks, it might be the case that her privacy has been violated. But if it is learned or inferred that she has HIV, the stakes are altogether different. The ability to make ever-more-fine-grained inferences from very large databases increases the importance of ethics in data mining.
Privacy, Confidentiality, and Consent It is common to distinguish between privacy and confidentiality by applying the former to humans’ claim or desire to control access to themselves or information about them, and the latter, more narrowly, to specific units of that information. Privacy, if you will, is about people; confidentiality is about information. Privacy is broader, and it includes interest in information protection and control. An important correlate of privacy is consent. One cannot control information without being asked for permission, or at least informed of information use. A business that is creating a customer database might collect data surreptitiously, arguably infringing on pri-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Moral Foundations of Data Mining
vacy. Or it might publicly — thought not necessarily individually — disclose the collection. Such disclosures are often key components of privacy policies. Privacy policies sometimes seek permission in advance to obtain and archive personal data or, more frequently, disclose that data are being collected and then provide a mechanism for individuals to opt out. The question whether such opt-out policies are adequate to give individuals opportunities to control use of their data is subject to widespread debate. In another context, a government will collect data for vital statistics or public health databases. Such uses, at least in democratic societies, may be justified on grounds of the implied consent of those to whom the information applies and who would benefit from its collection. It is not clear how much or what kind of consent would be necessary to provide ethical warrant for data mining of personal information. The problem of adequate consent is complicated by what may be hypothesized to be widespread ignorance about data mining and its capabilities. As elsewhere, some solutions to this ethical problem might be identified or clarified by empirical research related to public understanding of data-mining technology, individuals’ preference for (levels of) control over use of their information, and similar considerations. The U.S. Department of Health and Human Services has, for instance, supported research on the process of informed consent for biomedical research. Data mining warrants a kindred research program. Among the issues to be clarified by such research are the following: • • •
To what extent do individuals want to control access to their information? What are the differences between consent to acquire data for one purpose and consent for secondary uses of that data? How do individual preferences or inclinations to consent vary along with the data miner? (That is, it may be hypothesized that some or many individuals will be sanguine about data mining by trusted public health authorities, but opposed to data mining by [certain] business entities or governments.)
It should be noted that many information exchanges — especially including those generally involving the most sensitive or personal data — are at least partly governed by professionalism standards. Thus, doctorpatient and lawyer- or accountant-client relationships traditionally, if not legally, impose high standards for the protection of information acquired during the course of a professional relationship.
Appropriate Uses and Users of Data Mining Technology
M
It should be uncontroversial to point out that not all data mining or knowledge discovery is done by appropriate users, and not all uses enjoy equal moral warrant. A datamining police state may not be said to operate with the same moral traction as a government public health service in a democracy. (We may one day need to inquire whether use of data-mining technology by a government is itself grounds for identifying it as repressive.) Similarly, given two businesses (insurance companies, say), it is straightforward to report that the one using datamining technology to identify trends in accidents to better offer advice about preventing accidents is on firm moral footing, as opposed to one that identifies trends in accidents to discriminate against minorities. One way to carve the world of data mining is at the public/private joint. Public uses are generally by governments or their proxies, which can include universities and corporate contractors, and can employ data from private sources (such as credit card information). Public data mining can, at least in principle, claim to be in the service of some collective good. The validity of such a claim must be assessed and then weighed against damage or threats to other public goods or values. In the United States, the General Accounting Office, a research and investigative branch of Congress, identified 199 federal data-mining projects and found that of these, 54 mined private sector data, with 36 involving personal information. There were 77 projects using data from other federal agencies, and, of these, 46 involve personal information from the private sector. The personal information, apparently used in these projects without explicit consent, is said to include “student loan application data, bank account numbers, credit card information, and taxpayer identification numbers.” The projects served a number of purposes, the top six of which are given as “improving service or performance” (65 projects), “detecting fraud, waste and abuse” (24), “analyzing scientific and research information” (23), “managing human resources” (17), “detecting criminal activities or patterns” (15) and “analyzing intelligence and detecting terrorist activities” (14) (General Accounting Office, 2004). Is losing confidentiality in credit card transactions a fair exchange for improved government service? Research? National security? These questions are the focus of sustained debate. In the private sphere, data miners enjoy fewer opportunities to claim that their work will result in collective benefit. The strongest warrant for private or for-profit
833
Moral Foundations of Data Mining
data mining is that corporations have fiduciary duties to shareholders and investors, and these duties can only or are best discharged by using data-mining technology. In many jurisdictions, the use of personal information, including health and credit data, is governed by laws that include requirements for privacy policies and notices. Because these often include opt-out provisions, it is clear is that individuals must — as a matter of strategy, if not moral self-defense — take some responsibility for safeguarding their own information. A further consideration is that for the foreseeable future, most people will continue to have no idea of what data mining is or what it is capable of. For this reason, the standard bar for consent to use data must be set very high indeed. It ought to include definitions of data mining and must explicitly disclose that the technology is able to infer or discover things about individuals that are not obvious from the kinds of data that are willingly shared: that databases can be linked; that data can be inaccurate; that decisions based on this technology can affect credit eligibility and employment; and so on (O’Leary, 1995). Indeed, the very point of data mining is the discovery of trends, patterns and (putative) facts that are not obvious or knowable on the surface. This must become a central feature of privacy notice disclosure. Further, all data-mining entities ought to develop or adopt and then adhere to policies and guidelines governing their practice (O’Leary, 1995).
Error and Uncertainty Data analysis is probabilistic. Very large datasets are buggy and contain inaccuracies. It follows that patterns, trends, and other inferences based on algorithmic analysis of databases are themselves open to challenge on grounds of accuracy and reliability. Moreover, when databases are proprietary, there are disincentives to operate with standards for openness and transparency that are morally required — and increasingly demanded — in government, business, and science. What emerges is an imperative to identify and follow best practices for data acquisition, transmission, and analysis. Although some errors are forgivable, those caused by sloppiness, low standards, haste, or inattention can be blameworthy. When the stakes are high, errors can cause extensive harm. It would indeed be a shame to solve problems of privacy, consent, and appropriate use only to fail to do the job one set in the first place. Perhaps the most interesting kind of data-mining error is that which identifies or relies on illusory patterns or flawed inferences. A business data-mining operation that errs in predicting the blue frock market may disappoint, but an error by data miners for a manned space program, biomedical research project, public health 834
surveillance system, or national security agency could be catastrophic. This means that good ethics requires good science. That is, it is not adequate to collect and analyze data for worthy purposes — one must also seek and stand by standards for doing so well. One feature of data mining that makes this exceptionally difficult is its very status as a science. It is customary in most empirical inquiries to frame and then test hypotheses. Although philosophers of science disagree about the best or most productive methods for conducting such inquiries, there is broad agreement that experiments and tests generally offer more rigor and produce more reliable results and insights than (mere) observation of patterns between or among variables. This is a well-known problem in domains in which experiments are impossible or unethical. In epidemiology and public health, for instance, it is often not possible to test a hypothesis by using the experimental tools of control groups, double blinding, and placebos (one could not intentionally attempt to give people a disease by different means in order to identify the most dangerous routes of transmission). For this reason, epidemiologists and public health scientists are usually mindful of the fact that their studies might fail to reveal a causal relation, but rather, no more than a statistical correlation. Put differently, when there is no hypothesis to test, it is “not possible to know what will be found until it is discovered” (Fule & Roddick, 2004, p. 160). There is a precedent for addressing this kind of challenge. In meta-analysis, or the concatenation and reanalysis of others’ results, statisticians endured withering criticism of questionable methods, bias introduced by initial data selection and over-broad conclusions. They have responded by refining their techniques and improving their processes. Meta-analysis remains imperfect, but scientists in numerous disciplines (perhaps most noteworthily in the biomedical sciences, where in some instances meta-analysis has altered standards of patient care) have come to rely on it (Goodman, 1998b).
FUTURE TRENDS Data mining has, in a short time, become ubiquitous. Like many hybrid sciences, it was never clear who “owned” it and therefore who should assume responsibility for its practice. Attention to data mining has increased in part because of concern about its use by governments and the extent to which such use will infringe on civil liberties. This has, in turn, led to renewed scrutiny, a positive development. It must, however, be emphasized that while issues related to privacy
Moral Foundations of Data Mining
and confidentiality are of fundamental concern, they are by no means the only ethical issues raised by data mining and, indeed, attention to them at the expense of the other issues identified here would be a dangerous oversight. The task of identifying ethical issues and proposing solutions, approaches, and best (ethical) practices itself requires additional research. It is even possible to evaluate data mining’s “ethical sensitivity,” that is, the extent to which rule generation itself takes ethical issues into account (Fule & Roddick, 2004). A broad-based research program will use the “content, user, purpose” triad to frame and test hypotheses about the best ways to protect widely shared values; it might be domain specific, emphasizing those areas that raise the most interesting issues: business and economics, public health (including bioterrorism preparedness), scientific (including biomedical) research, government operations (including national security), and so on. Some issues/domains raise keenly important issues and concerns. These include genetics and national security surveillance.
Genetics and Genomics The completion of the Human Genome Project in 2001 is correctly seen as a scientific watershed. Biology and medicine will never be the same. Although it has been recognized for some time that the genome sciences are dependent on information technology, the ethical issues and challenges raised by this symbiosis have only been sketched (Goodman, 1996; Goodman, 1999). The fact that vast amounts of human and other genetic information can be digitized and analyzed for clinical, research, and public health purposes has led to the creation of a variety of databases and, consequently, establishment of the era of data mining in genomics. It could not be otherwise: “[T]he amount of new information is so great that data mining techniques are essential in order to obtain knowledge from the experimental data” (Larrañaga, Menasalvas, & Peña, 2004). The ethical issues that arise at the intersection of genetics and data mining should be anticipated: privacy and confidentiality, appropriate uses and users, and accuracy and uncertainty. Each of these embodies special features. Confidentiality of genetic information applies not only to the source of genetic material but also, in one degree or another, to his or her relatives. The concepts of appropriate use and user are strained given extensive collaborations between and among corporate, academic, and government entities. And challenges of accuracy and uncertainty are magnified in case the results of data mining are applied in medical practice.
National Security Surveillance Although it would be blameworthy for vulnerable societies to fail to use all appropriate tools at their disposal to prevent terrorist attacks, data-mining technology poses new and interesting challenges to the concept of “appropriate tool.” There are concerns at the outset that data acquisition — for whatever purpose — may be inappropriate. This is true whether police, military, or other government officials are collecting information on notecards or in huge, machine-tractable databases. Knowledge discovery ups the ante considerably. The privacy rights of citizens in a democracy are not customarily bartered for other benefits, such as security. The rights are what constitute the democracy in the first place. On the other hand, rights are not absolute, and there might be occasions on which morality permits or even requires infringements. Moreover, when a database is created, appropriate use and error/uncertainty become linked: appropriate use includes (at least tacitly) the notion that such use will achieve its specified ends. To the degree that error propagation or uncertainty can impeach data-mining results, the use itself is correspondingly impeached.
CONCLUSION The intersection of computing and ethics has been a fertile ground for scholars and policymakers. The rapid growth and extraordinary power of data mining and knowledge discovery in databases bids fair to be a source of new and interesting ethical challenges. Precedents in kindred domains, such as health informatics, offer a platform on which to begin to prepare the kind of thoughtful and useful conceptual tools that will be required to meet those challenges. By identifying the key nodes of content, user, and purpose, and the leading ethical issues of privacy/confidentiality, appropriate use(r), and error/uncertainty, we can enjoy some optimism that the future of data mining will be guided by thoughtful rules and policies and not the narrower (but not invalid) interests of entrepreneurs, police, or scientists who hesitate or fail to weigh the goals of their inquiries against the needs and expectations of the societies that support their work.
REFERENCES American Statistical Association. (1999). Ethical guidelines for statistical practice. Retrieved from http://
835
M
Moral Foundations of Data Mining
www.amstat.org/profession/index.cfm?fus eaction=ethicalstatistics
Johnson, D. O. (1994). Computer ethics. Englewood Cliffs, NJ: Prentice Hall.
Association for Computing Machinery. (1992). Code of ethics and professional conduct. Retrieved from http:// www.acm.org/constitution/code.html
Johnson, D. O., & Snapper, J. W. (Eds.). (1985). Ethical issues in the use of computers. Belmont, CA: Wadsworth.
Bynum, T. W. (Ed.). (1985). Computers and ethics. New York: Blackwell.
Larrañaga, P., Menasalvas, E., & Peña, J. M. (2004). Data mining in genomics and proteomics. Artificial Intelligence in Medicine, 31, iii-iv.
Ermann, M. D., Williams, M. B., & Gutierrez, C. (Eds.). (1990). Computers, ethics, and society. New York: Oxford University Press. Forester, T., & Morrison, P. (1994). Computer ethics: Cautionary tales and ethical dilemmas in computing (2nd ed.). Cambridge, MA: MIT Press. Fule, P., & Roddick, J. F. (2004). Detecting privacy and ethical sensitivity in data mining results. In V. EstivillCastro (Ed.), Proceedings of the 27th Australasian Computer Science Conference (pp. 159-166). General Accounting Office. (2004). Data mining: Federal efforts cover a wide range of uses (Rep. No. GAO-04-548). Washington, DC: U.S. General Accounting Office. Retrieved from http://www.gao.gov/new.items/d04548.pdf Goodman, K. W. (1996). Ethics, genomics, and information retrieval. Computers in Biology and Medicine, 26, 223-229. Goodman, K. W. (Ed.). (1998a). Ethics, computing and medicine: Informatics and the transformation of health care. Cambridge, MA: Cambridge University Press. Goodman, K. W. (1998b). Meta-analysis: Conceptual, ethical and policy issues. In K. W. Goodman (Ed.), Ethics, computing and medicine: Informatics and the transformation of health care (pp. 139-167). Cambridge, MA: Cambridge University Press. Goodman, K. W. (1999). Bioinformatics: Challenges revisited. MD Computing, 16(3),17-20. IEEE. (1990). IEEE code of ethics. Retrieved from http:// www.ieee.org/portal/index.jsp?pageID= corp_level1& path=about/whatis& file=code.xml&xsl=generic.xsl International Statistical Institute. (1985). Declaration on professional ethics. Retrieved from http://www.cbs.nl/ isi/ethics.htm
836
O’Leary, D. E. (1995, April). Some privacy issues in knowledge discovery: The OECD personal privacy guidelines. IEEE Expert, 48-52.
KEY TERMS Applied Ethics: The branch of ethics that emphasizes not theories of morality but ways of analyzing and resolving issues and conflicts in daily life, the professions and public affairs. Codes of Ethics: More or less detailed sets of principles, standards, and rules aimed at guiding the behavior of groups, usually of professionals in business, government, and the sciences. Confidentiality: The claim, right, or desire that personal information about individuals should be kept secret or not disclosed without permission or informed consent. Ethics: The branch of philosophy that concerns itself with the study of morality. Informed (or Valid) Consent: The process by which people make decisions based on adequate information, voluntariness, and capacity to understand and appreciate the consequences of whatever is being consented to. Morality: A common or collective set of rules, standards, and principles to guide behavior and by which individuals and society can gauge an action as being blameworthy or praiseworthy. Privacy: Humans’ claim or desire to control access to themselves or information about them; privacy is broader than confidentiality, and it includes interest in information protection and control.
837
Mosaic-Based Relevance Feedback for Image Retrieval Odej Kao University of Paderborn, Germany Ingo la Tendresse Technical University of Clausthal, Germany
INTRODUCTION
BACKGROUND
A standard approach for content-based image retrieval (CBIR) is based on the extraction and comparison of features usually related to dominant colours, shapes, textures and layout (Del Bimbo, 1999). These features are a-priori defined and extracted, when the image is inserted into the database. At query time the user submits a similar sample image (query-by-sample-image) or draws a sketch (query-by-sketch) of the sought archived image. The similarity degree of the current query image and the target images is determined by calculation of a multidimensional distance between the corresponding features. The computed similarity values allow the creation of an image ranking, where the first k, usually k=32 or k=64, images are considered retrieval hits. These are chained in a list called ranking and then presented to the user. Each of these images can be used as a starting point for a refined search in order to improve the obtained results. The assessment of the retrieval result is based on a subjective evaluation of whole images and their position in the ranking. An important disadvantage of the retrieval with content-based features and the presentation of the resulting images as ranking is that the user is usually not aware, why certain images are shown on the top positions and why certain images are ranked low or not presented at all. Furthermore, users are also interested which sketch properties are decisive for the consideration and rejection of the images, respectively. In case of primitive features like colour these questions can be often answered intuitively. Retrieval with complex features considering for example texture and layout creates rankings, where the similarity between the query and the target images is not always obvious. Thus, the user is not satisfied with the displayed results and would like to improve the query, but it is not clear to him/her, which parts of the querying sketch or the sample image should be modified and improved according to the desired targets. Therefore, a suitable feedback mechanism is necessary.
Relevance feedback techniques are often used in image databases in order to gain additional information about the sought image set (Rui, Huang, & Mehrotra, 1998). The user evaluates the retrieval results and selects for example positive and negative instances, thus in the subsequent retrieval steps the search parameters are optimised and the corresponding images/features are supplied with higher weights (Müller & Pun, 2000; Rui & Huang, 2000). Moreover, additional query images can be considered and allow a more detailed specification of the target image (Baudisch, 2001). However, the more complex is the learning model, the more difficult is the analysis and the evaluation of the retrieval results. Users – in particular those with limited expertise in image processing and retrieval – are not able to detect misleading areas in the image/sketch with respect to the applied retrieval algorithms and to modify the sample image/sketch appropriately. Consequently, an iterative search is often reduced to a random process of parameter optimisation. The consideration of the existing user knowledge is the main objective of many feedback techniques. In case of user profiling (Cox, Miller, Minka, Papathomas, & Yianilos, 2000) the previously completed retrieval sessions are analysed in order to obtain additional information about user’s preferences. Furthermore, the selection actions during the current session are monitored and images similar to those are given higher weights. A hierarchical approach named multi-feedback ranking separates the querying image in several regions and allows a more detailed search (Mirmehd & Perissamy, 2001). Techniques for permanent feedback guide the user through the entire retrieval process. An example for this approach is implemented by the system Image Retro: based on a selected sample image a number of images are removed from the possible image set. By analysing these images, the user develops an intuition about promising starting points (Vendrig, Worring, & Smeulders, 2001). In case of the fast feedback, the user receives the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
M
Mosaic-Based Relevance Feedback for Image Retrieval
Figure 1. Compilation of a mosaic-based ranking
current ranking after every modification of the query sketch. The correspondence between the presented ranking and the last modification helps the user to identify sketch improvements leading in the desired retrieval directions and to undo inappropriate sketch drawings immediately (Veltkamp, Tanase, & Sent, 2001).
FEEDBACK METHOD FOR QUERYBY-SKETCH This section describes a feedback method, which helps the user to improve the query sketch and to receive the desired results. Subsequently, the query sketch is separated into regions and each area is compared with the corresponding subsection of all images in the database. The most similar sections are grouped in a mosaic, thus the user can detect well-defined and misleading areas of his/her query sketch quickly. The creation of a mosaic consisting of k areas with the most similar sections is shown in Figure 1. The corresponding pseudo-code algorithm for mosaic-based retrieval and ranking presentation is given in Algorithm 1. The criterion cm for the similarity computation of the corresponding sections can be identical with the main retrieval criterion (cm = c) or be adapted to specific Algorithm 1. Pseudo code of the mosaic build-up procedure mosaic(sketch s, ranking R, criterion c m , grid g) begin divide s using a uniform grid g into k fields sk initialise similarity[1 … k] = 0, image[1 … k]= none for each image i ∈ ∈R begin scale i to i* with dim[i*] = dim[s] divide i* using grid g into k fields ik for j=[1 … k] do if similarity[j] < cm(sj, ij) then similarity[j] = cm(sj, ij), image[j] = ij od end mosaic M = ∪j=1k image[j] end
838
Figure 2. An example for a mosaic feedback
images or users (cm ≠ c). The size of the mosaic sections can also be selected by the user. Here grids consisting of 16×16 and 32×32 pixel blocks are evaluated. An example for the resulting mosaics is found in Figure 2. On the left hand side the sought image is shown, which was approximated by the user-defined sketch in the middle. On the right hand side the resulting 16×16 mosaic containing the best section hits of all images in the databases is presented. A manual post-processing of the mosaic clarifies the significance of individual mosaic blocks: Those sections, which are parts of the sought image, are framed with a white box. Furthermore, sections from images of the same category– in this case other pictures with animals – are marked with white circles. An extension of the fixed grids is realised by adaptive mosaics, where neighbouring sections – depending on a given similarity threshold – are merged into larger, homogeneous areas, thus the user can evaluate individual sections more easily. Figure 3 shows a sample image and the corresponding mosaic-based ranking, which is gained using an adaptive grid. In these manually post-processed cases additional information is provided to the user, which sketch regions are already sufficiently prepared (boxes) and which regions have the right direction but still need minor corrections (circles). Finally, all other regions do not satisfy the similarity criterions and have to be re-drawn. With this information the user can focus on the most promising areas and modify these in a suitable manner. For the usage of this feedback method in real-world applications it is necessary to evaluate, whether typical users are able to detect suitable and misleading regions intuitively and thus to improve the original sketch. For the performance evaluation of the developed mosaic technique and measuring the retrieval quality we Figure 3. Mosaic based on an adaptive grid
Mosaic-Based Relevance Feedback for Image Retrieval
created a test image database with pictures of various categories and selected a wavelet-based feature for the retrieval (Kao & Joubert, 2001). Wavelet-based features have in common, that the semantic interpretation of the extracted coefficients and thus the produced rankings is rather difficult, in particular for users without significant background in image processing. Nevertheless, such features perform well in general image sets and are applied in a variety of systems (Jacobs, Finkelstein, & Salesin, 1995). The application of the mosaic-based feedback could improve the usability of these powerful features by helping the user to find and to modify the decisive image regions. The experiments considered fine-grained mosaics with a 16×16 grid, a 32×32 grid, and adaptive grids. Simultaneously, a traditional ranking for the current query was created, which contained the best k images for the given sketch and the selected similarity criterion. Both rankings were subsequently distributed to 16 users, who analysed those and – based on the derived knowledge – improved the sketch appropriately. The obtained results were evaluated manually in order to resolve following basic questions related to possible applications of the mosaic feedback:
sections per mosaic and needed on average 63 seconds study time. In order to evaluate the mosaic as a feedback technique we selected 26 sketches, which did not lead to the desired images during the traditional and mosaic-based retrieval and asked 16 test persons to improve the quality of the sketches. Eight test persons received the target image, the old sketch as well as the corresponding mosaic and had 300 seconds to correct the colours, shapes, layout, etc. The other eight persons should improve the sketch without the mosaic feedback. All modified sketches were submitted to the CBIR-system and produced following results: 46.59% of all sketches changed according to the mosaic feedback led to a successful traditional retrieval, i.e. the sought image was returned within the 16 most similar images. In opposite only 34.24% of the sketches modified without mosaic feedback enabled a successful retrieval. If a mosaic-based ranking was presented, then the recall quote increased up to 68.75% (feedback allowed) and up to 54.35% otherwise. Moreover, the desired image is placed on average two positions above the ranking position prior modification.
•
FUTURE TRENDS
• •
Can a mosaic be used as a sole presentation form for image ranking and thus replace the traditional list/table-based ranking? Are users able to recognise, which parts of the mosaic represent sections of relevant images, and which parts should be modified? Can we expect a significant improvement of the sketch quality, if the relevant sections are correctly marked/recognised?
A query of an image database is considered successful, if the sought image is returned within the first k retrieval hits. Analogously, in case of mosaic-based result presentation, image retrieval is evaluated as successful, if at least one of the mosaic blocks is a part of the sought image. An analysis of the produced rankings shows, that the recall rate improved – dependent on the used colour space – up to 32.75% (la Tendresse & Kao, 2003; Skubowius, 2002). However, it is not clear, whether the user will be able to establish a relation between the presented section and the image he/she is looking for. Therefore, we asked 16 test persons to study a number of images and the corresponding mosaics. The similarity of the images to the given sketch was not obvious, as the images were found during the mosaic-based retrieval, but not in case of the traditional retrieval. The test persons had 180 seconds time to mark all mosaic sections, which – in their opinion – match an original image section. They found approximately 65% of the relevant
The introduction of novel and powerful methods for relevance feedback in image, video and audio databases is an emerging and fast on-going process. Improvements related to integration of semantic information/ combination of multiple attributes in the query (Heesch & Rüger, 2003; Santini, Gupta, & Jain, 2001), automatic detection/modification of misleading areas, 3D result visualisation techniques beyond the simple ranking list (Cinque, Levialdi, Malizia, & Olsen, 1998), queries considering multiple instances (Santini & Jain, 2000) and many more are investigated in number of research groups. The presented mosaic-based feedback technique easies the modification process by showing significant sketch regions to the user. A re-definition of the entire sketch is however time-consuming and usually avoided by the users. Instead, minor modifications are applied to the sketch and another query run is started. By taking this into account, the feedback process – and more decisive the next ranking – can be improved, if the user is guided to the modification of those regions, which exhibit the largest chances of success. Thus, the autonomous detection of misleading sketch sections and system-based suggestions for modification are most desirable in the short-term development focus. For this purpose the convergence behaviour of individual sketch sections towards the desired target is analysed. All results 839
M
Mosaic-Based Relevance Feedback for Image Retrieval
can be combined in a so-called convergence map, which is subsequently processed by a number of rules. Based on the output of the analysis process, several actions can be proposed. For example, if the given sketch is completely on the wrong route, the user is asked to re-draw the sketch entirely. If only parts of the sketch are misleading, the user should be asked to pay more attention to those. Finally, even some of the necessary modifications can also be performed/proposed automatically, for example if intuitive features such as colours are used for the retrieval. In this case the user is advised to use more red or green colour in some sections or to change the average illumination of the sketch. An additional visual help for the user during creation of the query sketch is given by a permanent evaluation of the archived images and presentation of the most promising images in a three-dimensional space. The current sketch builds the centre of this 3D-space and the archived images are arranged according to their similarity distance around the sketch. The nearest images are the most similar images, thus the user can copy some aspects of those images in order to move the retrieval in the desired direction. After each new line the distance to all images is re-computed and the images in the 3D-space are re-arranged. Thus, the user can see the effects of the performed modification immediately and proceed in the same direction or remove the latest modifications and try another path.
CONCLUSIONS This article described a mosaic-based feedback method, which helps image database users to recognise and to modify misleading areas in the querying sketch and to improve the retrieval results. Performance measurements showed that the proposed method easies the querying process and allows the users to modify their sketches in a suitable manner.
REFERENCES Baudisch, P. (2001). Dynamic information Filtering. PhD Thesis, GMD Research Series 2001, No. 16. GMD Forschungszentrum. Cinque, L., Levialdi, S., Malizia, A., & Olsen, K.A. (1998). A multi-dimensional image browser. Journal of Visual Languages and Computing, 9(1), 103-117. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T., & Yianilos, P.N. (2000). The bayesian image retrieval sys-
840
tem, pichunter: Theory, implementation and psychophysical experiments. In IEEE Transactions on Image Processing, 9(1), 20-37. Del Bimbo, A. (1999). Visual information retrieval. Morgan Kaufmann Publishers. Heesch, D., & Rüger, S. M. (2003). Relevance feedback for content-based image retrieval: What can three mouse clicks achieve? In Advances in Information Retrieval, LNCS 2633 (pp. 363-376). Jacobs, C.E., Finkelstein, A., & Salesin, D.H. (1995). Fast multiresolution image querying. In Proceedings of ACM Siggraph (pp. 277-286). Kao, O., & Joubert, G.R. (2001). Efficient dynamic image retrieval using the á trous wavelet transformation. In Advances in Multimedia Information Processing, LNCS 2195 (pp. 343-350). La Tendresse, I., & Kao, O. (2003). Mosaic-based sketching interface for image databases. Journal of Visual Languages and Computing 14(3), 275-293. La Tendresse, I., Kao, O., & Skubowius, M. (2002). Mosaic feedback for sketch training and retrieval improvement. In Proceedings of the IEEE Conference on Multimedia and Expo (ICME 2002) (pp. 437-440). Mirmehdi M., & Perissamy, R. (2001). CBIR with perceptual region features. In Proceedings of the 12th British Machine Vision Conference (pp. 51-520). Müller, H., Müller, W., Squire, D., Marchand-Maillet, S., & Pun, T. (2000). Learning features weights from user behavior in content-based image retrieval. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Workshop on Multimedia Data Mining MDM/KDD2000). Rui, Y., & Huang, T.S. (2000). Optimizing learning in image retrieval. In Proceeding of IEEE International Conference on Computer Vision and Pattern Recognition (pp. 236-245). Rui, Y., Huang, T.S., & Mehrotra, S. (1998). Relevance feedback techniques in interactive content-based Image retrieval. In Proceedings of SPIE 3312 (pp. 25-36). Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image databases. IEEE Transactions on Knowledge Data Engineering, 13(3), 337-351. Santini, S., & Jain, R. (2000). Integrated browsing and querying for image databases. IEEE MultiMedia, 7(3), 2639.
Mosaic-Based Relevance Feedback for Image Retrieval
Veltkamp, R.C., Tanase, M., & Sent, D. (2001). Features in content-based image retrieval systems: A survey. Kluwer Academic Publishers.
sample image. Therefore the user sketches the looked-for image with a few drawing tools. It is not necessary to do this correctly in all aspects.
Vendrig, J., Worring, M., & Smeulders, A.W.M. (2001). Filter image browsing: Interactive image retrieval by using database overviews. Multimedia Tools and Applications, 15(1), 83-103.
Query-By-Pictorial-Example: The query is formulated by using a user-provided example image for the desire retrieval. Both query and stored image are analysed in the same way.
KEY TERMS Content-Based Image Retrieval: Search for suitable image in a database by comparing extracted features related to color, shape, layout and other specific image characteristics. Feature Vector: Data that describes the content of the corresponding image. The elements of the feature vector represent the extracted descriptive information with respect to the utilised analysis. Query By Sketch, Query by Painting: A widespread query type where the user is not able to present a similar
Ranking: List of the most similar images extracted from the database according to the querying sketch/ image or other user-defined criteria. The ranking displays the retrieval results to the user. Relevance Feedback: The user evaluates the quality of the individual items in the ranking based on the subjective expectations / specification of the sought item. Subsequently, the user supplies the system with the evaluation results (positive/negative instances, weights …) and re-starts the query. The system considers the user-knowledge while computing the similarity. Similarity: Correspondence of two images. The similarity is determined by comparing the extracted feature vectors, for example by a metric or distance function.
841
M
842
Multimodal Analysis in Multimedia Using Symbolic Kernels Hrishikesh B. Aradhye SRI International, USA Chitra Dorai IBM T. J. Watson Research Center, USA
INTRODUCTION The rapid adoption of broadband communications technology, coupled with ever-increasing capacity-to-price ratios for data storage, has made multimedia information increasingly more pervasive and accessible for consumers. As a result, the sheer volume of multimedia data available has exploded on the Internet in the past decade in the form of Web casts, broadcast programs, and streaming audio and video. However, indexing, search, and retrieval of this multimedia data is still dependent on manual, text-based tagging (e.g., in the form of a file name of a video clip). However, manual tagging of media content is often bedeviled by an inadequate choice of keywords, incomplete and inconsistent terms used, and the subjective biases of the annotator introduced in his or her descriptions of content adversely affecting accuracy in the search and retrieval phase. Moreover, manual annotation is extremely time-consuming, expensive, and unscalable in the face of ever-growing digital video collections. Therefore, as multimedia get richer in content, become more complex in format and resolution, and grow in volume, the urgency of developing automated content analysis tools for indexing and retrieval of multimedia becomes easily apparent. Recent research towards content annotation, structuring, and search of digital media has led to a large collection of low-level feature extractors, such as face detectors and recognizers, videotext extractors, speech and speaker identifiers, people/vehicle trackers, and event locators. Such analyses are increasingly processing both visual and aural elements to result in large sets of multimodal features. For example, the results of these multimedia feature extractors can be • •
Real-valued, such as shot motion magnitude, audio signal energy, trajectories of tracked entities, and scene tempo Discrete or integer-valued, such as the number of faces detected in a video frame and existence of a scene boundary (yes/no)
• •
Ordinal, such as shot rhythm, which exhibits partial neighborhood properties (e.g., metric, accelerated, decelerated) Nominal, such as identity of a recognized face in a frame and text recognized from a superimposed caption1
Multimedia metadata based on such a multimodal collection of features pose significant difficulties to subsequent tasks such as classification, clustering, visualization, and dimensionality reduction — all which traditionally deal with only continuous-valued data. Common data-mining algorithms employed for these tasks, such as Neural Networks and Principal Component Analysis (PCA), often assume a Euclidean distance metric, which is appropriate only for real-valued data. In the past, these algorithms could be applied to symbolic domains only after representing the symbolic labels as integers or real values or to a feature space transformation to map each symbolic feature as multiple binary features. These data transformations are artificial. Moreover, the original feature space may not reflect the continuity and neighborhood imposed by the integer/real representation. This paper discusses mechanisms that extend tasks traditionally limited to continuous-valued feature spaces, such as (a) dimensionality reduction, (b) de-noising, (c) visualization, and (d) clustering, to multimodal multimedia domains with symbolic and continuous-valued features. To this end, we present four kernel functions based on well-known distance metrics that are applicable to each of the four feature types. These functions effectively define a linear or nonlinear dot product of real or symbolic feature vectors and therefore fit within the generic framework of kernel space machines. The framework of kernel functions and kernel space machines provides classification techniques that are less susceptible to overfitting when compared with several data-driven learning-based classifiers. We illustrate the usefulness of such symbolic kernels within the context of Kernel PCA and Support Vector Machines (SVMs), particularly in temporal clustering and tracking of videotext in multimedia. We show that such
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Multimodal Analysis in Multimedia Using Symbolic Kernels
analyses help capture information from symbolic feature spaces, visualize symbolic data, and aid tasks such as classification and clustering and therefore are eminently useful in multimodal analysis of multimedia.
BACKGROUND Early approaches to multimedia content analysis dealt with multimodal feature data in two primary ways. Either a learning technique such as Neural Nets was used to find patterns in the multimodal data after mapping symbolic values into integers, or the multimodal features were segregated into different groups according to their modes of origin (e.g., into audio and video features), processed separately, and the results from the separate processes were merged by using some probabilistic mechanism or evidence combination method. The first set of methods implicitly assumed the Euclidean distance as an underlying metric between feature vectors. Although this may be appropriate for real-valued data, it imposes a neighborhood property on symbolic data that is artificial and is often inappropriate. The second set of methods essentially dealt with each category of multimodal data separately and fused the results. They were thus incapable of leading to novel patterns that can arise if the data were treated together as a whole. As audiovisual collections of today provide multimodal information, they need to be examined and interpreted together, not separately, to make sense of the composite message (Bradley, Fayyad, & Mangasarian, 1998). Recent advances in machine-learning and data analysis techniques, however, have enabled more sophisticated means of data analyses. Several researchers have attempted to generalize the existing PCA-based framework. For instance, Tipping (1999) presented a probabilistic latent-variable framework for data visualization of binary and discrete data types. Collins and co-workers (Collins, Dasgupta, & Schapire, 2001) generalized the basic PCA framework, which inherently assumes Gaussian features and noise, to other members of the exponential family of functions. In addition to these research efforts, Kernel PCA (KPCA) has emerged as a new data representation and analysis method that extends the capabilities of the classical PCA — which is traditionally restricted to linear feature spaces — to feature spaces that may be nonlinearly correlated (Scholkopf, Smola, & Muller, 1999). In this method, the input vectors are implicitly projected on a high-dimensional space by using a nonlinear mapping. Standard PCA is then applied to this high-dimensional space. KPCA avoids explicit calculation of highdimensional projections with the use of kernel functions, such as radial basis functions (RBF), high-degree polynomials, or the sigmoid function. KPCA has been success-
fully used to capture important information from large, nonlinear feature spaces into a smaller set of principal components (Scholkopf et al., 1999). Operations such as clustering or classification can then be carried out in this reduced dimensional space. Because noise is eliminated as projections on eigen-vectors with low eigen-values, the final reduced space of larger principal components contains less noise and yields better results with further data analysis tasks such as classification. Although many conventional methods have been previously developed for extraction of principal components from nonlinearly correlated data, none allowed for generalization of the concepts to dimensionality reduction of symbolic spaces. The kernel-space representation of KPCA presents such an opportunity. However, since its inception, applications of KPCA have been primarily limited to domains with real-valued, nonlinearly correlated features despite the recent literature on defining kernels over several discrete objects such as sequences, trees, graphs, as well as many other types of objects. Moreover, recent techniques like the Fisher kernel approach by Jaakkola and Haussler (1999) can be used to systematically derive kernels from generative models, which have been demonstrated quite successfully in the rich symbolic feature domain of bioinformatics. Against the backdrop of these emerging collections of research, the work presented in this paper uses the ideas of Kernel PCA and symbolic kernel functions to investigate the yet unexplored problem of symbolic domain principal component extraction in the context of multimedia. The kernels used here are designed based on well-known distance metrics, namely Hamming distance, Cityblock distance, and the Edit distance metric, and have been previously used for string comparisons in several domains, including gene sequencing. With these and other symbolic kernels, multimodal data from multimedia analysis containing real and symbolic values can be handled in a uniform fashion by using, say, an SVM classifier employing a kernel function that is a combination of Euclidean, Hamming, and Edit Distance kernels. Applications of the proposed kernel functions to temporal analysis of videotext data demonstrate the utility of this approach.
MAIN THRUST Distance Kernels for Multimodal Data Kernel-based classifiers such as SVMs and Neural Networks use linear, Radial Basis Function (RBF), or polynomial functions as kernels that first (implicitly) transform input data into a higher dimensional feature space and then process them in this space. Many of the common 843
M
Multimodal Analysis in Multimedia Using Symbolic Kernels
kernels assume a Euclidean distance to compare feature vectors. Symbolic kernels, in contrast, have been less commonly used in the published literature. We use the following distance-based kernel functions in our analysis. 1.
Linear (Euclidean) Kernel Function for Real-valued Features: This is the most commonly used kernel function for linear SVMs and other kernel-based algorithms. Let x and z be two feature vectors. Then the function K l (x, z ) = x T z
2.
(1)
defines a Euclidean distance-based kernel function. KPCA with linear kernel function reduces to the standard PCA. The linear kernel trivially follows Mercer’s condition for kernel validity, that is, the matrix comprising pairwise kernel function values for any finite subset of feature vectors selected from the feature space is guaranteed to be positive semidefinite. Hamming Kernel Function for Nominal Features: Let the number of features be N. Let x and z be two feature vectors, that is, N-dimensional symbolic vectors from a finite, symbolic feature space XN. Then the function N
K h (x, z ) = N − ∑ ä(xi , z i ) i =1
(2)
(Aradhye & Dorai, 2002). The equality, xi = z i , refers to symbol/label match. Cityblock Kernel Function for Discrete Features: Using the preceding notation, the function N
K c (x, z ) = ∑ M i − xi − z i i =1
4.
(3)
where the ith feature is Mi-ary, defines a Cityblock distance–based kernel function and follows the Mercer’s condition for kernel validity (Aradhye & Dorai, 2002). Edit Kernel Function for Stringlike Features: We define an edit kernel between two strings as K e (x, z ) = max(len(x ), len(z )) − E (x, z )
(4)
where E (x, z ) is the edit distance between the two strings, defined conventionally as the minimum num844
len(x ) is the length of string x. In theory, Edit
distance does not obey Mercer validity, as has been recently proved by Cortes and coworkers (Cortes, Haffner, & Mohri, 2002, 2003). However, empirically, the Kernel matrices generated by the edit kernel are often positive definite, justifying the practical use of Edit distance–based kernels.
Hybrid Multimodal Kernel Having defined these four basic kernel functions for different modalities of features, we are now in a position to define a multimodal kernel function that encompasses all types of common multimedia features. Let any given feature vector x be comprised of a real-valued feature set x r , a nominal feature set xh , a discrete-valued feature set xc , and a string-style feature x = [x r
xh
xc
xe , such that
x e ]. Then, because a linear combi-
nation of valid kernel functions is a valid kernel function, we define K m (x, z ) = αK l (xr , z r ) + βK h (xh , z h ) + γK c (xc , z c ) + δK l (xe , z e )
(5)
where ä(xi , z i ) = 0 if xi = z i and 1 otherwise, defines a Hamming distance–based kernel function and follows the Mercer’s condition for kernel validity
3.
ber of change, delete, and insert operations required to convert one string into another, and
where K m (x, z ) is our multimodal kernel and α , β , γ , and δ are constants. Such a hybrid kernel can now be seamlessly used to analyze multimodal feature vectors that have real and symbolic values, without imposing any artificial integer mapping of symbolic labels and further obtaining the benefits of analyzing disparate data together as one. The constants α , β , γ , and δ can be determined in practice either by a knowledge-based analysis of the relative importance of the different types of features or by empirical optimization.
Example Multimedia Application: Videotext Postprocessing Video sequences contain a rich combination of images, sound, motion, and text. Videotext, which refers to superimposed text on images and video frames, serves as an important source of semantic information in video streams, besides speech, close caption, and visual content in video. Recognizing text superimposed on video frames yields important information such as the identity of the speaker, his/her location, topic under discussion, sports scores, product names, associated shopping data, and so forth, allowing for automated content description,
Multimodal Analysis in Multimedia Using Symbolic Kernels
search, event monitoring, and video program categorization. Therefore, a videotext based Multimedia Description Scheme has recently been adopted into the MPEG-7 ISO/ IEC standard to facilitate media content description (Dimitrova, Agnihotri, Dorai, & Bolle, 2000). To this end, videotext extraction and recognition is an important task of an automated video content analysis system. Figure 1 shows three illustrative frames containing videotext taken from MPEG videos of different genres. However, unlike scanned paper documents, videotext is superimposed on often changing backgrounds comprising moving objects with a rich variety of color and texture. In addition, videotext is often of low resolution and suffers from compression artifacts. Due to these difficulties, existing OCR algorithms result in low accuracy when applied to the problem of videotext extraction and recognition. On the other hand, we observed that temporal videotext often persists on the screen over a span of time (approximately 30 seconds) to ensure readability, resulting in many available samples of the same body of text over multiple frames of video. This redundancy can be exploited to improve recognition accuracy, although erroneous text extraction and/or incorrect character recognition by a classifier may make the strings dissimilar from frame to frame. Often no single instance of the text may lead to perfect recognition, underscoring the need for intelligent postprocessing. In the existing literature, unfortunately, temporal contiguity analysis of videotext is implemented by using adhoc thresholds and heuristics for the following reasons. First of all, due to missed or merged characters, the same string may be perceived to be of different lengths on different frames. We thus have feature vectors of varying lengths. Secondly, the exact duration of the persistence of videotext is unknown a priori. Two consecutive frames can have completely different strings. Thirdly, videotext can be in scrolling motion. Because multiple moving text blocks can be present in the same video frame, it is nontrivial to recognize which videotext objects from consecutive frames are instances of the same text. In light of these difficulties, we present brief illustrative examples of dimensionality reduction and visualiza-
tion of feature vectors comprising strings of recognized videotext. Experiments with our Edit distance kernel investigated the use of KPCA for analyzing the temporal contiguity of videotext using these feature vectors. •
•
Videotext Clustering and Change Detection: Figure 2 shows the first two components obtained by applying KPCA with the Edit distance kernel to a set of strings recognized from 20 consecutive frames. These frames contain instances of two distinct strings. Without any assumed knowledge, KPCA’s use of the Edit distance kernel clearly shows two distinct clusters corresponding to these two strings. Videotext Tracking and Outlier Detection: Figure 3 shows the first three principal components obtained by applying KPCA with our Edit distance kernel to a set of strings recognized from 20 consecutive frames. These frames contain instances of videotext scrolling across the screen. In this threedimensional plot, we can see a visual representation of the changing content of videotext as a trajectory in the principal component space, and locating outliers from the trajectory indicates the appearance of other strings in the video frames.
These results show that the symbolic kernels can assist significantly in automated agglomeration and tracking of recognized text as well as effective data visualization. In addition, multimodal feature vectors constructed from recognized text strings and frame motion estimates can now be analyzed jointly by using our hybrid kernel for media content characterization.
FUTURE TRENDS One of the big hurdles facing media management systems is the semantic gap between the high-level meaning sought by user queries in search for media and the low-level features that we actually compute today for media indexing and description. Computational Media Aesthetics, a promising approach to bridging the gap and building
Figure 1. Illustrative frames with videotext
845
M
Multimodal Analysis in Multimedia Using Symbolic Kernels
Figure 2. Videotext clustering
Figure 3. Videotext tracking and visualization
high-level semantic descriptions for media search and navigation services, is founded upon an understanding of media elements and their individual and joint roles in synthesizing meaning and manipulating perceptions, with a systematic study of media productions (Dorai & Venkatesh, 2001). The core trait of this approach is that in order to create effective tools for automatically understanding video, we need to be able to interpret the data with its maker’s eye. In order to realize the potential of this approach, it becomes imperative that all sources of descriptive information, audio, video, text, and so forth need to be considered as a whole and analyzed together to derive inferences with certain level of integrity. With the ability to treat multimodal features as an integrated feature set to describe media content during classification and visualization, new higher level semantic mappings from low-level features can be achieved to describe media content. The symbolic kernels are promising an initial step in that direction to facilitate rigorous joint feature analysis in various media domains.
paper, help apply traditionally numeric methods to symbolic spaces without any forced integer mapping for important tasks such as data visualization, principal component extraction, and clustering in multimedia and other domains.
CONCLUSION Traditional integer representation of symbolic multimedia feature data for classification and other data-mining tasks is artificial, as the symbolic space may not reflect the continuity and neighborhood relations as imposed by integer representations. In this paper, we use distancebased kernels in conjunction with kernel space methods such as KPCA to handle multimodal data, including symbolic features. These symbolic kernels, as shown in this 846
REFERENCES Aradhye, H., & Dorai, C. (2002). New kernels for analyzing multimodal data in multimedia using kernel machines. Proceedings of the IEEE International Conference on Multimedia and Expo, Switzerland, 2 (pp. 37-40). Bradley, P. S., Fayyad, U. M., & Mangasarian, O. (1998). Data mining: Overview and optimization opportunities (Tech. Rep. No. 98-01). Madison: University of Wisconsin, Computer Sciences Department. Collins, M., Dasgupta, S., & Schapire, R. (2001). A generalization of principal component analysis to the exponential family. In T.G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14 (pp. 617-624). Cambridge, MA: MIT Press. Cortes, C., Haffner, P., & Mohri, M. (2002). Rational kernels. In S. Becker, S. Thrun, & K. Obermayer (Ed.), Advances in neural information processing systems 15 (pp. 41-56). Cambridge, MA: MIT Press. Cortes, C., Haffner, P., & Mohri, M. (2003). Positive definite rational kernels. Proceedings of the 16th Annual Conference on Computational Learning Theory (pp. 4156), USA.
Multimodal Analysis in Multimedia Using Symbolic Kernels
Dimitrova, N., Agnihotri, L., Dorai, C., & Bolle, R. (2000, October). MPEG-7 videotext descriptor for superimposed text in images and video. Signal Processing: Image Communication, 16, 137-155. Dorai, C., & Venkatesh, S. (2001, October). Computational media aesthetics: Finding meaning beautiful. IEEE Multimedia, 8(4), 10-12. Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems 11 (pp. 487-493). Cambridge, MA: MIT Press. Scholkopf, B., Smola, A., & Muller, K. R. (1999). Kernel principal component analysis. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: SV learning (pp. 327-352). Cambridge, MA: MIT Press. Tipping, M. E. (1999). Probabilistic visualisation of highdimensional binary data. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems 11 (pp. 592-598). Cambridge, MA: MIT Press.
KEY TERMS Dimensionality Reduction: The process of transformation of a large dimensional feature space into a space comprising a small number of (uncorrelated) components. Dimensionality reduction allows us to visualize, categorize, or simplify large datasets.
Mercer’s Condition: A kernel function is said to obey Mercer’s condition for kernel validity iff the kernel matrix comprising pairwise kernel evaluations over any given subset of the feature space is guaranteed to be positive semidefinite. MPEG Compression: Video/audio compression standard established by Motion Picture Experts Group. MPEG compression algorithms use psychoacoustic modeling of audio and motion analysis as well as DCT of video data for efficient multimedia compression. Multimodality of Feature Data: Feature data is said to be multimodal if the features can be characterized as a mixture of real-valued, discrete, ordinal, or nominal values. Principal Component Analysis (PCA): One of the oldest modeling and dimensionality reduction techniques. PCA models observed feature data as a linear combination of a few uncorrelated, Gaussian principal components and additive Gaussian noise. Videotext: Text graphically superimposed on video imagery, such as caption text, headline news, speaker identity, location, and so on.
ENDNOTE 1
The term symbolic is loosely used in this paper to mean discrete, ordinal, and nominal features.
Kernel Function: A function that intrinsically defines the projection of two feature vectors (function arguments) onto a high-dimensional space and a dot product therein.
847
M
848
Multiple Hypothesis Testing for Data Mining Sach Mukherjee University of Oxford, UK
INTRODUCTION A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.
BACKGROUND Many problems in commercial and scientific data mining involve selecting objects of interest from large datasets on the basis of numerical relevance scores (“object selection”). This Section looks briefly at the role played by hypothesis tests in problems of this kind. We start by examining the relationship between relevance scores, statistical errors and the testing of hypotheses in the context of two illustrative data mining tasks. Readers familiar with conventional hypothesis testing may wish to progress directly to the main part of the Chapter. As a topical example, consider the differential analysis of gene microarray data (Piatetsky-Shapiro & Tamayo, 2004; Cui & Churchill, 2003). The data consist of expression levels (roughly speaking, levels of activity) for each of thousands of genes across two or more conditions (such as healthy and diseased). The data mining task is to find a set of genes which are differentially expressed between the conditions, and therefore likely to be relevant to the disease or biological process under investigation. A suitably defined mathematical function (the t-statistic is a canonical choice) is used to assign a “relevance score” to each gene and a subset of genes selected on the basis of the scores. Here, the objects being selected are genes. As a second example, consider the mining of sales records. The aim might be, for instance, to focus marketing efforts on a subset of customers, based on some property of their buying behavior. A suitably defined function would be used to score each customer by rel-
evance, on the basis of his or her records. A set of customers with high relevance scores would then be selected as targets for marketing activity. In this example, the objects are customers. Clearly, both tasks are similar; each can be thought of as comprising the assignment of a suitably defined relevance score to each object and the subsequent selection of a set of objects on the basis of the scores. The selection of objects thus requires the imposition of a threshold or cut-off on the relevance score, such that objects scoring higher than the threshold are returned as relevant. Consider the microarray example described above. Suppose the function used to rank genes is simply the difference between mean expression levels in the two classes. Then the question of setting a threshold amounts to asking how large a difference is sufficient to consider a gene relevant. Suppose we decide that a difference in means exceeding x is ‘large enough’: we would then consider each gene in turn, and select it as “relevant” if its relevance score equals or exceeds x. Now, an important point is that the data are random variables, so that if measurements were collected again from the same biological system, the actual values obtained for each gene might differ from those in the particular dataset being analyzed. As a consequence of this variability, there will be a real possibility of obtaining scores in excess of x from genes which are in fact not relevant. In general terms, high scores which are simply due to chance (rather than the underlying relevance of the object) lead to the selection of irrelevant objects; errors of this kind are called false positives (or Type I errors). Conversely, a truly relevant object may have an unusually low score, leading to its omission from the final set of results. Errors of this kind are called false negatives (or Type II errors). Both types of error are associated with identifiable costs: false positives lead to wasted resources, and false negatives to missed opportunities. For example, in the market research context, false positives may lead to marketing material being targeted at the wrong customers; false negatives may lead to the omission of the “right” customers from the marketing campaign. Clearly, the rates of each kind of error are related to the threshold imposed on the relevance score: an excessively strict threshold will minimize false positives but produce many false negatives, while an overly lenient threshold will have the
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Multiple Hypothesis Testing for Data Mining
opposite effect. Setting an appropriate threshold is therefore vital to controlling errors and associated costs. Statistical hypothesis testing can be thought of as a framework within which the setting of thresholds can be addressed in a principled manner. The basic idea is to specify an acceptable false positive rate (i.e. an acceptable probability of Type I error) and then use probability theory to determine the precise threshold which corresponds to that specified error rate. A general discussion of hypothesis tests at an introductory level can be found in textbooks of statistics such as DeGroot and Schervish (2002), or Moore and McCabe (2002); the standard advanced reference on the topic is Lehmann (1997). Now, let us assume for the moment that we have only one object to consider. The hypothesis that the object is irrelevant is called the null hypothesis (and denoted by H0), and the hypothesis that it is relevant is called the alternative hypothesis (H1). The aim of the hypothesis test is to make a decision regarding the relevance of the object, that is, a decision as to which hypothesis should be accepted. Suppose the relevance score for the object under consideration is t. A decision regarding the relevance of the object is then made as follows: (1) (2)
(3)
Specify an acceptable level of Type I error p*. Use the sampling distribution of the relevance score under the null hypothesis to compute a threshold score corresponding to p*. Let this threshold score be denoted by c. If t ≥ c, reject the null hypothesis and regard the object as relevant. If t < c, regard the object as irrelevant.
The specified error level p* is called the significance level of the test and the corresponding threshold c the critical value. Hypothesis testing can alternatively be thought of as a procedure by which relevance scores are converted into corresponding error probabilities. The null sampling distribution can be used to compute the probability p of making a Type I error if the threshold is set at exactly t, i.e. just low enough to select the given object. This then allows us to assert that the probability of obtaining a false positive if the given object is to be selected is at least p. This latter probability of Type I error is called a P-value. In contrast to relevance scores, P-values, being probabilities, have a clear interpretation. For instance, if we found that an object had a tstatistic value of 3 (say), it would be hard to tell whether the object should be regarded as relevant or not. However, if we found the corresponding P-value was 0.001, we would know that if the threshold were set just low enough to include the object, the false positive rate would be 1 in 1000, a fact that is far easier to interpret.
MAIN THRUST We have seen that in the case of a single variable, relevance scores obtained from test statistics can be easily converted into error probabilities called P-values. However, practical data mining tasks, such as mining microarrays or consumer records, tend to be on a very large scale, with thousands, even millions of objects under consideration. Under these conditions of multiplicity, the conventional P-value described above no longer corresponds to the probability of obtaining a false positive. An example will clarify this point. Consider once again the microarray analysis scenario, and assume that a suitable relevance scoring function has been chosen. Now, suppose we wish to set a threshold corresponding to a false positive rate of 0.05. Let the relevance score whose P-value is 0.05 be denoted by t 05. Then, in the case of a single variable/gene, if we were to set the threshold at t 05, the probability of obtaining a false positive would be 0.05. However, in the multi-gene setting, it is each of the thousands of genes under study that is effectively subjected to a hypothesis test with the specified error probability of 0.05. Thus, the chance of obtaining a false positive is no longer 0.05, but much higher. For instance, if each of 10000 genes were statistically independent, (0.05 × 10000) = 500 genes would be mistakenly selected on average! In effect, the very threshold which implied a false positive rate of 0.05 for a single gene now leaves us with hundreds of false positives. Multiple hypothesis testing procedures address the issue of multiplicity in hypothesis tests and provide a way of setting appropriate thresholds in multi-variable problems. The remainder of this Section describes two well-known multiple testing methods (the Bonferroni and False Discovery Rate methods), and discusses their advantages and disadvantages. Table 1 summarizes the numbers of objects in various categories, and will prove useful in clarifying some of the concepts presented below. The total number of objects under consideration is m, of which m1 are relevant and m0 are irrelevant. A total of S objects are selected, of which S1 are true positives and S0 are false positives. We follow the convention that variables relating to irrelevant objects have the subscript “0” (to signify the null hypothesis) and those relating to relevant objects the subscript “1” (for the alternative hypothesis). Note also that fixed quantities (e.g. the total number of objects) are denoted by lower-case letters, while variable quantities (e.g. the number of objects selected) are denoted by upper-case letters. The initial stage of a multi-variable analysis follows from our discussion of basic hypothesis testing and is
849
M
Multiple Hypothesis Testing for Data Mining
Table 1. Summary table for multiple testing, following Benjamini and Hochberg (1995) Selected
Not selected
Total
Irrelevant objects
S0
m0-S0
m0
Relevant objects
S1
m1-S1
m1
Total
S
m-S
m
common to the methods discussed below. The data is processed as follows:
committing at least one Type I error, and then calculate a corresponding threshold in terms of nominal P-value.
Procedure (1) (2)
Specify an acceptable probability p* of at least one Type I error being committed. Recall that P(1) ≤ P(2) ≤ … ≤ P(m) represent the ordered P-values. Find the largest i which satisfies the following inequality:
Score each of the m objects under consideration using a suitably chosen test statistic f. If the data corresponding to object j is denoted by D j, and the corresponding relevance score by Tj:
(1)
P( i ) ≤
Let the largest i found be denoted k.
Tj = f (D j )
(3)
Let the scores so obtained be denoted by T1, T2,…Tm. (2)
Convert each score T j to a corresponding P-value Pj by making use of the relevant null sampling distribution: null sampling distribution
Tj
Pj
These P-values are called ‘nominal P-values’ and represent per-test error probabilities. The procedure by which a P-value corresponding to an observed test statistic is computed is not described here, but can be found in any textbook of statistics, such as those mentioned previously. Essentially, the null cumulative distribution function of the test statistic is used to compute the probability of Type I error at threshold T j. (3)
P( 2) P( m )
Thus, each object has a corresponding P-value. In order to select a subset of objects, it will therefore be sufficient to determine a threshold in terms of nominal P-value.
The Bonferroni Method This is perhaps the simplest method for correcting Pvalues. We first specify an acceptable probability of 850
Select the k objects corresponding to P-values P(1), P(2) …P(k) as relevant.
If we assume that all m tests are statistically independent, it is easy to show that this simple procedure does indeed guarantee that the probability of obtaining at least one false positive is p*.
Advantages •
The basic Bonferroni procedure is extremely simple to understand and use.
Disadvantages •
Arrange the P-values obtained in the previous step in ascending order (smaller P-values correspond to objects more likely to be relevant). Let the ordered P-values be denoted by:
P(1)
p* m
•
The Bonferroni procedure sets the threshold to meet a specified probability of making at least one Type I error. This notion of error is called “Family Wise Error Rate”, or FWER, in the statistical literature. FWER is a very strict control of error, and is far too conservative for most practical applications. Using FWER on datasets with large numbers of variables often results in the selection of a very small number of objects, or even none at all. The assumption of statistical independence between tests is a strong one, and almost never holds in practice.
The FDR Method of Benjamini and Hochberg An alternative way of thinking about errors in multiple testing is the false discovery rate (FDR) (Benjamini & Hochberg, 1995). Looking at Table 1 we can see that of S objects selected, S0 are false positives. FDR is simply
Multiple Hypothesis Testing for Data Mining
the average proportion of false positives among the objects selected:
S FDR ≡ E 0 S Where, E[] denotes expectation. The Benjamini and Hochberg method allows us to compute a threshold corresponding to a specified FDR.
Procedure (1) (2)
Specify an acceptable FDR q* Find the largest i which satisfies the following inequality:
P( i ) ≤
i × q* m
Let the largest i found be denoted k. (3)
Select the k objects corresponding to P-values P(1), P (2) …P(k) as relevant.
Under the assumption that the m tests are statistically independent, it can be shown that this procedure leads to a selection of objects such that the FDR is indeed q*.
Advantages • • •
The FDR concept is far less conservative than the Bonferroni method described above. Specifying an acceptable proportion q* of false positives is an intuitively appealing way of controlling the error rate in object selection. The FDR method tells us in advance that a certain proportion of results are likely to be false positives. As a consequence, it becomes possible to plan for the associated costs.
Disadvantages •
Again, the assumption of statistically independent tests rarely holds.
FUTURE TRENDS A great deal of recent work in statistics, machine learning and data mining has focused on various aspects of multiple testing in the context of object selection. Some important areas of active research which were not discussed in detail are briefly described below.
Robust FDR Methods The FDR procedure outlined above, while simple and computationally efficient, makes several strong assumptions, and while better than Bonferroni is still often too conservative for practical problems (Storey & Tibshirani, 2003). The recently introduced “q-value” (Storey, 2003) is a more sophisticated approach to FDR correction and provides a very robust methodology for multiple testing. The q-value method makes use of the fact that Pvalues are uniformly distributed under the null hypothesis to accurately estimate the FDR associated with a particular threshold. Estimated FDR is then used to set an appropriate threshold. The q-value approach is an excellent choice in many multi-variable settings.
Resampling Based Methods In recent years a number of resampling based methods have been proposed for multiple testing (see e.g. Westfall & Young, 1993). These methods make use of computationally intensive procedures such as the bootstrap (Efron, 1982; Davison & Hinckley, 1997) to perform non-parametric P-value corrections. Resampling methods are extremely powerful and make fewer strong assumptions than methods based on classical statistics, but in many data mining applications the computational burden of resampling may be prohibitive.
Machine Learning of Relevance Functions The methods described in this chapter allowed us to determine threshold relevance scores, but very little attention was paid to the important issue of choosing an appropriate relevance scoring function. Recent research in bioinformatics (Broberg, 2003; Mukherjee, 2004a) has shown that the effectiveness of a scoring function can be very sensitive to the statistical model underlying the data, in ways which can be difficult to address by conventional means. When fully labeled data (i.e. datasets with objects flagged as relevant/irrelevant) are available, canonical supervised algorithms (see e.g. Hastie, Tibshirani, & Friedman , 2001) can be used to learn effective relevance functions. However, in many cases - microarray data being one example - fully labeled data is hard to obtain. Recent work in machine learning (Mukherjee, 2004b) has addressed the problem of learning relevance functions in an unsupervised setting by exploiting a probabilistic notion of stability; this approach turns out to be remarkably effective in settings where underlying statistical models are poorly understood and labeled data unavailable. 851
M
Multiple Hypothesis Testing for Data Mining
CONCLUSION An increasing number of problems in industrial and scientific data analysis involve multiple testing, and there has consequently been an explosion of interest in the topic in recent years. This Chapter has discussed a selection of important concepts and methods in multiple testing; for further reading we recommend Benjamini and Hochberg (1995) and Dudoit, Shaffer, & Boldrick, (2003).
REFERENCES Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 289-300. Broberg, P. (2003). Statistical methods for ranking differentially expressed genes. Genome Biology, 4(6). Cui, X., & Churchill, G. (2003). Statistical tests for differential expression in cDNA microarray experiments. Genome Biology, 4(210). Davison, A. C., & Hinckley, D. V. (1997). Bootstrap methods and their applications. Cambridge University Press. DeGroot, M. H., & Schervish, M. J. (2002). Probability and statistics (3rd ed.). Addison-Wesley. Dudoit, S., Shaffer, J. P., & Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1), 71-103. Efron, B. (1982). The jacknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer-Verlag. Lehmann, E. L. (1997). Testing statistical hypotheses (2nd ed.). Springer-Verlag. Moore, D. S., & McCabe, G. P. (2002). Introduction to the practice of statistics (4 th ed.). W. H. Freeman. Mukherjee, S. (2004a). A theoretical analysis of gene selection. Proceedings of the IEEE Computer Society Bioinformatics Conference 2004 (CSB 2004). IEEE press. Mukherjee, S. (2004b). Unsupervised learning of ranking functions for high-dimensional data (Tech. Rep. No. PARG-04-02). University of Oxford, Department of Engineering Science. 852
Piatetsky-Shapiro, G., & Tamayo, P. (2003). Microarray Data mining: Facing the Challenges. SIGKDD Explorations, 5(2). Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31, 2013-2035. Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genome-wide studies. Proceedings of the National Academy of Sciences, 100 (pp. 9440-9445). Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons.
KEY TERMS Alternative Hypothesis: The hypothesis that an object is relevant. In general terms, the alternative hypothesis refers to the set of events that is the complement of the null hypothesis. False Discovery Rate (FDR): The expected proportion of false positives among the objects selected as relevant. False Negative: The error committed when a truly relevant object is not selected. More generally, a false negative occurs when the null hypothesis is erroneously accepted. Also called Type II error. False Positive: The error committed when an object is selected as relevant when it is in fact irrelevant. More generally, a false positive occurs when the null hypothesis is erroneously rejected. Also called Type I error. Gene Microarray Data: Measurements of mRNA abundances derived from biochemical devices called microarrays. These are essentially measures of gene activity. Hypothesis Test: A formal statistical procedure by which an interesting hypothesis (the alternative hypothesis) is accepted or rejected on the basis of data. Multiple Hypothesis Test: A formal statistical procedure used to account for the effects of multiplicity in a hypothesis test. Null Hypothesis: The hypothesis that an object is irrelevant. In general terms, it is the hypothesis we wish to falsify on the basis of data. P-Value: The P-value for an object is the probability of obtaining a false positive if the threshold is set just high enough to include the object among the set selected
Multiple Hypothesis Testing for Data Mining
as relevant. More generally, it is the false positive rate corresponding to an observed test statistic. Random Variable: A variable characterized by random behavior in assuming its different possible values. Sampling Distribution: The distribution of values obtained by applying a function to random data.
Test Statistic: A relevance scoring function used in a hypothesis test. Classical test statistics (such as the tstatistic) have null sampling distributions which are known a priori. However, under certain circumstances null sampling distributions for arbitrary functions can be obtained by computational means.
853
M
854
Music Information Retrieval Alicja A. Wieczorkowska Polish-Japanese Institute of Information Technology, Poland
INTRODUCTION Music information retrieval is a multi-disciplinary research on retrieving information from music. This research involves scientists from traditional, music, and digital libraries; information science; computer science; law; business; engineering; musicology; cognitive psychology; and education (Downie, 2001).
BACKGROUND A huge amount of audio resources, including music data, is becoming available in various forms, both analog and digital. Notes, CDs, and digital resources of the World Wide Web are growing constantly in amount, but the value of music information depends on how easy it can be found, retrieved, accessed, filtered, and managed (Fingerhut, 1997; International Organization for Standardization, 2003). Music information retrieval consists of quick and efficient searching for various types of audio data of interest to the user, filtering them in order to receive only the data items that satisfy the user’s preferences (International Organization for Standardization, 2003). This broad domain of research includes: • • • • •
Audio retrieval by content; Auditory scene analysis and recognition (Rosenthal & Okuno, 1998); Music transcription; Denoising of old analog recordings; and Other topics discussed further in this article (Wieczorkowska & Ras, 2003).
The topics are interrelated, since the same or similar techniques can be applied for various purposes. For instance, source separation, usually applied in auditory scene analysis, is used also for music transcription and even restoring (denoising) of all recordings. The research on music information retrieval has many applications. The most important include automatic production of music score on the basis of the presented input, retrieval of music pieces from huge audio databases, and restoration of old recordings. Generally, the research within the music information retrieval domain is fo-
cused on harmonic structure analysis, note extraction, melody and rhythm tracking, timbre and instrument recognition, classification of type of the signal (speech, music, pitched vs. non-pitched), and so forth. The research basically uses digital audio recordings, where sound waveform is digitally stored as a sequence of discrete samples representing the sound intensity at a given time instant, and MIDI files, storing information on parameters of electronically synthesized sounds (voice, note on, note off, pitch bend, etc.). Sound analysis and data mining tools are used to extract information from music files in order to provide the data that meet users’ needs (Wieczorkowska & Ras, 2001).
MAIN THRUST Music information retrieval domain covers a broad range of topics of interest, and various types of music data are investigated in this research. Basic techniques of digital sound analysis for these purposes come from speech processing focused on automatic speech recognition and speaker identification (Foote, 1999). Sound descriptors calculated this way can be added to the audio files in order to facilitate content-based searching of music databases. The issue of representation of music and multimedia information in a form that allows interpretation of the information’s meaning is addressed by MPEG-7 standard, named Multimedia Content Description Interface. MPEG-7 provides a rich set of standardized tools to describe multimedia content through metadata (i.e., data about data), and music information description also has been taken into account in this standard (International Organization for Standardization, 2003). The following topics are investigated within music information retrieval domain: •
Auditory scene analysis, which focuses on various aspects of music like timbre description, sound harmonicity, spatial origin, source separation, and so forth (Bregman, 1990). Timbre is defined subjectively as this feature of sound that distinguishes two sounds of the same pitch, loudness, and duration. Therefore, subjective listening tests are often performed in this research, but also signal-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Music Information Retrieval
•
•
processing techniques are broadly applied here. One of the main topics of computational auditory scene analysis is automatic separation of individual sound sources from a mixture. It is difficult with mixtures of harmonic instrument sounds, where spectra overlap. However, assuming time-frequency smoothness of the signal, sound separation can be performed, and when sound changes in time are observed, onset, offset, amplitude, and frequency modulation have similar shapes for all frequencies in the spectrum; thus, a demixing matrix can be estimated for them (Virtanen, 2003; Viste & Evangelista, 2003). Audio source separation techniques also can be used to source localization for auditory scene analysis. These techniques, like independent component analysis, originate from speech recognition in cocktail party environment, where many sound sources are present. Independent component analysis is used for finding underlying components from multidimensional statistical data, and it looks for components that are statistically independent (Vincent et al., 2003). Computational auditory scene recognition aims at classifying auditory scenes into predefined classes, using audio information only. Examples of auditory scenes are various outside and inside environments, like streets, restaurants, offices, homes, cars, and so forth. Statistical and nearest neighbor algorithms can be applied for this purpose. In the nearest neighbor algorithm, the class (type of auditory scene, in this case) is assigned on the basis of the distance of the investigated sample to the nearest sample, for which the class membership is known. Various acoustic features, based on Fourier spectral analysis (i.e., mathematic transform, decomposing the signal into frequency components), can be applied to parameterize the auditory scene for classification purposes. Effectiveness of this research approaches 70% correctness for about 20 auditory scenes (Peltonen et al., 2002). Query-by-humming systems, which search melodic databases using sung queries (Adams et al., 2003). This topic represents audio retrieval by contents. Melody usually is quantized coarsely with respect to pitch and duration, assuming moderate singing abilities of users. Music retrieval system takes such an aural query (i.e., a motif or a theme) as input, and searches the database for the piece from which this query comes. Markov models, based on Markov chains, can be used for modeling musical performances. Markov chain is a stochastic process for which the parameter is discrete time values. In Markov sequence of events, the probability of future states depends on the present state; in this case, states represent pitch (or
•
•
set of pitches) and duration (Birmingham et al., 2001). Query-by-humming is one of more popular topics within music information retrieval domain. Audio retrieval-by-example for orchestral music aims at searching for acoustic similarity in an audio collection, based on analysis of the audio signal. Given an example audio document, other documents in a collection can be ranked by similarity on the basis of long-term structure; specifically, the variation of soft and louder passages, determined from envelope of audio energy vs. time in one or more frequency bands (Foote, 2000). This research is a branch of audio retrieval by content. Audio query-by-example search also can be performed within a single document when searching for sounds similar to the selected sound event. Such a system for content-based audio retrieval can be based on a self-organizing feature map (i.e., a special kind of neural network designed by analogy with a simplified model of the neural connections in the brain and trained to find relationships in the data). Perceptual similarity can be assessed on the basis of spectral evolution in order to find sounds of similar timbre (Spevak & Polfreman, 2001). Neural networks also are used in other forms in audio information retrieval systems. For instance, time-delayed neural networks (i.e., neural nets with time delay inputs) are applied, since they perform well in speech recognition applications (Meier et al., 2000). One of applications of audio retrieval-by-example is searching for the piece in a huge database of music pieces with the use of so-called audio fingerprinting—technology that allows piece identification. Given a short passage transmitted, for instance, via car phone, the piece is extracted, and, most important, information also is extracted on the performer and title linked to this piece in the database. In this way, the user may identify the piece of music with very high accuracy (95%) only on the basis of a small recorded (possibly noisy) passage. Transcription of music, defined as writing down the musical notation for the sounds that constitute the investigated piece of music. Onset detection based on incoming energy in frequency bands and multi-pitch estimation based on spectral analysis may be used as the main elements of an automatic music transcription system. The errors in such a system may contain additional inserted notes, omissions, or erroneous transcriptions (Klapuri et al., 2001). Pitch tracking (i.e., estimation of pitch of note events in a melody or a piece of music) is often performed in many music information retrieval systems. For polyphonic music, 855
M
Music Information Retrieval
•
•
856
polyphonic pitch-tracking and timbre separation in digital audio is performed with such applications as score-following and denoising of old analog recordings, which is also a topic of interest within music information retrieval. Wavelet analysis can be applied for this purpose, since it decomposes the signal in time-frequency space, and then musical notes can be extracted from the result of this decomposition (Popovic et al., 1995). Simultaneous polyphonic pitch and tempo tracking, aiming at automatically inferring a musical notation that lists the pitch and the time limits of each note, is a basis of the automatic music transcription. A musical performance can be modeled for these purposes using dynamic Bayesian networks (i.e., directed graphical models of stochastic processes) (Cemgil et al., 2003). It is assumed that the observations may be generated by a hidden process that cannot be directly experimentally observed, and dynamic Bayesian networks represent the hidden and observed states in terms of state variables, which can have complex interdependencies. Dynamic Bayesian networks generalize hidden Markov models, which have one hidden node and one observed node per observation time. Automatic characterizing the rhythm and tempo of music and audio, revealing tempo and the relative strength of particular beats, is a branch of research on automatic music transcription. Since highly structured or repetitive music has strong beat spectrum peaks at the repetition times, it allows tempo estimation and distinguishing between different kinds of rhythms at the same tempo. The tempo can be estimated using beat spectral peak criterion (the lag of the highest peak exceeding assumed time threshold) accurately to within 1% in the analysis window (Foote & Uchihashi, 2001). Automatic classification of musical instrument sounds, aiming at accurate identification of musical instruments playing in a given recording, based on various sound analysis and data mining techniques (Herrera et al., 2000; Wieczorkowska, 2001). This research is focused mainly on monophonic sounds, and sound mixes are usually addressed in the research on separation of sound sources. In most cases, sounds of instruments of definite pitch have been investigated, but recently, research on percussion also has been undertaken. Various analysis methods are used to parameterize sounds for instrument classification purposes, including timedomain description, Fourier, and wavelet analysis. Classifiers range from statistic and probabilistic methods, through learning by example, to artificial intelligence methods. Effectiveness of this research
• •
ranges from about 70% accuracy for instrument identification to more than 90% for instruments family (i.e., strings, winds, etc.), approaching 100% for discriminating impulsive and sustained sounds, thus even exceeding human performance. Such instrument sound classification can be included in automatic music transcription systems. Sonification, in which utilities for intuitive auditory display (i.e., in audible form) are provided through a graphical user interface (Ben-Tal et al., 2002). Generating human-like expressive musical performances with appropriately adjusted dynamics (i.e., loudness), rubato (variation in time limits of notes), vibrato (changes of pitch) (Mantaras & Arcos, 2002), and identification of musical pieces representing different types of emotions (tenderness, sadness, joy, calmness, etc.), which music evokes. Emotions in music are gaining the interest of researchers recently, also including recognition of emotions in the recordings.
The topics mentioned previously interrelate and sometimes partially overlap. For instance, both auditory scene analysis and recognition may take into account a very broad range of recordings containing numerous acoustic elements to identify and analyze. Query by humming requires automatic transcription of music, since the input audio samples first must be transformed into the form based on musical notation, describing basic melodic features of the query. Audio retrieval by example and automatic classification of musical instrument sounds are both branches of retrieval by content. Transcription of music requires not only pitch tracking, but also automatic characterizing the rhythm and tempo, so these topics overlap. Pitch tracking is needed in many research topics, including music transcription, query by humming, and even automatic classification of musical instrument sounds, since pitch is one of the features characterizing instrumental sound. Sonification and generating human-like expressive musical performances both are related to sound synthesis, which are needed to create auditory display or emotional performance. All these topics are focused on a broad domain of music and its various aspects. Results of this research are not always easily measurable, especially in the case of synthesis-based topics, since they usually are validated via subjective tests. Other topics, like transcription of music, may produce errors of various importance (i.e., wrong pitch, length, omission, etc.), and comparison of the obtained transcript with the original score can be measured in many ways, depending on the considered criteria. The easiest estimation and comparison of results can be performed in case of recognition of singular sound events or files. In
Music Information Retrieval
case of query by example, very high recognition rate has been reached already (95% of correct piece identification via audio fingerprinting), reaching commercial level. The research on music information retrieval is gaining an increasing interest from the scientific community, and investigation of further issues in this domain can be expected.
FUTURE TRENDS Multimedia databases and library collections, expanding tremendously nowadays, need efficient tools for content-based search. Therefore, we can expect an intensification of research effort on music information retrieval, which may aid searching music data. Especially, tools for query-by-example and query-by-humming are needed, as well as tools for automatic music transcription, so these areas should be investigated broadly in the near future.
CONCLUSION Music information retrieval is a broad range of research, focusing on various aspects of possible applications. The main domains include audio retrieval by content, automatic music transcription, denoising of old recordings, generating human-like performances, and so forth. The results of this research help users find the audio data they need, even if the users are not experienced musicians. Constantly growing audio resources evoke a demand for efficient tools to deal with this enormous amount of data; therefore, music information retrieval becomes a dynamically developing field of research.
REFERENCES
Mellody, M. & Rand, B. (2001). MUSART: Music retrieval via aural queries. Proceedings of ISMIR 2001 2 nd Annual International Symposium on Music Information Retrieval, Bloomington, Indiana. Bregman, A.S. (1990). Auditory scene analysis, the perceptual organization of sound. Cambridge, MS: MIT Press. Cemgil, A.T., Kappen, B., & Barber, D. (2003). Generative model based polyphonic music transcription. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA’03, New Paltz, New York. de Mantaras, R.L., & Arcos, J.L. (2002, Fall). AI and music: From composition to expressive performance. AI Magazine, 43-58. Downie, J.S. (2001). Wither music information retrieval: Ten suggestions to strengthen the MIR research community. Proceedings of the Second Annual International Symposium on Music Information Retrieval: ISMIR 2001, Bloomington, Indiana. Fingerhut, M. (1997). Le multimédia dans la bibliothèque. Culture et recherche, 61. Retrieved 2004 from http://catalogue.ircam.fr/articles/textes/Fingerhut 97a/. Foote, J. (1999). An overview of audio information retrieval. Multimedia Systems, 7(1), 2-11. Foote, J. (2000). ARTHUR: Retrieving orchestral music by long-term structure. Proceedings of the International Symposium on Music Information Retrieval ISMIR 2000, Plymouth, Massachusetts. Foote, J., & Uchihashi, S. (2001). The beat spectrum: A new approach to rhythm analysis. Proceedings of the International Conference on Multimedia and Expo ICME 2001, Tokyo, Japan.
Adams, N.H., Bartsch, M.A., & Wakefield, G.H. (2003). Coding of sung queries for music information retrieval. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA’03, New Paltz, New York.
Herrera, P., Amatriain, X., Batlle, E., & Serra X. (2000). Towards instrument segmentation for music content description: A critical review of instrument classification techniques. Proceedings of the International Symposium on Music Information Retrieval ISMIR 2000, Plymouth, Massachusetts.
Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G. & Cook, P. (2002). SONART: The sonification application research toolbox. Proceedings of the 2002 International Conference on Auditory Display, Kyoto, Japan.
International Organization for Standardization ISO/IEC JTC1/SC29/WG11. (2003). MPEG-7 Overview. Retrieved 2004 from http://www.chiariglione.org/mpeg/ standards/mpeg-7/mpeg-7.htm
Birmingham, W.P., Dannenberg, R.D., Wakefield, G.H., Bartsch, M.A., Bykowski, D., Mazzoni, D., Meek, C.,
Klapuri, A., Virtanen, T., Eronen, A., & Seppänen, J. (2001). Automatic transcription of musical recordings. Proceedings of the Consistent & Reliable Acoustic Cues for Sound Analysis CRAC Workshop, Aalborg, Denmark. 857
M
Music Information Retrieval
Meier, U., Stiefelhagen, R., Yang, J., & Waibel, A. (2000). Towards unrestricted lip reading. International Journal of Pattern Recognition and Artificial Intelligence, 14(5), 571-586. Peltonen, V., Tuomi, J., Klapuri, A., Huopaniemi, J., & Sorsa, T. (2002). Computational auditory scene recognition. Proceedings of the International Conference on Acoustics Speech and Signal Processing ICASSP, Orlando, Florida. Popovic, I., Coifman, R., & Berger, J. (1995). Aspects of pitch-tracking and timbre separation: Feature detection in digital audio using adapted local trigonometric bases and wavelet packets. Center for Studies in Music Technology. Retrieved 2004 from http://wwwccrma.stanford.edu/~brg/research/pc/pitchtrack.html Rosenthal, D., & Okuno, H.G. (Eds.). (1998). Computational auditory scene analysis. Proceedings of the IJCAI95 Workshop, Mahwah, New Jersey. Spevak, C., & Polfreman, R. (2001). Sound spotting—A frame-based approach. Proceedings of the Second Annual International Symposium on Music Information Retrieval: ISMIR 2001, Bloomington, Indiana. Vincent, E., Rodet, X., Röbel, A., Févotte, C., & Carpentier, É.L. (2003). A tentative typology of audio source separation tasks. Proceedings of the 4th Symposium on Independent Component Analysis and Blind Source Separation, Nara, Japan. Virtanen, T. (2003). Algorithm for the separation of harmonic sounds with time-frequency smoothness constraint. Proceedings of the 6th International Conference on Digital Audio Effects DAFX-03, London, UK.
KEY TERMS Digital Audio: Digital representation of sound waveform, recorded as a sequence of discrete samples, representing the intensity of the sound pressure wave at a given time instant. Sampling frequency describes the number of samples recorded in each second, and bit resolution describes the number of bits used to represent the quantized (i.e., integer) value of each sample. Fourier Analysis: Mathematical procedure for spectral analysis, based on Fourier transform that decomposes a signal into sine waves, representing frequencies present in the spectrum. Information Retrieval: The actions, methods, and procedures for recovering stored data to provide information on a given subject. Metadata: Data about data (i.e., information about the data). MIDI: Musical Instrument Digital Interface. MIDI is a common set of hardware connectors and digital codes used to interface electronic musical instruments and other electronic devices. MIDI controls actions such as note events, pitch bends, and the like, while the sound is generated by the instrument itself. Music Information Retrieval: Multi-disciplinary research on retrieving information from music. Pitch Tracking: Estimation of pitch of note events in a melody or a piece of music.
Wieczorkowska, A. (2001). Musical sound classification based on wavelet analysis. Fundamenta Informaticae Journal, 47(1/2), 175-188.
Sound: A physical disturbance in the medium through which it is propagated. Fluctuation may change routinely, and such a periodic sound is perceived as having pitch. The audible frequency range is from about 20 Hz (hertz, or cycles per second) to about 20 kHz. Harmonic sound wave consists of frequencies being integer multiples of the first component (fundamental frequency) corresponding to the pitch. The distribution of frequency components is called spectrum. Spectrum and its changes in time can be analyzed using mathematical transforms, such as Fourier or wavelet transform.
Wieczorkowska, A., & Ras, Z. (2001). Audio content description in sound databases. In Proceedings of the First Asia-Pacific Conference on Web Intelligence, WI 2001, Maebashi City, Japan.
Wavelet Analysis: Mathematical procedure for time-frequency analysis, based on wavelet transform that decomposes a signal into shifted and scaled versions of the original function called wavelet.
Viste, H., & Evangelista, G. (2003). Separation of harmonic instruments with overlapping partials in multichannel mixtures. Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA’03, New Paltz, New York.
Wieczorkowska, A., & Ras, Z.W. (Eds.). (2003). Music information retrieval. Journal of Intelligent Information Systems, 21, 1.
858
859
Negative Association Rules in Data Mining Olena Daly Monash University, Australia David Taniar Monash University, Australia
INTRODUCTION Data Mining is a process of discovering new, unexpected, valuable patterns from existing databases (Chen, Han & Yu, 1996; Fayyad et. al., 1996; Frawley, Piatetsky-Shapiro & Matheus, 1991; Savasere, Omiecinski & Navathe, 1995). Though data mining is the evolution of a field with a long history, the term itself was introduced only relatively recently in the 1990s. Data mining is best described as the union of historical and recent developments in statistics, artificial intelligence, and machine learning. These techniques then are used together to study data and find previously hidden trends or patterns within. Data mining is finding increasing acceptance in science and business areas that need to analyze large amounts of data to discover trends that they could not otherwise find. Different applications may require different data mining techniques. The kinds of knowledge that could be discovered from a database are categorized into association rules mining, sequential patterns mining, classification, and clustering (Chen, Han & Yu, 1996). In this article, we concentrate on association rules mining and, particularly, on negative association rules.
BACKGROUND Association rules discover database entries associated with each other in some way (Agrawal, Imielinski & Swami, 1993; Agrawal & Srikant, 1994; Agrawal, et. al., 1996; Mannila, Toivonen & Verkamo, 1994; Piatetsky-Shapiro, 1991; Srikant & Agrawal, 1995). The example could be supermarket items purchased in the same transaction, weather conditions occurring on the same day, stock market movements within the same trade day, words that tend to appear in the same sentence. In association rules mining, we search for sequences of associated entities, where each subsequence forms association rules as well. Association rule is an implication of the form A⇒B, where A and B are database itemsets. A and B belong to the same database transaction. The discovered sequences and later on association rules allow to study customers’ purchase patterns, stock market movements, and so forth.
There are two measures to evaluate association rules: support and confidence (Agrawal, Imielinski & Swami, 1993). The rule A⇒B has support s, if s% of all transactions contains both A and B. The rule A⇒B has confidence c, if c% of transactions that contains A also contains B. Association rules have to satisfy the user-specified minimum support (minsup) and minimum confidence (minconf). Such rules with high support and confidence are referred to as strong rules (Agrawal, Imielinski & Swami, 1993). The generation of the strong association rules is decomposed into the following two steps (Agrawal, Imielinski & Swami, 1993; Agrawal & Srikant, 1994): (i) discover the frequent itemsets; and (ii) use the frequent itemsets to generate the association rules. Itemset is a set of database items. Itemsets that have support at least equal to minsup are called frequent itemsets. 1-itemset is called an itemset with one item (A), 2-itemset is called an itemset with two items (AB), kitemset is called an itemset with k items. The output of the first step is all itemsets in the database that have their support at least equal to the minimum support. In the second step, all possible association rules will be generated from each frequent itemset, and the confidence of the possible rules will be calculated. If the confidence is at least equal to minimum confidence, the discovered rule will be listed in the output of the algorithm. Consider an example database in Figure 1. Let the minimum support be 40%, the minimum confidence 70%; the database contains 5 records and 4 different items A, B, C, D (see Figure 1). In a database with five records, support 40% means two records. For an item (itemset) to be frequent in the sample database, it has to occur in two or more records.
Figure 1. A sample database
1 2 3 4 5
A,D B,C,D A,B,C A,B,C A,B
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
N
Negative Association Rules in Data Mining
Discover the Frequent Itemsets •
1-itemsets: The support of each database item is calculated
Support(A)= 4 (records) Support(B)= 4 (records) Support(C)= 3 (records) Support(D)= 2 (records) All items occur in two or more records, so all of them are frequent. Frequent 1-itemsets: A, B, C, and D •
2-itemsets: From frequent 1-itemsets candidate 2itemsets are generated: AB, AC, AD, BC, BD, CD. To calculate the support of the candidate 2-itemsets, the database is scanned again.
Support(AB) =3 (records) Support(AC) =2 (records) Support(AD) =1 (record) Support(BC) =3 (records) Support(BD) =1 (record) Support(CD) =1 (record) Only itemsets occurring in two or more records are frequent. Frequent 2-itemsets: AB, AC, and BC •
3-itemsets: From frequent 2-itemsets candidate 3itemsets are generated: ABC. In step 3 for candidate 3-itemsets, it is allowed to join frequent 2-itemsets that only differ in the last item. AB and AC from the step 2 differ only in the last item, so they are joined and the candidate 3-itemset ABC is obtained. AB and BC differ in the first item so a candidate itemset cannot be produced here; AC and BC do not differ in the last item, no candidate itemset produced. So there is only one candidate 3-itemset ABC. To calculate the support of the candidate 3-itemsets the database is scanned again.
Support(ABC) =2 (records) Frequent 3-itemsets: ABC •
4-itemsets: In step 3, there is only one frequent itemset, so it is impossible to produce any candidate 4-itemsets. The frequent itemsets generation stops here. If there was more than one frequent 3-itemset, then step 4 (and onwards) would be similar to step 3.
The list of frequent itemsets: A, B, C, D, AB, AC, BC, and ABC.
860
Use the Frequent Itemsets to Generate the Association Rules To produce an association rule, at least frequent 2-itemset. So the generation process starts with 2-itemsets and goes on with longer itemsets. Every possible rule is produced from each frequent itemset: AB: AC: BC: ABC:
A⇒B B⇒ A A⇒C C⇒ A B⇒ C C⇒ B A ⇒ BC B ⇒AC C ⇒AB BC ⇒ A AC ⇒ B AB ⇒ C
To distinguish the strong association rules among all possible rules, the confidence of each possible rule will be calculated. All support values have been obtained in the first step of the algorithm. Confidence(A⇒B)=Support(AB)/Support(A) Confidence(AB⇒C)=Support(ABC)/Support(AB) Association rules have been one of the most developed areas in data mining. Most of research has been done in positive implications, which is when occurrence of an item implies the occurrence of another item. Negative rules also consider negative implications, when occurrence of an item implies absence of another item.
MAIN THRUST Negative itemsets are itemsets that contain both items and their negations (e.g., AB~C~DE~F). ~C means negation of the item C (absence of the item C in the database record). Negative association rules are rules of a kind A⇒B, where A and B are frequent negative itemsets. Negative association rules consider both presence and absence of items in the database record and mine for negative implications between database items. Examples of negative association rules could be Meat⇒~Fish, which implies that when customers purchase meat at the supermarket, they do not buy fish at the same time, or ~Sunny⇒Windy, which means no sunshine
Negative Association Rules in Data Mining
implies wind, or ~OilStock⇒~PetrolStock, which says if the price the oil shares does not go up, petrol shares wouldn’t go up either. In the negative association rules mining, the number of possible generated negative rules is vast and numerous in comparison to the positive association rules mining. It is a critical issue to distinguish only the most interesting ones among the candidates. A simple and straightforward method to search negative association would be to add the negations of absent items to the database transaction and treat them as additional items. Then, the traditional positive association mining algorithms could be run on the extended database. This approach works in domains with a very limited number of database attributes. A good example is weather prediction. The database only contains 10-15 different factors like fog, rain, sunshine, and so forth. A database record would look like sunshine, ~fog, ~rain, windy. Thus, the number of database items will be doubled, and the number of itemsets will increase significantly. Besides, the support of negations of items can be calculated, based on the support of positive items. Then, specially created algorithms are required. The approach works in tasks like weather prediction/ limited stock market analyses but is hardly applicable to domains with thousands of different database items (sales/ healthcare/marketing). There could be numerous and senseless negative rules generated then. That is why special approaches are required to distinct interesting negative association rules. There are two categories in mining negative association rules. In category I, a special measure for negative association rules is created. In category II, the data/firstgenerated rules are analyzed to further produce only most interesting rules with no computational costs. First, we explain category I. A measure to distinguish the most valuable negative rules will be employed. The interest measure may be applied to negative rules (Wu, Zhang & Zhang, 2002). Interest measure was first offered for positive rules. The measure is defined by a formula: Interest(A⇒B)=support(AB)-support(A)* support(B). The negative rules are generated from infrequent itemsets. The mining process for negative rules starts within 1-frequent itemsets. A more complicated way to search negative association rules is to employ statistical theories (Brin, Motwani & Silverstein, 1997). Correlations among database items are calculated. If the correlation is positive, a positive association rule will be obtained. If the correlation is negative, a negative association rule will be obtained. The stronger the correlation, the stronger the generated rule. In this approach, Chi-squared statistics is employed to calculate the correlations among database items. Each different database transaction (market basket) denotes a
cell in a contingency table. The chi-squared statistic is calculated, which is, in short, a normalized deviation from expectation for each cell. From the obtained value, the strength of the correlation between database items is estimated. Now, we explain category II. The data or negative rules generated in the initial step are analyzed first, and then the obtained information is employed on the next steps of generation process. Mining frequent negative itemsets starts with itemsets that contain n items and only one negation of an item; for example, {ABC~D, AB~CD}, where ~C, ~D means absence of the item C or D. Then, the produced rules with n items and one negation are analyzed and some information is obtained that allows the next step when producing rules with n items and two negations (A~BC~D) to disregard some itemsets without even considering (Fortes, Balcazar & Morales, 2001). n takes on values from 1 to the overall number of different items in the database. The negative rules mining stops when the maximum number of negations m in rules has been reached, m provided by the user, or when all negative rules have been generated, if m has not been provided. Another approach in category II is to consider the hierarchy of items. An example of database hierarchy is shown in Figure 2. It is supposed that database items are organized into the hierarchy ancestor-to-descendant. First, the negative rules on the very top level are generated and then on the lower levels. When proceeding to the lower level of hierarchy, the information obtained from rules on the higher level is utilized (Daly & Taniar, 2003, 2004). The hierarchy approach makes the rules more general, which is crucial for negative association. Any further exploration can be ceased, when no additional knowledge will be extracted on the lower levels. For instance, if a rule {Beef ⇒ Not Juice} is generated, it is not interesting what kind of juice (because “Not Juice” is negation of any kind of juice). In contrast, in the positive rule {Beef ⇒ Juice}, it is required to discover a specific kind of juice and go deeper into the levels. Figure 2. Hierarchy of items Items Stationary Writing
Instruments Paper Ink
Pen Pencil
Food & Drinks Drinks
Juice Coke
Food Fish Meat Pastry Dairy
Apple Juice Orange Juice Trout Beef Bread Milk Butter
861
N
Negative Association Rules in Data Mining
Though the hierarchy approach requires additional information about the database, the way items are organized into a hierarchy and an experts’ opinion may be needed. After the negative association rules have been generated, some reduction techniques may be applied to refine the rules. For instance, the reduction techniques could verify the true support of items (not ancestors) or generalize the sets of negative rules.
FUTURE TRENDS There has been some research in the area of negative association mining. The negative association obviously has provoked more complicated issues than the positive association mining. In order to overcome the difficulties of negative association mining, the research community endeavors to invent new methods and algorithms to achieve the desirable efficiency. Following are the limitations and trends in the negative association mining.
Issues/Open Research Areas •
Some research has been done in the negative association rules mining, but there are many unexplored subareas, and advanced research is essential for the area’s development. •
A need for special interest measures and algorithms for negative association vs. positive association.
New interest measures should be invented, new approaches for negative rules discovered, the ones that would take advantage of the special properties of the negative rules. •
A need for parallel algorithms for negative association rules mining.
Parallel algorithms have taken data mining to the new extent of mining abilities. As the negative association is a relatively new research area, the parallel data mining algorithms have not been developed yet, and they are absolutely necessary. •
A need for new sophisticated reduction techniques.
A vast number of candidate rules for negative association rules in comparison with the positive association rules.
After the set of negative rules has been generated, one may apply some reduction and refining techniques to the set to make sure the rules are not redundant.
In comparison with the positive association mining, in the negative association rules, a greater number of candidate itemsets is obtained, which would make it technically impossible to evaluate all and distinguish the rules with high support/confidence/interest. Besides, the generated rules may not provide efficient information. Special approaches, measures, and algorithms should be developed to solve the issue.
The issues require additional research in the area to be conducted and more scientists to contribute their knowledge, time, and inspiration.
•
Hardware limitations.
Hardware limitations are crucial for negative association rules mining for the following reasons: a huge number of calculations and assessments required from the CPU; a storage for vast frequent itemsets trees in the sufficient main memory; and rapid data interchange with databases required. •
Vast data sets.
With the development of the hardware components, vast databases could be handled in sufficient time for the companies. •
862
A need for advanced research in negative association mining.
Current Trends •
Use of various approaches in negative association mining.
Researchers are currently trying to invent distinctive approaches to generate the negative rules and to make sure they provide valuable information. A straightforward approach is not acceptable for the negative rules; support/confidence measures are not enough to distinguish the negative rules. Some novel/adapted measures, particularly for interest measures, will need to be utilized. Data structure analysis also forms an important part in the process of negative rule generation. •
Specialized mining algorithms development.
The algorithms for positive association rules mining are not suitable for negative rules, so new and specialized algorithms often are developed by the researchers currently working in negative rules mining.
Negative Association Rules in Data Mining
•
Rules reduction development.
Rules reduction techniques have been developed to refine the final set of the negative rules. Reduction techniques verify the rules quality or generalize the rules. The researchers are attempting to overcome the vast search space in negative association mining; they look at the issue from different points of view to discover what makes the negative rules more interesting and what measures could distinguish the highly valuable rules from the rest.
CONCLUSION
Chen, M., Han, J., & Yu, P. (1996). Data mining: An overview from a database perspective. Institute of Electrical and Electronics Engineers, Transactions on Knowledge and Data Engineering, 8(6), 866-883. Daly, O., & Taniar, D. (2003). Mining multiple-level negative association rules. Proceedings of the International Conference on Intelligent Technologies (InTech’03). Daly, O., & Taniar, D. (2004). Exception rules mining based on negative association rules. Computational Science and Its Applications, 3046, 543-552. Fayyad, U. et al. (1996). Advances in knowledge discovery and data mining. American Association for Artificial Intelligence Press.
This article is a review of the research that has been done in the negative association rules. Negative association rules have brought interesting challenges and research opportunities. This is an open area for future research. This article describes the main issues and approaches in negative association rules mining derived from the literature. One of the main issues is a vast number of candidate rules for negative association rules in comparison with the positive association rules. There are two categories in mining negative association rules. In category I, a special measure for negative association rules is created. In category II, the data/firstgenerated rules are analyzed to further produce only most interesting rules with no computational costs.
Fortes, I., Balcázar, J., & Morales, R. (2001). Bounding negative information in frequent sets algorithms. Proceedings of the 4th International Conference of Discovery Science.
REFERENCES
Savasere, A., Omiecinski, E., & Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. Proceedings of the 21st International Conference on Very Large Data Bases.
Agrawal, R. et al. (1996). Fast discovery of association rules. In U. Fayyad et al. (Eds.), Advances in knowledge discovery and data mining. American Association for Artificial Intelligence Press. Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the Association for Computing Machinery, Special Interest Group on Management of Data, International Conference Management of Data. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases. Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market basket: Generalizing association rules to correlations. Proceedings of Association for Computing Machinery, Special Interest Group on Management of Data, International Conference Management of Data.
Frawley, W., Piatetsky-Shapiro, G., & Matheus, C. (1991). Knowledge discovery in databases: An overview. American Association for Artificial Intelligence Press. Mannila, H., Toivonen, H., & Verkamo, A. (1994). Efficient algorithms for discovering association rules. Proceedings of the American Association for Artificial Intelligence Workshop on Knowledge Discovery in Databases. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. American Association for Artificial Intelligence Press.
Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. Proceedings of the 21st Very Large Data Bases Conference. Wu, X., Zhang, C., & Zhang, S. (2002). Mining both positive and negative association rules. Proceedings of the 19th International Conference on Machine Learning.
KEY TERMS Association Rules: An implication of the form AÞB, where A and B are database itemsets. Association rules have to satisfy the pre-set minimum support (minsup) and minimum confidence (minconf) constraints.
863
N
Negative Association Rules in Data Mining
Confidence: The rule A⇒B has confidence c, if c% of transactions that contain A, also contains B.
Negative Association Rules: Rules of a kind A⇒B, where A and B are frequent negative itemsets.
Database Item: An item/entity occurring in the database.
Negative Itemsets: Itemsets that contain both items and their negations.
Frequent Itemsets: Itemsets that have support at least equal to minsup. Itemset: A set of database items.
864
Support: The rule A⇒B has support s, if s% of all transactions contains both A and B.
865
Neural Networks for Prediction and Classification
N
Kate A. Smith Monash University, Australia
INTRODUCTION Neural networks are simple computational tools for examining data and developing models that help to identify interesting patterns or structures. The data used to develop these models is known as training data. Once a neural network has been exposed to the training data, and has learnt the patterns that exist in that data, it can be applied to new data thereby achieving a variety of outcomes. Neural networks can be used to: • • •
learn to predict future events based on the patterns that have been observed in the historical training data; learn to classify unseen data into pre-defined groups based on characteristics observed in the training data; learn to cluster the training data into natural groups based on the similarity of characteristics in the training data.
a set of inputs and known outputs. They are a supervised learning technique in the sense that they require a set of training data in order to learn the relationships. With supervised learning models, the training data contains matched information about input variables and observable outcomes. Models can be developed that learn the relationship between these data characteristics (inputs) and outcomes (outputs). Figure 1 shows the architecture of a 3-layer MFNN. The weights are updated using the backpropagation learning algorithm, as summarized in Table 1, where f(.) is a non-linear activation function such as 1/(1+exp(- ë (.)), ë is the learning rate, and dk is the desired output of the kth neuron. The successful application of MFNNs to a wide range of areas of business, industry and society has been reported widely. We refer the interested reader to the excellent survey articles covering application domains of medicine (Weinstein, Myers, Casciari, Buolamwini, & Raghavan, 1994), communications (Ibnkahla, 2000), business (Smith & Gupta, 2002), finance (Refenes, 1995), and a variety of industrial applications (Fogelman, Soulié, & Gallinari, 1998).
BACKGROUND There are many different neural network models that have been developed over the last fifty years or so to achieve these tasks of prediction, classification, and clustering. Broadly speaking, these models can be grouped according to supervised learning algorithms (for prediction and classification), and unsupervised learning algorithms (for clustering). This paper focused on the former paradigm. We refer the interested reader to Haykin (1994) for a detailed account of many neural network models. According to a recent study (Wong, Jiang, & Lam, 2000), over fifty percent of reported neural network business application studies utilise multilayered feedforward neural networks (MFNNs) with the backpropagation learning rule (Werbos, 1974; Rumelhart & McClelland, 1986). This type of neural network is popular because of its broad applicability to many problem domains of relevance to business and industry: principally prediction, classification, and modeling. MFNNs are appropriate for solving problems that involve learning the relationships between
MAIN THRUST A number of significant issues will be discussed, and some guidelines for successful training of neural networks will be presented in this section. Figure 1. Architecture of MFNN (note: not all weights are shown)
x1 x2 x3
-1
. . .
. . .
.
J neurons (hidden layer)
y1 v11
w11 w21
w12
wJ1
v21 vK1
w22 wJ2
.. .
w13 w23 w
wJ2 1N+1
w
2N+1
w
K neurons (output layer)
wJ3
JN+1
y2
z2
yJ
zK
v1j
-1
z1
.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Neural Networks for Prediction and Classification
Table 1. Backpropagation learning algorithm for MFNNs STEP 1:
Randomly select an input pattern x to present to the MFNN through the input layer
STEP 2:
Calculate the net inputs and outputs of the hidden layer neurons N +1
net hj = ∑ w ji xi
STEP 3:
i =1
y j = f (net hj )
Calculate the net inputs and outputs of the K output layer neurons J +1
net ko = ∑ v kj y j j =1
z k = f ( net ko )
STEP 4:
Update the weights in the output layer (for all k, j pairs) vkj ← vkj + cλ (d k − z k ) z k (1 − z k ) y j
STEP 5:
Update the weights in the hidden layer (for all i, j pairs) K
w ji ← w ji + cλ 2 y j (1 − y j ) x i (∑ ( d k − z k ) z k (1− z k )v kj ) k =1
STEP 6:
Update the error term K
E ← E + ∑ (d k − z k ) 2 k =1
and repeat from STEP 1 until all input patterns have been presented (one epoch). STEP 7:
If E is below some predefined tolerance level (say 0.000001), then STOP. Otherwise, reset E = 0, and repeat from STEP 1 for another epoch.
Critical Issues for Neural Networks Neural networks have not been without criticism, and there are a number of important limitations that need careful attention. At this stage it is still a trial and error process to obtain the optimal model to represent the relationships in a dataset. This requires appropriate selection of a number of parameters including learning rate, momentum rate (if an additional momentum term is added to steps 4 and 5 of the algorithm), choice of activation function, as well as the optimal neural architecture. If the architecture is too large, with too many hidden neurons, the MFNN will find it very easy to memorize the training data, but will not have generalized its knowledge to other datasets (out-of-sample or test sets). This problem is known as overtraining. Another limitation is the gradient descent nature of the backpropagation learning algorithm, which causes the network weights to become trapped in local minima of the error function being minimized. Finally, neural networks are considered inappropriate for certain problem domains where insight and explanation of the model is required. Some work has been done on extracting rules from trained neural networks (Andrews, Diederich, & Tickle, 1995), but other data mining techniques like rule induction may be better suited to such situations.
Guidelines for Successful Training Successful prediction and classification with MFNNs requires careful attention to two main stages: 866
1.
2.
Learning: The developed model must represent an adequate fit of the training data. The data itself must contain the relationships that the neural network is trying to learn, and the neural network model must be able to derive the appropriate weights to represent these relationships. Generalization: The developed model must also perform well when tested on new data to ensure that it has not simply memorized the training data characteristics. It is very easy for a neural network model to “overfit” the training data, especially for small data sets. The architecture needs to be kept small, and key validation procedures need to be adopted to ensure that the learning can be generalized.
Within each of these stages there are a number of important guidelines that can be adopted to ensure effective learning and successful generalization for prediction and classification problems (Remus & O’Connor, 2001). To ensure successful learning: •
•
Prepare the Data Prior to Learning the Neural Network Model: A number of pre-processing steps may be necessary including cleansing the data; removing outliers; determining the correct level of summarisation; converting non-numeric data (Pyle, 1999). Normalise, Scale, Deseasonalise and Detrend the Data Prior to Learning: Time series data often needs to be deseasonalised and detrended to enable the neural network to learn the true patterns in the data (Zhang, Patuwo, & Hu, 1998).
Neural Networks for Prediction and Classification
•
•
•
Ensure that the MFNN Architecture is Appropriate to Learn the Data: If there are not enough hidden neurons, the MFNN will be unable to represent the relationships in the data. Most commercial software packages extend the standard 3-layer MFNN architecture shown in Figure 1 to consider additional layers of neurons, and sometimes include feedback loops to provide enhanced learning capabilities. The number of input dimensions can also be reduced to improve the efficiency of the architecture if inclusion of some variables does not improve the learning. Experiment with the Learning Parameters to Ensure that the Best Learning is Produced: There are several parameters in the backpropagation learning equations that require selection and experimentation. These include the learning rate c, the equation of the function f() and its gradient lð, and the values of the initial weights. Consider Alternative Learning Algorithms to Backpropagation: Backpropagation is a gradient descent technique that guarantees convergence only to a local minimum of the error function. Other local optimization techniques, such as the LevenbergMarquardt algorithm, have gained popularity due to their increased speed and reduced memory requirements (Kinsella, 1992). Recently, researchers have used more sophisticated search strategies such as genetic algorithms and simulated annealing in an effort to find globally optimal weight values (Sexton & Dorsey, 2000). To ensure successful generalization:
•
Extract a Test Set from the Training Data: Commonly 20% of the training data is reserved as a test set. The neural network is only trained on 80% of the data, and the degree to which it has learnt or memorized the data is gauged by the measured performance on the test set. When ample additional data is available, a third group of data known as the validation set is used to evaluate the generalization capabilities of the learnt model. For time series prediction problems, the validation set is usually taken as the most recently available data, and provides the best indication of how the developed model will perform on future data. When there is insufficient data to extract a test set and leave enough training data for learning, cross-validation sets are used. This involves randomly extracting a test set, developing a neural network models based on the remaining training data, and repeating the process with several random divisions of the data. The reported results are based on
•
the average performance of all randomly extracted test sets. The most popular method is known as ten-fold cross-validation, and involves repeating the approach for ten distinct subsets of the data. Avoid Unnecessarily Large and Complex Architectures: An architecture containing a large number of hidden layers and hidden neurons results in more weights than a smaller architecture. Since the weights correspond to the degrees of freedom or number of parameters the model has to fit the data, it is very easy for such large architectures to overfit the training data. For the sake of future generalization of the model, the architecture should therefore be only as large as is required to learn the data and achieve an acceptable performance on all data sets (training, test, and validation where available).
When these guidelines are observed, the chances of developing a MFNN model that learns the training data effectively and generalizes its learning on new data are greatly improved. Most commercially available neural network software packages include features to facilitate adherence to these guidelines.
FUTURE TRENDS Despite the successful application of neural networks to a wide range of application areas, there is still much research that continues to improve their functionality. Specifically, research continues in the development of hardware models (chips or specialized analog devices) that enable neural networks to be implemented rapidly in industrial contexts. Other research attempts to connect neural networks back to their roots in neurophysiology, and seeks to improve the biological plausibility of the models. On-line learning of neural network models that are more effective in situations when the data is dynamically changing, will also become increasingly important. A useful discussion of the future trends of neural networks can be found at a virtual workshop discussion: http://www.ai.univie.ac.at/neuronet/workshop/.
CONCLUSION Over the last decade or so, we have witnessed neural networks come of age. The idea of learning to solve complex pattern recognition problems using an intelligent data-driven approach is no longer simply an interesting challenge for academic researchers. Neural networks have proven themselves to be a valuable tool across a wide range of application areas. As a critical 867
N
Neural Networks for Prediction and Classification
component of most data mining systems, they are also changing the way organizations view the relationship between their data and their business strategy. The multilayered feedforward neural network (MFNN) has been presented as the most common neural network employing supervised learning to model the relationships between inputs and outputs. This dominant neural network model finds application across a broad range of prediction and classification problems. A series of critical guidelines have been provided to facilitate the successful application of these neural network models.
Smith, K. A., & Gupta, J. N. D. (Eds.). (2002). Neural networks in business: Techniques and applications. Hershey, Pennsylvania: Idea Group Publishing.
REFERENCES
Wong, B. K., Jiang, L., & Lam, J. (2000). A bibliography of neural network business application research: 1994 - 1998. Computers and Operations Research, 27(11), 1045-1076.
Andrews, R., Diederich, J., & Tickle, A. (1995). A survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge Based Systems, 8, 373-389. Fogelman, S. F., & Gallinari, P. (Eds.). (1998). Industrial applications of neural networks. Singapore: World Scientific Press. Haykin, S. (1994). Neural networks: A comprehensive foundation. Englewood Cliffs, NJ: Macmillan Publishing Company. Ibnkahla, M. (2000). Applications of neural networks to digital communications - a survey. Signal Processing, 80(7), 1185-1215. Kinsella, J. A. (1992). Comparison and evaluation of variants of the conjugate gradient methods for efficient training in feed-forward neural networks with backward error propagation. Neural Networks, 3(27). Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufmann Publishers. Refenes, A. P. (Ed.). (1995). Neural networks in the capital markets. Chichester: John Wiley & Sons Ltd. Remus, W., & O’Connor, M. (2001). Neural networks for time series forecasting. In J. S. Armstrong (Ed.), Principles of forecasting: A handbook for researchers and practitioners. Norwell, MA: Kluwer Academic Publishers. Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Sexton, R. S., & Dorsey, R. E. (2000). Reliable classification using neural networks: A genetic algorithm and backpropagation comparison. Decision Support Systems, 30, 11-22.
868
Weinstein, J. N., Myers, T., Casciari, J. J., Buolamwini, J., & Raghavan, K. (1994). Neural networks in the biomedical sciences: A survey of 386 publications since the beginning of 1991. Proceedings of the World Congress on Neural Networks, 1 (pp. 121-126). Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Cambridge, MA: Harvard University, Ph.D. dissertation.
Zhang, Q., Patuwo, B. E., & Hu, M. Y. (1998). Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting, 14, 35-62.
KEY TERMS Architecture: The configuration of the network of neurons into layers of neurons is referred to as the neural architecture. To specify the architecture involves declaring the number of input variables, hidden layers, hidden neurons, and output neurons. Backpropagation: The name of the most common learning algorithm for MFNNs. It involves modifying the weights of the MFNN in such a way that the error (difference between the MFNN output and the training data desired output) is minimized over time. The error at the hidden neurons is approximated by propagating the output error backwards, hence the name backpropagation. Epoch: A complete presentation of all training data to the MFNN is called one epoch. Typically a MFNN will need to be trained for many thousands of epochs before the error is reduced to an acceptable level. Generalization: The goal of neural network training is to develop a model that generalizes its knowledge to unseen data. Overtraining must be avoided for generalization to occur. Hidden Neurons: The name given to the layer of neurons between the input variables and the output neuron. If too many hidden neurons are used, the neural network may be easily overtrained. If too few hidden neurons are used, the neural network may be unable to learn the training data.
Neural Networks for Prediction and Classification
Learning Algorithm: The method used to change the weights so that the error is minimized. Training data is repeatedly presented to the MFNN through the input layer, the output of the MFNN is calculated and compared to the desired output. Error information is used to determine which weights need to be modified, and by how much. There are several parameters involved in the learning algorithm including learning rate, momentum factor, initial weights, etc.
Neural Model: Specifying a neural network model involves declaring the architecture and activation function types. Overtraining: When the MFNN performs significantly better on the training data than an out-of-sample test data, it is considered to have memorized the training data, and be overtrained. This can be avoided by following the guidelines presented above.
869
N
870
Off-Line Signature Recognition Indrani Chakravarty Indian Institute of Technology, India Nilesh Mishra Indian Institute of Technology, India Mayank Vatsa Indian Institute of Technology, India Richa Singh Indian Institute of Technology, India P. Gupta Indian Institute of Technology, India
INTRODUCTION
BACKGROUND
The most commonly used protection mechanisms today are based on either what a person possesses (e.g. an ID card) or what the person remembers (like passwords and PIN numbers). However, there is always a risk of passwords being cracked by unauthenticated users and ID cards being stolen, in addition to shortcomings like forgotten passwords and lost ID cards (Huang & Yan, 1997). To avoid such inconveniences, one may opt for the new methodology of Biometrics, which though expensive will be almost infallible as it uses some unique physiological and/or behavioral (Huang & Yan, 1997) characteristics possessed by an individual for identity verification. Examples include signature, iris, face, and fingerprint recognition based systems. The most widespread and legally accepted biometric among the ones mentioned, especially in the monetary transactions related identity verification areas is carried out through handwritten signatures, which belong to behavioral biometrics (Huang & Yan,1997). This technique, referred to as signature verification, can be classified into two broad categories - online and off-line. While online deals with both static (for example: number of black pixels, length and height of the signature) and dynamic features (such as acceleration and velocity of signing, pen tilt, pressure applied) for verification, the latter extracts and utilizes only the static features (Ramesh and Murty, 1999). Consequently, online is much more efficient in terms of accuracy of detection as well as time than offline. But, since online methods are quite expensive to implement, and also because many other applications still require the use of off-line verification methods, the latter, though less effective, is still used in many institutions.
Starting from banks, signature verification is used in many other financial exchanges, where an organization’s main concern is not only to give quality services to its customers, but also to protect their accounts from being illegally manipulated by forgers. Forgeries can be classified into four types—random, simple, skilled and traced (Ammar, Fukumura & Yoshida, 1988; Drouhard, Sabourin, & Godbout, 1996). Generally online signature verification methods display a higher accuracy rate (closer to 99%) than off-line methods (9095%) in case of all the forgeries. This is because, in off-line verification methods, the forger has to copy only the shape (Jain & Griess, 2000) of the signature. On the other hand, in case of online verification methods, since the hardware used captures the dynamic features of the signature as well, the forger has to not only copy the shape of the signature, but also the temporal characteristics (pen tilt, pressure applied, velocity of signing etc.) of the person whose signature is to be forged. In addition, he has to simultaneously hide his own inherent style of writing the signature, thus making it extremely difficult to deceive the device in case of online signature verification. Despite greater accuracy, online signature recognition is not encountered generally in many parts of the world compared to off-line signature recognition, because it cannot be used everywhere, especially where signatures have to be written in ink, e.g. on cheques, where only off-line methods will work. Moreover, it requires some extra and special hardware (e.g. pressure sensitive signature pads in online methods vs. optical scanners in off-line methods), which are not only expensive but also have a fixed and short life span.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Off-Line Signature Recognition
•
MAIN THRUST In general, all the current off-line signature verification systems can be divided into the following sub-modules: • • • •
Data Acquisition Preprocessing and Noise Removal Feature Extraction and Parameter Calculations Learning and Verification (or Identification)
Data Acquisition Off-line signatures do not consider the time related aspects of the signature such as velocity, acceleration and pressure. Therefore, they are often termed as “static” signatures, and are captured from the source (i.e. paper) using a camera or a high resolution scanner, in comparison to online signatures (in which data is captured using a digitizer or an instrumented pen generating signals) (Tappert, Suen, & Wakahara, 1990; Wessels & Omlin, 2000), which do consider the time related or dynamic aspects besides the static features.
•
Preprocessing The preprocessing techniques that are generally performed in off-line signature verification methods comprise of noise removal, smoothening, space standardization and normalization, thinning or skeletonization, converting a gray scale image to a binary image, extraction of the high pressure region images, etc. Figure 1. Modular structure of an off-line verification system Signature
Cropped scanned Image Preprocessing module
Data Base
Learning module
Analyzer / Controller module
Feature Extraction module
Output
Verification module
Figure 2. Noise removal using median filter
(a) Gray scale
(b) Noise free
•
Noise Removal: Signature images, like any other image may contain noises like extra dots or pixels (Ismail & Gad, 2000), which originally do not belong to the signature, but get included in the image because of possible hardware problems or the presence of background noises like dirt. To recognize the signature correctly, these noise elements have to be removed from the background in order to get the accurate feature matrices in the feature extraction phase. A number of filters have been used as preprocessors (Ismail & Gad, 2000) by researchers to obtain the noise free image. Examples include the mean filter, median filter, filter based on merging overlapped run lengths in one rectangle (Ismail & Gad, 2000) etc. Among all the filtering techniques mentioned above, average and median filtering are considered to be standard noise reduction and isolated peak noise removal techniques(Huang & Yan, 1997). However, median filter is preferred more because of its ability to remove noises without blurring the edges of the signature instance unlike the mean filter. Space Standardization and Normalization: In Space standardization, the distance between the horizontal components of the same signature is standardized, by removing blank columns, so that it does not interfere with the calculation of global and local features of the signature image (Baltzakis & Papamarkos, 2001; Qi & Hunt, 1994). In normalization, the signature image is scaled to a standard size which is the average size of all training samples, keeping the width to height ratio constant (Baltzakis & Papamarkos, 2001; Ismail & Gad, 2000; Ramesh & Murty, 1999). Extracting the Binary Image from Grayscale Image: Using the Otsu’s method a threshold is calculated to obtain a binary version of the grayscale image (Ammar, Fukumura, & Yoshida, 1988; Ismail & Gad, 2000; Qi & Hunt, 1994). The algorithm is as follows:
1 R ( x, y ) > threshold , S ( x, y ) = 0 R ( x, y ) < threshold ,
Figure 3. Converting grayscale image into binary image
(a) Gray scale
(b) Binary 871
O
Off-Line Signature Recognition
where S(x,y) is the binary image and R(x,y) is the grayscale image. Smoothening: The noise-removed image may have small connected components or small gaps, which may need to be filled up. It is done using a binary mask obtained by thresholding followed by morphological operations (which include both erosion and dilation) (Huang & Yan, 1997; Ismail & Gad, 2000; Ramesh & Murty, 1999). Extracting the High Pressure Region Image: Ammar et al. (Ammar, M., Fukumura, T., & Yoshida Y., 1986) have used the high pressure region as one of the prominent features for detecting skilled forgeries. It is the area of the image where the writer gives special emphasis reflected in terms of higher ink density (more specifically, higher gray level intensities than the threshold chosen). The threshold is obtained as follows:
•
•
Th HPR = I min + 0.75(I max − I min ), where Imin and Imax are the minimum and maximum grayscale intensity values. Thinning or Skeletonization: This process is carried out to obtain a single pixel thick image from the binary signature image. Different researchers have used various algorithms for this purpose (Ammar, Fukumura, & Yoshida, 1988; Baltzakis & Papamarkos, 2001; Huang & Yan, 1997; Ismail & Gad, 2000).
•
Feature Extraction Most of the features can be classified in two categories global and local features. Global Features: Ismail and Gad (2000) have described global features as characteristics, which identify or describe the signature as a whole. They are less responsive to small distortions and hence
•
Figure 4. Extracting the high pressure region image
(a) Gray scale
(b) High pressure region
Figure 5. Extraction of thinned image
(a) Binary
image
872
(b) Thinned
image
•
are less sensitive to noise as well. Examples include width and height of individual signature components, width to height ratio, total area of black pixels in the binary and high pressure region (HPR) images, horizontal and vertical projections of signature images, baseline, baseline shift, relative position of global baseline and centre of gravity with respect to width of the signature, number of cross and edge points, circularity, central line, corner curve and corner line features, slant, run lengths of each scan of the components of the signature, kurtosis (horizontal and vertical), skewness, relative kurtosis, relative skewness, relative horizontal and vertical projection measures, envelopes, individual stroke segments. (Ammar, Fukumura, & Yoshida, 1990; Bajaj & Chaudhury, 1997; Baltzakis & Papamarkos, 2001, Fang, et al., 2003, Huang & Yan, 1997; Ismail & Gad, 2000; Qi & Hunt, 1994, Ramesh & Murty, 1999; Yacoubi, Bortolozzi, Justino, & Sabourin, 2000; Xiao & Leedham, 2002) Local Features: Local features are confined to a limited portion of the signature, which is obtained by dividing the signature image into grids or treating each individual signature component as a separate entity (Ismail & Gad, 2000). In contrast to global features, they are responsive to small distortions like dirt, but are not influenced by other regions of the signature. Hence, though extraction of local features requires more computations, they are much more precise. Many of the global features have their local counterparts as well. Examples of local features include width and height of individual signature components, local gradients such as area of black pixels in high pressure region and binary images, horizontal and vertical projections, number of cross and edge points, slant, relative position of baseline, pixel distribution, envelopes etc. of individual grids or components. (Ammar, Fukumura, & Yoshida, 1990; Bajaj & Chaudhury, 1997; Baltzakis & Papamarkos, 2001; Huang & Yan, 1997; Ismail & Gad, 2000; Qi & Hunt, 1994; Ramesh & Murty, 1999; Yacoubi, Bortolozzi, Justino, & Sabourin, 2000)
Learning and Verification The next stage after feature extraction involves learning. Though learning is not mentioned separately for verification as a separate sub module, it is the most vital part of the verification system in order to verify the authenticity of the signature. Usually five to forty features are passed in as a feature matrix/vector to train the system. In the training phase, 3-3000 signature instances are
Off-Line Signature Recognition
Table 1. Summary of some prominent off-line papers Author
Mode of Verification
Database
Ammar, Fukumara and Yoshida (1988)
Statistical method-Euclidian distance and threshold
10 genuine signatures per person from 20 people
Feature Extraction
Global baseline, upper and lower extensions, slant features, local features (e.g. 90% verification rate local slants) and pressure feature
Signature outline, core feature, ink distribution, Neural network Based-A total of 3528 high pressure region, Multilayer directional frontiers feature, Huang and Yan signatures perception based (including genuine area of feature pixels in (1997) core, outline, high pressure neural networks and forged) trained and used region, directional frontiers, coarse, and in fine ink Global geometric features: Aspect ratio, width without blanks, slant angle, vertical Genetic center of gravity (COG) etc. Algorithms used Moment Based features: 650 signatures, 20 for obtaining horizontal and vertical genuine and 23 Ramesh and genetically projection images, kurtosis Murty (1999) forged signatures measures( horizontal and optimized from 15 people vertical) weights for Envelope Based: weighted feature vector extracting the lower and upper envelopes and Wavelet Based features Central line features, corner 220 genuine and line features, central circle Ismail and Gad Fuzzy concepts 110 forged features, corner curve (2000) samples features, critical points features Yacoubi, Bortolozzi, Justino and Sabourin (2000)
4000 signatures from 100 people Hidden Markov (40 people for 1st Model and 60 for 2nd database
O
Results
Caliber, proportion, behavior guideline, base behavior, spacing
99.5 % for random forgery and 90 % for targeted forgeries
90 % under genuine, 98 % under random forgery and 70-80 % under skilled forgery case
95% recognition rate and 98% verification rate Average Error(%) :0.48 on 1st database, and 0.88 on the 2nd database
Nonliinear dynamic time Horizontal and Vertical Best result of 18.1% projections for non linear warping applied average error rate Fang, Leung, 1320 genuine dynamic time warping (average of FAR and to horizontal and signatures from 55 Tang, Tse, method. Skeletonized FRR) for first vertical Kwok, and authors and 1320 image with approximation method. The projections. Wong (2003) forgeries from 12 of the skeleton by short Average error rate Elastic bunch authors lines for elastic bunch was 23.4% in case of graph matching to individual graph matching algorithm the second method stroke segments
used. These signature instances include either the genuine or both genuine and forged signatures depending on the method. All the methods extract features in a manner such that the signature cannot be constructed back from them in the reverse order but have sufficient data to capture the features required for verification. These features are stored in various formats depending upon the system. It could be either in the form of feature values, weights of the neural network, conditional probability values of a belief (Bayesian) network or as a covariance matrix (Fang et al., 2003). Later, when a sample signature
is to be tested, its feature matrix/vector is calculated, and is passed into the verification sub module of the system, which identifies the signature to be either authentic or unauthentic. So, unless the system is trained properly, chances, though less, are that it may recognize an authentic signature to be unauthentic (false rejection), and in certain other cases, recognize the unauthentic signature to be authentic (false acceptance). So, one has to be extremely cautious at this stage of the system. There are various methods for off-line verification systems that are in use today. Some of them include: 873
Off-Line Signature Recognition
• • • • • •
Statistical methods (Ammar, M., Fukumura, T., & Yoshida Y., 1988; Ismail & Gad, 2000) Neural network based approach (Bajaj & Chaudhury, 1997; Baltzakis & Papamarkos, 2001; Drouhard, Sabourin, & Godbout, 1996; Huang & Yan, 1997) Genetic algorithms for calculating weights (Ramesh & Murty, 1999) Hidden Markov Model (HMM) based methods (Yacoubi, Bortolozzi, Justino, & Sabourin, 2000) Bayesian networks (Xiao & Leedham, 2002) Nonliinear dynamic time warping (in spatial domain) and Elastic bunch graph matching (Fang et al., 2003)
The problem of signature verification is that of dividing a space into two different sets of genuine and forged signatures. Both online and off-line approaches use features to do this, but, the problem with this approach is that even two signatures by the same person may not be the same. The feature set must thus have sufficient interpersonal variability so that we can classify the input signature as genuine or forgery. In addition, it must also have a low intrapersonal variability so that an authentic signature is accepted. Solving this problem using fuzzy sets and neural networks has also been tried. Another problem is that an increase in the dimensionality i.e. using more features does not necessarily minimize(s) the error rate. Thus, one has to be cautious while choosing the appropriate/optimal feature set.
FUTURE TRENDS Most of the presently used off-line verification methods claim a success rate of more than 95% for random forgeries and above 90% in case of skilled forgeries. Although, a 95% verification rate seems high enough, it can be noticed that, even if the accuracy rate is as high as 99 %, when we scale it to the size of a million, even a 1% error rate turns out to be a significant number. It is therefore necessary to increase this accuracy rate as much as possible.
CONCLUSION Performance of signature verification systems is measured from their false rejection rate (FRR or type I error) (Huang & Yan, 1997) and false acceptance rate (FAR or type II error) (Huang & Yan, 1997) curves. Average error rate, which is the mean of the FRR and FAR values, is also used at times. Instead of using the FAR and FRR values, many researchers quote the “100 - average rate” values as the performance result. Values for various approaches have been mentioned in the table above. It is very diffi-
874
cult to compare the values of these error rates, as there is no standard database either for off-line or for online signature verification methods. Although online verification methods are gaining popularity day by day because of higher accuracy rates, off-line signature verification methods are still considered to be indispensable, since they are easy to use and have a wide range of applicability. Efforts must thus be made to improve its efficiency to as close as that of online verification methods.
REFERENCES Ammar, M., Fukumura, T., & Yoshida Y. (1986). A new effective approach for off-line verification of signature by using pressure features. Proceedings 8th International Conference on Pattern Recognition, ICPR’86 (pp. 566-569), Paris. Ammar, M., Fukumura, T., & Yoshida, Y. (1988). Off-line preprocessing and verification of signatures. International Journal of Pattern Recognition and Artificial Intelligence 2(4), 589-602. Ammar, M., Fukumura, T., & Yoshida, Y. (1990). Structural description and classification of signature images. Pattern Recognition 23(7), 697-710. Bajaj, R., & Chaudhury, S. (1997). Signature verification using multiple neural classifiers. Pattern Recognition 30(1), l-7. Baltzakis, H., & Papamarkos, N. (2001). A new signature verification technique based on a two-stage neural network classifier. Engineering Applications of Artificial Intelligence, 14, 95-103. Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A neural network approach to off-line signature verification using directional pdf. Pattern Recognition 29(3), 415-424. Fang, B., Leung, C. H., Tang, Y. Y., Tse, K. W., Kwok, P. C. K., & Wong, Y. K. (2003). Off-line signature verification by tracking of feature and stroke positions. Pattern Recognition 36, 91-101. Huang, K., & Yan, H. (1997). Off-line signature verification based on geometric feature extraction and neural network classification. Pattern Recognition, 30(1), 9-17. Ismail, M. A., & Gad, S. (2000). Off-line Arabic signature recognition and verification. Pattern Recognition, 33, 1727-1740.
Off-Line Signature Recognition
Jain, A. K. & Griess, F. D., (2000). Online Signature Verification. Project Report, Department of Computer Science and Engineering, Michigan State University, USA. Qi, Y., & Hunt, B. R. (1994). Signature verification using global and grid features. Pattern Recognition, 27(12), 1621-1629. Ramesh, V.E., & Murty, M. N. (1999). Off-line signature verification using genetically optimized weighted features. Pattern Recognition, 32(7), 217-233. Tappert, C. C., Suen, C. Y., & Wakahara, T. (1990).The State of the Art in Onliine Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(8). Wessels, T., & Omlin, C. W, (2000). A Hybrid approach for Signature Verification. International Joint Conference on Neural Networks. Yacoubi, A. El., Bortolozzi, F., Justino, E. J. R., & Sabourin, R. (2000). An off-line signature verification system using HMM and graphometric features. 4th IAPR International Workshop on Document Analysis Systems (pp. 211-222). Xiao, X., & Leedham, G. (2002). Signature verification using a modified Bayesian network. Pattern Recognition, 35, 983-995.
m
m
y =1
y =1
X = ∑ (y.Ph [y])/ ∑ Ph [y], and Y
n
n
x =1
x =1
= ∑ (x.Pv [x ])/ ∑ Pv [x ],
O
where X, Y are the x and y coordinates of the centre of gravity of the image, and Ph and Pv are the horizontal and vertical projections respectively. Cross and Edge Points: A cross point is an image pixel in the thinned image having at least three eight neighbors, while an edge point has just one eight neighbor in the thinned image. Envelopes: Envelopes are calculated as upper and lower envelopes above and below the global baseline. These are the connected pixels which form the external points of the signature obtained by sampling the signature at regular intervals. Global Baseline: It is the median value of pixel distribution along the vertical projection. Horizontal and Vertical Projections: Horizontal projection is the projection of the binary image along the horizontal axis. Similarly, vertical projection is the projection of the binary image along the vertical axis. They can be calculated as follows: n
m
x =1
y =1
Ph [y] = ∑ black.pixel(x, y),Pv [x ] = ∑ black.pixel(x, y),
KEY TERMS Area of Black Pixels: It is the total number of black pixels in the binary image. Biometric Authentication: The identification of individuals using their physiological and behavioral characteristics. Centre of Gravity: The centre of gravity of the image is calculated as per the following equation:
where m = width of the image and n = height of the image. Moment Measures: Skewness and kurtosis are moment measures calculated using the horizontal and vertical projections and co-ordinates of centre of gravity of the signature. Slant: Either it is defined as the angle at which the image has maximum horizontal projection value on rotation or it is calculated using the total number of positive, negative, horizontally or vertically slanted pixels.
875
876
Online Analytical Processing Systems Rebecca Boon-Noi Tan Monash University, Australia
INTRODUCTION
OLAP Versus OLTP
Since its origin in the 1970’s research and development into databases systems has evolved from simple file storage and processing systems to complex relational databases systems, which have provided a remarkable contribution to the current trends or environments. Databases are now such an integral part of day-to-day life that often people are unaware of their use. For example, purchasing goods from the local supermarket is likely to involve access to a database. In order to retrieve the price of the item, the application program will access the product database. A database is a collection of related data and the database management system (DBMS) is software that manages and controls access to the database (Elmasri & Navathe, 2004).
Two reasons why traditional OLTP is not suitable for data warehousing are presented: (a) Given that operational databases are finely tuned to support known OLTP workloads, trying to execute complex OLAP queries against the operational databases would result in unacceptable performance. Furthermore, decision support requires data that might be missing from the operational databases; for instance, understanding trends or making predictions requires historical data, whereas operational databases store only current data. (b) Decision support usually requires consolidating data from many heterogeneous sources: these might include external sources such as stock market feeds, in addition to several operational databases. The different sources might contain data of varying quality, or use inconsistent representations, codes and formats, which have to be reconciled.
BACKGROUND Data Warehouse
Traditional Online Transaction Processing (OLTP)
A data warehouse is a specialized type of database. More specifically, a data warehouse is a “repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site” (Silberschatz, Korth, & Sudarshan, 2002, p. 843). Chaudhuri and Dayal (1997) consider that a data warehouse should be separately maintained from the organization’s operational database since the functional and performance requirements of online analytical processing (OLAP) supported by data warehouses are quite different from those of the online transaction processing (OLTP) traditionally supported by the operational database.
Traditional relational databases have been used primarily to support OLTP systems. The transactions in an OLTP system usually retrieve and update a small number of records accessed typically on their primary keys. Operational databases tend to be hundreds of megabytes to gigabytes in size and store only current data (Ramakrishnan & Gehrke, 2003). Figure 1 shows a simple overview of the OLTP system. The operational database is managed by a conventional relational DBMS. OLTP is designed for day-to-day operations. It provides a real-time response. Examples include Internet banking and online shopping.
Figure 1. OLTP system
Online Transaction Processing (OLTP) system
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Online Analytical Processing Systems
Online Analytical Processing (OLAP) OLAP is a term that describes a technology that uses a multi-dimensional view of aggregate data to provide quick access to strategic information for the purposes of advanced analysis (Ramakrishnan & Gehrke, 2003). OLAP supports queries and data analysis on aggregated databases built in data warehouses. It is a system for collecting, managing, processing and presenting multidimensional data for analysis and management purposes (Figure 2). There are two main implementation methods to support OLAP applications: relational OLAP (ROLAP) and multidimensional OLAP (MOLAP).
ROLAP Relational online analytical processing (ROLAP) provides OLAP functionality by using relational databases and familiar relational query tools to store and analyse multidimensional data (Ramakrishnan & Gehrke, 2003). Entity Relationship diagrams and normalization techniques are popularly used for database design in OLTP environments. However, the database designs recommended by ER diagrams are inappropriate for decision support systems where efficiency in querying and in loading data (including incremental loads) are crucial. A special schema known as a star schema is used in an OLAP environment for performance reasons (Martyn, 2004). This star schema usually consists of a single fact table and a dimension table for each dimension (Figure 3).
MOLAP Multidimensional online analytical processing (MOLAP) extends OLAP functionality to multidimensional database management systems (MDBMSs). A MDBMS uses
special proprietary techniques to store data in matrix-like n-dimensional arrays (Ramakrishnan & Gehrke, 2003). The multi-dimensional data cube is implemented by the arrays with the dimensions forming the axes of the cube (Sarawagi, 1997). Therefore, only the data value corresponding to a data cell is stored as direct mapping. MOLAP servers have excellent indexing properties due to the fact that looking for a cell is simple array lookups rather than associative lookups in tables. But unfortunately it provides poor storage utilization, especially when the data set is sparse. In a multi-dimensional data model, the focal point is on a collection of numeric measures. Each measure depends on a set of dimensions. For instance, the measure attribute is amt as shown in Figure 4. Sales information is being arranged in a three-dimensional array of amt. Figure 4 shows that the array only shows the values for single L# value where L# = L001, which presented as a slice orthogonal to the L# axis.
Cube-By Operator In decision support database systems, aggregation is a commonly used operation. As previously mentioned, current SQL can be very inefficient. Thus, to effectively support decision support queries in OLAP environment, a new operator, Cube-by was proposed by (Gray, Bosworth, Lyaman, & Pirahesh, 1996). It is an extension of the relational operator Group-by. The Cube-by operator computes Group-by corresponding to all possible combinations of attributes in the Cube-by clause. In order to see how a data cube is formed, an example is provided in Figure 5. It shows an example of data cube formation through executing the cube statement at the top left of the figure. Figure 5 presents two way of presenting the aggregated data: (a) a data cube, and (b) a 3D data cube in table form.
Figure 2. OLAP system
Online Analytical Processing (OLAP) System
877
O
Online Analytical Processing Systems
Figure 3. Star schema showing that location, product, date, and sales are represented as relations
Figure 4. SALES presented as a multidimensional dataset
MAIN THRUST Processing Level in OLAP Systems The processing level where the execution of the raw data or aggregated data takes place before presenting the result to the user is shown in Figure 6. The user is only interested in the data in the report in a fast response. However, the background of the processing level is needed in order to provide the user with the result of the aggregated data in an effective and efficient way. The processing level has been broken down into two distinct areas: back-end processing and front-end pro878
cessing. Back-end processing basically deals with the raw data, which is stored in either tables (ROLAP) or arrays (MOLAP) and then process it into aggregated data, which is presented to the user. Front-end processing is computing the raw data and managing the precomputed aggregated data in either in 3D data cube or ndimensional table. It is important to mention that front-end processing is similar to back-end processing as it also deals with raw data. It needs to construct or execute the raw data into the data cube and then store the pre-computed aggregated data in either the memory or disk. It is convenient for the user if the pre-computed aggregated data is ready and is
Online Analytical Processing Systems
Figure 5. An example of data cube formation through executing the cube statement at the top left of the figure
O
Figure 6. Two interesting areas of the processing level in OLAP environment
Online Analytical Processing (OLAP) System
879
Online Analytical Processing Systems
stored in the memory whenever the user requests for it. The difference between the two processing levels is that the front-end has the pre-computed aggregated data in the memory which is ready for the user to use or analyze it at any given time. On the other hand the back-end processing computes the raw data directly whenever there is a request from the user. This is why the front-end processing is considered to present the aggregated data faster or more efficiently than back-end processing. However, it is important to note that back-end processing and front-end processing are basically one whole processing. The reason behind this break down of the two processing levels is because the problem can be clearly seen in backend processing and front-end processing. In the next section, the problems associated with each areas and the related work in improving the problems will be considered.
Back-End Processing Back-end processing involves basically dealing with the raw data, which is stored in either tables (ROLAP) or arrays (MOLAP) as shown in Figure 7. The user queries the raw data for decision-making purpose. The raw data is then processed and computed into aggregated data, which will be presented to the user for analyzing. Generally the basic stage: extracting, is followed by two sub-stages: indexing, and partitioning in back-end processing. Extracting is the process of querying the raw data either from tables (ROLAP) or arrays (MOLAP) and computing it. The process of extracting is usually time-consuming. Firstly, the database (data warehouse) size is extremely large and secondly the computation time is equally high. However, the user is only interested in a fast
Figure 7. Stages in back-end processing
response time for providing the resulted data. Another consideration is that the analyst or manager using the data warehouse may have time constraints. There are two sub-stages or fundamental methods of handling the data: (a) indexing and (b) partitioning in order to provide the user with the resulted data in a reasonable time frame. Indexing has existed in databases for many decades. Its access structures have provided faster access to the base data. For retrieval efficiency, index structures would typically be defined especially in data warehouses or ROLAP where the fact table is very large (Datta, VanderMeer & Ramamritham, 2002). O’Neil and Quass (1997) have suggested a number of important indexing schemes for data warehousing including bitmap index, value-list index, projection index, data index. Data index is similar to projection index but it exploits a positional indexing strategy (Datta, VanderMeer, & Ramamritham, 2002). Interestingly MOLAP servers have better indexing properties than ROLAP servers since they look for a cell using simple array lookups rather than associative lookups in tables. Partitioning of raw data is more complex and challenging in data warehousing as compared to that of relational and object databases. This is due to the several choices of partitioning of a star schema (Datta, VanderMeer, & Ramamritham, 2002). The data fragmentation concept in the context of distributed databases aims to reduce query execution time and facilitate the parallel execution of queries (Bellatreche, Karlapalem, Mohania, & Schneide, 2000). In a data warehouse or ROLAP, either the dimension tables or the fact table or even both can be fragmented. Bellatreche et al. (2000) have proposed a methodology for applying the fragmentation techniques in a data warehouse star schema to reduce the total query execution cost. The data fragmentation concept in the context of distributed databases aims to reduce query execution time and facilitates the parallel execution of queries.
Front-End Processing Front-end processing is computing the raw data and managing the pre-computed aggregated data in either in 3D data cube or n-dimensional table. The three basic types of stages that are involved in Front-end processing are shown in Figure 8. They are (i) constructing, (ii) storing and (iii) querying. Constructing is basically the computation process of a data cube by using cube-by operator. Cube-by is an expensive approach, especially when the number of Cube-by attributes and the database size are large. The difference between constructing and extracting needs to be clearly defined. Constructing in the Front-end processing is similar to the extracting in the Back-end 880
Online Analytical Processing Systems
Figure 8. Stages in front-end processing
processing as both of them are basically querying the raw data. However, in this case, extracting concentrates on how the fundamental methods can help in handling the raw data in order to provide efficient retrieval. Constructing in the Front-end processing concentrates on the cubeby operator that involves the computation of raw data. Storing is the process of putting the aggregated data into either the n-dimension table of rows or the n-dimensional arrays or data cube. There are two parts within the storing process. One part is to store temporary raw data in the memory for executing purpose and other part is to store pre-computed aggregated data. Hence, there are two problems related to the storage: (a) insufficient memory space – due to the loaded raw data and also in addition to the incremental loads; (b) poor storage utilization – the array may not fit into memory especially when the data set is sparse. Querying is the process of extracting useful information from the pre-computed data cube or n-dimensional table for decision makers. It is important to take note that Querying is also part of Constructing process. Querying also makes use of cube-by operator that involves the computation of raw data. ROLAP is able to support ad hoc requests and allows unlimited access to dimensions unlike MOLAP only allows limited access to predefined dimensions. Despite the fact that it is able to support ad hoc requests and access to dimensions, certain queries might be difficult to fulfill the need of decision makers and also to reduce the querying execution time when there is n-dimensional query as the time factor is important to decision makers. To conclude, the three stages and their associated problems in the OLAP environment have been outlined. First, Cube-by is an expensive approach, especially when
O
the number of Cube-by attributes and the database size are large. Second, storage has insufficient memory space – this is due to the loaded data and also in addition to the incremental loads. Third, storage is not properly utilized – the array may not fit into memory especially when the data set is sparse. Fourth, certain queries might be difficult to fulfill the need of decision makers. Fifth, the execution querying time has to be reduced when there is n-dimensional query as the time factor is important to decision makers. However, it is important to consider that there is scope for other possible problems to be identified.
FUTURE TRENDS Problems have been identified in each of the three stages, which have generated considerable attention from researchers to find solutions. Several researchers have proposed a number of algorithms to solve these cube-by problems. Examples include:
Constructing Cube-by operator is an expensive approach, especially when the number of Cube-by attributes and the database size are large.
Fast Computation Algorithms There are algorithms aimed at fast computation of large sparse data cube (Ross & Srivastava, 1997; Beyer & Ramakrishnon, 1999). Ross and Srivastava (1997) have taken into consideration the fact that real data is fre-
881
Online Analytical Processing Systems
quently sparse. (Ross & Srivastava, 1997) partitioned large relations into small fragments so that there was always enough memory to fit in the fragments of large relation. Whereas, Beyer and Ramakrishnan (1999) proposed the bottom-up method to help reduce the penalty associated with the sorting of many large views.
Parallel Processing System The assumption of most of the fast computation algorithms is that their algorithms can be applied into the parallel processing system. Dehne, Eavis, Hambrusch, and Rau-Chaplin. (2002) presented a general methodology for the efficient parallelization of exiting data cube construction algorithms. Their paper described two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. They provide a good summary and comparison with other parallel data cube computations (Ng, Wagner, & Yin., 2001; Yu & Lu, 2001). In Ng et al. (2001), the approach considers the parallelization of sort-based data cube construction, whereas Yu and Lu (2001) focuses on the overlap between multiple data cube computations in a sequential setting.
Storing Two problems related to the storage: (a) insufficient memory space; (b) poor storage utilization.
Data Compression Another technique is related to data compression. Lakshmanan, Pei, and Zhao (2003) have developed a systematic approach to achieve efficacious data cube construction and exploration by semantic summarization and compression.
Figure 9. Working areas in the OLAP system
882
Condensed Data Cube Wang, Feng, Lu, and Yu (2002) have proposed a new concept called a condensed data cube. This new approach reduces the size of data cube and hence its computation time. They make use of “single base tuple” compression to generate a “condensed cube” so it is smaller in size.
Querying It might be difficult for certain queries to fulfill both the need of the decision makers and also the need to reduce the querying execution time when there is n-dimensional query as the time factor is important to decision makers.
Types of Queries Other work has focused on specialized data structures for fast processing of special types of queries. Lee, Ling and Li (2000) have focused on range-max queries and have proposed hierarchical compact cube to support the rangemax queries. The hierarchical structure stores not only the maximum value of all the children sub-cubes, but also stores one of the locations of the maximum values among the children sub-cubes. On the other hand, Tan, Taniar, and Lu (2004) focus on data cube queries in term of SQL and have presented taxonomy of data cube queries. They also provide a comparison with different type queries.
Parallel Processing Techniques Taniar and Tan (2002) have proposed three parallel techniques to improve the performance of the data cube queries. However, the challenges in this area continue to grow with the cube-by problems.
Online Analytical Processing Systems
CONCLUSION In this overview, the OLAP systems in general have been considered, followed by processing level in the OLAP systems and lastly the related and future work in the OLAP systems. A number of problems and solutions in the OLAP environment have been presented. However, consideration needs to be given to the possibility that other problems maybe identified which in turn will present new challenges for the researchers to address.
REFERENCES Bellatreche, L., Karlapalem, K., Mohania M., & Schneider M. (2000, September). What can partitioning do for your data warehouses and data marts? International IDEAS conference (pp. 437-445), Yokohoma, Japan.
Ng, R.T., Wagner, A., & Yin, Y. (2001, May). Iceberg-cube computation with pc clusters. International ACM SIGMOD Conference (pp. 25-36), Santa Barbara, California. O’Neil, P., & Graefe, G. (1995). Multi-table joins through bit-mapped join indices. SIGMOD Record, 24(3), 8-11. Ramakrishnan, R., & Gehrke, J. (2003). Database management systems. NY: McGraw-Hill. Ross, K.A., & Srivastava, D. (1997, August). Fast computation of sparse datacubes. International VLDB Conference (pp. 116-185), Athens, Greece. Sarawagi, S. (1997). Indexing OLAP data. IEEE Data Engineering Bulletin, 20(1), 36-43. Silberschatz, A., Korth, H., & Sudarshan, S. (2002). Database system concepts. NY: McGraw-Hill.
Beyer, K.S., & Ramakrishnan, R. (1999, June). Bottom-up computation of sparse and iceberg cubes. International ACM SIGKDD conference (pp. 359-370), Philadelphia, PA.
Taniar, D., & Tan, R.B.N. (2002, May). Parallel processing of multi-join expansion-aggregate data cube query in high performance database systems. International I-SPAN Conference (pp. 51-58), Manila, Philippines.
Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26, 65-74.
Tan, R.B.N., Taniar, D., & Lu, G.J. (2004). A Taxonomy for Data Cube Queries, International Journal for Computers and Their Applications, 11(3), 171-185.
Datta, A., VanderMeer, D., & Ramamritham, K. (2002). Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP. IEEE Transactions on Knowledge & Data Engineering, 14(6), 1299-1316.
Wang, W., Feng, J.L., Lu, H.J., & Yu, J.X. (2002, February). Condensed Cube: An effective approach to reducing data cube size. International Data Engineering Conference (pp. 155-165), San Jose, California.
Dehne, F., Eavis, T., Hambrusch, S., & Rau-Chaplin, A. (2002). Parallelizing the data cube. International Journal of Distributed & Parallel Databases, 11, 181-201.
Yu, J.X., & Lu, H.J. (2001, April). Multi-cube computation. International DASFAA Conference (pp. 126-133), Hong Kong, China.
Elmasri, R., & Navathe, S.B. (2004). Fundamentals of database systems. Boston, MA: Addison Wesley. Gray, J., Bosworth, A., Lyaman, A. & Pirahesh, H. (1996, June). Data cube: A relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS. International ICDE Conference (pp. 152-159), New Orleans, Louisiana. Lakshmanan, L.V.S., Pei, J., & Zhao, Y. (2003, September). Efficacious data cube exploration by semantic semmarization and compression. International VLDB Conference (pp. 1125-1128), Berlin, Germany. Lee, S.Y, Ling, T.W., & Li, H.G. (2000, September). Hierarchical compact cube for range-max queries. International VLDB Conference (pp. 232-241), Cairo, Egypt. Martyn, T. (2004). Reconsidering multi-dimensional schemas. ACM SIGMOD Record, 83-88.
KEY TERMS Back-End Processing: Is dealing with the raw data, which is stored in either tables (ROLAP) or arrays (MOLAP). Data Cube Operator: Computes Group-by corresponding to all possible combinations of attributes in the Cubeby clause. Data Warehouse: A type of database A “subjectoriented, integrated, time-varying, non-volatile collection of data that issued primarily in organizational decision making” (Elmasri & Navathe, 2004, p.900). Front-End Processing: Is computing the raw data and managing the pre-computed aggregated data in either in 3D data cube or n-dimensional table. 883
O
Online Analytical Processing Systems
Multidimensional OLAP (MOLAP): Extends OLAP functionality to multidimensional database management systems (MDBMSs). Online Analytical Processing (OLAP): Is a term used to describe the analysis of complex data from data warehouse (Elmasri & Navathe, 2004, p.900).
884
Relational OLAP (ROLAP): Provides OLAP functionality by using relational databases and familiar relational query tools to store and analyse multidimensional data.
885
Online Signature Recognition Indrani Chakravarty Indian Institute of Technology, India Nilesh Mishra Indian Institute of Technology, India Mayank Vatsa Indian Institute of Technology, India Richa Singh Indian Institute of Technology, India P. Gupta Indian Institute of Technology, India
INTRODUCTION Security is one of the major issues in today’s world and most of us have to deal with some sort of passwords in our daily lives; but, these passwords have some problems of their own. If one picks an easy-to-remember password, then it is most likely that somebody else may guess it. On the other hand, if one chooses too difficult a password, then he or she may have to write it somewhere (to avoid inconveniences due to forgotten passwords) which may again lead to security breaches. To prevent passwords being hacked, users are usually advised to keep changing their passwords frequently and are also asked not to keep them too trivial at the same time. All these inconveniences led to the birth of the biometric field. The verification of handwritten signature, which is a behavioral biometric, can be classified into off-line and online signature verification methods. Online signature verification, in general, gives a higher verification rate than off-line verification methods, because of its use of both static and dynamic features of the problem space in contrast to off-line which uses only the static features. Despite greater accuracy, online signature recognition is not that prevalent in comparison to other biometrics. The primary reasons are: • •
It cannot be used everywhere, especially where signatures have to be written in ink; e.g. on cheques, only off-line methods will work. Unlike off-line verification methods, online methods require some extra and special hardware, e.g. electronic tablets, pressure sensitive signature pads, etc. For off-line verification method, on the other hand, we can do the data acquisition with optical scanners.
•
The hardware for online are expensive and have a fixed and short life cycle.
In spite of all these inconveniences, the use online methods is on the rise and in the near future, unless a process requires particularly an off-line method to be used, the former will tend to be more and more popular.
BACKGROUND Online verification methods can have an accuracy rate of as high as 99%. The reason behind is its use of both static and dynamic (or temporal) features, in comparison to the off-line, which uses only the static features (Ramesh & Murty, 1999). The major differences between off-line and online verification methods do not lie with only the feature extraction phases and accuracy rates, but also in the modes of data acquisition, preprocessing and verification/recognition phases, though the basic sequence of tasks in an online verification (or recognition) procedure is exactly the same as that of the off-line. The phases that are involved comprise of: • • • •
Data Acquisition Preprocessing and Noise Removal Feature Extraction and Verification (or Identification)
However, online signatures are much more difficult to forge than off-line signatures (reflected in terms of higher accuracy rate in case of online verification methods), since online methods involve the dynamics of the signature such as the pressure applied while writing, pen tilt, the velocity with which the signature is done etc. In
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
O
Online Signature Recognition
case of off-line, the forger has to copy only the shape (Jain & Griess, 2000) of the signature. On the other hand, in case of online, the hardware used captures the dynamic features of the signature as well. It is extremely difficult to deceive the device in case of dynamic features, since the forger has to not only copy the characteristics of the person whose signature is to be forged, but also at the same time, he has to hide his own inherent style of writing the signature. There are four types of forgeries: random, simple, skilled and traced forgeries (Ammar, Fukumura, & Yoshida, 1988; Drouhard, Sabourin, & Godbout, 1996). In case of online signatures, the system shows almost 100% accuracy for the first two classes of forgeries and 99% in case of the latter. But, again, a forger can also use a compromised signature-capturing device to repeat a previously recorded signature signal. In such extreme cases, even online verification methods may suffer from repetition attacks when the signature-capturing device is not physically secure.
MAIN THRUST Although the basic sequence of tasks in online signature verification is almost the same as that of off-line methods, the modes differ from each other especially in the ways the data acquisition, preprocessing and feature extraction are carried out. More specifically, the submodules of online are much more difficult with respect to off-line (Jain & Griess, 2000). Figure 1 gives a generic structure of an online signature verification system. The online verification system can be classified into the following modules: • • • •
Data Acquisition, Preprocessing, Feature Extraction, Learning and Verification.
Data Acquisition Data acquisition (of the dynamic features) in online verification methods is generally carried out using special devices called transducers or digitizers (Tappert, Suen, & Wakahara, 1990, Wessels & Omlin, 2000), in contrast to the use of high resolution scanners in case of off-line. The commonly used instruments include the electronic tablets (which consist of a grid to capture the x and y coordinates of the pen tip movements), pressure sensitive tablets, digitizers involving technologies such as acoustic sensing in air medium, surface acoustic waves, triangularization of reflected laser beams, and optical sensing of a light pen to extract information about the number of strokes, velocity of signing, direction of writing, pen tilt, pressure with which the signature is written etc.
Preprocessing Preprocessing in online is much more difficult than in off-line, because it involves both noise removal (which can be done using hardware or software) (Plamondon & Lorette, 1989) and segmentation in most of the cases. The other preprocessing steps that can be performed are signal amplifying, filtering, conditioning, digitizing, resampling, signal truncation, normalization, etc. However, the most commonly used include: •
•
External Segmentation: Tappert, Suen and Wakahara (1990) define external segmentation as the process by which the characters or words of a signature are isolated before the recognition is carried out. Resampling: This process is basically done to ensure uniform smoothing to get rid of the redundant information, as well as to preserve the required information for verification by comparing
Figure 1. Modular structure of a generic online verification system Input device / interface Preprocessing module
Data Base Learning module
886
Analyzer / Controller module
Feature Extraction module
Output
Verification module
Online Signature Recognition
the spatial data of two signatures. According to Jain and Griess (2000), here, the distance between two critical points is measured, and if the total distance exceeds a threshold called the resampling length( which is calculated by dividing the distance by the number of sample points for that segment), then a new point is created by using the gradient between the two points, Noise Reduction: Noise is nothing but irrelevant data, usually in the form of extra dots or pixels in images (in case of off-line verification methods) (Ismail & Gad, 2000), which do not belong to the signature, but are included in the image (in case of off-line) or in the signal (in case of online), because of possible hardware problems (Tappert, Suen, & Wakahara, 1990) or presence of background noises like dirt or by faulty hand movements (Tappert, Suen, & Wakahara, 1990) while signing.
•
Feature Extraction Online signature extracts both the static and the dynamic features. Some of the static and dynamic features have been listed below.
• • •
The general techniques used are: • •
• •
•
•
Smoothening: in which a point is averaged with its neighbors. (Tappert, Sue, & Wakahara, 1990) Filtering: Online thinning (Tappert, Suen, & Wakahara, 1990) is not the same as off-line thinning (Baltzakis & Papamarkos, 2001), although, both decrease the number of irrelevant dots or points. According to Tappert, Suen and Wakahara (1990), it can be carried out either by forcing a minimum distance between adjacent points or by forcing a minimum change in the direction of the tangent to the drawing for consecutive points. Wild Point Correction: This removes noises (points) caused by hardware problems (Tappert, Suen, & Wakahara, 1990). Dehooking: Dehooking on the other hand removes hook like features from the signature, which usually occur at the start or end of the signature (Tappert, Suen, & Wakahara, 1990). Dot Reduction: The dot size is reduced to single points. (Tappert, Suen, & Wakahara, 1990) Normalization: Off-line normalization often limits itself to scaling the image to a standard size and/ or removing blank columns and rows. In case of online, normalization includes deskewing, baseline drift correction (orienting the signature to horizontal), size normalization and stroke length normalization (Ramesh & Murty, 1999; Tappert, Suen, & Wakahara, 1990; Wessels & Omlin, 2000).
Static Features: Although both static and dynamic information are available to the online verification system, in most of the cases, the static information is discarded, as dynamic features are quite rich and they alone give high accuracy rates. Some of the static features used by online method are:
•
• •
• •
Width and height of the signature parts (Ismail & Gad, 2000) Width to height ratio (Ismail & Gad, 2000) Number of cross and edge points (Baltzakis & Papamarkos, 2001) Run lengths of each scan of the components of the signature (Xiao & Leedham, 2002) Kurtosis (horizontal and vertical), skewness, relative kurtosis, relative skewness, relative horizontal and vertical projection measures (Bajaj & Chaudhury, 1997; Ramesh & Murty, 1999) Envelopes: upper and lower envelope features (Bajaj & Chaudhury, 1997; Ramesh & Murty, 1999) Alphabet Specific Features: Like presence of ascenders, descenders, cusps, closures, dots etc. (Tappert, Suen, & Wakahara, 1990) Dynamic Features: Though online methods utilize some of the static features, they give more emphasis to the dynamic features, since these features are more difficult to imitate.
•
The most widely used dynamic features include: •
• • •
Position (ux(t) and uy(t) ): e.g. the pen tip position when the tip is in the air, and when it is in the vicinity of the writing surface etc (Jain & Griess, 2000; Plamondon & Lorette, 1989). Pressure (u p (t)) (Jain & Griess, 2000; Plamondon & Lorette, 1989) Forces (uf(t), ufx(t), ufy(t)): Forces are computed from the position and pressure coordinates (Plamondon & Lorette, 1989). Velocity (u v(t), uvx(t), uvy(t)): It can be derived from position coordinates (Jain & Griess, 2000; Plamondon & Lorette, 1989).
887
O
Online Signature Recognition
• •
•
Absolute and relative speed between two critical points (Jain, Griess, & Connell, 2002) Acceleration (ua(t), u ax(t), uay(t)): Acceleration can be derived from velocity or position coordinates. It can also be computed using an accelerometric pen (Jain & Griess, 2000; Plamondon & Lorette, 1989). Parameters: A number of parameters like number of peaks, starting direction of the signature, number of pen lifts, means and standard deviations, number of maxima and minima for each segment, proportions, signature path length, path tangent angles etc. are also calculated apart from the above mentioned functions to increase the dimensionality (Gupta, 1997; Plamondon & Lorette, 1989; Wessels & Omlin, 2000). Moreover, all these features can be of both global and local in nature.
Verification and Learning For online, examples of comparison methods include use of: • • • • • • • • •
Corner Point and Point to Point Matching algorithms (Zhang, Pratikakis, Cornelis, & Nyssen, 2000) Similarity measurement on logarithmic spectrum (Lee, Wu, & Jou, 1998) Extreme Points Warping (EPW) (Feng & Wah, 2003) String Matching and Common threshold (Jain, Griess, & Connell, 2002) Split and Merging, (Lee, Wu, & Jou, 1997) Histogram classifier with global and local likeliness coefficients (Plamondon & Lorette, 1989) Clustering analysis ( Lorette, 1984) Dynamic programming based methods; matching with Mahalanobis pseudo-distance (Sato & Kogure,1982) Hidden Markov Model based methods (McCabe, 2000; Kosmala & Rigoll, 1998)
Table 1 gives a summary of some prominent works in the online signature verification field.
Performance Evaluation Performance evaluation of the output (which is to accept or reject the signature) is done using false rejection rate (FRR or type I error) and false acceptance rate (FAR or type II error) (Huang & Yan, 1997). The values 888
for error rates of different approaches have been included in the comparison table above. Equal error rate (ERR) which is calculated using FRR and FAR is also used for measuring the accuracy of the systems.
GENERAL PROBLEMS The feature set in online has to be taken very carefully, since it must have sufficient interpersonal variability so that the input signature can be classified as genuine or forgery. In addition, it must also have a low intra personal variability so that an authentic signature is accepted. Therefore, one has to be extremely cautious, while choosing the feature set, as increase in dimensionality does not necessarily mean an increase in efficiency of a system.
FUTURE TRENDS Biometrics is gradually replacing the conventional password and ID based devices, since it is both convenient and safer than the earlier methods. Today, it is not difficult at all to come across a fingerprint scanner or an online signature pad. Nevertheless, it still requires a lot of research to be done to make the system infallible, because even an accuracy rate of 99% can cause failure of the system when scaled to the size of a million. So, it will not be strange if we have ATMs in near future granting access only after face recognition, fingerprint scan and verification of signature via embedded devices, since a multimodal system will have a lesser chance of failure than a system using a single biometric or a password/ID based device. Currently online and offline signature verification systems are two disjoint approaches. Efforts must be also made to integrate the two approaches enabling us to exploit the higher accuracy of online verification method and greater applicability of off-line method.
CONCLUSION Most of the online methods claim near about 99% accuracy for signature verification. These systems are gaining popularity day by day and a number of products are currently available in the market. However, online verification is facing stiff competition from fingerprint verification system which is both more portable and has higher accuracy. In addition, the problem of maintaining a balance between FAR and FRR has to be maintained. Theoretically, both FAR and FRR are inversely related to each other. That is, if we keep tighter thresholds to
Online Signature Recognition
Table 1. Summery of some prominent online papers Author
Mode Of Verification
Database
Feature Extraction
Results (Error rates)
Wu, Lee and Jou (1997)
Split and merging
200 genuine and 246 forged
coordinates to represent the signature and the velocity
86.5 % accuracy rate for genuine and 97.2 % for forged
27 people each 10 signs, Similarity 560 genuine and measurement on logarithmic spectrum 650 forged for testing
Features based on coefficients of the logarithmic spectrum
FRR=1.4 % and FAR=2.8 %
Wu, Lee and Jou (1998)
Zhang, Corner point matching corner points Pratikakis, algorithm and Point to 188 signatures, Cornelis and extracted based on point matching from 19 people Nyssen velocity information algorithm (2000)
Jain, Griess and Connell (2002)
String matching and common threshold
Extreme Points Warping ( both Feng and Euclidean distance and Wah (2003) Correlation Coefficients used)
0.1 % mismatch in segments in case of corner point algorithm and 0.4 % mismatch in case of point to point matching algorithm
Number of strokes, co-ordinate distance between two points, angle with respect to x and y axis, 1232 signatures of 102 curvature, distance from centre of individual gravity, grey value in 9X9 neighborhood and velocity features 25 users contributed 30 genuine signatures and 10 forged signatures
x and y trajectories used, apart from torque and center of mass.
Type I: 2.8% Type II: 1.6%
EER
(Euclidean) = 25.4 % EER (correlation) = 27.7 %
decrease FAR, inevitably we increase the FRR by rejecting some genuine signatures. Further research is thus still required to overcome this barrier.
REFERENCES Ammar, M., Fukumura, T., & Yoshida Y. (1988). Offline preprocessing and verification of signatures. International Journal of Pattern Recognition and Artificial Intelligence, 2(4), 589-602. Ammar, M., Fukumura, T., & Yoshida Y. (1990). Structural description and classification of signature images. Pattern Recognition, 23(7), 697-710. Bajaj, R., & Chaudhury, S. (1997). Signature verification using multiple neural classifiers. Pattern Recognition, 30(1), l-7. Baltzakis, H., & Papamarkos, N. (2001). A new signature verification technique based on a two-stage neural network classifier. Engineering Applications of Artificial Intelligence, 14, 95-103. Drouhard, J. P., Sabourin, R., & Godbout, M. (1996). A neural network approach to off-line signature verification using directional pdf. Pattern Recognition, 29(3), 415-424. Feng, H., & Wah, C. C. (2003). Online signature verification using a new extreme points warping technique. Pattern Recognition Letters 24(16), 2943-2951.
Gupta, J., & McCabe, A. (1997). A review of dynamic handwritten signature verification. Technical Article, James Cook University, Australia. Huang, K., & Yan, H. (1997). Off-line signature verification based on geometric feature extraction and neural network classification. Pattern Recognition, 30(1), 9-17. Ismail, M.A., & Gad, S. (2000). Off-line Arabic signature recognition and verification. Pattern Recognition, 33, 1727-1740. Jain, A. K., & Griess, F. D. (2000). Online signature verification. Project Report, Department of Computer Science and Engineering, Michigan State University, USA. Jain, A. K., Griess, F. D., & Connell, S. D. (2002). Online signature verification. Pattern Recognition, 35(12), 29632972. Kosmala, A., & Rigoll, G. (1998). A systematic comparison between online and off-line methods for signature verification using hidden markov models. 14th International Conference on Pattern Recognition (pp. 1755-1757). Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1997). Online signature verification based on split and merge matching mechanism. Pattern Recognition Letters, 18, 665-673 Lee, S. Y., Wu, Q. Z., & Jou, I. C. (1998). Online signature verification based on logarithmic spectrum. Pattern Recognition, 31(12), 1865-1871 Lorette, G. (1984). Online handwritten signature recognition based on data analysis and clustering. Proceedings of 7th International Conference on Pattern Recognition, Vol. 2 (pp. 1284-1287). McCabe, A. (2000). Hidden markov modeling with simple directional features for effective and efficient handwriting verification. Proceedings of the Sixth Pacific Rim International Conference on Artificial Intelligence. Plamondon R., & Lorette, G. (1989). Automatic signature verification and writer identification – the state of the art. Pattern Recognition, 22(2), 107-131. Ramesh, V.E., & Murty, M. N. (1999). Off-line signature verification using genetically optimized weighted features. Pattern Recognition, 32(7), 217-233. Sato, Y., & Kogure, K. (1982). Online signature verification based on shape, motion and handwriting pressure. Proceedings of 6th International Conference on Pattern Recognition, Vol. 2 (pp. 823-826). Tappert, C. C., Suen, C. Y., & Wakahara, T. (1990). The state of the art in on-line handwriting recognition. IEEE
889
O
Online Signature Recognition
Transactions on Pattern Analysis and Machine Intelligence, 12(8).
Global Features: The features are extracted using the complete signature image or signal as a single entity.
Wessels, T., & Omlin, C. W. (2000). A hybrid approach for signature verification. International Joint Conference on Neural Networks.
Local Features: The geometric information of the signature is extracted in terms of features after dividing the signature image or signal into grids and sections.
Xiao, X., & Leedham, G. (2002). Signature verification using a modified bayesian network. Pattern Recognition, 35, 983-995.
Online Signature Recognition: The signature is captured through a digitizer or an instrumented pen and both geometric and temporal information are recorded and later used in the recognition process.
Zhang, K., Pratikakis, I., Cornelis, J., & Nyssen, E. (2000). Using landmarks in establishing a point to point correspondence between signatures. Pattern Analysis and Applications, 3, 69-75.
KEY TERMS Equal Error Rate: The error rate when the proportions of FAR and FRR are equal. The accuracy of the biometric system is inversely proportional to the value of EER. False Acceptance Rate: Rate of acceptance of a forged signature as a genuine signature by a handwritten signature verification system. False Rejection Rate: Rate of rejection of a genuine signature as a forged signature by a handwritten signature verification system.
890
Random Forgery: Random forgery is one in which the forged signature has a totally different semantic meaning and overall shape in comparison to the genuine signature. Simple Forgery: Simple forgery is one in which the semantics of the signature are the same as that of the genuine signature, but the overall shape differs to a great extent, since the forger has no idea about how the signature is done. Skilled Forgery: In skilled forgery, the forger has a prior knowledge about how the signature is written and practices it well, before the final attempt of duplicating it. Traced Forgery: For traced forgery, a signature instance or its photocopy is used as a reference and tried to be forged.
891
Organizational Data Mining Hamid R. Nemati The University of North Carolina at Greensboro, USA Christopher D. Barko The University of North Carolina at Greensboro, USA
INTRODUCTION Data mining is now largely recognized as a business imperative and considered essential for enabling the execution of successful organizational strategies. The adoption rate of data mining by enterprises is growing quickly, due to noteworthy industry results in applications such as credit assessment, risk management, market segmentation, and the ever-increasing volumes of corporate data available for analysis. The quantity of data being captured is staggering—data experts estimate that in 2002, the world generated five exabytes of information. This amount of data is more than all the words ever spoken by human beings (Hardy, 2004). The rate of growth is just as astounding—the amount of data produced in 2002 was up 68% from just two years earlier. The size of the typical business database has grown a hundred-fold during the past five years as a result of Internet commerce, ever-expanding computer systems, and mandated recordkeeping by government regulations (Hardy, 2004). Following this trend, a recent survey of corporations across 23 countries revealed that the largest transactional database almost doubled in size to 18 TB (terabytes), while the largest decision-support database grew to almost 30 TB (Reddy, 2004). However, in spite of this enormous growth in enterprise databases, research from IBM reveals that organizations use less than 1% of their data for analysis (Brown, 2002). In a similar study, a leading business intelligence firm surveyed executives at 450 companies and discovered that 90% of these organizations rely on gut instinct rather than hard facts for most of their decisions, because they lack the necessary information when they need it (Brown, 2002). In cases where sufficient business information is available, those organizations only are able to utilize less than 7% of it (The Economist, 2001). This is the fundamental irony of the Information Age we live in—organizations possess enormous amounts of business information yet have so little real business knowledge. In the past, companies have struggled to make decisions because of lack of data. But in the current environment, more and more organizations are struggling to
overcome information paralysis—there is so much data available that it is difficult to determine what is relevant and how to extract meaningful knowledge. Organizations today routinely collect and manage terabytes of data in their databases, thereby making information paralysis a key challenge in enterprise decision making. The generation and management of business data lose much of their potential organizational value unless important conclusions can be extracted from them quickly enough to influence decision making, while the business opportunity is still present. Managers must understand rapidly and thoroughly the factors driving their business in order to sustain a competitive advantage. Organizational speed and agility, supported by fact-based decision making, are critical to ensure an organization remains at least one step ahead of its competitors.
BACKGROUND The manner in which organizations execute this intricate decision-making process is critical to their wellbeing and industry competitiveness. Those organizations making swift, fact-based decisions by optimally leveraging their data resources will outperform those organizations that do not. A robust technology that facilitates this process of optimal decision making is known as Organizational Data Mining (ODM). ODM is defined as leveraging data mining tools and technologies in order to enhance the decision-making process by transforming data into valuable and actionable knowledge to gain a competitive advantage (Nemati & Barko, 2001). ODM eliminates the guesswork that permeates so much of corporate decision making. By adopting ODM, an organization’s managers and employees are able to act sooner rather than later, be proactive rather than reactive, and know rather than guess. ODM technology has helped many organizations to optimize internal resource allocations while better understanding and responding to the needs of their customers. ODM spans a wide array of technologies, including, but not limited to, e-business intelligence, On-Line Analytical Processing (OLAP), optimization, Customer
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
O
Organizational Data Mining
Relationship Management (CRM), electronic CRM (eCRM), Executive Information Systems (EIS), digital dashboards, and enterprise information portals. ODM enables organizations to answer questions about the past (what has happened?), the present (what is happening?), and the future (what might happen?). Armed with this capability, organizations can generate valuable knowledge from their data, which, in turn, enhances enterprise decisions. This decision-enhancing technology offers many advantages in operations (faster product development, optimal supply chain management), marketing (higher profitability and increased customer loyalty through more effective marketing campaigns), finance (optimal portfolio management, financial analytics), and strategy implementation (Business Performance Management [BPM] and the Balanced Scorecard). Over the last three decades, the organizational role of information technology has evolved from efficiently processing large amounts of batch transactions to providing information in support of tactical and strategic decision-making activities. This evolution, from automating expensive manual systems to providing strategic organizational value, led to the birth of Decision Support Systems (DSS), such as data warehousing and data mining. The organizational need to combine data from multiple stand-alone systems (e.g., financial, manufacturing, and distribution) grew as corporations began to acknowledge the power of combining these data sources for reporting. This spurred the growth of data warehousing, where multiple data sources were stored in a format that supported advanced data analysis. The slowness in adoption of ODM techniques in the 1990s was partly due to an organizational and cultural resistance. Business management always has been reluctant to trust something it does not fully understand. Until recently, most businesses were managed by instinct, intuition, and gut feeling. The transition over the past 20 years to a method of managing by the numbers is both the result of technology advances as well as a generational shift in the business world, as younger managers arrive with information technology training and experience.
ODM Research Given the scarcity of past research in ODM along with its growing acceptance and importance in organizations, we conducted empirical research during the past several years that explored the utilization of ODM in organizations along with project implementation factors critical for success. We surveyed ODM professionals from multiple industries in both domestic and international organizations. Our initial research examined the ODM industry status and best practices, identified both tech892
nical and business issues related to ODM projects, and elaborated on how organizations are benefiting through enhanced enterprise decision making (Nemati & Barko, 2001). The results of our research suggest that ODM can improve the quality and accuracy of decisions for any organization that is willing to make the investment. After exploring the status and utilization of ODM in organizations, we decided to focus subsequent research on how organizations implement ODM projects and on the factors critical to its success. To that end, we developed a new ODM Implementation Framework based on data, technology, organizations, and the Iron Triangle (Nemati & Barko, 2003). Our research demonstrated that selected organizational data mining project factors, when modeled under this new framework, have a significant influence on the successful implementation of ODM projects. Given the promise of strengthening customer relationships and enhancing profits, CRM technology and associated research are gaining greater acceptance within organizations. However, findings from recent studies suggest that organizations generally fail to support their CRM efforts with complete data (Brohman et al., 2003). As further investigation, our latest research has focused on a specific ODM technology known as Electronic Customer Relationship Management (e-CRM) and its data integration role within organizations. Consequently, we developed a new e-CRM Value Framework to better examine the significance of integrating data from all customer touch-points with the goal of improving customer relationships and creating additional value for the firm. Our research findings suggest that, despite the cost and complexity, data integration for e-CRM projects contributes to a better understanding of the customer and leads to higher return on investment (ROI), a greater number of benefits, improved user satisfaction, and a higher probability of attaining a competitive advantage (Nemati, Barko & Moosa, 2003).
MAIN THRUST Data mining is the process of discovering and interpreting previously unknown patterns in databases. It is a powerful technology that converts data into information and potentially actionable knowledge. However, there are many obstacles to the broad inclusion of data mining in organizations. Obtaining new knowledge in an organizational vacuum does not facilitate optimal decision making in a business setting. Simply incorporating data mining into the enterprise mix without considering nontechnical issues is usually a recipe for failure. Businesses must give careful thought when weaving data mining into their organization’s fabric. The unique orga-
Organizational Data Mining
nizational challenge of understanding and leveraging ODM to engineer actionable knowledge requires assimilating insights from a variety of organizational and technical fields and developing a comprehensive framework that supports an organization’s quest for a sustainable competitive advantage. These multi-disciplinary fields include data mining, business strategy, organizational learning and behavior, organizational culture, organizational politics, business ethics and privacy, knowledge management, information sciences, and decision support systems. These fundamental elements of ODM can be categorized into three main groups: Artificial Intelligence (AI), Information Technology (IT), and Organizational Theory (OT). Our research and industry experience suggest that successfully leveraging ODM requires integrating insights from all three categories in an organizational setting typically characterized by complexity and uncertainty. This is the essence and uniqueness of ODM. Obtaining maximum value from ODM involves a cross-department team effort that includes statisticians/ data miners, software engineers, business analysts, lineof-business managers, subject-matter experts, and upper management support.
Organizational Theory and ODM Organizations are concerned primarily with studying how operating efficiencies and profitability can be achieved through the effective management of customers, suppliers, partners, and employees. To achieve these goals, research in Organizational Theory (OT) suggests that organizations use data in three vital knowledge creation activities. This organizational knowledge creation and management is a learned ability that only can be achieved via an organized and deliberate methodology. This methodology is a foundation for successfully leveraging ODM within the organization. The three knowledge creation activities (Choo, 1997) are: • • •
Sense Making: The ability to interpret and understand information about the environment and events happening both inside and outside the organization. Knowledge Making: The ability to create new knowledge by combining the expertise of members to learn and innovate. Decision Making: The ability to process and analyze information and knowledge in order to select and implement the appropriate course of action.
First, organizations use data to make sense of changes and developments in the external environments—a process called sense making. This is a vital activity wherein managers discern the most significant changes, interpret their meaning, and develop appropriate responses. Sec-
ond, organizations create, organize, and process data to generate new knowledge through organizational learning. This knowledge creation activity enables the organization to develop new capabilities, design new products and services, enhance existing offerings, and improve organizational processes. Third, organizations search for and evaluate data in order to make decisions. This data is critical, since all organizational actions are initiated by decisions and all decisions are commitments to actions, the consequences of which will, in turn, lead to the creation of new data. Adopting an OT methodology enables an enterprise to enhance the knowledge engineering and management process. In another OT study, researchers and academic scholars have observed that there is no direct correlation between information technology (IT) investments and organizational performance. Research has confirmed that identical IT investments in two different companies may give a competitive advantage to one company but not the other. Therefore, a key factor for the competitive advantage in an organization is not the IT investment but the effective utilization of information as it relates to organizational performance (Brynjolfsson & Hitt, 1996). This finding emphasizes the necessity of integrating OT practices with robust information technology and artificial intelligence techniques in successfully leveraging ODM.
ODM Practices at Leading Companies A 2002 Strategic Decision-Making study conducted by Hackett Best Practices determined that world-class companies have adopted ODM technologies at more than twice the rate of average companies (Hoblitzell, 2002). ODM technologies provide these world-class organizations greater opportunities to understand their business and make informed decisions. ODM also enables world-class organizations to leverage their internal resources more efficiently and more effectively than their average counterparts, who have not fully embraced ODM. Many of today’s leading organizations credit their success to the development of an integrated, enterprise-level ODM system. As part of an effective CRM strategy, customer retention is now widely viewed by organizations as a significant marketing strategy in creating a competitive advantage. Research suggests that as little as a 5% increase in retention can mean as much as a 95% boost in profit, and repeat customers generate over twice as much gross income as new customers (Winer, 2001). In addition, many business executives today have replaced their cost reduction strategies with a customer retention strategy—it costs approximately five to 10 times more to acquire new 893
O
Organizational Data Mining
customers than to retain established customers (Pan & Lee, 2003). An excellent example of a successful CRM strategy is Harrah’s Entertainment, which has saved over $20 million per year since implementing its Total Rewards CRM program. This ODM system has given Harrah’s a better understanding of its customers and has enabled the company to create targeted marketing campaigns that almost doubled the profit per customer and delivered same-store sales growth of 14% after only the first year. In another notable case, Travelocity.com, an Internet-based travel agency, implemented an ODM system and improved total bookings and earnings by 100% in 2000. Gross profit margins improved 150%, and booker conversion rates rose 8.9%, the highest in the online travel services industry. In another significant study, executives from 24 leading companies in customer-knowledge management, including FedEx, Frito-Lay, Harley-Davidson, Procter & Gamble, and 3M, all realized that in order to succeed, they must go beyond simply collecting customer data and must translate it into meaningful knowledge about existing and potential customers (Davenport, Harris & Kohli, 2001). This study revealed that several objectives were common to all of the leading companies, and these objectives can be facilitated by ODM. A few of these objectives are segmenting the customer base, prioritizing customers, understanding online customer behavior, engendering customer loyalty, and increasing cross-selling opportunities.
FUTURE TRENDS The number of ODM projects is projected to grow more than 300% in the next decade (Linden, 1999). As the collection, organization, and storage of data rapidly increase, ODM will be the only means of extracting timely and relevant knowledge from large corporate databases. The growing mountains of business data, coupled with recent advances in Organizational Theory and technological innovations, provide organizations with a framework to effectively use their data to gain a competitive advantage. An organization’s future success will depend largely on whether or not they adopt and leverage this ODM framework. ODM will continue to expand and mature as the corporate demand for oneto-one marketing, CRM, e-CRM, Web personalization, and related interactive media increases. We believe that organizations are slowly moving from the Information Age to the Knowledge Age, where decision makers will leverage ODM for optimal business performance. Organizations are complex entities comprised of employees, managers, politics, culture, hierarchies, teams, processes, customers, partners, sup894
pliers, and shareholders. The never-ending challenge is to successfully integrate data-mining technologies within organizations in order to enhance decision making with the objective of optimally allocating scarce enterprise resources. This is not an easy task, as many information technology professionals, consultants, and managers can attest. The media can oversimplify the effort, but successfully implementing ODM is not accomplished without political battles, project management struggles, cultural shocks, business process reengineering, personnel changes, short-term financial and budgetary shortages, and overall disarray. Recent ODM research has revealed a number of industry predictions that are expected to be key ODM issues in the coming years. Nemati & Barko (2001) found that almost 80% of survey respondents expect Web farming/mining and consumer privacy to be significant issues. We also foresee the development of widely accepted standards for ODM processes and techniques to be an influential factor for knowledge seekers in the 21st century. One attempt at ODM standardization is the creation of the Cross Industry Standard Process for Data Mining (CRISP-DM) project that developed an industry and tool-neutral, data-mining process model to solve business problems. Another attempt at industry standardization is the work of the Data Mining Group in developing and advocating the Predictive Model Markup Language (PMML), which is an XML-based language that provides a quick and easy way for companies to define predictive models and share models between compliant vendors’ applications. Last, Microsoft’s OLE DB for Data Mining is a further attempt at industry standardization and integration. This specification offers a common interface for data mining that will enable developers to embed data-mining capabilities into their existing applications. One only has to consider Microsoft’s industry-wide dominance of the office productivity (Microsoft Office), software development (Visual Basic and .Net), and database (SQL Server) markets to envision the potential impact this could have on the ODM market and its future direction.
CONCLUSION Although many improvements have materialized over the last decade, the knowledge gap in many organizations is still prevalent. Industry professionals have suggested that many corporations could maintain current revenues at half the current costs, if they optimized their use of corporate data (Saarenvirta, 2001). Whether this finding is true or not, it sheds light on an important issue. Leading corporations in the next decade will adopt and weave these ODM technologies into the fabric of their organizations at the strategic, tactical, and
Organizational Data Mining
operational levels. Those enterprises that see the strategic value of evolving into knowledge organizations by leveraging ODM will benefit directly in the form of improved profitability, increased efficiency, and sustainable competitive advantage. Once the first organization within an industry realizes a competitive advantage through ODM, it is only a matter of time before one of three events transpires: its industry competitors adopt ODM, change industries, or vanish. By adopting ODM, an organization’s managers and employees are able to act sooner rather than later, anticipate rather than react, know rather than guess, and, ultimately, succeed rather than fail.
REFERENCES Brohman, M. et al. (2003). Data completeness: A key to effective net-based customer service systems. Communications of the ACM, 46(6), 47-51. Brown, E. (2002). Analyze this. Forbes, 169(8), 96-98. Brynjolfsson, E., & Hitt, L. (1996, September 9). The customer counts. InformationWeek. Retrieved from www.informationweek.com/596/96mit.htm Choo, C.W. (1997). The knowing organization: How organizations use information to construct meaning, create knowledge, and make decisions. Retrieved from www.choo.fis.utoronto.ca/fis/ko/default.html Davenport, T.H., Harris, J.G., & Kohli, A.K. (2001). How do they know their customers so well? Sloan Management Review, 42(2), 63-73. Hardy, Q. (2004). Data of reckoning. Forbes, 173(10), 151-154. Hoblitzell, T. (2002). Disconnects in today’s BI systems. DM Review, 12(6), 56-59. Linden, A. (1999). CIO update: Data mining applications of the next decade. Inside Gartner Group. Gartner Inc. Nemati, H.R., & Barko, C.D. (2001). Issues in organizational data mining: A survey of current practices. Journal of Data Warehousing, 6(1), 25-36. Nemati, H.R., & Barko, C.D. (2003). Key factors for achieving organizational data mining success. Industrial Management & Data Systems, 103(4), 282-292. Nemati, H.R., Barko, C.D., & Moosa, A. (2003). ECRM analytics: The role of data integration. Journal of Electronic Commerce in Organizations, 1(3), 73-89.
Pan, S.L., & Lee, J.-N. (2003). Using E-CRM for a unified view of the customer. Communications of the ACM, 46(4), 95-99. Reddy, R. (2004). The reality of real time. Intelligent Enterprise, 7(10), 40-41. Saarenvirta, G. (2001). Operation data mining. DB2 Magazine. Retrieved from http://www.db3mag.com/db_area/ archives/2001/q2/saarenvirta.html The slow progress of fast wires. (2001). The Economist, 358(8209), 57-59. Winer, R.S. (2001). A framework for customer relationship management. California Management Review, 43(4), 89-106.
TERMS CRISP-DM (Cross Industry Standard Process for Data Mining): An industry and tool-neutral datamining process model developed by members from diverse industry sectors and research institutes. CRM (Customer Relationship Management): The methodologies, software, and Internet capabilities that help a company manage customer relationships in an efficient and organized manner. Information Paralysis: A condition where too much data causes difficulty in determining relevancy and extracting meaningful information and knowledge. ODM (Organizational Data Mining): The process of leveraging data-mining tools and technologies in an organizational setting in order to enhance the decisionmaking process by transforming data into valuable and actionable knowledge in order to gain a competitive advantage. OLAP (Online Analytical Processing): A data mining technology that uses software tools to interactively and simultaneously analyze different dimensions of multi-dimensional data. OT (Organizational Theory): A body of research that focuses on studying how organizations create and use information in three strategic arenas—sense making, knowledge making, and decision making. PMML (Predictive Model Markup Language): An XML-based language that provides a method for companies to define predictive models and share models between compliant vendors’ applications.
895
O
896
Path Mining in Web Processes Using Profiles Jorge Cardoso University of Madeira, Portugal
INTRODUCTION Business process management systems (BPMSs) (Smith & Fingar, 2003) provide a fundamental infrastructure to define and manage business processes, Web processes, and workflows. When Web processes and workflows are installed and executed, the management system generates data describing the activities being carried out and is stored in a log. This log of data can be used to discover and extract knowledge about the execution of processes. One piece of important and useful information that can be discovered is related to the prediction of the path that will be followed during the execution of a process. I call this type of discovery path mining. Path mining is vital to algorithms that estimate the quality of service of a process, because they require the prediction of paths. In this work, I present and describe how process path mining can be achieved by using datamining techniques.
BACKGROUND BPMSs, such as workflow management systems (WfMS) (Cardoso, Bostrom, & Sheth, 2004) are systems capable of both generating and collecting considerable amounts of data describing the execution of business processes, such as Web processes. This data is stored in a process log systems, which are vast data archives that are seldom visited. Yet, the data generated from the execution of processes are rich with concealed information that can be used for making intelligent business decisions. One important and useful piece of knowledge to discover and extract from process logs is the implicit rules that govern path mining. In Web processes for e-commerce, suppliers and customers define a contract between the two parties, specifying quality of service (QoS) items, such as products or services to be delivered, deadlines, quality of products, and cost of services. The management of QoS metrics directly impacts the success of organizations participating in e-commerce. A Web process, which typically can have a graphlike representation, includes a number of linearly independent control paths. Depending on the path followed during the execution of a Web
process, the QoS may be substantially different. If you can predict with a certain degree of confidence the path that will be followed at run time, you can significantly increase the precision of QoS estimation algorithms for Web processes. Because the large amounts of data stored in process logs exceeds understanding, I describe the use of datamining techniques to carry out path mining from the data stored in log systems. This approach uses classification algorithms to conveniently extract patterns representing knowledge related to paths. My work is novel because no previous work has targeted the path mining of Web processes and workflows. The literature includes only work on process and workflow mining (Agrawal, Gunopulos, & Leymann, 1998; Herbst & Karagiannis, 1998; Weijters & van der Aalst, 2001). Process mining allows the discovery of workflow models from a workflow log containing information about workflow processes executed. Luo, Sheth, Kochut, and Arpinar (2003) present an architecture and the implementation of a sophisticated exception-handling mechanism supported by a case-based reasoning (CBR) engine.
MAIN THRUST The material presented in this section emphasizes the use of data-mining techniques for uncovering interesting process patterns hidden in large process logs. The method contained in the next section is more suitable for administrative and production processes compared to Ad-hoc and collaborative processes, because they are more repetitive and predictable.
Web Process Scenario A major bank has realized that to be competitive and efficient it must adopt a new and modern information system infrastructure. Therefore, a first step was taken in that direction with the adoption of a workflow management system to support its business processes. All the services available to customers are stored and executed under the supervision of the workflow system. One of the services supplied by the bank is the loan process depicted in Figure 1.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Path Mining in Web Processes Using Profiles
A Web process is composed of Web services and transitions. Web services are represented by circles, and transitions are represented by arrows. Transitions express dependencies between Web services. A Web service with more than one outgoing transition can be classified as an and-split or xor-split. And-split Web services enable all their outgoing transitions after completing their execution. Xor-split Web services enable only one outgoing transition after completing their execution. And-split Web services are represented with ‘•’, and xor-split Web services are represented with ‘⊕’. A Web service with more than one incoming transition can be classified as an and-join or xor-join. And-join Web services start their execution when all their incoming transitions are enabled. Xor-join Web services are executed as soon as one of the incoming transitions is enabled. As with and-split and xor-split Web services, and-join and xor-join Web services are represented with the symbols ‘•’ and ‘⊕’, respectively. The Web process of this scenario is composed of 14 Web services. The Fill Loan Request Web service allows clients to request a loan from the bank. In this step, the client is asked to fill out an electronic form with personal information and data describing the condition of the loan being requested. The second Web service, Check Loan Type, determines the type of loan a client has requested and, based on the type, forwards the request to one of three Web services: Check Home Loan, Check Educational Loan, or Check Car Loan. Educational loans are not handled and managed automatically. After an educational loan application is submitted and checked, a notification is immediately sent informing the client that he or she has to contact the bank personally. A loan request can be either accepted (Approve Home Loan and Approve Car Loan) or rejected (Reject Home Loan and Reject Car Loan). In the case of a home loan, however, the loan can also be approved conditionFigure 1. The loan process
ally. The Web service Approve Home Loan Conditionally, as the name suggests, approves a home loan under a set of conditions. The following formula is used to determine if a loan is approved or rejected. MP = (L*R*(1+R/12)12*NY)/(-12+12*(1+R12)12*NY) (1) MP=Monthly payment,L=Loan amount,R=Interest rate,NY=Number of years When the result of a loan application is known, it is e-mailed to the client. Three Web services are responsible for notifying the client: Notify Home Loan Client, Notify Education Loan Client, and Notify Car Loan Client. Finally, the Archive Application Web service creates a report and stores the loan application data in a database record.
Web Process Log During the execution of Web processes (such as the one presented in Figure 1), events and messages generated by the enactment system are stored in a Web process log. These data stores provide an adequate format on which path mining can be performed. The data includes real-time information describing the execution and behavior of Web processes, Web services, instances, transitions, and other elements such as runtime QoS metrics. Table 1 illustrates an example of a modern Web process log. To perform path mining, current Web process logs need to be extended to store information indicating the values and the type of the input parameters passed to Web services and the output parameters received from Web services. Table 2 shows an extended Web process log that accommodates input/output values of Web services parameters generated at run time. Each Parameter/ value entry has a type, parameter name, and value (e.g., string loan-type=”car-loan”). Additionally, the Web process log needs to include path information describing the Web services that have been executed during the enactment of a Web process. This information can be easily stored in the log. For example, an extra field can be added to the log system to contain the information indicating the path followed. The path needs only to be associated to the entry corresponding to the last service of a process to be executed. For example, in the Web process log illustrated in Table 2, the service NotifyUser is the last service of a Web process. The log has been extended in such a way that the NotifyUser record contains information about the path that was followed during the Web process execution. 897
P
Path Mining in Web Processes Using Profiles
Table 1. Web process log Date
Web process
6:45 03-03-04 6:51 03-03-04 6:59 03-03-04 7:01 03-03-04 …
LoanApplication TravelRequest TravelRequest InsuranceClaim …
Process instance LA04 TR08 TR09 IC02 …
Web service RejectCarLoan FillRequestTravel NotifyUser SubmitClaim …
Service instance RCL03 FRT03 NU07 SC06 …
Cost
Durati on 13 min 14 min 24 hrs 05 min …
$1.2 $1.1 $1.4 $1.2 …
… … … … … …
Table 2. Extended Web process log … …
Process instance LA04
… …
LA04 LA05
… …
Web service RejectCarLoan
Service instance RCL03 NLC07 CLR05
TR09
NotifyCLoanClient CheckLoanReques t NotifyUser
…
…
…
NU07
Parameter/value
Path
…
int LoanNum=14357; string loan-type=”car-loan” string e-mail=”[email protected]” double income=12000; string Name=”Eibe Frank”; String [email protected]; String tel=”35129170023”
…
…
… …
… …
FillForm->CheckForm-> Approve->Sign->Report …
…
Web Process Profile When beginning work on path mining, it is necessary to elaborate a profile for each Web process. A profile provides the input to machine learning and is characterized by its values on a fixed, predefined set of attributes. The attributes correspond to the Web service input/ output parameters that have been stored previously in the Web process log. Path mining will be performed on these attributes. A profile contains two types of attributes, numeric and nominal. Numeric attributes measure numbers, either real or integer-valued. For example, Web services inputs or outputs parameters that are of type byte, decimal, int, short, or double will be placed in the profile and classified as numeric. In Table 2, the parameters LoanNum, income, BudgetCode, income, and tel will be classified as numeric in the profile. Nominal attributes take on values within a finite set of possibilities. Nominal quantities have values that are distinct symbols. For example, the parameter loan-type from the loan application and present in Table 2 is nominal because it can take the finite set of values: home-loan, education-loan, and car-loan. In my approach, string and Boolean data type manipulated by Web services are considered to be nominal attributes.
Profile Classification The attributes present in a profile trigger the execution of a specific set of Web services. Therefore, for each profile previously constructed, I associate an additional attribute, the path attribute, indicating the path followed 898
…
when the attributes of the profile have been assigned to specific values. The path attribute is a target class. Classification algorithms classify samples or instances into target classes. After the profiles and a path attribute value for each profile have been determined, I can use data-mining methods to establish a relationship between the profiles and the paths followed at run time. One method appropriate to deal with my problem is the use of classification. In classification, a learning schema takes a set of classified profiles, from which it is expected to learn a way of classifying unseen profiles. Because the path of each training profile is provided, my methodology uses supervised learning.
EXPERIMENTS In this section, I present the results of applying my algorithm to a synthetic loan dataset. To generate a synthetic dataset, I start with the process presented in the introductory scenario and, using this as a process model graph, log a set of process instance executions. The data are lists of event records stored in a Web process log consisting of process names, instance identification, Web services names, variable names, and so forth. Table 3 shows the additional data that have been stored in the Web process log. The information includes the Web service variable values that are logged by the system and the path that has been followed during the execution of instances. Each entry corresponds to an instance execution.
Path Mining in Web Processes Using Profiles
Table 3. Additional data stored in the Web process log Income
Loan_Type
1361.0 Unknown 1475.0 …
Home-Loan Education-Loan Car-Loan …
Loan_ amount 129982.0 Unknown 15002.0 …
Loan_ years 33 Unknown 9 …
Web process profiles provide the input to machine learning and are characterized by a set of six attributes: income, loan_type, loan_amount, loan_years, name, and SSN. The profiles for the loan process contain two types of attributes: numeric and nominal. The attributes income, loan_amount, loan_years, and SSN are numeric, whereas the attributes loan_type and name are nominal. As an example of a nominal attribute, loan_type can take the finite set of values home-loan, education-loan, and carloan. These attributes correspond to the Web service input/output parameters that have been stored previously in the Web process log presented in Table 3. Each profile is associated with a class indicating the path that has been followed during the execution of a process when the attributes of the profile have been assigned specific values. The last column of Table 3 shows the class named path. The profiles and path attributes will be used to establish a relationship between the profiles and the paths followed at runtime. The profiles and the class path have been extracted from the Web process log. After profiles are constructed and associated with paths, these data are combined and formatted to be analyzed using Weka (2004), a set of software for machine learning and data mining. The data is automatically formatted using the ARFF format. I have used the J.48 algorithm, which is Weka’s implementation of the C4.5 (Hand, Mannila, & Smyth, 2001) decision tree learner to classify profiles. C4.5 decision tree learner is one of the most well-known decision tree algorithms in the data-mining community. Weka system and its data format (ARFF) is also one of the most well-known datamining systems in academia. Figure 2. Experimental results % Correctly Predicted Paths
100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 Number of Attributes
Name
SSN
Path
Bernard-Boar John-Miller Eibe-Frank …
10015415 15572979 10169316 …
FR>CLT>CHL>AHL>NHC>CA FR>CLT>CEL>CA FR>CLT>CCL>ACL>NCC>CA …
P
Each experiment has involved data from 1,000 Web process executions and a variable number of attributes (ranging from two to six). I have conducted 34 experiments, analyzing a total of 34,000 records containing data from Web process instance executions. Figure 2 shows the results that I have obtained. The path-mining technique developed has achieved encouraging results. When three or more attributes are involved in the prediction, the system is able to predict correctly the path followed for more than 75% of the process instances. This accuracy improves when four attributes are involved in the prediction; in this case, more than 82% of the paths are correctly predicted. When five attributes are involved, I obtain a level of prediction that reaches a high of 93.4%. Involving all six attributes in the prediction gives excellent results: 88.9% of the paths are correctly predicted. When a small number of attributes are involved in the prediction, the results are not as good. For example, when only two attributes are selected, I obtain predictions that range from 25.9% to 86.7%.
FUTURE TRENDS Currently, organizations use BPMSs, such as WfMS, to define, enact, and manage a wide range of distinct applications (Q-Link Technologies, 2002), such as insurance claims, bank loans, bioinformatic experiments (Hall, Miller, Arnold, Kochut, Sheth, & Weise, 2003), health-care procedures (Anyanwu, Sheth, Cardoso, Miller, & Kochut, 2003), and telecommunication services (Luo, Sheth, Kochut, & Arpinar, 2003). In the future, I expect to see a wider spectrum of applications managing processes in organizations. According to the Aberdeen Group’s estimates, spending in the business process management software sector (which includes workflow systems) reached $2.26 billion in 2001 (Cowley, 2002). The concept of path mining can be used effectively in many business applications — for example, to estimate the QoS of Web processes and workflows (Cardoso, Miller, Sheth, Arnold, & Kochut, 2004) — because the estimation requires the prediction of paths. Organizations operating in modern markets, such as e-commerce activi899
Path Mining in Web Processes Using Profiles
ties and distributed Web services interactions, require QoS management. Appropriate quality control leads to the creation of quality products and services; these, in turn, fulfill customer expectations and achieve customer satisfaction (Cardoso, Sheth, & Miller, 2002).
CONCLUSION BPMSs, Web processes, workflows, and workflow systems represent fundamental technological infrastructures that efficiently define, manage, and support business processes. The data generated from the execution and management of Web processes can be used to discover and extract knowledge about the process executions and structure. I have shown that one important area of Web processes to analyze is path mining. I have demonstrated how path mining can be achieved by using data-mining techniques, namely classification, to extract path knowledge from Web process logs. From my experiments, I can conclude that classification methods are a good solution to perform path mining on administrative and production Web processes.
Cowley, S. (2002, September 23). Study: BPM market primed for growth. Available from the InfoWorld Web site, http://www.infoworld.com Hall, R. D., Miller, J. A., Arnold, J., Kochut, K. J., Sheth, A. P., & Weise, M. J. (2003). Using workflow to build an information management system for a geographically distributed genome sequence initiative. In R. A. Prade & H. J. Bohnert (Eds.), Genomics of plants and fungi (pp. 359-371). New York: Marcel Dekker. Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Bradford Book. Herbst, J., & Karagiannis, D. (1998). Integrating machine learning and workflow management to support acquisition and adaption of workflow models. Proceedings of the Ninth International Workshop on Database and Expert Systems Applications. Luo, Z., Sheth, A., Kochut, K., & Arpinar, B. (2003). Exception handling for conflict resolution in crossorganizational workflows. Distributed and Parallel Databases, 12(3), 271-306. Q-Link Technologies. (2002). BPM2002: Market milestone report. Retrieved from http://www.qlinktech.com.
REFERENCES
Smith, H., & Fingar, P. (2003). Business process management (BPM): The third wave. Meghan-Kiffer Press.
Agrawal, R., Gunopulos, D., & Leymann, F. (1998). Mining process models from workflow logs. Proceedings of the Sixth International Conference on Extending Database Technology, Spain.
Weijters, T., & van der Aalst, W. M. P. (2001). Process mining: Discovering workflow models from event-based data. Proceedings of the 13th Belgium-Netherlands Conference on Artificial Intelligence.
Anyanwu, K., Sheth, A., Cardoso, J., Miller, J. A., & Kochut, K. J. (2003). Healthcare enterprise process development and integration. Journal of Research and Practice in Information Technology, 35(2), 83-98.
Weka. (2004). Weka [Computer software.] Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
Cardoso, J., Bostrom, R. P., & Sheth, A. (2004). Workflow management systems and ERP systems: Differences, commonalities, and applications. Information Technology and Management Journal, 5(3-4), 319-338.
KEY TERMS
Cardoso, J., Miller, J., Sheth, A., Arnold, J., & Kochut, K. (2004). Quality of service for workflows and Web service processes. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 1(3), 281-308. Cardoso, J., Sheth, A., & Miller, J. (2002). Workflow quality of service. Proceedings of the International Conference on Enterprise Integration and Modeling Technology and International Enterprise Modeling Conference, Spain.
900
Business Process: A set of one or more linked activities that collectively realize a business objective or goal, normally within the context of an organizational structure. Business Process Management System (BPMS): Provides an organization with the ability to collectively define and model its business processes, deploy these processes as applications that are integrated with its existing software systems, and then provide managers with the visibility to monitor, analyze, control, and improve the execution of those processes.
Path Mining in Web Processes Using Profiles
Process Definition: The representation of a business process in a form that supports automated manipulation or enactment by a workflow management system. Web Process: A set of Web services that carry out a specific goal. Web Process Data Log: Records and stores events and messages generated by the enactment system during the execution of Web processes.
Workflow: The automation of a business process, in whole or part, during which documents, information, or tasks are passed from one participant to another for action, according to a set of procedural rules. Workflow Management System: A system that defines, creates, and manages the execution of workflows through the use of software, which is able to interpret the process definition, interact with participants, and, where required, invoke the use of tools and applications.
Web Service: Describes a standardized way of integrating Web-based applications by using open standards over an Internet protocol.
901
P
902
Pattern Synthesis for Large-Scale Pattern Recognition P. Viswanath Indian Institute of Science, India M. Narasimha Murty Indian Institute of Science, India Shalabh Bhatnagar Indian Institute of Science, India
INTRODUCTION Two major problems in applying any pattern recognition technique for large and high-dimensional data are (a) high computational requirements and (b) curse of dimensionality (Duda, Hart, & Stork, 2000). Algorithmic improvements and approximate methods can solve the first problem, whereas feature selection (Guyon & Elisseeff, 2003), feature extraction (Terabe, Washio, Motoda, Katai, & Sawaragi, 2002), and bootstrapping techniques (Efron, 1979; Hamamoto, Uchimura, & Tomita, 1997) can tackle the second problem. We propose a novel and unified solution for these problems by deriving a compact and generalized abstraction of the data. By this term, we mean a compact representation of the given patterns from which one can retrieve not only the original patterns but also some artificial patterns. The compactness of the abstraction reduces the computational requirements, and its generalization reduces the curse of dimensionality effect. Pattern synthesis techniques accompanied with compact representations attempt to derive compact and generalized abstractions of the data. These techniques are applied with nearest neighbor classifier (NNC), which is a popular nonparametric classifier used in many fields, including data mining, since its conception in the early 1950s (Dasarathy, 2002).
BACKGROUND Pattern synthesis techniques, compact representations and its application with NNC are based on more established fields: •
Pattern Recognition: Statistical techniques, parametric and nonparametric methods, classifier design, nearest neighbor classification, curse of dimensionality, similarity measures, feature selec-
•
•
tion, feature extraction, prototype selection, and clustering techniques. Data Structures and Algorithms: Computational requirements, compact storage structures, efficient nearest neighbor search techniques, approximate search methods, algorithmic paradigms, and divide-and-conquer approaches. Database Management: Relational operators, projection, cartesian product, data structures, data management, queries, and indexing techniques.
MAIN THRUST Pattern synthesis, compact representations followed by its application with NNC, are described in this section.
Pattern Synthesis Generation of artificial new patterns by using the given set of patterns is called pattern synthesis. There are two broad ways of doing pattern synthesis: modelbased pattern synthesis and instance-based pattern synthesis. In model-based pattern synthesis, a model (such as the Hidden Markov model) or description (such as probability distribution) of the data is derived first and is then used to generate new patterns. This method can be used to generate as many patterns as needed, but it has two drawbacks. First, any model depends on the underlying assumptions; hence, the synthetic patterns generated can be erroneous. Second, deriving the model might be computationally expensive. Another argument against this method is that if pattern classification is the purpose, then the model itself can be used without generating any patterns at all! Instance-based pattern synthesis, on the other hand, uses the given training patterns and some of the properties of the data. It can generate only a finite number of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Pattern Synthesis for Large-Scale Pattern Recognition
new patterns. Computationally, this method can be less expensive than deriving a model. It is especially useful for nonparametric methods, such as NNC- and Parzenwindow-based density estimation (Duda et al., 2000), which directly use the training instances. Further, this method can also result in reduction of the computational requirements. This article presents two instance-based pattern synthesis techniques called overlap-based pattern synthesis and partition-based pattern synthesis and their corresponding compact representations.
Overlap-based Pattern Synthesis Let F be the set of features (or attributes). There may exist a three-block partition of F, say, {A, B, C}, with the following properties. For a given class, there is a dependency (probabilistic) among features in A U B. Similarly, features in B U C have a dependency. However, features in A (or C ) can affect those in C (or A ) only through features in B. That is, to state it more formally, A and C are statistically independent, given B. Suppose that this is the case and you are given two patterns, X = (a1, b, c1) and Y = (a2, b, c2), such that a1 is a featurevector that can be assigned to the features in A, b to the features in B, and c1 to the features in C. Similarly, a2, b, and c2 are feature-vectors that can be assigned to features in A, B, and C, respectively. Our argument, then, is that the two patterns, (a 1, b, c 2) and (a2, b, c 1), are also valid patterns in the same class or category as X and Y. If these two new patterns are not already in the class of patterns, it is only because of the finite nature of the set. We call this generation of additional patterns an overlap-based pattern synthesis, because this kind of synthesis is possible only if the two given patterns have the same feature-values for features in B. In the given example, feature-vector b is common between X and Y and therefore is called the overlap. This method is suitable only with discrete valued features (can also be of symbolic or categorical types). If more than one such partition exists, then the synthesis technique is applied sequentially with respect to the partitions in some order. One simple example to illustrate this concept is as follows. Consider a supermarket sales database where two records, (bread, milk, sugar) and (coffee, milk, biscuits), are given. Assume that a known dependency exists between (a) bread and milk, (b) milk and sugar, (c) coffee and milk, and (d) milk and biscuits. The two new
records that can then be synthesized are (bread, milk, biscuits) and (coffee, milk, sugar). Here, milk is the overlap. A compact representation in this case is shown in Figure 1, where a path from left to right denotes a data item or pattern. So you get four patterns total (two original and two synthetic patterns) from the graph shown in Figure 1. Association rules derived from association rule mining (Han & Kamber, 2000) can be used to find these kinds of dependencies. Generalization of this concept and its compact representation for large datasets are described in the paragraphs that follow. If the set of features, F, can be arranged in an order such that F = {f1, f 2 , ..., fd } is an ordered set, with fk being the k th feature and all possible three-block partitions can be represented as Pi = {A i, B i, Ci } such that Ai = (f 1, ..., fa ), B i = (f a+1, ..., fb ), and Ci = (fb+1, ..., fd ), then the compact representation called overlap pattern graph is described with the help of an example.
Overlap Pattern Graph (OLP-graph) Let F = (f 1, f2, f 3 , f4 , f5 ). Let two partitions satisfying the conditional independence requirement be P1 = { {f1}, { f 2 , f3 }, { f4 , f5 }} and P2 = { {f 1, f2 } , {f3 , f4 }, { f5 }}. Let three given patterns be (a,b,c,d,e), (p,b,c,q,r), and (u,v,c,q,w), respectively. Because (b,c) is common between the1st and 2nd patterns, two synthetic patterns that can be generated are (a,b,c,q,r) and (p,b,c,d,e). Likewise, three other synthetic patterns that can be generated are (p,b,c,d,e), (p,b,c,q,w), and (a,b,c,q,w). (Note that the last synthetic pattern is derived from two earlier synthetic patterns.) A compact representation called overlap pattern graph (OLP-graph) for the entire set (including both given and synthetic patterns) is shown in Figure 2, where a path from left to right represents a pattern. The graph is constructed by inserting the given patterns, whereas the patterns that can be extracted out of the graph form the entire synthetic set consisting of both original and synthetic patterns. Thus, from the graph in Figure 2, a total of eight patterns can be extracted, five of which are new synthetic patterns. OLP-graph can be constructed by scanning the given dataset only once and is independent of the order in which the given patterns are considered. An approximate method for finding partitions, a method for conFigure 2. OLP-graph
Figure 1. A compact representation
903
P
Pattern Synthesis for Large-Scale Pattern Recognition
struction of OLP-graph, and its application to NNC are described in Viswanath, Murty, and Bhatnagar (2003). For large datasets, this representation drastically reduces the space requirement.
block. A path in the PC-tree for block 1 (from root to leaf) concatenated with a path in the PC-tree for block 2 gives a pattern that can be extracted. So a total of six patterns can be extracted from the structures shown in Figure 3.
Partition-based Pattern Synthesis
An Application of Synthetic Patterns with Nearest Neighbor Classifier
Let P = {A1, A2,...,Ak} be a k block partition of F such that A1, A2,..., and Ak are statistically independent subsets. Let Y be the given dataset and ProjA1(Y) be the projection of Y onto features in the subset A1. Then the cartesian product ProjA1(Y) X ProjA2(Y) X … X ProjAk(Y) is the synthetic set generated using this method. The partiton P satisfying the above requirement can be obtained from domain knowledge or from association rule mining. Approximate partitioning methods using mutual information or pair-wise correlation between the features can also be used to get suitable partitions and are experimentally demonstrated to work well with NNC in Viswanath, Murty, and Bhatnagar (2004).
Partitioned Pattern Count Tree (PPCtree) This compact representation, suitable for storing the partition-based synthetic patterns, is based on Pattern Count Tree (PC-tree) (Ananthanarayana, Murty, & Subramanian, 2001), which is a prefix-tree-like structure. PPC-tree is a generalization of PC-tree such that a PC-tree is built for each of the projected patterns, that is, ProjAi(Y) for i = 1 to k. PPC-tree is a more compact structure than PC-tree, which can be built by scanning the dataset only once and is independent of the order in which the given patterns are considered. The construction method and properties of PPC-tree can be found in (Viswanath et al., 2004). As an example, consider four given patterns, (a,b,c,d,e,f), (a,u,c,d,v,w), (a,b,p,d,e,f), and (a,b,c,d,v,w), and a two-block partition, where the first block consists of the first three features, and the second block consists of the remaining three features. The corresponding PPC-tree is shown in Figure 3, where each node contains a feature-value and an integer called count, which gives the number of patterns sharing that node. The figure shows two PC-trees, one corresponding to each Figure 3. PPC-tree
904
Classification and prediction methods are among various elements of data mining and knowledge discovery, including association rule mining, clustering, link analysis, rule induction, and so forth (Wang, 2003). The nearest neighbor classifier (NNC) is a very popular nonparametric classifer (Dasarathy, 2002). It is widely used in various fields because of its simplicity and good performance. Its conceptual equivalents in other fields (such as artificial intelligence) are instancebased learning, lazy learning, memory-based reasoning, case-based reasoning, and so forth. Theoretically, with an infinite number of samples, its error rate is upper-bounded by twice the error of Bayes classifier (Duda et al., 2000). NNC is a lazy classifier in the sense that no general model (like a decision tree) is built until a new sample needs to be classified. So NNC can easily adapt to situations where the dataset changes frequently, but computational requirements and curse of dimensionality are the two major issues that need to be addressed in order to make the classifier suitable for data mining applications (Dasarathy, 2002). Prototype selection (Susheela Devi & Murty, 2002), feature selection (Guyon & Elisseeff, 2003), feature extraction (Terabe et al., 2002), compact representations of the datasets (Khan, Ding, & Perrizo, 2002), and bootstrapping (Efron, 1979; Hamamoto et al., 1997) are some of the remedies to tackle these problems. But all these techniques need to be followed one after the other; that is, they cannot be combined together. Pattern synthesis techniques and compact representations described in this article provide a unified solution. Efficient implementations of NNC that can directly work with OLP-graph are given in (Viswanath et al., 2003). These implementations use dynamic programming techniques to reuse the partial distance computations and thus reduce the classification time requirement. Partition-based pattern synthesis is presented in (Viswanath et al., 2004), where an efficient NNC implementation with constant classification time is presented. These methods are, in general, based on the divide-andconquer approach and are efficient to work directly with the compact representation. Thus, the computational requirements are reduced. Because the total number of patterns considered (i.e., the size of the synthetic set) is much larger than those in the given training set, the effect of curse of dimensionality is reduced.
Pattern Synthesis for Large-Scale Pattern Recognition
FUTURE TRENDS Integration of various data mining tasks, such as association rule mining, feature selection, feature extraction, classification, and so forth, by using compact and generalized abstractions is a very promising direction. Apart from NNC, other applications using synthetic patterns and corresponding compact representations, including decision tree induction, rule deduction, clustering, and so forth, are also interesting areas that can give valuable results.
CONCLUSION Pattern synthesis is a novel technique that can enhance the generalization ability of lazy learning methods such as NNC by reducing the curse of dimensionality effect; also, the corresponding compact representations can reduce the computational requirements. In essence, it derives a compact and generalized abstraction of the given dataset. Overlap-based pattern synthesis and partition-based pattern synthesis are two instance-based pattern synthesis methods where OLP-graph and PPCtree are respective compact storage structures.
REFERENCES Ananthanarayana, V. S., Murty, M. N., & Subramanian, D. K. (2001). An incremental data mining algorithm for compact realization of prototypes. Pattern Recognition, 34, 2249-2251. Dasarathy, B. V. (2002). Data mining tasks and methods: Classification: Nearest-neighbor approaches. In Handbook of data mining and knowledge discovery (pp. 288298). New York: Oxford University Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.) Wiley. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annual Statistics, 7, 1-26. Hamamoto, Y., Uchimura, S., & Tomita, S. (1997). A bootstrap technique for nearest neighbor classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1), 73-79. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann.
Guyon, I., & Elisseeff, A.(2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182. Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest neighbor classification on spacial data streams using P-trees. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 517-528), Taiwan. Susheela Devi, V., & Murty, M. N. (2002). An incremental prototype set building technique. Pattern Recognition Journal, 35, 505-513. Terabe, M., Washio, T., Motoda, H., Katai, O., & Sawaragi, T. (2002). Attribute generation based on association rules. Knowledge and Information Systems, 4(3), 329-349. Viswanath, P., Murty, M. N., & Bhatnagar, S. (2003). Overlap pattern synthesis with an efficient nearest neighbor classifier (Indian Institute of Science Tech. Rep. 1/ 2003). Retrieved from http://keshava.csa.iisc.ernet.in/ techreports/2003/overlap.pdf Viswanath, P., Murty, M. N., & Bhatnagar, S. (2004). Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification [Electronic version]. Information Fusion, 5(4), 239-250. Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing.
KEY TERMS Bootstrapping: Generating artificial patterns from the given original patterns. (This does not mean that the artificial set is larger in size than the original set; also, artificial patterns need not be distinct from the original patterns.) Curse of Dimensionality Effect: When the number of samples needed to estimate a function (or model) grows exponentially with the dimensionality of the data. Compact and Generalized Abstraction of the Training Set: A compact representation built by using the training set from which not only the original patterns but also some new synthetic patterns can be derived. Prototype Selection: The process of selecting a few representative samples from the given training set suitable for the given task. Training Set: The set of patterns whose class labels are known and that are used by the classifier in classifying a given pattern.
905
P
906
Physical Data Warehousing Design Ladjel Bellatreche LISI/ENSMA, France Mukesh Mohania IBM India Research Lab, India
INTRODUCTION Recently, organizations have increasingly emphasized applications in which current and historical data are analyzed and explored comprehensively, identifying useful trends and creating summaries of the data in order to support high-level decision making. Every organization keeps accumulating data from different functional units, so that they can be analyzed (after integration), and important decisions can be made from the analytical results. Conceptually, a data warehouse is extremely simple. As popularized by Inmon (1992), it is a “subjectoriented, integrated, time-invariant, non-updatable collection of data used to support management decisionmaking processes and business intelligence”. A data warehouse is a repository into which are placed all data relevant to the management of an organization and from which emerge the information and knowledge needed to effectively manage the organization. This management can be done using data-mining techniques, comparisons of historical data, and trend analysis. For such analysis, it is vital that (1) data should be accurate, complete, consistent, well defined, and time-stamped for informational purposes; and (2) data should follow business rules and satisfy integrity constraints. Designing a data warehouse is a lengthy, time-consuming, and iterative process. Due to the interactive nature of a data warehouse application, having fast query response time is a critical performance goal. Therefore, the physical design of a warehouse gets the lion’s part of research done in the data warehousing area. Several techniques have been developed to meet the performance requirement of such an application, including materialized views, indexing techniques, partitioning and parallel processing, and so forth. Next, we briefly outline the architecture of a data warehousing system.
BACKGROUND The conceptual architecture of a data warehousing system is shown in Figure 1. Data of a warehouse is extracted from operational databases (relational, object-oriented,
or relational-object) and external sources (legacy data, other files formats) that may be distributed, autonomous, and heterogeneous. Before integrating this data into a warehouse, it should be cleaned to minimize errors and to fill in missing information, when possible, and transformed to reconcile semantic conflicts that can be found in the various sources. The cleaned and transformed data are integrated finally into a warehouse. Since the sources are updated periodically, it is necessary to refresh the warehouse. This component also is responsible for managing the warehouse data, creating indices on data tables, partitioning data, and updating meta-data. The warehouse data contain the detail data, summary data, consolidated data, and/or multi-dimensional data. The data typically are accessed and analyzed using tools, including OLAP query engines, data mining algorithms, information, visualization tools, statistical packages, and report generators. The meta-data generally is held in a separate repository. The meta-data contain the informational data about the creation, management, and usage of tools (e.g., analytical tools, report writers, spreadsheets and data-mining tools) for analysis and informational purposes. Basically, the OLAP server interprets client queries (the client interacts with the front-end tools and passes these queries to the OLAP server) and converts them into complex SQL queries required to access the warehouse data. It also might access the data warehouse. It serves as a bridge between the users of the warehouse and the data contained in it. The warehouse data also are accessed by the OLAP server to present the data in a multi-dimensional way to the front-end tools. Finally, the OLAP server passes the multi-dimensional views of data to the frontend tools, which format the data according to the client’s requirements. The warehouse data are typically modeled multidimensionally. The multi-dimensional data model has been proved to be the most suitable for OLAP applications. OLAP tools provide an environment for decision making and business modeling activities by supporting ad-hoc queries. There are two ways to implement a multidimensional data model: (1) by using the underlying relational architecture (star schemas, snowflake schemas)
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Physical Data Warehousing Design
Figure 1. A data warehousing architecture
- Network
- Analysis
- Select
- Relational - Legacy
P
Meta Data
- Others
- Transform
Ware
- Clean
house
- Integrate
Data
ROLAP/ MOLAP
Clients
- Report Generator
- Refresh OLAP Server
Information Sources
- Data Mining
- Others Front-end Tools
Warehouse Creation and Management Component
to project a pseudo-multi-dimensional model (example includes Informix Red Brick Warehouse); and (2) by using true multi-dimensional data structures such as, arrays (example includes Hyperion Essbase OLAP Server Hyperion). The advantage of MOLAP architecture is that it provides a direct multi-dimensional view of the data whereas the ROLAP architecture is just a multi-dimensional interface to relational data. On the other hand, the ROLAP architecture has two major advantages: (i) it can be used and easily integrated into other existing relational database systems; and (ii) relational data can be stored more efficiently than multi-dimensional data. Data warehousing query operations include standard SQL operations, such as selection, projection, and join. In addition, it supports various extensions to aggregate functions, such as percentile functions (e.g., top 20th percentile of all products), rank functions (e.g., top 10 products), mean, mode, and median. One of the important extensions to the existing query language is to support multiple group-by, by defining roll-up, drill-down, and cube operators. Roll-up corresponds to doing further group-by on the same data object. Note that roll-up operator is order sensitive; that is, when it is defined in the extended SQL, the order of columns (attributes) matters. The function of a drill-down operation is the opposite of roll-up.
cations have driven the growth of the DBMS industry in the past three decades and will doubtless continue to be important. One of the main objectives of relational systems is to maximize transaction throughput and minimize concurrency conflicts. However, these systems generally have limited decision support functions and do not extract all the necessary information required for faster, better, and intelligent decision making for the growth of an organization. For example, it is hard for an RDBMS to answer the following query: What are the supply patterns for product ABC in New Delhi in 2003, and how were they different from the year 2002? Therefore, it has become important to support analytical processing capabilities in organizations for (1) the efficient management of organizations, (2) effective marketing strategies, and (3) efficient and intelligent decision making. OLAP tools are well suited for complex data analysis, such as multi-dimensional data analysis and to assist in decision support activities that access data from a separate repository called a data warehouse, which selects data from many operational legacies, and possibly heterogeneous data sources. The following table summarizes the differences between OLTP and OLAP.
OLTP vs. OLAP
Decision-support systems demand speedy access to data, no matter how complex the query. To satisfy this objective, many optimization techniques exist in the literature. Most of these techniques are inherited from traditional relational database systems. Among them are materialized views (Bellatreche et al., 2000; Jixue et al., 2003; Mohania et al., 2000; Sanjay et al., 2000, 2001), indexing methods (Chaudhuri et al., 1999; Jügens et al., 2001; Stockinger et al., 2002), data partitioning (Bellatreche et al., 2000, 2002, 2004; Gopalkrishnan et al., 2000; Kalnis et al., 2002; Oracle, 2000; Sanjay et al., 2004), and parallel processing (Datta et al., 1998).
Relational database systems (RDBMS) are designed to record, retrieve, and manage large amounts of real-time transaction data and to keep the organization running by supporting daily business transactions (e.g., update transactions). These systems generally are tuned for a large community of users, and the user typically knows what is needed and generally accesses a small number of rows in a single transaction. Indeed, relational database systems are suited for robust and efficient Online Transaction Processing (OLTP) on operational data. Such OLTP appli-
MAIN THRUST
907
Physical Data Warehousing Design
Table 1. Comparison between OLTP and OLAP applications Criteria User Function BD Design Data View Usage Unit of Work Access Records Accessed Users Size Metric
OLTP Clerk, IT Professional Day-to-day operations Application-oriented Current Detailed, flat relation Structured, repetitive Short, simple transaction Read/Write Tens Thousands MB-GB Transaction throughput
Materialized Views Materialized views are used to precompute and store aggregated data, such as sum of sales. They also can be used to precompute joins with or without aggregations. So, materialized views are used to reduce the overhead associated with expensive joins or aggregations for a large or important class of queries. Two major problems related to materializing the views are (1) the view-maintenance problem, and (2) the view-selection problem. Data in the warehouse can be seen as materialized views generated from the underlying multiple data sources. Materialized views are used to speed up query processing on large amounts of data. These views need to be maintained in response to updates in the source data. This often is done using incremental techniques that access data from underlying sources. In a data-warehousing scenario, accessing base relations can be difficult; sometimes data sources may be unavailable, since these relations are distributed across different sources. For these reasons, the issue of self-maintainability of the view is an important issue in data warehousing. The warehouse views can be made selfmaintainable by materializing some additional information, called auxiliary relations, derived from the intermediate results of the view computation. There are several algorithms, such as counting algorithm and exact-change algorithm, proposed in the literature for maintaining materialized views. To answer the queries efficiently, a set of views that are closely related to the queries is materialized at the data warehouse. Note that not all possible views are materialized, as we are constrained by some resource like disk space, computation time, or maintenance cost. Hence, we need to select an appropriate set of views to materialize under some resource constraint. The view selection problem (VSP) consists in selecting a set of materialized views that satisfies the query response time under some resource constraints. All studies showed that this problem is an NPhard. Most of the proposed algorithms for the VSP are 908
OLAP Decision maker Decision support Subject-oriented Historical, consolidated Summarized, multi-dimensional Ad hoc Complex query Append mostly Millions Tens GB-TB Query throughput
static. This is because each algorithm starts with a set of frequently asked queries (a priori known) and then selects a set of materialized views that minimize the query response time under some constraint. The selected materialized views will be a benefit only for a query belonging to the set of a priori known queries. The disadvantage of this kind of algorithm is that it contradicts the dynamic nature of decision support analysis. Especially for adhoc queries, where the expert user is looking for interesting trends in the data repository, the query pattern is difficult to predict.
Indexing Techniques Indexing has been the foundation of performance tuning for databases for many years. It creates access structures that provide faster access to the base data relevant to the restriction criteria of queries. The size of the index structure should be manageable, so that benefits can be accrued by traversing such a structure. The traditional indexing strategies used in database systems do not work well in data warehousing environments. Most OLTP transactions typically access a small number of rows; most OLTP queries are point queries. B-trees, which are used in the most common relational database systems, are geared toward such point queries. They are well suited for accessing a small number of rows. An OLAP query typically accesses a large number of records for summarizing information. For example, an OLTP transaction would typically query for a customer who booked a flight on TWA 1234 on April 25, for instance; on the other hand, an OLAP query would be more like give me the number of customers who booked a flight on TWA 1234 in one month, for example. The second query would access more records that are of a type of range queries. B-tree indexing scheme is not suited to answer OLAP queries efficiently. An index can be a single-column or multi-column table (or view). An index either can be clustered or non-clustered. An index can be defined on
Physical Data Warehousing Design
one table (or view) or many tables using a join index. In the data warehouse context, when we talk about index, we refer to two different things: (1) indexing techniques and (2) the index selection problem. A number of indexing strategies have been suggested for data warehouses: value-list index, projection index, bitmap index, bit-sliced index, data index, join index, and star join index.
Data Partitioning and Parallel Processing The data partitioning process decomposes large tables (fact tables, materialized views, indexes) into multiple (relatively) small tables by applying the selection operators. Consequently, the partitioning offers significant improvements in availability, administration, and table scan performance Oracle9i. Two types of partitioning are possible to decompose a table: vertical and horizontal. In the vertical fragmentation, each partition consists of a set of columns of the original table. In the horizontal fragmentation, each partition consists of a set of rows of the original table. Two versions of horizontal fragmentation are available: primary horizontal fragmentation and derived horizontal fragmentation. The primary horizontal partitioning (HP) of a relation is performed using predicates that are defined on that table. On the other hand, the derived partitioning of a table results from predicates defined on another relation. In a context of ROLAP, the data partitioning is applied as follows (Bellatreche et al., 2002): it starts by fragmenting dimension tables, and then, by using the derived horizontal partitioning, it decomposes the fact table into several fact fragments. Moreover, by partitioning data of ROLAP schema (star schema or snowflake schema) among a set of processors, OLAP queries can be executed in a parallel, potentially achieving a linear speedup and thus significantly improving query response time (Datta et al., 1998). Therefore, the data partitioning and the parallel processing are two complementary techniques to achieve the reduction of query processing cost in data warehousing environments.
FUTURE TRENDS It has been seen that many enterprises are moving toward building the Operational Data Store (ODS) solutions for real-time business analysis. The ODS gets data from one or more Enterprise Resource Planning (ERP) systems and keeps the most recent version of information for analysis rather than the history of data. Since the Client Relation-
ship Management (CRM) offerings have evolved, there is a need for active integration of CRM with the ODS for realtime consulting and marketing (i.e., how to integrate ODS with CRM via messaging system for real-time business analysis). Another trend that has been seen recently is that many enterprises are moving from data warehousing solutions to information integration (II). II refers to a category of middleware that lets applications access data as though they were in a single database, whether or not they are. It enables the integration of data and content sources to provide real-time read and write access in order to transform data for business analysis and data warehousing and to data placement for performance, currency, and availability. That is, we envisage that there will more focus on integrating the data and contents rather than only integrating structured data, as done in the data warehousing.
CONCLUSION The data warehousing design is quite different from those of transactional database systems, commonly referred as Online Transaction Processing (OLTP) systems. A data warehouse tends to be extremely large, and the information in a warehouse usually is analyzed in a multi-dimensional way. The main objective of a data warehousing design is to facilitate the efficient query processing and maintenance of materialized views. For achieving this objective, it is important that the relevant data is materialized in the warehouse. Therefore, the problems of selecting materialized views and maintaining them are very important and have been addressed in this article. To further reduce the query processing cost, the data can be partitioned. That is, partitioning helps in reducing the irrelevant data access and eliminates costly joins. Further, partitioning at a finer granularity can increase the data access and processing cost. The third problem is index selection. We found that judicious index selection does reduce the cost of query processing, but we also showed that indices on materialized views improve the performance of queries even more. Since indices and materialized views compete for the same resource (storage), we found that it is possible to apply heuristics to distribute the storage space among materialized views and indices so as to efficiently execute queries and maintain materialized views and indexes. It has been seen that enterprises are moving toward building the data warehousing and operational data store solutions.
909
P
Physical Data Warehousing Design
REFERENCES
the International Conference on Management of Data (ACM SIGMOD), Wisconsin, USA.
Bellatreche, L. et al. (2000). What can partitioning do for your data warehouses and data marts? Proceedings of the International Database Engineering and Applications Symposium (IDEAS), Yokohoma, Japan.
Mohania, M., & Kambayashi, Y. (2000). Making aggregate views self-maintainable. Data & Knowledge Engineering, 32(1), 87-109.
Bellatreche, L. et al. (2002). PartJoin: An efficient storage and query execution for data warehouses. Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DAWAK), Aix-en Provence, France. Bellatreche, L. et al. (2004). Bringing together partitioning, materialized views and indexes to optimize performance of relational data warehouses. Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DAWAK), Zaragoza, Spain. Bellatreche, L., Karlapalem, K., & Schneider, M. (2000). On efficient storage space distribution among materialized views and indices in data warehousing environments. Proceedings of the International Conference on Information and Knowledge Management (ACM CIKM), McLearn, VA, USA. Chaudhuri, S., & Narasayya, V. (1999). Index merging. Proceedings of the International Conference on Data Engineering (ICDE), Sydney, Australia. Datta, A., Moon, B., & Thomas, H.M. (2000). A case for parallelism in data warehousing and olap. Proceedings of the DEXA Workshop, London, UK. Gopalkrishnan, V., Li, Q., & Karlapalem, K. (2000). Efficient query processing with associated horizontal class partitioning in an object relational data warehousing environment. Proceedings of the International Workshop on Design and Management of Data Warehouses, Stockholm, Sweden. Hyperion. (2000). Hyperion Essbase OLAP Server. http:/ /www.hyperion.com/ Inmon, W.H. (1992). Building the data warehouse. John Wiley. Jixue, L., Millist, W., Vincent, M., & Mohania, K. (2003). Maintaining views in object-relational databases. Knowledge and Information Systems, 5(1), 50-82. Jürgens, M., Lenz, H. (2001). Tree based indexes versus bitmap indexes: A performance study. International Journal of Cooperative Information Systems (IJCIS), 10(3), 355-379. Kalnis, P. et al. (2002). An adaptive peer-to-peer network for distributed caching of OLAP results. Proceedings of 910
Oracle Corp. (2004). Oracle9i enterprise edition partitioning option. Retrieved from http://otn.oracle.com/products/oracle9i/datasheets/partitioning.html Sanjay, A., Chaudhuri, S., & Narasayya, V.R. (2000). Automated selection of materialized views and indexes in Microsoft SQL server. Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’2000), Cairo, Egypt. Sanjay, A., Chaudhuri, S., & Narasayya, V. (2001). Materialized view and index selection tool for Microsoft SQP server 2000. Proceedings of the International Conference on Management of Data (ACM SIGMOD), CA. Sanjay, A., Narasayya, V., & Yang, B. (2004). Integrating vertical and horizontal partitioning into automated physical database design. Proceedings of the International Conference on Management of Data (ACM SIGMOD), Paris, France. Stockinger, K. (2000). Bitmap indices for speeding up high-dimensional data analysis. Proceedings of the International Conference Database and Expert Systems Applications (DEXA), London, UK.
KEY TERMS Bitmap Index: Consists of a collection of bitmap vectors, each of which is created to represent each distinct value of the indexed column. A bit i in a bitmap vector, representing value x, is set to 1, if the record i in the indexed table contains x. Cube: A multi-dimensional representation of data that can be viewed from different perspectives. Data Warehouse: An integrated decision support database whose content is derived from the various operational databases. A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes. Dimension: A business perspective useful for analyzing data. A dimension usually contains one or more hierarchies that can be used to drill up or down to different levels of detail.
Physical Data Warehousing Design
Dimension Table: A table containing the data for one dimension within a star schema. The primary key is used to link to the fact table, and each level in the dimension has a corresponding field in the dimension table. Fact Table: The central table in a star schema, containing the basic facts or measures of interest. Dimension fields are also included (as foreign keys) to link to each dimension table. Horizontal Partitioning: Distributing the rows of a table into several separate tables. Join Index: Built by translating restrictions on the column value of a dimension table to restrictions on a large
fact table. The index is implemented using one of two representations: row id or bitmap, depending on the cardinality of the indexed column. Legacy Data: Data that you already have and use. Most often, this takes the form of records in an existing database on a system in current use. Measure: A numeric value stored in a fact table or cube. Typical examples include sales value, sales volume, price, stock, and headcount. Star Schema: A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for star schema.
911
P
912
Predicting Resource Usage for Capital Efficient Marketing D.R. Mani Massachusetts Institute of Technology, USA, and Harvard University, USA Andrew L. Betz Progressive Insurance, USA James H. Drew Verizon Laboratories, USA
A structural conflict exists in businesses that sell services whose production costs are discontinuous and whose consumption is continuous but variable. A classic example is in businesses where capital-intensive infrastructure is necessary for provisioning service, but the capacity resulting from capital outlay is not always fully and efficiently utilized. Marketing departments focus on initiatives that increase infrastructure usage to improve both customer retention and ongoing revenue. Engineering and operations departments focus on the cost of service provision to improve the capital efficiency of revenue dollars received. Consequently, a marketing initiative to increase infrastructure usage may be resisted by engineering, if its introduction would require great capital expense to accommodate that increased usage. This conflict is exacerbated when a usage-enhancing initiative tends to increase usage variability so that capital expenditures are triggered with only small increases in total usage. A data warehouse whose contents encompass both these organizational functions has the potential to mediate this conflict, and data mining can be the tool for this mediation. Marketing databases typically have customer data on rate plans, usage, and past responses to marketing promotions. Engineering databases generally record infrastructure locations, usages, and capacities. Other information often is available from both general domains to allow for the aggregation, or clustering, of customer types, rate plans, and marketing promotions, so that marketing proposals and their consequences can be evaluated systematically to aid in decision making. These databases generally contain such voluminous or complicated data that classical data analysis tools are inadequate. In this article, we look at a case study where data mining is applied to predicting capital-intensive resource or infrastructure usage, with the goal of guiding marketing decisions to enable capital-efficient marketing. Although the data mining models developed in this article do not
Figure 1. Marketing initiatives to boost average demand can indirectly increase peak demand to beyond capacity. Capital Expense vs. Resource Capacity 100 Resource Capacity
INTRODUCTION
80 Capacity
60
Peak Demand
40 20
Avg Demand
0 0
1
2
3
4
5
6
Capital Outlay
provide conclusive positions on specific marketing initiatives, and their engineering consequences, the usage revenues, and infrastructure performance predicted by these models provide systematic, sound, and quantitative input for making balanced and cost-effective business decisions.
BACKGROUND In this business context, applying data mining (Abramowicz & Zurada, 2000; Berry & Linoff, 2004; Han & Kamber, 2000) to capital efficient marketing is illustrated here by a study from wireless telephony 1 (Green, 2000), where marketing plans2 introduced to utilize excess off-peak network capacity3 (see Figure 1) potentially could result in requiring fresh capital outlays by indirectly driving peak demand to levels beyond current capacity. We specifically consider marketing initiatives (e.g., rate plans with free nights and weekends) that are aimed at stimulating off-peak usage. Given a rate plan with a fixed peak minute allowance, availability of extra off-peak min-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Predicting Resource Usage for Capital Efficient Marketing
Figure 2. Flowchart describing data sources and data mining operations used in predicting busy-hour impact of marketing initiatives
utes could potentially increase peak usage. The quantification of this effect is complicated by the corporate reality of myriad rate plans and geographically extensive and complicated peak usage patterns. In this study, we use data mining methods to analyze customer, call detail, rate plan, and cell-site location data to predict the effect of marketing initiatives on busy-hour4 network utilization. This will enable forecasting network cost of service for marketing initiatives, thereby leading to optimization of capital outlay.
MAIN THRUST Ideally, the capital cost of a marketing initiative is obtained by determining the existing capacity, the increased capacity required under the new initiative, and then factoring the cost of the additional capital; data for a study like this would come from a corporate data warehouse (Berson & Smith, 1997) that integrates data from relevant sources. Unfortunately, such detailed cost data are not available in most corporations and businesses. In fact, in many situations, the connection between promotional marketing initiatives and capital cost is not even recognized. In this case study, we therefore need to assemble relevant data from different and disparate sources in order to predict the busy-hour impact of marketing initiatives.
Data The parallelograms in the flowchart in Figure 2 indicate essential data sources for linking marketing initiatives to busy-hour usage. Customer data characterize the customer by indicating a customer’s mobile phone number(s),
lifetime value5, access charge, subscribed rate plan, and peak and off-peak minutes used. Rate plan data provide details for a given rate plan, including monthly charges, allowed peak, off-peak, weekend minutes of use, perminute charges for excess use, long distance, roaming charges, and so forth. Call detail data, for every call placed in a given time period, provide the originating and terminating phone numbers (and, hence, originating and terminating customers), cell sites used in handling the call, call duration, and other call details. Cell site location data indicate the geographic location of cell sites, capacity of each cell site, and details about the physical and electromagnetic configuration of the radio towers.
Data Mining Process Figure 2 provides an overview of the analysis and datamining process. The numbered processes are described in more detail to illustrate how the various components are integrated into an exploratory tool that allows marketers and network engineers to evaluate the effect of proposed initiatives. 1.
Cell-Site Clustering: Clustering cell sites using the geographic location (latitude, longitude) results in cell site clusters that capture the underlying population density, with cluster area generally inversely proportional to population. This is a natural consequence of the fact that heavily populated urban areas tend to have more cell towers to cover the large call volumes and provide good signal coverage. The flowchart for cell site clustering is included in Figure 3 with results of k-means clustering (Hastie, Tibshirani & Friedman, 2001) for the San 913
P
Predicting Resource Usage for Capital Efficient Marketing
Figure 3. Steps 1and 2 in the data mining process outlined in Figure 2 2.
Francisco area shown in Figure 4. The cell sites in the area are grouped into four clusters, each cluster approximately circumscribed by an oval. Predictive Modeling: The predictive modeling stage, shown in Figure 3, merges customer and rate plan data to provide the data attributes (features) used for modeling. The target for modeling is obtained by combining cell site clusters with network usage information. Note that the sheer volume of call detail data makes its summarization and merging with other customer and operational data a daunting one. See Berry and Linoff (2000) for a discussion. The resulting dataset has related customer characteristics and rate plan details for every customer, matched up with that customer’s actual network usage and the cell sites providing service for that customer. This data can then be used to build a predictive model. Feature selection (Liu & Motoda, 1998) is performed, based on correlation to target, with grouped interactions taken into account. The actual predictive model is based on linear regression (Hand, Mannila & Smyth, 2001; Hastie, Tibshirani & Friedman, 2001), which, for this application, performs similar to neural network models (Haykin, 1994). Note that the sheer number of customer and rate plan characteristics requires the variable reduction capabilities of a data-mining solution.
For each of the four cell site clusters shown in Figure 4, Figure 5 shows the actual vs. the predicted busy-hour usage for calls initiated during a specific month in the San Figure 4. Clustering cell sites in the San Francisco area, based on geographic position. Each spot represents a cell site, with the ovals showing approximate location of cell site clusters.
914
Predicting Resource Usage for Capital Efficient Marketing
Figure 5. Regression modeling results showing predicted vs. actual busy-hour usage for the cell site clusters shown in Figure 4.
Ac tu a l vs . P red ic ted B H U sa g e
Ac tu al vs . P re dic te d B H U sa g e
Cell Cluster 1
Cell Cluster 2
10 Predicted BH Usage (min)
Predicted BH Usage (min)
10 8
y = 0.6615x + 0.3532 R2 = 0.672
6 4 2 0
8 6
y = 0.3562x + 0.5269 R2 = 0.3776
4 2 0
0
2
4
6
8
10
0
2
Actual BH Usage (min)
6
8
10
A ctu a l vs . P red icte d B H U s ag e
10 Predicted BH Usage (min)
10 8 6 y = 0.3066x + 0.6334 R2 = 0.3148
4 2 0
8 6 y = 0.3457x + 0.4318 R2 = 0.3591
4 2 0
2
4
6
Actual BH Usage (min)
8
10
0
2
4
6
Similar to cell site clustering, based on empirical exploration, we decide on using four clusters for both customers and rate plans. Results of the clustering for customers and applicable rate plans in the San Francisco area are shown in Figures 6 and 7. 5.
Cell Cluster 4
Cell Cluster 3
Predicted BH Usage (min)
4
Actual BH Usage (min)
Ac tu a l v s. P red icted B H U s ag e
0
plans into a small number of groups. Without such clustering, the complex data combinations and analyses needed to predict busy-hour usage will result in very little predictive power. Clusters ameliorate the situation by providing a small set of representative cases to use in the prediction.
8
10
Actual BH Usage (min)
Francisco region. With statistically significant R2 correlation values ranging from about 0.31 to 0.67, the models, while not perfect, are quite good at predicting clusterlevel usage. 3. and 4. Customer and Rate Plan Clustering: Using kmeans clustering to cluster customers based on their attribute, and rate plans, based on their salient features, results in categorizing customers and rate Figure 6. Customer clustering flowchart and results of customer clustering. The key attributes identifying the four clusters are access charge (ACC_CHRG), lifetime value (LTV_PCT), peak minutes of use (PEAK_MOU), and total calls (TOT_CALL).
Integration and Application of the Predictive Model: Stringing all the pieces together, we can start with a proposed rate plan and an inwards projection6 of the expected number of customers who would subscribe to the plan and predict the busy-hour usage on targeted cell sites. The process is outlined in Figure 8. Inputs labeled A, B, and C come from the respective flowcharts in Figures 2, 6, and 7.
Validation Directly evaluating and validating the analytic model developed would require data summarizing the capital cost of a marketing initiative. This requires determining the existing capacity, the increased capacity required under the new initiative, and then factoring the cost of the additional capital. Unfortunately, as mentioned earlier, such detailed cost data are generally unavailable. ThereFigure 7. Rate plan clustering flowchart and clustering results. The driving factor that distinguishes rate plan clusters include per-minute charges (intersystem, peak, and off-peak) and minutes of use.
915
P
Predicting Resource Usage for Capital Efficient Marketing
Figure 8. Predicting busy-hour usage at a cell site for a given marketing initiative
fore, we validate the analytic model by estimating its impact and comparing the estimate with known parameters. We began this estimation by identifying the 10 most frequent rate plans in the data warehouse. From this list, we selected one rate plan for validation. Based on marketing area and geographic factors, we assume an equal inwards projection for each customer cluster. Of course, our approach allows for, but does not require, different inward projections for the customer clusters. This flexibilFigure 9. Scatter plot showing actual vs. predicted busy-hour usage for each cell site, for a specific rate plan
Actual vs. Predicted BH Usage for Each Cell Site Rate Plan RDCDO
Predicted BH Usage
1 0.8
y = 1.31x - 0.1175 R2 = 0.1257
0.6 0.4 0.2 0 0
0.1
0.2
0.3
0.4
Actual BH Usage
916
0.5
0.6
ity could be useful, for example, when price elasticity data suggest inward projection differences by cluster. Starting at the cluster level, we applied the predictive model to estimate the average busy-hour usage for each customer on each cell tower. These cluster-level predictions were disaggregated to the cellular-tower level by assigning busy-hour minutes proportionate to total minutes for each cell tower within the respective cellulartower cluster. Following this, we then calculated the actual busy-hour usage per customer of that rate plan across the millions of call records. In Figure 9, a scatter plot of actual busy-hour usage against predicted busyhour usage, with an individual cell tower now the unit of analysis, reveals an R2 correlation measure of 0.13. The estimated model accuracy dropped from R2s in the mid 0.30s for cluster-level data (Figure 5) to about 0.13 when the data were disaggregated to the cellular tower level (Figure 9). In spite of the relatively low R2 value, the correlation is statistically significant, indicating that this approach can make contributions to the capital estimates of marketing initiatives. However, the model accuracy on disaggregated data was certainly lower than the effects observed at the cluster level. The reasons for this loss in accuracy probably could be attributed to the fine-grained disaggregation, the large variability among the cell sites in terms of busy-hour usage, the proportionality assumption made in disaggregating, and model sensitivity to inward projections. These data point out both an oppor-
Predicting Resource Usage for Capital Efficient Marketing
Figure 10. Lifetime value (LTV) distributions for customers with below and above average usage and BH usage indicates that high-LTV customers also impact the network with high BH usage
LTV distribution for customers with less than average usage and BH usage
LTV distribution for customers with more than average usage and BH usage
tunity (that rate plans can be intelligently targeted to specific locations with excess capacity) and a threat (that high busy-hour volatility would lead engineering to be cautious in allowing usage-increasing plans).
Business Insights In the context of the analytic model, the data also can reveal some interesting business insights. First, observe
the customer lifetime value density plots as a function of strain that they place on the network. The left panel in Figure 10 shows LTV density for customers with belowaverage total usage and below-average busy-hour usage. The panel on the right shows LTV density for customers with above-average total usage and above-average busyhour usage. Although the predominant thinking in marketing circles is that higher LTV is always better, the data suggest this reasoning should be tempered by whether
Figure 11. Exploring customer clusters and rate plan clusters in the context of BH usage shows that customers in a cluster who subscribe to a class of rate plans impact BH usage differently from customers in a different cluster
BH Impact by Customer & Rate Plan Cluster Proportion of Heavy BH Users
Proportion of Heavy BH Users
B H Impact by C ustome r C luste r 0.2
0.15
0.1
0.05
0 1
2
3
Customer Cluster
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 1
4
A
4
Customer Cluster Rate Plan Cluster 2
Rate Plan Cluster 3
B
917
P
Predicting Resource Usage for Capital Efficient Marketing
the added value in revenue offsets the disproportionate strain on network resources. This is the basis for a fundamental tension between marketing and engineering functions in large businesses. Looking at busy-hour impact by customer cluster and rate plan cluster is also informative, as shown in Figure 11. For example, if we define heavy BH users as customers who are above average in total minutes as well as busyhour minutes, we can see main effect differences across customer clusters (Figure 11a). This is not entirely surprising, as we have already seen that LTV was a major determinant of customer cluster and heavy BH customers also skewed towards having higher LTV. However, there was an unexpected crossover interaction of rate plan cluster by customer cluster when heavy BH users were the target (Figure 11b). The implication is that controlling for revenue, certain rate plan types are more network-friendly, depending on the customer cluster under study. Capital-conscious marketers in theory could tailor their rate plans to minimize network impact by tuning rate plans more precisely to customer segments.
FUTURE TRENDS The proof of concept case study sketched here describes an analytical effort to optimize capital costs of marketing programs, while promoting customer satisfaction and retention by utilizing the business insights gained to tailor marketing programs to better match the needs and requirements of customers (Lenskold, 2003). As database systems incorporate data-mining engines (e.g., Oracle data mining [Oracle, 2004]), future software and customer relationship management applications will automatically incorporate such an analysis to extract optimal recommendations and business insights for maximizing business return on investment and customer satisfaction, leading to effective and profitable one-on-one customer relationship management (Brown, 2000; Gilmore & Pine, 2000; Greenberg, 2001).
CONCLUSION We have made the general argument that a comprehensive company data warehouse and broad sets of models analyzed with modern data mining techniques can resolve tactical differences between different organizations within the company and provide a systematic and sound basis for business decision making. The role of data mining in so doing has been illustrated here in mediating between the marketing and engineering functions in a wireless telephony company, where statistical models are used to
918
target specific customer segments with rate plan promotions in order to increase overall usage (and, hence, retention), while more efficiently using, but not exceeding network capacity. A further practical advantage of this data-mining approach is that all of the customer groups, cell site locations, and rate plan promotions are simplified through clustering in order to simplify their characteristic representation facilitating productive discussion among upper-management strategists. We have sketched several important practical business insights that come from this methodology. A basic concept to be demonstrated to upper management is that busy-hour usage, the main driver of equipment capacity needs, varies greatly by customer segment and by general cell site grouping (see Figure 9) and that usage can be differently volatile over site groupings (see Figure 5). These differences point out the potential need for targeting specific customer segments with specific rate plan promotions in specific locations. One interesting example is illustrated in Figure 11, where two customer segments respond differently to each of two candidate rate plans, one group decreasing its BH usage under one plan and the other decreasing it under the other plan. This leads to the possibility that certain rate plans should be offered to specific customer groups in locations where there is little excess equipment capacity, but others can be offered where there is more slack in capacity. Of course, there are many organizational, implementation, and deployment issues associated with this integrated approach. All involved business functions must accept the validity of the modeling and its results, and this concurrence requires the active support of upper management overseeing each function. Second, these models should be repeatedly run to assure their relevance in a dynamic business environment, as wireless telephony is in our illustration. Third, provision should be made for capturing the sales and engineering consequences of the decisions made from this approach, to be built into future models. Despite these organizational challenges, business decision makers informed by these models hopefully may resolve a fundamental business rivalry.
REFERENCES Abramowicz, W., & Zurada, J. (2000). Knowledge discovery for business information systems. Kluwer. Berry, M.A., & Linoff, G.S. (2000). Mastering data mining: The art and science of customer relationship management. Wiley. Berry, M.A., & Linoff, G.S. (2004). Data mining techniques: For marketing, sales, and customer relationship management. Wiley.
Predicting Resource Usage for Capital Efficient Marketing
Berson, A., & Smith, S. (1997). Data warehousing, data mining and OLAP. McGraw-Hill. Brown, S.A. (2000). Customer relationship management: Linking people, process, and technology. John Wiley. Drew, J., Mani, D.R., Betz, A., & Datta, P. (2001). Targeting customers with statistical and data-mining techniques. Journal of Services Research, 3(3), 205-219. Gilmore, J., & Pine, J. (2000). Markets of one. Harvard Business School Press. Green, J.H. (2000). The Irwin handbook of telecommunications. McGraw-Hill. Greenberg, P. (2001). CRM at the speed of light: Capturing and keeping customers in Internet real time. McGraw Hill. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining. Bradford Books. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer.
Capital-Efficient Marketing: Marketing initiatives that explicitly take into account and optimize, if possible, the capital cost of provisioning service for the introduced initiative or promotion. Feature Selection: The process of identifying those input attributes that contribute significantly to building a predictive model for a specified output or target. Inwards Projection: Estimating the expected number of customers that would sign on to a marketing promotion, based on customer characteristics and profiles. k-Means Clustering: A clustering method that groups items that are close together, based on a distance metric like Euclidean distance, to form clusters. The members in each of the clusters can be described succinctly using the mean (or centroid) of the respective cluster. Lifetime Value: A measure of the profit generating potential, or value, of a customer; a composite of expected tenure (how long the customer stays with the business provider) and expected revenue (how much a customer spends with the business provider).
Haykin, S. (1994). Neural networks: A comprehensive foundation. Prentice-Hall.
Predictive Modeling: Use of statistical or data-mining methods to relate attributes (input features) to targets (outputs) using previously existing data for training in such a manner that the target can be predicted for new data based on the input attributes alone.
Lenskold, J.D. (2003). Marketing ROI: The path to campaign, customer, and corporate profitability. Mc-Graw Hill.
ENDNOTES
Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining. Kluwer. Mani, D.R., Drew, J., Betz, A., & Datta, P. (1999). Statistics and data mining techniques for lifetime value modeling. Proceedings of the Fifth ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
1
2
Oracle. (2004). Oracle Corporation. Retrieved from http:/ /otn.oracle.com/products/bi/odm/index.html
KEY TERMS Busy Hour: The hour at which a mobile telephone network handles the maximum call traffic in a 24-hour period. It is that hour during the day or night when the product of the average number of incoming calls and average call duration is at its maximum.
3
Wireless telephony here refers to both the engineering and commercial aspects of mobile telecommunications by radio signal, also known as cellular telephony. In this article, the terms marketing plan and marketing initiative refer to promotions and incentives, generally mediated through a rate plan—a mobile calling plan that specifies minute usage allowances at peak and off-peak times, monthly charge, cost of extra minutes, roaming and longdistance charges, and so forth. The capacity of a cellular telephone network is the maximum number of simultaneous calls that can be handled by the system at any given time. Times during the day when call volumes are higher than average are referred to as peak hours; off-peak hours entail minimal call traffic—like nights and weekends.
919
P
Predicting Resource Usage for Capital Efficient Marketing
4
5
920
Busy hour (BH) is that hour during which the network utilization is at its peak. It is that hour during the day or night when the product of the average number of incoming calls and average call duration is at its maximum. The well-established and widely used concept of customer lifetime value (LTV) captures expected revenue and expected tenure, preferably personal-
6
ized for each individual customer. (Drew, et al., 2001; Mani et al., 1999). Inwards projection is a term used to denote an estimate of the number of customers that would subscribe to a new (or enhanced) rate plan, with details reasonably summarizing characteristics of customers expected to sign on to such a plan.
921
Privacy and Confidentiality Issues in Data Mining Yücel Saygin Sabanci University, Turkey
INTRODUCTION Data regarding people and their activities have been collected over the years, which has become more pervasive with widespread usage of the Internet. Collected data usually are stored in data warehouses, and powerful data mining tools are used to turn it into competitive advantage. Besides businesses, government agencies are among the most ambitious data collectors, especially in regard to the increase of safety threats coming from global terrorist organizations. For example, CAPPS (Computer Assisted Passenger Prescreening System) collects flight reservation information as well as commercial information about passengers. This data, in turn, can be utilized by government security agencies. Although CAPPS represents US national data collection efforts, it also has an effect on other countries. The following sign at the KLM ticket desk in Amsterdam International Airport illustrates the international level of data collection efforts: “Please note that KLM Royal Dutch Airlines and other airlines are required by new security laws in the US and several other countries to give security customs and immigration authorities access to passenger data. Accordingly, any information we hold about you and your travel arrangements may be disclosed to the concerning authorities of these countries in your itinerary.” This is a very striking example of how the confidential data belonging to citizens of one country could be handed over to authorities of some other country via newly enforced security laws. In fact, some of the largest airline companies in the US, including American, United, and Northwest, turned over millions of passenger records to the FBI, according to the New York Times (Schwartz & Maynard, 2004). Aggressive data collection efforts by government agencies and enterprises have raised a lot of concerns among people about their privacy. In fact, the Total Information Awareness (TIA) project, which aims to build a centralized database that will store the credit card transactions, e-mails, Web site visits, and flight details of Americans was not funded by Congress due to privacy concerns. Although the privacy of individuals is protected by regulations, such regulations may not be enough to ensure privacy against new technologies such as data mining. Therefore, important issues need to be considered by the data collectors and the data owners. First of all, the data itself may contain confidential information
that needs to be secured. More importantly, the privacy of the data owners needs to be taken into consideration, depending on how the data will be used. Privacy risks increase even more when we consider powerful datamining tools that can be used to extract confidential information. We will try to address the privacy and confidentiality issues that relate to data-mining techniques throughout this article.
BACKGROUND Historically, researchers have been more interested in the security aspect of data storage and transfer. Data security is still an active area of research in relation to new data storage and transfer models, such as XML. Also, new media of data transfer, such as wireless, introduce new challenges. The most important work on data security that relates to the privacy and confidentiality issues in data analysis is statistical disclosure control and data security in statistical databases (Adam & Wortmann, 1989). The basic idea of a statistical database is to let users obtain statistical or aggregate information from a database, such as the average income and maximum income in a given department. This is done while limiting the disclosure of confidential information, such as the salary of a particular person. The concept of k-anonymity, which ensures that the obtained values refer to at least k individuals (Sweeney, 2002), was proposed as a measure for the degree of confidentiality in aggregate data. In this way, identification of individuals can be blocked up to a certain degree. However, successive query results over the database may pose a risk to security, since a combination of intelligently issued queries may infer confidential data. The general problem of adversaries being able to obtain sensitive data from nonsensitive data is called the inference problem (Farkas & Jajodia, 2003).
MAIN THRUST Data mining is considered to be a tool for analyzing large collections of data. As a result, government and industry seek to exploit its potential by collecting extensive information about people and their activities. Coupled with its
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
P
Privacy and Confidentiality Issues in Data Mining
power to extract previously unknown knowledge, data mining also has become one of the main targets of privacy advocates. In reference to data mining, there are two main concerns regarding privacy and confidentiality: (1) (2)
Protecting the confidentiality of data and privacy of individuals against data mining methods. Enabling the data mining algorithms to be used over a database without revealing private and confidential information.
We should note that, although both of the tracks seem to be similar to each other, their purposes are different. In the first case, the data may not be confidential, but datamining tools could be used to infer confidential information. Therefore, some sanitization techniques are needed to modify the database, so that privacy concerns are alleviated. In the second track, the data are considered confidential and are perturbed before they are given to a third party. Both of the approaches are necessary in order to achieve full privacy of individuals and will be discussed in detail in the following subsections.
Privacy Preserving Data Mining In the year 2000, Agrawal and Srikant, in their seminal paper published in ACM SIGMOD proceedings, coined the term privacy preserving data mining. Based on this work, privacy-preserving data mining can be defined as a technology that enables data-mining algorithms to work on encrypted or perturbed data (Agrawal & Srikant, 2000). We think of it as mining without seeing the actual data. The authors have pointed out that the collected data may be outsourced to third parties to perform the actual data mining. However, before the data could be handed over to third parties, the confidential values in the database, such as the salary of employees, needs to be perturbed. The research results are for specific data-mining techniques; namely, decision tree construction for classification. The main idea is to perturb the confidential data values in a way that the original data distribution could be reconstructed but not the original data values. In this way, the classification techniques that rely on the distribution of the data to construct a model can still work within a certain error margin that the authors approve. Another interesting work on privacy-preserving data mining was conducted in the US at Purdue University (Kantarcioglu & Clifton, 2002). This work was in the context of association rules in a distributed environment. The authors considered a case in which there were multiple companies that had their own confidential local databases that they did not want to share with others. When the datasets of the individual companies have the same schema, then this means that the data are distributed 922
horizontally. If the schema of the local databases are complementary to each other, then this means that the data are vertically distributed. The Purdue research group considered both horizontal and vertical data distribution for privacy-preserving association rule mining. In both cases, the individual association rules together with their statistical properties were assumed to be confidential. Also, secure multi-part computation techniques were employed, based on specialized encryption practices, to make sure that the confidential association rules were circulated among the participating companies in encrypted form. The resulting global association rules can be obtained in a private manner without each company knowing which rule belongs to which local database. Both of the seminal works on privacy-preserving data mining (Agrawal & Srikant, 2000; Kantarcioglu & Clifton, 2002) have shaped the research in this area. They show how data mining can be done on private data in a centralized and distributed environment. Although a specific data-mining model was the target in both of these papers, these authors initiated ideas that could be applied to other data-mining models.
Privacy and Confidentiality Against Data Mining Data mining and data warehousing technology have a great impact on data analysis and business decision making, which motivated the collection of data on practically everything, ranging from customer purchases to navigation patterns of Web site visitors. As a result of data collection efforts, privacy advocates in the US are now raising their voices against the misuse of data, even if it was collected for security purposes. The following quote from the New York Times article demonstrates the point of privacy against data mining (Markoff, 2002): Pentagon has released a study that recommends the government to pursue specific technologies as potential safeguards against the misuse of data-mining systems similar to those now being considered by the government to track civilian activities electronically in the United States and abroad. This shows us that even the most aggressive data collectors in the US are aware of the fact that the datamining tools could be misused, and we need a mechanism to protect the confidentiality and privacy of people. The initial work by Chang and Moskovitz (2000) points out the possible threats imposed by data mining tools. In this work, the authors argue that even if some of the data values are deleted (or hidden) in a database, there is still a threat of the hidden data being recovered. They show that a classification model could be constructed using
Privacy and Confidentiality Issues in Data Mining
part of the released data as a training set, which can be used further to predict the hidden data values. Chang and Moskovitz (2000) propose a feedback mechanism that will try to construct a prediction model and go back to the database to update it with the aim of blocking the prediction of confidential data values. Another aspect of privacy and confidentiality against data mining is that data-mining results, such as the patterns in a database, could be confidential. For example, a database could be released with the thought that the confidential values are hidden and cannot be queried. However, there may be patterns in the database that are confidential. This issue was first pointed out by O’Leary (1991). Clifton and Marks (1996) further elaborate on the issue of patterns being confidential by specific examples of data-mining models, such as association rules. The association rules, which are very popular in retail marketing, may contain confidential information that should not be disclosed. In Verykios, et al. (2004), a way of identifying confidential association rules and sanitizing the database to limit the disclosure of the association rules are discussed. Furthermore, the work illustrates ways in which heuristics for updating the database can reduce the significance of the association rules in terms of their support and confidence. The main idea is to change the transactions in the database that contribute to the support and confidence of the sensitive (or private) association rules in a way that the support and/or confidence of the rules decrease with a limited effect on the non-sensitive rules.
FUTURE TRENDS Privacy and confidentiality issues in data mining will be more and more crucial as data collection efforts increase and the type of data collected becomes more diverse. A typical example of this is the usage of RFID tags and mobile phones, which reveal our sensitive location information. As more data become available and more tools are implemented to search this data, the privacy risks will increase even more. Consider the World Wide Web, which is a huge data repository with very powerful search engines working on top of it. An adversary can find the phone number of a person from some source and use the search engine Google to obtain the address of that person. Then, by applying the address to a tool such as Mapquest, door-todoor driving directions to the corresponding person’s home could be found easily. This is a very simple example of how the privacy of a person could be in danger by integrating data from a couple of sources, together with a search engine over a data repository. Data integration over multiple data repositories will be one of the main challenges in privacy and confidentiality of data against data mining.
The New York Times article cited previously highlights another important aspect of privacy issues in data mining by saying, “Perhaps the strongest protection against abuse of information systems is Strong Audit mechanisms… we need to watch the watchers” (Markoff, 2002). This confirms that data-mining tools also should be monitored, and users’ access to data via data-mining tools should be controlled. Although there is some initial work on this issue (Oliveira, Zaiane & Saygin, 2004), how this could be achieved is still an open problem waiting to be addressed, since there is no available access control mechanism specifically developed for data-mining tools similar to the one employed in database systems. Privacy-preserving data-mining techniques have been proposed, but they may not fully preserve the privacy, as pointed out in Agrawal and Aggarwal (2001) and Evfimievski, Gehrke, and Srikant (2003). Therefore, privacy metrics and benchmarks are needed to assess the privacy threats and the effectiveness of the proposed privacy-preserving data-mining techniques or the privacy breaches introduced by data-mining techniques.
CONCLUSION Data mining has found a lot of applications in the industry and government, due to its success in combining machine learning, statistics, and database fields with the aim of turning heaps of data into valuable knowledge. Widespread usage of the Internet, especially for e-commerce and other services, has led to the collection and storage of more data at a lower cost. Also, large data repositories about individuals and their activities, coupled with powerful data mining tools, have increased fears about privacy. Therefore, data mining researchers have felt the urge to address the privacy and confidentiality issues. Although privacy-preserving data mining is still in its infancy, some promising results have been achieved as the outcomes of initial studies. For example, data perturbation techniques enable data-mining models to be built on private data, and encryption techniques allow multiple parties to mine their databases as if their data were stored in a central database. The threat of powerful data-mining tools revealing confidential information also has been addressed. The initial results in this area have shown that confidential patterns in a database can be concealed by specific hiding techniques. For the advance of the data-mining technology, privacy issues need to be investigated more, and the current problems, such as privacy leaks in privacy-preserving data-mining algorithms, and the scalability issues need to be resolved. In summary, data-mining technology is needed to make our lives easier and to increase our safety standards, but, at the same time, privacy standards should 923
P
Privacy and Confidentiality Issues in Data Mining
be established and enforced on data collectors in order to protect the privacy of data owners against the misuse of the collected data.
REFERENCES Adam, N.R., & Wortmann, J.C. (1989). Security-control methods for statistical databases: A comparative study. ACM Computing Surveys, 21(4), 515-556. Agrawal, D., & Aggarwal, C. (2001). On the design and quantification of privacy preserving data mining algorithms. Proceedings of ACM Symposium on Principles of Database Systems, Santa Barbara, California. Agrawal, R., & Srikant, R. (2000). Privacy preserving data mining. Proceedings of SIGMOD Conference, Dallas, Texas. Clifton, C., & Marks, D. (1996). Security and privacy implications of data mining. Proceedings of ACM Workshop on Data Mining and Knowledge Discovery, Montreal Canada. Evfimievski, A.V., Gehrke, J., & Srikant, R. (2003). Limiting privacy breaches in privacy preserving data mining. Proceedings of ACM Symposium on Principles of Database Systems, San Diego, California.
Conference on Knowledge Discovery and Data Mining, Sydney, Australia. Saygin, Y., Verykios, V.S., & Clifton, C. (2001). Using unknowns to prevent the discovery of association rules. SIGMOD Record, 30(4), 45-54. Schwartz, J., & Micheline, M. (2004, May 1). Airlines gave F.B.I. millions of records on travelers after 9/11. New York Times. Sweeney, L. (2003). k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557-570. Verykios, V.S., Elmagarmid, A., Bertino, E., Saygin, Y., & Dasseni, E. (2004). Association rule hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 434-447.
KEY TERMS Data Perturbation: Modifying the data so that original confidential data values cannot be recovered. Distributed Data Mining: Performing the data-mining task on data sources distributed in different sites.
Farkas, C., & Jajodia, S. (2003). The inference problem: A survey. ACM SIGKDD Explorations, 4(2), 6-12.
K-Anonymity: A privacy metric, which ensures an individual’s information cannot be distinguished from at least k-1 other people when a data source is disclosed.
Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving distributed mining of association rules on horizontally partitioned data. Proceedings of The ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Madison, Wisconsin.
Privacy Against Data Mining: Preserving the privacy of individuals against data-mining tools when disclosed data contain private information that could be extracted by data-mining tools.
Markoff, J. (2002, December 19). Study seeks technology safeguards for privacy. New York Times (p. 18). O’Leary, D.E. (1991). Knowledge discovery as a threat to database security. In G. Piatetsky-Shapiro, & W.J. Frawley (Eds.), Knowledge discovery in databases (pp. 507-516). AAI/MIT Press. Oliveira, S., Zaiane, O., & Saygin, Y. (2004). Secure association rule sharing. Proceedings of the 8th Pacific-Asia
924
Privacy Preserving Data Mining: Performing the datamining task on private data sources (centralized or distributed). Secure Multiparty Computation: Computing the result of an operation (i.e., sum, min, max) on private data (e.g., finding the richest person among a group of people without revealing the wealth of the individuals). Statistical Database: Type of database system that is designed to support statistical operations while preventing operations that could lead to the association of individuals with confidential data.
925
Privacy Protection in Association Rule Mining Neha Jha Indian Institute of Technology, Kharagpur, India Shamik Sural Indian Institute of Technology, Kharagpur, India
INTRODUCTION Data mining technology has emerged as a means for identifying patterns and trends from large sets of data. Mining encompasses various algorithms, such as discovery of association rules, clustering, classification and prediction. While classification and prediction techniques are used to extract models describing important data classes or to predict future data trends, clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. Alternatively, association rule mining searches for interesting relationships among items in a given data set. The discovery of association rules from a huge volume of data is useful in selective marketing, decision analysis and business management. A popular area of application is the “market basket analysis,” which studies the buying habits of customers by searching for sets of items that are frequently purchased together (Han & Kamber, 2003). Let I = {i1, i 2, …, in} be a set of items and D be a set of transactions, where each transaction T belonging to D is an itemset such that T is a subset of I. A transaction T contains an itemset A if A is a subset of T. An itemset A with k items is called a k-itemset. An association rule is an implication of the form A => B where A and B are subsets of I and A ∩B is null. The rule A => B is said to have a support s in the database D if s% of the transactions in D contain AUB. The rule is said to have a confidence c if c% of the transactions in D that contain A also contain B. The problem of mining association rules is to find all rules whose support and confidence are higher than a specified minimum support and confidence. An example association rule could be: Driving at night and speed > 90 mph results in a road accident, where at least 70% of drivers meet the three criteria and at least 80% of those meeting the driving time and speed criteria actually have an accident. While data mining has its roots in the traditional fields of machine learning and statistics, the sheer volume of data today poses a serious problem. Traditional methods typically make the assumption that the data is memory resident. This assumption is no longer tenable and implementation of data mining ideas in high-performance par-
allel and distributed computing environments has become crucial for successful business operations. The problem is not simply that the data is distributed, but that it must remain distributed. For instance, transmitting large quantities of data to a central site may not be feasible or it may be easier to combine results than to combine the sources. This has led to the development of a number of distributed data mining algorithms (Ashrafi et al., 2002). However, most of the distributed algorithms were initially developed from the point of view of efficiency, not security. Privacy concerns arise when organizations are willing to share data mining results but not the data. The data could be distributed among several custodians, none of who are allowed to transfer their data to another site. The transaction information may also be specific to individual users who do not appreciate the disclosure of individual record values. Consider another situation. A multinational company wants to mine its own data spread over many countries for globally valid results while national laws prevent trans-border data sharing. Thus, a need for developing privacy preserving distributed data mining algorithms has been felt in recent years. This keeps the information about each site secure, and at the same time determines globally valid results for taking effective business decisions. There are many variants of these algorithms depending on how the data is distributed, what type of data mining we wish to do, and what restrictions are placed on sharing of information. This paper gives an overview of the different approaches to privacy protection and information security in distributed association rule mining. The following two sections describe the traditional privacy preserving methods along with the pioneering work done in recent years. Finally, we discuss the open research issues and future directions of work.
BACKGROUND A direct approach to data mining over multiple sources would be to run existing data mining tools at each site independently and then combine the final results. While it is simple to implement, this approach often fails to give
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
P
Privacy Protection in Association Rule Mining
globally valid results since a rule that is valid in one or more of the individual locations need not be valid over the entire data set. Efforts have been made to develop methods that perform local operations at each site to produce intermediate results, which can then be used to obtain the final result in a secure manner. For example, it can be easily shown that if a rule has support > m% globally, it must have support > m% on at least one of the individual sites. This result can be applied to the distributed case with horizontally partitioned data (all sites have the same schema but each site has information on different entities). A distributed algorithm for this would work by requesting each site to send all rules with support at least m. For each rule returned, the sites are then asked to send the count of their items that support the rule, and the total count of all items at the site. Using these values, the global support of each rule can be computed with the assurance that all rules with support at least m have been found. This method provides a certain level of information security since the basic data is not shared. However, the problem becomes more difficult if we want to protect not only the individual items at each site, but also how much each site supports a given rule. The above method reveals this information, which may be considered to be a breach of security depending on the sensitivity of any given application. Theoretical studies in the field of secure computation started in the late 1980s. In recent years, the focus has shifted more to the application field (Maurer, 2003). The challenge is to apply the available theoretical results in solving intricate real-world problems. Du & Atallah (2001) review and suggest a number of open secure computation problems including applications in the field of computational geometry, statistical analysis and data mining. They also suggest a method for solving secure multi-party computational geometry problems and secure computation of dot products in separate pieces of work (Atallah & Du, 2001; Ioannidis et al., 2002). In all the above-mentioned work, the secure computation problem has been treated with an approach to providing absolute zero knowledge whereas the corporations may not always be willing to bear the cost of zero information leakage as long as they can keep the information shared within known bounds. In the next section, we discuss some of the important approaches to privacy preserving data mining with an emphasis on the algorithms developed for association rule mining.
MAIN THRUST In this section, we describe the various levels of privacy protection possible while mining data and the corresponding algorithms to achieve the same. Identification 926
of privacy concerns in data mining has led to a wide range of proposals in the past few years. The solutions can be broadly categorized as those belonging to the classes of data obfuscation, data summarization and data separation. The goal of data obfuscation is to hide the data to be protected. This is achieved by perturbing the data before delivering it to the data miner by either randomly modifying the data, swapping the values between the records or performing controlled modification of data to hide the secrets. Cryptographic techniques are often employed to encrypt the source data, perform intermediate operations on the encrypted data and then decrypt the values to get back the final result with each site not knowing anything but the global rule. Summarization, on the other hand, attempts to make available innocuous summaries of the data and therefore only the needed facts are exposed. Data separation ensures that only the trusted parties can see the data by making all operations and analysis to be performed either by the owner/creator of the data or by trusted third parties. One application of data perturbation technique is decision tree based classification to protect individual privacy by adding random values from a normal/Gaussian distribution of mean 0 to the actual data values (Agrawal & Srikant, 2000). Bayes’ rule for density functions is then used to reconstruct the distribution. The approach is quite elegant since it provides a method for approximating the original data distribution and not the original data values by using the distorted data and information on the random data distribution. Similar data perturbation techniques can be applied to the mining of Boolean association rules also (Rizvi & Haritsa, 2002). It is assumed that the tuples in the database are fixed length sequences of 0’s and 1’s. A typical example is the market basket application where the columns represent the items sold by a supermarket, and each row describes, through a sequence of 1’s and 0’s, the purchases made by a particular customer (1 indicates a purchase and 0 indicates no purchase). One interesting feature of this work is a flexible definition of privacy; for example, the ability to correctly guess a value of “1” from the perturbed data can be considered a greater threat to privacy than correctly learning a “0.” For many applications such as market basket, it is reasonable to expect that the customers would want more privacy for the items they buy compared to the items they do not. There are primarily two ways of handling this requirement. In one method, the data is changed or perturbed to a certain extent to hide the exact information that can be extracted from the original data. In another approach, data is encrypted before running the data mining algorithms on it. While data perturbation techniques usually result in a transformation that leads to loss of information and the exact result cannot be determined,
Privacy Protection in Association Rule Mining
cryptographic protocols try to achieve zero knowledge transfer using lossless transformation. Cryptographic techniques have been developed for the ID3 classification algorithm with two parties having horizontally partitioned data (Lindell & Pinkas, 2000). While the approach is interesting, it is not very efficient in mining rules from very large databases. Also, completely secure multi-party computation may not actually be required in practical applications. Although there may be many different data mining techniques, they often perform similar computations at various stages. For example, counting the number of items in a subset of the data shows up in both association rule mining and learning decision trees. There exist four popular methods for privacy-preserving computations that can be used to support different forms of data mining. These are: secure sum, set union, set intersection size and scalar product. Though all methods are not truly secure (in some, information other than the results is revealed), but they do have provable bounds on the information released. In addition, they are efficient as the communication and computation cost is not significantly increased through addition of the privacy-preserving component. At this stage, it is important to understand that not all the association rules have equal need for protection. One has to analyze the sensitivity of the various association rules mined from a database (Saygin et al., 2002). From the large amount of data made available to the public through various sources, a malicious user may be able to extract association rules that were meant to be protected. Reducing the support or the confidence can hide the sensitive rules. Evfimievski et al. (2002) investigate the problem of applying randomization techniques in association rule mining. One of their most important contributions is the application of privacy preserving techniques in rule mining from categorical data. They also provide a formal definition for privacy breaches. It is quite challenging to extend this work by combining randomization techniques with cryptographic techniques to make the scheme more robust. Kantarcioglu & Clifton (2002) propose a cryptographic technique for mining association rules in a horizontally partitioned database. It assumes that the transactions are distributed among n sites. The global support count of an item set is the sum of all the local support counts. An itemset A is globally supported if the global support count of A is greater than s% of the total transaction database size. The global confidence of a rule A => B can be given as (AUB).support/(A).support and a k-itemset is called a globally large k-itemset if it is globally supported. Quasicommutative hash functions are used for secure computation of set unions that determine globally frequent candidate itemsets from locally frequent candidate itemsets. The globally frequent k-itemsets are determined from the candidate itemsets with a secure protocol.
Since the goal is to determine if the support exceeds a threshold rather than to learn the exact support of a rule, the secure sum computation is slightly altered and, instead of sending the computed values to each other, the sites perform secure comparison among themselves. If the goal were to have a totally secure method, the union step would have to be eliminated. However, using the secure union method gives higher efficiency with provably controlled disclosure of some minor information (e.g., the number of duplicate items and the candidate sets). The validity of even this disclosed information can be reduced by noise addition as each site can add some fake large itemsets to its actual locally large itemsets. In the pruning phase, the fake items can be eliminated. In contrast to the above method, computing association rules in vertically partitioned data is even more challenging (Vaidya & Clifton, 2002). Here the items are partitioned and each itemset is split between sites. Most steps of the traditional a priori algorithm can be done locally at each of the sites (Agarwal & Srikant, 1994). The crucial step involves finding the support count of an itemset. If the support count of an itemset is securely computed, it can be checked if the support is greater than the threshold to determine whether the itemset is frequent. Consider the entire transaction database to be a Boolean matrix where 1 represents the presence of an item (column) in a transaction (row), while 0 correspondingly represents an absence. Then the support count of an itemset is the scalar product of the vectors representing the sub-itemsets with both parties. An algorithm to compute the scalar product securely is sufficient for secure computation of the support count. Most of the above protocols assume a semi-honest model, where the parties involved will honestly follow the protocol but can later try to infer additional information from whatever data they receive through the protocol. One result of this is that parties are not allowed to give spurious input to the protocol. If a party is allowed to give spurious input, they can probe to determine the value of a specific item at other parties. For example, if a party gives the input (0, ..., 0, 1, 0, …, 0), the result of the scalar product (1 or 0) tells the malicious party if the other party has the transaction corresponding to the 1. Attacks of this type can be termed as probing attacks. All of the protocols currently suggested in the literature are susceptible to such probing attacks. Better techniques, which work even in the malicious model, are needed to guard against this.
FUTURE DIRECTIONS The ongoing work in the field has suggested several interesting directions for future research. Existing ap927
P
Privacy Protection in Association Rule Mining
proaches should be extended to cover new classes of privacy breaches and to ascertain the theoretical limits on discoverability for a given level of privacy (Agrawal & Aggarwal, 2001). Potential research needs to concentrate on combining the randomization and cryptographic protocols to get the strengths of both without the weaknesses of either. Privacy estimation formulas used in data perturbation techniques can be refined so as to include the effects of using the mining output to re-interrogate the distorted database. Extending association rule mining of vertically partitioned data to multiple parties is an important research topic in itself, especially considering collusion between the parties. Much of the current work on vertically partitioned data is limited to Boolean association rule mining. Categorical attributes and quantitative association rule mining are significantly more complex problems. Most importantly, a wide range of data mining algorithms exists such as classification, clustering and sequence detection, and the effect of privacy constraints on these algorithms is also an interesting area of future research (Verykios et al., 2004). The final goal of researchers is to develop methods enabling any data mining operation that can be done at a single site to be done across various sources, while respecting the privacy policies of the sources.
CONCLUSION Several algorithms have been proposed to address the conflicting goals of supporting privacy and accuracy while mining association rules on large databases. Once this field of research is matured, a toolkit of privacypreserving distributed computation techniques needs to be built that can be assembled to solve specific real-world problems (Clifton et al., 2003). Current techniques address the problem of performing one secure computation and using that result to perform the next computation reveals intermediate information that may not be part of the final results. Controlled disclosure is guaranteed by evaluating whether the real results together with the extra information violate privacy constraints. This approach however, becomes more difficult with iterative techniques as intermediate results from several iterations may reveal a lot of information. Proving that this does not violate privacy is a difficult problem.
REFERENCES Agrawal, D., & Aggarwal, C.C. (2001). On the design and quantification of privacy preserving data mining algorithms. In Twentieth ACM SIGACT-SIGMOD-SIGART
928
Symposium on Principles of Database Systems (pp. 247255). Santa Barbara, California. Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In ACM SIGMOD Conference on Management of Data (pp. 439-450). Dallas, TX, USA. Ashrafi, M.Z., Taniar, D., & Smith, K.A. (2002). A data mining architecture for distributed environments. Innovative Internet Computing Systems, 27-38. Atallah, M.J., & Du, W. (2001). Secure multi-party computational geometry. In Seventh International Workshop on Algorithms and Data Structures (pp. 165-179). Providence, Rhode Island, USA. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. (2003). Tools for privacy preserving distributed data mining. ACM SIGKDD Explorations, 4(2), 28-34. Du, W., & Atallah, M.J. (2001). Secure multi-party computation problems and their applications: A review and open problems. In New Security Paradigms Workshop (pp. 1120). Cloudcroft, New Mexico, USA. Evfimievski, A., Srikant, R., Agrawal, R., & Gehrke, G. (2002). Privacy preserving mining of association rules. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 217-228). Edmonton, Canada28. Han, J., & Kamber, M. (2003). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann Publishers. Ioannidis, I., Grama, A., & Atallah, M. (2002). A secure protocol for computing dot-products in clustered and distributed environments. In International Conference on Parallel Processing (pp. 279-285). Kantarcioglu, M., & Clifton, C. (2002). Privacy-preserving distributed mining of association rules on horizontally partitioned data. In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. Advances in Cryptology (pp. 36-54). Maurer, U. (2003). Secure multi-party computation made simple. In Third Conference on Security in Communication Networks (SCN’02) (pp. 14-28). Lecture Notes in Computer Science (Vol. 2576). Berlin: Springer-Verlag. Rizvi, S.J., & Haritsa, J.R. (2002). Maintaining data privacy in association rule mining. In Twenty-Eighth International Conference on Very Large Data Bases (pp. 682689).
Privacy Protection in Association Rule Mining
Saygin, Y., Verykios, V.S., & Elmagarmid, A.K. (2002). Privacy preserving association rule mining. In Twelfth International Workshop on Research Issues in Data Engineering (pp. 151-158). Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 639-644). Edmonton, Alberta, Canada. Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., Saygin, Y., & Theodoridis. (2004). State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, 33 (1), 50-57.
KEY TERMS Association Rule: A relation between the occurrences of a set of items with another set of items in a large data set. Cryptographic Data Mining Techniques: Methods that encrypt individual data before running data mining algorithms so that the final result is also available in an encrypted form.
Distributed Data Mining: Mining information from a very large set of data spread across multiple locations without transferring the data to a central location. Horizontally Partitioned Data: A distributed architecture in which the all the sites share the same database schema but have information about different entities. The union of all the rows across all the sites forms the complete database. Itemset: A set of one or more items that are purchased together in a single transaction. Privacy Preserving Data Mining: Algorithms and methods that are used to mine rules from a distributed set of data in which the sites reveal detailed information as less as possible. Secure Multi-Party Computation: Computation of an overall result based on the data from a number of users in which only the final result is known to the individual user at the end of computation and nothing else. Vertically Partitioned Data: A distributed architecture in which the different sites store different attributes of the data. The union of all these attributes or columns together forms the complete database.
929
P
930
Profit Mining Senqiang Zhou Simon Fraser University, Canada Ke Wang Simon Fraser University, Canada
A major obstacle in data mining applications is the gap between the statistic-based pattern extraction and the value-based decision-making. “Profit mining” aims to reduce this gap. In profit mining, given a set of past transactions and pre-determined target items, we like to build a model for recommending target items and promotion strategies to new customers, with the goal of maximizing profit. Though this problem is studied in the context of retailing environment, the concept and techniques are applicable to other applications under a general notion of “utility”. In this short article, we review existing techniques and briefly describe the profit mining approach recently proposed by the authors. The reader is referred to (Wang, Zhou & Han, 2002) for the details.
& Srikant, 1994), considers a rule as “interesting” if it passes certain statistical tests such as support/confidence. To an enterprise, however, it remains unclear how such rules can be used to maximize a given business object. For example, knowing “Perfume → Lipstick” and “Perfume → Diamond”, a store manager still cannot tell which of Lipstick and Diamond, and what price should be recommended to a customer who buys Perfume. Simply recommending the most profitable item, say Diamond, or the most likely item, say Lipstick, does not maximize the profit because there is often an inverse correlation between the likelihood to buy and the dollar amount to spend. This inverse correlation reflects the general trend that the more dollar amount is involved, the more cautious the buyer is when making a purchase decision.
BACKGROUND
MAIN THRUST OF THE CHAPTER
It is a very complicated issue whether a customer buys a recommended item. Consideration includes items stocked, prices or promotions, competitors’ offers, recommendation by friends or customers, psychological issues, conveniences, etc. For on-line retailing, it also depends on security consideration. It is unrealistic to model all such factors in a single system. In this article, we focus on one type of information available in most retailing applications, namely past transactions. The belief is that shopping behaviors in the past may shed some light on what customers like. We try to use patterns of such behaviors to recommend items and prices. Consider an on-line store that is promoting a set of target items. At the cashier counter, the store likes to recommend one target and a promotion strategy (such as a price) to the customer based on non-target items purchased. The challenge is determining an item interesting to the customer at a price affordable to the customer and profitable to the store. We call this problem profit mining (Wang, Zhou & Han, 2002). Most statistics-based rule mining, such as association rules (Agrawal, Imilienski & Swami, 1993; Agrawal
Related Work
INTRODUCTION
Profit maximization is different from the “hit” maximization as in classic classification because each hit may generate different profit. Several approaches existed to make classification cost-sensitive. (Domingos, 1999) proposed a general method that can serve as a wrapper to make a traditional classifies cost-sensitive. (Zadrozny & Elkan, 2001) extended the error metric by allowing the cost to be example dependent. (Margineantu & Dietterich, 2000) gave two bootstrap methods to estimate the average cost of a classifier. (Pednault, Abe & Zadrozny, 2002) introduced a method to make sequential cost-sensitive decisions, and the goal is to maximize the total benefit over a period of time. These approaches assume a given error metric for each type of misclassification, which is not available in profit mining. Profit mining is related in motivation to actionability (or utility) of patterns: a pattern is interesting in the sense that the user can act upon it to her advantage (Silberschatz & Tuzhilin, 1996). (Kleinberg, Papadimitriou & Raghavan, 1998) gave a framework for evaluating data mining operations in terms of utility in
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Profit Mining
decision-making. These works, however, did not propose concrete solutions to the actionability problem. Recently, there were several works applying association rules to address business related problems. (Brijs, Swinnen, Avanhoof & Wets, 1999; Wong, Fu & Wang, 2003; Wang & Su, 2002) studied the problem of selecting a given number of items for stocking. The goal is to maximize the profit generated by selected items or customers. These works present one important step beyond association rue mining, i.e., addressing the issue of converting a set of individual rules into a single actionable model for recommending actions in a given scenario. There were several attempts to generalize association rules to capture more semantics, e.g., (Lin, Yao & Louie, 2002; Yao, Hamilton & Butz, 2004; Chan, Yang & Shen, 2003). Instead of a uniform weight associated with each occurrence of an item, these works associate a general weight with an item and mine all itemsets that pass some threshold on the aggregated weight of items in an itemset. Like association rule mining, these works did not address the issue of converting a set of rules or itemsets into a model for recommending actions. Collaborative filtering (Resnick & Varian, 1997) makes recommendation by aggregating the “opinions” (such as rating about movies) of several “advisors” who share the taste with the customer. Built on this technology, many large commerce web sites help their customers to find products. For example, Amazon.com uses “Book Matcher” to recommend books to customers; Moviefinder.com recommends movies to customers using “We Predict” recommender system. For more examples, please refer to (Schafer, Konstan & Riedl, 1999). The goal is to maximize the hit rate of recommendation. For items of varied profit, maximizing profit is quite different from maximizing hit rate. Also, collaborative filtering relies on carefully selected “item endorsements” for similarity computation, and a good set of “advisors” to offer opinions. Such data are not easy to obtain. The ability of recommending prices, in addition to items, is another major difference between profit mining and other recommender systems. Another application where data mining is heavily used for business targets is direct marketing. See (Ling & Li, 1998; Masand & Shapiro, 1996; Wang, Zhou, Yeung & Yang, 2002), for example. The problem is to identify buyers using data collected from previous campaigns, where the product to be promoted is usually fixed and the best guess is about who are likely to buy. The profit mining, on the other hand, is to guess the best item and price for a given customer. Interestingly, these two problems are closely related to each other. We can model the direct marketing problem as profit mining problem by including customer demographic data as
part of her transactions and including a special target item NULL representing no recommendation. Now, each recommendation of a non-NULL item (and price) corresponds to identifying a buyer of the item. This modeling is more general than the traditional direct marketing in that it can identify buyers for more than one type of item and promotion strategies.
Profit Mining We solve the profit mining by extracting patterns from a set of past transactions. A transaction consists of a collection of sales of the form (item, price). A simple price can be substituted by a “promotion strategy”, such as “buy one get one free” or “X quantity for Y dollars”, that provides sufficient information for derive the price. The transactions were collected over some period of times and there could be several prices even for the same item if sales occurred at different times. Given a collection of transactions, we find recommendation rules of the form {s 1, …, sk}→, where I is a target item and P is a price of I, and each si is a pair of non-target item and price. An example is (Perfume, price=$20)→ (Lipstick, price=$10). This recommendation rule can be used to recommend Lipstick at the price of $10 to a customer who bought Perfume at the price of $20. If the recommendation leads to a sale of Lipstick of quantity Q, it generates (10-C)*Q profit, where C is the cost of Lipstick. Several practical considerations would make recommendation rules more useful. First, items on the left-hand side in si can be item categories instead to capture category-related patterns. Second, a customer may have paid a higher price if a lower price was not available at the shopping time. We can incorporate the domain knowledge that paying a higher price implies the willingness of paying a lower price (for exactly the same item) to search for stronger rules at lower prices. This can be done through multi-level association mining (Srikant and Agrawal, 1995; Han and Fu, 1995), by modeling a lower price as a more general category than a higher price. For example, the sale {
P
Profit MIning
average profit implies that both confidence and profit are high. Given a new customer, we pick up the highest ranked matching rule to make recommendation. Before making recommendation, however, “overfitting” rules that work only for observed transactions, but not for new customers, should be pruned because our goal is to maximize profit on new customers. The idea is as follows. Instead of ranking rules by observed profit, we rank rules by projected profit, which is based on the estimated error of a rule adapted for pruning classifiers (Quinlan, 1993). Intuitively, the estimated error will increase for a rule that matches a small number of transactions. Therefore, over-fitting rules tend to have a larger estimated error, which translates into a lower projected profit, and a lower rank. For a detailed exposure and experiments on real life and synthetic data sets, the reader is referred to (Wang, Zhou & Han, 2002).
FUTURE TRENDS The profit mining proposed is only the first, but important, step in addressing the ultimate goal of data mining. To make profit mining more practical, several issues need further study. First, it is quite likely that the recommended item tends to be the item that the customer will buy independently of the recommendation. Obviously, such items need not be recommended, and recommendation should focus on those items that the customer may buy if informed, but may not otherwise. Recommending such items likely brings in additional profit. Second, the current model maximizes only the profit of “one-shot selling effect,” therefore, a sale in a large quantity is favored. In reality, a customer may regularly shop the same store over a period of time, in which case a sale in a large quantity will affect the shopping frequency of a customer, thus, profit. In this case, the goal is maximizing the profit for reoccurring customers over a period of time. Another interesting direction is to incorporate the feedback whether a certain recommendation is rejected and accepted to improve future recommendations. This current work has focused on the information captured in past transactions. As pointed out in Introduction, other things such as competitors’ offers, recommendation by friends or customers, consumer fashion, psychological issues, conveniences, etc. can affect the customer’s decision. Addressing these issues requires additional knowledge, such as competitors’ offers, and computers may not be the most suitable tool. One solution could be suggesting several best recommendations to the domain expert, the store manager or sales person in this case, who makes the final recommendation to the customer after factoring the other considerations. 932
CONCLUSION Profit mining is a promising data mining approach because it addresses the ultimate goal of data mining. In this article, we study profit mining in the context of retailing business, but the principles and techniques illustrated should be applicable to other applications. For example, “items” can be general actions and “prices” can be a notion of utility resulted from actions. In addition, “items” can be used to model customer demographic information such as Gender, in which case the price component is unused.
REFERENCES Agrawal, R., Imilienski, T., & Swami, A. (1993, May). Mining association rules between sets of items in large databases. ACM Special Interest Group on Management of Data (SIGMOD) (pp. 207-216), Washington D.C., USA. Agrawal, R. & Srikant, R. (1994, September). Fast algorithms for mining association rules. International Conference on Very Large Data Bases (VLDB) (pp. 487-499), Santiago de Chile. Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999, August). Using association rules for product assortment decisions: A case study. International Conference on Knowledge Discovery and Data Mining (KDD) (pp. 254260), San Diego, USA. Chan, R., Yang, Q., & Shen, Y. (2003, November). Mining high utility itemsets. IEEE International Conference on Data Mining (ICDM) (pp. 19-26), Melbourne, USA. Domingos, P. (1999, August). MetaCost: A general method for making classifiers cost-sensitive. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 155-164), San Diego, USA. Han, J., & Fu, Y. (1995, September). Discovery of multiple-level association rules from large databases. International Conference on Very Large Data Bases (VLDB) (pp. 420-431), Zurich, Switzerland. Kleinberg, J., Papadimitriou, C. & Raghavan, P. (1998, December). A microeconomic view of data mining. Data Mining and Knowledge Discovery Journal, 2(4), 311-324. Lin, T. Y., Yao, Y.Y., & Louie, E. (2002, May). Value added association rules. Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference PAKDD (pp. 328-333), Taipei, Taiwan.
Profit Mining
Ling, C., & Li, C. (1998, August) Data mining for direct marketing: problems and solutions. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 73-79), New York, USA.
Wong, R. C. W., Fu, A. W. C., & Wang, K. (2003, November). MPIS: Maximal-profit item selection with crossselling considerations. IEEE International Conference on Data Mining (ICDM) (pp. 371-378), Melbourne, USA.
Margineantu, D. D., & Dietterich, G. T. (2000, June-July). Bootstrap methods for the cost-sensitive evaluation of classifiers. International Conference on Machine Learning (ICML) (pp. 583-590), San Francisco, USA.
Yao, H., Hamilton, H. J., & Butz, C. J. (2004, April). A foundational approach for mining itemset utilities from databases. SIAM International Conference on Data Mining (SIAMDM) (pp. 482-486), Florida, USA.
Masand, B., & Shapiro, G. P. (1996, August) A comparison of approaches for maximizing business payoff of prediction models. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 195-201), Portland, USA.
Zadrozny, B., & Elkan, C. (2001, August). Learning and making decisions when costs and probabilities are both unknown. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 204-213), San Francisco, USA.
Pednault, E., Abe, N., & Zadrozny, B. (2002, July). Sequential cost-sensitive decision making with reinforcement learning. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (p. 259-268), Edmonton, Canada.
KEY TERMS
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. Resnick, P., & Varian, H.R. (1997). CACM special issue on recommender systems. Communications of the ACM, 40(3), 56-58. Schafer, J. B., Konstan, J. A., & Riedl, J. (1999, November). Recommender systems in E-commerce. ACM Conference on Electronic Commerce (pp. 158-166), Denver, USA. Silberschatz, A., & Tuzhilin, A. (1996). What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8(6), 970-974. Srikant, R., & Agrawal, R. (1995, September). Mining generalized association rules. International Conference on Very Large Data Bases (VLDB) (pp. 407-419), Zurich, Switzerland. Wang, K., & Su, M. Y. (2002, July). Item selection by hubauthority profit ranking. ACM SIG International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 652-657), Edmonton, Canada. Wang, K., Zhou, S., & Han, J. (2002, March). Profit mining: From patterns to actions. International Conference on Extending Database Technology (EDBT) (pp. 70-87), Prague, Czech Republic. Wang, K., Zhou, S., Yeung, J. M. S., & Yang, Q. (2003, March). Mining customer value: From association rules to direct marketing. International Conference on Data Engineering (ICDE) (pp. 738-740), Bangalore, India.
Association Rule: An association has the form I1→ I2, where I1 and I2 are two itemsets. The support of an association rule is the support of the itemset I1 ∪ I2, and the confidence of a rule is the ratio of support of I1 ∪ I2 and the support of I1. Classification: Given a set of training examples in which each example is labeled by a class, build a model, called a classifier, to predict the class label of new examples that follow the same class distribution as training examples. A classifier is accurate if the predicted class label is the same as the actual class label. Cost Sensitive Classification: The error of a misclassification depends on the type of the misclassification. For example, the error of misclassifying Class 1 as Class 2 may not be the same as the error of misclassifying Class 1 as Class 3. Frequent Itemset: The support of an itemset refers to as the percentage of transactions that contain all the items in the itemset. A frequent itemset is an itemset with support above a pre-specified threshold. Over-fitting Rule: A rule has high performance (e.g. high classification accuracy) on observed transaction(s) but performs poorly on future transaction(s). Hence, such rules should be excluded from the decision-making systems (e.g. recommender). In many cases over-fitting rules are generated due to the noise in data set. Profit Mining: In a general sense, profit mining refers to data mining aimed at maximizing a given objective function over decision making for a targeted population (Wang, Zhou & Han, 2002). Finding a set of rules that pass a given threshold on some interestingness measure (such as association rule mining or its variation) is not profit 933
P
Profit MIning
mining because of the lack of a specific objective function to be maximized. Classification is a special case of profit mining where the objective function is the accuracy and the targeted population consists of future cases. This paper examines a specific problem of profit mining, i.e.,
934
building a model for recommending target products and prices with the objective of maximizing net profit. Transaction: A transaction is some set of items chosen from a fixed alphabet.
935
Pseudo Independent Models
P
Yang Xiang University of Guelph, Canada
INTRODUCTION
BACKGROUND
Graphical models such as Bayesian networks (BNs) (Pearl, 1988) and decomposable Markov networks (DMNs) (Xiang, Wong & Cercone, 1997) have been applied widely to probabilistic reasoning in intelligent systems. Figure1 illustrates a BN and a DMN on a trivial uncertain domain: A virus can damage computer files, and so can a power glitch. A power glitch also causes a VCR to reset. The BN in (a) has four nodes, corresponding to four binary variables taking values from {true, false}. The graph structure encodes a set of dependence and independence assumptions (e.g., that f is directly dependent on v, and p but is independent of r, once the value of p is known). Each node is associated with a conditional probability distribution conditioned on its parent nodes (e.g., P(f | v, p)). The joint probability distribution is the product P(v, p, f, r) = P(f | v, p) P(r | p) P(v) P(p). The DMN in (b) has two groups of nodes that are maximally pair-wise connected, called cliques. Each clique is associated with a probability distribution (e.g., clique {v, p, f} is assigned P(v, p, f)). The joint probability distribution is P(v, p, f, r) = P(v, p, f) P(r, p) / P(p), where P(p) can be derived from one of the clique distributions. The networks, for instance, can be used to reason about whether there are viruses in the computer system, after observations on f and r are made. Construction of such networks by elicitation from domain experts can be very time-consuming. Automatic discovery (Neapolitan, 2004) by exhaustively testing all possible network structures is intractable. Hence, heuristic search must be used. This article examines a class of graphical models that cannot be discovered using the common heuristics.
Let V be a set of n discrete variables x1, … , xn (in what follows, we will focus on finite, discrete variables). Each variable xi has a finite space Si = {xi,1, x i,2, … , xi,D } of cardinality Di. When there is no confusion, we write xi,j as xij for simplicity. The space of a set V of variables is defined by the Cartesian product of the spaces of all variables in V, that is, SV = S1 x ... x Sn (or ∏ S ). Thus, SV contains the tuples made of all possible combinations of values of the variables in V. Each tuple is called a configuration of V, denoted by v = (x 1, … , xn). Let P(xi ) denote the probability function over x i and P(xij ) denote the probability value P(xi = x ij ). A probabilistic domain model (PDM) Μ over V defines the probabil-
Figure 1. (a) a trivial example BN; (b) a corresponding DMN
i
i
ity values of every configuration for every subset A ⊆ V . Let P(V) or P(x1, … , xn) denote the joint probability distribution (JPD) function over x1, … , xn and P(x1 j1, … , xn jn) denote the probability value of a configuration (x1 j1, …, xn jn). We refer to the function P(A) over A ⊂ V as the marginal distribution over A and P(xi) as the marginal distribution of x i. We refer to P(x1 j1, … , xn jn) as a joint parameter and P(xij) as a marginal parameter of the corresponding PDM over V. For any three disjoint subsets of variables W, U and Z in V, subsets W and U are called conditionally independent given Z, if P(W | U, Z) = P(W | Z) for all possible values in W, U and Z such that P(U, Z) > 0. Conditional independence signifies the dependence mediated by Z. This allows the dependence among W U U U Z to be modeled over subsets W U Z and U U Z separately. Conditional independence is the key property explored through graphical models. Subsets W and U are said to be marginally independent (sometimes referred to as unconditionally independent) if P(W | U) = P(W) for all possible values W and U such that P(U) > 0. When two subsets of variables are marginally independent, there is no dependence between them. Hence, each subset
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Pseudo Independent Models
can be modeled independently without losing information. If each variable x i in a subset A is marginally independent of A\{ xi }, the variables in A are said to be marginally independent. The following proposition reveals a useful property called factorization when this is the case. •
Proposition 1: If each variable xi in a subset A is marginally independent of A\{ xi } then
P ( A) = ∏ P( xi ) xi ∈A
.
Variables in a subset A are called generally dependent, if P(B | A \ B) ≠P(B) for every proper subset B ⊂ A . If a subset of variables is generally dependent, its proper subsets cannot be modeled independently without losing information. A generally dependent subset of variables, however, may display conditional independence within the subset. For example, consider A={x1, x2, x3}. If P(x1, x2| x3) = P(x1, x2), i.e., {x1, x2} and x3 are marginally independent, then A is not generally dependent. On the other hand, if P(x1, x2| x3) ≠ P(x1, x2), P(x2, x3| x1) ≠ P(x2, x3), P(x3, x 1| x2) ≠ P(x3, x1), then A is generally dependent. Variables in A are collectively dependent if, for each proper subset B ⊂ A , there exists no proper subset C ⊂ A \ B that satisfies P(B | A \ B) = P(B | C). Collective dependence prevents conditional independence and modeling through proper subsets of variables. Table 1 shows the JPD over a set of variables V={x1, x2, x, x 4}. The four variables are collectively dependent; for example,
(SI) Variables in each proper subset of V are marginally independent. (SII) Variables in V are collectively dependent. Table 1 shows the JPD of a binary full PI model, where V = {x1, x2, x 3, x4}. Its marginal parameters are P(x1,0) = 0.7, P(x2,0) = 0.6, P(x3,0) = 0.35, P(x4,0) = 0.45. Any subset of three variables are marginally independent; for example, P(x1,1, x2.0, x3,1) = P(x1,1) P(x 2,0) P(x 3,1) = 0.117. The four variables are collectively dependent as explained previously. Table 2 is the JPD of the color model given earlier, where V = {x1, x2, x3}. The marginal independence can be verified by P(x1=red )= P(x2=red) = P(x3=red ) = 0.5, P(x1=red \ x2) = P(x1=red \ x3) = P(x2=red \ x3) = 0.5, and the collective dependence can be seen from P(x1=red \ x2=red, x 3=red )= 1. By relaxing condition (SI) on marginal independence, full PI models are generalized into partial PI models, which are defined through marginally independent partition (Xiang, Hu, Cercone & Hamilton, 2000) introduced in the following: •
A = {xik | xik ∈ B k , k = 1,..., m} , variables in A are marginally independent. Each block Bi is called a marginally independent block.
P(x1,1, | x2.0, x3,1, x4,0) = 0.257 and P(x1,1, | x2.0, x3,1) = P(x1,1, | x2.0, x4,0) = P(x1,1, | x3.0, x4,0) = 0.3.
MAIN THRUST Pseudo-Independent (PI) Models A pseudo-independent (PI) model is a PDM where proper subsets of a set of collectively dependent variables display marginal independence (Xiang, Wong & Cercone, 1997). The basic PI model is a full PI model: •
936
Definition 2 (Full PI model): A PDM over a set V (|V| ≥ 3) of variables is a full PI model, if the following properties (called axioms of full PI models) hold:
Definition 3 (Marginally Independent Partition): Let V (|V| ≥ 3) be a set of variables, and B = { B1, …, Bm } (m ≥ 2) be a partition of V. B is a marginally independent partition if, for every subset
Intuitively, a marginally independent partition groups variables in V into m blocks. If one forms a subset A by taking one element from each block, then variables in A are marginally independent. Unlike full PI models, in a partial PI model, it is not necessary that every proper subset is marginally independent. Instead, that requirement is replaced with the marginally independent partition. •
Definition 4 (Partial PI Model): A PDM over a set V (|V| ≥ 3) of variables is a partial PI model, if the following properties (called axioms of partial PI models) hold: (SI’) V can be partitioned into two or more marginally independent blocks.
Pseudo Independent Models
Table 1. A PDM where v={{x 1, x2, x , x4} v (0,0,0,0) (0,0,0,1) (0,0,1,0) (0,0,1,1)
P(v) 0.0586 0.0884 0.1304 0.1426
v (0,1,0,0) (0,1,0,1) (0,1,1,0) (0,1,1,1)
P(v) 0.0517 0.0463 0.0743 0.1077
v (1,0,0,0) (1,0,0,1) (1,0,1,0) (1,0,1,1)
P(v) 0.0359 0.0271 0.0451 0.0719
v (1,1,0,0) (1,1,0,1) (1,1,1,0) (1,1,1,1)
v
P(v)
v
P(v)
(red, red, red)
0.25
(green, red, red)
0
P
P(v) 0.0113 0.0307 0.0427 0.0353
Table 2. The color model where v = {x1, x2, x3 }
(red, red, green)
0
(green, red, green)
0.25
(red, green, red)
0
(green, green, red)
0.25
(red, green, green)
0.25
(green, green, green)
0
Table 3. A partial PI model where v = {x 1, x2, x3 } v (0,0,0) (0,0,1) (0,1,0)
P(v) 0.05 0.04 0.01
v (0,1,1) (0,2,0) (0,2,1)
P(v) 0.11 0.06 0.03
v (1,0,0) (1,0,1) (1,1,0)
P(v) 0.05 0.01 0
v (1,1,1) (1,2,0) (2,0,1)
(SII) Variables in V are collectively dependent. Table 3 shows the JPD of a partial PI model over two ternary variables and one binary variable, where V = {x1, x2, x3}. Its marginal parameters are P(x1,0) = 0.3, P(x1,1) = 0.2, P(x1.2) = 0.5, P(x2.0) = 0.3, P(x2.1) = 0.4, P(x 2.2) = 0.3, P(x3.0) = 0.4, P(x3.1) = 0.6. The marginally independent partition is {{ x1}, {x2, x3}}. Variable x1 is marginally independent of each variable in the other block; for example, P(x1, x2.0) = P(x1,1) P(x2.0) = 0.06. However, variables within block {x2, x3} are dependent; for example, P(x2.0, x 3.1) = 0.1 ¹ P(x 2.0) P( x3.1) = 0.18. The three variables are collectively dependent; for example, P(x1,1| x2.0, x3.1 ) = 0.1 and P(x1,1| x2.0) = P(x1,1| x3.1) = 0.2. Similarly, P(x2,1| x1.0, x3.1 ) = 0.61, P(x2.1| x1,0) =0.4, P(x2.1| x3.1) = 0.5.
P(v) 0.08 0.03 0.03
v (2,0,0) (2,0,1) (2,1,0)
P(v) 0.10 0.05 0.09
v (2,2,1) (2,2,0) (2,2,1)
P(v) 0.11 0.01 0.14
Variables that form either a full or a partial PI model may be a proper subset of V, where the remaining variables display normal dependence and independence relations. In such a case, the subset is called an embedded PI submodel. A PDM can contain one or more embedded PI submodels. •
Definition 5 (Embedded PI Submodel): Let a PDM be over a set V of generally dependent variables. A proper subset V ' ⊂ V (|V’| ≥ 3) of variables forms an embedded PI submodel if the following properties (axioms of embedded PI models) hold: (SIII) V’ forms a partial PI model. (SIV) The marginal independent partition B = { B 1, …, Bm } of V’ extends into V. That is, there is a partition {A 1 , …, A m } of V such that B i ⊆ Ai , (i = a,..., m) , and for each x ∈ Ai and each y ∈ A j (i ≠ j ) , x and y are marginally independent. Definition 5 requires that variables in V are generally dependent. It eliminates the possibility that a proper subset is marginally independent of the rest of V. Table 4 shows the jpd of a PDM with an embedded PI model over variables x1, x2 and x3, where the marginals are 937
Pseudo Independent Models
Table 4. A PDM containing an embedded PI model where v={x1, x 2, x3, x4, x 5, } v (0,0,0,0,0) (0,0,0,0,1) (0,0,0,1,0) (0,0,0,1,1) (0,0,1,0,0) (0,0,1,0,1) (0,0,1,1,0) (0,0,1,1,0)
P(v) 0 0 0 0 .0288 .0072 .1152 .0288
v (0,1,0,0,0) (0,1,0,0,1) (0,1,0,1,0) (0,1,0,1,1) (0,1,1,0,0) (0,1,1,0,1) (0,1,1,1,0) (0,1,1,1,1)
P(v) .0018 .0162 .0072 .0648 .0048 .0012 .0192 .0048
P(x1,0) = 0.3, P(x2,0) = 0.6, P(x3,0) = 0.3, P(x4,0) = 0.34, P(x5,0) = 0.59. The marginally independent partition of the embedded PI model is {B1 = { x1}, B2 = {x 2., x3}}. Outside the PI submodel, B1 extends to include x4 and B2 extends to include x5. Each variable in one block is marginally independent of each variable in the other block; for example, P(x1,1, x5,0)= P(x 1,1) P(x5,0) = 0.413. Variables in the same block are pair-wise dependent; for example, P(x2,1, x3,0)= 0.1 ¹ P(x2,1) P(x3,0) = 0.12. The three variables in the submodel are collectively dependent; for example, P(x1,1 | x2,0, x3,1) = 0.55, P(x1,1 | x2,0) = P(x1,1 | x3,1) = 0.7. However, x4 is independent of other variables given x1 and x5 is independent of other variables given x3, displaying the normal conditional independence relation; for example, P(x5,1 | x2,0, x3,0, x4,0) = P(x5,1 | x3,0) =0.9. PDMs with embedded PI submodels are the most general type of PI models.
Discovery of PI Models Given a data set over n variables, the number of possible network structures is super-exponential. To make the discovery tractable, a common heuristic method is the single-link lookahead search (Cooper & Herskovits, 1992; Heckerman, Geiger & Chickering, 1995; Herskovits & Cooper, 1990; Lam & Bacchus, 1994). Learning starts with
938
v (1,0,0,0,0) (1,0,0,0,1) (1,0,0,1,0) (1,0,0,1,1) (1,0,1,0,0) (1,0,1,0,1) (1,0,1,1,0) (1,0,1,1,1)
P(v) .0080 .0720 .0120 .1080 .0704 .0176 .1056 .0264
v (1,1,0,0,0) (1,1,0,0,1) (1,1,0,1,0) (1,1,0,1,1) (1,1,1,0,0) (1,1,1,0,1) (1,1,1,1,0) (1,1,1,1,1)
P(v) .0004 .0036 .0006 .0054 .0864 .0216 .1296 .0324
some initial graphical structure. Successive graphical structures representing different sets of conditional independence assumptions are adopted. Each adopted structure differs from its predecessor by a single link and improves a score metric optimally. PI models pose a challenge to such algorithms. It has been shown (Xiang, Wong & Cercone, 1996) that when the underlying PDM of the given data is PI, the graph structure returned by such algorithms misrepresents the actual dependence relations of the PDM. Intuitively, these algorithms update the current graph structure, based on some tests for local dependence (see the next paragraph for justification). The marginal independence of a PI model misleads these algorithms into ignoring the collective dependence. Consider a full PI model over n binary variables x 1, ..., xn where n≥3. Each xi (1 ≤ i < n ) takes value red or green with equal chance. Variable xn takes red if even number of other variables take red and takes green otherwise. If the search starts with an empty graph, then the single-link lookahead will return an empty graph, because every proper subset of variables is marginally independent. From the values of any n-1 variable, this learned model will predict the n’th variable as equally likely to be red or green. In fact, when the values of any n-1 variable are known, the value of the n’th variable can be determined with certainty! When one has a life or death decision to make, one certainly does not want to use such incorrectly learned model. The key to correctly discovering a PI model from data is to identify collective dependence. In particular, given a large problem domain that contains embedded PI submodels, the key to discovery is to identify collective dependence among variables in each submodel. This requires multi-link lookahead search, during which candidate graph structures with k>1 additional links are examined before the best candidate is adopted. The multiple additional links define their endpoints as a subset of variables, whose potential collective dependence is tested explicitly. Once such collective dependence is confirmed, the subset will be identified as a PI submodel. Clearly, if improperly organized, multi-link lookahead search can become intractable. Hu and Xiang (1997) presented an
Pseudo Independent Models
Table 5. Variables in social survey data on harmful drinking i 0 1 2 3 4 5 6 7
Variable HarmSocial HarmHealth HrmLifOutlk HarmLifMrig HarmWorkSty HarmFinance NumDrivrDrink NmNonDrvrDrink
P
Question Did alcohol harm friendships/social life? Did alcohol harm your physical health? Did alcohol harm your outlook on life? Did alcohol harm your life or marriage? Did alcohol harm your work, studies, and so forth? Did alcohol harm your financial position? How many drinks should a designated driver have? How many drinks should non-designated driver have?
algorithm, which applies single-link lookahead search and low-order (small k value) multi-link lookahead search as much as possible and uses high-order (large k value) multi-link lookahead search only when necessary. An experiment using data from social survey was reported in Xiang, et al. (2000). A PI model was discovered from the data on harmful drinking (see Table 5). The discovered DMN graphical structure is shown in Figure 2. The discovered PI model performed 10% better in prediction than the model discovered using single-link lookahead search.
FUTURE TRENDS A number of issues are still open for research. A PI submodel is highly constrained by its collective dependence. Therefore, a PI submodel over k binary variables is specified by less than 2n - 1 probability parameters. This means that a PI submodel, though collective dependent, is simpler than a conventional complete graphical submodel. Research is needed to quantify this difference. Recent progress on this is reported in Xiang, Lee, and Cercone (2003). Collective dependence in PI models does not allow the conventional factorization, which is a powerful tool in both knowledge representation and probabilistic inference with graphical models. On the other hand, PI submodels are simple submodels, as argued previously. Research into formalisms and techniques that can explore this simplicity is needed.
Figure 2.DMN learned from data on harmful drinking
Causal models are stronger than dependence models, as they provide a basis for manipulation and control. What is the relation between PI models and its causal counterpart? How can one discover the causal structure within a PI model? Answers to these questions will make useful contributions to knowledge discovery both theoretically as well as practically.
CONCLUSION Research in the last decade indicated that PI models exist in practice. This fact complements the theoretical analysis that for any given set of n ≥3variables, there exist infinitely many PI models, each of which is characterized by a distinct JPD. Knowledge discovery by definition is an open-minded process. The newer generation of discovery algorithms equipped with the theoretical understanding of PI models are more open-minded. They admit PI models when the data say so, thus improving the quality of knowledge discovery and allowing more accurate predictions from more accurately discovered models. The first generation of algorithms that are capable of discovering PI models demonstrates that, with a reasonable amount of extra computation (relative to single-link lookahead search), many PI models can be discovered and used effectively in inference.
REFERENCES Cooper, G.F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309-347. Heckerman, D., Geiger, D., & Chickering, D.M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197-243. Herskovits, E.H., & Cooper, G.F. (1990). Kutato: An entropy-driven system for construction of probabilistic expert systems from database. Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence, Cambridge, Massachusetts. 939
Pseudo Independent Models
Hu, J., & Xiang, Y. (1997). Learning belief networks in domains with recursively embedded pseudo independent submodels. Proceedings of the 13 th Conference on Uncertainty in Artificial Intelligence, Providence, Rhode Island. Lam, W., & Bacchus, F. (1994). Learning Bayesian networks: An approach based on the MDL principle. Computational Intelligence, 10(3), 269-293. Neapolitan, R.E. (2004). Learning Bayesian networks. Prentice Hall. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference.Morgan Kaufmann. Wong, S.K.M., & Xiang, Y. (1994). Construction of a Markov network from data for probabilistic inference. Proceedings of the 3 rd International Workshop on Rough Sets and Soft Computing, San Jose, California. Xiang, Y., Hu, J., Cercone, N., & Hamilton, H. (2000). Learning pseudo-independent models: Analytical and experimental results. In H. Hamilton (Ed.), Advances in artificial intelligence (pp. 227-239). Springer. Xiang, Y., Lee, J., & Cercone, N. (2003). Parameterization of pseudo-independent models. Proceedings of the 16th International Florida Artificial Intelligence Research Society Conference. Xiang, Y., Wong, S.K.M., & Cercone, N. (1996). Critical remarks on single link search in learning belief networks. Proceedings of the 12 th Conference on Uncertainty in Artificial Intelligence, Portland, Oregon. Xiang, Y., Wong, S.K.M., & Cercone, N. (1997). A “microscopic” study of minimum entropy search in learning decomposable Markov networks. Machine Learning, 26(1), 65-92.
940
KEY TERMS Collective Dependence: A set V of variables is collectively dependent, if V cannot be split into non-empty subsets X and Y such that X and Y are marginally independent, nor can V be partitioned into non-empty subsets X, Y, and Z, such that X and Y are conditionally independent given Z. Conditional Independence: Two sets X and Y of variables are conditionally independent given a third set Z, if knowledge on Z (what value Z takes) makes knowledge on Y irrelevant to guessing the value of X. Embedded PI Submodel: An embedded PI submodel is a full or partial PI model over a proper subset of domain variables. The most general PI models are those over large problem domains that contain embedded PI submodels. Full PI Model: A full PI model is a PI model where every proper subset of variables is marginally independent. Full PI models are the most basic PI models. Marginal Independence: Two sets X and Y of variables are marginally independent, if knowledge on Y is irrelevant to guessing the value of X. Partial PI Model: A partial PI model is a PI model, where some proper subsets of variables are not marginally independent. A partial PI model is also a full PI model, but the converse is not true. Hence, partial PI models are more general than full PI models.
941
Reasoning about Frequent Patterns with Negation Marzena Kryszkiewicz Warsaw University of Technology, Poland
INTRODUCTION Discovering frequent patterns in large databases is an important data mining problem. The problem was introduced in (Agrawal, Imielinski, & Swami, 1993) for a sales transaction database. Frequent patterns were defined there as sets of items that are purchased together frequently. Frequent patterns are commonly used for building association rules. For example, an association rule may state that 80% of customers who buy fish also buy white wine. This rule is derivable from the fact that fish occurs in 5% of sales transactions and set {fish, white wine} occurs in 4% of transactions. Patterns and association rules can be generalized by admitting negation. A sample association rule with negation could state that 75% of customers who buy coke also buy chips and neither beer nor milk. The knowledge of this kind is important not only for sales managers, but also in medical areas (Tsumoto, 2002). Admitting negation in patterns usually results in an abundance of mined patterns, which makes analysis of the discovered knowledge infeasible. It is thus preferable to discover and store a possibly small fraction of patterns, from which one can derive all other significant patterns when required. In this chapter, we introduce first lossless representations of frequent patterns with negation.
BACKGROUND Let us analyze sample transactional database , presented in Table 1, which we will use throughout the chapter. Each row in this database reports items that were purchased by a customer during a single visit to a supermarket. As follows from Table 1, items a and b were purchased together in four transactions. The number of transactions in which set of items {x1, ..., xn} occurs is called its support and denoted by sup({x1, ..., x n}). A set of items is called a frequent pattern if its support exceeds a user-specified threshold (minSup). Otherwise, it is called an infrequent pattern. In the remainder of the chapter, we assume minSup = 1. One can discover 27 frequent patterns from D, which we list in Figure 1.
Table 1. Sample database , Id
Transaction
T1
{abce}
T2
{abcef}
T3
{abch}
T4
{abe}
T5
{acfh}
T6
{bef}
T7
{h}
T8
{af}
One can easily note that the support of a pattern never exceeds the supports of its subsets. Hence, subsets of a frequent pattern are also frequent, and supersets of an infrequent pattern are infrequent. Aside from searching for only statistically significant sets of items, one may be interested in identifying frequent cases when purchase of some items (presence of some symptoms) excludes purchase of other items (presence of other symptoms). Pattern consisting of items x1, …, xm and negations of items xm+1, …, xn will be denoted by {x1, …, xm, -xm+1, …, -xn}. The support of pattern {x1, …, xm, -xm+1, …, -xn} is defined as the number of transactions in which all items in set {x1, …, xm} occur and no item in set {xm+1, …, xn} occurs. In particular, {a(–b)} is supported by two transactions in ,, while {a(–b)(–c)} is supported by one transaction. Hence, {a(–b)} is frequent, while {a(– b)(–c)} is infrequent. From now on, we will say that X is a positive pattern, if X does not contain any negated item. Otherwise, X is called a pattern with negation. A pattern obtained from pattern X by negating an arbitrary number of items in X is called a variation of X. For example, {ab} has four distinct variations (including itself): {ab}, {a(–b)}, {(–a)b}, {(– a)(–b)}. One can discover 109 frequent patterns in ,, 27 of which are positive, and 82 of which have negated items.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
4
Reasoning about Frequent Patterns with Negation
Figure 1. Frequent positive patterns discovered from database D. Values provided in square brackets in the subscript denote supports of patterns.
{abce}[2] {abc}[3] {abe}[3] {ace}[2] {acf}[2] {ach}[2] {bce}[2] {bef}[2] {ab}[4] {ac}[4] {ae}[3] {af}[3] {ah}[2] {bc}[3] {be}[4] {bf}[2] {ce}[2] {cf}[2] {ch}[2] {ef}[2] {a}[6] {b}[5] {c}[4] {e}[4] {f}[4] {h}[3] ∅[8]
In practice, the number of frequent patterns with negation is by orders of magnitude greater than the number of frequent positive patterns. A first trial to solve the problem of large number of frequent patterns with negation was undertaken by Toivonen (1996), who proposed a method for using supports of positive patterns to derive supports of patterns with negation. The method is based on the observation that for any pattern X and any item x, the number of transactions in which X occurs is the sum of the number of transactions in which X occurs with x and the number of transactions in which X occurs without x. In other words, sup(X) = sup(X∪{x}) + sup(X∪{(–x)}), or sup(X∪{(–x)}) = sup(X) – sup(X∪{x}) (Mannila & Toivonen, 1996). Multiple usage of this property enables determination of the supports of patterns with an arbitrary number of negated items based on the supports of positive patterns. For example, the support of pattern {a(–b)(– c)}, which has two negated items, can be calculated as follows: sup({a(–b)(–c)}) = sup({a(–b)}) – sup({a(–b)c}). Thus, the task of calculating the support of {a(–b)(–c)}, which has two negated items, becomes a task of calculating the supports of patterns {a(–b)} and {a(–b)c}, each of which contains only one negated item. We note that sup({a(–b)}) = sup({a}) – sup({ab}), and sup({a(–b)c}) = sup({ac)}) – sup({abc}). Eventually, we obtain: sup({a(–b)(–c)}) = sup({a}) – sup({ab}) – sup({ac)}) + sup({abc}). The support of {a(–b)(–c)} is hence determinable from the supports of {abc} and its proper subsets. It was proved in Toivonen (1996) that for any pattern with negation its support is determinable from the supports of positive patterns. Nevertheless, the knowledge of the supports of only frequent patterns may be insufficient to derive the supports of all frequent patterns with negation (Boulicaut, Bykowski, & Jeudy, 2000), which we show beneath. Let us try to calculate the support of pattern {bef(–h)}: sup({bef(–h)}) = sup({bef}) – sup({befh}). Pattern {bef} is frequent and its support equals 2 (see Figure 1). To the 942
contrary, {befh} is not frequent, so its support does not exceed minSup, which equals 1. Hence, 1 ≤ sup({bef(– h)}) ≤ 2. The obtained result is not sufficient to determine if {bef(–h)} is frequent. The problem of large amount of mined frequent patterns is widely recognized. Within the last five years, a number of lossless representations of frequent positive patterns have been proposed. Frequent closed itemsets were introduced in (Pasquier et al., 1999); the generators representation was introduced in (Kryszkiewicz, 2001). Other lossless representations are based on disjunctionfree sets (Bykowski & Rigotti, 2001), disjunction-free generators (Kryszkiewicz, 2001), generalized disjunctionfree generators (Kryszkiewicz & Gajek, 2002), generalized disjunction-free sets (Kryszkiewicz, 2003), non-derivable itemsets (Calders & Goethals, 2002), and k-free sets (Calders & Goethals, 2003). All these models allow distinguishing between frequent and infrequent positive patterns and enable determination of supports for all frequent positive patterns. Although the research on concise representations of frequent positive patterns is advanced, no model was offered in the literature to represent all frequent patterns with negation.
MAIN THRUST We offer a generalized disjunction-free literal set model (GDFLR) as a concise lossless representation of all frequent positive patterns and all frequent patterns with negation. Without the need to access the database, GDFLR enables distinguishing between all frequent and infrequent patterns, and enables calculation of the supports for all frequent patterns. GDFLR uses the mechanism of deriving supports of positive patterns that was proposed in Kryszkiewicz & Gajek (2002). Hence, we first recall this mechanism. Then we examine how to use it to derive the supports of patterns with negation and propose a respective naive representation of frequent patterns. Next we examine relationships
Reasoning about Frequent Patterns with Negation
between specific patterns and supports of their variations. Eventually, we use the obtained results to offer GDFLR as a refined version of the naive model.
Reasoning about Positive Patterns Based on Generalized Disjunctive Patterns
• • •
Let us observe that whenever item a occurs in a transaction in database D, then item b, or f, or both also occur in the transaction. This fact related to pattern {abf} can be expressed in the form of implication a ⇒ b ∨ f. Now, without accessing the database, we can derive additional implications, such as ac ⇒ b ∨ f and a ⇒ b ∨ f ∨ c, which are related to supersets of {abf}. The knowledge of such implications can be used for calculating the supports of patterns they relate to. For example, ac ⇒ b ∨ f implies that the number of transactions in which {ac} occurs equals the number of transactions in which {ac} occurs with b plus the number of transactions in which {ac} occurs with f minus the number of transactions in which {ac} occurs both with b and f. In other words, sup({ac}) = sup({acb}) + sup({acf}) – sup({acbf}). Hence, sup({abcf}) = sup({abc}) + sup({acf}) – sup({ac}), which means that the support of pattern {abcf} is determinable from the supports of its proper subsets. In general, if there is an implication related to a positive pattern, then the support of this pattern is derivable from the supports of its proper subsets [please, see Kryszkiewicz & Gajek (2002) for proof]. If there is such an implication for a pattern, then the pattern is called a generalized disjunctive set. Otherwise, it is called a generalized disjunction-free set. We will present now a lossless generalized disjunction-free set representation (GDFSR) of all frequent positive patterns, which uses the discussed mechanism of deriving supports. The GDFSR representation is defined as consisting of the following components (Kryszkiewicz, 2003):
the main component containing all frequent generalized disjunction-free positive patterns stored altogether with their supports; the infrequent border consisting of all infrequent positive patterns all proper subsets of which belong to the main component; the generalized disjunctive border consisting of all minimal frequent generalized disjunctive positive patterns stored altogether with their supports and/ or respective implications.
Figure 2 depicts the GDFSR representation found in D. The main component consists of 17 elements, the infrequent border of 7 elements, and generalized disjunctive border of 2 elements. Now, we will demonstrate how to use this representation for evaluating unknown positive patterns: •
•
Let us consider pattern {abcf}. We note that {abcf} has a subset, for example {abf}, in the infrequent border. This means that all supersets of {abf}, in particular {abcf}, are infrequent. Let us consider pattern {abce}. It does not have any subset in the infrequent border, but has a subset, for example {ac}, in the generalized disjunctive border. Property c ⇒ a, associated with {ac} implies property bce ⇒ a related to {abce}. Hence, sup({abce}) = sup({bce}). Now, we need to determine the support of {bce}. We observe that {bce} has subset {be} in the generalized disjunctive border. Property e ⇒ b associated with {be} implies property ce ⇒ b related to {bce}. Hence, sup({bce}) = sup({ce}). Pattern {ce} belongs to the main component, so its support is known (here: equal 2). Summarizing, sup({abce}) = sup({bce}) = sup({ce}) = 2.
Figure 2. The GDFSR representation found in D
Infrequent border:
Generalized disjunctive border:
{abf} {aef} {bcf} {cef} {bh} {eh} {fh}
{ac}[4, c ⇒ a] {be}[4, e ⇒ b] Main component:
{ab}[4] {ae}[3] {af}[3] {ah}[2] {bc}[3] {bf}[2] {ce}[2] {cf}[2] {ch}[2] {ef}[2] {a}[6] {b}[5] {c}[4] {e}[4] {f}[4] {h}[3] ∅[8]
943
R
Reasoning about Frequent Patterns with Negation
Naive Approach to Reasoning about Patterns with Negation Based on Generalized Disjunctive Patterns One can easily note that implications, we were looking for positive patterns, may exist also for patterns with negation. For instance, looking at Table 1, we observe that whenever item a occurs in a transaction, then item b occurs in the transaction and/or item e is missing in the transaction. This fact is related to pattern {ab(–e)} and can be expressed as implication a ⇒ b ∨ (–e). Hence, sup({a}) = sup({ab}) + sup({a(–e)}) – sup({ab(–e)}), or sup({ab(–e)}) = sup({ab}) + sup({a(–e)}) – sup({a}). Thus, the support of pattern {ab(–e)} is determinable from the supports of its proper subsets. In general, the support of a generalized disjunctive pattern with any number of negated items is determinable from the supports of its proper subsets. Having this in mind, we conclude that the GDFSR model can easily be adapted for representing all frequent patterns. We define a generalized disjunction-free set representation of frequent patterns admitting negation (GDFSRN) as holding all conditions that are held by GDFSR except for the condition restricting the representation’s elements to positive patterns. GDFSRN discovered from database , consists of 113 elements. It contains both positive patterns and patterns with negation. For instance, {bc}[3], {b(–c)}[2], and {(–b)(–c)}[2], which are frequent generalized disjunction-free, are sample elements of the main component of this representation, whereas {a(–c)}[2, ∅ Þ a ∨ (–c)], which is a minimal frequent generalized disjunctive pattern, is a sample element of the generalized disjunctive border. Although conceptually straightforward, the representation is not concise, since its cardinality (113) is comparable with the cardinality of all frequent patterns (109).
Generalized Disjunctive Patterns versus Supports of Variations Let us consider implication a ⇒ b ∨ f, which holds in our database. The statement that whenever item a occurs in a transaction, then item b and/or item f also occurs in the transaction is equivalent to the statement that there is no transaction in which a occurs without both b and f. Therefore, we conclude that implication a ⇒ b ∨ f is equivalent to statement sup({a(–b)(–f)}) = 0. We generalize this observation as follows: x1 … xm ⇒ xm+1 ∨ … ∨ xn is equivalent to sup({x1, …, xm} ∪ {-x m+1, …, -xn}) = 0.
944
Let us recall that x1 … xm ⇒ xm+1 ∨ …∨ x n implies that pattern {x1, …, xn} is generalized disjunctive, and sup({x1, …, xm} ∪{-xm+1, …, -xn}) = 0 implies that pattern {x1, …, xn} has a variation different from itself that does not occur in any transaction. Hence, we infer that a positive pattern is generalized disjunctive if and only if it has a variation with negation the support of which equals 0.
Effective Approach to Reasoning about Patterns with Negation Based on Generalized Disjunctive Patterns In order to overcome the problem of possible small conciseness ratio of the GDFSRN model, we offer a new representation of frequent patterns with negation. Our intention is to store in the new representation at most one pattern for a number of patterns occurring in GDFSRN that have the same positive variation. We define a generalized disjunction-free literal representation (GDFLR) as consisting of the following components: •
•
•
the main component containing each positive pattern (stored with its support) that has at least one frequent variation and all its variations have nonzero supports; the infrequent border containing each positive pattern all variations of which are infrequent and all proper subsets of which belong to the main component; the generalized disjunctive border containing each minimal positive pattern (stored with its support and, eventually, implication) that has at least one frequent variation and at least one variation with zero support.
Please note that each element in the main component is generalized disjunction-free since all its variations have non-zero supports. On the other hand, each element in the generalized disjunctive border is generalized disjunctive or has support equal zero. Figure 3 depicts GDFLR discovered in ,. The main component consists of 19 elements, the infrequent border of 1 element, and generalized disjunctive border of 11 elements. Now we will illustrate how to use this representation for evaluating unknown patterns: •
Let us consider pattern {a(–c)(–e)f}. We note that {acef}, which is a positive variation of the evaluated pattern, has subset {cef} in the infrequent border.
Reasoning about Frequent Patterns with Negation
Figure 3. The GDFLR representation found in D
R
Generalized disjunctive border:
Infrequent border:
{bcf}[1, c ⇒ b ∨ f] {bch}[1, bh ⇒ c] {bfh}[0] {cfh}[1, fh ⇒ c]
{cef}
{abf}[1, f ⇒ a ∨ b] {abh}[1, bh ⇒ a] {aef}[1, f ⇒ a ∨ e] {afh}[1, fh ⇒ a] {ac}[4, c ⇒ a] {be}[4, e ⇒ b] {eh}[0]
Main component: {ab}[4] {ae}[3] {af}[3] {ah}[2] {bc}[3] {bf}[2] {bh}[1] {ce}[2] {cf}[2] {ch}[2] {ef}[2] {fh}[1] {a}[6] {b}[5] {c}[4] {e}[4] {f}[4] {h}[3] ∅[8]
•
This means that all supersets of {cef} and all their variations, including {acef} and {a(–c)(–e)f}, are infrequent. Let us consider pattern {bef(–h)}. The positive variation {befh} of {bef(–h)} does not have any subset in the infrequent border, so {bef(–h)} has a chance to be frequent. Since, sup({bef(–h)}) = sup({bef}) – sup({befh}), we need to determine the supports of two positive patterns {bef} and {befh}. {bef} has subset {be} in the generalized disjunctive border, the implication of which is e ⇒ b. Hence, ef ⇒ b is an implication for {bef}. Thus, sup(bef) = sup(ef) = 2 (please, see the main component for pattern {ef}). Pattern {befh} also has a subset, for example {eh}, in the generalized disjunctive border. Since sup({eh}) = 0, then sup({befh}) equals 0 too. Summarizing, sup({bef(–h)}) = 2 – 0 = 2, and thus {bef(–h)} is a frequent pattern.
GDFLR is a lossless representation of all frequent patterns. A formal presentation of this model and its properties, as well as an algorithm for its discovery and experimental results can be found in our recent work (Kryszkiewicz, 2004b). The experiments carried out on real large data sets show that GDFLR is by several orders of magnitude more concise than all frequent patterns. Further reduction of GDFLR (and GDFSRN) can be achieved by applying techniques for reducing borders (Calders & Goethals, 2003; Kryszkiewicz, 2003; Kryszkiewicz, 2004a) or a main component (Kryszkiewicz, 2004c).
FUTURE TRENDS Development of different representations of frequent patterns with negation and algorithms for their discovery
can be considered as a short-term trend. As a long-term trend, we envisage development of representations of other kinds of knowledge admitting negation, such as association rules, episodes, sequential patterns and classifiers. Such research should stimulate positively the development of inductive databases, where queries including negation are common.
CONCLUSION The set of all positive patterns can be treated as a lossless representation of all frequent patterns; nevertheless it is not concise. On the other hand, the set of all frequent positive patterns neither guarantees derivation of all frequent patterns with negation, nor is concise in practice. The GDFSRN and GDFLR representations, we proposed, are first lossless representations of both all frequent positive patterns and patterns with negation. GDFLR consists of a subset of only positive patterns and hence is more concise than analogous GDFSRN, which admits the storage of many patterns having the same positive variation.
REFERENCES Agrawal, R., Imielinski, R., & Swami, A.N. (1993, May). Mining association rules between sets of items in large databases. In ACM SIGMOD International Conference on Management of Data (pp. 207-216), Washington, USA. Boulicaut, J.F., Bykowski, A., & Jeudy, B. (2000, October). Towards the tractable discovery of association rules with negations. In International Conference on Flexible Query Answering Systems (FQAS’00) (pp. 425-434), Warsaw, Poland. 945
Reasoning about Frequent Patterns with Negation
Bykowski, A., & Rigotti, C. (2001, May). A condensed representation to find patterns. In ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems (PODS’01) (pp. 267-273), Santa Barbara, USA.
Mannila, H., & Toivonen, H. (1996, August). Multiple uses of frequent sets and condensed representations. In International Conference on Knowledge Discovery and Data Mining (KDD’96) (pp. 189-194), Portland, USA.
Calders, T., & Goethals, B. (2002, August). Mining all nonderivable frequent itemsets. In European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’02) (pp. 74-85), Helsinki, Finland.
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999, January). Discovering frequent closed itemsets for association rules. Database Theory, International Conference (ICDT’99) (pp. 398-416), Jerusalem, Israel.
Calders, T., & Goethals, B. (2003, September). Minimal kfree representations. In European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’03) (pp. 71-82), Cavtat-Dubrovnik, Croatia.
Toivonen, H. (1996). Discovery of frequent patterns in large data collections. Ph.D. Thesis, Report A-1996-5. University of Helsinki.
Kryszkiewicz, M. (2001, November-December). Concise representation of frequent patterns based on disjunction–free generators. In IEEE International Conference on Data Mining (ICDM’01) (pp. 305-312), San Jose, USA. Kryszkiewicz, M. (2003, July). Reducing infrequent borders of downward complete representations of frequent patterns. In Symposium on Databases, Data Warehousing and Knowledge Discovery (DDWKD’03) (pp. 29-42), Baden-Baden, Germany. Kryszkiewicz, M. (2004a, March). Reducing borders of kdisjunction free representations of frequent patterns. In ACM Symposium on Applied Computing (SAC’04) (pp. 559-563), Nikosia, Cyprus. Kryszkiewicz, M. (2004b, May). Generalized disjunctionfree representation of frequent patterns with negation. ICS Research Report 9. Warsaw University of Technology. Extended version accepted to Journal of Experimental and Theoretical Artificial Intelligence. Kryszkiewicz, M. (2004c, July). Reducing main components of k-disjunction free representations of frequent patterns. In International Conference in Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’04) (pp. 1751-1758), Perugia, Italy. Kryszkiewicz, M., & Gajek, M. (2002, May). Concise representation of frequent patterns based on generalized disjunction–free generators. In Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conference (PAKDD’02) (pp. 159-171), Taipei, Taiwan.
946
Tsumoto, S. (2002). Discovery of positive and negative knowledge in medical databases using rough sets. In S. Arikawa & A. Shinohara (Eds.), Progress in Discovery Science (pp. 543-552). Heidelberg: Springer.
KEY TERMS Frequent Pattern: Pattern the support of which exceeds a user-specified threshold. Generalized Disjunction-Free Pattern: Pattern the support of which is not determinable from the supports of its proper subsets. Generalized Disjunctive Pattern: Pattern the support of which is determinable from the supports of its proper subsets. Item: 1) sales product; 2) feature, attribute. Literal: An item or negated item. Lossless Representation of Frequent Patterns: Fraction of patterns sufficient to distinguish between frequent and infrequent patterns and to determine the supports of frequent patterns. Pattern with Negation: Pattern containing at least one negated item. Positive Pattern: Pattern with no negated item. Reasoning about Patterns: Deriving supports of patterns without accessing a database. Support of a Pattern: The number of database transactions in which the pattern occurs.
947
Recovery of Data Dependencies Hee Beng Kuan Tan Nanyang Technological University, Singapore Yuan Zhao Nanyang Technological University, Singapore
INTRODUCTION Today, many companies have to deal with problems in maintaining legacy database applications, which were developed on old database technology. These applications are getting harder and harder to maintain. Reengineering is an important means to address the problems and to upgrade the applications to newer technology (Hainaut, Englebert, Henrard, Hick, J.-M., & Roland, 1995). However, much of the design of legacy databases including data dependencies is buried in the transactions, which update the databases. They are not explicitly stated anywhere else. The recovery of data dependencies designed from transactions is essential to both the reengineering of database applications and frequently encountered maintenance tasks. Without an automated approach, the recovery is difficult and timeconsuming. This issue is important in data mining, which entails mining the relationships between data from program source codes. However, until recently, no such approach was proposed in the literature. Recently, Hee Beng Kuan Tan proposed an approach based on program path patterns identified in transactions for the implementation of the most commonly used methods to enforce each common data dependency. The approach is feasible for automation to infer data dependencies designed from the identification of these patterns through program analysis (Muchnick & Jones, 1981; Wilhelm & Maurer, 1995).
BACKGROUND Data dependencies play an important role in database design (Maiser, 1986; Piatetsky-Shapiro & Frawley, 1991). Many legacy database applications were developed on old generation database management systems and conventional file systems. As a result, most of the data dependencies in legacy databases are not enforced in the database management systems. As such, they are not explicitly defined in database schema and are enforced in the transactions, which update the databases. Finding out the data dependencies designed manually during the
maintenance and reengineering of database applications is very difficult and time-consuming. In software engineering, program analysis has long been developed and proven as a useful aid in many areas. This article reports the research on the use of program analysis for the recovery of common data dependencies, that is, functional dependencies, key constraints, inclusion dependencies, referential constraints, and sum dependencies, designed in a database from the behavior of transactions.
RECOVERY OF DATA DEPENDENCIES FROM PROGRAM SOURCE CODES Tan (Tan & Zhao, 2004) has presented a novel approach for the inference of functional dependencies, key constraints, inclusion dependencies, referential constraints, and sum dependencies designed in a database from the analysis of the source codes of the transactions, which update the database. The approach is based on the program path patterns for implementing the most commonly used methods for enforcing data dependencies. We believe that the approach should be able to recover majority of data dependencies designed in database applications. A prototype system has been implemented for the proposed approach in UNIX by using Lex and Yacc. Many of the world’s database applications are built on old generation DBMSs. Due to the nature of system development, many data dependencies are not discovered in the initial system development; they are only discovered during the system maintenance stage. Although keys can be used to implement functional dependencies in old generation DBMSs, due to the effort in restructuring databases during the system maintenance stage, many of these dependencies are not defined explicitly as keys in the databases. They are enforced in transactions. Most of the conventional files and relational databases allow only the definition of one key. As such, most of the candidate keys are enforced in transactions. The feature for implementing inclusion dependencies and referential constraints in a database is only available in some of the latest generations of DBMSs.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
4
Recovery of Data Dependencies
As a result, most of the inclusion dependencies and referential constraints in legacy databases are also not defined explicitly in the databases and are enforced in transactions. To avoid repeated retrieval of related records for the computation of a total in query and reporting programs, the required total is usually maintained and stored by some transactions that update the database such that other programs can retrieve them directly from the database. As such, many sum dependencies are maintained by transactions in database applications. In summary, much of the functional dependencies, key constraints, inclusion dependencies, referential constraints, and sum dependencies in existing database applications are enforced in transactions. Therefore, transactions are the only source that can accurately reflect them. The proposed approach can be used to automatically recover these data dependencies designed in database applications during the reverse engineering and system maintenance stages. These dependencies constitute the majority of data dependencies in database applications. In the case that data dependencies are jointly enforced from schema, transactions, and their GUI (graphical user interface) definitions, the approach is still applicable. The data dependencies defined explicitly in database schema can be found from the schema without much effort. The GUI definition for a transaction can be interpreted as part of the transaction and analysed to recover data dependencies designed. All the recovered data dependencies designed form the design of data dependencies for the whole database application. Extensive works have been carried out in database integrity constraints that include data dependencies. However, these works mainly concern enforcing integrity constraints separately in a data management system (Blakeley, Coburn, & Larson,1989; Orman, 1998; Sheard & Stemple, 1989) and the discovery of data dependencies hold in the current database (Agrawal, Imielinski, & Swami, 1993; Andersson, 1994; Anwar, Beck, & Navathe, 1992; Kantola, Mannila, Raiha, & Siirtola, 1992; Petit, Kouloumdjian, Boulicaut, & Toumani,1994; PiatatskyShapiro & Frawley, 1991; Signore, Loffredo, Gregori, & Cima, 1994; Tsur, 1990). No direct relationship exists between the former work and the proposed approach. The distinct difference between Tan’s work and the latter work is that the proposed approach recovers data dependencies designed in a database, whereas the latter work discovers data dependencies hold in the current database. A data dependency that is designed in a database may not hold in the current database, due to the update by the transactions that were developed wrongly during the earlier stage, or to the update by the query utility without any validation.
948
FUTURE TRENDS We believe that the integrated analysis of information in databases and programs will be a fruitful direction for establishing useful techniques in order to verify the quality and accuracy of database applications. The information can comprise both formally and empirically based characteristics of programs and databases.
CONCLUSION We have presented a novel approach for the recovery of data dependencies in databases from program source codes. The proposed approach establishes a bridge for integrating information in databases and the source codes of programs that update the databases. As a final remark, we would like to highlight that as far as we can identify common methods for enforcing an integrity constraint and the resulting program path patterns for these method, a similar approach can be developed to recover the integrity constraint designed in a database. This research could be interesting to explore further.
REFERENCES Agrawal, R., Imielinski, T., & Swami A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914925. Andersson, M. (1994). Extracting an entity relationship schema from a relational database through reverse engineering. Proceedings of the 13th International Conference on ERA (pp. 403-419). Anwar, T. M., Beck, H. W., & Navathe, S. B. (1992). Knowledge mining by imprecise querying: A classification-based approach. Proceedings of the IEEE Eighth International Conference on Data Engineering, USA. Blakeley, J. A., Coburn, N., & Larson, P. (1989). Updating derived relations: Detecting irrelevant and autonomously computable updates. ACM Transaction on Database Systems, 14(3), 369-400. Hainaut, J.-L., Englebert, V., Henrard, J., Hick, J.-M., & Roland, D. (1995). Requirements for information system reverse engineering support. Proceedings of the IEEE Working Conference on Reverse Engineering (pp. 136-145). Kantola, M., Mannila, H., Raiha, K., & Siirtola, H. (1992). Discovery functional and inclusion dependencies in rela-
Recovery of Data Dependencies
tional databases. International Journal of Intelligent Systems, 7, 591-607.
KEY TERMS
Maier, D. (1982). The theory of relational databases. Computer Science Press.
Control Flow Graph: A control flow graph is an abstract data structure used in compilers. It is an abstract representation of a procedure or program and is maintained internally by a compiler. Each node in the graph represents a basic block. Directed edges are used to represent jumps in the control flow.
Muchnick, S. S., & Jones, N. D. (Eds.). (1981). Program flow analysis: Theory and applications. Prentice-Hall. Orman, L. V. (1998). Differential relational calculus for integrity maintenance. IEEE Transactions on Knowledge and Data Engineering, 10(2), 328-341. Petit, J.-M., Kouloumdjian, J., Boulicaut, J.-H., & Toumani, F. (1994). Using queries to improve database reverse engineering. Proceedings of the 13th International Conference on ERA (pp. 369-386). Piatetsky-Shapiro, G., & Frawley, W. J. (Eds.). (1991). Knowledge discovery in databases. Cambridge, MA: AAAI/MIT. Sheard, T., & Stemple, D. (1989). Automatic verification of database transaction safety. ACM Transaction on Database Systems, 14(3), 322-368. Signore, O., Loffredo, M., Gregori, M., & Cima, M. (1994). Reconstruction of ER schema from database applications: A cognitive approach. Proceedings of the 13th International Conference on ERA (pp. 387-402). Tan, H. B. K., Ling, T. W., & Goh, C. H. (2002). Exploring into programs for the recovery of data dependencies designed. IEEE Transactions on Knowledge and Data Engineering, 14(4), 825-835. Tan, H.B.K, & Zhao, Y. (2004). Automated elicitation of functional dependencies from source codes of database transactions. Information & Software Technology, 46(2), 109-117. Tsur, S. (1990). Data dredging. IEEE Data Engineering, 13(4), 58-63. Wilhelm, R., & Maurer, D. (1995). Compiler design. Addison-Wesley.
Data Dependencies: Data dependencies are the various ways that data attributes are related, for example, functional dependencies, inclusion dependencies, and so forth. Design Recovery: Design recovery recreates design abstractions from a combination of code, existing design documentation (if available), personal experience, and general knowledge about problem and application domains. Functional Dependency: For any record r in a record type, its sequence of values of the attributes in X is referred as the X-value of r. Let R be a record type and X and Y be sequences of attributes of R. We say that the functional dependency, X’!Y of R, holds at time t, if at time t, for any two R records r and s, the X-values of r and s are identical, then the X-values of r and s are also identical. Inclusion Dependency: Let R and S be two record types (not necessarily distinct) and X and Y be sequences of attributes of R and S, respectively, such that the numbers of attributes in X and Y are identical. We say that the inclusion dependency (IND), R[X] Í S[Y], holds at time t if at time t, for each R record, r, an S record, s, exists such that r[X] = s[Y]. Program Analysis: Program analysis offers static compile-time techniques for predicting safe and computable approximations to the set of values or behaviours arising dynamically at run-time when executing a computer program. Reverse Engineering: Reverse engineering is the extraction of higher level software artifacts, including design, documentation, and so forth, from the source or binary codes of a software system.
949
4
950
Reinforcing CRM with Data Mining Dan Zhu Iowa State University, USA
INTRODUCTION
BACKGROUND
With the explosive growth of information available on the World Wide Web, users must increasingly use automated tools to find, extract, filter, and evaluate desired information and resources. Companies are investing significant amounts of time and money on creating, developing, and enhancing individualized customer relationships, a process called customer relationship management, or CRM (Berry & Linoff, 1999; Buttle, 2003; Rud, 2000). Based on a report by the Aberdeen Group, worldwide CRM spending reached $13.7 billion in 2002 and should be close to $20 billion by 2006. Data mining is a powerful technology that can help companies focus on crucial information that may be hiding in their data warehouses (Fayyad, Grinstein, & Wierse, 2001; Wang, 2003). The process involves extracting predictive information from large databases. Data-mining tools can predict future trends and behaviors that enable businesses to make proactive, knowledge-based decisions. By scouring databases for hidden patterns and finding prognostic information that lies outside expectations, these tools can also answer business questions that previously were too time-consuming to tackle. Web mining is the discovery and analysis of useful information by using the World Wide Web. This broad definition encompasses Web content mining, the automated search for resources and retrieval of information from millions of Web sites and online databases, as well as Web usage mining, the discovery and analysis of users’ Web site navigation and online service access patterns (Berry & Linoff, 2002; Marshall, McDonald, Chen, & Chung, 2004). Today, most companies collect and refine massive amounts of data. To increase the value of current information resources, data-mining techniques can be rapidly implemented on existing software and hardware platforms and integrated with new products and systems. If implemented on high-performance client/server or parallel processing computers, data-mining tools can analyze enormous databases to answer customer-centric questions such as, “Which clients have the highest likelihood of responding to my next promotional mailing, and why?” This article provides a basic introduction to data-mining and Web-mining technologies and their applications in CRM.
CRM CRM is an enterprise approach to customer service that uses meaningful communication to understand and influence consumer behavior. The purpose of the process is twofold: a) to impact all aspects to the consumer relationship (e.g., improve customer satisfaction, enhance customer loyalty, and increase profitability) and b) to ensure that employees within an organization are using CRM tools. The need for greater profitability requires an organization to proactively pursue its relationships with customers (Fleisher & Blenkhom, 2003). In the corporate world, acquiring, building, and retaining customers are becoming top priorities. For many firms, the quality of their customer relationships provides their competitive edge over other businesses. In addition, the definition of customer has been expanded to include immediate consumers, partners and resellers — in other words, virtually everyone who participates, provides information, or requires services from the firm. Companies worldwide are beginning to realize that surviving an intensively competitive and global marketplace requires closer relationships with customers. In turn, enhanced customer relationships can boost profitability three ways: a) by reducing cots by attracting more suitable customers, b) by generating profits through cross-selling and up-selling activities, and c) by extending profits through customer retention. Slightly expanded explanations of these activities follow. •
•
•
Attracting more suitable customers: Data mining can help firms understand which customers are most likely to purchase specific products and services, thus enabling businesses to develop targeted marketing programs for higher response rates and better returns on investment. Better cross-selling and up-selling: Businesses can increase their value proposition by offering additional products and services that are actually desired by customers, thereby raising satisfaction levels and reinforcing purchasing habits. Better retention: Data-mining techniques can identify which customers are more likely to defect and
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Reinforcing CRM with Data Mining
why. A company can use this information to generate ideas that allow them to maintain these customers. In general, CRM promises higher returns on investments for businesses by enhancing customer-oriented processes such as sales, marketing, and customer service. Data mining helps companies build personal and profitable customer relationships by identifying and anticipating customers’ needs throughout the customer lifecycle.
Data Mining: An Overview Data mining can help reduce information overload and improve decision making. This is achieved by extracting and refining useful knowledge through a process of searching for relationships and patterns from the extensive data collected by organizations. The extracted information is used to predict, classify, model, and summarize the data being mined. Data-mining technologies, such as rule induction, neural networks, genetic algorithms, fuzzy logic, and rough sets, are used for classification and pattern recognition in many industries (Zhao & Zhu, 2003; Zhong, Dong, & Ohsuga, 2001; Zhu, Premkumar, Zhang, & Chu, 2001). Table 1 gives a few of the many ways that data mining can be used. Data mining builds models of customer behavior by using established statistical and machine-learning techniques. The basic objective is to construct a model for one situation in which the answer or output is known and then apply that model to another situation in which the answer or output is sought. The best applications of the above techniques are integrated with data warehouses and other interactive, flexible business analysis tools. The analytic data warehouse can thus improve business processes across the organization in areas such as campaign management, new product rollout, and fraud detection. Data mining integrates different technologies to populate, organize, and manage the data store. Because
quality data is crucial to accurate results, data-mining tools must be able to clean the data, making it consistent, uniform, and compatible with the data store. Data mining employs several techniques to extract important information. Operations are the actions that can be performed on accumulated data, including predictive modeling, database segmentation, link analysis, and deviation detection. Statistical procedures can be used to apply advanced data-mining techniques to modeling (Giudici, 2003; Yang & Zhu, 2002). Improvements in user interfaces and automation techniques make advanced analysis more feasible. There are two groups of modeling and associated tools: theory driven and data driven. The purpose of theory-driven modeling, also called hypothesis testing, is to substantiate or disprove a priori notions. Thus, theory-driven modeling tools ask the user to specify the model and then test its validity. On the other hand, datadriven modeling tools generate the model automatically based on discovered patterns in the data. The resulting model must be tested and validated prior to acceptance. Because modeling is an evolving and complex process, the final model might require a combination of prior knowledge and new information, yielding a competitive advantage.
MAIN THRUST Modern data mining can take advantage of increasing computing power and high-powered analytical techniques to reveal useful relationships in large databases (Han & Kamber, 2001; Wang, 2003). For example, in a database containing hundreds of thousands of customers, a data-mining process can process separate pieces of information and uncover that 73% of all people who purchased sport utility vehicles (SUVs) also bought outdoor recreation equipment, such as boats and snowmobiles, within three years of purchasing their SUVs. This kind of information is invaluable to recreation equipment
Table 1. Some uses of data mining
A supermarket organizes its merchandise stock based on shoppers’ purchase patterns. An airline reservation system uses customers’ travel patterns and trends to increase seat utilization. Web pages alter their organizational structure or visual appearance based on information about the person who is requesting the pages. Individuals perform a Web-based query to find the median income of households in Iowa.
951
4
Reinforcing CRM with Data Mining
manufacturers. Furthermore, data mining can identify potential customers and facilitate targeted marketing. CRM software applications can help database marketers automate the process of interacting with their customers (Kracklauer, Mills, & Seifert, 2004). First, database marketers identify market segments containing customers or prospects with high profit potential. This activity requires the processing of massive amounts of data about people and their purchasing behaviors. Data-mining applications can help marketers streamline the process by searching for patterns among the different variables that serve as effective predictors of purchasing behaviors. Marketers can then design and implement campaigns that will enhance the buying decisions of a targeted segment, in this case, customers with high income potential. To facilitate this activity, marketers feed the data-mining outputs into campaign management software that focuses on the defined market segments. Here are three additional ways in which data mining supports CRM initiatives. •
•
952
Database marketing: Data mining helps database marketers develop campaigns that are closer to the targeted needs, desires, and attitudes of their customers. If the necessary information resides in a database, data mining can model a wide range of customer activities. The key objective is to identify patterns that are relevant to current business problems. For example, data mining can help answer questions such as “Which customers are most likely to cancel their cable TV service?” and “What is the probability that a customer will spend over $120 from a given store?” Answering these types of questions can boost customer retention and campaign response rates, which ultimately increases sales and returns on investment. Customer acquisition: The growth strategy of businesses depends heavily on acquiring new customers, which may require finding people who have been unaware of various products and services, who have just entered specific product categories (for example, new parents and the diaper category), or who have purchased from competitors. Although experienced marketers often can select the right set of demographic criteria, the process increases in difficulty with the volume, pattern complexity, and granularity of customer data. Highlighting the challenges of customer segmentation has resulted in an explosive growth in consumer databases. Data mining offers multiple segmentation solutions that could increase the response rate for a customer acquisition campaign. Marketers need to use creativity and experience to tailor new and interesting offers for customers identified through data-mining initiatives.
•
Campaign optimization: Many marketing organizations have a variety of methods to interact with current and prospective customers. The process of optimizing a marketing campaign establishes a mapping between the organization’s set of offers and a given set of customers that satisfies the campaign’s characteristics and constraints, defines the marketing channels to be used, and specifies the relevant time parameters. Data mining can elevate the effectiveness of campaign optimization processes by modeling customers’ channel-specific responses to marketing offers.
Database marketing software enables companies to send customers and prospective customers timely and relevant messages and value propositions. Modern campaign management software also monitors and manages customer communications on multiple channels including direct mail, telemarketing, e-mail, the Internet, point of sale, and customer service. Furthermore, this software can be used to automate and unify diverse marketing campaigns at their various stages of planning, execution, assessment, and refinement. The software can also launch campaigns in response to specific customer behaviors, such as the opening of a new account. Generally, better business results are obtained when data mining and campaign management work closely together. For example, campaign management software can apply the data-mining model’s scores to sharpen the definition of targeted customers, thereby raising response rates and campaign effectiveness. Furthermore, data mining may help to resolve the problems that traditional campaign management processes and software typically do not adequately address, such as scheduling, resource assignment, and so forth. Although finding patterns in data is useful, data mining’s main contribution is providing relevant information that enables better decision making. In other words, it is a tool that can be used along with other tools (e.g., knowledge, experience, creativity, judgment, etc.) to obtain better results. A data-mining system manages the technical details, thus enabling decision makers to focus on critical business questions such as “Which current customers are likely to be interested in our new product?” and “Which market segment is best for the launch of our new product?”
FUTURE TRENDS Data mining is a modern technology that offers competitive firms a method to manage customer information, retain customers, and pursue new and hopefully profit-
Reinforcing CRM with Data Mining
able customer relationships. Data mining and Web mining employ many techniques to extract relevant information from massive data sources so that companies can make better business decisions with regard to their customer relationships. Hence, data mining and Web mining promote the goals of customer relationship management, which are to initiate, develop, and personalize customer relationships by profiling customers and highlighting segments. However, data mining presents a number of issues that must be addressed. Data privacy is a trigger-button issue (Rees, Koehler, & Ozcelik, 2002). Recently, privacy violations, complaints, and concerns have grown in significance as merchants, companies, and governments continue to accumulate and store large amounts of personal data. There are concerns not only about the collection of personal data, but also the analyses and uses of the data. For example, transactional information is collected from the customer for the processing of a credit card payment and then, without prior notification, the information is used for other purposes (e.g., data mining). This action would violate principles of data privacy. Fueled by the public’s concerns about the rising volume of collected data and potent technologies, clashes between data privacy and data mining will likely cause higher levels of scrutiny in the coming years. Legal challenges are quite possible in this regard. There are other issues facing data mining as well. Data inaccuracies can cause analyses, results, and recommendations to veer off-track. Customers’ submission of erroneous or false information and data type incompatibilities during the data importation process poses real hazards to data mining’s effectiveness. Another risk is that data mining might be easily confused with data warehousing. Companies that build data warehouses without implementing data-mining software likely will neither reach top productivity nor receive the full benefits. Likewise, cross-selling can be a problem if it violates customers’ privacy, breaches their trust, or annoys them with unwanted solicitations. Data mining can help to alleviate the latter issue by aligning marketing programs with targeted customers’ interests and needs.
CONCLUSION Despite the potential issues and impediments, the market for data mining is projected to grow by several billion dollars. Database marketers should understand that some customers are significantly more profitable than others. Data mining can help to identify and target these customers, whose data is buried in massive data-
bases, thereby helping to redefine and reinforce customer relationships.
ACKNOWLEDGMENTS This research is partially supported under a grant from Iowa State University. I would like to thank the reviewers and the editor for their helpful comments and suggestions for improving the presentation of this paper.
REFERENCES Berry, M. J. A., & Linoff, G.S. (1999). Mastering data mining: The art and science of customer relationship management. New York: John Wiley & Sons, Inc. Berry, M. J. A., & Linoff, G.S. (2002). Mining the Web: Transforming customer data. New York: John Wiley & Sons, Inc. Buttle, F. (2003). Customer relationship management: Concepts and tools. Oxford, England: ButterworthHeinemann. Fayyad, U., Grinstein, G., & Wierse, A. (2001). Information visualization in data mining and knowledge discovery. San Francisco: Morgan Kaufmann. Fleisher, C. S., & Blenkhom, D. (2003). Controversies in competitive intelligence: The enduring issues. Westport, CT: Praeger. Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. Wiley. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Kracklauer, D., Mills, Q., & Seifert, D. (Eds.). (2004). Collaborative customer relationship management: Taking CRM to the next level. New York: SpringerVerlag. Marshall, B., McDonald, D., Chen, H., & Chung, W. (2004). EBizPort: Collecting and analyzing business intelligence information. Journal of the American Society for Information Science and Technology, 55(10), 873– 891. Rees, J., Koehler, G. J., & Ozcelik, Y. (2002). Information privacy and E-business activities: Key issues for managers. In S. K. Sharma & J. N. D. Gupta (Eds.), Managing Ebusinesses of the 21st century.
953
4
Reinforcing CRM with Data Mining
Rud, O. P. (2000). Data mining cookbook: Modeling data for marketing, risk and customer relationship management. New York: Wiley. Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group Publishing. Yang, Y., & Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, 30, 100-121. Zhao, L. J., & Zhu, D. (2003). Workflow resource selection from UDDI repositories with mobile agents. Proceedings of Web2003, USA. Zhong, N., Dong, J., Ohsuga, S. (2001). Using rough sets with heuristics for feature selection. Journal of Intelligent Information Systems, 16(3), 199-214. Zhu, D., Premkumar, G., Zhang, X., & Chu, C. (2001). Data mining for network intrusion detection: A comparison of alternative methods. Decision Sciences, 32(4), 635-660.
KEY TERMS Application Service Providers: Offer outsourcing solutions that supply, develop, and manage applicationspecific software and hardware so that customers’ internal information technology resources can be freed up.
Classification: The distribution of things into classes or categories of the same type, or the prediction of the category of data by building a model based on some predictor variables. Clustering: Groups of items that are similar as identified by algorithms. For example, an insurance company can use clustering to group customers by income, age, policy types, and prior claims. The goal is to divide a data set into groups such that records within a group are as homogeneous as possible and groups are as heterogeneous as possible. When the categories are unspecified, this may be called unsupervised learning. Genetic Algorithm: Optimization techniques based on evolutionary concepts that employ processes such as genetic combination, mutation, and natural selection in a design. Online Profiling: The process of collecting and analyzing data from Web site visits, which can be used to personalize a customer’s subsequent experiences on the Web site. Network advertisers, for example, can use online profiles to track a user’s visits and activities across multiple Web sites, although such a practice is controversial and may be subject to various forms of regulation. Rough Sets: A mathematical approach to extract knowledge from imprecise and uncertain data.
Business Intelligence: The type of detailed information that business managers need for analyzing sales trends, customers’ purchasing habits, and other key performance metrics in the company.
Rule Induction: The extraction of valid and useful if-then-else rules from data based on their statistical significance levels, which are integrated with commercial data warehouse and OLAP platforms.
Categorical Data: Fits into a small number of distinct categories of a discrete nature, in contrast to continuous data, and can be ordered (ordinal), for example, high, medium, or low temperatures, or nonordered (nominal), for example, gender or city.
Visualization: Graphically displayed data from simple scatter plots to complex multidimensional representations to facilitate better understanding.
954
955
Resource Allocation in Wireless Networks Dimitrios Katsaros Aristotle University, Greece Gökhan Yavas Bilkent University, Turkey Alexandros Nanopoulos Aristotle University, Greece Murat Karakaya Bilkent University, Turkey Özgür Ulusoy Bilkent University, Turkey Yannis Manolopoulos Aristotle University, Greece
INTRODUCTION During the past years, we have witnessed an explosive growth in our capabilities to both generate and collect data. Advances in scientific data collection, the computerization of many businesses, and the recording (logging) of clients’ accesses to networked resources have generated a vast amount of data. Various data mining techniques have been proposed and widely employed to discover valid, novel and potentially useful patterns in these data. Traditionally, the two primary goals of data mining tend to be description and prediction, although description is considered to be more important in practice. Recently though, it was realized that the prediction capabilities of the models constructed by the data mining process can be effectively used to address many problems related to the allocation of resources in networks. For instance, such models have been used to drive prefetching decisions in the World Wide Web (Nanopoulos, Katsaros, & Manolopoulos, 2003) or to schedule data broadcasts in wireless mobile networks (Saygin & Ulusoy, 2002). The intrinsic attribute of these environments is that the network records the characteristics, for example, movements, data preferences of its clients. Thus, it is possible to infer future client behaviors by mining the historical information, which has been recorded by the network. The present article will highlight the data mining techniques that have been developed to achieve efficient allocation of resources, for example bandwidth, to wireless mobile networks or the data mining methods that
have been used in order to reduce the latency associated with the access of data by wireless mobile clients.
BACKGROUND We consider a typical wireless Personal Communications Systems (PCS) (see Figure 1) with architecture similar to those used in EIA/TIA IS-41 and GSM standards. The PCS serves a geographical area, called coverage area, where mobile users (MU) can freely roam. The coverage area served by the PCS is partitioned into a number of non-overlapping regions, called cells. At the heart of the PCS lies a fixed backbone (wireline) network. A number of fixed hosts are connected to this network. Each cell is usually served by one base station (BS), which is connected to the fixed network and it is equipped with wireless transmission and receiving capability. We assume that each base station serves exactly one cell. MUs use radio channels to communicate with BSs and gain access to the fixed or wireless network. The BS is responsible for converting the network signaling traffic and data traffic to the radio interface for communication with the MU and also for transmitting paging messages to the MU. Finally, a cell site switch (CSS) will govern one or more base stations. CSS will provide access to the serving mobile network, will manage the radio resources and provide mobility management control functions (for example, location update). The coverage area consists of a number of location areas (LA). Each location area consists of one or more cells. The MU can freely roam inside a location area
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
4
Resource Allocation in Wireless Networks
Figure 1. Architecture of a wireless PCS
without notifying the system about its position. Whenever it moves to a new location area it must update its position, reporting the location area it entered. This procedure is called location update. This is done as follows: each mobile user is assigned to one database, the home location register (HLR) (one for each PCS network), which maintains the profile information regarding the user, such as authentication, access rights, billing, position, and etcetera. Each location area is associated to one visitor location register (VLR), which stores the profile of the MUs currently residing in its respective location area. We assume that each VLR is associated to one location area and vice-versa. The search for mobile clients is performed by broadcasting paging messages to the cells, where the clients might have moved, until the client is located or the whole coverage area is searched. The identity of the cell is continuously being broadcast by the cell’s BS, thus the terminal is aware of the cell it resides. If each mobile terminal records the sequence of cells it visits and communicates them back to the network every time it reports a location area crossing, then the network can have accurate information for the mobile user trajectories inside the coverage region. The concept of resource allocation in wireless mobile environments covers aspects of both network and data management issues. With respect to network management, the issue of dynamic bandwidth allocation is of particular importance. Instead of granting a fixed frequency spectrum to each cell, irrespective of the number and needs of the clients residing therein, the allocated spectrum varies according to the clients’ demands. This necessitates prediction of both future clients’ movements and future data needs. The issue of client movement prediction is also related to the order according to which the paging messages should be broadcast to the cells, so as to guarantee minimum energy consumption and, at the same time, fast client location determination. 956
With respect to the data management issues, a prominent problem is that of reducing the average latency experienced by the clients while retrieving data from the underline infrastructure network or the wireless network. This problem is tightly related to caching data at various places of the network. The caches can be airlocated, that is, data broadcasting (Karakaya & Ulusoy, 2001), or relocated to specific base stations. Thus, we need to identify data that will be requested together by the same client or group of clients, so as to broadcast them during short time intervals. In addition, we need to deduce future client movements to cells so as to “push-cache” data to the base stations serving these cells (Hadjiefthymiades & Merakos, 2003; Kyriakakos et al., 2003).
DATA MINING AT THE SERVICE OF THE NETWORK We will present the most important research areas where data mining methodologies are employed to improve the efficiency of wireless mobile networks. The first area is the mobile user location prediction and aims at deducing future client movements. Location prediction is important for both bandwidth allocation and data placement. The second area is the data broadcast schedule creation and aims at recognizing groups of data items that are likely to be requested together or during small time intervals, so as to place them “closely” in the broadcast program.
LOCATION PREDICTION The issue of predicting future mobile client positions has received considerable attention (e.g., Aljadhai & Znati, 2001; Liang & Haas, 2003) in the wireless mobile networks research community. The focus of these efforts is the determination of the position of a mobile, given some information about its velocity and direction. Though most (if not all) of these works make unrealistic assumptions about the distribution of the velocity and direction of the mobile terminals. Only recently, data mining techniques have been employed in order to predict future trajectories of the mobiles. Data mining techniques capitalize on the simple observation that the movement of people consists of random movements and regular movements and the majority of mobile users has some regular daily (hourly, weekly,…) movement patterns and follow these patterns more or less every day. Several efforts targeting at location prediction exploited this regularity. The purpose of all these efforts
Resource Allocation in Wireless Networks
is to discover movement regularities and code them into some form of “knowledge,” say, sequences of cell visits. Thus, for a considered mobile user the system tries to match its current trajectory with some of the already discovered sequences and provide appropriate predictions. Yavas et al. (2004) proposed a method to derive such sequences in the form of association rules, which describe the frequently followed paths by mobile users. This method is a level-wise algorithm, like the Apriori (Agrawal & Srikant, 1994), but takes into consideration the cellular structure of the PCS system. In each iteration, instead of generating all possible candidate paths by combining the frequent paths discovered in the previous iteration, it generates only the candidates that comprise legal paths over the coverage area. Thus, it achieves a significant reduction in the processing time. Similar reasoning was followed by Lee & Wang (2003) and Peng & Chen (2003), though the latter method is based on the application of the sequential patterns paradigm (Srikant & Agrawal, 1996). Unlike the aforementioned works, which treated the location prediction problem as an association rule generation problem, the works by Katsaros et al.,(2003) and by Wu et al. (2001) investigated solutions for it based on the clustering paradigm. They treated the trajectories of the mobile users as points in a metric space with an associated distance function. The first work treated the trajectories as sequences of symbols (each cell corresponds to a symbol), utilized as distance function the string-edit distance and applied hierarchical agglomerative clustering in order to form clusters of trajectories. Each cluster is represented by one or more cluster representatives and each representative is a sequence of cells. Similar methodology was followed by Wu et al. (2001), but they used the standard Euclidean space and Euclidean distance function, that is, the L2 norm.
Aiming to reduce the mobile client latency associated with data retrieval, the work of Song & Cao (2004) designed a prefetching scheme for mobile clients. Since prefetching also consumes system resources such as bandwidth and power, they considered the system overhead when designing the prefetching scheme and proposed the cache-miss-initiated prefetch (CMIP) scheme to address this issue. The CMIP scheme relies on two prefetch sets: the always-prefetch set and the missprefetch set. The always-prefetch set consists of data that should always be prefetched if possible. The missprefetch set consists of data that are closely related to the cache-missed data item. When a cache miss happens, instead of sending an uplink request to ask for the cache-missed data item only, the client also requests for the data items, which are within the miss-prefetch set. This reduces not only future cache misses but also the number of uplink requests. Song & Cao proposed novel algorithms to mine the association rules and used them to construct the two prefetch sets.
FUTURE TRENDS The application of data mining techniques to the improvement of wireless networks performance proved to be an effective tool; though, the proposed techniques to date are straightforward. More sophisticated methods are needed to support, for example, data allocation schemes that utilize the knowledge of user moving patterns for proper allocation of shared data in a mobile computing system. In addition, the knowledge discovered from telecommunication alarm data can be used in finding problems in networks and possibly in predicting severe faults or detecting intrusion attempts. For these application areas new mining procedures are needed.
SCHEDULING BROADCASTS For the purpose of discovering data dependencies and subsequently scheduling the broadcast of these items closely in time, Saygin and Ulusoy (2002) proposed the use of association rule mining to the log files of the base servers, which record the data requests. Having discovered these dependencies, then a correlation graph is constructed, which depicts the correlated data requests. Applying a topological sorting over this graph, the authors derive the broadcast schedule. The main characteristic of this schedule is that the items, which are frequently requested together by the clients, are broadcasted either consecutive or with very small distance in time. In this way, the average latency of the clients’ requests is significantly reduced.
CONCLUSION Traditionally, the data mining process has been used to develop models, which describe the data. Recently though, it was realized that these models can be effectively used to predict characteristics of the data. This observation has led to a number of data mining methods used to improve the performance of wireless mobile networks. The aim of the present article is to present the fields where these methods can be applied and also to provide an overview of the particular data mining techniques, which have been developed into these fields.
957
4
Resource Allocation in Wireless Networks
REFERENCES Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB’94) (pp. 487-499). Aljadhai, A., & Znati, T. (2001). Predictive mobility support for QoS provisioning in mobile wireless environments. IEEE Journal on Selected Areas in Communications, 19(10), 1915-1930. Hadjiefthymiades, S., & Merakos, L. (2003). Proxies + path prediction: Improving Web service provisioning in wireless mobile communication. ACM/Kluwer Mobile Networks and Applications, 8(4), 389-399. Karakaya, M., & Ulusoy, O. (2001). Evaluation of a broadcast scheduling algorithm. In Proceedings of the Conference on Advances in Databases and Information Systems (ADBIS’01) (pp. 182-195). Katsaros et al. (2003). Clustering mobile trajectories for resource allocation in mobile environments. In Proceedings of the 5th Intelligent Data Analysis Symposium (IDA). Lecture Notes in Computer Science (2810) (pp. 319329). Kyriakakos, M., Frangiadakis, N., Merakos, L., & Hadjiefthymiades, S. (2003). Enhanced path prediction for network resource management in wireless LANs. IEEE Wireless Communications, 10(6), 62-69.
Song, H., & Cao, G. (2004). Cache-miss-initiated prefetch in mobile environments. In Proceedings of the International Conference on Mobile Data Management (MDM’04), 370-381. Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the International Conference on Extending Database Technology (EDBT’96) (pp. 3-17). Wu, H.-K., Jin, M.-H., Horng, J.-T., & Ke, C.-Y. (2001). Personal paging area design based on mobile’s moving behaviors. In Proceedings of the IEEE Conference on Computer and Communications (IEEE INFOCOM’01) (pp. 21-30). Yavas, G., Katsaros, D., Ulusoy, O., & Manolopoulos, Y. (2005). A data mining approach for location prediction in mobile environments. Data and Knowledge Engineering, to appear.
KEY TERMS Calling Path: A calling path 〈c1, c2, …, cn〉 n≥2, is a sequence o visited cells during a mobile phone call, where c1, c 2, …, cn are cell IDs.
Lee, A.J.T., & Wang, Y.-T. (2003). Efficient data mining for calling path patterns in GSM networks. Information Systems, 28, 929-948.
Handoff or Handover: It is the process of changing some of the parameters of a channel (frequency, time slot, spreading code, or a combination of them) associated with the current connection in progress. Handoffs are initiated by a client’s movement, by crossing a cell boundary, or by a deteriorated quality of signal received on a currently employed channel.
Liang, B., & Haas, Z. (2003). Predictive distance-based mobility management for multidimensional PCS networks. IEEE/ACM Transactions on Networking, 11(5), 718-732.
Hierarchical Agglomerative Clustering (HAC): A family of clustering algorithms, which start with each individual item in its own cluster and iteratively merge clusters until all items belong in one cluster.
Nanopoulos, A., Katsaros, D., & Manolopoulos, Y. (2003). A data mining algorithm for generalized Web prefetching. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1155-1169.
MANET: A Mobile Adhoc Network (MANET) is a local network with wireless or temporary plug-in connection, in which mobile or portable devices are part of the network only while they are in close proximity.
Peng, W.-C., & Chen, M.S. (2003). Developing data allocation schemes by incremental mining of user moving patterns in a mobile computing system. IEEE Transactions on Knowledge and Data Engineering, 15(1), 70-85.
Prefetching: It is the technique of deducing future client requests for objects based on the current request, and bringing those objects into the cache in the background before an explicit request is made for them.
Saygin, Y., & Ulusoy, Ö. (2002). Exploiting data mining techniques for broadcasting data in mobile computing environments. IEEE Transactions on Knowledge and Data Engineering, 14(6), 1387-1399.
Push-Caching: The technique of pushing data closer to consumers by making an informed guess as to what the clients may access in the near future. The concept of push-caching is closely related to prefetching, but prefetches are always initiated in response to an ondemand request.
958
Resource Allocation in Wireless Networks
String Edit Distance: The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, and substitutions – needed to transform the first string into the second (matches are not counted).
4
959
960
Retrieving Medical Records Using Bayesian Networks Luis M. de Campos Universidad de Granada, Spain Juan M. Fernández-Luna Universidad de Granada, Spain Juan F. Huete Universidad de Granada, Spain
INTRODUCTION
BACKGROUND
Bayesian networks (Jensen, 2001) are powerful tools for dealing with uncertainty. They have been successfully applied in a wide range of domains where this property is an important feature, as in the case of information retrieval (IR) (Turtle & Croft, 1991). This field (Baeza-Yates & Ribeiro-Neto, 1999) is concerned with the representation, storage, organization, and accessing of information items (the textual representation of any kind of object). Uncertainty is also present in this field, and, consequently, several approaches based on these probabilistic graphical models have been designed in an attempt to represent documents and their contents (expressed by means of indexed terms), and the relationships between them, so as to retrieve as many relevant documents as possible, given a query submitted by a user. Classic IR has evolved from flat documents (i.e., texts that do not have any kind of structure relating their contents) with all the indexing terms directly assigned to the document itself toward structured information retrieval (SIR) (Chiaramella, 2001), where the structure or the hierarchy of contents of a document is taken into account. For instance, a book can be divided into chapters, each chapter into sections, each section into paragraphs, and so on. Terms could be assigned to any of the parts where they occur. New standards, such as SGML or XML, have been developed to represent this type of document. Bayesian network models also have been extended to deal with this new kind of document. In this article, a structured information retrieval application in the domain of a pathological anatomy service is presented. All the medical records that this service stores are represented in XML, and our contribution involves retrieving records that are relevant for a given query that could be formulated by a Boolean expression on some fields, as well as using a text-free query on other different fields. The search engine that answers this second type of query is based on Bayesian networks.
Probabilistic retrieval models (Crestani et al., 1998) were designed in the early stages of this discipline to retrieve those documents relevant to a given query, computing the probability of relevance. The development of Bayesian networks and their successful application to real problems has caused several researchers in the field of IR to focus their attention on them as an evolution of probabilistic models. They realized that this kind of network model could be suitable for use in IR, specially designed to perform extremely well in environments where uncertainty is a very important feature, as is the case of IR, and also because they can properly represent the relationships between variables. Bayesian networks are graphical models that are capable of representing and efficiently manipulating ndimensional probability distributions. They use two components to codify qualitative and quantitative knowledge, respectively: first, a directed acyclic graph (DAG), G=(V,E), where the nodes in V represent the random variables from the problem we want to solve, and set E contains the arcs that join the nodes. The topology of the graph (the arcs in E) encodes conditional (in)dependence relationships between the variables (by means of the presence or absence of direct connections between pairs of variables); and second, a set of conditional distributions drawn from the graph structure. For each variable X i∈V, we therefore have a family of conditional probability distributions P(Xi | pa(X i)), where pa(X i) represents any combination of the values of the variables in Pa(X i), and Pa(Xi) is the parent set of Xi in G. From these conditional distributions, we can recover the joint distribution over V. This decomposition of the joint distribution gives rise to important savings in storage requirements. In many cases, it also enables probabilistic inference (propagation) to be performed efficiently (i.e., to compute the posterior probability for any variable, given
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Retrieving Medical Records Using Bayesian Networks
some evidence about the values of other variables in the graph). n
P( X 1 , X 2 ,..., X n ) = ∏ P ( X i | pa ( X i )) i =1
The first complete IR model based on Bayesian networks was the Inference Network Model (Turtle & Croft, 1991). Subsequently, two new models were developed: the Belief Network Model (Calado et al., 2001; Reis, 2000) and the Bayesian Network Retrieval Model (de Campos et al., 2003, 2003b, 2003c, 2003d). Of course, not only have complete models been developed in the IR context, but also solutions to specific problems (Dumais, et al., 1998; Tsikrika & Lalmas, 2002; Wong & Butz, 2000). Structural document representation requires IR to design and implement new models and tools to index, retrieve, and present documents according to the given document structure. Models such as the previously mentioned Bayesian Network Retrieval Model have been adapted to cope with this new context (Crestani et al., 2003, 2003b), and others have been developed from scratch (Graves & Lalmas, 2002; Ludovic & Gallinari, 2003; Myaeng et al., 1998).
MAIN THRUST The main purpose of this article is to present the guidelines for construction and use of a Bayesian-networkbased information retrieval system. The source document collection is a set of medical records about patients and their medical tests stored in an XML database from a pathological anatomy service. By using XML tags, the information can be organized around a welldefined structure. Our hypothesis is that by using this structure, we will obtain retrieval results that better match the physicians’ needs. Focusing on the structure of the documents, data are distributed between two different types of tags: on the one hand, we could consider fixed domain tags (i.e., those attributes from the medical record with a set of well-defined values, such as sex, birthdate, address, etc.); and on the other hand, free text passages are used by the physicians to write comments and descriptions about their particular perceptions of the tests that have been performed on the patients, as well as any conclusions that can be drawn from the results. In this case, there is no restriction on the information that can be stored. Three different freetext passages are considered, representing a description of the microscopic analysis, the macroscopic analysis, and the final diagnostic, respectively.
Physicians must be able to use queries that combine both fixed and free-text elements. For example, they might be interested in all documents concerning males who are suspected of having a malignant tumor. In order to tackle this problem, we propose a two-step process. First, a Boolean retrieval task is carried out in order to identify those records in the dataset, mapping the requirements of the fixed domain elements. The query is formulated by means of the XPath language. These records are then the inputs of a Bayesian retrieval process in the second stage, where they are sorted in decreasing order of their posterior probability of relevance to the query as the final output of the process.
The Bayesian Network Model Since, for those attributes related to fixed domains, it is sufficient to consider a Boolean retrieval, the Bayesian model will be used to represent both the structural and the content information related to free-text passages. In order to specify the topology of the model (a directed acyclic graph, representing dependence relationships), we need to determine which information components (variables) will be considered as relevant. In our case, we can distinguish between two different types of variables: the set T that contains those terms used to index the free-text passages, T = {T1 ,..., TM } , with M being the total number of index terms used; and set D , representing the documents (medical records) in the collection. In this case, we consider as relevant variables the whole document Dk and also the three subordinate documents that comprise it: macroscopic description, Dmk ; microscopic description, Dµk ; and final diagnostic, D fk (generically, any of these will be represented by D•k ). D = {D1 , Dm1 , Dµ1 , D f 1 ,..., DN , DmN , DµN , D fN } , Therefore, with N being the number of documents that comprise the collection 1 . Each term variable, Ti , is a binary random variable
taking values in the set {t i , t i } , where t i stands for the term Ti is not relevant, and t i represents the term Ti is relevant. The domain of each document variable, D j , is the set {d j , d j } , where, in this case, d j and d j mean the document D j is not relevant for a given query, and the document D j is relevant for the given query, respectively. A similar reasoning can be stated for any subordinate document, D• j . In order to specify completely the model topology, we need to include those links representing the dependence 961
4
Retrieving Medical Records Using Bayesian Networks
relationships between variables. We can distinguish two types of nodes. The first type links between each term node Ti ∈ T and each subordinate document node D• j ∈ D , whenever Ti belongs to D• j . These links reflect the dependence between the (ir)relevance values of this document and the terms used to index it and will be directed from terms to documents. Therefore, the parent set of a docu-
nentially with the number of parents. We therefore propose using a canonical model that represents the particular influence of each parent in the relevance of the node. In particular, given a variable X j (representing a document or a subordinate document node), the probability of relevance given a particular configuration of the parent pa ( X j ) set is computed by means of
ment node D• j is the set of term nodes that belong to D• j
p ( x j | pa ( X j )) =
(i.e., Pa ( D• j ) = {Ti ∈ T | Ti ∈ D• j } ). The second type links by connecting each subordinate document D• j with the node document D j to which it belongs, reflecting the fact that the relevance of a document to a query will depend only on the relevance values of its subordinate documents. These links will be directed from subordinate to document nodes. It should be noted that we do not use links between terms and documents, because we consider these to be independent, given that we know the relevance value of the subordinate documents. Consequently, we have designed a layered topology for our model that also represents the structural information of the medical records. Figure 1 displays the graph associated with the Bayesian network model.
Probability Distributions The following step to complete the design of the model is the estimation of the quantitative components of the Bayesian network (i.e., the probability distributions stored in each node). For term nodes, and taking into account that all terms are root nodes, marginal distributions need to be stored. The following estimator is used for every term Ti : p(ti ) = (1 / M ) and p (ti ) = ( M − 1) / M . Therefore, the prior probability of relevance of any term is very small and inversely proportional to the size of the index. Considering now document and subordinate document nodes, it is necessary to assess a set of conditional probability distributions, the size of which grows expo-
Figure 1. Topology of the Bayesian information retrieval model
∑w
ij
i∈ pa ( X j )
,
where
the
expression
i ∈ pa ( X j ) means that only those weights where the value
assigned to the ith parent of X j in the configuration pa ( X j ) is relevant will be included in the sum. There-
fore, the greater the number of relevant variables in pa( X j ) , the greater the probability of relevance of X j . The particular values of the weights wij are first,
for
a
subordinate
d o c u m e n t , D• j ,
wij = (tf ij idf i 2 ) /(∑T ∈Pa ( D ) tf kj idf k2 ) , with tf ij being the frek •j
quency of the term ith in the subordinate document and idf i the inverse document frequency of the term Ti in the whole collection; and second, for a document node, D j , we use three factors α = wmj , j , β = wµj , j and δ = w fj , j , representing the influence of the macroscopic description, microscopic description, and final diagnosis, respectively. These values can be assigned by the physicians with the restriction that the sum α + β + δ must be 1. This means, for example, that we can choose α = β = δ = 1 / 3 , so we decide that every subordinate document has the same influence when calculating the probability of relevance for a document in general. Another example is to choose α = β = 1 / 4 and δ = 1 / 2 , if we want the final diagnosis to obtain a higher influence by the calculation of the probability of relevance for a document in general.
Inference and Retrieval Given a query Q submitted to our system, the retrieval process starts by placing the evidences in the term subnetwork—the state of each term TiQ belonging to Q is fixed to t iQ (relevant). The inference process is then run, obtaining, for each document D j , its probability of relevance, given that the terms in the query are also relevant, p (d j | Q ) . Finally, documents are sorted in decreasing order of probability and returned to the user. We should mention the fact that the Bayesian network contains thousand of nodes, many of which have a
962
Retrieving Medical Records Using Bayesian Networks
great number of parents. In addition, although the network topology is relatively simple, it contains cycles. Therefore, general-purpose propagation algorithms cannot be applied for reasons of efficiency. We therefore propose the use of a specific inference process (de Campos et al., 2003), which is designed to take advantage of both the topology of the network and the kind of probability function used at document nodes, but ensuring that the results are the same as those obtained using exact propagation in the entire network. The final probability of relevance for a document, therefore, is computed using the following equations: p (d k | Q ) = α ⋅ p (d mk | Q ) + β ⋅ p (d µ k | Q ) + δ ⋅ p (d fk | Q )
where p( d •k | Q ) can be computed as follows: p (d • j | Q) = (1/M)
∑
Ti ∈D• j ∩Q
∑
Ti ∈D• j
wij + (1/ M )
∑
Ti ∈D• j \Q
wij + ( M − 1) / M
∑
wij =
Ti ∈D• j ∩ Q
wij
FUTURE TRENDS Because of the excellent features offered by Bayesian networks for representing relationships between variables and their strengths, as well as their efficient inference mechanisms, these probabilistic models will be used in many different areas of IR. Following the subject of this article (i.e., dealing with structured documents), one interesting line of research would be the introduction of decisions in the inference process. Instead of returning a ranking of documents, it might be very useful to give the user only those parts of the document that might be relevant, instead of the whole document. A first attempt has been made by de Campos, et al. (2004) using influence diagrams. This field is relatively new and is an open and promising research line. On the grounds of the basic methodology proposed in this article, an intuitive step in this line of work would be to open the field of research to the area of recommendation systems, where Bayesian networks also can perform well. The Web is also a challenging context. As well as the large number of existing Web pages, we must consider the hyperlinks between these. A good treatment of these links by means of a suitable representation through arcs in a Bayesian network and by means of the conditioned probability distributions, which should include the positive or negative influences regarding the relevance of the Web page that is pointed to, should help improve retrieval
effectiveness. Finally, another interesting point would be not to consider index terms independent among them, but to take into account relationships, captured by means of data mining techniques with Bayesian networks.
CONCLUSION In this article, we have presented a retrieval model to deal with medical records from a pathological anatomy service represented in XML. The retrieval model operates in two stages, given a query: the first one employs an XPath query to retrieve XML documents, and the second, using Bayesian networks, computes a probability of relevance using IR techniques on the free text tags from records obtained in the previous step. This model ensures not only an accurate representation of the structure of the record collection, but also a fast mechanism to retrieve relevant records given a query.
ACKNOWLEDGMENTS (a) This work has been jointly supported by the Spanish Fondo de Investigación Sanitaria and Consejería de Salud de la Junta de Andalucía, under Projects PI021147 and 177/02, respectively; (b) we would like to thank Armin Stoll for his collaboration with the development of the software implementing the model presented in this article.
REFERENCES Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley. Calado, P., Ribeiro, B., Ziviani, N., Moura, E., & Silva, I. (2001). Local versus global link information in the Web. ACM Transactions on Information Systems, 21(1), 42-63. Chiaramella, Y. (2001). Information retrieval and structured documents. Lectures Notes in Computer Science, 1980, 291-314. Crestani, F., de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003a). A multi-layered Bayesian network model for structured document retrieval. Lecture Notes in Artificial Intelligence, 2711, 74-86. Crestani, F., de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003b). Ranking structured documents using utility theory in the Bayesian network retrieval model. Lecture Notes in Computer Science, 2857, 168-182. 963
4
Retrieving Medical Records Using Bayesian Networks
Crestani, F., Lalmas, M., van Rijsbergen, C.J., & Campbell, L. (1998). Is this document relevant? Probably: A survey of probabilistic models in information retrieval. Computing Survey, 30(4). 528-552. de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003a). An information retrieval model based on simple Bayesian networks. International Journal of Intelligent Systems, 18, 251-265. de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003b). Implementing relevance feedback in the Bayesian network retrieval model. Journal of the American Society for Information Science and Technology, 54(4), 302-313. de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003c). The BNR model: Foundations and performance of a Bayesian network-based retrieval model. International Journal of Approximate Reasoning, 34, 265-285. de Campos, L.M., Fernández-Luna, J., & Huete, J.F. (2003d). Improving the efficiency of the Bayesian network retrieval model by reducing relationships between terms. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11, 101-116. Dumais, S.T., Platt, J., Hecherman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Proceedings of the ACM International Conference on Information and Knowledge Management. Graves, A., & Lalmas, M. (2002). Video Retrieval using an MPEG-7 based inference network. Proceedings of the 25th ACM-SIGIR Conference. Jensen, F.V. (2001). Bayesian networks and decision graphs. Springer Verlag. Ludovic, D., & Gallinari, P. (2003). A belief networkbased generative model for structured documents. An application to the XML categorization. Lecture Notes in Computer Science, 2734, 328-342. Myaeng, S.H., Jang, D.H., Kim, M.S., & Zhoo, Z.C. (1998). A flexible model for retrieval of SGML documents. Proceedings of the 21th ACM—SIGIR Conference.
ings of the 24th European Colloquium on Information Retrieval Research. Turtle, H.R., & Croft, W.B. (1991). Evaluation of an inference network-based retrieval model. Information Systems, 9(3), 187-222. Turtle, H.R., & Croft, W.B. (1997). Uncertainty in information systems. In Uncertainty management in information system: From needs to solutions (pp. 189-224). Kluver. Wong, S.K.M., & Butz, C.J. (2000). A Bayesian approach to user profiling in information retrieval. Technology Letters, 4(1), 50-56.
KEY TERMS Bayesian Network: A directed acyclic graph where the nodes represent random variables and arcs represent the relationships between them. Their strength is represented by means of conditional probability distributions stored in the nodes. Information Retrieval: A research field that deals with the representation, storage, organization, and accessing of information items. Probability Distribution: A function that assigns a probability to each value that a random variable can take, fulfilling the Kolmogorov’s axioms. Recommendation System: Software that, given preferences expresses by a user, select those choices, from a range of them, that better satisfy these user’s preferences. Structured Document: A textual representation of any object, whose content could be organized around a well-defined structure. XML: Acronym for Extensible Markup Language. A meta-language directly derived from SGML but designed for Web documents. It allows the structuring of information and transmission between applications and between organizations.
Piwowarski, B., & Gallinari P. (2002). A Bayesian network model for page retrieval in a hierarchically structured collection. Proceedings of the XML Workshop— 25 th ACM-SIGIR Conference.
XPath: A language designed to access the different elements of an XML document.
Reis, I. (2000). Bayesian networks for information retrieval [doctoral thesis]. Universidad Federal de Minas Gerais.
ENDNOTE
Tsikrika, T., & Lalmas, M. (2002). Combining Web document representations in a Bayesian inference network model using link and content-based evidence. Proceed964
1
The notation Ti ( D j , respectively) refers to both the term (document, respectively) and its associated variable and node.
965
Robust Face Recognition for Data Mining Brian C. Lovell The University of Queensland, Australia Shaokang Chen The University of Queensland, Australia
INTRODUCTION While the technology for mining text documents in large databases could be said to be relatively mature, the same cannot be said for mining other important data types such as speech, music, images and video. Yet these forms of multimedia data are becoming increasingly prevalent on the Internet and intranets as bandwidth rapidly increases due to continuing advances in computing hardware and consumer demand. An emerging major problem is the lack of accurate and efficient tools to query these multimedia data directly, so we are usually forced to rely on available metadata, such as manual labeling. Currently the most effective way to label data to allow for searching of multimedia archives is for humans to physically review the material. This is already uneconomic or, in an increasing number of application areas, quite impossible because these data are being collected much faster than any group of humans could meaningfully label them — and the pace is accelerating, forming a veritable explosion of non-text data. Some driver applications are emerging from heightened security demands in the 21st century, post-production of digital interactive television, and the recent deployment of a planetary sensor network overlaid on the Internet backbone.
BACKGROUND Although they say a picture is worth a thousand words, computer scientists know that the ratio of information contained in images compared to text documents is often much greater than this. Providing text labels for image data is problematic because appropriate labeling is very dependent on the typical queries users will wish to perform, and the queries are difficult to anticipate at the time of labeling. For example, a simple image of a red ball would be best labeled as sports equipment, a toy, a red object, a round object, or even a sphere, depending on the nature of the query. Difficulties with text metadata have led to researchers concentrating on techniques
from the fields of Pattern Recognition and Computer Vision that work on the image content itself. A motivating application and development testbed is the emerging experimental planetary scale sensor Web, IrisNet (Gibbons, Karp, Ke, Nath, & Sehan, 2003). IrisNet uses Internet connected desktop PCs and inexpensive, off-the-shelf sensors such as Webcams, microphones, temperature, and motion sensors deployed globally to provide a wide-area sensor network. IrisNet is deployed as a service on PlanetLab (www.planetlab.org), a worldwide collaborative network environment for prototyping next generation Internet services initiated by Intel Research and Princeton University that has 177 nodes as of August, 2004. Gibbons, Karp, Ke, Nath, & Sehan envisage a worldwide sensor Web in which many users can query, as a single unit, vast quantities of data from thousands or even millions of planetary sensors. IrisNet stores its sensor-derived data in a distributed XML schema, which is well-suited to describing such hierarchical data as it employs self-describing tags. Indeed the robust distributed nature of the database can be most readily compared to the structure of the Internet DNS naming service. The authors give an example of IrisNet usage where an ecologist wishes to assess the environmental damage after an oil spill by locating beaches where oil has affected the habitat. The query would be directed toward a coastal monitoring service that collects images from video cameras directed at the coastline. The ecologist would then receive images of the contaminated sites as well at their geographic coordinates. Yet the same coastal monitoring service could be used simultaneously to locate the best beaches for surfing. Moreover, via stored trigger queries, the sensor network could automatically notify the appropriate lifeguard in the event of detecting dangerous rips or the presence of sharks. A valuable prototype application that could be deployed on IrisNet is wide area person recognition and location services. Such services have existed since the emergence of human society to locate specific persons when they are not in immediate view. For example, in a crowded shopping mall, a mother may ask her child, “Have you seen your sister?” If there were a positive
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
4
Robust Face Recognition for Data Mining
response, this may then be followed by a request to know the time and place of the last sighting, or perhaps by a request to go look for her. Here the mother is using the eyes, face recognition ability, memory persistence, and mobility of the child to perform the search. If the search fails, the mother may then ask the mall manager to give a “lost child” announcement over the public address system. Eventually the police may be asked to employ these human search services on a much wider scale by showing a photograph of the missing child on the television to ask the wider community for assistance in the search. On the IrisNet the mother could simply upload a photograph of her child from the image store in her mobile phone and the system would efficiently look for the child in an ever-widening geographic search space until contact was made. Clearly in the case of IrisNet, there is no possibility of humans being employed to identify all the faces captured by the planetary sensor Web to support the search, so the task must be automated. Such a service raises inevitable privacy concerns, which must be addressed, but the service also has the potential for great public good as in this example of reuniting a worried mother with her lost child. In addition to person recognition and location services on a planetary sensor Web, another interesting commercial application of face recognition is a system to semi-automatically annotate video streams to provide content for digital interactive television. A similar idea was behind the MIT MediaLab Hypersoap project (Agamanolis & Bove, 1997). In this system, users touch images of objects and people on a television screen to bring up information and advertising material related to the object. For example, a user might select a famous actor and then a page would appear describing the actor, films in which they have appeared, and the viewer might be offered the opportunity to purchase copies of their other films. Automatic face recognition and tracking would greatly simplify the task of labeling the video in post-production — the major cost component of producing such interactive video. Now we will focus on the crucial technology underpinning such data mining services — automatically recognizing faces in image and video databases.
MAIN THRUST Robust Face Recognition Robust face recognition is a challenging goal because of the gross similarity of all human faces compared to large differences between face images of the same 966
person due to variations in lighting conditions, view point, pose, age, health, and facial expression. An ideal face recognition system should recognize new images of a known face and be insensitive to nuisance variations in image acquisition. Yet, differences between images of the same face (intraclass variation) due to these nuisance variations in image capture are often greater than those between different faces (interclass variation) (Adinj, Moses, & Ulman, 1997), making the task extremely challenging. Most systems work well only with images taken under constrained or laboratory conditions where lighting, pose, and camera parameters are strictly controlled. This requirement is much too strict to be useful in many data mining situations when only a few sample images are available, such as in recognizing people from surveillance videos from a planetary sensor Web or searching historic film archives. Recent research has been focused on diminishing the impact of nuisance factors on face recognition. Two main approaches have been proposed for illumination invariant recognition. The first is to represent images with features that are less sensitive to illumination change (Yilmaz & Gokmen, 2000; Gao & Leung, 2002), such as using the edge maps of an image. These methods suffer from robustness problems because shifts in edge locations resulting from small rotation or location errors significantly degrade recognition performance. Yilmaz and Gokmen (2000) proposed using “hills” for face representation; others use derivatives of the intensity (Edelman, Reisfeld, & Yeshurun, 1994; Belhumeur & Kriegman, 1998). No matter what kind of representation is used, these methods assume that features do not change dramatically with variable lighting conditions. Yet this is patently false as edge features generated from shadows may have a significant impact on recognition. The second main approach is to construct a low dimensional linear subspace for the images of faces taken under different lighting conditions. This method is based on the assumption that images of a convex Lambertian object under variable illumination form a convex cone in the space of all possible images (Belhumeur & Kriegman, 1998). Once again, it is hard for these systems to deal with cast shadows. Furthermore, such systems need several images of the same face taken under different lighting source directions to construct a model of a given face — in data mining applications it is often impossible to obtain the required number of images. Experiments performed by Adinj, Moses, and Ulman (1997) show that even with the best image representations using illumination insensitive features and the best distance measurement, the misclassification rate is often more than 20%. As for expression invariant face recognition, this is still an open problem for machine recognition and is
Robust Face Recognition for Data Mining
Table 1. Problems with existing face recognition technology •
Overall accuracy, particularly on large databases
•
Sensitivity to changes in lighting, camera angle, pose
•
Computational load of searches
also quite a difficult task for humans. The approach adopted in Beymer and Poggio (1996) and Black, Fleet, and Yacoob (2000) is to morph images to be the same expression as the one used for training. A problem is that not all images can be morphed correctly. For example, an image with closed eyes cannot be morphed to a standard image because of the lack of texture inside the eyes. Liu, Chen, and Kumar (2001) proposed using optical flow for face recognition with facial expression variations. However, it is hard to learn the motions within the feature space to determine the expression changes, since the way one person expresses a certain emotion is normally somewhat different from others. These methods also suffer from the need to have large numbers of example images for training.
Mathematical Basis for Face Recognition Technologies Most face recognition systems are based on one of the following methods: 1. 2.
Direct Measurement of Facial Features Principal Components Analysis or “Eigenfaces” (Turk & Pentland, 1991) Fisher Linear Discriminant Function (Liu & Wechsler, 1998)
3.
Table 2. Data mining applications for face recognition •
Person recognition and location services on a planetary wide sensor net
•
Recognizing faces in a crowd from video surveillance
•
Searching for video or images of selected persons in multimedia databases
•
Forensic examination of multiple video streams to detect movements of certain persons
•
Early forms of face recognition were based on Method 1 with direct measurement of features such as width of the nose, spacing between the eyes, and etcetera. These measurements were frequently performed by hand using calipers. Many modern systems are based on either of Methods 2 or 3, which are better suited to computer automation. Here we briefly describe the principles behind one of the most popular methods — Principal Components Analysis (PCA), also known as “eigenfaces,” as originally popularized by Turk & Pentland (1991). The development assumes a basic background in linear algebra.
Principal Components Analysis PCA is a second-order method for finding a linear representation of faces using only the covariance of the data. It determines the set of orthogonal components (feature vectors), which minimizes the reconstruction error for a given number of feature vectors. Consider the face image set I = [ I1 , I 2 , L , I n ] , where I i is a p × q pixel image, i ∈ [1L n], p, q, n ∈ Z + ; the average face of the image set is defined by the matrix:
Ψ=
1 n ∑ Ii . n k =1
(1)
Note that face recognition is normally performed on grayscale (i.e., black and white) face images rather than color. Colors, and skin color tones in particular, are frequently used to aid face detection and location within the image stream (Rein-Lien, Abdel-Mottaleb, & Jain, 2002). We assume additionally that the face images are pre-processed by scaling, rotation, eye centre alignment, and background suppression so that averaging is meaningful. Now normalizing each image by subtracting the average face, we have the normalized difference image matrix:
~ Di = I i − Ψ . Unpacking
(2) ~ Di
row-wise,
we
form
the
N ( N = p × q ) dimensional column vector d i . We define the covariance matrix C of the normalized image
set D = [d1 , d 2 , L d n ] corresponding to the original face image set I by:
Automatic annotation and labeling of video streams to provide added value for digital interactive television
967
4
Robust Face Recognition for Data Mining
n
C = ∑ d i d = DD . i =1
T i
T
(3)
An eigen decomposition of C yields eigenvalues λi
λi = σ i2 ) and the columns of U are the eigenvectors. Now
consider a similar derivation for C ' . n
C ' = DT D = VS TU TUSV T = VS 2V T = ∑σ i2vi viT i =1
and eigenvectors u i , which satisfy:
Cu i = λi u i , and
(4)
n
C = DD T = ∑ λi u i u iT ,
(5)
i =1
where i ∈ [1L N ] . In practice, N is so huge that eigenvector decomposition is computationally impossible. Indeed for even a small image of 100 × 100 pixels, C is a 10,000 × 10,000 matrix. Fortunately, the following shortcut lets us bypass direct decomposition of C . We consider decompositions of C ' = D T D instead of C = DD T . Singular value decomposition of D gives us
D = USV T
where S 2 [ n×n] = diag (σ 12 ,σ 22 ,L , σ n2 ) . Comparing (7) and (8) we see that the singular values are identical, so the squares of the singular values yield the eigenvalues of C . The eigenvectors of C can be obtained from the eigenvectors of C ' , which are the columns of V , by rearranging (6) as follows:
U = DVS −1
(9)
which can be expressed alternatively by
ui =
1
λi
Dvi ,
(10)
where i = [1L n] . Thus by performing an eigenvector
(6)
where U [ N × N ] and V [ n×n] are unitary and S [ Nxn ] is diagonal. Without loss of generality, assume the diagonal elements of S = diag (σ 1 , σ 2 , L , σ n ) are sorted such that σ 1 > σ 2 > L > σ n where the σ i are known as the singular
values of D . Then n
C = DD T = USV TVS TU T = US 2U T = ∑ σ i2ui uiT i =1
(7)
where S 2 [ N × N ] = diag (σ 12 ,σ 22 , L , σ n2 ,0, L 0) . Thus, only the first n singular values are non-zero. Comparing (7) with (5), we see that the squares of the singular values give us the eigenvalues of C (i.e.,
decomposition on the small matrix C ' [ n×n ] , we efficiently obtain both the eigenvalues and eigenvectors of the very large matrix C [ N × N ] . In the case of a database of 100 × 100 pixel face images of size 30, by using this shortcut, we need only decompose a 30 × 30 matrix instead of a 10,000 × 10,000 matrix! The eigenvectors of C are often called the eigenfaces and are shown as images in Figure 1. Being the columns of a unitary matrix, the eigenfaces are orthogonal and efficiently describe (span) the space of variation in faces. Generally, we select a small subset of m < n eigenfaces to define a reduced dimensionality facespace that yields highest recognition performance on unseen examples of faces: for good recognition performance the required number of eigenfaces, m , is typically chosen to be of the order of 6 to 10.
Figure 1. Typical set of eigenfaces as used for face recognition. Leftmost image is average face.
968
(8)
Robust Face Recognition for Data Mining
Figure 2. Contours of 95% recognition performance for the original PCA and the proposed APCA method against lighting elevation and azimuth
Thus in PCA recognition each face can be represented by just a few components by subtracting out the average face and then calculating principal components by projecting the remaining difference image onto the m eigenfaces. Simple methods such as nearest neighbors are normally used to determine which face best matches a given face.
Robust PCA Recognition The authors have developed Adaptive Principal Component Analysis (APCA) to improve the robustness of PCA to nuisance factors such as lighting and expression (Chen & Lovell, 2003, 2004). In the APCA method, we first apply PCA. Then we rotate and warp the facespace Figure 3. Recognition rates for APCA and PCA versus number of eigenfaces with variations in lighting and expression from Chen & Lovell (2003)
by whitening and filtering the eigenfaces according to overall covariance, between-class, and within-class covariance to find an improved set of eigenfeatures. Figure 2 shows the large improvement in robustness to lighting angle. The proposed APCA method allows us to recognize faces with high confidence even if they are half in shadow. Figure 3 shows significant recognition performance gains over standard PCA when both changes in lighting and expression are present.
Critical Issues of Face Recognition Technology Despite the huge number of potential applications for reliable face recognition, the need for such search capabilities in multimedia data mining, and the great strides made in recent decades, there is still much work to do before these applications become routine.
FUTURE TRENDS Face recognition and other biometric technologies are coming of age due to the need to address heightened security concerns in the 21st century. Privacy concerns that have hindered public acceptance of these technologies in the past are now yielding to society’s need for increased security while maintaining a free society. Apart from the demands from the security sector, there are many applications for the technology in other areas of data mining. The performance and robustness of systems will increase significantly as more researcher effort is brought to bear. In recent real-time systems there is much interest in 3-D reconstruction of the head from multiple camera angles, but in data mining the focus must remain on reliable recognition from single photos.
969
4
Robust Face Recognition for Data Mining
Table 3. A Summary of Critical Issues of Face Recognition Technologies Privacy Concerns It is clear that personal privacy may be reduced with the widespread adoption of face recognition technology. However, since September 11, 2001, concerns about privacy have taken a back seat to concerns about personal security. Governments are under intense pressure to introduce stronger security measures. Unfortunately, government’s current need for biometric technology does nothing to improve performance in the short term and may actually damage uptake in the medium term due to unrealistic expectations. Computational Efficiency Face recognition can be computationally very intensive for large databases. This is a serious impediment for multimedia data mining. Accuracy on Large Databases
Sensitivity to Illumination and Other Changes Changes in lighting, camera angle, and facial expression can greatly affect recognition performance. Inability to Cope with Multiple Head Poses Very few systems can cope with non-frontal views of the face. Some researchers propose 3-D recognition systems using stereo cameras for real-time applications, but these are not suitable for data mining. Ability to Scale While a laboratory system may work quite well on 20 or 30 faces, it is not clear that these systems will scale to huge face databases as required for many security applications such as detecting faces of known criminals in a crowd or the person locator service on the planetary sensor Web.
Studies indicate that recognition error rates of the order of 10% are the best that can be obtained on large databases. This error rate sounds rather high, but trained humans do no better and are much slower at searching.
CONCLUSION It has been argued that by the end of the 20th century computers were very capable of handling text and numbers and that in the 21 st century computers will have to be able to cope with raw data such as images and speech with much the same facility. The explosion of multimedia data on the Internet and the conversion of all information to digital formats (music, speech, television) is driving the demand for advanced multimedia search capabilities, but the pattern recognition technology is mostly unreliable and slow. Yet, the emergence of handheld computers with built-in speech and handwriting recognition ability, however primitive, is a sign of the changing times. The challenge for researchers is to produce pattern recognition algorithms, such as face recognition, reliable and fast enough for deployment on data spaces of a planetary scale.
REFERENCES Adinj, Y., Moses, Y., & Ullman, S. (1997). Face recognition: The problem of compensation for changes in illumination direction. IEEE PAMI, 19(4), 721-732. 970
Agamanolis, S., & Bove, Jr., V.M. (1997). Multi-level scripting for responsive multimedia. IEEE Multimedia, 4(4), 40-50. Belhumeur, P., & Kriegman, D. (1998). What is the set of images of an object under all possible illumination conditions? International Journal of Computer Vision, 28(3), 245-260. Beymer, D., & Poggio, T. (1995). Face recognition from one example view. In Proceedings of the International Conference of Computer Vision (pp. 500-507). Black, M.J., Fleet, D.J., & Yacoob, Y. (2000). Robustly estimating changes in image appearance. Computer Vision and Image Understanding, 78(1), 8-31. Chen, S., & Lovell, B.C. (2003). Face recognition with one sample image per class. In Proceedings of ANZIIS2003 (pp. 83-88), December 10-12, Sydney, Australia. Chen, S., & Lovell, B.C. (2004). Illumination and expression invariant face recognition with one sample image. In Proceedings of the International Conference on Pattern Recognition, August 23-26, Cambridge, UK. Chen, S., Lovell, B.C., & Sun, S. (2002). Face recognition with APCA in variant illuminations. In Proceedings of
Robust Face Recognition for Data Mining
WOSPA2002 (pp. 9-12), December 17-18, Brisbane, Australia.
Turk, M.A., & Pentland, A.P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71-86.
Edelman, S., Reisfeld, D., & Yeshurun, Y. (1994). A system for face recognition that learns from examples. In Proceedings of the European Conference on Computer Vision (pp. 787-791). Berlin: Springer-Verlag.
Yang, J., Zhang, D., Frangi, A.F., & Jing-Yu, Y. (2004). Two-dimensional PCA: A new approach to appearancebased face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1), 131-137.
Feraud, R., Bernier, O., Viallet, J.E., & Collobert, M. (2000). A fast and accurate face detector for indexation of face images. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (pp. 77-82), March 28-30. Gao, Y., & Leung, M.K.H.(2002). Face recognition using line edge map. IEEE PAMI, 24(6), 764-779. Georghiades, A.S., Belhumeur, P.N., & Kriegman, D.J. (2001). From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6), 643-660. Gibbons, P.B., Karp, B., Ke, Y., Nath, S., & Sehan, S. (2003). IrisNet: An architecture for a worldwide sensor Web. Pervasive Computing, 2(4), 22-33. Li, Y., Goshtasby, A., & Garcia, O.(2000). Detecting and tracking human faces in videos. Proc. 15th Int’l Conference on Pattern Recognition (pp. 807-810), Sept 3-8, 1. Liu, C., & Wechsler, H. (1998). Evolution of Optimal Projection Axes (OPA) for Face Recognition. Third IEEE International Conference on Automatic face and Gesture Recognition, FG’98 (pp. 282-287), Nara, Japan, April 14-16. Liu, X.M., Chen, T., & Kumar, B.V.K.V. (2003). Face authentication for multiple subjects using eigenflow. Pattern Recognition, Special issue on Biometric, 36(2), 313-328.
Yilmaz, A., & Gokmen, M. (2000). Eigenhill vs. eigenface and eigenedge. In Proceedings of International Conference Pattern Recognition (pp. 827-830). Barcelona, Spain. Zhao, L., & Yang, Y.H. (1999). Theoretical analysis of illumination in PCA-based vision systems. Pattern Recognition, 32, 547-564.
KEY TERMS Biometric: A measurable, physical characteristic or personal behavioral trait used to recognize the identity, or verify the claimed identity, of an enrollee. A biometric identification system identifies a human from a measurement of a physical feature or repeatable action of the individual (for example, hand geometry, retinal scan, iris scan, fingerprint patterns, facial characteristics, DNA sequence characteristics, voice prints, and hand written signature). Computer Vision: Using computers to analyze images and video streams and extract meaningful information from them in a similar way to the human vision system. It is related to artificial intelligence and image processing and is concerned with computer processing of images from the real world to recognize features present in the image.
Ming-Hsuan, Y., Kriegman, D.J., & Ahuja, N. (2002). Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1), 34-58.
Eigenfaces: Another name for face recognition via principal components analysis.
Rein-Lien, H., Abdel-Mottaleb, M., & Jain, A.K. (2002). Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 696-706.
Head Pose: Position of the head in 3-D space including head tilt and rotation.
Face Space: The vector space spanned by the eigenfaces.
Metadata: Labeling, information describing other information.
Swets, D.L., & Weng, J. (1996). Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 831-836.
Pattern Recognition: Pattern recognition is the ability to take in raw data, such as images, and take action based on the category of the data.
The Hypersoap Project. (n.d.) Retrieved February 6, 2004, from http://www.media.mit.edu/hypersoap/
Principal Components Analysis: Principal components analysis (PCA) is a method that can be used to 971
4
Robust Face Recognition for Data Mining
simplify a dataset. It is a transform that chooses a new coordinate system for the data set, such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis and so on. PCA can be used for reducing dimensionality. PCA is also called the Karhunen-Loève transform or the Hotelling transform.
972
Robust: The opposite of brittle; this can be said of a system that has the ability to recover gracefully from the whole range of exceptional inputs and situations in a given environment. Also has the connotation of elegance in addition to careful attention to detail.
973
Rough Sets and Data Mining Jerzy W. Grzymala-Busse University of Kansas, USA Wojciech Ziarko University of Regina, Canada
INTRODUCTION Discovering useful models capturing regularities of natural phenomena or complex systems until recently was almost entirely limited to finding formulae fitting empirical data. This worked relatively well in physics, theoretical mechanics, and other areas of science and engineering. However, in social sciences, market research, medicine, pharmacy, molecular biology, learning and perception, and in many other areas, the complexity of the natural processes and their common lack of analytical smoothness almost totally exclude the use of standard tools of mathematics for the purpose of databased modeling. A fundamentally different approach is needed in those areas. The availability of fast data processors creates new possibilities in that respect. This need for alternative approaches to modeling from data was recognized some time ago by researchers working in the areas of neural nets, inductive learning, rough sets, and, more recently, data mining. The empirical models in the form of data-based structures of decision tables or rules play similar roles to formulas in classical analytical modeling. Such models can be analyzed, interpreted, and optimized using methods of rough set theory.
BACKGROUND The theory of rough sets was originated by Pawlak (1982) as a formal mathematical theory, modeling knowledge about a universe of interest in terms of a collection of equivalence relations. Its main application areas are acquisition, analysis, and optimization of computerprocessible models from data. The models can represent functional, partially functional, and probabilistic relations existing in data in the extended rough set approaches (Grzymala-Busse, 1998; Katzberg & Ziarko, 1996; Slezak & Ziarko, 2003; Ziarko, 1993). When deriving the models in the context of the rough set theory, there is no need for any additional information about data, such as, for example, probability distribution function in statistical theory, grade of membership in fuzzy set theory, and so forth (Grzymala-Busse, 1988).
The original rough set approach is concerned with investigating properties and limitations of knowledge. The main goal is forming discriminative descriptions of subsets of a universe of interest. The approach is also used to investigate and prove numerous useful algebraic and logical properties of knowledge and of approximately defined sets, called rough sets. The knowledge is modeled by an equivalence relation representing the ability to partition the universe into classes of indiscernible objects, referred to as elementary sets. The presence of the idea of approximately defined sets is a natural consequence of imperfections of existing knowledge, which may be incomplete, imprecise, or uncertain. Only an approximate description, in general, of a set (target set) can be formed. The approximate description consists of specification of lower and upper set approximations. The approximations are definable sets. The lower approximation is a union of all elementary sets contained in the target set. The upper approximation is a union of all elementary sets overlapping the target set. This ability to create approximations of nondefinable, or rough, sets allows for development of approximate classification algorithms for prediction, machine learning, pattern recognition, data mining, and so forth. In these algorithms, the problem of classifying an observation into an undefinable category, which is not tractable, in the sense that the discriminating description of the category does not exist, is substituted by the problem of classifying the observation into an approximation of the category.
MAIN THRUST The article is focused on data-mining-related extensions of the original rough set model. Based on the representative extensions, data mining techniques and applications are reviewed.
Extensions of Rough Set Theory Developing practical applications of rough set theory revealed the limitations of this approach. For example,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
4
Rough Sets and Data Mining
when dealing with market survey data, it was not possible to identify non-empty lower approximation of the target category of buyers of a product. Similarly, it often was not possible to identify non-trivial upper approximation of the target category, such as would not extend over the whole universe. These limitations follow from the fact that practical classification problems are often non-deterministic. When dealing with such problems, perfect prediction accuracy is not possible and not expected. The need to make rough set theory applicable to a more comprehensive class of practical problems inspired the development of extensions of the original approach to rough sets. One such extension is the variable precision rough set model (VPRSM) (Ziarko, 1993). As in the original rough set theory, set approximations also are formed in VPRSM. The VPRSM criteria for forming the lower and upper approximations are relaxed, in particular by allowing a controlled degree of misclassification in the lower approximation of a target set. The resulting lower approximation represents an area of the universe where the correct classification can be made with desired probability of success, rather than deterministically. In this way, the VPRSM approach can handle a comprehensive class of problems requiring developing non-deterministic models from data. The VPRSM preserves all basic properties and algorithms of the Pawlak approach to rough sets. The algorithms are enhanced additionally with probabilistic information acquired from data (Katzberg & Ziarko, 1996; Ziarko, 1998, 2003, Ziarko & Xiao, 2004). The structures of decision tables and rules derived from data within the framework of VPRSM have probabilistic confidence factors to reflect the degree of uncertainty in classificatory decision making. The objective of such classifiers is to improve the probability of success rather than trying to guarantee 100% correct classification. Another extension of rough set theory is implemented in the data mining system LERS (GrzymalaBusse, 1992, 1994), in which rules are equipped with three coefficients characterizing rule quality: specificity (i.e., the total number of attribute-value pairs on the left-hand side of the rule); strength (i.e., the total number of cases correctly classified by the rule during training; and the total number of training cases matching the left-hand side of the rule. For classification of unseen cases, the LERS incorporates the ideas of genetic learning, extended to use partial matching of rules and cases. The decision to which a case belongs is made on the basis of support, defined as the sum of scores of all matching rules from the class, where a score of the rule is the product of the first two coefficients associated with the rule. As indicated by experiments, partial matching is a valuable mechanism when complete matching fails (Grzymala-Busse, 1994). In the LERS classification system, the user may use 16
974
strategies for classification. In some of these strategies, the final decision is based on probabilities acquired from raw data (Grzymala-Busse & Zou, 1998). Other extensions of rough set theory include generalizations of the basic concept of rough set theory—the indiscernibility relation. A survey of such methods was presented in Yao (2003).
From Data to Rough Decision Tables When deriving models from data within the rough set framework, one of the primary constructs is a decision table derived from data referred to as rough decision table (Pawlak, 1991; Ziarko, 1999, 2002a). The rough decision table represents knowledge about the universe of interest and the relation between the knowledge and the target set or sets. The idea of the rough decision table was formulated in both the original framework of rough sets and in the extended VPRSM. In the latter case, the table is called probabilistic decision table (Ziarko, 2002a). In the table, some columns correspond to descriptive attributes used to classify objects of the domain of interest, while other columns represent target sets or rough approximations of the sets. The rows of the table represent the classes of the classification of the domain in terms of the descriptive attributes. If the decision table contains representatives of all or almost all classes of the domain, and if the relation with the prediction targets is completely or almost completely specified, then the table can be treated as a model of the domain. Such a model represents descriptions of all or almost all objects of the domain and their relationship to the prediction target. The specification of the relationship may include empirical assessments of conditional probabilities, if the VPRSM approach is used in model derivation. If the model is complete enough, and if the data-based estimates of probabilities are relatively close to real values, then the decision table can be used as a basis of a classifier system. To ensure relative completeness and generality of the decision table model, the values of the attributes used to construct the classification of the domain need to be sufficiently general. For example, in many practical problems, rather than using precise numeric measurements, value ranges often are used after preliminary discretization of original precise values. This conversion of original data values into secondary, less precise representation is one of the major pre-processing steps in rough set-based methodology. The acquired decision table can be further analyzed and optimized using classical algorithms for interattribute dependency computation and minimal nonredundant subset of attributes (attribute reduct) identification (Pawlak, 1991; Ziarko 2002b).
Rough Sets and Data Mining
From Data to Rule Sets A number of systems for machine learning and data mining have been developed in the course of research on theory and applications of rough sets (Grzymala-Busse, 1992; Ohrn & Komorowski, 1997; Ziarko, 1998b; Ziarko et al., 1993). The representative example of such developments is the data mining system LERS, whose first version was developed at the University of Kansas in 1988. The current version of LERS is essentially a family of data mining systems. The main objective of LERS is computation of decision rules from data. Computed rule sets may be used for classification of new cases or for interpretation of knowledge. The LERS system may compute rules from imperfect data (Grzymala-Busse, 1992) (e.g., data with missing attribute values or inconsistent cases). LERS is also equipped with a set of discretization schemas to deal with numerical attributes. In addition, a variety of LERS methods may help to handle missing attribute values. LERS accepts inconsistent input data (i.e., characterized by the same values of all attributes, but belonging to two different target sets). For inconsistent data, LERS computes lower and upper approximations of all sets involved. The system is also assisted with tools for rule validation, such as leaving-one-out, 10-fold cross validation, and holdout. LERS has proven its applicability having been used for two years by NASA Johnson Space Center (Automation and Robotics Division) as a tool to develop expert systems of the type most likely to be used in medical decision making on the board of the International Space Station. LERS also was used to enhance facility compliance under Sections 311, 312, and 313 of Title III of the Emergency Planning and Community Right to Know (Grzymala-Busse, 1993). The LERS system was used in other areas, as well (e.g., in the medical field to compare the effects of warming devices for postoperative patients, to assess preterm birth) (Woolery & GrzymalaBusse, 1994) and for diagnosis of melanoma (GrzymalaBusse et al., 2001).
FUTURE TRENDS The literature related to the subject of rough sets exceeds well over 1,000 publications. By necessity, in what follows, we cite only some representative examples of the research works on the subject. A comprehensive up-to-date collection of references can be found online at http://rsds.wsiz.rzeszow.pl (Suraj, 2004). Following Pawlak’s original publication (Pawlak, 1991), the mathematical fundamentals of the original
rough set model were published in Polkowski (2002). There exits an extensive body of literature on rough set theory applications to knowledge discovery and data mining. In particular, a comprehensive review is available in Polkowski and Skowron (1998). The basic algorithms for data mining applications using the original rough set theory were summarized in Ziarko (2002b). Since the introduction of the original RST, several extensions of the original model were proposed (Greco et al., 2000; Slezak & Ziarko, 2003; Yao & Wong 1992; Ziarko, 1993). In particular, VPRSM was published for the first time in Ziarko (1993) and was further investigated in Kryszkiewicz (1994), Beynon (2000), Slezak and Ziarko (2003), and others, and served as a basis of a novel approach to inductive logic programming (Mahesvari et al., 2001). The probabilistic decision tables were introduced in Ziarko (1998b). The LERS system was first described in Grzymala-Busse (1992). Its most important algorithm, LEM2, was presented in Chan and Grzymala-Busse (1994). Some applications of LERS were published in Freeman, et al. (2001), Gunn and Grzymala-Busse (1994), GrzymalaBusse et al. (2001), Grzymala-Busse and Gunn (1995), Grzymala-Busse and Woolery (1994), Loupe, et al. (2001), Moradi, et al. (1995), and Woolery, et al. (1991). It appears that utilizing extensions of the original rough set theory is the main trend in data mining applications of this approach. In particular, a number of sources reported experiments using rough set theory for medical diagnosis, control, and pattern recognition, including speech recognition, handwriting recognition, and music fragment classification (Brindle & Ziarko, 1999; Kostek, 1998; Mrozek, 1986; Peters et al., 1999; Plonka & Mrozek, 1995; Shang & Ziarko, 2003). These technologies are far from maturity, which indicates that the trend toward developing applications based on extensions of rough set theory will continue.
CONCLUSION Data mining and machine learning applications based on the original approach to rough set theory and, more recently, on extensions and generalizations of rough set theory, have been attempted for about 20 years now. Due to space limits, this article mentions only example experimental and real-life application projects. The projects confirm the viability of rough set theory as a fundamental framework for data mining, machine learning, pattern recognition, and related application areas, and provide inspiring feedback toward continuing growth of the rough set approach to better suit the needs of real-life application problems.
975
4
Rough Sets and Data Mining
REFERENCES
Rough Sets and Current Trends in Computing, Warsaw, Poland.
Beynon, M. (2000). An investigation of beta-reduct selection within variable precision rough sets model. Proceedings of the 2nd International Conference on Rough Sets and Current Trends in Computing, Banff, Canada.
Gunn, J.D., & Grzymala-Busse, J.W. (1994). Global temperature stability by rule induction: An interdisciplinary bridge. Human Ecology, 22, 59-81.
Brindle, D., & Ziarko, W. (1999). Experiments with rough set approach to speech recognition. Proceedings of the International Conference on Methodologies for Intelligent Systems, Warsaw, Poland. Chan, C.C., & Grzymala-Busse, J.W. (1994). On the two local inductive algorithms: PRISM and LEM2. Foundations of Computing and Decision Sciences, 19, 185-203. Freeman, R.L., Grzymala-Busse, J.W., Laura, A., Riffel, L.A., & Schroeder, S.R. (2001). Analysis of self-injurious behavior by the LERS data mining system. Proceedings of the Japanese Society for AI, International Workshop on Rough Set Theory and Granular Computing, RSTGC-2001, Shimane, Japan. Greco, S., Matarazzo, B., Slowinski, R., & Stefanowski, J. (2000). Variable consistency model of dominancebased rough sets approach. Proceedings of the 2nd International Conference on Rough Sets, Banff, Canada. Grzymala-Busse, J.P., Grzymala-Busse, J.W., & Hippe, Z.S. (2001). Melanoma prediction using data mining system LERS. Proceedings of the 25th Anniversary Annual International Computer Software and Applications Conference COMPSAC 2001, Chicago, Illinois. Grzymala-Busse, J.W. (1992). LERS—A system for learning from examples based on rough sets. In R. Slowinski (Ed.), Intelligent decision support: Handbook of applications and advances of the rough sets theory. Kluwer. Grzymala-Busse, J.W. (1993). ESEP: An expert system for environmental protection. Proceedings of the RSKD–93, International Workshop on Rough Sets and Knowledge Discovery, Banff, Canada. Grzymala-Busse, J.W. (1994). Managing uncertainty in machine learning from examples. Proceedings of the Third Intelligent Information Systems Workshop, Wigry, Poland. Grzymala-Busse, J.W., & Werbrouck, P. (1998). On the best search method in the LEM1 and LEM2 algorithms. In E. Orlowska (Ed.), Incomplete information: Rough set analysis. Physica-Verlag. Grzymala-Busse, J.W., & Zou, X. (1998). Classification strategies using certain and possible rules. Proceedings of the First International Conference on 976
Katzberg, J., & Ziarko, W. (1996). Variable precision extension of rough sets. Fundamenta Informaticae, Special Issue on Rough Sets, 27, 155-168. Kostek, B. (1998). Computer-based recognition of musical phrases using the rough set approach. Journal of Information Sciences, 104, 15-30. Kryszkiewicz, M. (1994). Knowledge reduction algorithms in information systems [doctoral thesis]. Warsaw, Poland: Warsaw University of Technology. Loupe, P.S., Freeman, R.L., Grzymala-Busse, J.W., & Schroeder, S.R. (2001). Using rule induction for prediction of self-injuring behavior in animal models of development disabilities. Proceedings of the 14th IEEE Symposium on Computer-Based Medical Systems, Bethesda, Maryland. Maheswari, U., Siromoney, A., Mehata, K., & Inoue, K. (2001). The variable precision rough set inductive logic programming model and strings. Computational Intelligence, 17, 460-471. Moradi, H. et al. (1995). Entropy of English text: Experiments with humans and a machine learning system based on rough sets. Proceedings of the 2nd Annual Joint Conference on Information Sciences, Wrightsville Beach, North Carolina. Mrozek, A. (1986). Use of rough sets and decision tables for implementing rule-based control of industrial processes. Bulletin of the Polish Academy of Sciences, 34, 332-356. Ohrn, A., & Komorowski, J. (1997). ROSETTA: A rough set toolkit for analysis of data. Proceedings of the Third International Joint Conference on Information Sciences, Fifth International Workshop on Rough Sets and Soft Computing, Durham, North Carolina. Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341-356. Pawlak, Z. (1984). International Journal Man-Machine Studies, 20, 469. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Kluwer. Pawlak, Z., Grzymala-Busse, J.W., Slowinski, R., & Ziarko, W. (1995). Rough sets. Communications of the ACM, 38, 89-95.
Rough Sets and Data Mining
Peters, J., Skowron, A., & Suraj, Z. (1999). An application of rough set methods in control design. Proceedings of the Workshop on Concurrency, Warsaw, Poland. Plonka, L., & Mrozek, A. (1995). Rule-based stabilization of the inverted pendulum. Computational Intelligence, 11, 348-356. Polkowski, L. (2002). Rough sets: Mathematical foundations. Springer Verlag. Polkowski, L., & Skowron, A. (Eds.). (1998). Rough sets in knowledge discovery, 2, applications, case studies and software systems. Heidelberg: Physica Verlag. Shang, F., & Ziarko, W. (2003). Acquisition of control algorithms. Proceedings of the International Conference on New Trends in Intelligent Information Processing and Web Mining, Zakopane, Poland. Slezak, D., & Ziarko, W. (2003). Variable precision Bayesian rough set model. Proceedings of the 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Chongqing, China. Slowinski, R. (Ed.). (1992). Decision support by experience: Rough sets approach. Kluver. Suraj, Z., & Grochowalski, P. (2004). The rough set data base system: An overview. Proceedings of the International Conference on Rough Sets and Current Trends in Computing, Uppsala, Sweden.
Ziarko, W. (1998a). Approximation region-based decision tables. Proceedings of the International Conference on Rough Sets and Current Trends in Computing, Warsaw, Poland. Ziarko, W. (1998b). KDD-R: Rough sets-based data mining system. In L. Polkowski, & A. Skowron (Eds.), Rough sets in knowledge discovery, Part II (pp. 598601). Springer Verlag. Ziarko, W. (2002a). Acquisition of hierarchy-structured probabilistic decision tables and rules from data. Proceedings of the IEEE International Conference on Fuzzy Systems, Honolulu, Hawaii. Ziarko, W. (2002b). Rough set approaches for discovery of rules and attribute dependencies. In W. Kloesgen, & J. Zytkow (Eds.), Handbook of data mining and knowledge discovery (pp. 328-339). Oxford University Press. Ziarko, W., Golan, R., & Edwards, D. (1993). An application of datalogic/R knowledge discovery tool to identify strong predictive rules in stock market data. Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, Washington, D.C. Ziarko, W., & Xiao, X. (2004). Computing minimal probabilistic rules from probabilistic decision tables: Decision matrix approach. Proceedings of the Atlantic Web Intelligence Conference, Cancun, Mexico.
Tsumoto, S. (2003). Extracting structure of medical diagnosis: Rough set approach. Proceedings of the 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Chongqing, China.
KEY TERMS
Woolery, L., Grzymala-Busse, J., Summers, S., & Budihardjo, A. (1991). The use of machine learning program LERS_LB 2.5 in knowledge acquisition for expert system development in nursing. Computers in Nursing, 9, 227-234.
Definable Set: A set that has a description precisely discriminating elements of the set from among all elements of the universe of interest.
Yao, Y.Y. (2003). On generalizing rough set theory. Proceedings of the 9 th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Chongqing, China. Yao, Y.Y., & Wong, S.K.M. (1992). A decision theoretic framework for approximating concepts. Proceedings of the International Journal of Man-Machine Studies. Ziarko, W. (1993). Variable precision rough sets model. Journal of Computer and Systems Sciences, 46, 39-59.
Decision Rule: Specification of the relationship between collection of observations (conditions) and an outcome (a decision).
LERS: A comprehensive system for data mining based on rough sets. Lower Approximation of a Rough Set: Maximum definable set contained in the rough set. Rough Decision Table: Collection of disjoint decision rules of identical format. Rough Set: An undefinable set. Upper Approximation of a Rough Set: Minimum definable set containing the rough set. Variable Precision Rough Set Model: An approach to forming lower and upper approximations of a rough set via generalized parametric definitions. 977
4
978
Rule Generation Methods Based on Logic Synthesis Marco Muselli Italian National Research Council, Italy
INTRODUCTION One of the most relevant problems in artificial intelligence is allowing a synthetic device to perform inductive reasoning, i.e. to infer a set of rules consistent with a collection of data pertaining to a given real world problem. A variety of approaches, arising in different research areas such as statistics, machine learning, neural networks, etc., have been proposed during the last 50 years to deal with the problem of realizing inductive reasoning. Most of the developed techniques build a black-box device, which has the aim of solving efficiently a specific problem generalizing the information contained in the sample of data at hand without caring about the intelligibility of the solution obtained. This is the case of connectionist models, where the internal parameters of a nonlinear device are adapted by an optimization algorithm to improve its consistency on available examples while increasing prediction accuracy on unknown data. The internal structure of the nonlinear device and the training method employed to optimize the parameters determine different classes of connectionist models: for instance, multilayer perceptron neural networks (Haykin, 1999) consider a combination of sigmoidal basis functions, whose parameters are adjusted by a local optimization algorithm, known as back-propagation. Another example of connectionist model is given by support vector machines (Vapnik, 1998), where replicas of the kernel of a reproducing kernel Hilbert space are properly adapted and combined through a quadratic programming method to realize the desired nonlinear device. Although these models provide a satisfactory way of approaching a general class of problems, the behavior of synthetic devices realized cannot be directly understood, since they generally involve the application of nonlinear operators, whose meaning is not directly comprehensible. Discriminant analysis techniques as well as statistical nonparametric methods (Duda, Hart, & Stork., 2001), like k-nearest-neighbor or projection pursuit, also belong to the class of black-box approaches, since the reasoning followed by probabilistic models to perform a prediction cannot generally be expressed in an intelligible form.
However, in many real world applications the comprehension of this predicting task is crucial, since it provides a direct way to analyze the behavior of the artificial device outside the collection of data at our disposal. In these situations the adoption of black-box techniques is not acceptable and a more convenient approach is offered by rule generation methods (Duch, Setiono, & Zurada, 2004), a particular class of machine learning techniques that are able to produce a set of intelligible rules, in the if-then form, underlying the real world problem at hand. Several different rule generation methods have been proposed in the literature: some of them reconstruct the collection of rules by analyzing a connectionist model trained with a specific optimization algorithm (Setiono, 2000; Setnes, 2000); others generate the desired set of rules directly from the given sample of data. This last approach is followed by algorithms that construct decision trees (Hastie, Tibshirani, & Friedman, 2001; Quinlan, 1993) and by techniques in the area of Inductive Logic Programming (Boytcheva, 2002; Quinlan & Cameron-Jones, 1995). A novel methodology, adopting proper algorithms for logic synthesis to generate the set of rules pertaining to a given collection of data (Boros, Hammer, Ibaraki, & Kogan, 1997; Boros et al., 2000; Hong, 1997; Sanchez, Triantaphyllou, Chen, & Liao, 2002; Muselli & Liberati, 2000), has been recently proposed and forms the subject of the present chapter. In particular, the general procedure followed by this class of methods will be outlined in the following sections, analyzing in detail the specific implementation followed by one of these techniques, Hamming Clustering (Muselli & Liberati, 2002), to better comprehend the peculiarities of the rule generation process.
BACKGROUND Any logical combination of simple conditions can always be written as a Disjunctive Normal Form (DNF) of binary variables, each of which takes into account the fulfillment of a particular condition. Thus, if the inductive reasoning to be performed amounts to making a binary decision, the optimal set of if-then rules can be
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Rule Generation Methods Based on Logic Synthesis
associated with a Boolean function f that assigns the most probable output to every case. Since the goal of methods for logic synthesis is exactly the determination of the DNF for a Boolean function f, starting from a portion of its truth table, they can be directly used to generate a set of rules for any pattern recognition problem by examining a finite collection of examples, usually called training set. To allow the generalization of the information contained in the sample at hand, a proper logic synthesis technique, called Hamming Clustering (HC) (Muselli & Liberati, 2000; Muselli & Liberati, 2002), has been developed. It proceeds by grouping together binary strings with the same output value, which are close among them according to the Hamming distance. Theoretical results (Muselli & Liberati, 2000) ensure that HC has a polynomial computational cost O(n2cs+nc2), where n is the number of input variables, s is the size of the given training set, and c is the total number of AND ports in the resulting digital circuit. A similar, more computationally intensive, methodology has been proposed by Boros et al. (2000). Every method based on logic synthesis shows the two following advantages: •
•
It generates artificial devices that can be directly implemented on a physical support, since they are not affected by problems connected with the precision used when numbers are stored. It determines automatically the significant inputs for the problem at hand (feature selection).
MAIN THRUST A typical situation, where inductive reasoning has to be performed, is given by pattern recognition problems. Here, vectors x ∈ℜn, called patterns, have to be assigned to one of two possible classes, associated with the values of a binary output y, coded by the integers 0 and 1. This assignment task must be consistent with a collection of m examples (xi,yi), i = 1, …, m, called training set, obtained by previous observations for the problem at hand. The target is to retrieve a proper binary function g(x) that provides the correct answer y = g(x) for most input patterns x.
intelligible rules in the if-then form. The conditions included in the if part of each rule act on the input variables contained in the vector x; consequently, they have different form depending on the range of values assumed by the component xj of the vector x to which they refer. Three situations can be devised: 1. 2.
3.
Continuous Variables: x j varies within an interval [a,b] of the real axis; no upper bound on the number of different values assumed by x j is given. Discrete (Ordered) Variables: xj can assume only the values contained in a finite set; typically, the first positive integers are considered with their natural ordering. Nominal Variables: xj as for discrete variables xj can assume only the values contained in a finite set, but there is no ordering relationship among them; again, the first positive integers are usually employed for the values of xj.
Binary variables are considered as a particular case of nominal variables, but the values 0 and 1 are used instead of 1 and 2. Henceforth, only threshold conditions of the kind x j < c, being c a real number, will be considered for inclusion in the if part of a rule, when xj is a continuous or a discrete variable. On the other hand, a nominal variable x j ∈ {1,2,…,k} will participate in g only through membership conditions, like xj ∈ {1,3,4}. Separate conditions are composed only by means of AND operations, whereas different rules are applied as if they were linked by an OR operator. As an example, consider the problem of analyzing the characteristics of clients buying a certain product: the average weekly expense x1 is a continuous variable assuming values in the interval [0,10000], whereas the age of the client x2 is better described by a discrete variable in the range [0,120]. His/her activity x3 gives an example of nominal variable; suppose to consider only four categories: farmer, worker, employee, and manager, coded by integers 1, 2, 3, and 4, respectively. A final binary variable x4 is associated with the gender of the client (0 = male, 1 = female). With this notation, two rules for the problem at hand can assume the following form: if x1 > 300 AND x3∈{1,2} then y = 1 (he/she buy the product)
Solving Pattern Recognition Problems Through Logic Synthesis
if x2 < 20 AND x4 = 0 then y = 0 (he/she does not buy the product)
Inductive reasoning occurs if the form of the target function g is directly understandable; a possible way of achieving this result is to write g as a collection of
Note that x 3∈{1,2} refers to the possibility that the client is a farmer or a worker, whereas x4 = 0 (equivalent to x4∈{0}) means that the client must be a male to verify the conditions of the second rule. 979
R
Rule Generation Methods Based on Logic Synthesis
Figure 1. General procedure followed by logic synthesis techniques to perform inductive reasoning in a pattern recognition problem 1.
The input vector x is mapped into a binary string z by using a proper coding β(x) that preserves the basic properties (ordering and distance) of every component xj.
2.
The AND-OR expression of a Boolean function f(z) is retrieved starting from the available examples (xi,yi) (coded in binary form as (zi,yi), being zi = β(xi)).
3.
Each logical product in the AND-OR expression of f(z) is directly translated into an intelligible rule underlying the problem at hand. This amounts to write the target function g(x) as f(β(x)).
The general approach followed by techniques relying on logic synthesis is sketched in Fig. 1.
Coding the Training Set in Binary Format At the first step the entire pattern recognition problem has to be rewritten in terms of Boolean variables; since the output is already binary, we have to translate only the input vector x in the desired form. To this aim, we consider for every component xj a proper binary coding that preserves the basic properties of ordering and distance. However, as the number of bits employed in the coding depends on the range of values assumed by the component xj, it is often necessary to perform a preliminary discretization step to reduce the length of the resulting binary string. Given a specific training set, several techniques (Boros et al., 2000; Dougherty, Kohavi, & Sahami, 1995; Liu & Setiono, 1997) are available in the literature to perform the discretization step while minimizing the loss of information involved in this task. Suppose that in our example, concerning a marketing problem, the range for input x 1 has been split into five intervals [0,100], (100,300], (300,500], (500,1000], and (1000,10000]. We can think that the component x1 has now become a discrete variable assuming integer values in the range [1,5]: value 1 means that the average weekly expense lies within [0,100], value 2 is associated with the interval (100,300] and so on. Discretization can also be used for the discrete input x2 to reduce the number of values it may assume. For example, the four intervals [0,12], (12,20], (20,60], and (60,120] could have been determined as an optimal subdivision, thus resulting in a new discrete component x2 assuming integer values in [1,4]. Note that after discretization any input variable can be either discrete (ordered) or nominal; continuous variables
980
no more occurs. Thus, the mapping required at Step 1 of Fig. 1 can be realized by considering a component at a time and by employing the following codings: 1.
2.
Thermometer Code (for discrete variables): it adopts a number of bits equal to the number of values assumed by the variable minus one and set to 1 the leftmost k–1 bits to code the value k. For example, the component x1, which can assume five different values, will be mapped into a string of four bits; in particular, the value x1 = 3 is associated with the binary string 1100. Only-One Code (for nominal variables): it adopts a number of bits equal to the number of values assumed by the variable and set to 1 only the kth bit to code the value k. For example, the component x3, which can assume four different values, will be mapped into a string of four bits; in particular the value x3 = 3 is associated with the binary string 0010.
Binary variables do not need any code, but are left unchanged by the mapping process. It can be shown that these codings maintain the properties of ordering and distance, if the Hamming distance (given by the number of different bits) is employed in the set of binary strings. Then, given any input vector x, the binary string z = β (x), required at Step 1 of Fig. 1, can be obtained by applying the proper coding to each of its components and by taking the concatenation of the binary strings obtained. As an example, a 28 years old female employee with an average weekly expense of 350 dollars is described (after discretization) by a vector x = (3,2,3,1) and coded by the binary string z = 1100|100|0010|1 (the symbol ‘|’ has only the aim of subdividing the contribution of different components). In fact, x1 = 3 gives 1100, x 2 = 2 yields 100, x 3 = 3 maps into 0010 and, finally, x4 = 1 is left unchanged.
Rule Generation Methods Based on Logic Synthesis
Hamming Clustering Through the adoption of the above mapping, the m examples (xi,yi) of the training set are transformed into m pairs (zi,yi) = (β (xi),yi), which can be considered as a portion of the truth table of a Boolean function to be reconstructed. Here, the procedure for rule generation in Fig. 1 continues at Step 2, where a suitable method for logic synthesis, like HC, has to be employed to retrieve a Boolean function f(z) that generalizes the information contained in the training set. A basic concept in the procedure followed by HC is the notion of cluster, which is the collection of all the binary strings having the same values in a fixed subset of components; for instance, the four binary strings 01001, 01101, 11001, 11101 form a cluster since all of them only have the values 1, 0, and 1 in the second, the fourth and the fifth component, respectively. This cluster is usually written as *1*01, by placing a don’t care symbol ‘*’ in the positions that are not fixed, and it is said that the cluster *1*01 covers the four binary strings above. Every cluster can be associated with a logical product among the bits of z, which gives output 1 for all and only the binary strings covered by that cluster. For example, the cluster *1*01 corresponds to the logical product z 2 z 4 z 5 , being z4 the complement of the fourth bit z 4. The desired Boolean function f(z) can then be constructed by generating a valid collection of clusters for the binary strings belonging to a selected class. The procedure employed by HC consists of the four steps shown in Fig. 2. Once the example (z i,yi) in the training set has been randomly chosen at Step 1, a cluster of points including z i is to be generated and associated with the class y i. Since each cluster is uniquely associated with an AND operation among the bits of the input string z, it is straightforward to build at Step 4 the AND -OR expression for the reconstructed Boolean function f(z). However, every cluster can also be directly translated into an intelligible rule having in its if part conditions on the components of the original input vector x. To this aim, it is sufficient to analyze the patterns covered by that
cluster to produce proper consistent threshold or membership conditions; this is the usual way to perform Step 3 of Fig. 1. An example may help understanding this passage: suppose the application of a logic synthesis method, like HC, for the marketing problem has produced the cluster 11**|***|**00|* for the output y = 1. Then, the associate rule can be generated by examining the parts of the cluster that do not contain only don’t care symbols. These parts allow to obtain as many conditions on the corresponding components of vector x. In the case above the first four positions of the cluster contain the sequence 11**, which covers the admissible binary strings 1100, 1110, 1111 (according to the thermometer code), associated with the intervals (300,500], (500,1000], and (1000,10000] for the first input x1. Thus, the sequence 11** can be translated into the threshold condition x 1 > 300. Similarly, the sequence **00 covers the admissible binary strings 1000 and 0100 (according to the only-one code) and corresponds therefore to the membership condition x3∈{1,2}. Hence, the resulting rule is if x 1 > 300 AND x3∈{1,2} then y = 1
FUTURE TRENDS Note that in the approach followed by HC several clusters can lead to the same rule; for instance, both the implicants 11**|***|**00|* and *1**|***|**00|* yield the condition x1 > 300 AND x3∈{1,2}. On the other side, there are clusters that do not correspond to any rule, such as 01**|***|11**|*. Even if these last implicants were not generated by the logic synthesis technique, they would increase the complexity of reconstructing a Boolean function that generalizes well. To overcome this drawback, a new approach is currently under examination: it considers the possibility of removing the NOT operator from the resulting digital circuits, which is equivalent to employing the class of monotone Boolean functions for the construction of the desired set of rules. In fact, it can be shown that such an
Figure 2. Procedure employed by Hamming Clustering to reconstruct a Boolean function from examples 1. Choose at random an example (zi,yi) in the training set. 2. Build a cluster of points including zi and associate that cluster with the class yi. 3. Remove the example (zi,yi) from the training set. If the construction is not complete, go to Step 1. 4. Simplify the set of clusters generated and build the AND-OR expression of the corresponding Boolean function f(z).
981
4
Rule Generation Methods Based on Logic Synthesis
approach leads to a unique bi-directional correspondence between clusters and rules, thus reducing the computational cost needed to perform inductive reasoning, while maintaining a good generalization ability.
Haykin, S. (1999). Neural network: A comprehensive foundation. London: Prentice Hall.
CONCLUSION
Liu, H., & Setiono, R. (1997). Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 9, 642-645.
Inductive reasoning is a crucial task when exploring a collection of data to retrieve some kind of intelligible information. The realization of an artificial device or of an automatic procedure performing inductive reasoning is a basic challenge that involves researchers working in different scientific areas, such as statistics, machine learning, data mining, fuzzy logic, etc. A possible way of attaining this target is offered by the employment of a particular kind of techniques for logic synthesis. They are able to generate a set of understandable rules underlying a real-world problem starting from a finite collection of examples. In most cases the accuracy achieved is comparable or superior to that of best machine learning methods, which are however unable to produce intelligible information.
REFERENCES Boros, E., Hammer, P.L., Ibaraki, T., & Kogan, A. (1997). Logical analysis of numerical data. Mathematical Programming, 79, 163-190. Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., & Muchnik, I. (2000). An implementation of logical analysis of Data. IEEE Transactions on Knowledge and Data Engineering, 12, 292-306. Boytcheva, S. (2002). Overview of inductive logic programming (ILP) systems. Cybernetics and Information Technologies, 1, 27-36. Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In ML-95: Proceedings of the Twelfth International Conference on Machine Learning (pp. 194202). San Francisco, CA: Morgan Kaufmann. Duch, W., Setiono, R., & Zurada, J.M. (2004). Computational intelligence methods for rule-based data understanding. Proceedings of the IEEE, 92 (pp. 771-805). Duda, R.O., Hart, P E., & Stork, D.G. (2001). Pattern classification. New York: John Wiley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer.
982
Hong, S.J. (1997). R-MINI: An iterative approach for generating minimal rules from examples. IEEE Transactions on Knowledge and Data Engineering, 9, 709-717.
Muselli, M., & Liberati, D. (2000). Training digital circuits with hamming clustering. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 47, 513-527. Muselli, M., & Liberati, D. (2002). Binary rule generation via hamming clustering. IEEE Transactions on Knowledge and Data Engineering, 14, 1258-1268. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Quinlan, J. R., & Cameron-Jones, R. M. (1995). Induction of logic programs: Foil and related systems. New Generation Computing, 13, 287-312. Sanchez, S. N., Triantaphyllou, E., Chen, J., & Liao, T. W. (2002). An incremental learning algorithm for constructing Boolean functions from positive and negative examples. Computers & Operations Research, 29, 16771700. Setiono, R. (2000). Extracting m-of-n rules from trained neural networks. IEEE Transactions on Neural Networks, 11, 512-519. Setnes, M. (2000). Supervised fuzzy clustering for rule extraction. IEEE Transactions on Fuzzy Systems, 8, 416-424. Vapnik, V. (1998). Statistical learning theory. New York: John Wiley.
KEY TERMS Boolean Function: A binary function that maps binary strings (with fixed length) into a binary value. Every Boolean function can be written as an expression containing only AND , OR, and NOT operations. Hamming Distance: The distance between two binary strings (with the same length) given by the number of different bits.
Rule Generation Methods Based on Logic Synthesis
Inductive Reasoning: The task of extracting intelligible information from a collection of examples pertaining a physical system.
examples. The same term is also used to denote classification problems, where the number of output classes is greater than two.
Logic Synthesis: The process of reconstructing an unknown Boolean function from (a portion of) its truth table.
Rule Generation: An automatic way of performing inductive reasoning through the generation of understandable rules underlying the physical system at hand.
Pattern Recognition Problem: A decision problem where the state of a system (described by a vector of inputs) has to be assigned to one of two possible output classes, generalizing the information contained in a set of
Truth Table: The collection of all the input-output pairs for a Boolean function.
983
4
984
Rule Qualities and Knowledge Combination for Decision-Making Ivan Bruha McMaster University, Canada
INTRODUCTION
BACKGROUND
Within the past several years, research in decision-supporting systems has been investigating the possibilities of enhancing their overall performance, particularly their prediction (classification) accuracy, or performance quality, and their time complexity. One such discipline, data mining (DM), processes usually very large databases in a profound and robust way. Since data are collected and stored at a very large acceleration these days, there has been an urgent need for a new generation of robust software packages to extract useful information or knowledge from large volumes of data. Research is supposed to develop methods and techniques to process large data in order to receive knowledge, which is hidden in these databases, that would be compact, more or less abstract, but understandable and useful for further applications. DM usually is defined as a nontrivial process of identifying valid, novel, and ultimately understandable knowledge in data. It is understood that DM points to the overall process of determining a useful knowledge from databases (i.e., extracting high-level knowledge from lowlevel data in the context of large databases). It can be viewed as a multi-disciplinary activity, because it exploits several research disciplines of artificial intelligence (AI), such as machine learning, pattern recognition, expert systems, and knowledge acquisition, as well as mathematical disciplines such as statistics, theory of information, and uncertainty processing. This article discusses two enhancements in DM: rule quality and knowledge integration/combination in the section, Main Thrust of the Article. The future possible directions in these two fields are briefly discussed in the next to the Future Trends section. The last section then analyzes the enhancements achieved by embedding the measures into rule-based classifiers and the multi-strategy approach in decision-supporting systems. It also should be noted that there is no uniform terminology in the knowledge-intensive systems (including DM and machine learning, of course); therefore, here, we usually use not a single term, but several most common terms that can be found in literature. Also, some definitions are not uniform but overlap (see the section, Terms and Definitions).
Data Mining (DM) or Knowledge Discovery in Databases (KDD) utilizes several paradigms for extracting a knowledge that then can be exploited as a decision scenario (architecture) within expert or classification (prediction) systems. One commonly used paradigm in Machine Learning (ML) is called divide and conquer, which induces decision trees. Another widely used covering paradigm generates sets of decision rules (e.g., the CNx family [Clark & Niblett, 1989], C4.5Rules, Ripper, etc.). However, the rule-based classification systems are faced by an important deficiency that is to be solved in order to improve the predictive power of such systems. The traditional decision-making systems have been dependent on a single technique, strategy, or architecture. Therefore, their accuracy and successfulness have not been so high. New sophisticated decision-supporting systems utilize results obtained from several lower-level systems, each usually (but not required to be) based on a different paradigm, or combine or refine them within a dynamic process. Thus, such a multi-strategy (hybrid) system consists of two or more individual agents that interchange information and cooperate together. It should be noted that there are, in fact, two fundamental approaches for combining the information from multi-data tasks: 1. 2.
In data combination, the datasets are merged into a single set before the actual knowledge acquisition. In knowledge (theory) combination, or sensor fusion, several agents (base classifiers, sensors) process each input dataset separately, and the induced models (knowledge bases) then are combined at the higher level.
The next section discusses the latter approach, including the more general aspect of knowledge integration. There are various knowledge combination schemes (e.g., the best, weighted voting, sensitive voting, Bayesian combination, etc.). The next section focuses on relatively new trends in knowledge combination. Furthermore, there are two types of agents in the multistrategy (knowledge combination) decision-supporting
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Rule Qualities and Knowledge Combination for Decision-Making
architecture. The simpler one yields a single decision; the more sophisticated one induces a list of several decisions. In both types, each decision should be accompanied by the agent’s confidence (belief) in it. These functional measurements are supported mostly by statistical analysis that is based on both the certainty (accuracy, predictability) of the agent as well as the consistency of its decision. There have been quite a few research inquiries to define formally such statistics; some, however, have yielded in quite complex and hardly enumerable formulas, so that they have never been used. The following section presents a simpler but more understandable approach to define these measurements.
MAIN THRUST AND BACKGROUND (a) Rule Quality A rule-inducing algorithm may yield either an ordered or unordered set of decision rules. The latter seems to be more understandable by humans and directly applicable in most decision-supporting systems. However, the classification utilizing an unordered set of decision rules exhibits a significant deficiency, not immediately apparent. Three cases are possible: 1.
2.
3.
If an input unseen (to-be-classified) object satisfies (matches, fires for) one or more rules of the same class, then the object is categorized to the class assigned to the rule(s). If the unseen object is not covered by any rule, then either the classifier informs the user about its inability to decide (‘I do not know’), or the object is assigned by default to the majority class in the training set, or some similar techniques are invoked. Difficulty arises if the input object satisfies more rules assigned to different classes. Then, some schemes have to be applied to assign the unseen input object to the most appropriate class.
One possibility to clarify the conflict situation (case 3) of multiple-rule systems is to associate each rule in the decision scheme (knowledge base, model) of a classifier with a numerical factor that can express its properties and characterize a measure of belief in the rule, its power, predictability, reliability, likelihood, and so forth. A collection of these properties is symbolized by a function commonly called the rule quality. After choosing a formula for the rule quality, we also have to select a scheme for combining these qualities (quality combination). Quality of rules, its methodology, as well as appropriate formulas have been discussed for many years. Bergadano, et al. (1992) is one the first papers that intro-
duces various definitions and formulas for the rule quality; besides the rule’s power and predictability, it measures its size, understandability, and the like. Formulas for the rule quality have been studied and tested further in several other papers (An & Cercone, 2001; Hipp et al., 2002). A survey of the rule combinations can be found in Kohavi and Kunz (1997). Comprehensive analysis and empirical expertise of formulas of rule qualities and their combining schemes have been published in Bruha and Tkadlec (2003) and their theoretical methodology in Tkadlec and Bruha (2003). The first one introduces quite a few statistical and empirical formulas for the rule quality, including the quality combinations, and compares them. A rule quality, in most cases, is a function of its consistency (sensitivity), completeness (coverage, positive predictive value), and other statistics, such as a rule’s matching rates. Because we deal with real-world noisy data, any decision set induced must be not only reliable but also powerful. Its reliability is characterized by a consistency factor and its power by a completeness. These and other statistical factors usually are defined by means of the so-called 2×2 contingency table. The latter paper introduces theoretical formalism and methodological tools for building multiple-rule systems. It focuses on four agents that cooperate with each other: designer, learner, classifier, and predictor. The paper offers to a designer of a new multiple-rule system the minimum requirements for the previously discussed concepts and (mostly statistical) characteristics that the designer can start with. It also exhibits a general flow chart for a decision-system builder. In addition to the rule quality discussed previously, there are other rule measures, such as its size (i.e., the number of attribute pairs involved), computational complexity, comprehensibility (‘Is the rule telling humans something interesting about the application domain?’), understandability, redundancy (measured within the entire decision set of rules), and the like (Tan, Kumar & Srivastava, 2004). However, some of these characteristics are subjective; on the contrary, formulas for rule quality are supported by theoretical sources or profound empirical expertise. In most decision-supporting systems, the rule qualities are static, constant, and calculated a priori before the actual classification or prediction. Their predictability can be improved by a dynamic change of their values during the classification process. One possible scheme implants a feedback loop from the classifier to the learner (Bruha, 2000); it refines (modifies) the rule qualities according to the correct/false predictions made by the classifier by changing the qualities of the rules that were involved in the current classification. The entire refinement method thus may be viewed as a (semi-) meta985
4
Rule Qualities and Knowledge Combination for Decision-Making
learning, because a portion of the model induced by learning is modified within classification (see the next section).
(b) Knowledge Combination and MetaLearning Researchers of empirical ML and DM are concerned with such issues as computational cost of learning, search techniques, and predictive accuracy. A great deal of research in ML focuses on improving topology of classifiers. There are approaches to combine various paradigms into one robust (hybrid, multi-strategy) system that utilizes the advantages of each subsystem and tries to eliminate their drawbacks. There is a general belief that integrating results obtained from multiple lower-lever classifiers, each usually (but not required to be) based on a different paradigm, produce better performance. We can consider the boosting and bagging algorithms (Bauer & Kohavi, 1999) as already traditional topologies of this approach. Generally speaking, the main advantages of such hybrid systems are (i) better performance than that of individual lower-level agents included; (ii) the ability to process multivariate data from different information sources; and (iii) better understanding of internal data processing when a complex task is solved. There are more or less three general techniques or topologies of multi-level knowledge-based systems (called knowledge integration): 1.
2.
3.
Knowledge Combination/Selection: The input to such a system is usually formed by several knowledge bases (models) that are generated by various DM algorithms (learners). Each model (knowledge base) independently produces its decision about prediction; these results then are combined into a final decision, or the best decision is selected according to a (usually statistical) criterion. In this architecture, the mechanism of quality of knowledge bases (model qualities) is usually put to use. Knowledge Merging: Several models (knowledge bases) are merged into one robust, usually redundant model by utilizing statistics that accompany these models (i.e., model and rule qualities). Knowledge Modification (Revision, Refining): The input is an existing (old) knowledge base and a new database. A DM algorithm revises (modifies, refines) the current knowledge base according to the knowledge that is hidden in the new database. The new knowledge base thus gets over the old knowledge by being updated by knowledge extracted from the new database.
We should state here that there is no uniform terminology in multiple hybrid systems. Therefore, we introduce a 986
couple of synonyms for each item. Also, there are quite a few various definitions, methodologies, topologies, and applications in this AI topic. We mention just a few of them. The first project in this field is evidently Brazdil and Torgo (1990); their system merges several decision trees generated by ID3 into a robust one. The already mentioned bagging and boosting algorithms can be viewed as representatives of multi-models. Another direction is formed by the system XCS, which is a mixture of genetic algorithms (GAs) and neural nets (Wilson, 1999). There are several extensions of this system (e.g., NXCS) (Armano et al., 2002). Another hybrid multi-system combines GAs with decision trees (Carvalho & Freitas, 2000). Many other enhancements and applications of meta-learning can be found in Druzdzel and Diez (2003), Brazdil, Soares, and da Costa (2003), and Todovski and Dzeroski (2003). All these research projects have revealed that metalearning improves the performance of the base classifiers. Knowledge modification is utilized quite often in Inductive Logic Programming (ILP); they usually use the term theory refinement (Haddawy et al., 2003; Wrobel, 1996). Fan, Chan, and Stolfo (1996) introduce the methodology of combiner, stack-generalizer, and one of the commonly used concepts of meta-combiners (meta-learners). Meta-learning can be viewed as learning from information generated by a set of base learners or, in other words, as learning of meta-knowledge on the learned information. The base learners, each usually utilizing a different inductive strategy, induce base classifiers; the base classifiers applied to a training set of examples form a so-called meta-database; it is then used by the metalearner to derive a meta-classifier. The two-level structure of classifiers then is used for making decisions about the input objects. Hence, this meta-classifier does not exploit the traditional select-best or by-vote strategy but rather combines the decisions of all the base classifiers. Bruha (2004) applies this scenario to the processing of unknown attribute values in multi-attribute rule-based algorithms. A wide survey of meta-learning can be found in Vilalta and Drissi (2000). It uses a slightly different taxonomy; besides the above meta-learner of base-learners, it also distinguishes dynamic selection of classifiers, building meta-rules matching problem with algorithm performance, inductive transfer and learning to learn, and learning classifier systems. As we stated in the introduction, we claim here that the term meta-learning (and knowledge integration) means different things to different researchers. There are many interesting issues in meta-combining, for instance, combining statistical/fuzzy data, (probability distribution of classes, quality of decision/perfor-
Rule Qualities and Knowledge Combination for Decision-Making
mance, reliability of each base classifier, cascade classifier) (Gama & Brazdil, 2000), and the like. Knowledge combination can be extended to higherlevel combining methods. For instance, Fan, et al. (2002) investigates classifier ensembles and multi-level treestructured combining methods. Bruha (2004b) explores three-level combiners; the first two are formed by the meta-combiner discussed previously, and the third level utilizes the decision of this meta-combiner and other classification systems, such as averaging, regression, best decision scenario, voting scenario, and naive Bayesian combination.
FUTURE TRENDS The methodology and theoretical formalism for single decision rules can be extended to the entire decision set of rules (i.e., knowledge base, model). The model qualities consequently can be utilized in a multi-level decisionsupporting system. As we already mentioned, its second level combines the decisions of the base models (utilizing their model qualities) in order to make the final decision. (We can view a base model as a physician who has to find a diagnosis for a patient, and the meta-level as a council of physicians that combines the decisions of all of the members according to the physicians’ qualities and makes up the final verdict of the patient’s diagnosis.) Also, knowledge combination and meta-learning can be spanned in the following various ways: • • • •
• •
The other knowledge combination techniques, such as various types of voting, including dynamic selective voting. Higher-level knowledge combination systems, as discussed at the end of the previous section. Redundancy of lower-level knowledge bases and that of combined (multi-agent) systems. Embedding genetic algorithms and other optimization tools for generating more robust decisionmaking and knowledge-combining systems; particularly, genetic algorithms seem to be a very powerful and robust technique for inducing reliable and consistent knowledge bases (models). More intelligent cascade algorithms. The other knowledge integration techniques.
The area of research in this field is really very open and promising. New ways of knowledge representation is worth mentioning. Any knowledge combiner or meta-learner is to cooperate with the base agents, which could utilize various techniques. A uniform representation of these base agents (knowledge bases) will help to investigate and
support more sophisticated research in the field of knowledge combination topologies.
CONCLUSION The concept of rule quality evidently solves some conflicts in multiple-rule systems. This direction is being studied further (e.g., more sophisticated quality combination scenarios, exploiting other statistics beyond contingency tables, etc.). We can observe that both research fields discussed in this article cooperate together as two sides of one coin. The research in knowledge combination and meta-learning continues in many directions. For instance, an interesting and fruitful issue of knowledge combination employs the genetic algorithms. An original (old) knowledge base can be refined by an evolutionary process that utilizes new information (a new database) of a given task. A large source of papers on meta-learning and knowledge combination can be found on the Internet (e.g., http:/ /www.metal-kdd.org).
REFERENCES An, A., & Cercone, N. (2001). Rule quality measures for rule induction systems: Description and evaluation. Computational Intelligence, 17(3), 409-424. Armano, G. et al. (2002). Stock market prediction by a mixture of genetic-neural experts. International Journal of Pattern Recognition and Artificial Intelligence, 16(5), 501-526. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105-142. Bergadano, F. et al. (1992). Learning two-tiered descriptions of flexible concepts: The Poseidon system. Machine Learning, 8, 5-43. Brazdil, P., Soares, C., & da Costa, J.P. (2003). Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3), 251277. Brazdil, P., & Torgo, L. (1990). Knowledge acquisition via knowledge integration. In B. Wielinga, et al. (Eds) Current trends in knowledge acquisition. Amsterdam: IOS Press. Bruha, I. (2000). A feedback loop for refining rule qualities in a classifier: A reward-penalty strategy. Proceedings of the European Conference on Machine Learning (ECML2000), Workshop Meta Learning, Spain. 987
4
Rule Qualities and Knowledge Combination for Decision-Making
Bruha, I. (2004a). Meta-learner for unknown attribute values processing: Dealing with inconsistency of metadatabases. Journal of Intelligent Information Systems, 22(1), 71-84.
Vilalta, R., & Drissi, Y. (2002). A perspective view and survey of meta-learning. Journal of Artificial Intelligence Review, 18(2), 77-95.
Bruha, I. (2004b). Three-level tree-structured meta-combiner: A case study [submitted].
Wilson, S.W. (1999). Get real: XCS with continuousvalued inputs. In L. Booker (Ed.), Festschrift in honor of J.H. Holland (pp. 111-121). University of Michigan.
Bruha, I., & Tkadlec, J. (2003). Rule quality for multiplerule classifier: Empirical expertise and theoretical methodology. Intelligent Data Analysis, 7, 99-124.
Wrobel, S. (1996). First order theory refinement. In L. De Raedt (Ed.), Advances in inductive logic programming (pp. 14-33). IOS Press.
Carvalho, D.R., & Freitas, A.A. (2000). A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining. Proceedings of the Genetic and Evolutionary Computation (GECCO-2000). Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261-283.
KEY TERMS Classifier: A decision-supporting system that, given an unseen input object, yields a prediction (e.g., it classifies the given object to a certain class).
Druzdzel, M.J., & Diez, F.J. (2003). Combining knowledge from different sources in causal probabilistic models. Journal of Machine Learning Research, 4, 295-316.
Decision Rule: An element (piece) of knowledge, usually in the form of an if-then statement:
Fan, D.W., Chan, P.K., & Stolfo, S.J. (1996). A comparative evaluation of combiner and stacked generalization. Proceedings of the AAAI-96, Workshop Integrating Multiple Learning Models.
If its condition is satisfied (i.e., matches a fact in the corresponding database of a given problem), then its action (e.g., decision making) is performed.
Fan, W. et al. (2002). Progressive modelling. Proceedings of the 2nd IEEE International Conference Data Mining (ICDM-2002). Gama, J., & Brazdil, P. (2000). Cascade generalization. Machine Learning, 41(3), 315-343. Haddawy, P. et al. (2003). Preference elicitation via theory refinement. Journal Machine Learning Research, 4, 317337. Hipp, J. et al. (2002). Data mining of association rules and the process of knowledge discovery in databases. Proceedings of the International Conference Data Mining. Kohavi, R., & Kunz, C. (1997). Optional decision trees with majority votes. In D. Fisher (Ed.), Machine learning: Proceedings of 14th International conference (pp. 161169). Morgan Kaufmann. Tan, P.-N., Kumar, V., & Srivastava, J. (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4), 293-313. Tkadlec, J., & Bruha, I. (2003). Formal aspects of a multiple-rule classifier. International Journal of Pattern Recognition and Artificial Intelligence, 17(4), 581-600. Todorovski L., & Dzeroski, S. (2003). Combining classifiers with meta decision trees. Machine Learning, 50(3), 223-249. 988
if
Decision Set: Ordered or unordered set of decision rules; a common knowledge representation tool (utilized in the most expert systems). Knowledge Integration: Methodology of combining, modifying, refining, and merging usually several models (knowledge bases) into one robust, more predictable, and usually redundant model, or that of combining decisions of single (base) models. Learner: Given a training set of (representative) examples (accompanied usually by their desired classes/ concepts), a learner induces concept description (model, knowledge base) for a given task that then is usually utilized in the corresponding decision-supporting system. Meta-Combiner (Meta-Learner): Multi-level structure improving the learning process by dynamic accumulation of knowledge about learning. Its common topology involves base learners and classifiers at the first level and meta-learner and meta-classifier at the second level; the meta-classifier combines the decisions of all the base classifiers. Model (Knowledge Base): Formally described concept of a certain problem; usually represented by a set of production rules, semantic nets, frames, and the like.
Rule Qualities and Knowledge Combination for Decision-Making
Model Quality: Similar to rule quality, but it characterizes the decision power, predictability, and reliability of the entire model (knowledge base) as a unit.
4
Rule Quality: A numerical factor that characterizes a measure of belief in the given decision rule, its power, predictability, reliability, and likelihood.
989
990
Sampling Methods in Approximate Query Answering Systems Gautam Das The University of Texas at Arlington, USA
INTRODUCTION In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories typically are created in the hope that, through analysis such as data mining and decision support, they will yield new insights into the data and the real-world processes that created them. In practice, however, while the collection and storage of massive datasets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity. While keeping query response times short is very important in many data mining and decision support applications, exactness in query results is frequently less important. In many cases, ballpark estimates are adequate to provide the desired insights about the data, at least in preliminary phases of analysis. For example, knowing the marginal data distributions for each attribute up to 10% error often will be enough to identify top-selling products in a sales database or to determine the best attribute to use at the root of a decision tree. For example, consider the following SQL query:
SELECT State, COUNT (*) as ItemCount FROM SalesData WHERE ProductName = ‘Lawn Mower’ GROUP BY State ORDER BY ItemCount DESC This query seeks to compute the total number of a particular item sold in a sales database, grouped by state. Instead of a time-consuming process that produces com-
pletely accurate answers, in some circumstances, it may be suitable to produce ballpark estimates (e.g., counts to the nearest thousands). The acceptability of inexact query answers, coupled with the necessity for fast query response times, has led researchers to investigate approximate query answering (AQA) techniques that sacrifice accuracy to improve running time, typically through some sort of lossy data compression. The general rubric in which most approximate query processing systems operate is as follows: first, during the preprocessing phase, some auxiliary data structures, or data synopses, are built over the database; then, during the runtime phase, queries are issued to the system and approximate query answers quickly are returned, using the data synopses built during the preprocessing phase. The quality of an approximate query processing system often is determined by how accurately the synopsis represents the original data distribution, how practical it is to modify existing database systems to incorporate approximate query answering, and whether error estimates can be returned in addition to ballpark estimates.
BACKGROUND Figure 1 describes a general architecture for most AQA systems. There are two components in the architecture: (1) a component for building the synopses from database relations, and (2) a component that rewrites an incoming query in order to use the synopses to answer the query approximately and report the answer with an estimate of the error in the answer. The different approximate query answering systems that have been proposed differ in various ways: in the types of synopses proposed; whether the synopses building component is executed in a preprocessing phase or whether it executes at runtime; the ability of the AQA system also to provide error guarantees in addition to the approximate answers; and, finally (from a practical point of view and perhaps the most important), the amount of changes necessary to query processing engines of commercial database management systems to incorporate approximate query answering.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Sampling Methods in Approximate Query Answering Systems
Figure 1. Architecture for approximate query answering Databa Build Synopses Tables Synops es
umns, connected to several dimension tables via foreign key relationships. Furthermore, we assume that our queries are aggregation queries with SUM, COUNT, and GROUP BY operators, either over a single fact table or over a fact table joined to several dimension tables.
Uniform Random Sampling Rewrite and Execute
Incoming Query
Answer set with error estimate
The types of synopses developed for AQA systems can be divided into two broad groups: sampling-based approaches and non-sampling-based approaches. In sampling-based approaches, a small random sample of the rows of the original database table is prepared, and queries are directed against this small sample table. The non-sampling-based approaches encompass a wide variety of techniques (e.g., sophisticated data structures such as wavelets [Chakrabarti et al., 2001; Matias, Vitter & Wang, 1998] and histograms [Ioannidis & Poosala, 1999]) have been proposed as useful tools for AQA. Work in non-sampling-based AQA techniques is of great theoretical interest, but its practical impact often is limited by the extensive modifications to query processors and query optimizers that often are needed to make use of these technologies. On the other hand, samplingbased systems have the advantage that they can be implemented as a thin layer of middleware that rewrites queries to run against sample tables stored as ordinary relations in a standard, off-the-shelf database server. Partly for these reasons, sampling-based systems have in recent years been the most heavily studied type of AQA system. In the rest of this article, our focus is on presenting an overview of the latest developments in sampling-based AQA techniques.
MAIN THRUST In the following section, we summarize the various sampling-based AQA technologies that have been proposed in recent years by the research community. The focus of this article is on approximately answering standard SQL queries on relational databases; other exciting work done on approximate query processing in other scenarios, such as streaming and time series data, is beyond the scope of this article. We assume a standard data warehouse schema, consisting of a few fact tables containing the measure col-
The essential idea is that a small precomputed uniform random sample of rows S of the database R often represents the entire database well. For a fast approximate answer at runtime, one simply has to execute the query on S and scale the result. Thus, if S is a 1% sample of the database, the scaling factor is 100. The main advantages of uniform random sampling are simplicity and efficiency of preprocessing. However, there are several critical disadvantages that have not allowed this approach to be considered seriously for AQA systems. One disadvantage is the well-known statistical problem of large data variance. For example, suppose we wish to estimate the average salaries of a particular corporation. Uniform random sampling does badly if the salary distribution is highly skewed. The other disadvantage is specific to database systems, and is the low selectivity problem. For example, suppose a query wishes to find the average salary of a small department of a large corporation. If we only had a uniform random sample of the entire database, then it is quite likely that this small department may not be adequately represented, leading to large errors in the estimated average salary. To mitigate these problems, much research has been attempted using so-called biased sampling techniques, where a non-uniform random sample is precomputed, such that parts of the database deemed more important than the rest are better represented in the sample. We discuss such techniques later in the article.
Online Aggregation Hellerstein, Haas, and Wang (1997) describe techniques for online aggregation in which approximate answers for queries are produced during early stages of query processing and gradually refined until all the data have been processed. This framework is extended in Raman and Hellerstein (2002) to have the query processor give precedence to tuples that contribute to higher-priority parts of the query result, where priority is defined using a user-specified function. The online aggregation approach has some compelling advantages (e.g., it does not require preprocessing, and it allows progressive refinement of approximate answers at runtime until the user is satisfied or the exact answer is supplied, and it can provide confidence intervals that indicate the uncertainty present in the answer). 991
5
Sampling Methods in Approximate Query Answering Systems
However, there are two important systems considerations that represent practical obstacles to the integration of online aggregation into conventional database systems. First, stored relations are frequently clustered by some attribute, so accessing tuples in a random order, as required for online aggregation, requires (slow) random disk accesses. Second, online aggregation necessitates significant changes to the query processor of the database system. This is impractical, as it is desirable for an AQA system to leverage today’s commercial query processing systems with minimal changes to the greatest degree possible. Next, we consider several biased-sampling AQA methods that are based on precomputing the samples. Toward the end, we also discuss a method that attempts to strike a balance between online and precomputed sampling.
Icicles Recognizing the low selectivity problem, designing a biased sample that is based on known workload information was attempted in Ganti, Lee, and Ramakrishnan (2000). In this paper, the assumption was that a workload of queries (i.e., a log of all recent queries executing against the database) is a good predictor of the queries that are yet to execute on the database in the future. Thus, for example, if a query requests for the average salary of a small department in a large corporation, it is assumed that such (or similar) queries will repeat in the future. A heuristic precomputation procedure called Icicles was developed, in which tuples that have been accessed by many queries in the workload were assigned greater probabilities of being selected into the sample. While this was an interesting idea based on biased sampling that leverages workload information, a disadvantage was that it focuses only on the low selectivity problem, and, furthermore, the suggested solution is rather heuristical.
Outlier Indexing The first paper that attempted to address the problem of large data variance was by Chaudhuri, Das, Datar, Motwani, and Narasayya (2001). It proposes a technique called Outlier Indexing for improving sampling-based approximations for aggregate queries, when the attribute being aggregated has a skewed distribution. The basic idea is that outliers of the data (i.e., the records that contribute to high variance in the aggregate column) are collected into a separate index, while the remaining data is sampled using a biased sampling technique. Queries are answered by running them against both the outlier index as well as the biased sample, and an estimated answer is composed out of both results. A 992
disadvantage of this approach was that the primary emphasis was on the data variance problem, and while the authors did propose a hybrid solution for both the data variance as well as the low selectivity problem, the proposed solution was heuristical and, therefore, suboptimal.
Congressional Sampling The AQUA project at Bell Labs (Acharya, Gibbons & Poosala, 1999) developed a sampling-based system for approximate query answering. Techniques used in AQUA included congressional sampling (Acharya, Gibbons & Poosala, 2000), which is targeted toward answering a class of common and useful analysis queries (group by queries with aggregation). Their approach stratifies the database by considering the set of queries involving all possible combinations of grouping columns and produces a weighted sample that balances the approximation errors of these queries. However, their approach is still ad hoc in the sense that even though they try to reduce the error, their scheme does not minimize the error for any of the well-known error metrics.
Join Synopses The AQUA project at Bell Labs also developed the join synopses technique (Acharya et al., 1999), which allows approximate answers to be provided for certain types of join queries; in particular, foreign-key joins. The technique involved precomputing the join of samples of fact tables with dimension tables, so that at runtime, queries only need to be executed against single (widened) sample tables. This is an alternate to the approach of only precomputing samples of fact tables and having to join these sample tables with dimension tables at runtime. We mention that the problem of sampling over joins that are not foreign-key joins is a difficult problem and, under certain conditions, is essentially not possible (Chaudhuri, Motwani & Narasayya, 1999). Thus, approximate query answering does not extend to queries that involve non-foreign key joins.
Stratified Sampling (STRAT) The paper by Chaudhuri, Das, and Narasayya (2001) sought to overcome many of the limitations of the previous works on precomputed sampling for approximate query answering and proposed a method called STRAT for approximate query answering. Unlike most previous sampling-based studies that used ad-hoc randomization methods, the authors here formulated the problem of precomputing a sample as an optimization problem, whose goal is to minimize the error
Sampling Methods in Approximate Query Answering Systems
for the given workload. They also introduced a generalized model of the workload (lifted workload) that makes it possible to tune the selection of the sample, so that approximate query processing using the sample is effective, not only for workloads that are exactly identical to the given workload, but also for workloads that are similar to the given workload (i.e., queries that select regions of the data that overlap significantly with the data accessed by the queries in the given workload)—a more realistic scenario. The degree of similarity can be specified as part of the user/database administrator preference. They formulate selection of the sample for such a lifted workload as a stratified sampling task with the goal to minimize error in estimation of aggregates. The benefits of this systematic approach are demonstrated by theoretical results (where it is shown to subsume much of the previous work on precomputed sampling methods for AQA) and experimental results on synthetic data as well as real-enterprise data warehouses.
Dynamic Sample Selection A sampling technique that attempts to strike a middle ground between precomputed and online sampling is dynamic sample selection (Babcock, Chaudhuri & Das, 2003). The requirement for fast answers during the runtime phase means that scanning a large amount of data to answer a query is not possible, or else the running time would be unacceptably large. Thus, most sampling-based approximate query answering schemes have restricted themselves to building only a small sample of the data. However, because relatively large running times and space usage during the preprocessing phase are generally acceptable, as long as the time and space consumed are not exorbitant, nothing prevents us from scanning or storing significantly larger amounts of data during preprocessing than we are able to access at runtime. Of course, because we only are able to access a small amount of stored data at runtime, there is no gain to be had from building large auxiliary data structures, unless they are accompanied by some indexing technique that allows us to decide, for a given query, which (small) portion of the data structures should be accessed to produce the most accurate approximate query answer. In Babcock, Chaudhuri, and Das (2003), the authors describe a general system architecture for approximate query processing that is based on the dynamic sample selection technique. The basic idea is to construct during the preprocessing phase a random sample containing a large number of differently biased subsamples, and then, for each query that arrives during the runtime phase, to select an appropriate small subset from the sample that can be used to give a highly accurate approximate answer
to the query. The philosophy behind dynamic sample selection is to accept greater disk usage for summary structures than other sampling-based AQA methods in order to increase accuracy in query responses while holding query response time constant (or, alternatively, to reduce query response time while holding accuracy constant). The belief is that for many AQA applications, response time and accuracy are more important considerations than disk usage.
FUTURE TRENDS In one sense, AQA systems are not new. These methods have been used internally for a long time by query optimizers of database systems for selectivity estimation. However, approximate query answering has not been externalized yet to the end user by major vendors, though sampling operators are appearing in commercial database management systems. Research prototypes exist in the industry (e.g., AQP from Microsoft Research and the AQUA system from Bell Labs). From a research potential viewpoint, approximate query answering promises to be a very fertile area with several deep and unresolved problems. Currently, there is a big gap between the development of algorithms and their adaptability in real systems. This gap needs to be addressed before AQA techniques can be embraced by the industry. Second, the research has to broaden beyond the narrow confines of aggregation queries over single table databases or multi-tables involving only foreign-key joins. It is important to investigate how to return approximations to set-valued results, AQA over multi-table databases with more general types of SQL queries, AQA over data streams, and investigations into the practicality of other non-sampling-based approaches to approximate query answering. As data repositories get larger and larger, effective data analysis will prove increasingly more difficult to accomplish.
CONCLUSION In this article, we discussed the problem of approximate query answering in database systems, especially in decision support applications. We described various approaches taken to design approximate query answering systems, especially focusing on sampling-based approaches. We believe that approximate query answering is an extremely important problem for the future, and much work needs to be done before practical systems can be built that leverage the substantial theoretical developments already accomplished in the field.
993
5
Sampling Methods in Approximate Query Answering Systems
REFERENCES Acharya, S. et al. (1999). Join synopses for approximate query answering. Proceedings of the Special Interest Group on Management of Data. Acharya, S., Gibbons, P.B., & Poosala, V. (1999). Aqua: A fast decision support system using approximate query answers. Proceedings of the International Conference on Very Large Databases. Acharya, S., Gibbons, P.B., & Poosala, V. (2000). Congressional samples for approximate answering of group-by queries. Proceedings of the Special Interest Group on Management of Data. Babcock, B., Chaudhuri, S., & Das, G. (2003). Dynamic sample selection for approximate query processing. Proceedings of the Special Interest Group on Management of Data. Chakrabarti, K., Garofalakis, M.N., Rastogi, R., & Shim, K. (2001). Approximate query processing using wavelets. Proceedings of the International Conference on Very Large Databases. Chaudhuri,S., Das, G., Datar, M., Motwani, R., & Narasayya, V. (2001). Overcoming limitations of sampling for aggregation queries. Proceedings of the International Conference on Data Engineering. Chaudhuri, S., Das, G., & Narasayya, V. (2001). A robust, optimization-based approach for approximate answering of aggregate queries. Proceedings of the Special Interest Group on Management of Data. Chaudhuri, S., Motwani, R., & Narasayya, V. (1999). On random sampling over joins. Proceedings of the Special Interest Group on Management of Data. Ganti, V., Lee, M., & Ramakrishnan, R. (2000). ICICLES: Self-tuning samples for approximate query answering. Proceedings of the International Conference on Very Large Databases. Hellerstein, J.M., Haas, P.J., & Wang, H. (1997). Online aggregation. Proceedings of the Special Interest Group on Management of Data. Ioannidis, Y.E., & Poosala, V. (1999). Histogram-based approximation of set-valued query-answers. Proceedings of the International Conference on Very Large Databases. Matias, Y., Vitter, J.S., & Wang, M. (1998). Wavelet-based histograms for selectivity estimation. Proceedings of the Special Interest Group on Management of Data.
994
Raman, V., & Hellerstein, J.M. (2002). Partial results for online query processing. Proceedings of the Special Interest Group on Management of Data.
KEY TERMS Aggregation Queries: Common queries executed by decision support systems that aggregate and group large amounts of data, where aggregation operators are typically SUM, COUNT, AVG, and so forth. Biased Sampling: A random sample of k tuples of a database, where the probability of a tuple belonging to the sample varies across tuples. Decision Support Systems: Typically, business applications that analyze large amounts of data in warehouses, often for the purpose of strategic decision making. Histograms: Typically used for representing onedimensional data, though multi-dimensional histograms are being researched in the database field. A histogram is a division of the domain of a one-dimensional ordered attribute into buckets, where each bucket is represented by a contiguous interval along the domain, along with the count of the number of tuples contained within this interval and other statistics. Standard Error: The standard deviation of the sampling distribution of a statistic. In the case of approximate query answering, it measures the expected value of the error in the approximation of aggregation queries. Stratified Sampling: A specific procedure for biased sampling, where the database is partitioned into different strata, and each stratum is uniformly sampled at different sampling rates. Tuples that are more important for aggregation purposes, such as outliers, are put into strata that are then sampled at a higher rate. Uniform Sampling: A random sample of k tuples of a database, where each subset of k tuples is equally likely to be the sample. Workload: The log of all queries that execute on a database system. Workloads often are used by database administrators as well as by automated systems (such as AQA systems) to tune various parameters of database systems for optimal performance, such as indexes and physical design, and, in the case of AQA, the set of sample tables.
995
Scientific Web Intelligence Mike Thelwall University of Wolverhampton, UK
INTRODUCTION Scientific Web Intelligence (SWI) is a research field that combines techniques from data mining, Web intelligence, and scientometrics to extract useful information from the links and text of academic-related Web pages using various clustering, visualization, and counting techniques. Its origins lie in previous scientometric research into mining off-line academic data sources such as journal citation databases. Typical scientometric objectives are either evaluative (assessing the impact of research) or relational (identifying patterns of communication within and among research fields). From scientometrics, SWI also inherits a need to validate its methods and results so that the methods can be justified to end users, and the causes of the results can be found and explained.
BACKGROUND The term scientific in SWI has a dual meaning. The first meaning refers to the scope of the data—it must be academic-related. For example, the data may be extracted from university Web sites, electronic journal sites, or just pages that mention or link to academic pages. The second meaning of scientific alludes to the need for SWI research to use scientifically defensible techniques to obtain its results. This is particularly important when results are used for any kind of evaluation. SWI is young enough that its basic techniques are not yet established (Thelwall, 2004a). The current emphasis is on methods rather than outputs and objectives. Methods are discussed in the next section. The ultimate objectives of typical developed SWI studies of the future can be predicted, however, from research fields that have used offline academic document databases for data mining purposes. These fields include bibliometrics, the study of academic documents, and scientometrics, the measurement of aspects of science, including through its documents (Borgman & Furner, 2002). Evaluative scientometrics develops and applies quantitative techniques to assess aspects of the value of academic research or researchers. An example is the Journal Impact Factors (JIF) of the Institute for Scientific Information (ISI) that are reported in the ISI’s journal citation reports. JIFs are calculated for journals by count-
ing citations to articles in the journal over a fixed period of time and dividing by the number of articles published in that time. Assuming that a citation to an article is an indicator of impact (because other published research has used the article in order to cite it), the JIF assesses the average impact of articles in the journal. By extension, good journals should have a higher impact (Garfield, 1979), so JIFs could be used to rank or compare journals. In fact, this argument is highly simplistic. Scientometricians, while accepting the principle of citations as a useful impact proxy, will argue for more careful counting methods (e.g., not comparing citation counts between disciplines) and a much lower level of confidence in the results (e.g., taking them as indicative rather than definitive) (van Raan, 2000). Evaluative techniques also are commonly used for academic departments. For example, a government may use citation-based statistics in combination with peer review to conduct a comparative evaluation of all of the nation’s departments within a given discipline (van Raan, 2000). SWI also may be used in an evaluative role, but since its data source is only Web pages, which are not the primary outputs of most scientific research, it is unlikely to ever be used to evaluate academics’ Web publishing impact. Given the importance of the Web in disseminating research (Lawrence, 2001), it is reasonable, however, to measure Web publishing. Relational scientometrics seeks to identify patterns in research communication. Depending on the scale of the study, this could mean patterns of interconnections of researchers within a single field, of fields or journals within a discipline, or of disciplines within the whole of science. Typical outputs are graphs of the relationships, although dimension-reducing statistics, such as factor analysis, also are used. For example, an investigation into how authors within a field cite each other may yield an author-based picture of the field that usefully identifies sub-specialisms, their main actors, and interrelationships (Lin, White & Buzydlowski, 2003). Knowledge domain visualization (Börner, Chen & Boyack, 2003) is a closely related research area but one that focuses on the design of visualizations to display relationships in knowledge domains. Relationship identification is likely to be a common outcome for future SWI applications. An advantage of the Web over academic journal databases is that it can contain more up-to-date information, which could help produce more current domain visualizations. The disad-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
5
Scientific Web Intelligence
vantage, however, is that the Web contains a wide variety of information that is loosely related to scholarly activity, if at all, even in university Web sites. The challenge of SWI and the rationale for the adoption of Web intelligence and data mining is to extract useful patterns from this mass of mainly useless data. Successful SWI will be able to provide an early warning of new research trends within and among disciplines.
MAIN THRUST SWI uses methods based upon Web links (Web structure mining) and text (Web content mining). A range of relevant mining and structure mining techniques is described in the following section.
Academic Web Structure Mining Modeling Early academic Web structure mining sought to assess whether counts of links to university or department Web sites could be used to measure their online impact. This originated in the work of Ingwersen (1998). In brief, the results of this line of research indicated that links between university Web sites, unlike citations, almost never represented knowledge transfer within the context of research. For example, few of these links point to online journal or conference articles. Nevertheless, it seems that about 90% of links are related in some way to academic activities (Wilkinson et al., 2003), and counts of links to universities correlate significantly with measures of research productivity for universities (Thelwall & Harries, 2004) and departments in some disciplines (Li et al., 2003; Tang & Thelwall, 2003). These results are consistent with Web publishing being a natural by-product of research activity (people who do more research tend to create more Web pages), but the chances of any given Web page being linked to does not depend upon the research capabilities of its author, on average. In other words, more productive researchers tend to attract more links, but they also tend to produce more content, and so the two factors cancel out. A little more basic information is known about academic Web linking. Links are related to geography (closer universities tend to interlink more) (Thelwall, 2002). Links are related to language (universities in countries sharing a common language tend to interlink more, at least in Europe, and English accounts for at least half of international linking pages in European universities in all countries except Greece) (Thelwall, Tang & Price, 2003).
996
Data Cleansing An important but unexpected outcome of the research previously described was the need for extensive data cleansing in order to get better results from link-counting exercises. This is because, on a theoretical level, link counting works best when each link is created independently by human experts exercising care and judgement. In practice, however, many links are created casually or by automated processes. For example, links within a Web site are often for navigational purposes and do not represent a judgment of target-page quality. Automatically-generated links vary from the credit links inserted by Web authoring software to links in navigation bars on Web sites. The following types of link normally are excluded from academic link studies. • •
All links between pages in the same site. All links originating in pages not created by the hosting organization (e.g., mirror sites).
Note that the second type requires human judgments about ownership and that these two options do not address the problem of automatically-generated links. Some research has excluded a portion of such links (Thelwall & Aguillo, 2003), but an alternative more automated approach devised to solve this problem is changing the method of counting. Several new methods of counting links have been devised. These are deployed under the umbrella term of Alternative Document Models (ADMs) and are, in effect, data cleansing techniques (Thelwall & Wilkinson, 2003). The ADMs were inspired by the realization that automated links tended to originate in pages within the same directory. For example, a mini Web site of 40 pages may have a Web authorizing software credit link on each page but with all site pages residing in the same directory. The effect of these links can be reduced if links are counted between directories instead of between pages. In the example given, the 40 links from 40 pages would be counted as one link from a directory, discarding the other 39 links, which are now duplicates. The ADMs deployed so far include the page ADM (standard link counting), the directory ADM, the domain ADM, and the whole site ADM. The choice of ADM depends partly on the research question and partly on the data. A purely data-driven selection method has been developed (Thelwall, 2005a), designed to be part of a much more automated approach to data cleansing; namely, Multiple Site Link Structure Analysis (MSLSA).
Scientific Web Intelligence
Subject Similarity and Clustering A key SWI goal is to be able to automatically cluster academic Web pages by academic subject. The ability to cluster Web pages by the more general concept of topic has been investigated in the past, employing both textbased and link-based approaches. For example, the research of Chakrabarti, et al. (2002) and Menczer (2005) shows that pages about the same topic tend to interlink more than with pages on different topics. It is logical to conclude that links will be helpful for subject clustering in academic Webs. A pair of Web pages can be directly linked or may be indirectly connected by links, if another page is joined to both by links. Direct links are not more reliable as indicators of subject than indirect connections, but indirect connections are far more numerous (Thelwall & Wilkinson, 2004). Hence, academic subject clustering should use both types. There are many link-based clustering algorithms, but one that is fast and scalable is the Community Identification Algorithm (Flake et al., 2002). This accepts any number of interlinked pages as input and returns their community, based solely upon link structures. Loosely speaking, this community is a collection of pages that tend to link to each other more than they link to pages outside of the community. Research with this algorithm on academic Webs has shown that it is capable of identifying communities for the page, directory, and domain ADM, but heavily linked pages negatively affect its results (Thelwall, 2003). Data cleansing to remove these pages is recommended.
ing document sets in ways that they do not follow naturally. This latter capability was developed in response to the realization that academic Web sites did not naturally cluster by subject, but in other ways, including university affiliation. Further research with low-frequency words (Price & Thelwall, 2005) confirmed them to be helpful for subject clustering (i.e., removing them from the documents reduced their subject clustering tendency).
Knowledge Domain Visualization The field of information visualization has been able to develop rapidly in recent years with the improved speed and graphical power of PCs. Its newer subfield, Knowledge Domain Visualization (KDViz), uses scientometric data and develops special-purpose visualizations. These visualizations are for use by researchers to orient themselves within their own discipline or to see how other fields or disciplines fit together or relate to each other. Although the typical data sources have been journal citation databases or journal text collections, these have similarities to Web links and Web content that make KDViz tools a logical starting point for SWI visualizations. A discussion of some KDViz research serves to highlight the visualization capabilities already present. •
Academic Web Content Mining Academic Web content mining is less developed than academic Web structure mining, but it is beginning to evolve. As with structure mining, a key goal is to be able to cluster academic Web spaces by subject. There is some overlap between the two, for example, in the need for ADMs and similar data cleansing. Exploratory analysis of the text in university Web sites has revealed the existence of many non-subject-specific, high-frequency words, such as computer and internetrelated terms. Low-frequency words were found to be predominantly not errors. The lesson for text mining is that low-frequency words could not be ignored but that a strategy must be developed to filter out unwanted high frequency words (Thelwall, 2005b). Such a strategy, Vocabulary Spectral Analysis (VSA), has been developed (Thelwall, 2004b). VSA is a technique based upon the standard vector space model, and k-means clustering that identifies words that are highly influential in clustering document sets and also words that are helpful for cluster-
•
•
PNASLINK is a system that treats visualizations from articles published in the Proceedings of the National Academy of Sciences (White et al., 2004). It uses pathfinder networks (a technique for selecting the most important connections to draw for large network visualizations) and self-organizing maps (a clustering technique that can plot documents on a two-dimensional map) to display information to users in order to help them select terms with which to search the digital library. Both text and citations are used by the algorithms. Cross maps is a technique for visualizing overlapping relationships in journal article collections (Morris & Yen, 2004). It produces two-dimensional graphs cross mapping authors and research fronts, more of a mainstream scientometrics application than PNASLINK. CITESPACE implements features that are designed to help users identify key moments in the evolution of research fields (Chen, 2004). It works by tracking the evolution of collections of papers in a field through citation relationships. Particularly important nodes in the generated network can be identified through the visualizations and also key moments in the time turning points for the evolution of the network.
997
5
Scientific Web Intelligence
To apply all of the above visualization techniques to SWI data is a future task. The main current challenge is to process Web data in ways that make it possible to get useful results from visualizations.
FUTURE TRENDS The immediate goal of SWI research is effective subject clustering of collections of academic Web sites. This is likely to involve a fusion of link-based and text-based clustering approaches. Success will be dependent upon developing more effective data cleansing techniques. Perhaps initially, these techniques will be only semiautomated and quite labor-intensive, but a longer-term goal will be to make them increasingly more automated. This prediction for a focus on data cleansing does not rule out the possibility that advanced Web intelligence techniques could be developed that bypass the need for data cleansing. The medium-term SWI goal is to harness academic Web data to visualizations in order to give Web information to users in a practical and effective way. The long-term SWI goals are to develop applications that extend those of scientometrics and KDViz in order to branch out into different Web data sets, to incorporate more Web intelligence techniques (Zhong, Liu & Yao, 2003), and to extract new types of useful information from the data.
CONCLUSION SWI has taken the first steps toward maturity as an independent field through the harnessing of techniques from scientometrics, Web structure mining, and Web content mining. To these have been added additional techniques and knowledge specific to academic Web spaces. Many of the new discoveries relate to data cleansing, recognition that Web data is far noisier than any data set previously used for similar purposes. The future is promising, however, particularly in the longer term, if the techniques developed can be applied to new areas of Web information—perhaps even to some that do not yet exist.
REFERENCES Borgman, C., & Furner, J. (2002). Scholarly communication and bibliometrics. In B. Cronin (Ed.), Annual review of information science and technology (pp. 3-72). Medford, NJ: Information Today Inc.
998
Börner, K., Chen, C., & Boyack, K. (2003). Visualizing knowledge domains. Annual Review of Information Science & Technology, 37, 179-255. Chakrabarti, S., Joshi, M.M., Punera, K., & Pennock, D.M. (2002). The structure of broad topics on the Web. Proceedings of the WWW2002 Conference, Honolulu, Hawaii. Chen, C. (2004). Searching for intellectual turning points: Progressive knowledge domain visualization. National Academy of Sciences, 101, 5303-5310. Chen, C., Newman, J., Newman, R., & Rada, R. (1998). How did university departments interweave the Web: A study of connectivity and underlying factors. Interacting with Computers, 10(4), 353-373. Flake, G.W., Lawrence, S., Giles, C.L., & Coetzee, F.M. (2002). Self-organization and identification of Web communities. IEEE Computer, 35, 66-71. Garfield, E. (1979). Citation indexing: Its theory and applications in science, technology and the humanities. New York: Wiley Interscience. Ingwersen, P. (1998). The calculation of Web impact factors. Journal of Documentation, 54(2), 236-243. Lawrence, S. (2001). Free online availability substantially increases a paper’s impact. Nature, 411(6837), 521. Li, X., Thelwall, M., Musgrove, P., & Wilkinson, D. (2003). The relationship between the links/Web impact factors of computer science departments in UK and their RAE (Research Assessment Exercise) ranking in 2001. Scientometrics, 57(2), 239-255. Lin, X., White, H.D., & Buzydlowski, J. (2003). Real-time author co-citation mapping for online searching. Information Processing & Management, 39(5), 689-706. Menczer, F. (2005). Lexical and semantic clustering by Web links. Journal of the American Society for Information Science and Technology (to be published). Morris, S., & Yen, G. (2004). Crossmaps: Visualization of overlapping relationships in collections of journal papers. National Academy of Sciences, 101, 5291-5296. Price, E.L., & Thelwall, M. (2005). The clustering power of low frequency words in academic Webs. Journal of the American Society for Information Science and Technology (to be published). Tang, R., & Thelwall, M. (2003). Disciplinary differences in US academic departmental Web site interlinking. Library & Information Science Research, 25(4), 437-458.
Scientific Web Intelligence
Thelwall, M. (2002). Evidence for the existence of geographic trends in university Web site interlinking. Journal of Documentation, 58(5), 563-574.
White, H., Lin, X., Buzydlowski, J., & Chen, C. (2004). User-controlled mapping of significant literatures. National Academy of Sciences, 101, 5297-5302.
Thelwall, M. (2003). A layered approach for investigating the topological structure of communities in the Web. Journal of Documentation, 59(4), 410-429.
Wilkinson, D., Harries, G., Thelwall, M., & Price, E. (2003). Motivations for academic Web site interlinking: Evidence for the Web as a novel source of information on informal scholarly communication. Journal of Information Science, 29(1), 59-66.
Thelwall, M. (2004a). Scientific Web intelligence: Finding relationships in university Webs. Communications of the ACM. Thelwall, M. (2004b). Vocabulary spectral analysis as an exploratory tool for scientific Web intelligence. In Information Visualization (IV04), Los Alamitos, CA: IEEE, (pp. 501-506). Thelwall, M. (2005a). Data cleansing and validation for multiple site link structure analysis. In A. Scime (Ed.), Web mining: Applications and techniques (pp. 208-227). Hershey, PA: Idea Group Inc. Thelwall, M. (2005b). Text characteristics of English language university Web sites. Journal of the American Society for Information Science and Technology (to be published).
Zhong, N., Liu, J., & Yao, Y. (2003). Web Intelligence. Berlin: Springer-Verlag.
KEY TERMS Alternative Document Model: A conceptual rule for grouping together Web pages into larger units, such as sites and domains, for more effective data mining, particularly useful in Web-structure mining. Knowledge Domain Visualization: A subfield of information visualization that is concerned with creating effective visualizations for specific knowledge domains.
Thelwall, M., & Aguillo, I. (2003). La salud de las Web universitarias españolas. Revista Española de Documentación Científica, 26(3), 291-305.
Multiple Site Link Structure Analysis: A technique for identifying the alternative document model that best fits a collection of Web pages.
Thelwall, M., & Harries, G. (2004). Do better scholars’ Web publications have significantly higher online impact? Journal of the American Society for Information Science and Technology, 55(2), 149-159.
Scientific Web Intelligence: A research field that combines techniques from data mining, Web intelligence and Webometrics to extract useful information from the links and text of academic-related Web pages, principally concerning the impact of information and the relationships among different kinds of information.
Thelwall, M., Tang, R., & Price, E. (2003). Linguistic patterns of academic Web use in Western Europe. Scientometrics, 56(3), 417-432. Thelwall, M., & Wilkinson, D. (2003). Three target document range metrics for university Web sites. Journal of the American Society for Information Science and Technology, 54(6), 489-496. Thelwall, M., & Wilkinson, D. (2004). Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management, 40(1), 515-526. van Raan, A.F.J. (2000). The Pandora’s box of citation analysis: Measuring scientific excellence—The last evil? In B. Cronin, & H.B. Atkins (Eds.), The web of knowledge: A festschrift in honor of Eugene Garfield (pp. 301-319). Metford, NJ: Information Today Inc.
Scientometrics: The quantitative study of science and scientists, particularly the documentary outputs of science. Vocabulary Spectral Analysis: A technique using the vector space model and k-means clustering to identify words that are highly influential in clustering document sets. Web Content Mining: Data mining the Web primarily through the contents of Web pages and ignoring interlinking between pages. Web Structure Mining: Data mining the Web primarily through its link structure.
999
5
1000
Search Situations and Transitions Nils Pharo Oslo University College, Norway Kalervo Järvelin University of Tampere, Finland
INTRODUCTION In order to understand the nature of Web information search processes it is necessary to identify the interplay of factors at the micro-level, that is, to understand how search process related factors such as the actions performed by the searcher on the system are influenced by various factors that might explain it, for example, those related to his work task, search task, knowledge about the work task or searching and etcetera. The Search Situation Transition (SST) method schema provides a framework for such analysis.
BACKGROUND Studies of information seeking and information retrieval (IS&R) have identified many factors that influence the selection and use of sources for information seeking and retrieval. What has been lacking is knowledge on whether, and how, these factors influence the actual search performance. Web information searching often seems to be a rather haphazard behaviour where searchers seem to behave irrationally, that is, they do not follow optimal textbook prescriptions (e.g., Ackermann & Hartman, 2003). In the research literature it is claimed that factors related to the searcher’s personal characteristics, search task, and social/organisational environment influence the searcher during his selection and use of information sources. These factors have been classified and discussed in great detail in the literature, and more recently the searcher’s work task has been focused on as playing a major role (e.g., Byström & Järvelin, 1995; Vakkari, 2001). The SST method schema focuses specifically on the search process and how it is affected by external factors. There are several studies that focus on search processes in other information systems (e.g., Marchionini et al., 1993). These studies have primarily been based on logged and/or video taped data of online bibliographic
searches. However, their scope has been search tasks and searcher characteristics, focusing on term selections and results evaluation. Similar examples can be found in the Web searching context (e.g., Wang & Tenopir, 1998; Fidel et al., 1999; Silverstein et al., 1999; Jansen, Spink, & Saracevic, 2000); these studies analyse characteristics of the Web Information Search (WIS) processes, such as term selection, search task strategies and searcher characteristics, but do not aim at explaining the process itself and the factors that guide it. The current method schema focuses on explaining the process at a micro-level. Early studies of Web searching have to a large degree used log analysis (see review in Jansen & Pooch [2001]) or surveys (e.g., GVU’s WWW user surveys [2001]) as their data collection methods. Log analysis can provide researchers with data on large numbers of user-system interactions focusing on users’ actions. Most often log analysis has been used to see how searchers formulate and reformulate queries (e.g., Spink et al., 2001). The user surveys have focused on demographics of Web users and collected information on the use of different kinds of Web resources, time spent on Web use, e-shopping, etcetera. Although both these kinds of methods may reveal important information about how and why people use the Web, they are unable to point out what causes the searcher to perform the actions he does. We cannot use these methods if we want to learn how work tasks, search tasks, and searcher’s personality directly affect Web information search processes. The SST method schema (Pharo, 2002; Pharo & Järvelin, 2004) was developed for such analyses.
MAIN THRUST To present a method (e.g., Bunge, 1967), as well as a method schema (Eloranta, 1979), one needs to define its domain, procedure and justifications (Newell, 1969; Pharo, 2002). Both the domain and procedure will be presented below in order to clarify the usability of the SST method schema.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Search Situations and Transitions
The Method Schema’s Domain The problem statement, or domain, which is used in the following, states the properties of the problem the method is intended for and their relationships. This designates how general it is possible to make the procedure for handling the problem. Figure 1 is a representation of the framework’s five categories and the relationships existing between them. The search process category consists of two subcategories: search situation and search transition. The search process category will be emphasised here, the other categories and their attributes are well known from the IS&R literature (for details see Pharo, 2002) Search situations are the periods during a search process when the searcher examines a resource in order to find information that may be of help in executing his work task. Situations may take place in the same kind of resources as transitions depending on the search task; if the searcher wants to learn more about the structuring of Figure 1. The conceptual framework - the domain of the method schema
Work task • Goal • Complexity • Resources • Size • Stage
Searcher • Task knowledge • Search knowledge • Search system knowledge • Education • Motivation • Tenacity • Uncertainty • Attention
Soc./org. environment • Actors • Domain • Goals/strategies
Search task • Goal • Complexity • Resources • Size • Stage • Strategies
Search process Search transition • Actions • Accumulated results • Accumulated effort • Information space • Time • Remaining needs • Resource type • Technical problems
Search situation • Actions • Accumulated results • Accumulated effort • Information space • Time • Relevance judgements • Relevance level • Remaining needs • Resource type • Technical problems
subject indices it would be natural to examine such resource types for that purpose. Search transitions are executed in order to find resources in which the searcher believes there may be information that can help execute his task. The transitions consist of source selection and inter-source navigation. An alternative way of explaining this is to say that while situations represent interaction with real information the transitions deal with meta-information. Action is used to describe the moves made by the searcher during a situation/transition. In Web interaction this includes the following of links, entering of queries, and reading of pages. The actions may be influenced, for example, by a search task strategy. The accumulated results refer to the information already found. This includes information found in previous situations as well as information found in the current one. Accumulated results relate to the completion of the information need (or the futility of trying this). The accumulated efforts refer to how much work the searcher has had to invest from the start of the present session (or in prior sessions) up to the current position. In addition it can refer specifically to effort invested in the current situation. The information space refers to the part of the Web that the searcher has navigated, as well as the information space anticipated by the searcher. The searcher has developed a cognitive model of the information space based on his knowledge about the Web and the existing resources on the Web, but also on his knowledge about institutions and organisations that he expects to be represented on the Web. Time can be used to specify how the total amount of time spent during a search process influences the current situation, but it can also relate to the specific time used in that situation. The remaining needs refer to what the searcher has planned to search for in the continuation of the search process and possibly in subsequent search processes. Web resource types differ from each other with respect to content and format. Some are known from the world of paper-based publishing, such as newspapers, scientific journals, dissertations, novels, and collections of poems, but there are many new genres that have originated on the Web (home pages, various kinds of interactive resources, etc.) (Shepherd & Watters, 1998). “Technical problems” is used to describe problems caused by the software in use, both on the client and server sides of interaction. Lack of bandwidth may also cause problems, for example in accessing resources that heavily depend on transmission of large amounts of data. Web pages that have disappeared also cause this kind of problem.
1001
5
Search Situations and Transitions
Situations and transitions share many attributes. Two unique attributes are only present in situations: relevance judgement and relevance level. Relevance judgement relates to the searcher’s evaluation of the pages found, which may be of use to him in different degrees. We do not state any predefined categories for relevance judgements, whereas in other studies binary (relevant or not relevant) or ternary (adding “partially relevant” to the former two) relevance measures have been used. By relevance level we mean that the criteria used for evaluation may be related to the work task, which is what Saracevic (1996) calls situational relevance, but they can also be related to other levels, for example, when an intermediary judges a resource’s relevance for a (potential) user. Relevance judgements are also made in accordance with the organisational preferences, thus sociocognitive relevance (Cosijn & Ingwersen, 2000) may also affect the judgements.
The Method Schema’s Procedure Log analysis and surveys seem to be the most common data collection methods in Web IS&R. In general the problem with: •
•
Survey-type of WIS analysis is that neither the specific work tasks/search tasks nor the specific processes are captured. Ex post facto findings in surveys provide only overviews of individuals’ conceptions of WIS in general; Log analysis-type of data on WIS analysis is that it is not informed by anything happening in front of the computer screen.
In fact, even if one combines these types of analyses, one cannot analyse the processes properly for the effects of characteristics of work tasks, search tasks, or specific processes because the primary determinants are missing from the study setting. The use of triangulation as a general approach for data collection is necessary to capture the interplay of the various factors. We suggest the use of the following data collection methods for the domain described above: •
•
1002
The search process can be captured using a combination of video logs and observation. This kind of data will provide information on all of the proposed attributes of situations and transitions discussed above. It will also provide data on the other categories. The work task can be captured using a combination of interviews and output data, such as, for example, theses, articles, reports and other written material.
• • •
The search task can be identified from the interviews as well as utterances made by the searcher during the process (observation and video logs). The searcher can provide information about him/ himself in interviews and questionnaires/surveys. The social/organisational environment can be described through interviews, annual reports, and other written material documenting the organisation’s activities and policies.
The core data would be collected using some kind of screen capturing or video recording of the computer screen during the processes. This, however, should be combined with simultaneous recordings of the searcher’s utterances, and the searchers should be instructed to talk aloud (Ericsson & Simon, 1996) during searching. Alternatively Web transaction logs could be used, but then it would be difficult to capture nonaction-related features of the process, for example, to determine whether the searcher is really reading a page.
The Method Schema’s Justification A method based on this schema was used to analyse real WIS interactions (Pharo, 2002) with encouraging results.
FUTURE TRENDS The continuing growth, and hence importance, of the World Wide Web will make a better understanding of the complex interplay taking place during search processes even more important. The Web will doubtless be an important factor affecting interaction in various “environments” (business, education on all levels, research, public affairs, etc.). The need to analyse what takes place in different setting advocates the need for tools for holistic analysis, such as the SST method schema.
CONCLUSION There is a large body of research literature on Web information searching (WIS). One approach to WIS research is log analysis, which is based on log contents and furnishes researchers with easily available massive data sets. However, the logs do not represent the user’s intentions and interpretations. Another common approach to WIS is based on user surveys. Such surveys may cover issues like the demographics of users, frequencies of use, preferences, habits, hurdles to WIS,
Search Situations and Transitions
etcetera. However, being ex post facto studies, they do not supply knowledge on how the searchers act in concrete WIS processes. To understand and explain WIS processes, one needs to closely look at concrete processes in context. The literature of IS&R suggests several factors or categories like work task, search task, the searcher him/herself, and organisational environment as affecting information searching. A promising way to understand/explain WIS is through these categories. The current approaches to WIS, however, cannot shed light on what the effects are, if any. The SST method schema is developed to address these issues.
REFERENCES Ackermann, E., & Hartman, K. (2003). Searching and researching on the Internet and the World Wide Web (3rd ed.). Wilsonville, OR: Franklin, Beedle & Associates. Bunge, M. (1967). Scientific research. Heidelberg: Springer-Verlag. Byström, K., & Järvelin, K. (1995). Task complexity affects information seeking and use. Information Processing and Management, 3 (2), 191-213. Cosijn, E., & Ingwersen, P. (2000). Dimensions of relevance. Information Processing & Management, 36(4), 533-550. Eloranta, K.T. (1979). Menetelmäeksperttiyden analyysi menetelmäkoulutuksen suunnittelun perustana [The analysis of method expertise as a basis for planning education]. Tampere: Tampereen yliopiston. Fidel, R. et al. (1999). A visit to the information mall: Web searching behavior of high school students. Journal of the American Society for Information Science, 50 (1), 2437. GVU’s WWW User Surveys. (2001). GVU Center’s WWW User Surveys. Retrieved March 25, 2005, from http:// www.cc.gatech.edu/gvu/user_surveys/ Jansen, B.J., & Pooch, U. (2001). A review of Web searching studies and a framework for future research. Journal of the American Society for Information Science, 52(3), 235-246. Jansen, B.J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing & Management, 36(2), 207-227.
Marchionini, G., Dwiggins, S., Katz, A., & Lin, X. (1993). Information seeking in full-text end-user-oriented searchsystems: the roles of domain and search expertise. Library & Information Science Research, 15(1), 35-69. Newell, A. (1969). Heuristic programming: Ill-structured problems. In J. Aronofsky (Ed.), Progress in operations research III (pp. 360-414). New York: John Wiley & Sons. Pharo, N. (2002). The SST method schema: A tool for analysing Web information search processes. Doctoral dissertation. Tampere: University of Tampere. Pharo, N., & Järvelin, K. (2004). The SST method: A tool for analysing Web information search processes. Information Processing & Management, 40(4), 633-654. Saracevic, T. (1996) Relevance reconsidered ’96. In P. Ingwersen & N.O. Pors (Eds.), Information science: Integration in perspective (pp. 201-218). Copenhagen: Royal School of Librarianship. Shepherd, M., & Watters, C. (1998). The evolution of cybergenres. In Proceedings of the 32nd Hawaii International Conference on System Sciences (HICSS ’98) (pp. 97-109). Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1999). Analysis of a very large Web search engine query log. SIGIR Forum, 33(1), 6-12. Spink, A., Wolfram, D., Jansen, B.J., & Saracevic, T. (2001). Searching the Web: The public and their queries. Journal of the American Society for Information Science, 52(3), 226-234. Vakkari, P. (2001). A theory of the task-based information retrieval process: A summary and generalisation of a longitudinal study. Journal of Documentation, 7(1), 4460. Wang, P., & Tenopir, C. (1998). An exploratory study of users’ interaction with World Wide Web resources: Information skills, cognitive styles, affective states, and searching behaviors. In M.E. Williams (Ed.), Proceedings of the 19th National Online Meeting (pp. 445-454). Medford, NJ: Information Today.
KEY TERMS Information System: A collection of sources containing potential information. Information systems can be of variable structure and size, from small bibliographic catalogues to the Web itself.
1003
5
Search Situations and Transitions
Method: A procedure for handling a set of problems. Methods can be categorised as “quantitative,” which is, for example, the case for various statistical ways of data handling, or “qualitative,” which may be exemplified by grounded theory. A method (and thus a method schema) consists of the following three parts: (1) a problem statement or domain modelling the phenomenon under study, (2) a procedure for collecting and analysing data to understand the phenomenon, and (3) a justification, for example, by showing its ability to solve designated problems of the domain. Method Schema: Any representation defined for one or more methods, where one or more aspects of the method have been left uninterpreted and represented only through their plain name, and where some aspects of the methods may have been left out (even lacking their naming). Method schemas take the format of a method, but it contains unspecified components that need to be specified if it is to reach the level of a method. In other words a method schema is an abstract representation of one or more
1004
methods – a generic model. The difference between a method and a method schema can be said to be a continuum of generality. Search Process: The period during which a searcher interacts with an information system. The structure of a search process is dialectic; it switches between search situations and search transitions. Search Situations: The periods of a search process during which the searcher interacts with sources potentially containing information related to his/her search task. Search Transitions: The periods of a search process during which the searcher interacts with sources containing meta-information SST Method Schema: A method schema developed to analyse search processes by identifying what external and internal factors interplay with the search process during, before, and after the process.
1005
Secure Multiparty Computation for Privacy Preserving Data Mining Yehida Lindell Bar-Ilan University, Israel
INTRODUCTION The increasing use of data-mining tools in both the public and private sectors raises concerns regarding the potentially sensitive nature of much of the data being mined. The utility to be gained from widespread data mining seems to come into direct conflict with an individual’s need and right to privacy. Privacy-preserving data-mining solutions achieve the somewhat paradoxical property of enabling a data-mining algorithm to use data without ever actually seeing it. Thus, the benefits of data mining can be enjoyed without compromising the privacy of concerned individuals.
BACKGROUND A classical example of a privacy-preserving data-mining problem is from the field of medical research. Consider the case that a number of different hospitals wish to jointly mine their patient data for the purpose of medical research. Furthermore, assume that privacy policy and law prevent these hospitals from ever pooling their data or revealing it to each other due to the confidentiality of patient records. In such a case, classical data-mining solutions cannot be used. Fortunately, privacy-preserving data-mining solutions enable the hospitals to compute the desired data-mining algorithm on the union of their databases without ever pooling or revealing their data. Indeed, the only information (provably) learned by the different hospitals is the output of the data-mining algorithm. This problem, whereby different organizations cannot directly share or pool their databases but must nevertheless carry out joint research via data mining, is quite common. For example, consider the interaction between different intelligence agencies in the USA. These agencies are suspicious of each other and do not freely share their data. Nevertheless, due to recent security needs, these agencies must run data-mining algorithms on their combined data. Another example relates to data that is held by governments. Until recently, the Canadian government held a vast federal database that pooled citizen data from a number of different government ministries (some called this database the “big brother” database). The Canadian government claimed that the data-
base was essential for research. However, due to privacy concerns and public outcry, the database was dismantled, thereby preventing that “essential research” from being carried out. This is another example of where privacypreserving data mining could be used to find a balance between real privacy concerns and the need of governments to carry out important research. Privacy-preserving data mining is actually a special case of a long-studied problem in cryptography: secure multiparty computation. This problem deals with a setting where parties with private inputs wish to jointly compute some function of their inputs. Loosely speaking, this joint computation should have the property that the parties learn the correct output and nothing else, even if some of the parties maliciously collude to obtain more information.
MAIN THRUST In this article, I provide a succinct overview of secure multiparty computation and how it can be applied to the problem of privacy-preserving data mining. The main focus is on how security is formally defined, why this definitional approach is adopted, and what issues should be considered when defining security for privacy-preserving data-mining problems. Due to space constraints, the treatment in this chapter is both brief and informal. For more details, see Goldreich (2003) for a survey on cryptography and cryptographic protocols.
Security Definitions for Secure Computation The aim of a secure multiparty computation task is for the participating parties to securely compute some function of their distributed and private inputs. But what does it mean for a computation to be secure? One way of approaching this question is to provide a list of security properties that should be preserved. The first such property that often comes to mind is that of privacy or confidentiality. A naïve attempt at formalizing privacy would be to require that each party learns nothing about the other parties’ inputs, even if it behaves maliciously. However, such a definition is usually unat-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
5
Secure Multiparty Computation for Privacy Preserving Data Mining
tainable, because the defined output of the computation itself typically reveals some information about the other parties’ inputs. (For example, a decision tree computed on two distributed databases reveals some information about both databases.) Therefore, the privacy requirement is usually formalized by saying that the only information learned by the parties in the computation (again, even by those who behave maliciously) is that specified by the function output. Although privacy is a primary security property, it rarely suffices. Another important property is that of correctness; this states that the honest parties’ outputs are correctly distributed even in the face of adversarial attack. A central question that arises in this process of defining security properties is “When is the list of properties complete?” This question is, of course, application dependent, which essentially means that for every new problem, the process of deciding which security properties are required must be reevaluated. I must stress that coming up with the right list of properties is often very difficult and it can take many years until one is convinced that a definition truly captures the necessary security requirements. Furthermore, an incomplete list of properties may easily lead to real security failures.
The Ideal/Real Model Paradigm Due to these difficulties, the standard definitions of secure computation today follow an alternative approach called the ideal/real model paradigm. This has been the dominant paradigm in the investigation of secure computation in the last 15 years; see Canetti (2000) for the formal definition and references therein for related definitional work. Loosely speaking, this paradigm defines the security of a real protocol by comparing it to an ideal computing scenario, in which the parties interact with an external trusted and incorruptible party. In this ideal execution, the parties all send their inputs to the trusted party (via ideally secure communication lines). The trusted party then computes the function on these inputs and sends each party its specified output. Such a computation embodies the goal of secure computation, and it is easy to see that the properties of privacy and correctness hold in the ideal model. In addition to the fact that these and other security properties are preserved in an ideal execution, the simplicity
of the ideal model provides an intuitively convincing security guarantee. For example, notice that the only message a party sends in an ideal execution is its input. Therefore, the only power that a corrupted party has is to choose its input as it wishes (behavior that is typically legitimate anyway). So far, I have defined an ideal execution in an ideal world. However, in the real world, the parties run a protocol without any trusted help. Despite this fact, a secure real protocol should somehow emulate an ideal execution. That is, a real protocol that is run by the parties (in a world where no trusted party exists) is secure if no adversary can do more harm in a real execution than in an execution that takes place in the ideal world. Stated differently, for any adversary carrying out a successful attack on a real protocol, there exists an adversary that successfully carries out the same attack in the ideal world. This suffices because, as I have shown, no successful attacks can be carried out in an ideal execution. Thus, no successful attacks can be carried out on the real protocol, implying that it is secure. See Figure 1 for a diagram of the real and ideal models. Note that security is required to hold for every adversary carrying out any feasible attack (within the parameters defined for the adversary, as discussed next).
Defining the Model The preceding informal description of the ideal/real model paradigm expresses the intuition that a real execution should behave just like an ideal execution. In order to obtain a complete and formal definition, it is crucial that both the ideal and real models are fully defined. Among other things, this involves defining the real network model and the adversary’s power, including any assumptions on its behavior. A secure protocol only provides real-world security guarantees if the mathematical definition of the real computation and adversarial models accurately reflects the real network and adversarial threats that exist. I now briefly discuss a number of parameters that are considered when defining the network model and the adversary; this list is far from comprehensive. Two central considerations that arise when defining the net-
Figure 1. x1
Real Model
x2
x1
Ideal Model
x2 x2
x1
f(x1,x2) output
1006
output
f(x1,x2)
f(x1,x2)
Secure Multiparty Computation for Privacy Preserving Data Mining
work model relate to the communication channels and whether or not any trusted setup phase is assumed. It is typically assumed that all parties are connected via pointto-point authenticated channels (meaning that the adversary cannot modify messages sent between honest parties). Note that this can be implemented by assuming a public-key infrastructure for digital signatures. Other parameters to consider are whether the communication over the network is synchronous or asynchronous, and whether messages that are sent between honest parties are guaranteed to arrive. Finally, the question of what (if any) other protocols are running simultaneously in the network must also be addressed. This issue is referred to as protocol composition and is currently a very active research subject in the cryptographic community. When defining the adversary, a number of possibilities arise, including the following: 1.
2.
3.
4.
Complexity: Given that the widely accepted notion of efficient or feasible computation is probabilistic polynomial-time, the natural choice is to limit an adversary to this complexity. However, there are also protocols that are secure against unbounded adversaries. Number of corrupted parties: In a general multiparty setting, it is assumed that the adversary controls some subset of the participating parties; these parties are called corrupted. The allowed size of this subset must also be defined (typical choices are assuming that less than one third or one half are corrupted and not assuming any limitation on the number of corrupted parties). Corruption strategy: This parameter relates to whether the adversary is static (meaning that the set of corrupted parties is fixed ahead of time) or adaptive (meaning that the adversary can break into parties during the protocol execution). Allowed adversarial behavior: In the earlier discussion, we implicitly referred to malicious adversaries who are allowed to arbitrarily deviate from the protocol specification. However, the adversary’s behavior is sometimes restricted. For example, a semihonest adversary is assumed to follow the protocol but may attempt to learn secret information from the messages that it receives.
The above very partial list of parameters for defining the adversary begs the question: How does one decide which adversarial model to take? A conservative approach is to take the most powerful adversary possible. However, being overly conservative comes at a price. For example, it is impossible to obtain security for unbounded adversaries in the case that half or more of the parties are corrupted. Furthermore, it is often the case that
more efficient protocols can be constructed for weaker adversaries (specifically, highly efficient protocols for many tasks are known for the semihonest adversarial model, but this is not the case for the malicious model). In general, a good approach is to consider malicious polynomial-time adversaries who may adaptively corrupt any number of the participants. However, in some cases, the semihonest adversarial model is reasonable. For example, in the medical database example provided in the introduction, the hospitals are not believed to be malicious; rather, the law prevents them from revealing confidential patient data. In such a case, the protection provided by semihonest adversarial modeling is sufficient. I stress, however, that in many cases the semihonest model is not realistic, and malicious adversaries must be considered. In summary, two central guiding principles when defining security are (a) the definition must accurately and conservatively model the real-world network setting and adversarial threats and (b) all aspects of the model must be fully and explicitly defined. These conditions are necessary for obtaining a mathematical definition of security that truly implies that protocols executed in the real world will withstand all adversarial attacks.
The Feasibility of Secure Multiparty Computation The aforementioned security definition provides very strong guarantees. An adversary attacking a protocol that is secure is essentially limited to choosing its input (because this is all that it can do in the ideal model). However, can this definition actually be achieved, and if yes, under what conditions? A fundamental result of the theory of cryptography states that under certain parameters and assumptions, any efficient multiparty functionality can be securely computed. This result comprises a number of different theorems, depending on the model and the number of corrupted parties. In this article, I describe the basic results for the stand-alone model (where only a single protocol execution is considered) and the computational setting (where the adversary is limited to polynomial time). The basic results for the informationtheoretic setting can be found in Ben-Or, Goldwasser, and Wigderson (1988) and Chaum, Crepeau, and Damgard (1988). The first basic theorem states that when a majority of the parties are honest, any multiparty functionality can be securely computed in the presence of malicious, static adversaries (Yao, 1986; Goldreich, Micali, & Wigderson, 1986). Extensions to the case of adaptive
1007
5
Secure Multiparty Computation for Privacy Preserving Data Mining
adversaries can be found in Beaver and Haber (1992) and Canetti, Feige, Goldreich, and Naor (1996). The second basic theorem relates to the case that any number of parties may be corrupted, so an honest majority does not necessarily exist. In this case, it is impossible to construct protocols that meet the definition as described previously. The reason is that the definition implies that all parties receive output together; however, this cannot be achieved without an honest majority (Cleve, 1986). The security definition is therefore explicitly relaxed to allow the adversary to prevent the honest parties from receiving their output, even in the ideal model; this relaxed definition is called security with abort. As before, it has been shown that even when any number of parties may be corrupted, any multiparty functionality can be securely computed with abort in the presence of malicious, static adversaries (Yao, 1986; Goldreich et al., 1986). As I have mentioned, the preceding results all refer to the stand-alone model of computation, where it is assumed that the secure protocol being analyzed is run once in isolation. Feasibility results have also been shown for the case of protocol composition where many different protocols run concurrently; for example, see Canetti (2001) and Canetti, Lindell, Ostrovsky, and Sahai (2002). A brief survey on known results for the setting of composition can be found in Lindell (2003). The importance of the above results is that they demonstrate that under an appropriate choice of parameters and assumptions, any privacy-preserving datamining problem can be solved, in principle. Therefore, the remaining challenge is to construct protocols that are efficient enough for practical use.
Secure Protocols for PrivacyPreserving Data Mining The first paper to take the classic cryptographic approach to privacy-preserving data mining was Lindell and Pinkas (2002). The paper presents an efficient protocol for the problem of distributed decision tree learning; specifically, how to securely compute an ID3 decision tree from two private databases. The paper considered semihonest adversaries only. This approach was adopted in a relatively large number of works that demonstrate semihonest protocols for a wide variety of data-mining algorithms; see, for example, Clifton, Kantarcioglu, Vaidya, Lin, and Zhu (2003). In my opinion, these results serve as a proof of concept that highly efficient protocols can be constructed even for seemingly complex functions. However, in many cases, the semihonest adversarial model does not suffice. Therefore, the malicious model must also be considered.
1008
Other work on the problem of privacy-preserving data mining has followed what is often called the data perturbation approach, as introduced by Agrawal and Srikant (2000). The development of rigorous security definitions that appropriately model security in settings considered by this approach seems to be a very difficult task; naïve definitions of security have been shown to be completely insecure — see Dinur and Nissim (2003) for just one example.
FUTURE TRENDS As I have shown, secure solutions exist for all privacypreserving data-mining problems. However, these solutions are usually not efficient enough for use in practice. Thus, the main problem of privacy-preserving data mining is to find protocols that can realistically be run, even on very large databases. Until now, most work has focused on the semihonest adversarial model, which often does not provide a sufficient level of security. It is therefore of great importance to begin developing tools and techniques for constructing highly efficient protocols that are secure against malicious adversaries. Achieving this goal may involve finding new definitions that are more relaxed than those described here, yet still accurately model the real security concerns. This research task is very nontrivial due to the subtle nature of security and security definitions.
CONCLUSION The history of cryptography shows very clearly that when protocols are not proven secure, or when the adversarial models are not explicitly defined, real attacks are often discovered. Furthermore, the task of coming up with mathematical definitions that accurately model real adversarial threats is a very difficult task. Indeed, slight modifications to existing definitions can render them completely useless; see Canetti (2000) for some discussions on this issue. In this short article, I have described the real/ideal model paradigm for defining security. This definitional approach has been the fruit of many years of cryptographic research, and protocols that meet this definition provide very powerful security guarantees. In order to provide efficient solutions for privacy-preserving data-mining problems, it may be necessary to find new definitions that provide both rigorous security guarantees and can be met by highly efficient protocols. This is perhaps the ultimate challenge of this new and exciting field of research.
Secure Multiparty Computation for Privacy Preserving Data Mining
REFERENCES
sium on Principles of Database Systems (pp. 202-210), USA.
Agrawal, R., & Srikant, R. (2000). Privacy preserving data mining. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 439450), USA.
Goldreich, O. (2003). Cryptography and cryptographic protocols. Distributed Computing, 16(2), 177-199.
Beaver, D., & Haber, S. (1992). Cryptographic protocols provably secure against dynamic adversaries. Proceedings of the Eurocrypt Conference on Cryptologic Research (pp. 307-323), Hungary. Ben-Or, M., Goldwasser, S., & Wigderson, A. (1988). Completeness theorems for non-cryptographic faulttolerant distributed computation. Proceedings of the 20th Annual ACM Symposium on Theory of Computing (pp. 1-10), USA. Canetti, R. (2000). Security and composition of multiparty cryptographic protocols. Journal of Cryptology, 13(1), 143-202. Canetti, R. (2001). Universally composable security: A new paradigm for cryptographic protocols. Proceedings of the 42nd Annual IEEE Symposium on the Foundations of Computer Science (pp. 136-145), USA. Canetti, R., Feige, U., Goldreich, O., & Naor, M. (1996). Adaptively secure multi-party computation. Proceedings of the 28th Annual ACM Symposium on Theory of Computing (pp. 639-648), USA. Canetti, R., Lindell, Y., Ostrovsky, R., & Sahai, A. (2002). Universally composable two-party and multiparty computation. Proceedings of the 34th Annual ACM Symposium on Theory of Computing (pp. 494-503), Canada. Chaum, D., Crepeau, C., & Damgard, I. (1988). Multiparty unconditionally secure protocols. Proceedings of the 20th Annual ACM Symposium on Theory of Computing (pp. 11-19), USA. Cleve, R. (1986). Limits on the security of coin flips when half the processors are faulty. Proceedings of the 18th Annual ACM Symposium on Theory of Computing (pp. 364-369), USA. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., & Zhu, M. Y. (2003). Tools for privacy preserving data mining. SIGKDD Explorations, 4(2), 28-34. Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. Proceedings of the ACM Sympo-
Goldreich, O., Micali, S., & Wigderson, A. (1987). How to play any mental game: A completeness theorem for protocols with honest majority. Proceedings of the 19th Annual ACM Symposium on Theory of Computing (pp. 218-229), USA. Lindell, Y. (2003). Composition of secure multi-party protocols: A comprehensive study. Springer-Verlag. Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15(3), 177-206. Yao, A. (1986). How to generate and exchange secrets. Proceedings of the 27th Annual IEEE Symposium on the Foundations of Computer Science (pp. 162-167), Canada.
KEY TERMS Corrupted Parties: Parties that participate in a protocol while under the control of the adversary. Functionality: The task that the parties wish to jointly compute. Ideal Model: A virtual setting where all parties interact with an incorruptible trusted party who carries out the joint computation for them. Malicious Adversary: An adversary who may arbitrarily deviate from the protocol specification (and so is unlimited in its attack strategy). Real Model: The setting where a real protocol is run (without any trusted help). Secure Multiparty Computation: The problem of computing any distributed task so that security is preserved in the face of adversarial attacks. Semihonest Adversary: An adversary who follows the protocol specification but may try to learn private information by analyzing the messages that it receives during the protocol execution. (This models the inadvertent leakage of information even during legitimate protocol executions.)
1009
5
1010
Semantic Data Mining Protima Banerjee Drexel University, USA Xiaohua Hu Drexel University, USA Illhoi Yoo Drexel University, USA
INTRODUCTION Over the past few decades, data mining has emerged as a field of research critical to understanding and assimilating the large stores of data accumulated by corporations, government agencies, and laboratories. Early on, mining algorithms and techniques were limited to relational data sets coming directly from Online Transaction Processing (OLTP) systems, or from a consolidated enterprise data warehouse. However, recent work has begun to extend the limits of data mining strategies to include “semistructured data such as HTML and XML texts, symbolic sequences, ordered trees and relations represented by advanced logics” (Washio & Motoda, 2003). The goal of any data mining endeavor is to detect and extract patterns in the data sets being examined. Semantic data mining is a novel approach that makes use of graph topology, one of the most fundamental and generic mathematical constructs, and semantic meaning, to scan semistructured data for patterns. This technique has the potential to be especially powerful as graph data representation can capture so many types of semantic relationships. Current research efforts in this field are focused on utilizing graph-structured semantic information to derive complex and meaningful relationships in a wide variety of application areas — national security and Web mining being foremost among these. In this paper, we review significant segments of recent data mining research that feed into semantic data mining and describe some promising application areas.
BACKGROUND In mathematics, a graph is viewed as a collection of vertices or nodes and a set of edges that connect pairs of those nodes; graphs may be partitioned into sub-graphs to expedite and/or simplify the mining process. A tree is defined as an acyclic sub-graph, and trees may be ordered or unordered, depending on whether or not the edges are
labeled to specify precedence. If a sub-graph does not include any branches, it is called a path. The two pioneering works in graph-based data mining, the algorithmic precursor to semantic data mining, take an approach based on greedy search. The first of these, SUBDUE, deals with conceptual graphs and is based on the Minimum Description Length (MDL) principle (Cook & Holder, 1994). SUBDUE is designed to discover individual concepts within the graph by starting with a single vertex, which represents a potential concept, and then incrementally adding nodes to it. At each iteration, a more “abstract” concept is evaluated against the structure of the original graph, until the algorithm reaches a stopping point, which is defined by the MDL heuristic (Cook & Holder, 2000). The second of the seminal graph mining works is called Graph Based Induction (GBI), and like SUBDUE, it is also designed to extract concepts from data sets (Yoshida, Motoda, & Inokuchi, 1994). The GBI algorithm repeatedly compresses a graph by replacing each found sub-graph or concept with a single vertex. To avoid compressing the graph down to a single vertex, an empirical graph size definition is set to establish the size of the extracted patterns, as well as the size of the compressed graph. Later researchers have applied several other approaches to the graph mining problem. Notable among these are the Apriori-based approach for finding frequent sub-graphs (Inokuchi, Washio, & Motoda, 2000; Kuramochi & Karypis, 2002), Inductive Logic Processing (ILP), which allows background knowledge to be incorporated in to the mining process; Inductive Database approaches which have the advantage of practical computational efficiency; and the Kernel Function approach, which uses the mathematical kernel function measure to compute similarity between two graphs (Washio & Motoda, 2003). Semantic data mining expands the scope of graphbased data mining from being primarily algorithmic to include ontologies and other types of semantic informa-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Semantic Data Mining
tion. These methods enhance the ability to systematically extract and/or construct domain specific features in data.
MAIN THRUST Defining Semantics The effectiveness of semantic data mining is predicated on the definition of a domain-specific structure that captures semantic meaning. Recent research suggests three possible methods of capturing this type of domain knowledge: • • •
Ontologies Semantic Associations Semantic Metadata
In this section, we will explore each of these in depth. An ontology is a formal specification in a structured format, such as XML or RDF, of the concepts that exist within a given area of interest and the semantic relationships among those concepts. The most useful aspects of feature extraction and document classification, two fundamental data mining methods, are heavily dependent on semantic relationships (Phillips & Buchanan, 2003). For example, a news document that describes “a car that ran into a gasoline station and exploded like a bomb” might not be classified as a terrorist act, while “a car bomb that exploded in a gasoline station” probably should be (Gruenwald, McNutt, & Mercier, 2003). Relational databases and flat documents alone do not have the required semantic knowledge to intelligently guide mining processes. While databases may store constraints between attributes, this is not the same as describing relationships among the attributes themselves. Ontologies are uniquely suited to characterize this semantic meta-knowledge (Phillips & Buchanan, 2003). In the past, ontologies have proved to be valuable in enhancing the document clustering process (Hotho, Staab, & Strumme, 2003). While older methods of text clustering were only able to relate documents that used identical terminology, semantic clustering methods were able to take into account the conceptual similarity of terms such as might be defined in terminological resources or thesauri. Beneficial effects can be achieved for text document clustering by integrating an explicit conceptual account of terms found in ontologies such as WordNet. For example, documents containing the terms “beef” and “chicken” are found to be similar, because “beef” and “chicken” are both sub-concepts of “meat” and, at a higher level, “food.” However, at a more granular clustering level, “beef” may be more similar to “pork” than “chicken” because both can be grouped together under the sub-heading of “red meat” (Hotho, Staab, & Strumme, 2003).
Ontologies have also been used to augment the knowledge discovery and knowledge sharing processes (Phillips & Buchanan, 2003). While in the past prior knowledge had been specified separately for each new problem, with the use of an ontology, prior knowledge found to be useful for one problem area can be reused in another domain. Thus, shared knowledge can be stored even in a relatively simple ontology, and collections of ontologies can be consolidated together at later points in time to form a more comprehensive knowledge base. At this point it should be noted that that the issues associated with ontology construction and maintenance are a research area in and of themselves. Some discussion of potential issues is presented in Gruenwald, McNutt, & Mercier (2003) and Phillips & Buchanan (2003), but an extensive examination of this topic is beyond the scope of the current paper. In addition to ontologies, another important tool in extracting and understanding meaning is semantic associations. “Semantic associations lend meaning to information, making it understandable and actionable, and provide new and possibly unexpected insights” (AlemanMeza, et al., 2003). Looking at the Internet as a prime example, it becomes apparent that entities can be connected in multiple ways to other entities by types of relationships that cannot be known or established a priori. For example, a “student” can be related to a “university,” “professors,” “courses,” and “grades;” but she can also be related to other entities by different relations like financial ties, familial ties, neighborhood, and etcetera. “In the Semantic Web vision, the RDF data model provides a mechanism to capture the meaning of an entity or resource by specifying how it relates to other entities or classes of resources” (Aleman-Meza et. al., 2003) – each of these relationships between entities is a “semantic association” and users can formulate queries against them. For example, semantic association queries in the port security domain may include the following: 1. 2.
Are any passengers on a ship coming into dock in the United States known to be related by blood to one or more persons on the watch list? Does the cargo on that ship contain any volatile or explosive materials, and are there any passengers on board that have specialized knowledge about the usage of those materials?
Semantic associations that span several entities and these constructs are very important in domains such as national security because they may enable analysts to uncover non-obvious connections between disparate people, places and events. In conjunction with semantic associations, semantic metadata is an important tool in understanding the mean1011
5
Semantic Data Mining
ing of a document. Semantic metadata, in contrast to syntactic metadata, describes the content of a document, within the context of a particular domain of knowledge. For example, documents relating to the homeland security domain may include semantic metadata describing terrorist names, bombing locations, and etcetera (Sheth et. al., 2002).
Methods of Graph Traversal Once the semantic structures for a given domain have been defined, an effective method of for traversing those structures must be established. One such method that is coming into recent prominence, in addition to the algorithmic graph mining methods mentioned earlier in this chapter, is link mining. “Link mining is a newly emerging research area that is the intersection of the work in link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining” (Getoor, 2003). Link mining is an instance of multi-relational data mining, in its broadest sense, and a field that is coming into prominence as the issues around graph traversal become paramount. Link mining encompasses a range of tasks including both descriptive and predictive modeling. The field also introduces new algorithms for classification and clustering for the linked relational domain, and with the increasing prominence of links new mining tasks come to light as well (Getoor, 2003). Examples of such new tasks include predicting the number of links between two entities, predicting link types, inferring the existence of a link based on existing entities and links, inferring the identity of an entity, finding co-references, and discovering sub-graph patterns. Link mining areas currently being explored are: link-based classification, which predicts the category of an object, link based cluster analysis, which clusters linked data based on the original work of SUBDUE, and several approaches on finding frequently occurring linking patterns (Getoor, 2003).
Relative Importance and Ranking There is increasing interest in developing algorithms and software tools for visualization, exploratory and interpretive analysis of graph-structured data, such the results of the semantic mining process. “While visualization techniques such as graph-drawing can be very useful for gleaning qualitative information about the structure of small graphs, there is also a need for quantitative tools for characterizing graph properties beyond simple lists of links and connections, particularly as graphs become too large and complex for manual analysis” (White & Smyth, 2003). In the area of Web graphs, a number of ranking algorithms have been proposed, such as HITS (Kleinberg,
1012
1999) and PageRank (Brin & Page, 1998) for automatically determining the “importance” of Web pages. One way of determining the relative importance of a result set might be to use a standard, global algorithm to rank all nodes in a sub-graph surrounding the root nodes of interest. The aforementioned PageRank algorithm is one such example. However, the problem with such an approach is that the root nodes are not given preferential treatment in the resulting ranking — in effect, one is ranking the nodes in the local sub-graph, rather than being ranked globally. Another approach is to apply semantic methods themselves to the relevance and ranking problem. In order to determine the relevance of semantic associations to user queries, it becomes critical to capture the semantic context within which those queries are going to be interpreted and used, or, more specifically, the domains of user interest. Aleman-Meza, et al. (2003) proposes that this can be accomplished “by allowing a user to browse an onotology and mark a region (sub-graph) of an RDF graph of nodes and/or properties of interest.” The associations passing through these regions that are considered relevant are ranked more highly in the returned result set than other associations, which may be ranked lower or discarded.
FUTURE TRENDS One of the most high-profile application areas for semantic data mining is in the building and mining of the Semantic Web, which associates the meaning of data with Web content. The SCORE system (Semantic Content Organization and Retrieval Engine), built at the University of Georgia, is one example that uses semantic techniques to traverse relationships between entities. (Sheth et. al., 2002). Designed as a semantic engine with main-memory based indexing, SCORE provides support for context sensitive search, browsing, correlation, normalization and content analysis. Once the semantic search engine determines the context of information described in the document, it can explore related entities through associations. By navigating these associations or relationships, the engine can access content about these entities. Another critical domain for the application of semantic data mining, as mentioned previously, is national security. In this area, one of the most difficult aspects of the mining process is creating an ontology to be used for the duration of the task. “In classification of a terrorist incident we must identify violent acts, weapons, tactics, targets, groups and perhaps individuals” (Gruenwald, McNutt, & Mercier, 2003). While many domains are
Semantic Data Mining
topic-driven and focus on only a single classification area, the national security inherently requires a focused search across multiple topics in order to classify a document as terrorism related. Specifically, a document must be identified as being semantically related to multiple branches in a terrorism hierarchy to be positively marked as relevant to the national security domain. (Gruenwald, McNutt, & Mercier, 2003). The prototypical Passenger Identification, Screening and Threat Analysis Application (PISTA), developed at the University of Georgia is an example of the application of the semantic mining approach to the national security domain. “PISTA extracts relevant metadata from different information resources, including government watch-lists, flight databases, and historical passenger data” (Sheth et. al., 2003). Using a semantic associationbased knowledge discovery engine, PISTA discovers suspicious patterns and classifies passengers into highrisk, low-risk and no-risk groups, potentially minimizing the burden of an analyst who would have to perform further investigation. While PISTA restricts its focus to flight security, a similar approach might be applied to other aspects of national security and terrorism deterrence, such as port security and bomb threat prediction. One relatively novel area to which semantic mining techniques have recently been applied is in money laundering crimes (Zhang, Salerno, & Yu, 2003). Money laundering is considered a major federal offense, and with the development of the global economy and Internet commerce, it is predicted that money laundering will become more prevalent and difficult to detect. The investigation of such crimes involves analyzing thousands of text documents in order to generate crime group models. These models group together a number of people or entities linked by certain attributes. These “attributes” typically are identified by the investigators based on their experiences and expertise, and consequently are very subjective and/or specific to a particular situation (Zhang, Salerno, & Yu, 2003). The resulting structure resembles a semantic graph, with the edges defined by the aforementioned attributes. Once this graphical model of the crime group model has been generated graph traversal and semantic query techniques may be used to automatically detect potential investigation scenarios.
CONCLUSION Today, semantic data mining is a fast growing field due to the increasing interest in understanding and exploiting the entities, attributes, and relationships in graph-structured data, which occurs naturally in many fields. In this review paper, we have presented a high-level overview of several important research areas that feed into semantic
mining, as well as describing some prominent applications of specific techniques. Many accomplishments have been made in this field to date, however, there is still much work to be done. As more and more domains begin to realize the predictive power that can be harnessed by using semantic search and association methods, it is expected that semantic mining will become of the utmost importance in our endeavor to assimilate and make effective use of the everincreasing data stores that abound in today’s world.
REFERENCES Aleman-Meza, B., Halascheck, C., Ismailcem, B., & Sheth, A.P. (2003). Context-aware semantic association ranking. SWDB 2003 (pp. 33-50). Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7 th International World Wide Web Conference (pp. 107-117). Cook, D., & Holder, L. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1, 231-255. Cook, D. & Holder, L. (2000). Graph-Based Data Mining. IEEE Intelligent Systems, 15(2), 32-41. Getoor, L. (2003). Link mining, A new data mining challenge. ACM SIGKDD Explorations Newsletter, 5(1), 5-9. Gruenwald, L., McNutt, G., & Mercier, A. (2003). Using an ontology to improve search in a terrorism database system. In DEXA Workshops 2003 (pp. 753-757). Hotho, A., Staab, S., & Stumme, G. (2003). Ontologies improve text document clustering. In Third IEEE International Conference on Data Mining (pp. 541-544). Inokuchi, A., Washio, T., & Motoda, H. (2000). An Aprioribased algorithm for mining frequent substructure from graph data. In Proceedings of the 4th European Conference on Principles of Knowledge Discovery and Data Mining (pp. 13-23). Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632. Kuramochi, M., & Karypis, G. (2002). Mining scientific data sets using graphs. NSF Next Generation Data Mining Workshop (pp. 170-179). Phillips, J., & Buchanan, B.G. (2001). Ontology-guided knowledge discovery in databases. International Conference on Knowledge Capture (pp. 123-130). Sheth, A., Aleman-Meza, B., Arpinar, B., Bertram, C., Warke, Y., Ramakrishnan, C., Halaschek, C., Anyanwu, K., 1013
5
Semantic Data Mining
Avant, D., Arpinar, S., & Kochut, K. (2003). Semantic association identification and knowledge discovery for national security applications. Technical Memorandum #03-009 of the LSDIS, University of Georgia.
relational learning and inductive logic programming, and graph mining. Link mining places primary emphasis on links, and is used in both predictive and descriptive modeling.
Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., & Warke, Y. (2002, July/August). Managing semantic content for the Web. IEEE Internet Computing, 80-87.
Ontology: A formal specification in a structured format, such as XML or RDF, of the concepts that exist within a given area of interest and the semantic relationships among those concepts.
Washio, T., & Motoda, H. (2003). State of the art of graphbased data mining. SIGKDD Explorations Special Issue on Multi-Relational Data Mining, 5(1), 45-52. White, S., & Smyth P. (2003). Algorithms for estimating relative importance in networks. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 266-275). Yoshida, K., Motoda, H., & Indurkhya, N. (1994). Graph based induction as a unified learning framework. Journal of Applied Intelligence, 4, 297-328. Zhang, Z., Salerno, J., & Yu, P. (2003). Applying data mining in investigating money laundering crimes. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 747-752).
Semantic Associations: “The associations that lend meaning to information, making it understandable and actionable, and providing new and possibly unexpected insights” (Aleman-Meza, 2003) Semantic Context: The specification of the concepts particular to a domain that help to determine the interpretation of a document. Semantic Data Mining: A method of data mining which is used to find novel, useful, and understandable patterns in data, and incorporates semantic information from a field into the mining process.
KEY TERMS
Semantic Metadata: Metadata that describes the content of a document, within the context of a particular domain of knowledge. For example, for documents relating to the homeland security domain, semantic metadata may include terrorist names, group affiliations, and etcetera.
Graph: In mathematics, a set of vertices or nodes which are connect by links or edges. A pair of vertices that are connected by multiple edges yield a multi-graph; vertices that are connected to themselves via looping edge yield a pseudo-graph.
Semantic Web: An extension of the current World Wide Web, proposed by Tim Berners-Lee, in which information is given a well-defined meaning. The Semantic Web would allow software agents, as well as humans, to access and process information content.
Graph-Based Data Mining: A method of data mining which is used to find novel, useful, and understandable patterns in graph representations of data.
Syntactic Metadata: Metadata that describes a document’s structure and/or format. For example, document language, document size, and MIME type might all be included as elements of syntactic metadata.
Link Mining: A method of data mining that combines techniques from link analysis, hypertext and Web mining,
1014
1015
Semi-Structured Document Classification Ludovic Denoyer University of Paris VI, France Patrick Gallinari University of Paris VI, France
INTRODUCTION Document classification developed over the last 10 years, using techniques originating from the pattern recognition and machine-learning communities. All these methods operate on flat text representations, where word occurrences are considered independents. The recent paper by Sebastiani (2002) gives a very good survey on textual document classification. With the development of structured textual and multimedia documents and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They are often composed of heterogeneous information sources (e.g., text, image, video, metadata, etc.). Another major change with structured documents is the possibility to access document elements or fragments. The development of classifiers for structured content is a new challenge for the machine-learning and IR communities. A classifier for structured documents should be able to make use of the different content information sources present in an XML document and to classify both full documents and document parts. It should adapt easily to a variety of different sources (e.g., different document type definitions). It should be able to scale with large document collections.
BACKGROUND Handling structured documents for different IR tasks is a new domain that recently has attracted increasing attention. Most of the work in this new area has concentrated on ad hoc retrieval. Recent Sigir workshops (2000, 2002, 2004) and journal issues (Baeza-Yates et al., 2002; Campos et. al., 2004) were dedicated to this subject. Most teams involved in this research gather around the recent initiative for the development and the evaluation of XML IR systems (INEX), which was launched in 2002. Besides this mainstream of research, some work is also developing around other generic IR problems like clustering and classification for structured documents. Clustering mainly
has been dealt with in the database community, focusing on structure clustering and ignoring the document content (Termier et al., 2002; Zaki & Aggarwal, 2003). Structured document classification, the focus of this article, is discussed in greater length below. Most papers dealing with structured documents classification propose to combine flat text classifiers operating on distinct document elements in order to classify the whole document. This has been developed mainly for the categorization of HTML pages. Yang, et al. (2002) combine three classifiers operating respectively on the textual information of a page and on titles and hyperlinks. Cline (1999) maps a structured document onto a fixed-size vector, where each structural entity (title, links, text, etc.) is encoded into a specific part of the vector. Dumais and Chen (2000) make use of the HTML tags information to select the most relevant part of each document. Chakrabarti, et al. (1998) use the information contained in neighboring documents of HTML pages. All these methods rely explicitly on the HTML tag semantic (i.e., they need to know whether tags correspond to a title, a link, a reference, etc.). They cannot adapt to more general structured categorization tasks. Most models rely on a vectorial description of the document and do not offer a natural way for dealing with document fragments. Our model is not dependent on the semantic of the tags and is able to learn which parts of a document are relevant for the classification task. A second family of models uses more principled approaches for structured documents. Yi and Sundaresan (2000) developed a probabilistic model for tree-like document classification. This model makes use of local word frequencies specific to each node, so that it faces a very severe estimation problem for these local probabilities. Diligenti, et al. (2001) proposed the Hidden Tree Markov Model (HTMM), which is an extension of HMMs, to treelike structures. They performed tests on the WebKB collection, showing a slight improvement over Naive Bayes (1%). Outside the field of information retrieval, some related models also have been proposed. The hierarchical HMM (Fine et al., 1998) (HHMM) is a generalization of HMMs, where hidden nodes emit sequences instead of symbols for classical HMMs. The HHMM is aimed at discovering substructures in sequences instead of processing structured data.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
5
Semi-Structured Document Classification
Generative models have been used for flat document classification and clustering for a long time. Naive Bayes (Lewis, 1998) is one of the most used text classifiers, and different extensions have been proposed (Koller & Sahami, 1997). Probabilistic models with latent variables have been used recently for text clustering, classification, or mapping by different authors. (Vinokourov & Girolami, 2001; Cai & Hofmann, 2003). Blei and Jordan (2003) describe similar models for learning the correspondence among images or image regions and image captions. All these models do not handle structured representations. Finally, Bayesian networks have been used for the task of ad hoc retrieval, both for flat documents (Callan et al., 1992) and for structured documents (Myaeng et al., 1998; Piwowarski et al., 2002). This is different from classification, since the information need is not specified in advance. The models and problems are, therefore, different from those discussed here.
ments we consider are defined by the logical structure of the document. They typically correspond to the different components of an XML document. In this article, we introduce structured documents and the core Bayesian network model. We then briefly summarize some experimental results and describe possible extensions of the model.
MAIN THRUST
•
We describe a generative model for the classification of structured documents. Each document will be modeled by a Bayesian network. Classification then will amount to performing inference in this network. The model is able to take into account the structure of the document and different types of content information. It also allows one to perform inference either on whole documents or on document parts taken in their context, which goes beyond the capabilities of classical classifier schemes. The ele-
We will refer to structural and content nodes for these two types of information. Figure 1 gives an example for a simple textual document. We will consider only textual documents here. Extensions for multimedia documents are considered in Denoyer, et al. (2004a).
Structured Document We will consider that a document is a tree, where each node represents a structural entity. This corresponds to the usual representation of XML document. A node will contain two types of information: •
A label information that represents the type of structural entity. A label could be, for example, paragraph, section, introduction, or title. Labels depend on the document’s corpora; for XML documents, they are usually defined in the DTD. A content information. For a multimedia document, this could be text, image, or signal. For a textual document node with the label paragraph, the node content will be the paragraph text.
Figure 1. A tree representation for a structured document composed of an introduction and two sections. Circle and square nodes are respectively structural and content nodes
1016
Semi-Structured Document Classification
Modeling Documents With Bayesian Networks Let us first introduce some notations: • • • •
Let C be a discrete random variable that represents a class from the set of classes C. Let Λ be the set of all the possible labels for a structural node. Let V be the set of all possible words. V* denotes the set of all possible word sequences, including the empty one. Let d be a structured document consisting of a set of features ( s1d ,...sdd , t1d ,...tdd ) where sdi is the label of the i-th structural node of d ( sdi ∈ Λ ) , tdi is the textual content of this i-th node (tdi ∈ V * ) , and |d| is the number of structural nodes. d is a realization of a random vector D. In the following, all nodes are supposed to have a unique identifier, indicated as superscript i.
Bayesian networks offer a suitable framework for modeling the dependencies and relations among the different elements in a structured document. We will associate a network model to each document. Since we focus here on the logical document structure, each network will be de-
fined according to the corresponding document structure. For our classification task, the network parameters will be learned on all the documents from the same class in the training set. Documents from the same class will then share their parameters; there is one set of such parameters for each class. Different networks could be used for modeling a document, depending on which type of relation we want to take into account. We consider here the explicit document structure, and we will not try to uncover any hidden structure among the document elements. Some of the natural relations that could then be modeled are: “is a descendant of” in the document tree; “is a sibling of”; “is a successor of,” given a preorder visit of the document tree; and combinations of these different possibilities. Tests we performed using different types of relations and models of different complexity did not show a clear superiority of one model over the others with respect to classification performances. For simplifying the description, we will then consider tree-like Bayesian networks. The network structure is built from the document tree but need not be identical to this tree. Note that this is not limitative, and all the derivations in the article can be extended easily to BNs with no cycles. Figures 2 shows a simple BN that encodes the “is a descendant of” relation and whose structure is similar to the document tree structure.
Figure 2. A final Bayesian network encoding “is a descendant of” relation
1017
5
Semi-Structured Document Classification
Figure 3. The final document subnet. In the full Bayesian network, all nodes also have node c for parent
A Tree-Like Model for Structured Document Classification For this model, we make the following assumptions: • • • •
There are two types of variables corresponding to structure and content nodes. Each structure node may have zero or many structure subnodes and zero or one content node. Each feature of the document depends on the class c we are interested in. Each structural variable sdi depends on its parent pa ( s ) in the document network. i d
•
Each content variable tdi depends only on its structural variable.
The generative process for the model corresponds to a recursive application of the following process: at each structural node s, one chooses a number of structural subnodes, which could be zero, and the length of the textual part, if any. Subnodes labels and words are then sampled from their respective distribution, which depends on s and the document class. The document depth 1018
could be another parameter of the model. Document length and depth distributions are omitted in our model, since the corresponding terms fall out for the classification problems considered here. Using such a network, we can write the joint content and structure probability: |d| |d| P(d , c) = P(c) P(sdi | pa(sdi ), c) P(t di | sdi , c) i=1 i=1 (a) (b)
∏
∏
(1)
where (a) and (b) respectively correspond to structural and textual probabilities. Structural probabilities Ps|pasc can be estimated directly from data using some smooth estimator. Since t di is defined on the infinite set V*, we shall make an additional hypothesis for estimating the textual prob-
(
)
i i abilities P td sd , c . In the following, we use a Naive Bayes
model for text fragments. This is not a major option, and other models could do as well. Let us define t as the sequence of words
t
i d
= ( wdi ,1 ,...., wdi , t i ) where wdi ,k ∈V and t di d
is the number of word occurrences (i.e., the length of t di .
Semi-Structured Document Classification
Using Naive Bayes for the textual probability, the joint probability for this model is then: t di d d i i P ( d , c ) = P ( c ) Π P ( s d pa ( s d ), c ) Π Π P ( w di , k s di , c ) i =1 i = 1 k = 1
Learning (2)
Figure 3 shows the final belief network obtained for the document in Figure 1. For clarity, the class variable is omitted.
Classifying Document Parts Suppose now that d is a large heterogeneous document and that fragments of d correspond to different predefined classes. We could be interested in classifying any subpart d' of d into one of these classes. If d' corresponds to a subtree of d, and if we consider d' out of any context, we simply use Equation (2), replacing d with d'. We could also be interested in classifying d’ within the context of document d. For this, we need to compute P(d ' , c | d d ' ) , where d d ' represents d with d' removed. Let s' be the structural node, which is the father of d’ root node. We get P(d ' , c | d d ' ) = P(d ' , c | s ' ) , which can be estimated via:
|d '|+ k ' |d '|+ k ' P( d ' , c | s ' ) = P(c) P( s di | pa( s di ), c) i =k ' i = k '
∏
|t di |
∏∏ k =1
P( wdi ,k | s di , c)
(e.g., the INEX corpus has about 16 K documents and 8 M fragments).
(3)
where k' is the index for the root of d', and structure nodes are supposed ordered, according to a pre-order traversal of the tree. The interesting thing here is that by computing P(d,c), one automatically gets P (d ', c | d d ' ) , since both quantities make use of the same probabilities and probability estimates. If d' does correspond to a partial subtree of d instead of a full subtree or to different subtrees in d, one gets a similar expression by limiting the structure and content terms in the products in Equation (3) to those in d'. Classifying d' fragments is easily performed with this generative classifier. This compositionality property (carrying out global computations by combining local ones) is achieved in this model via the probabilistic conditional independence assumptions. Compositionality is an important property for structured document classification. It usually is not shared by discriminant classifiers. Training discriminant classifiers both on document fragments might be prohibitive when the number of fragment is large
In order to estimate the joint probability of each document and each class, the parameters must be learned from a training set of documents. Let us define the q parameters as: θ = U U θ nc,,ms U θ nc,,mw n∈V , m∈Λ n∈Λ ,m∈Λ
(4)
where θ c,s is the estimation for P ( sdi = n | pa ( sdi ) = m, c ) n,m c, w i i and θ n, m is the estimation for P( wd , k = n | sd = m, c) . s in
θ.,.., s indicates a structural parameter and w in θ.,.., w a textual parameter. There is one such set of parameter for each class. For learning the s using the set of training documents DTRAIN , we will maximize the log-likelihood L for DTRAIN : c, s c, w d d td L = ∑ log P (c ) + ∑ log θ i i + ∑∑ log θ i i sd , pa ( sd ) wd , k , sd d ∈DTRAIN i =1 i =1 k =1 (5) i
The learning algorithm solves for each parameter θcn,.,m (“.” corresponds to s or w) the following equation:
∂L = 0 under constraints : ∂θ nc,,.m
∀mΛ, ∑ θ nc,,ms = 1 n∈Λ
∀mΛ, ∑ θ nc,,ms = 1 (6) n∈Λ
This equation has an analytical solution (Denoyer & Gallinari, 2004a). In summary, this generative classifier can cope with both content and structure information. It allows one to perform inference on the different nodes and subtrees of the network. Document parts then can be classified in the context of the whole document. More generally, decisions can be made by taking into account only a subpart of the document or when information is missing in the document. Denoyer and Gallinari (2004a) describe how this model can take into account multimedia documents (text and image) and show how to extend it into a discriminant classifier using the formalism of Fisher Kernels. 1019
5
Semi-Structured Document Classification
EXPERIMENTS Denoyer and Gallinari (2004a) describe experiments on three medium-sized corpus: INEX (about 15,000 scientic articles in XML,18 classes which correspond to journals), webKB (4,520 HTML pages, six classes), and NetProtect (19,652 HTML pages with text and image, two classes). The BN model scales well on these corpus and outperforms Naïve Bayes with improvements ranging from 2% to 6% (macro-average and micro-average recall ) for whole document classification. These experiments validate experimentally the model and show the importance of taking into account both content and structure for classifying structured documents, even for basic whole document classification tasks. The model also performs well for document fragment classification.
FUTURE TRENDS We have presented a generative model for structured document. It is based on Bayesian networks and allows one to model the structure and the content of documents. Tests show that the model behaves well on a variety of situations. Further investigations are needed for analyzing its behavior on document fragments classification. The model also could be modified for learning implicit relations among document elements besides using the explicit structure so that the BN structure itself is learned. An interesting aspect of the generative model is that it could be used for other tasks relevant to IR. It could serve as a basis for clustering structured documents. The natural solution is to consider a mixture of Bayesian networks models, where parameters depend on the mixture component instead of the class, as is the case here. Two other important problems are schema-mapping and automatic document structuring. These new tasks currently are being investigated in the database and IR communities. The potential of the model for performing inference on document parts when information is missing in the document will be helpful for this type of application. Preliminary experiments about automatic structurization of documents are described in Denoyer, et al. (2004b).
REFERENCES Baeza-Yates, R., Carmel, D., Maarek, Y., & Soffer, A. (Eds.) (2002). Journal of the American Society for Information Science and Technology (JASIST). Blei, D.M., & Jordan, M.I. (2003). Modeling annotated data. Proceedings of the SIGIR.
1020
Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. Proceedings of the SIGIR. Callan, J.P., Croft, W.B., & Harding, S.M. (1992). The INQUERY retrieval system. Proceedings of the DEXA. Campos, L.M., Fernandez-Luna, J.M., & Huete, J.F (Ed) (2004). Information Processing & Management, 40(5). Chakrabarti, S., Dom, B.E., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. Proceedings of the ACM-SIGMOD-98. Cline, M. (1999). Utilizing HTML structure and linked pages to improve learning for text categorization [undergraduate thesis]. University of Texas. Denoyer, L., & Gallinari, P. (2004a). Bayesian network model for semi-structured document classification. In L.M. Campos et al. Denoyer, L., Wisniewski, G., & Gallinari, P. (2004b). Document structure matching for heterogeneous corpora. Proceedings of SIGIR 2004, Workshop on Information Retrieval and XML, Sheffield, UK. Diligenti, M., Gori, M., Maggini, M., & Scarselli, F. (2001). Classification of HTML documents by hidden tree-markov models. Proceedings of ICDAR. Dumais, S.T., & Chen, H. (2000). Hierarchical classification of Web content. Proceedings of SIGIR-00. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden markov model: Analysis and applications. Machine Learning, 32(1), 41-62. Fuhr, N., Govert, N., Kazai, G., & Lalmas, M. (2002). INEX: Initiative for the evaluation of XML retrieval. Proceedings of ACM SIGIR 2002 Workshop on XML and Information Retrieval. Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of the ICML. Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of the ECML-98. Myaeng, S.H., Jang, D.-H., Kim, M.-S., & Zhoo, Z.-C. (1998). A flexible model for retrieval of SGML documents. Proceedings of the SIGIR. Piwowarski, B., Faure, G., & Gallinari, P. (2002). Bayesian networks and INEX. Proceedings of the First Annual Workshop of the Initiative for the Evaluation of XML retrieval (INEX), Dagstuhl, Germany.
Semi-Structured Document Classification
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. Termier, A., Rousset, M., & Sebag, M. (2002). Treefinder: A first step towards XML data mining. Proceedings of the ICDM. Vinokourov, A., & Girolami, M. (2001). Document classification employing the Fisher kernel derived from probabilistic hierarchic corpus representations. Proceedings of ECIR-01. Yang, Y., Slattery, S., & Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3), 219-241. Yi, J., & Sundaresan, N. (2000). A classifier for semistructured documents. Proceedings of the Sixth ACM SIGKDD. Zaki, M.J., & Aggarwal, C.C. (2003). Xrules: An effective structural classifier for XML data. Proceedings of the SIGKDD 03, Washington, D.C.
KEY TERMS A Bayesian Network: A directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables.
Information Retrieval (IR): The art and science of searching for information in documents, searching for documents themselves, searching for metadata that describe documents, or searching within databases, whether relational stand-alone databases or hypertext-networked databases such as the Internet or intranets for text, sound, images, or data. Machine Learning: An area of artificial intelligence involving developing techniques to allow computers to learn. More specifically, machine learning is a method for creating computer programs by the analysis of data sets rather than the intuition of engineers. Multimedia: Data combining several different media, such as text, images, sound, and video. Probabilistic Model: A classic model of document retrieval based on a probabilistic interpretation of document relevance (to a given user query). Semi-Structured Data: Data whose structure may not match, or may only partially match, the structure prescribed by the data schema. XML (Extensible Markup Language): A W3C recommendation for creating special-purpose markup languages. It is a simplified subset of SGML, capable of describing many different kinds of data. Its primary purpose is to facilitate the sharing of structured text and information across the Internet.
1021
5
1022
Semi-Supervised Learning Tobias Scheffer Humboldt-Universität zu Berlin, Germany
INTRODUCTION For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data. Semi-supervised (for an example, see Seeger, 2001) has a long tradition in statistics (Cooper & Freeman, 1970); much early work has focused on Bayesian discrimination of Gaussians. The Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is the most popular method for learning generative models from labeled and unlabeled data. Model-based, generative learning algorithms find model parameters (e.g., the parameters of a Gaussian mixture model) that best explain the available labeled and unlabeled data, and they derive the discriminating classification hypothesis from this model. In discriminative learning, unlabeled data is typically incorporated via the integration of some model assumption into the discriminative framework (Miller & Uyar, 1997; Titterington, Smith, & Makov, 1985). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification results in a bias toward placing the hyperplane in regions of low density, p ( x) . Recently, studies have covered graph-based approaches that rely on the assumption that neighboring instances are more likely to belong to the same class than remote instances (Blum & Chawla, 2001). A distinct approach to utilizing unlabeled data has been proposed by de Sa (1994), Yarowsky (1995) and Blum and Mitchell (1998). When the available attributes can be split into independent and compatible subsets, then multi-view learning algorithms can be employed. Multiview algorithms, such as co-training (Blum & Mitchell, 1998) and co-EM (Nigam & Ghani, 2000), learn two independent hypotheses, which bootstrap by providing each other with labels for the unlabeled data. An analysis of why training two independent hypotheses that provide each other with conjectured class labels for unlabeled data might be better than EM-like self-training has been provided by Dasgupta, Littman,
and McAllester (2001) and has been simplified by Abney (2002). The disagreement rate of two independent hypotheses is an upper bound on the error rate of either hypothesis. Multi-view algorithms minimize the disagreement rate between the peer hypotheses (a situation that is most apparent for the algorithm of Collins & Singer, 1999) and thereby the error rate. Semi-supervised learning is related to active learning. Active learning algorithms are able to actively query the class labels of unlabeled data. By contrast, semi-supervised algorithms are bound to learn from the given data.
BACKGROUND Semi-supervised classification algorithms receive both labeled data Dl = ( x1 , y1 ),..., ( xm , ym ) l
data Du = x1u ,..., xmu
u
l
and unlabeled
and return a classifier f : x a y ; the
unlabeled data is generally assumed to be governed by an underlying distribution p( x) , and the labeled data by p ( x, y ) = p ( y | x) p ( x) . Typically, the goal is to find a classifier f that minimizes the error rate with respect to p( x) . In the following sections, we distinguish between model-based approaches, mixtures of model-based and discriminative techniques, and multi-view learning. Model-based approaches can directly utilize unlabeled data to estimate p ( x, y ) more accurately. Discriminative classification techniques need to be augmented with some model-based component to make effective use of unlabeled data. Multi-view learning can be applied when the attributes can be split into two independent and compatible subsets.
Model-Based Semi-Supervised Classification Model-based classification algorithms assume that the data be generated by a parametric mixture model p ( x, y | Θ) and that each mixture component contains only data belonging to a single class. Under this assumption, in principle, only one labeled example per
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Semi-Supervised Learning
model is correct — that is, the data has, in fact, been generated by p ( x, y | Θ) for some Θ — then the idea is arguable that unlabeled data will improve the accuracy of the resulting classifier f Θ ( x) under fairly reasonable assumptions (Zhang & Oles, 2000; Cozman, Cohen, & Cirelo, 2003). However, as Cozman et al. have pointed out, the situation is different when the model assumption is incorrect — that is, no Θ exists such that p ( x, y | Θ) equals the true probability p ( x, y ) , which governs the data. In this case, the best approximation to
mixture component is required (in addition to unlabeled data) to learn an accurate classifier. Estimating the parameter vector Θ from the data leads to a generative model; that is, the model p ( x, y | Θ) can be used to draw new labeled data. In the context of classification, the main purpose of the model is discrimination. Given the model parameter, the corresponding classifier is f Θ ( x) = arg max y p( x, y | Θ) . For instance, when p ( x, y | Θ) is a mixture of Gaussians with equal covariance matrices, then the discriminator f Θ ( x) is a linear function; in
the labeled data — Θl = arg max Θ p( Dl | Θ) — can be a
the general Gaussian case, f Θ ( x) is a second-order polynomial. The Expectation Maximization (EM) algorithm (Dempster et al., 1977) provides a general framework for semi-supervised model-based learning — that is, for finding model parameters Θ . Semi-supervised learning with EM is sketched in Table 1; after initializing the model by learning from the labeled data, it iterates two steps. In the E-step, the algorithm calculates the class probabilities for the unlabeled data based on the current model. In the M-step, the algorithm estimates a new set of model parameters from the labeled and the originally unlabeled data for which probabilistic labels have been estimated in the E-step. The EM algorithm, which is a greedy method for maximizing the likelihood p ( Dl , Du | Θ) = p ( Dl | Θ) p ( Du | Θ) of the data, has three caveats. The first is that no obvious connection exists between the maximum likelihood model parameters Θ and the Bayesian discriminator that minimizes the conditional risk given a new instance x . Practical semi-supervised learning algorithms apply some form of regularization to approximate the maximum a posteriori rather than the maximum likelihood parameters. The second caveat is that the resulting parameters are a local but not necessarily the global maximum. The third caveat of semi-supervised learning with EM is more subtle: When the assumed parametric
much
better
classifier
than
f Θ ( x)
with
Θ = arg max Θ p( Dl , Du | Θ) , which approximates the labeled and unlabeled data. In other words, when the model assumption is incorrect, then semi-supervised learning with EM can generally result in poorer classifiers than supervised learning from only the labeled data. Semi-supervised learning with EM has been employed with many underlying models and for many applications, including mixtures of Gaussians and naïve Bayesian text classification (Nigam, McCallum, Thrun, & Mitchell, 2000).
Mixtures of Discriminative and ModelBased Learning The answer to the question of how to utilize unlabeled data in the context of discriminative learning is not obvious. Discriminative learners, such as decision trees, logistic regression, or the Support Vector Machine, directly learn a classifier y = f ( x) without taking the detour via a generative model p ( x, y | Θ) . This classifier contains some information about the posterior p ( y | x) but does not contain a model of p ( x) that could be refined by unlabeled data. Some approaches that mix generative and discriminative models have been studied
Table 1. Semi-supervised classification with EM Input: labeled data Dl
= ( x1 , y1 ),..., ( xml , yml ) ; unlabeled Du = x1u ,..., xmu u .
Initialize model parameters Θ by learning from the labeled data. Repeat until a local optimum of the likelihood p ( x, y | Θ) is reached. u
E-step: For all unlabeled data xi and class labels that y is the class of
y , calculate E ( f ( xiu ) = y | Θ) , the expected probability
xiu given Θ ; that is, use p ( y | x, Θ) to probabilistically label the xiu .
M-step: Calculate the maximum likelihood parameters Θ = arg max p ( Dl , Du | estimated class probabilities
labeled and probabilistically labeled unlabeled data. Return classifier p ( y | x, Θ) .
for Du ) ; that is, learn from the
1023
5
Semi-Supervised Learning
and use the unlabeled data to tweak their models of the class-conditional likelihood or the mixing probabilities (Miller & Uyar, 1997). A special model assumption on the class-conditional likelihood p( x | y ) underlies graph-based algorithms. It is often reasonable to assume that similar instances are more likely to have identical class labels than remote instances; clearly, the concept of similarity is very domain specific. In this situation, instances and similarities can be encoded in a graph, and mincuts of that graph correspond to optimal labelings of the unlabeled data (Blum & Chawla, 2001; Zhu, Gharamani, & Lafferty, 2003; Joachims, 2003). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification is achieved by a greedy procedure that conjectures class labels for the unlabeled data and then iteratively flips the pair of conjectured class labels that yields the greatest improvement of the optimization criterion. After each flip of class labels, the hyperplane has to be retrained. This procedure results in a bias toward placing the hyperplane in regions of low density p( x) , where few instances result in small sums of the slack terms ξi .
Multi-View Learning Multi-view learning is a semi-supervised learning paradigm that is fundamentally different from model-based semi-supervised learning. Multi-view algorithms require the available attributes to be split into two independent subsets, or views, and either view has to be sufficient for learning the target concept. I discuss multi-view learning in the following section. The multi-view approach has also been applied to active learning (Muslea, Kloblock, & Minton, 2002).
MAIN THRUST In this section, I discuss the multi-view framework of semisupervised learning. In particular, I show that multi-view learning can, in a fundamentally different way, improve classification results over supervised learning, and I review some multi-view algorithms. Multi-view learning applies when the available attributes can be decomposed into two views V1 and V2 . For instance, V1 can be the bag-of-words representation of a Web page, whereas V2 might consist of the inbound hyperlinks referring to the page. Multi-view algorithms 1024
require that either view be sufficient for learning — that is, functions f1 and f 2 exist such that for all x , f1 ( x1 ) = f 2 ( x2 ) = f ( x ) , where f is the true target function. This rule is also called the compatibility assumption. In addition, the views have to be conditionally independent given the class label — that is, P( x1 , x2 | y ) = P( x1 | y ) P( x2 | y ) . In these independent views, independent classifiers f1 and f 2 can be trained. Now Abney (2002) has observed the following: For an unlabeled instance x , you
cannot decide whether f1 ( x) and f 2 ( x) are correct or incorrect, but you can decide whether they agree or disagree. You can reasonably assume that either hypothesis has an error probability of no more than ½. For any given instance x with true class y , the probability of a disagreement is then an upper bound on the probability that either hypothesis misclassifies x . This can be shown by the following equations, which first utilize the independence, followed by the assumption that the error is at most ½, and finally the independence again. P( f1 ( x) ≠ f 2 ( x)) = P( f1 ( x) = y, f 2 ( x) = y ) + P( f1 ( x ) = y , f 2 ( x) = y ) ≥ max i P( fi ( x) = y, f i ( x) = y ) + P( fi ( x) = y , f i ( x) = y ) = max i P( f i ( x) ≠ y)
This observation motivates the strategy that multiview algorithms follow: minimize the error on the labeled data, and minimize the disagreement of the two independent hypotheses on the unlabeled data. Even though the error itself cannot be minimized on unlabeled data due to the absence of labels, by minimizing the disagreement on the unlabeled data, multi-view algorithms minimize an upper bound on the error. The most prominent learning algorithms that utilize this principle are co-training and co-EM, displayed in Table 2. Co-training can be wrapped around any learning algorithm with the ability to provide a confidence score for the classification of an instance. Here, two hypotheses bootstrap by providing each other with labels for the unlabeled examples that they are most confident about. Co-EM, the multi-view counterpart of semi-supervised learning with EM, requires the base learner to be able to infer class label probabilities for the unlabeled data. In addition, a model has to be learned from conjectured class probabilities for the originally unlabeled data in addition to the labeled data. Because of these requirements, co-EM has frequently been applied with naïve Bayes as an underlying learner (for an example, see Nigam & Ghani, 1999). Recently, Brefeld
Semi-Supervised Learning
Table 2. Semi-supervised classification with co-training and co-EM Input: labeled data Dl views
5
= ( x1 , y1 ),..., ( xml , yml ) ; unlabeled data Du = x1u ,..., xmu u , attributes are split into two
V1 and V2 .
Algorithm Co-Training: For v = 1..2 : Learn initial classifier Repeat until
f v from labeled data Dl in view v .
Du is empty.
fv most confidently rates positive and negative, remove them from Du and add them to Dl , labeled positive and negative, respectively. For v = 1..2 : learn new classifier fv from labeled data Dl in view v . For v = 1..2 : find the examples that
Algorithm Co-EM: Learn parameters Θ2 of initial classifier
f ( Θ, 2) from labeled data Dl in view 2.
Repeat for T iterations. For v = 1..2 : M-Step: Estimate class probabilities complementary view v . E-Step: Learn parameters class probabilities
pv ( y | x, Θ v ) of unlabeled data using the model Θv in the
Θv in current view from labeled data Dl and unlabeled data Du with
pv ( y | x, Θv ) .
Both algorithms return confidence weighted hypothesis
and Scheffer (2004) studied a co-EM version of the Support Vector Machine; it transforms the uncalibrated decision function values into normalized probabilities and maps these probabilities to example-specific costs, which are used as weights for the error terms ξi in the optimization criterion. Co-EM is more effective than co-training, provided that the independence of the views is not violated (Muslea et al., 2002; Brefeld & Scheffer, 2004). Multi-view algorithms require the views to be independent; this assumption will often be violated in practice. Muslea et al. (2002) has observed that co-EM, especially, is detrimental to the performance when dependence between attributes is introduced. Brefeld & Scheffer (2004) have observed co-training to be more robust against violations of the independence assumptions than co-EM; Krogel & Scheffer (2004) found even co-training to deteriorate the performance of SVMs for the prediction of gene functions and localizations. This discovery raises the questions of how dependence between views can be quantified and measured and which degree of dependence is tolerable for multi-view algorithms. It is not possible to measure whether two large sets of continuous attributes are independent. The proof of Abney (2002) is based on the assumption that the two classifiers err independently; you can measure the violation of this assumption as follows: Let E1 and E2 be two random variables that indicate whether f1 and f 2 make an error for a given instance. The correlation coefficient of these random variables is defined as
1 ( f ( Θ,1) + f ( Θ,2) ) . 2
1
1
Φ 2 = ∑ i =0 ∑ j = 0
( P ( E1 = i, E2 = j ) − P ( E1 = i) P ( E2 = j )) 2 P ( E1 = i ) P ( E2 = j )
which quantifies whether these events occur independently — in this case, Φ 2 = 0 — or are dependent. In the most extreme case, when the two hypotheses always err at the same time, then Φ 2 = 1 . In experiments with gene function prediction and text classification problems, Krogel & Scheffer (2004) have found a clearly negative relationship between the benefit of co-training for a given problem and the error correlation coefficient Φ 2 of the initial classifiers. When the initial classifiers are correlated, for example, with Φ 2 ≥ 0.3 , then co-training will often deteriorate the classification result instead of improving it.
FUTURE TRENDS In knowledge discovery and machine learning, one is often interested in discriminative learning. Generative models allow you to easily incorporate unlabeled data into the learning process via the EM algorithm, but model-based learners optimize the wrong utility criterion when the goal is really discriminative learning (for example, see Cozman et al., 2003). Graph-based approaches (Blum & Chawla, 2001) allow the utilization of unlabeled data for discriminative learning, under the mild model assumption that instances with identical classes are more likely to be neighbors than instances with 1025
Semi-Supervised Learning
distinct classes. This idea is currently being investigated for a range of applications (Joachims, 2003; Zhu et al., 2003; Blum, Lafferty, Reddy, & Rwebangira, 2004). The principle of minimizing the disagreement of independent hypothesis is a simple yet powerful mechanism that allows minimization of an upper bound on the error by using only unlabeled data. Exploiting this principle for additional learning tasks such as clustering (Bickel & Scheffer, 2004; Kailing, Kriegel, Pryakhin, & Schubert, 2004), and by more effective algorithms, is a principal challenge that will lead to more powerful and broadly applicable semi-supervised learning algorithms. Algorithms that automatically analyze attribute interactions (Jakulin & Bratko, 2004) will possibly extend the scope of multi-view learning applicable to learning problems for which independent attribute sets are not available a priori.
CONCLUSION The Expectation Maximization algorithm provides a framework for incorporating unlabeled data into modelbased learning. However, the model that maximizes the joint likelihood of labeled and unlabeled data can, in principle, be a worse discriminator than a model that was trained only on labeled data. Mincut algorithms allow the utilization of unlabeled data in discriminative learning; similar to the transductive SVM, only mild assumptions on p( x | y ) are made. The multi-view framework provides a simple yet powerful mechanism for utilizing unlabeled data: The disagreement of two independent hypotheses upper bounds the error — it can be minimized with only unlabeled data. A prerequisite of multi-view learning is two independent views; the dependence of views can be quantified and measured by the error correlation coefficient. A small correlation coefficient corresponds to a great expected benefit of multi-view learning. The co-EM algorithm is the most effective multi-view algorithm when the views are independent; co-training is more robust against violations of this independence. Only when dependencies are strong is multi-view learning detrimental.
REFERENCES Abney, S. (2002). Bootstrapping. Proceedings of the Annual Meeting of the Association for Computational Linguistics. Bickel, S., & Scheffer, T. (2004). Multi-view clustering. Proceedings of the IEEE International Conference on Data Mining. Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. Proceedings of the International Conference on Machine Learning. Blum, A., Lafferty, J., Reddy, R., & Rwebangira, M. (2004). Semi-supervised learning using randomized mincuts. Proceedings of the International Conference on Machine Learning. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the Conference on Computational Learning Theory. Brefeld, U., & Scheffer, T. (2004). Co-EM support vector learning. Proceedings of the International Conference on Machine Learning. Collins, M., & Singer, Y. (1999). Unsupervised models for named entity classification. Proceedings of the Conference on Empirical Methods for Natural Language Processing. Cooper, D., & Freeman, J. (1970). On the asymptotic improvement in the outcome of supervised learning provided by additional nonsupervised learning. IEEE Transactions on Computers, C-19, 1055-1063. Cozman, F., Cohen, I., & Cirelo, M. (2003). Semi-supervised learning of mixture models. Proceedings of the International Conference on Machine Learning. Dasgupta, S., Littman, M., & McAllester, D. (2001). PAC generalization bounds for co-training. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14. Cambridge, MA: MIT Press.
ACKNOWLEDGMENT
de Sa (1994). Learning classification with unlabeled data. Advances of Neural Information Processing Systems.
The author is supported by Grant SCHE540/10-1 of the German Science Foundation DFG.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39.
1026
Semi-Supervised Learning
Jakulin, A., & Bratko, I. (2004). Testing the significance of attribute interactions. Proceedings of the International Conference on Machine Learning. Joachims, T. (1999). Transductive inference for text classification using support vector machines. Proceedings of the International Conference on Machine Learning. Joachims, T. (2003). Transductive learning via spectral graph partitioning. Proceedings of the International Conference on Machine Learning.
Zhu, X., Gharamani, Z., & Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. Proceedings of the International Conference on Machine Learning.
KEY TERMS Compatibility: Views V1 and V2 are compatible if funcand
exist such that for all
x,
Kailing, K., Kriegel, H, Pryakhin, A., & Schubert, M. (2004). Clustering multi-represented objects with noise. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.
tions
Krogel, M.-A., & Scheffer, T. (2004). Multirelational learning, text mining, and semi-supervised learning for functional genomics. Machine Learning, 57(1/2), 61-81.
independent given the class if for all x = ( x1 , x2 ) ,
Miller, D., & Uyar, H. (1997). A mixture of experts classier with learning based on both labelled and unlabelled data. In xxx (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.
Labeled Data: A sequence of training instances with corresponding class labels, where the class label is the value to be predicted by the hypothesis.
Muslea, I., Kloblock, C., & Minton, S. (2002). Active + semi-supervised learning = robust multi-view learning. Proceedings of the International Conference on Machine Learning. Nigam, K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. Proceedings of the International Conference on Information and Knowledge Management. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3). Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, University of Edinburgh. Titterington, D., Smith, A., & Makov, U. (1985). Statistical analysis of finite mixture distributions. Wiley. Vapnik, V. (1998). Statistical learning theory. Wiley. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of the Annual Meeting of the Association for Computational Linguistics. Zhang, T., & Oles, F. (2000). A probability analysis on the value of unlabeled data for classification problems. Proceedings of the International Conference on Machine Learning.
f1
f2
f1 ( x1 ) = f 2 ( x2 ) = f ( x ) , where f is the true target function.
Independence: Views V1 and V2 are conditionally P( x1 , x2 | y ) = P( x1 | y ) P( x2 | y ) .
Multi-View Learning: A family of semi-supervised or unsupervised learning algorithms that can be applied when instances are represented by two sets of features, provided that these sets are conditionally independent given the class and that either set suffices to learn the target concept. By minimizing the disagreement between two independent classifiers, multi-view algorithms minimize an upper bound on the error rate that can be determined without reference to labeled data. Semi-Supervised Classification: The task of learning a mapping from instances to one of finitely many class labels, coming from labeled data consisting of a sequence of instance-class pairs and unlabeled data consisting of just a sequence of instances. Supervised Learning: The task of learning a mapping from instances to function values (possibly class labels) from a sequence of pairs of instances and function values. Unlabeled Data: A sequence of training instances without corresponding class labels. Unsupervised Learning: The task of learning a model that describes a given data set where the attribute of interest is not available in the data. Often, the model is a mixture model, and the mixture component, from which each instance has been drawn, is not visible in the data. View: In multi-view learning, the available attributes are partitioned into two disjoint subsets, or views, which are required to be independent and compatible.
1027
5
1028
Sequential Pattern Mining Florent Masseglia INRIA Sophia Antipolis, France Maguelonne Teisseire University of Montpellier II, France Pascal Poncelet Ecole des Mines d’ Alès, France
INTRODUCTION Sequential pattern mining deals with data represented as sequences (a sequence contains sorted sets of items). Compared to the association rule problem, a study of such data provides “inter-transaction” analysis (Agrawal & Srikant, 1995). Applications for sequential pattern extraction are numerous and the problem definition has been slightly modified in different ways. Associated to elegant solutions, these problems can match with reallife timestamped data (when association rules fail) and provide useful results.
BACKGROUND In (Agrawal & Srikant, 1995) the authors assume that we are given a database of customer’s transactions, each of which having the following characteristics: sequence-id or customer-id, transaction-time and the item involved in the transaction. Such a database is called a base of data sequences. More precisely, each transaction is a set of items (itemset) and each sequence is a list of transactions ordered by transaction time. For efficiently aiding decision-making, the aim is to obtain typical behaviors according to the user’s viewpoint. Performing such a task requires providing data sequences in the database with a support value giving its number of actual occurrences in the database. A frequent sequential pattern is a sequence whose statistical significance in the database is above user-specified threshold. Finding all the frequent patterns from huge data sets is a very timeconsuming task. In the general case, the examination of all possible combination is intractable and new algorithms are required to focus on those sequences that are considered important to an organization. Sequential pattern mining is applicable in a wide range of applications since many types of data are in a time-related format. For example, from a customer purchase database a sequential pattern can be used to
develop marketing and product strategies. By way of a Web Log analysis, data patterns are very useful to better structure a company’s website for providing easier access to the most popular links (Kosala & Blockeel, 2000). We also can notice telecommunication network alarm databases, intrusion detection (Hu & Panda, 2004), DNA sequences (Zaki, 2003), etc.
MAIN THRUST Definitions related to the sequential pattern extraction will first be given. They will help understanding the various problems and methods presented hereafter.
Definitions The item is the basic value for numerous data mining problems. It can be considered as the object bought by a customer, or the page requested by the user of a website, etc. An itemset is the set of items that are grouped by timestamp (e.g. all the pages requested by the user on June 04, 2004). A data sequence is a sequence of itemsets associated to a customer. In table 1, the data sequence of C2 is the following: “(Camcorder, MiniDV) (DVD Rec, DVD-R) (Video Soft)” which means that the customer bought a camcorder and miniDV the same day, followed by a DVD recorder and DVD-R the day after, and finally a video software a few days later. A sequential pattern is included in a data sequence (for instance “(MiniDV) (Video Soft)” is included in the data sequence of C2, whereas “(DVD Rec) (Camcorder)” is not included according to the order of the timestamps). The minimum support is specified by the user and stands for the minimum number of occurrences of a sequential pattern to be considered as frequent. A maximal frequent sequential pattern is included in at least “minimum support” data sequences and is not included in any other frequent sequential pattern. Table 1 gives a simple example of 4 customers and their
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Sequential Pattern Mining
Table 1. Data sequences of four customers over four days Cust
June 04, 2004
June 05, 2004
June 06, 2004
June 07, 2004
C1
Camcorder, MiniDV
Digital Camera
MemCard
USB Key
C2
Camcorder, MiniDV
DVD Rec, DVD-R
C3
DVD Rec, DVD-R
MemCard
Video Soft
USB Key
Camcorder, MiniDV
Laptop
DVD Rec, DVD-R
C4
activity over 4 days in a shop. With a minimum support of “50%” a sequential pattern can be considered as frequent if it occurs at least in the data sequences of 2 customers (2/4). In this case a maximal sequential pattern mining process will find three patterns: • • •
S1: “(Camcorder, MiniDV) (DVD Rec, DVD-R)” S2: “(DVD Rec, DVD-R) (Video Soft)” S3: “(Memory Card) (USB Key)”
One can observe that S1 is included in the data sequences of C2 and C4, S2 is included in those of C2 and C3, and S3 in those of C1 and C2. Furthermore the sequences do not have the same length (S1 has length 4, S2 has length 3 and S3 has length 2).
Methods for Mining Sequential Patterns The problem of mining sequential patterns is stated in (Agrawal & Srikant, 1995) and improved, both for the problem and the method, in (Srikant & Agrawal, 1996). In the latter, the GSP algorithm is based on a breadthfirst principle since it is an extension of the A-priori model to the sequential aspect of the data. GSP uses the “Generating-Pruning” method defined in (Agrawal, Imielinski, & Swami, 1993) and performs in the following way. A candidate sequence of length (k+1) is generated from two frequent sequences, s1 and s2, having length k, if the subsequence obtained by pruning the first item of s1 is the same as the subsequence obtained by pruning the last item of s2. With the example in Table 1, and k=2, let s1 be “(DVD Rec, DVD-R)” and s2 be “(DVDR) (Video Soft)”, then the candidate sequence will be “(DVD Rec, DVD-R) (Video Soft)” since the subsequence described above (common to s1 and s2) is “(DVDR)”. Another method based on the Generating-Pruning principle is PSP (Massegli, Cathala, & Poncelet, 1998). The main difference to GSP is that the candidates as well as the frequent sequences are managed in a more efficient structure. The methods presented so far are designed to depend as little as possible on main memory. The methods presented thereafter need to load the database (or a rewriting of the database) in main memory. This results in efficient methods when the database can fit into the memory. In (Zaki, 2001), the authors proposed the SPADE algorithm. The main idea in this method is a clustering
5
Video Soft
of the frequent sequences based on their common prefixes and the enumeration of the candidate sequences, thanks to a rewriting of the database (loaded in main memory). SPADE needs only three database scans in order to extract the sequential patterns. The first scan aims at finding the frequent items, the second at finding the frequent sequences of length 2 and the last one associate to frequent sequences of length 2, a table of the corresponding sequences id and itemsets id in the database (e.g. data sequences containing the frequent sequence and the corresponding timestamp). Based on this representation in main memory, the support of the candidate sequences of length k is the result of join operations on the tables related to the frequent sequences of length (k-1) able to generate this candidate (so, every operation after the discovery of frequent sequences having length 2 is done in memory). SPAM (Ayres, Flannick, Gehrke, & Yiu, 2002) is another method which needs to represent the database in the main memory. The authors proposed a vertical bitmap representation of the database for both candidate representation and support counting. An original approach for mining sequential patterns aims at recursively projecting the data sequences into smaller databases. Proposed in (Han, et al., 2000), FreeSpan is the first algorithm considering the patternprojection method for mining sequential patterns. This work has been continued with PrefixSpan, (Pei, et al., 2001), based on a study about the number of candidates proposed by a Generating-Pruning method. Starting from the frequent items of the database, PrefixSpan generates projected databases with the remaining data-sequences. The projected databases thus contain suffixes of the data-sequences from the original database, grouped by prefixes. The process is recursively repeated until no frequent item is found in the projected database. At this level the frequent sequential pattern is the path of frequent items driving to this projected database.
Closed Sequential Patterns A closed sequential pattern is a sequential pattern included in no other sequential pattern having exactly the same support. Let us consider the database illustrated in Table 1. The frequent sequential pattern “(DVD Rec) (Video Soft)” is not closed because it is included in the sequential pattern S2 which has the same support (50%). 1029
Sequential Pattern Mining
On the other hand, the sequential pattern “(Camcorder, MiniDV)” (with a support of 75%) is closed because it is included in other sequential patterns but with a different support (for instance, S1, which has a support of 50%). The first algorithm designed to extract closed sequential patterns is CloSpan (Yan, Han, & Afshar, 2003) with a detection of non-closed sequential patterns avoiding a large number of recursive calls. CloSpan is based on the detection of frequent sequences of length 2 such that “A always occurs before/after B”. Let us consider the database given in Table 1. We know that “(DVD Rec) (Video Soft)” is a frequent pattern. The authors of CloSpan proposed relevant techniques to show that “(DVD-R)” always occurs before “(Video Soft)”. Based on this observation CloSpan is able to find that “(DVD Rec, DVDR) (Video Soft)” is frequent without anymore scans over the database. BIDE (Wang & Han, 2004) extends the previous algorithm in the following way. First, it adopts a novel sequence extension, called BI-Directional Extension, which is used both to grow the prefix pattern and to check the closure property. Second, in order to prune the search space more deeply than previous approaches, it proposes a BackScan pruning method. The main idea of this method is to avoid extending a sequence by detecting in advance that the extension is already included in a sequence.
Incremental Mining of Sequential Patterns As databases evolve, the problem of maintaining sequential patterns over a significantly long period of time becomes essential since a large number of new records may be added to a database. To reflect the current state of the database, in which previous sequential patterns would become irrelevant and new sequential patterns might appear, new efficient approaches were proposed. (Masseglia, Poncelet, & Teisseire, 2003) proposes an efficient algorithm, called ISE, for computing the frequent sequences in the updated database. ISE minimizes computational costs by re-using the minimal information from the old frequent sequences, i.e. the support of frequent sequences. The main new feature of ISE is that the set of candidate sequences to be tested is substantially reduced. The SPADE algorithm was extended into the ISM algorithm (Parthasarathy, Zaki, Ogihara, & Dwarkadas., 1999). In order to update the supports and enumerate frequent sequences, ISM maintains “maximally frequent sequences” and “minimally infrequent sequences” (also known as negative border). KISP (Lin and Lee, 2003) also proposes to take advantage of the knowledge previously computed and generates a knowledge base for further queries about sequential patterns of various support values. 1030
Extended Problems Based on the Sequential Pattern Extraction Motivated by the potential applications for the sequential patterns, numerous extensions of the initial definition have been proposed which may be related to the addition of constraints or to the form of the patterns themselves. In (Pei, Han, & Wang, 2002) the authors enumerate some of the most useful constraints for extracting sequential patterns. These constraints can be considered as filters applied to the extracted patterns, but most methods generally take them into account during the mining process. These filters may concern the items (“extract patterns containing the item Camcorder only”) or the length of the pattern, regular expressions describing the pattern, and so on. The definition of the sequential patterns has also been adapted by some research work. For instance (Kum, Pei, Wang, & Duncan, 2003) proposed ApproxMap to mine approximate sequential patterns. ApproxMap first proposes to cluster the data sequences depending on their items. Then for each cluster ApproxMap allows extraction of the approximate sequential patterns related to this cluster. Let us consider the database in Table 1 as a cluster. The first step of the extraction process is to provide the data sequences of the cluster with an alignment similar to those of bioinformatics. Table 2 illustrates such an alignment. The last sequence in Table 2 represents the weighted sequence obtained by ApproxMap on the sequences of Table 1. With a support of 50%, the weighted sequence gives the following approximate pattern: “(Camcorder: 3, MiniDV: 3) (DVD Rec: 3, DVD-R: 3) (MemCard: 2) (Video Soft: 2) (USB Key: 2)”. It is interesting to observe that this sequential pattern does not correspond to any of the recorded behavior, whereas it represents a trend for this kind of customer.
FUTURE TRENDS Today several methods are available for efficiently discovering sequential patterns according to the initial definition. Such patterns are widely applicable for a large number of applications. Specific methods, widely inspired from previous algorithms, exist in a wide range of domains. Nevertheless, existing methods have to be reconsidered since handled data is much more complex. For example, existing algorithms consider that data is binary and static. Today, according to the huge volume of data available, stream data mining represents an emerging class of data-intensive applications where data flows in and out dynamically. Such
Sequential Pattern Mining
Table 2. Alignment proposed for the data sequences of Table 1. Camcorder, MiniDV
DigiCam
Camcorder, MiniDV
MemCard DVD Rec, DVD-R DVD Rec, DVD-R
Camcorder, MiniDV
Laptop
DVD Rec, DVD-R
Camcorder: 3 MiniDV: 3
DigiCam: 1 Laptop: 1
DVD Rec: 3 DVD-R: 3
applications also need very fast or even real-time responses (Giannella, Han, Pei, Yan, & Yu, 2003; Cai et al., 2004). In order to increase the immediate usefulness of sequential rules, it is very important to consider much more information. Hence, by associating sequential patterns with a customer category or multi-dimensional information, the main objective of multi-dimensional sequential pattern mining is to provide the enduser with more useful classified patterns (Pinto et al., 2001). With such patterns, an auto-dealer would find, for example, an enriched sequential rule stating that “Customers who bought an SUV on monthly payment installments 2 years ago are likely to respond favorably to a trade-in now”.
CONCLUSION Since they have been defined in 1995, sequential patterns have received a great deal of attention. First work on this topic focused on improving the efficiency of the algorithms either with new structures, new representations or by managing the database in the main memory. More recently extensions were proposed by taking into account constraints associated with real life applications. In fact, the increasing contributions on sequential pattern mining are mainly due to their adaptability to such applications. The management of timestamp within the recorded data is a difficulty for designing algorithms; on the other hand this is the reason why sequential pattern mining is one of the most promising technologies for the next generation of knowledge discovery problems.
REFERENCES Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 207-216), Washington, D.C, USA. Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. Proceeding of the 11th International Conference on Data Engineering (pp. 3-14), Taipei, Taiwan.
5
USB Key Video Soft
MemCard
Video Soft
USB Key
MemCard: 2
Video Soft: 2
USB Key: 2
Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using bitmap representation. Proceedings of the 8 th International Conference on Knowledge Discovery and Data Mining (pp. 429-435), Alberta, Canada. Cai, Y., Clutter, D., Pape, G., Han, J., Welge, M., & Auvil, L. (2004). MAIDS: Mining alarming incidents from data streams. Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 919-920), Paris, France. Giannella, G., Han, J., Pei, J., Yan, X., & Yu, P. (2003). Mining frequent patterns in data streams at multiple time granularities. In H. Kargupta, A. Joshi, K. Sivakumar & Y. Yesha (Eds.), Next generation data mining (chap. 3). MIT Press. Han, J., Pei, J., Mortazavi-asl, B., Chen, Q., Dayal, U., & Hsu, M. (2000). FreeSpan: Frequent pattern-projected sequential pattern mining. Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (pp. 355-359), Boston, USA. Hu, Y., & Panda, B. (2004). A Data mining approach for database intrusion detection. Proceedings of the 19 th ACM Symposium on Applied Computing (pp. 711-716), Nicosia, Cyprus. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 2(1), 1-15. Kum, H.-C., Pei, J., Wang, W., & Duncan, D. (2003). ApproxMAP: Approximate mining of consensus sequential patterns. Proceedings of the 3 rd SIAM International Conference on Data Mining (pp. 311-315), San Francisco, CA. Lin, M., & Lee, S. (2003). Improving the efficiency of interactive sequential pattern mining by incremental pattern discovery. Proceedings of the 36th Annual Hawaii International Conference on System Sciences (p. 68), Big Island, USA, CDROM. Masseglia, F., Cathala, F., & Poncelet, P. (1998). The PSP approach for mining sequential patterns. Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (pp. 176-184), Nantes, France. 1031
Sequential Pattern Mining
Masseglia, F., Poncelet, P., & Teisseire, M. (2003). Incremental mining of sequential patterns in large databases. Data and Knowledge Engineering, 46(1), 97-121. Parthasarathy, S., Zaki, M., Ogihara, M., & Dwarkadas, S. (1999). Incremental and interactive sequence mining. Proceedings of the 8 th International Conference on Information and Knowledge Management (pp. 251-258), Kansas City, USA. Pei, J., Han, J., Mortazavi-asl, B., Pinto, H., Chen, Q., Dayal, U., et al. (2001). PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. Proceedings of 17th International Conference on Data Engineering (pp. 215-224), Heidelberg, Germany. Pei, J., Han, J., & Wang, W. (2002). Mining sequential patterns with constraints in large databases. Proceedings of the11th Conference on Information and Knowledge Management (pp. 18-25), McLean, USA. Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., & Dayal, U. (2001). Multi-dimensional sequential pattern mining. Proceedings of the 10th International Conference on Information and Knowledge Management (pp. 81-88), Atlanta, USA. Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Proceeding of the 5th International Conference on Extending Database Technology (pp. 3-17), Avignon, France. Wang, J., & Han, J. (2004). BIDE: Efficient mining of frequent closed sequences. Proceedings of the 20th International Conference of Data Engineering (pp. 7990), Boston, USA. Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. Proceedings of the 3rd SIAM International Conference on Data Mining, San Francisco, CA. Zaki, M. (2001). SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1/ 2), 31-60. Zaki, M. (2003). Mining data in bioinformatics. In N. Ye (Ed.), Handbook of data mining (pp. 573-596), Lawrence Earlbaum Associates.
1032
KEY TERMS Apriori: The method of generating candidates before testing them during a scan over the database, insuring that if a candidate may be frequent then it will be generated. See also Generating-Pruning. Breadth-First: The method of growing the intermediate result by adding items both at the beginning and the end of the sequences. See also Generating-Pruning Closed Sequential Pattern: A frequent sequential pattern that is not included in another frequent sequential pattern having exactly the same support. Data Sequence: The sequence of itemsets representing the behavior of a client over a specific period. The database involved in a sequential pattern mining process is a (usually large) set of data sequences. Depth-First: The method of generating candidates by adding specific items at the end of the sequences. See also Generating-Pruning. Generating-Pruning: The method of finding frequent sequential patterns by generating candidates sequences (from size 2 to the maximal size) step by step. At each step a new generation of candidates having the same length is generated and tested over the databases. Only frequent sequences are kept (pruning) and used in the next step to create a new generation of (longer) candidate sequences. Itemset: Set of items that occur together. Maximal Frequent Sequential Pattern: A sequential pattern included in at least n data sequences (with n the minimum support specified by the user). A sequential pattern is maximal when it is not included in another frequent sequential pattern. A frequent sequential pattern may represent, for instance, a frequent behavior of a set of customers, or a frequent navigation of the users of a Web site. Negative Border: The collection of all sequences that are not frequent but both of whose generating sub sequences are frequent. Sequential Pattern: A sequence included in a data sequence such that each item in the sequential pattern appears in this data sequence with respect to the order between the itemsets in both sequences.
1033
Software Warehouse
5
Honghua Dai Deakin University, Australia
INTRODUCTION A software warehouse is a facility providing an effective and yet efficient mechanism to store, manage, and utilize existing software resources (Dai, 2003, 2004a, 2004b; Dai & Li, 2004). It is designed for the automation of software analysis, testing, mining, reuse, evaluation, and system design decision making. It makes it easier to make use of existing software for solving new problems in order to increase software productivity and to reduce the cost of software development. By using a software warehouse, software assets are systematically accumulated, deposited, retrieved, packaged, managed, and utilized, driven by data mining and OLAP technologies. The design perspectives and the role of a software warehouse in modern software development are addressed in Dai (2003).
BACKGROUND With the dramatic increase in the amount and size of available software, it is naturally important to consider an effective and yet efficient way to store, manage, and make best use of existing software. A software warehouse is proposed to meet such a demand. In many cases, software analysis is a necessary step to system development for new applications. Such analysis is required for the provision of answers or solutions to many challenging and practical questions, such as the following: 1. 2. 3. 4.
Is there any software available for solving a particular problem? What are the software products? Which one is better? In history, how did people solve a similar problem? What are the existing software components that can be used in developing a new system? What is the best design for a set of given system requirements?
To provide a satisfactory answer to these questions, the following conditions need to be met: 1.
A comprehensive collection of both historical and current software.
2. 3.
An architecture and organization of the collected software are needed for effective and yet efficient access to the target software. A reliable and feasible management and access strategy is needed for management and for making use of the software.
In short, the significance of establishing a software warehouse includes: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Effective and efficient software storage and management. Software design study. Software development audit and management. Software reuse. Software analysis and software review. Software reverse engineering using data mining. Software development decision making. Software design recovery to facilitate software design. Support automatic software engineering. Provide essential material to software factory in an organized, systematic, effective, and efficient way.
Since the invention of computers, almost all software analysis tasks have been completed by human experts. Such analysis normally was done based on a very small portion of the information, due to the limitation of available resources. Such resource limitation is not due to the lack of resources but to the lacking of a way to effectively manage the resources and make use of them. With regard to software development, in today’s software industry, the analysis, design, programming, and testing of software systems are done mostly by human experts, while automation tools are limited to the execution of preprogrammed action only. Evaluation of system performance is also associated with a considerable effort by human experts, who often have imperfect knowledge of the environment and the system as a whole.
SOFTWARE WAREHOUSE TECHNOLOGY A software warehouse is an extension of the data warehouse (Barquin & Edelstein, 1997; Dai, Dai & Li, 2004).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Software Warehouse
It is used to store and manage software systematically. Similar to a data warehouse, a software warehouse merges current software and historical software into a single software repository to facilitate decision making via software analysis and software mining. Similar to a data warehouse, a software warehouse is tuned for online software analysis and software mining, whereas operational software packages or libraries are tuned for a problem-solving process; and unlike software packages or libraries, a software warehouse stores transformed integrated software that, in many cases, are summarized. Generally speaking, a software warehouse is the repository of software that can be in different languages, collected from multiple sources, stored under a unified schema, managed using particular strategies, and can provide well-organized services for software reuse, thus facilitating automatic software engineering. Such organized services include queries about software availability, suitability, and constructability. They analyze the properties of the software or a set of software systems and discover knowledge structure out of the software warehouse.
Software Warehouse Architecture and Construction Process
5. 6.
The overall architecture of a software warehouse system is shown in Figure 2. It is composed of the following components: •
2. 3.
Software Filtering (selection and quality control): To select and collect the software that meets the goals and objectives of the software warehouse. Software Clean: Clean up possible problems and bugs. Software Classification/Clustering: Classifying the software based on the subjects we have chosen for the software warehouse.
Figure 1. Software warehouse construction process
SWMS (Software Warehouse Management System) takes care of the following: • • • • • •
•
•
•
Software access control Software management Software application control Smart software application assistant Software operation (D/I/M/R/G/S, etc.) SW update and consistency management SWAMS (Software Warehouse Application Management System) is designed to take care of the following activities:
• • • • •
The construction of a software warehouse involves the following activities, as shown in Figure 1: 1.
Software Transformation: To make the software in the unified form. Software Integration: To integrate the software obtained from different sources as an integral part and to make it consistent. Software Loading: Load the software to the warehouse.
4.
Query understanding Query analysis and interaction process Software evaluation Software selection Software identification and accreditation SW-Cube (control unit of the software warehouse) is used to control the access to/from a software warehouse. All the access to the software warehouse from either SWMS or SWAMS can be done via SW-Cube. Software Warehouse (main body of the software warehouse) is the place where the software components are stored in a pre-designed organization.
Software Loading
6
Integration 5 Software Transformation 4 Software Classification 3
Figure 2. Software warehouse architecture SWMS
SWAMS
Software Warehouse
Software Cleaning 2
Software Filter 1
1034
SW-Cube
Software Warehouse
Software Warehouse
The organization of a software warehouse is divided into two major parts: the control part and the body. The control part is a software cube, as shown in Figure 3, that has three dimensions: (1) Product: The product dimension lists all the products collected in the software warehouse (e.g., Microsoft Windows XP, Lynux, etc.); (2) Historical: This dimension lists the time when the product was developed; (3) Form/Language: This dimension specifies in which form or language the product was stored in the warehouse (e.g., in C/C++, C# or in executable code of some particular system such as UNIX, etc.). A software warehouse contains software building blocks, including classes, packages, libraries, subroutines, and so forth. These components of the software warehouse are organized around major subjects in a hierarchy that can be easily accessed, managed, and utilized.
FUTURE TRENDS We expect that this research will lead to a streamlined highefficient software development process and enhance the productivity in response to modern challenges of the design and development of software applications.
CONCLUSION
engineering. We expect that software mining and a software warehouse will play an important role in the research and development of new software systems.
REFERENCES Barquin, R., & Edelstein, H. (1997). Building, using, and managing the data warehouse. Upper Saddle River, NJ: Prentice Hall. Chaudhuri, S., & Dayal, U. (1997). An overview of data warehouse and OLAP. ACM SIGMOD Record, 26(1). Dai, H., Dai, W., & Li, G. (2004). Software warehouse: Its design, management and applications. International Journal of Software Engineering and Knowledge Engineering (IJSEKE), 14(5), 38-49. Dai, H. (2003). Software warehouse and software mining: The impact of data mining to software engineering. Proceedings of the Keynote Speech at the 4th International Conference on Intelligent Technologies, Thailand. Dai, H., & Dai, W. (2003). Software warehouse and its management strategies. Proceedings of the 15th International Conference on Software Engineering and Knowledge Engineering.
A software warehouse extends the idea of the data warehouse (Chaudhuri & Dayal, 1997). A software warehouse provides the mechanism to store, manage, and make efficient smart use of software systematically. Unlike software packages and libraries, a software warehouses stores historical software for various applications beyond the traditional use. The typical applications of a software warehouse include software reuse, design recovery, software analysis, strategy generation, and software reverse
Dai, H., & Webb, G. (2003). Session 11. Proceedings of the 15th International Conference on Software Engineering and Knowledge Engineering.
Figure 3. Software cube for a software warehouse
Figure 4. What can a software warehouse offer?
Product
E D C B A
Dai, H., & Webb, G. (Eds.). (2004). Special issues on data mining for software engineering and knowledge engineering. International Journal on Software Engineering and Knowledge Engineering. World Scientific, 14(5).
Form/Languages Software Storage
Source code Executable
Software Analysis
95 97 00 xp ……..
Historical
Software Decision Making
Software Management
Software Warehouse
Strategy Generatio n
Software Control
Software Reuse
Software Reverse Engineer
1035
5
Software Warehouse
Dai, H., & Webb, G. (2004a). Data mining for software engineering: Current status and perspectives. Special Issues on Data Mining for Software Engineering and Knowledge Engineering, IJSEKE. World Scientific, 14(5). Last, M., & Kandel, F. (2004). Using data mining for automated software testing. International Journal on Software Engineering and Knowledge Engineering, 14(5). Li, G., & Dai, H. (2004). What will affect software reuse: A causal model analysis. International Journal of Software Engineering and Knowledge Engineering (IJSEKE), 14(6). Lo, S., & Chang, J. (2004). Application of clustering techniques to component architecture design. International Journal on Software Engineering and Knowledge Engineering, 14(5). Lujan-Mora, S., & Trujillo, J. (2003). A comprehensive method for data warehouse design, DMDW. Nicola, M., & Rizvi, H. (2003). Storage layout and I/O performance in data warehouses. DMDW.
KEY TERMS Software Analysis: An analysis of the software in a software warehouse for software decision making. Software Component: A minimum manageable unit of a software warehouse that can be a class or a subroutine. Software Cube: The control part of a software warehouse system. Software Mining: A process to derive software patterns/regularities from a given software warehouse or a software set. Software Warehouse: A software warehouse is the repository of software components that can be in different languages, collected from multiple sources, stored under a uniform schema, managed using particular strategies, and can provide well-organized services for software reuse, thus facilitating automatic software engineering.
Peralta, V., & Ruggia, R. (2003). Using design guidelines to improve data warehouse logical design. DMDW.
Software Warehouse Application Management System: A component of a software warehouse system for the management of the application of the software in the software warehouse.
Rizzi, S. (2003). Open problems in data warehousing: Eight years later. DMDW.
Software Warehouse Architecture: The organization and components of a software warehouse.
Schneider, M. (2003). Well-formed data warehouse structures. DMDW.
Software Warehouse Management System: A component of a software warehouse system for the management of software in the warehouse.
Weir, R., Peng, T., & Kerridge, J. (2003). Best practice for implementing a data warehouse: A review for strategic alignment. DMDW.
1036
1037
Spectral Methods for Data Clustering Wenyuan Li Nanyang Technological University, Singapore
INTRODUCTION With the rapid growth of the World Wide Web and the capacity of digital data storage, tremendous amount of data are generated daily from business and engineering to the Internet and science. The Internet, financial real-time data, hyperspectral imagery, and DNA microarrays are just a few of the common sources that feed torrential streams of data into scientific and business databases worldwide. Compared to statistical data sets with small size and low dimensionality, traditional clustering techniques are challenged by such unprecedented high volume, high dimensionality complex data. To meet these challenges, many new clustering algorithms have been proposed in the area of data mining (Han & Kambr, 2001). Spectral techniques have proven useful and effective in a variety of data mining and information retrieval applications where massive amount of real-life data is available (Deerwester et al., 1990; Kleinberg, 1998; Lawrence et al., 1999; Azar et al., 2001). In recent years, a class of promising and increasingly popular approaches — spectral methods — has been proposed in the context of clustering task (Shi & Malik, 2000; Kannan et al., 2000; Meila & Shi, 2001; Ng et al., 2001). Spectral methods have the following reasons to be an attractive approach to clustering problem: •
•
•
Spectral approaches to the clustering problem offer the potential for dramatic improvements in efficiency and accuracy relative to traditional iterative or greedy algorithms. They do not intrinsically suffer from the problem of local optima. Numerical methods for spectral computations are extremely mature and well understood, allowing clustering algorithms to benefit from a long history of implementation efficiencies in other fields (Golub & Loan, 1996). Components in spectral methods have the naturally close relationship with graphs (Chung, 1997). This characteristic provides an intuitive and semantic understanding of elements in spectral methods. It is important when the data is graph-based, such as links of WWW, or can be converted to graphs.
In this paper, we systematically discuss applications of spectral methods to data clustering.
BACKGROUND To begin with the introduction of spectral methods, we first present the basic foundations that are necessary to understand spectral methods.
Mathematical Foundations Data is typically represented as a set of vectors in a highdimensional space. It is often referred as the matrix representation of the data. Two widely used spectral operations are defined on the matrix. •
•
EIG(A) operation: Given a real symmetric matrix An×n,
if there is a vector x ∈ R n ≠ 0 such that Ax=ë x for some scalar ë , then ë is called the eigenvalue of A with corresponding (right) eigenvector x. EIG(A) is an operation to compute all eigenvalues and corresponding eigenvectors of A. All eigenvalues and eigenvectors are real, that is, guaranteed by Theorem of real schur decomposition (Golub & Loan, 1996). SVD(A) operation: Given a real matrix Am×n, similarly, there always exists two orthogonal matrices T T U ∈ R m×m and V ∈ R n×n ( U U= I and V V=I ) to T decompose A to the form A=USV , where r=rank(A) and S = diag( σ 1 ,L , σ r ) ∈ R r ×r , σ 1 ≥ σ 2 L ≥ σ r = L = σ n = 0 . Here, the σ i are the singular values of A and the first r columns of U and V are the left and right (respectively) singular vectors of A. SVD(A) is called Singular Value Decomposition of A (Golub & Loan, 1996).
Typically, the set of eigenvalues (or singular values) is called the spectrum of A. Besides, eigenvectors (or singular vectors) are the other important components of spectral methods. These two spectral components have
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
5
Spectral Methods for Data Clustering
been widely used in various disciplines and adopted to analyze the key encoding information of a complex system. Therefore, they are also the principle objects in spectral methods for data clustering.
conclusions and properties of LA are also applicable to TA. Moreover, L A and TA have the same eigenvectors.
Transformations
As a graph is represented by its adjacency matrix, there is a close relationship between the graph and the spectral components of its adjacency matrix. It is a long history to explore the fundamental properties of a graph from the view of the spectral components of this graph’s adjacency matrix in the area of mathematics. Especially, eigenvalues are closely related to almost all major invariants of graphs and thus, play a central role in the fundamental understanding of graphs. Based on this perspective, spectral graph theory has emerged and rapidly grown in recent years (Chung, 1997). Hence, many characteristics of spectral components of a matrix can be intuitively explained in terms of graphs and meanwhile graphs also can be analyzed from its spectral components. A notable case is the authority and hub vertices of the Web graph that is important to Web search as shown in HITS algorithm (Kleinberg, 1998). Another example is that the spectrum of the adjacency matrix of a graph can be analyzed to deduce its principal properties and structure, including the optimization information about cutting a graph. This view has been applied to discover and predict the clustering behavior of a similarity matrix before the actual clustering is performed (Li et al., 2004).
As observed by researchers, two key components of spectral methods — eigenvalues and eigenvectors — scale with different matrices. Therefore, before the analysis and application of them, some transformations, or more exactly, normalizations of two spectral components are needed. Although this might look a little complicated at first, this way to use them is more consistent with spectral geometry and stochastic processes. Moreover, another advantage of normalized components is due to its better relationship with graph invariants while the raw components may fail to do. There are three typical transformations often used in spectral methods. Laplacian: Given a symmetric matrix A=(aij)n×n with aij≥0, we define Laplacian LA=(lij) n×n of A as
•
aij 1 − . if i = j di aij . if aij ≠ 0 lij = − di d j 0. otherwise
The spectral graph theory takes this transformation (Chung, 1997). Variant of Laplacian: Given a symmetric matrix A=(aij)n×n with aij≥0, we define the variant of Laplacian TA=(tij) n×n of A to be TA=D-1/2(S-I)D-1/2. It can be easily proved that LA+TA=2I. This transformation of the matrix is often used (Li et al., 2004; Ng et al., 2001). Transition (or Stochastic) Matrix: Given a symmetric matrix A=(aij)n×n with aij≥0, we define the transition matrix PA=(pij) n×n of A satisfying pij=aij/di so that the sum of each row is 1. Apparently, P is a stochastic matrix, in the sense that it describes the transition probabilities of a Markov chain in the natural way.
•
•
In the definitions of these three matrices, di = ∑ j aij is
the sum of the i-th row vector and D=diag(d1, …, dn). These three matrices have real eigenvalues and eigenvectors. Moreover, the eigenvalues of Laplacian and the transition matrix lie in [0, 2] and [-1, 1], respectively. We can easily deduce from the relationship between L A and TA to obtain SPECTRUM(TA ) = {1 − λ | λ ∈ SPECTRUM( LA )} , where SPECTRUM(•) represents the set of eigenvalues of a matrix. Hence, the eigenvalues of TA lie in [-1, 1] and all the
1038
Relations to Graph
MAIN THRUST Spectral Analysis for Preprocessing of Data Clustering In clustering, one common preprocessing step is to capture or predict the characteristic of target data set before the clustering algorithm is performed. Here, the spectral analysis of a data set is introduced to predict the clustering behavior before the actual data clustering. Investigating the clustering process as shown in Jain et al. (1999), an assumption is concluded that the feature set, and similarity measure embody intrinsic knowledge of the clustering domain. Data clustering algorithms are greatly dependent on the similarity matrix. Therefore, the similarity matrix can be the principal object to be considered for decoding clustering information of the data set. Given the similarity matrix S=(sij) n×n, we define G(S)=
Spectral Methods for Data Clustering
nal entry s ij. In G(S), a large weight value between two vertices represents high connectivity between them, and vice versa. In order to analyze the clustering behaviors of G(S), we employ the variant of the Laplacian TS as the transformation of S. T S has n eigenvalues with decreasing order λ1 ≥ λ2 ≥ L ≥ λn , which are also called the G(S) spectrum. Two observations of the G(S) spectrum for the clustering behavior of S were found (Li et al., 2004). They indicate the relationships between the clustering behavior of S and the principal properties and structure of G(S). 1.
If λ2 is higher, there exists a better bipartition for S.
2.
λi For the sequence α i = λ (i≥2), ∃k ≥ 2 , it has α i → 1 2
and α i − α i +1 > δ ( 1 < δ < 1 ), then k indicates the cluster number of the data set. They can be accounted for by spectral graph theory. The partition and connectivity of G(S) correspond in a natural way to the clustering behavior of S. Thus, through the analysis of the G(S) spectrum, we can infer details about the partition and connectivity of the graph G(S), and then obtain the clustering behavior of S. Next, we introduce the Cheeger constant, which is important in the understanding of properties and structure of G(S) (Chung, 1997). It is a typical value to measure the goodness of the optimal bipartition for a graph. Therefore, h(G) is an appropriate measure to indicate the clustering quality of the bipartition in G(S): The lower h(G) is, the better the clustering quality of the bipartition in G(S). Given the relationship between the bipartition and Cheeger constant, we have the so-called Cheeger inequality (Chung, 1997): 1 − λ2 ≤ h(G) ≤ 2(1 − λ2 ) 2
The inequality gives the bounds of h(G) by λ2 . It shows that if λ2 is high enough to approach 1, h(G) will be very low; this indicates that there exists a good bipartition in G(S). Then we obtain Observation (1). Generally speaking, the above conclusion for G(S) is also applicable to its induced subgraphs G(Si). Similarly, λ2 of G(Si) shows the clustering quality of the bipartition in G(Si). Then we obtain Observation (2). Li et al. provided details of the theoretical and empirical results of these observations (2004).
Spectral Clustering Algorithms Information within the spectral components is very indicative of the clustering results in the process of clustering algorithms. The intuition behind the spectral methods can be shown in the following simple example. This example is from the example of text data when Landauer (1998) introduced LSA (p. 10). Here, we add three more passages. It uses as text passages the titles of twelve technical memoranda: five about human computer interaction (HCI), four about mathematical graph theory, and three about clustering techniques. Their topics are conceptually disjoint. We manually selected the italicized terms as the feature set and used the cosine similarity measure to compute the similarity matrix, as shown in gray scale image in Figure 1. The shade of each point in the image represents the value of the corresponding entry in similarity matrix. In this figure, we can see that the first, second and third diagonal blocks (white ones) correspond to the topics “HCI,” “graph,” and “clustering” respectively, while off-diagonal area shows the disjointed features of these topics. Based on the theoretical analysis of Observation (1), the second eigenvector of the similarity matrix indicates its clustering behavior. Therefore, unlike the investiga-
Table 1. Example of text data: Title of some technical memos c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey d1: An investigation of linguistic features and clustering algorithms for topical document clustering d2: A comparison of document clustering techniques Survey of clustering Data Mining Techniques d3 :
1039
5
Spectral Methods for Data Clustering
Figure 1. Gray scale image of the similarity matrix c1 c2 c3 c4 c5 m1 m2 m3 m4 d1 d2 d3
Figure 2. Coordinates of the second eigenvector in Laplacian
1 2 3 4 5 6 7 8 9 10 11 12 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5
• Figure 3. Coordinates of the second eigenvector in transition matrix 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5
•
1
2
3
4
5
6
7
8
9 10 11 12
to λ2 . Certainly, the similarity needs to be transformed. After transformations, the second eigenvectors of its Laplacian and transition matrix are illustrated in Figures 2 and 3, respectively. In figures, we can clearly see that the second eigenvector assigns large negative weights on the first five coordinates, large positive weights on the following four coordinates and nearly zero weights on the last three coordinates. This result is exactly identical to our class labels in Table 1. This example clearly shows how the information of eigenvectors indicates the clustering behavior of a data set. In general, the eigenvectors corresponding to large eigenvalues tend to capture global clustering characteristics of a data set. Therefore, according to these intuitive observations, clustering algorithms based on spectral components can be classified into two types.
1040
2
3
4
5
6
7
8
9 10 11 12
Multiway type: The algorithms use more information in multiple eigenvectors to do a direct partition of data.
Next, we will review four typical spectral clustering algorithms.
tion of λ2 , we examine the eigenvector x2, corresponding
•
1
Recursive type: The algorithms divide data points into two partitions based on a single eigenvector (for example, the second eigenvector) and then recursively divide each sub-partitions in the same way to find the optimum partitions.
The Shi & Malik algorithm (Shi & Malik, 2000): This algorithm was proposed as a heuristic algorithm to minimize the normalized cut criterion for graph partitioning in the area of image segment. Given a graph G=(V, E), the normalized cut between two sets A ∪ B = V, A ∩ B = ∅ is defined as Ncut(A, B) =
•
·
cut(A, B) cut(A, B) + vol(A) vol(B)
where the vertex set V is partitioned into two clusters A and B so that Ncut(A,B) over all two way partitions of V is minimized. This problem is proved to be NP-hard. However, Shi & Malik show that the spectral algorithm may find optimum under some special condition. Specifically, it uses the second eigenvector of the transition matrix PS to do bipartition. This algorithm is a kind of recursive type. The Kannan, Vempala, & Vetta algorithm (Kannan et al., 2000): This algorithm is similar to the Shi & Malik algorithm except a key point. It uses Cheeger constant as defined in the above section to be the criterion of bipartition and the second eigenvector of the transition matrix PS to do bipartition. Therefore, this algorithm is also one of recursive spectral clustering algorithms. The Meila & Shi algorithm (Meila & Shi, 2001): This algorithm is of multiway type. It first transforms the similarity matrix S to be the transition matrix PS. Then it computes x1, x2, …, xk, the eigenvectors of PS corresponding to the first k largest eigenvalues and generates the matrix X=[ x1, x2, …, x k ]. Finally it
Spectral Methods for Data Clustering
•
applies any non-spectral clustering algorithm to cluster rows of X as points in a k-dimensional space. The Ng, Jordan, & Weiss algorithm (Ng et al., 2001): This algorithm is also a kind of multiway type. It uses the variant of Laplacian matrix TS in the transformation step. Then it computes x 1, x2, …, xk, the eigenvectors of PS corresponding to the first k largest eigenvalues and generates the matrix X=[ x1, x2, …, xk ]. After obtaining the matrix Y by normalizing each of X’s rows to have unit length (i.e yij = xij /
∑
j
xij2 ). Finally, treating each row of Y as
a point in k dimensions, it clusters them by k-means clustering algorithm. Although there are various spectral clustering algorithms, they have largely common steps and theoretical proofs: (1). All need the transformation of the similarity matrix S before the actual clustering. And they may use LS, TS or PS. (2). Eigenvectors of the transformed matrix are undoubtedly used as the important data for clustering. The use of eigenvectors is also the reason why they are called spectral clustering algorithms. (3). The underlying theory of these algorithms is based on the optimization problem of graph cut. And the criterions of bipartition of the graph are introduced to prove the possible optimum solution given by eigenvectors.
FUTURE TRENDS There is a close relationship between spectral components and a notable phenomenon frequently occurring in real-world data – power law. Distributions in much real-life data from nature and human social behaviors, including city sizes, incomes, earthquake magnitudes, even the Internet topology, WWW and collaboration networks, which are composed of a large number of common events and a small number of rarer events, often manifest a form of regularity in which the relationship of any event to any other in the distribution scales in a simple way (Zipf, 1949; Malamud et al., 1998; Faloutsos et al., 1999; Barabási et al., 2000; Albert, 2001; Kumar et al., 2000). In essence, such distribution can be generalized as the power law, which is often represented by log-linear relations. Considering the ubiquity of power law in the real-world data, there has been a recent surge of interest in graphs whose degrees have the power-law distribution, because these graphs are often derived from real-life data. As we have discussed, spectral methods have close relationships with graphs, an intriguing problem naturally arising is: Does power-law in graphs affect the spectral methods? Actually, the eigenvalues of such graphs also follow power law, but with a little lower exponent than that of degrees
(Mihail & Papadimitriou, 2002). Meanwhile, they pointed out that, “The effectiveness of several of the SVD-based algorithms requires that the underlying space has low rank, that is, a relatively small number of significant eigenvalues. Power laws on the statistics of these spaces have been observed and are quoted as evidence that the involved spaces are indeed low rank and hence spectral methods should be efficient” (Mihail & Papadimitriou, 2002, p. 8). However, this is the beginning of research on how is the effectiveness of spectral methods on realworld data. More experiments and theoretical analysis are needed.
CONCLUSION Spectral techniques based on the analysis of the largest eigenvalues and eigenvectors have proven algorithmically successful in detecting semantics and clusters in graphs and some real-world data. In PageRank and HITS algorithms (Lawrence et al., 1999; Kleinberg, 1998), only the first eigenvector is considered and mined. In spectral clustering algorithms, the second eigenvector is proved to indicate more information about clustering behaviors in the targeted data set. However, it is worth noting that the empirical results of most spectral clustering algorithms discussed above are limited to the small or synthetic data set, not for the large scale and real-life data sets. Spectral methods shall be applied to more areas and compared with other existing methods.
REFERENCES Azar, Y., Fiat, A., Karlin, A.R., McSherry, F., & Saia, J. (2001). Spectral analysis of data. In The 33rd Annual ACM Symposium on Theory of Computing (pp. 619-626). Heraklion, Crete, Greece. Chung, F.R.K. (1997). Spectral graph theory. Number 92 in CBMS Regional Conference Series in Mathematics. Rhode Island: American Mathematical Society. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., & Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6), 391-407. Drineas, P., Frieze, A., Kannan, R., Vempala, S., & Vinay, V. (1999). Clustering in large graphs and matrices. In The 10th annual ACM-SIAM Symposium on Discrete Algorithms (pp. 291-299). Baltimore, Maryland, USA. Golub, G., & Loan, C.V. (1996). Matrix computations. Baltimore: The Johns Hopkins University Press. 1041
5
Spectral Methods for Data Clustering
Han, J., & Kambr, M. (2001). Data mining concepts and techniques. San Francisco: Morgan Kaufmann.
KEY TERMS
Jain, A.K., Murty, M.N., & Flynn P.J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323.
Adjacency Matrix: A matrix representing a graph with n vertex. It is an n-by-n array of Boolean values with the entry in row u and column v defined to be 1 if there is an edge connecting vertex u and v in the graph, and to be 0 otherwise.
Kannan, R., Vempala, S., & Vetta, A. (2000). On clustering – good, bad and spectral. In The 41st Annual Symposium on Foundations of Computer Science (pp. 367-377). Redondo Beach, CA, USA. Kleinberg, J.M. (1998). Authoritative sources in a hyperlinked environment. In The 9th Annual ACM-SIAM Symposium Discrete Algorithms (pp. 668-677). New York. Landauer, T.K., Foltz, P.W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259-284. Lawrence, P., Sergey, B., Rajeev, M., & Terry W. (1999). The PageRank citation ranking: Bringing order to the Web. Technical Report, Stanford Digital Library Technologies Project. Li, W., Ng, W.-K., Ong, K.-L., & Lim, E.-P. (2004). A spectroscopy of texts for effective clustering. In The 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (In press). Meila, M., & Shi, J. (2001). A random walks view of spectral segmentation. In International Workshop on AI and Statistics. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (pp. 849-856). Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905.
1042
Graph Invariants: Quantities to characterize the topological structure of a graph. If two graphs are topologically identical, they have identical graph invariants. HITS Algorithm (Hypertext Induced Topic Selection): A Web search technique for ranking Web pages according to relevance to a particular search term or search phrase. Two concepts, “authority” and “hub,” are proposed to characterize the importance of each Web page. Markov Chain: A finite state machine with probabilities for each transition, that is, a probability that the next state is sj given that the current state is si. PageRank Algorithm: A Web search technique for ranking Web pages according to relevance to a particular search term or search phrase. Based on the random surfer model and Web graph, the index, PageRank, is proposed to rate the importance of each Web page to users. Power Law Distribution: A probability distribution function, P[X=x]~cx-±, where constants c>0 and ±>0, and f(x)~g(x) represents that the limit of the ratios goes to 1 as x grows large. Spectral Graph Theory: A theory on the study of the eigenvalue properties of Laplacian matrix of a graph.
1043
Statistical Data Editing
5
Claudio Conversano University of Cassino, Italy Roberta Siciliano University of Naples Federico II, Italy
INTRODUCTION Statistical Data Editing (SDE) is the process of checking data for errors and correcting them. Winkler (1999) defined it as the set of methods used to edit (i.e., clean up) and impute (fill in) missing or contradictory data. The result of SDE is data that can be used for analytic purposes. Editing literature goes back to the 1960s with the contributions of Nordbotten (1965), Pritzker, et al. (1965), and Freund and Hartley (1967). A first mathematical formalization of the editing process was given by Naus, et al. (1972), who introduced a probabilistic criterion for the identification of records (or part of them) that failed the editing process. A solid methodology for generalized editing and imputation systems was developed by Fellegi and Holt (1976). The great break in rationalizing the process came as a direct consequence of the PC evolution in the 1980s, when editing started to be performed online on personal computers, even during the interview and by the respondent in CASI models of data collection (Bethlehem et al., 1989). Nowadays, SDE is a research topic in both academia and statistical agencies. The European Economic Commission organizes a yearly workshop on the subject that reveals an increasing interest in both scientific and managerial aspects of SDE.
BACKGROUND Before the advent of computers, editing was performed by large groups of persons undertaking very simple checks. In that stage, only a small fraction of errors was detected. The advent of computers was recognized by survey designers and managers as a means of reviewing all records by consistently applying even sophisticated checks requiring computational power to detect most of the errors in the data that could not be found by means of manual review. The focus of both the methodological work and, in particular, the applications was on the possibilities of
enhancing the checks and applying automated imputation rules in order to rationalize the process.
SDE Process Statistical organization periodically performs an SDE process. It begins with data collection. An interviewer can examine quickly the respondent answers and highlight gross errors. Whenever the data collection is performed using a computer, more complex edits can be stored in it and can be applied to the data just before they are transmitted to a central database. In all these cases, the core of the editing activity is performed after completing the data collection. Nowadays, any modern editing process is based on the a priori specification of a set of edits. These are logical conditions or restrictions on the values of data. A given set of edits is not necessarily correct. It may omit important edits or contain edits that are conceptually wrong, too restrictive, too lenient, or logically inconsistent. The extent of these problems is reduced by having subject-matter experts specifying the edits. Problems are not eliminated, however, because many surveys involve large questionnaires and require hundreds of edits, which makes their specification a very demanding task. As a check, a proposed set of edits is applied on test data with known errors before application on real data. Missing edits or logically inconsistent edits, however, may not be detected. Problems in the edits, if discovered during the actual editing or even after it, cause editing to start anew after their correction, leading to delays and incurring larger costs than expected. Any method or procedure that would assist in the most efficient specification of edits would, therefore, be welcome. The final result of an SDE process is the production of clean data as well as the indication of the underlying causes of errors in the data. Usually, editing software is able to produce reports indicating frequent errors in the data. The analysis of such reports allows the researcher to investigate the causes of data error generation and to improve the results of future surveys in terms of data quality. The elimination of sources of errors in a survey allows a data collector agency to save money.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Statistical Data Editing
SDE Activities SDE concerns two different aspects of data quality; namely, data validation (the activity concerning the correction of logical errors in the data) and data imputation (the activity concerning the imputation of correct values once errors in the data have been localized). Whenever missing values appear in the data, missing data treatment is part of the data imputation process to be performed in the framework of SDE.
Types of Editing It is possible to distinguish among different kinds of editing activities: •
•
•
1044
Micro Editing: concerns the separate examination of each single record aimed at examining the logical consistency of the data it contains using a mathematical formalization of the automation of SDE. Macro Editing: concerns the examination of the relationships among a given data record and the others, in order to account for the possible presence of errors. A classical example of macro editing is outlier detection. It consists of the examination of the proximity between a data value and some measures of location of the distribution to which it belongs. Outlier detection methods literature is vast, and it is possible to refer to any of the classical text in the subject (Barnett & Lewis, 1994). For compositional data, a common outlier detection approach is provided by the aggregate method, aimed at identifying suspicious values (i.e., possible errors) in the total figures and to drill down to their components to figure out the sources of errors. Other approaches are based on the use of data visualization tools (De Waal et al., 2000) as well as on the use of statistical models describing the change of data values over time or across domains (Revilla & Rey, 2000). Selective Editing: can be meant as a hybrid between micro and macro editing. Here, the most influential among the records that needs imputation is identified, and the correction is made by human operators, whereas the remaining records are automatically imputed by the computer. Influential records often are identified by looking at the characteristics of the corresponding sample unit (e.g., large companies in an industry survey) or by applying the HidiroglouBerthelot score variable method (Hidiroglou & Berthelot, 1986), taking account of the influence of each subset of observations on the estimates produced for the whole data set.
•
Significance Editing: a variant of selective editing introduced by Lawrence and McKenzie (2000). Here, the influence of each record on the others is examined at the moment the record is being processed and not after all records have been processed.
MAIN THRUST The editing literature does not contain many relevant suggestions. The Fellegi and Holt method is based on set theory concepts that help to perform several steps of the process more efficiently. This method represents a milestone, since all the recent contributions are aimed at improving (even in a small part) the Fellegi-Holt method, with particular attention to its computational issues.
The Fellegi-Holt (FH) Method Fellegi and Holt (1976) provided a solid mathematical model for SDE, in which all edits reside in easily maintained tables. In conventional editing, thousands of lines of if-then-else code need to be maintained and debugged. In the Fellegi-Holt (FH) model, a set of edits is a set of points determined by edit restraints. An edit is failed if a record intersects the set of points. Generally, discrete restraints have been defined for discrete data and linear inequality restraints for continuous data. An example for continuous data is ∑ i aij x j ≤ C j , ∀j = 1, 2,KK , n , whereas for discrete data, edits can be specified in the form {Age ≤ 15, marital status = Married } . If a record r falls in the set of restraints defined by the edit, then the record fails the edit. It is intuitive that one field (variable) in a record r must be changed for each failing edit. There is a major difficulty: if fields (variables) associated with failing edits are changed, then other edits that did not fail originally will fail. The code of the main mathematical routines in the FH model can be maintained easily. It is possible to check the logical validity of the system prior to the receipt of data. In one pass through the data of an editfailing record, it is possible to fill in and change values of variables so that the record satisfies all edits. Checking the logical validity often is referred to as determining the consistency or logical consistency of a set of edits. The three goals of the FH methods are as follows: 1. 2.
The data in each record should be made to satisfy all edits by changing the fewest possible variables. Imputation rules should derive automatically from edit rules.
Statistical Data Editing
3.
When imputation is necessary, it should maintain the joint distribution of variables.
Goal 1 is referred to as the error localization problem. To solve it, FH require the generation of all the implicit and explicit edit. Explicit edits are generated by the user (or subject matter expert) according to the nature of the variables, whereas implicit edits are derived (or generated) from a set of explicit edits. If the implicit edit fails, then necessarily at least one of the explicit edits used in the generation of the implicit ones fails. The main hint of the FH method is the demonstration of a theorem stating that it is always possible to find a set of fields to change in a record that yields a changed record that satisfies all edits. If a complete set of implicit edits can be logically derived prior to editing, then the integer programming routines that determine the minimal number of fields to change in a record are relatively fast. Generally, it is difficult to derive all implicit edits prior to editing (Garfinkel et al., 1986). When most of the implicit edits are available, an efficient way to determine the approximate minimal number of fields to change is described in Winkler and Chen (2002). Fellegi and Sunter (1969) showed that implicit edits provide information about edits that do not fail originally but may fail as a record is changed.
Improving the Speed of the Implicit Edit Generation Systems employing the FH method have been developed mainly for categorical variables. The main problem connected with them is the implicit edit generation, since the computational time is a steep exponential function of the number of explicit edits. A common but not completely satisfactory solution is to split the set of explicit edits into certain subsets and generate implicit edits separately for each subset. The editing systems employing the FH method for categorical variables usually works by splitting the set of explicit edits. These systems are used in Italy (Barcaroli et al., 1997), Spain, and Canada. Garfinkel et al. (1986) provided an algorithm for reducing the amount of computation required for the implicit edit generation. The reduction is achieved by identifying in advance for each candidate set of contributing edits and generating field those subsets that have a possibility of producing the maximal number of new edits. These subsets, called prime covers, are groups of edits that do not have any subsets with the same properties. For each set of contributing edits, there may exist more than one prime cover. Nevertheless, these methods often fail to produce all implicit edits when dealing with survey questionnaires presenting complicated skip pat-
terns. These algorithms have been implemented by the US Census Bureau.
Error Localization Using an Incomplete Set of Edits Some approaches do not generate the complete set of edits but attempt to perform error localization on the basis of an incomplete set of edits. Winkler and Chen (2002) provide a heuristic to perform error localization in an iterative way when some implicit edits are missing. In practice, starting from a subset of implicit edits, it is possible to detect new ones on the basis of the explicit edits failed by a given data record. The error localization process stops as soon as a certain data records does not fail any explicit edit. In the case of edits involving numerical variable, great difficulties arises for the generation of implicit edits. Efforts have concentrated only on linear and ratio edits applying the Chernickova (1964) algorithm. Some slight modifications of these algorithms have been introduced and implemented in software used by major statistical agencies (Canada, the Netherlands, USA). Other approaches are based on statistical tools; in particular, tree-based models (Conversano & Cappelli, 2002) and nearest-neighbor methods (Bankier et al., 2000).
Careful Design and Evaluation of the Set of Query Edits The design of the entire set of edits is of particular importance in getting an acceptable cost/benefit outcome of editing. The edits have to be coordinated for related items and adapted to the data to be edited. Probable measures for improving current processes are relaxing bounds by replacing subjectively set limits by bounds based on statistics from the data to be edited and removing edits that, in general, only produce unnecessary flags. It should be noted that there is a substantial dependence between edits and errors for related items. Furthermore, the edits have to be targeted on the specific error types of the survey, not on possible errors.
SDE for Exceptionally Large Files In the framework of Knowledge Discovery from Databases (KDD) and data mining, a model-based data editing procedure has been proposed by Petrakos et al. (2004). It uses recursive partitioning procedures to perform SDE. This particular approach to editing results in 1045
5
Statistical Data Editing
a fully automated procedure called TreeVal, which is cast in the framework of the Total Quality Management (TQM) principles (plan, do, check, act) of the well-known Deming Cycle and can be used for the derivation of edits, which does not require, at least initially, the help of subject matter experts. Considering periodic surveys, the survey database is the database storing the data derived for previous similar surveys. It is assumed to contain clean data; namely, data that were validated in the past. Instead, incoming data contain cases that must be validated before being included into the survey database. In order to simplify the SDE process, the validation procedure is not applied on the data of the whole survey tout court, but on strata of cases/variables of the survey database that are selected according to a specific data selection and validation planning (plan). These strata concur to define the pilot dataset, which is used to derive edits using tree-based methods (do) with the FAST algorithm (Mola & Siciliano, 1997). The corresponding strata of the incoming data are selected (validation sample) and are validated using edits derived in the previous step (check). The clean validated data are stored in the survey database and can be used to edit subsequent strata (act).
FUTURE TRENDS Nowadays, it would be better to consider editing as part of the total quality improvement process, not the whole quality process. In fact, editing alone cannot detect all errors and definitely cannot correct all mistakes committed in survey design, data collection, and processing. For future application of SDE also in business intelligence organizations, it is possible to specify the following roles for editing in priority order: 1.
2. 3.
Identify and collect data on problem areas and error causes in data collection and processing, producing the basics for the (future) improvement of the survey vehicle. Provide information about the quality of data. Identify and handle concrete important errors and outliers in individual data.
Besides its basic role to eliminate fatal errors in data, SDE should highlight, not conceal, serious problems in the survey vehicle. The focus should be on the cause of an error, not on the particular error, per se.
1046
CONCLUSION Since data need to be free of errors before being analyzed in a statistical way, SDE appear as an indispensable activity wherever data are collected. But SDE should not be meant as a stand-alone activity, since it should be integrated with collection, processing, and estimation of data. In addition to the goals of data correction and imputation, a main task of SDE would be to provide a basis for designing measures to prevent errors. Focusing on the recent uses of SDE, it emerges that the paradigm “the more (and tighter) checks, the better the data quality” is not always valid, since, at the moment, an editing method does not exist that clearly outperforms the others. When performing SDE, the entire set of query edits should be designed meticulously, be focused on errors influencing the estimates, and be targeted at existing error types that can be identified by edits. The effects of the edits should be evaluated continuously by analysis of performance measures and other diagnostics, which the process can be designed to produce.
REFERENCES Bankier, M., Lachance, M., & Poirier, P. (2000). 2001 Canadian Census minimum change donor imputation methodology [working paper n. 17]. UN/ECE Work Session on Statistical Data Editing. Retrieved from http://amrads.jrc.cec.eu.int/k-base/papers Barcaroli, G., & Venturi, M. (1997). DAISY (design, analysis and imputation system): Structure, methodology and first applications. In J. Kovar & L. Granquist (Eds.), Statistical data editing (pp. 40-51). U.N. Economic Commission for Europe, Geneva, Switzerland. Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley. Bethlehem, J.G., Hundepool, A.J., Schuerhoff, M.H., & Vermeulen, L.F.M. (1989). BLAISE 2.0: An introduction. Voorburg, the Netherlands: Central Bureau of Statistics. Chernickova, N.V. (1964). Algorithm for finding a general formula for the non-negative solutions of a system of linear inequalities. USSR Computational Mathematics and Mathematical Physics, 4, 151-158. Conversano, C., & Cappelli, C. (2002). Missing data incremental imputation through tree-based methods. Proceedings of Computational Statistics, PhysicaVerlag, Heidelberg.
Statistical Data Editing
De Waal, T. (2000). A brief overview of imputation methods applied at Statistics Netherlands. Netherlands Official Statistics, 3, 23-27.
Session on Statistical Data Editing. Retrieved from http:/ /amrads.jrc.cec.eu.int/k-base/papers
Fellegi, I.P., & Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71, 17-35.
Winkler, W.E., & Chen, B.C. (2002). Extending the Fellegi– Holt model of statistical data editing. Statistical Research Report 2002/02. Washington, D.C. US Bureau of the Census.
Fellegi, I.P., & Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64, 1183-1210.
KEY TERMS
Freund, R.J., & Hartley, H.O. (1967). A procedure for automatic data editing. Journal of the American Statistical Association, 62, 341-352. Garfinkel, R.S., Kunnathur, A.S., & Liepins, G.E. (1986). Optimal imputation for erroneous data. Operations Research, 34, 744-751. Hidiroglou, M.A., & Berthelot, J.M. (1986). Statistical editing and imputation for periodic business surveys. Survey Methodology, 12, 73-84. Lawrence, D., & McKenzie, R. (2000). The general application of significance editing. Journal of Official Statistics, 16, 943-950. Mola, F., & Siciliano, R. (1997). A fast splitting algorithm for classification trees. Statistics and Computing, 7, 209-216. Naus, J.I., Johnson, T.G., & Montalvo, R. (1972). A probabilistic model for identifying errors in data editing. Journal of the American Statistical Association, 67, 943-950. Nordbotten, S. (1965). The efficiency of automatic detection and correction of errors in individual observations as compared with other means for improving the quality of statistics. Proceedings of the 35-th Session of the International Statistical Institute, Belgrade. Petrakos, G. et al. (2004). New ways of specifying data edits. Journal of the Royal Statistical Society, 167, 249-264. Pritzker, L., Ogus, J., & Hansen, M.H. (1965). Computer editing methods: Some applications and results. Proceedings of the 35th Session of the International Statistical Institute, Belgrade. Revilla, P., & Rey, P. (2000). Analysis and quality control for ARMA modeling [working paper]. UN/ECE Work Session on Statistical Data Editing. Retrieved from http://amrads.jrc.cec.eu.int/k-base/papers
Data Checking: Activity through which the correctness conditions of the data are verified. It also includes the specification of the type of the error or condition not met and the qualification of the data and its division into the error-free and erroneous data. Data checking may be aimed at detecting error-free data or at detecting erroneous data. Data Editing: The activity aimed at detecting and correcting errors (logical inconsistencies) in data. Data Imputation: Substitution of estimated values for missing or inconsistent data items (fields). The substituted values are intended to create a data record that does not fail edits. Data Validation: An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values. Editing Procedure: The process of detecting and handling errors in data. It usually includes three phases: the definition of a consistent system of requirements, their verification on given data, and elimination or substitution of data that is in contradiction with the defined requirements. Error Localization: The (automatic) identification of the fields to impute in an edit-failing record. In most cases, an optimization algorithm is used to determine the minimal set of fields to impute so that the final (corrected) record will not fail edits. Explicit Edit: An edit explicitly written by a subject matter specialist. Implicit Edit: An unstated edit derived logically from explicit edits that were written by a subject matter specialist. Logical Consistency: Verifying whether the given logical condition is met. It usually is employed to check qualitative data.
Winkler, W.E. (1999). State of statistical data editing and current research problems [working paper]. UN/ECE Work 1047
5
1048
Statistical Metadata in Data Processing and Interchange Maria Vardaki University of Athens, Greece
INTRODUCTION The term metadata is frequently considered in many different sciences. Statistical metadata is a term generally used to denote data about data. Modern statistical information systems (SIS) use metadata templates or complex object-oriented metadata models, making an extensive and active usage of metadata. Complex metadata structures cannot be stored efficiently using metadata templates. Furthermore, templates do not provide the necessary infrastructure to support metadata reuse. On the other hand, the benefits of metadata management depend also on software infrastructure for extracting, integrating, storing, and delivering metadata. Organizations aspects, user requirements, and constraints created by existing data warehouse architecture lead to a conceptual architecture for metadata management, based on a common, semantically rich, objectoriented data/metadata model, integrating the main steps of data processing and covering all aspects of data warehousing (Pool et al., 2002).
BACKGROUND Metadata and metainformation are two terms widely used interchangeably in many different sciences and contexts. In all those cases, these terms are defined as data about data; that is, metadata are every piece of information needed for someone to understand the meaning of data. Until recently, metainformation usually was held as table footnotes. This was due to the fact that the data producer and/or consumer had underestimated the importance of this kind of information. When metadata consideration in a prearranged format became evident, the use of metadata templates was proposed. This was the first true attempt to capture metadata in a structured way. The advantage of this approach was reduced chances of having ambiguous metadata, as each field of the templates was well documented. Templates succeed in capturing metadata in a structured way. However, they have limited semantic power, as they cannot natively express the semantic links between the various pieces of metainformation.
To capture the semantics of metainformation, a metadata model must be used. In this case, metainformation is modeled as a set of entities, each having a set of attributes. The real advantage comes from the fact that these entities are interrelated. This enables the user to follow a navigation-style browsing in addition to the traditionally used, label-based search. Froeschl (1997) created an object-oriented model for storing and manipulating metadata. A number of European projects deals with metadata models development and their subsequent integration into statistical information systems. Currently, automated statistical information systems allow for complex data aggregations, yet they provide no assistance in metadata manipulation. To further increase the benefits of using metadata, attempts have been made to establish ways of automating the processing of statistical data. The main idea behind this task is to translate the meaning of data in a computerunderstandable form. A way of achieving this goal is by using large, semantically rich, statistical data/metadata models like the ones developed in Papageorgiou et al., (2001a, 2001b, 2002). However, in order to minimize compatibility problems between dispersed systems, the need that emerges is to build an integrated metadata model to manage data usage in all stages of information processing. The quantifiable benefits that have been proven through the integration of data mining with current information systems will be greatly increased, if such an integrated model is implemented. This is reinforced by the fact that both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. Such an integrated model was developed in Vardaki & Papageorgiou (2004), and it was demonstrated that such a generally applied model, keeping information about storage and location of information as well as data processing steps, was essential for data mining requirements. Other related existing work focuses either mainly on data operations (Denk et al., 2002) and OLAP databases (Scotney et al., 2002; Shoshani, 2003) or on semantically rich data models used mainly for data capturing purposes. In these cases, the authors focus their attention on data
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Statistical Metadata in Data Processing and Interchange
manipulations and maximization of the performance of data aggregations.
MAIN THRUST This paper aims to summarize some of the latest results of research in the area of metadata. Topics that are covered include a possible categorization of statistical metadata, the benefits of using structured metainformation, standardization, metadata databases, modeling of metainformation, and integration of metadata in statistical information systems.
Types of Metadata In the literature, a number of categories has been proposed to classify metainformation according to different criteria. The following division in four overlapping categories (Papageorgiou et al., 2000) is proposed, since the partitioning criterion is the role that metainformation plays during the life cycle of a survey. •
•
•
•
Semantic Metadata: These are the metadata that give the meaning of the data. Examples of semantic metadata are the sampling population used, the variables measured, the nomenclatures used, and so forth. Documentation Metadata: This is mainly text-based metainformation (e.g., labels), which are used in the presentation of the data. Documentation metadata are useful for creating user-friendly interfaces, since semantic metadata are usually too complex to be presented to the user. Usually, an overlap between the semantic and documentation metadata occurs. Logistic Metadata: These are miscellaneous metadata used for manipulating the data sets. Examples of logistic metadata are the data’s URL, the type of RDBMS used, the format and version of the used files, and so forth. Mismatches in logistic metadata are easily discovered, since the used information tools immediately produce error messages. However, many times, logistic metadata can be corrected only by specialized personnel. ProcessMetadata: Process metadata are the metadata used by information systems to support metadata-guided statistical processing. These metadata are transparent to the data consumer and are used in data and metadata transformations.
Benefits of Using Metadata Even though competition requires timely and sophisticated analysis on an integrated view of the data, there is
a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. The benefits of using metadata are several. Some of the most important can be summarized as follows: By capturing metadata in a structured way and providing a transformations framework, computers are enabled to process metadata and data at the same time. Thus, the possibility of human errors is minimized, since user intervention is generally not necessary. Furthermore, the possibility of errors is reduced by the fact that metadata can be used by computers for asserting data manipulations. For example, a metadata-enabled statistical software can warn the user of a possible error when adding two columns that use different measure units. Finally, errors due to misunderstanding of footnotes are eliminated, since structured metadata are unambiguously defined (Foeschl, 1997). Hence, it is easy to show that metadata are important for assuring high levels of data quality at a low cost. However, it should be noted that the benefits of using metadata are subject to the quality of the metadata.
Metadata Standards Affecting Quality of Results During the design of a survey, the statistician implicitly produces metainformation. Usually, for small non-periodic surveys, the statistician might choose to use an ad hoc solution. However, for large periodic surveys, the statistician definitely will follow a standard. Depending on the authority describing a standard, we can identify three types of metadata standards: •
•
•
The Ad Hoc (Internal) Standards: These are defined internally by each statistical office. Due to the versatility of a small statistical office, these standards are highly adaptive to the latest needs of the data consumers. However, the compatibility of an internal standard with respect to an internal standard of a different office is not guaranteed. National Standards: These are defined by the National Statistical Institutes of each country. Although they may not be as current as their respective internal statistical standards, they offer statistical data compatibility at country level, which is the level that interests mostly the data consumers. International Standards: These might be nomenclatures or classifications that are defined by supranational organizations such as OECD and Eurostat. The usage of international standards provides the maximum intercountry compatibility for the captured data. However, the approval of an international standard is a time-consuming process. In any 1049
5
Statistical Metadata in Data Processing and Interchange
case, international standards have a high level of data comparability.
Metadata Modeling The design of a data/metadata model is the most important step in the creation of a SIS. If the model is undersized, it will be incapable of holding important metadata, thus leading to problems due to missing metainformation. On the other hand, if it is oversized, it will keep information that is captured, rarely used, and never updated, thus leading to severe waste of resources. Obviously, an oversized model is difficult to be implemented or used by the Institute’s personnel. However, it is difficult to predict the needs of data consumers, as the amount of required metainformation depends on the application under consideration. In any case, a metadata model should at least capture a minimum set of semantic, documentation, logistic, and process metadata (Papageorgiou et al., 2001b). Apart from choosing what metainformation is worth capturing, there is an additional difficulty in choosing the most appropriate modeling technique. For example, enhanced entity-relationship (E-R) models were developed some years ago, which, in turn, proved that they lacked inheritance relationships. Then, the object-oriented (O-O) paradigm (Papazoglou et al., 2000) has started to be used where the statistical metadata model is described in uniform modeling language (UML) (OMG, 2002) to ensure flexibility and better representation of two-level interdependencies (class-attribute). It is recommended that an integrated, semantically rich, platform-independent, statistical metadata model should be designed to cover the major stages of the statistical information processing (data collection and analysis, including harmonization, processing of data and metadata, and dissemination/output phases), which can minimize complexity of data warehousing environments and compatibility problems between distributed databases and information systems. The main benefits of a metadata model designed in UML are its flexibility and that the same conceptual model may be used to generate different XML schemas and XMI for the data level as well as other representations. Even if the world embraces a new technological infrastructure tomorrow, the initial conceptual UML model remains the same or can be easily updated.
Metadata Databases and User Interfaces Metadata is widely considered as promising for improving effectiveness and efficiency of data warehouse usage, development, maintenance, and administration. Data warehouse usage can be improved because metadata provides end users with additional semantics necessary to recon1050
struct the business context of data stored in a data warehouse (Agosta, 2000). Initially, the size of metadata did not justify the creation of specialized metadata databases. However, the current size and complexity of metainformation has created a need for building specialized metadata databases. Although metadata are data themselves, there is a number of key differences between a database and a metadata database. Existing databases (relational and object-oriented) are good in providing fast access over indexed sets of data. Computer experts designed existing commercial products such as ORACLE and ObjectStore, aiming to achieve performance and fault tolerance. A good database management system (DBMS) is capable of hundreds of transactions per second. However, even an excellent database cannot easily be used for storing metadata. Metainformation such as classifications, nomenclatures, and definitions are constantly revised. Consequently, a metadata database should integrate a strong revision control system. Furthermore, a metadata database should be optimized for small transactions. Most of the time, metadata queries are restricted to keywords search. Internet full-text search engines quickly answer these kinds of queries, yet existing DBMSs fail to give a prompt answer, as databases are not optimized for random keyword searches. It is obvious that the best solution is to create a DBMS able to handle both data and metadata simultaneously. Last but not least, metadata databases suffer from the lack of a broadly accepted metadata query language. For example, every computer scientist knows how to retrieve data from a database using a language known as structured query language (SQL). However, this language cannot be used in metadata databases. Regarding the user interfaces, a recommended approach would be to use Internet Explorer as an interface. Using this interface, a navigation structure is built within the metadata model, allowing the user to easily retrieve related metadata entities simply by following a hyperlink. Apart from the hyperlink-based interfaces, it is important that the metadata database provides a way of random access retrieval of metainformation. This can be achieved using a keywords-based full-text search engine. In this case, the user can start the browsing of the metadata via a keyword search and then locate the specific metainformation that the user is looking for, using the hyperlinks-based navigation structure.
Metadata Guided Statistical Processing Metadata-guided statistical processing is based on a kind of metadata algebra. This algebra consists of a set
Statistical Metadata in Data Processing and Interchange
of operators (transformations) operating on both data and metadata simultaneously. For each transformation, someone must specify the preconditions that must be satisfied for this transformation to be valid as well as a detailed description of the output of the transformation (data and metadata) (Papageorgiou et al., 2000). However, it is really difficult to find a set of transformations that can describe every possible manipulation of statistical data. Research in the area of transformations is now focusing on two problems. The first one is that many transformations cannot be fully automated, as they require human interference while testing for the specific transformation preconditions. The second is that transformations are highly sensitive in the quality of metadata. Nevertheless, the use of transformations provides so many benefits that it is most likely to be an indispensable feature of the new generation of statistical information systems.
Figure 1. The architecture of metadata-enabled Web site USER BROWSING
QUERY (metadata)
SYSTEM Metadata-Driven Graphical User Interface (GUI) Engine
Query
Transformations Plan Logistic metadata
Transformations Engine
Data + Metadata
ANSWER (data + metadata)
Table Designer
Data Database
Case Study In this section, the architecture of a metadata-enabled Web site/portal. We have selected this example since the Web is an efficient way of reducing the time and cost needed to capture, produce, and disseminate information. The users typically interact with Web sites either by following a predefined set of links (navigation) or by making keywords-driven searches (Papageorgiou et al., 2000). Both ways of browsing require the use of extensive amounts of metadata. Figure 1 shows the architecture of a Web site incorporating a statistical metadata model. The end user initially browses the site pages, which are dynamically created using documentation and semantic metadata. Consequently, it is important that these metadata are stored locally in XML/RDF files or in a metadata database, so that they can be retrieved promptly. This is in contrast to the actual statistical data, which do not need to be locally stored and may be distributed in two or more relational or OLAP (Online Analytical Processing) databases. A fulltext search engine provides to the GUI engine the ability to support keyword searches, whereas the engine converts the relationships of the model into Web links to support a navigational style browsing of metainformation. Apart from providing metadata to users, the site also allows for retrieval of ad hoc statistical tables. In this case, the user submits a query, either by using a graphical interface or by using a proprietary query language. The query is input to a query rewrite system, which attempts to find a sequence of transformations (plan) that can be used for constructing the requested table. The rewrite engine uses process metadata and constructs the plan by satisfying any requested criterion and, subsequently, the time and space constraints.
Process metadata
Metadata Database
The resulting transformations plan is forwarded to the transformations execution engine, which retrieves the needed data and evaluates the plan. This is the first point, where physical access to data is needed. Therefore, apart from the plan, the engine must be supplied with all the relevant logistic metadata that designate how the requested data can be retrieved from the appropriate database. Finally, the results are returned to the user as the contents of an HTML page, which is created by the table designer subsystem. This module collects the results of the query (data and metadata) and constructs the answer returned to the user, either as a pivot table or as graph.
FUTURE TRENDS Future plans include the potential of metadata to improve data quality as a consequence of transformations handling as well as on integrating the proposed model with relevant data warehousing and OLAP technologies. Therefore, the automation of statistical processing calls for the derivation of some quality metrics that subsequently will be used during the plan optimization and selection stage. Other future research can concentrate on how to integrate metadata that cannot be extracted from data warehousing components but resides in various other sources. Also, the harmonization of various metadata terminologies as well as semantics integration and mappings is very promising. Similarly, the implementation of an integrated commonly shared metamodel is essential in order to bridge the gap between syntax and semantics of 1051
S
Statistical Metadata in Data Processing and Interchange
metadata representation in various software components of a data warehousing system. This will lead to a possible unified adoption of standards used for data/metadata exchange, since nearly each house uses a different metadata model that most of the time is compatible only with software provided by the same software vendor.
CONCLUSION Structured metadata are indispensable for the understanding of statistical results. The development of a flexible and integrated metadata model is essential for the improvement of a SIS and is widely considered as promising for improving effectiveness and efficiency of managing data warehouse environments. The support of automated, metadata-guided processing is vital for the next generation of statistical Web sites as well as for asserting data quality.
REFERENCES Agosta L. (2000). The essential guide to data warehousing. Upper Saddle River, NJ: Prentice Hall. Denk, M., Froeschl, K.A., & Grossmann, W. (2002). Statistical composites: A transformation-bound representation of statistical datasets. Proceedings of the Fourteenth International Conference on Scientific and Statistical Database Management, SSDBM’02, Edinburgh, UK.
Papageorgiou, H., Vardaki, M., Theodorou, E., & Pentaris, F. (2002). The use of statistical metadata modelling and related transformations to assess the quality of statistical reports. Proceedings of the Joint UNECE/Eurostat Seminar on Integrated Statistical Information Systems and Related Matters (ISIS 2002), Geneva, Switzerland. Papazoglou, M.P., Spaccapietra, S., & Tari, Z. (2000). Advances in object-oriented data modeling. Cambridge, MA: MIT Press. Pedersen, D., Riis, K., & Pedersen, T.B. (2002). A powerful and SQL-compatible data model and query language for OLAP. Proceedings of the Thirteenth Australasian Conference on Database Technologies, Melbourne, Victoria, Australia. Poole, J., Chang, D., Tolbert, D., & Mellor, D. (2002). Common warehouse metamodel—An introduction to the standard for data warehouse integration. New York: Wiley. Scotney, B., Dunne, J., & McClean, S. (2002). Statistical database modeling and compatibility for processing and publication in a distributed environment. Research in Official Statistics (ROS), 5(1), 5-18. Shoshani, A. (2003). Multidimensionality in statistical, OLAP, and scientific databases. In M. Rafanelli (Ed.), Multidimensional databases: Problems and solutions (pp. 46-68). Hershey, PA: Idea Group.
Froeschl, K.A. (1997). Metadata management in statistical information processing. Wien, Austria: Springer.
Vardaki, M., & Papageorgiou, H. (2004). An integrated metadata model for statistical data collection and processing. Proceedings of the Sixteenth International Conference on Scientific and Statistical Database Management (SSDBM), Santorini, Greece.
OMG. (2002). Unified modeling language. Retrieved from http://www.omg.org/uml/
KEY TERMS
Papageorgiou, H., Pentaris, F., Theodorou, E., Vardaki, M., & Petrakos, M. (2001a). Modelling statistical metadata. Proceedings of the Thirteenth International Conference on Scientific and Statistical Database Management (SSDBM), Fairfax, Virginia.
Data Interchange: The process of sending and receiving data in such a way that the information content or meaning assigned to the data is not altered during the transmission.
Papageorgiou, H., Pentaris, F., Theodorou, E., Vardaki, M., & Petrakos, M. (2001b). A statistical metadata model for simultaneous manipulation of data and metadata. Journal of Intelligent Information Systems (JIIS), 17(2/ 3), 169-192. Papageorgiou, H., Vardaki, M., & Pentaris, F. (2000). Data and metadata transformations. Research in Official Statistics (ROS), 3(2), 27-43.
1052
Data Processing: The operation performed on data in order to derive new information according to a given set of rules. RDF: Stands for Resource Description Framework. RDF is a mechanism to describe data as a list of triples: an object (a resource), an attribute (a property), and a value (a resource or free text). Standards (International): They refer to the international statistical guidelines and recommendations that
Statistical Metadata in Data Processing and Interchange
have been developed by international organizations working with national agencies. The standards cover almost every field of statistical endeavor from data collection, processing and dissemination, and almost every statistical subject.
Statistical Metadata: Data about statistical data. Metadata provide information on data and about processes of producing and using data. Metadata describe statistical data and, to some extent, processes and tools involved in the production and usage of statistical data.
Statistical Information System: It is the information system oriented toward the collection, storage, transformation, and distribution of statistical information.
XML: Stands for eXtensible Markup Language and is a specification for computer-readable documents. XML is actually a metalanguage, a mechanism for representing other languages in a standardized way; therefore, XML provides a syntax to encode data.
1053
5
1054
Storage Strategies in Data Warehouses Xinjian Lu California State University, Hayward, USA
INTRODUCTION A data warehouse stores and manages historical data for on-line analytical processing, rather than for on-line transactional processing. Data warehouses with sizes ranging from gigabytes to terabytes are common, and they are much larger than operational databases. Data warehouse users tend to be more interested in identifying business trends rather than individual values. Queries for identifying business trends are called analytical queries. These queries invariably require data aggregation, usually according to many different groupings. Analytical queries are thus much more complex than transactional ones. The complexity of analytical queries combined with the immense size of data can easily result in unacceptably long response times. Effective approaches to improving query performance are crucial to a proper physical design of data warehouses. One of the factors that affect response time is whether or not the desired values have been pre-computed and stored on the disk. If not, then the values have to be computed from base data, and the data has to be both retrieved and processed. Otherwise, only data retrieval is needed, resulting in better query performance. Storing pre-computed aggregations is a very valuable approach to reducing response time. With this approach, two physical design questions exist: • •
Which aggregations to pre-compute and store? How to structure the aggregations when storing them?
When processing queries, the desired data is read from disk into memory. In most cases, the data is spread throughout different parts of the disk, thus requiring multiple disk accesses. Each disk access involves a seek time, a rotational latency, and a transfer time. Both seek time and rotational latency are setup times for an actual retrieval of data. The organization of data on the disk thus has a significant impact on query performance. The following is another important question:
This chapter presents notable answers from existing literature to these questions, and discusses challenges that remain.
BACKGROUND In a data warehouse, data is perceived by users and often presented to users as multidimensional cubes (Chaudhuri & Dayal, 1997; Datta & Thomas, 1999; Vassiliadis & Sellis, 1999). Attributes are loosely classified as either independent or dependent. Together the values of independent attributes determine the values of dependent ones (Date, 2000). The independent attributes form the dimensions of the cube by which data is organized. Each dependent attribute is known as a measure. These dimensions can be used as addresses for looking up dependent values, similar to the way coordinates describe objects (points, lines and planes) in a Cartesian coordinate system. Values of a dependent attribute are also called fact values. As an example, we use sales (e.g., dollar amount in thousands) as a measure. Each fact value indicates the sales during a specific time period at a particular store location for a certain type of product. See Figure 1 for a multidimensional presentation of some sample base data. Base data is extracted from various transactional sources. In addition to the base data, users are also interested in aggregated sales. For example, what are the total (or: average, minimum, and maximum) sales across all locaFigure 1. An Illustration of data sources, base data, and views
Views
Base Data Product
•
How to place data on the disk strategically so that queries can be answered efficiently?
110
295
918
395
680
520
280
321
120
308
364
550
439
409
598
Time Data Preparation A Variety of Data Sources
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Location
Storage Strategies in Data Warehouses
tions in each day, for every product name? With the SQL language, this question can be answered using “Group By Day, ProductName”. A large number of groupings exist; and each of them may be requested. The result of a grouping is also known as a view, or summary table (Ramakrishnan and Gehrke, 2003, pp. 870-876). A view can be computed from the base cube when requested, which would involve both reading the base data as well as computing the aggregates. Performance can be improved through pre-computing and physically storing the view. When a view is pre-computed and stored, it is called a materialized view (Harinarayan, Rajaraman, & Ullman, 1996). A view is also perceived as a data cube. A view has a smaller size than that of the base cube, so less time would be needed to read it from disk to memory. Further, if the query can be answered using values in the materialized view without any aggregating, the computing time can be avoided. However, it most cases, the number of possible groupings can be so large that materializing all views becomes infeasible due to limited disk space and/or difficulties in maintaining the materialized views. It should be emphasized that although data is interpreted as multidimensional cubes, it is not necessarily stored using multidimensional data structures. How data is physically stored also affects the performance of a data warehouse. At one end of the spectrum, both base cube and materialized views are stored in relational databases, which organize data into tables with rows and columns. This is known as ROLAP (relational OLAP). At the other end of the spectrum, both the base cube and materialized views are physically organized as multidimensional arrays, an approach known as MOLAP (multidimensional OLAP). These two approaches differ in many ways, with performance and scalability being the most important ones. Another approach, known as HOLAP (hybrid OLAP), tries to balance between ROLAP and MOLAP, and stores all or some of the base data in a relation database and the rest in a multidimensional database (Date, 2000). Regardless of how data is physically structured (ROLAP, MOLAP or HOLAP), when answering a query, data has to be read from a disk. In most cases, the desired data for an analytical query is a large amount, and it is spread over noncontiguous spots, requiring many disk accesses for one query. Each disk access involves a significant setup time. Hence, data should be partitioned and placed strategically on the disk so that the expected number of disk accesses is reduced (Lu & Lowenthal, 2004).
MAIN THRUST
5
Which Views to Materialize? Users may navigate along different dimensions, or explore at different levels in the same dimension (for example date, month, quarter, or year in a Time dimension), trying to discover interesting information. It would be ideal to materialize all possible views. This approach would give the best query performance. For a data cube with N dimensions, if the attribute from each dimension is fixed, there are 2N – 1 different views. Using [Date, City, Brand] from the dimensions shown on Figure 1, the following views are the possible groupings:
• • • • • • • •
Date, City, Brand Date, City Date, Brand City, Brand Date City Brand (none)
Moreover, a dimension may have one or more hierarchical structures. Replacing Date with Month, another set of views can be generated; and still another set can be generated when Quarter is picked from the Time dimension. It becomes evident that the number of all possible views can be very large, so will be the disk space needed to store all of them. In a data warehouse environment, new data is loaded from data sources into the data warehouse periodically (daily, weekly, or monthly). With an updated base cube, the materialized views must also be maintained to keep them consistent with the new base data. Reloading the base cube and maintaining the views take a certain amount of time, during which the data warehouse becomes unavailable. This time period must be kept minimal (Mumick, Quass, & Mumick, 1997; Roussopoulos, Kotidis, & Roussopoulos, 1997). Due to the space limit and time constraint, materializing all possible views is not a feasible solution. To achieve a reasonable query performance, the set of views to materialize must be chosen carefully. As stated in Ramakrishnan and Gehrke (2003, p. 853), “In current OLAP systems, deciding which summary tables to materialize may well be the most important design decision.” Harinarayan, Rajaraman and Ullman (1996) have examined how to choose a good set of views to materialize based on the relational storage scheme. It is assumed that the time to answer a query is proportional to 1055
Storage Strategies in Data Warehouses
the space occupied by the minimum view from which the query is answered. If the minimum view is not materialized, a materialized view that is a superset of the minimum view is used, resulting in a longer response time. Using this linear cost model, the problem is approached from several different angles: •
•
• •
The number of views to materialize is fixed regardless of the space they take. The materialized views are chosen so that the total time to generate each of all possible views once (no matter whether they will be materialized or not) is minimized. The views are unlikely to be used with the same frequency. A probability value is associated to each view; and the total weighted amount of time is minimized. The amount of space is fixed. The materialized views are selected so that the total weighted time to process each of all possible views is minimized. Techniques for optimizing the space-time tradeoff are developed.
In Harinarayan, Rajaraman, and Ullman (1996), the time constraint on maintaining the base cube and the materialized views has not been considered. Mumick, Quass, and Mumick (1997) have proposed incremental maintenance techniques. When new data is loaded into a data warehouse, it will take a long time to re-compute the materialized views from scratch. Incremental maintenance would require much shorter time (called batch window), during which the data warehouse is not available to users. The constraints of both disk space and maintenance time should be incorporated into models for identifying views to materialize. Subject to these constraints, there can be many objectives that are relevant to performance improvement, such as the following: • • • •
Minimize the average response time Minimize the longest response time Minimize the weighted average response time, with weights being the frequencies of queries Maximize the number of queries within a given amount of time
We believe these issues have yet to be addressed. Some of these objectives conflict with others. In general, a set of materialized views that optimize one objective would not offer an optimal performance for others. Optimizing one objective subject to constraints on others is likely to be an effective approach.
1056
How to Store Materialized Views? The base data and the materialized views are conceptually structured in the form of data cubes only to facilitate the identification and analysis of information in the data. How the base data and the views are physically stored and managed is a separate issue. Nevertheless, as advocated in Thomsen (2002, p. 262), “The problem is to provide fast access and flexible multidimensional manipulation.” This problem has led to strong emphasis on designing for performance. The ROLAP approach uses relational database systems to store and manage base data as well as the materialized views. In a relational system, data is physically organized into tables with rows and columns. Data is presented to the users as cubes. Query performance is handled with smart indexes, bitmap indexes being a typical example (O’Neil & Graefe, 1995), and other conventional optimization techniques (Graefe 1995). The MOLAP approach uses multidimensional database systems to store and manage base data as well as materialized views. In a multidimensional system, data is physically organized into cubes, in the same way as it is perceived by the users. In general, ROLAP offers higher scalability (i.e., more efficient in managing very large amounts of data), while MOLAP provides better performance. The missing data problem is more serious in a data warehouse environment because the base data is collected from many data sources which in turn may have values missing. In a ROLAP system, if a measure value is missing, there is no need to have a row that relates the dimensions to the missing value. In a MOLAP system, however, missing values must be dealt with in a different and more difficult fashion because of the pre-allocated cells in a multidimensional data structure. HOLAP tries to balance between performance and scalability (Date, 2000). We can use a metaphor to illustrate why ROLAP systems offer higher scalability and MOLAP systems provide better query performance. Suppose there are two parking systems A and B. In system A, a car is always parked next to the prior car; and no parking lots are ever left empty between cars. In system B, every car has a pre-assigned lot; and no car is parked at a lot that is not assigned to it. Evidently, it takes longer to find a group of cars in system A than in system B. Thus system B offers better “query” performance. However, when it is required to scale up from accommodating cars owned by staff/faculty only to cars owned by students and staff/faculty, system B might not have enough space; while system A may have no problem because not everyone comes to the campus all the time. The differences between ROLAP and MOLAP are similar to those between systems A and B.
Storage Strategies in Data Warehouses
It should be noted that the storage approach adopted will also affect the set of views to materialize because of differences between approaches in space and performance. The storage approach should be one of the constraints in the model for identifying which views to materialize. In a data warehouse, most of the data is not well understood. In fact, the purpose of building a data warehouse is for people to analyze and understand the data (Date, 2000, p. 721). This lack of understanding makes semantic modeling (i.e., capturing the meaning of data) a challenging task. With the ROLAP approach, dimension tables store values of independent attributes. Values of dependent attributes are stored in a fact table (Kimball, 1996, pp. 10-14), whose primary key is the composite of foreign keys from the dimension tables. The corresponding conceptual schema is known as a star schema because when drawn as an entity-relationship diagram, the fact table is surrounded by the dimension tables. With the MOLAP approach, data is conceptually modeled as cells of a multidimensional array.
How to Partition and Place Data on the Disk strategically? The unit for data transferring between disk and main memory is a block; if a single item on a block is needed, the entire block is transferred (Aho & Ullman, 1992, pp. 155-158). First, time is spent in moving the disk head to the track that the desired block is on, called seek time. Then, time is needed to wait for the desired block to rotate under the disk head, called rotational latency. Transfer time is the time for the disk to rotate over the block. Seek time and rotational latency comprise a significant portion of a disk access (King, 1990; Lu & Gerchak, 1998). Seek time, rotational latency, and transfer rate are all disk system parameters, which are manufacturer specific and thus cannot be changed post manufacture. However, the size of a disk block can be set when the disk is initiated. In current disk designs, contiguous blocks can be read through one seek and rotation. So if the desired data is stored on contiguous blocks, only one instance of seek time and rotational latency is involved, thus reducing the total time cost significantly. However, in most cases, the desired data for an analytical query is a large amount, and is spread over many noncontiguous blocks. Placing data on the disk strategically will also help in reducing response time. Assuming the ROLAP approach, Lu and Lowenthal (2004) have formulated a cost model to express the expected time to read the desired data as a function of the disk system’s parameters as well as the lengths of the foreign keys. For a predetermined page size, the solution to the model specifies a partition of records in the fact
table to be arranged on the disk, which minimizes the expected response time. An algorithm is then provided for identifying the most desirable disk page size. Partition and arrangement of data on the disk in MOLAP and HOLAP scenarios have yet to be addressed.
FUTURE TRENDS As discussed in Date (2000), data warehouses are the use of database technology for the purpose of decision support. The large size of base data and the existence of redundant data (materialized views) have led to a strong emphasis on designing for performance. However, physical designs should not substitute for logical designs. Because a proper logical design of a data warehouse ensures its correctness, maintainability, and usability, it should neither be ignored (which is currently often the case in practice), nor be overlooked. There has been a debate on whether or not relational model is adequate for or capable of representing data in data warehouses, but no common consensus has been reached yet. Research is still underway to develop a calculus on which a standard multidimensional query language can be based (Thomsen, 2002, pp. 165-200).
CONCLUSION The performance of commonly asked analytical queries is the ultimate measure of a data warehouse. An important step in achieving good performance is to make good physical storage choices, which has been the focus of this chapter. Organizations have increasingly used historical data in data warehouses for identifying business intelligence in order to support high-level decisionmaking. Keeping response time for ad hoc interactive queries in data warehouses at a minimum is both important and challenging. Materialized views can make queries run much faster. Although values in data warehouses are understood or modeled collectively as cubes, relational storage schemes and multidimensional arrays can both be used as physical storage methods, each with its own advantages and disadvantages in terms of performance and scalability. Finally, data values should be partitioned and arranged strategically on the disk to minimize the setup time involved in disk accesses.
REFERENCES Aho, A.V., & Ullman, J.D. (1992). Foundations of computer science. New York: Computer Science Press. 1057
5
Storage Strategies in Data Warehouses
Chaudhuri, S., & Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1), 65-74.
Thomsen, E. (2002). OLAP solutions, building multidimensional information systems, (2nd ed.). New York: John Wiley & Sons, Inc.
Date, C.J. (2000). An introduction to database systems (7th ed.). New York: Addison-Wesley.
Vassiliadis, P., & Sellis. T. (1999). A survey of logical models for OLAP databases. SIGMOD Record, 28(4), 6469.
Datta, A., & Thomas, H. (1999). The cube data model: A conceptual model and algebra for on-line analytical processing in data warehouses. Decision Support Systems, 27, 289-301. Graefe, H. (1995). Query evaluation techniques for large databases. ACM Computing Survey, 25(2), 73-171. Harinarayan, V., Rajaraman, A., & Ullman, J.D. (1996). Implementing data cubes efficiently. ACM SIGMOD Conference on the Management of Data, SIGMOD’96 (pp. 205-216), Montreal, Canada. Kimball, R. (1996). The data warehouse toolkit. New York: John Wiley & Sons, Inc. King, R.P. (1990). Disk arm movement in anticipation of future requests. ACM Transactions on Computer Systems, 8, 214-229. Lu, X., & Gerchak, Y. (1998). Minimizing the expected response time of an idled server on a line. IIE Transactions, 30, 401-408. Lu, X., & Lowenthal, F. (2004). Arranging fact table records in a data warehouse to improve query performance. Computers & Operations Research, 31, 2165-2182. Mumick, I.S., Quass, D., & Mumick, B.S. (1997). Maintenance of data cubes and summary tables in a warehouse. ACM SIGMOD Conference on the Management of Data, SIGMOD’97 (pp. 100-111), AZ, USA. O’Neill, P., & Graefe, G. (1995). Multi-table joins though bitmapped join indexes. SIGMOD Record, 24(3), 8-11. Ramakrishnan, R., & Gehrke, J. (2003). Database management systems (3rd ed.). New York: McGraw-Hill. Roussopoulos, N., Kotidis, Y., & Roussopoulos, M. (1997). Cubetree: Organization of and bulk incremental updates on the data cube. ACM SIGMOD Conference on the Management of Data, SIGMOD’97 (pp. 89-99), AZ, USA.
1058
KEY TERMS Analytical Query: A query on a data warehouse for identifying business intelligence. These queries often have complex expressions, access many kinds of data, and involve statistical functions. Materialized View: A view whose values have been pre-computed from certain base data and stored. Partitioning: The technique of dividing a set of data into fragments for physical storage purposes. It is intended to improve manageability and accessibility of large amount of data. Query Performance: A measure of how fast a system processes queries, which involves reading data from disk and if necessary computing the results in memory. The shorter it takes to process a query, the better the performance is. Rotational Latency: The waiting time for the beginning of the desired data block to rotate under the disk head before the actual data transfer can begin. Seek Time: The data on a computer disk is arranged in concentric circles called tracks. Seek time is the time needed to move the disk head from its current position to the track that the desired data is on. View: A collection of data whose values either have to be derived from other data (when the view is not materialized), or have been pre-computed from other data and stored (when it is materialized). View Maintenance: When the base data of materialized views is updated, the values of the views need to be updated accordingly. This process of keeping materialized views consistent with their base data is knows as view maintenance.
1059
Subgraph Mining
5
Ingrid Fischer Friedrich-Alexander University Erlangen-Nürnberg, Germany Thorsten Meinl Friedrich-Alexander University Erlangen-Nürnberg, Germany
INTRODUCTION The amount of available data is increasing very fast. With this data the desire for data mining is also growing. More and larger databases have to be searched to find interesting (and frequent) elements and connections between them. Most often, the data of interest is very complex. It is common to model complex data with the help of graphs consisting of nodes and edges that often are labeled to store additional information. Applications can be found in very different fields. For example, the two-dimensional structure of molecules often is modeled as graphs having the atoms as nodes and bonds as edges. The same holds for DNA or proteins. Web pages and links between Web pages also can be represented as graph. Other examples are social networks as citation networks and CAD circuits; graphs can be found in a lot of different application areas. Having a graph database, it is interesting to find common graphs in it, connections between different graphs, and graphs that are subgraphs of a certain number of other graphs. This graph-based data mining has become more and more popular in the last few years. When analyzing molecules, it is interesting to find patterns— called fragments in this context—that appear at least in a certain percentage of molecules. Another problem is finding fragments that are frequent in one part of the database but infrequent in the other. This way, this substructure is separating the database into active and inactive molecules (Borgelt & Berthold, 2002). Similar problems occur for protein databases. Here, graph data mining can be used to find structural patterns in the primary, secondary, and tertiary structure of protein categories (Cook & Holder, 2000). Another application area is Web searches (Cook, Manocha & Holder, 2003). Existing search engines use linear feature matches. Using graphs as underlying data structure, nodes represent pages; documents or document keywords and edges represent links between them. Posing a query as a graph means a smaller graph has to be embedded in the larger one. The graph modeling the data structure can be mined to find similar clusters.
There are a lot of application areas where graph data mining is helpful. Despite the need for graph data mining, the first published algorithm in this area appeared in the mid-1990s. Subdue (Cook & Holder, 2000) is the oldest algorithm but is still used in various applications. Being the first, the number of extensions available for Subdue is enormous. The algorithm is combined with background knowledge, inexact graph matching, and there also is a parallelized variant available. Supervised and unsupervised mining is possible. It took a few more years before more and faster approaches appeared. In Helma, Kramer, and de Raedt (2002), graph databases are mined for simple paths; for a lot of other applications, only trees are of interest (El-Hajj & Zaïane, 2003; Rückert & Kramer, 2004). Also, inductive logic programming (Finn et al., 1998) was applied in this area. At the beginning of the new millennium, finally more and more and everytime faster approaches for general mining of graph databases were developed (Borgelt & Berthold, 2002; Inokuchi,Washio & Motoda, 2003; Kuramochi & Karypis, 2001; Yan & Han, 2002). The latest development, a system named Gaston (Nijssen & Kok, 2004) combines mining for paths, trees, and graphs leading to a fast and efficient algorithm.
BACKGROUND Theoretically, mining in graph databases can be modeled as the search in the lattice of all possible subgraphs. In Figure 1, a small example is shown based on one graph with six nodes labeled A,B,C as shown at the bottom of the figure. All possible subgraphs of this small graph are listed in this figure. At the top of the figure, the empty graph modeled with * is shown. In the next row, all possible subgraphs containing just one node (or zeros edges) are listed. The second row contains subgraphs with one edge. The parent-child relation between the subgraphs (indicated by lines) is the subgraph property. The empty graph can be embedded in every graph containing one node. The graph containing just one node labeled A can be embedded in a one-edge graph containing nodes A and C. Please note that in Figure 1, no graph
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Subgraph Mining
Figure 1. The lattice of all subgraphs in a graph
with one edge is given containing nodes labeled A and B. As there is no such subgraph in our running example, the lattice does not contain a graph like this. Only graphs that are real subgraphs are listed in the lattice. In the third row, graphs with two edges are shown, and so on. At the bottom of Figure 1, the complete graph with five edges is given. Each subgraph given in Figure 1 can be embedded in this graph. All graph mining algorithms have in common that they search this subgraph lattice. They are interested in finding a subgraph (or several subgraphs) that can be embedded as often as possible in the graph to be mined. In Figure 1, the circled graph can be embedded twice in the running example. When mining real-life graph databases, the situation, of course, is much more complex. Not only one, but a lot of graphs are analyzed, leading to a very large lattice. Searching this lattice can be done depth or breadth first. When searching depth first in Figure 1, the first discovered subgraph will be A followed by A-C, A-C-C, and so forth. So, first, all subgraphs containing A, in the next branch all containing B are found. If the lattice is traversed breadth first, all subgraphs in one level of the lattice (i.e., structures that have the same number of edges) are searched before the next level is started. The main disadvantage of breadth first search is the larger memory consumption, because in the middle of the lattice, a large 1060
amount of subgraphs has to be stored. With depth-first search, only structures whose amount is proportional to the size of the biggest graph in the database have to be recorded during the search. Building this lattice of frequent subgraphs involves two main steps: candidate generation, where new subgraphs are created out of smaller ones; and support computation, where the frequency or support of the new subgraphs in the database is determined. Both steps are highly complex, and, thus, various algorithms and techniques have been developed to find frequent subgraphs in finite time with reasonable resource consumptions.
MAIN THRUST We will now have a more detailed look at the two main steps of the search mentioned previously—candidate generation and support computation. There are two popular ways of creating new subgraphs: merging smaller subgraphs that share a common core (Inokuchi et al., 2002; Kuramochi & Karypis, 2001) or extending subgraphs edge by edge (Borgelt & Berthold, 2002; Yan & Han, 2002). The merge process can be explained by looking at the subgraph lattice shown in Figure 1. The circled subgraph
Subgraph Mining
has two parents, A-C and C-C. Both share the same core, which is C. So the new fragment A-C-C is created by taking the core and adding the two additional edge-node pairs, one from each parent. There are two problems with this approach. First, the common core needs to be detected somehow, which can be very expensive. Second, a huge amount of subgraphs generated in this way may not even exist in the database. Merging (e.g., A-C and B-C) in the example will lead to A-C-B, which does not occur in the database. To overcome this problem, various modifications of the merge process have been developed (Huan, Wang & Prins, 2003; Inokuchi, Washio & Motoda, 2003). Extending fragments has the advantage that no cores have to be detected. New edge-node pairs (or sometimes only edges, if cycles are closed) are just added to an existing subgraph. Also here, non-existing candidates can be generated, but there is an easy way to combine the generation of new subgraphs with the support computation, so that only existing candidates are created. As shown later, during the support computation, the candidate subgraph has to be embedded into all graphs of the database (which is essentially a subgraph isomorphism test). Once an embedding has been found, the surrounding of the subgraph is known, and in the following extension step, only edge-node pairs (or single edges) are added that really exist in the database’s graphs. The small drawback is that now all subgraph isomorphisms have to be computed, and not just one, as is normally required for the support computation. Nevertheless, this technique is currently the fastest subgraph mining algorithms relying on extending subgraphs. Computing the support of new candidates also can be done in two different ways. The already mentioned simple approach is to test subgraph isomorphism against all graphs in the database. This is a computationally expensive task, because the subgraph isomorphism problem is NP-complete. However, there is a small improvement for this strategy, as it suffices to check for subgraph isomorphism only in the graphs where the parent graph(s) occur. Unfortunately, this requires keeping a list of the graphs in which a subgraph occurs, which can be quite memoryconsuming, if the database is large. The other way of calculating the support is the use of so-called embeddings lists. An embedding can be thought of as a stored subgraph isomorphism (i.e., a map from the nodes and edges in the subgraph to the corresponding nodes and edges in the graph). Now, if the support of a new greater subgraph has to be determined, the position in the graph where it can occur is already known, and only the additional nodes and edge have to be checked. This reduces the time to find the isomorphism but comes with the drawback of enormous memory requirements, as all embeddings of a subgraph in the database have to be stored, which can be millions for small subgraphs on even
medium-sized databases of about 25,000 items. Using embedding lists, the actual support for a structure can be determined by counting the number of different graphs that are referred to by the embeddings. This can be done in linear time. In general, it is not possible to traverse the complete lattice, because the number of subgraphs is too large to be handled efficiently. Most of them can be reached by following many different paths in the lattice (see Figure 1). A mechanism is needed to prune the search tree that is built during the discovery process. First, it is obvious that a supergraph of a graph that is infrequent must be infrequent, too. It cannot occur in more graphs than its parents in the lattice. This property is also known as the antimonocity constraint. Once the search reaches a point where a graph does not occur in enough items of the database any more, this branch can be pruned. This leads to a drastic reduction of the number of subgraphs to be searched. To speed up the search even further, various authors have proposed additional pruning strategies that rely on canonical representations of the subgraphs or local orders on the nodes and edges. This information can be used to prune the search tree even further, while still finding all frequent fragments.
FUTURE TRENDS Despite the efforts of past years, several problems still have to be solved. Memory and runtime are a challenge for most of the algorithms. Having real-world graph databases containing millions of different graphs, various new algorithms and extensions of the existing ones are necessary. First thoughts concerning this topic can be found in Wang, Wang, Pei, Zhu, and Shi (2004). Another promising research direction is parallel and distributed algorithms. Distributing the graphs and their subgraph lattice onto different machines can help in processing even larger databases than with current algorithms. It is an open question how to realize the distribution without searching different branches of the lattice several times. Searching only in one part of the smaller database might also lead to the rejection of frequent subgraphs, as they may be infrequent in this part but frequent in the whole database. If on another machine the support for this subgraph is high enough to be reported, the total number of subgraphs is not correct. In several application areas, it is not exact graph matching that is necessary. For example, when mining molecules, it is helpful to search for groups of molecules having the same effect but not the same underlying graph. Well-known examples are the number of carbon atoms in chains or several carbon atoms in rings that
1061
5
Subgraph Mining
have been replaced by nitrogen (Hofer, Borgelt & Berthold, 2003). Finally, visualization of the search and the results is difficult. A semi-automatic search can be helpful. A human expert decides whether the search in a subpart of the lattice is useful or if the search is more promising in another direction. To achieve this goal, a visualization component is necessary that allows browsing in the graph database, showing the embeddings of subgraphs.
CONCLUSION Graph data mining is currently a very active research field. At the main data mining conferences of the ACM or the IEEE every year, various new approaches appear. The application areas of graph data mining are widespread, ranging from biology and chemistry to Internet applications. Wherever graphs are used to model data, data mining in graph databases is useful.
REFERENCES Borgelt, C., & Berthold, M. (2002). Mining molecular fragments: Finding relevant substructures of molecules. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan. Cook, D.J., & Holder, L.B. (2000). Graph-based data mining. IEEE Intelligent Systems, 15(2), 32-41. Cook, D.J., Manocha, N., & Holder, L.B. (2003). Using a graph-based data mining system to perform Web search. International Journal of Pattern Recognition and Artificial Intelligence, 17(5), 705-720. El-Hajj, M., & Zaïane, O. (2003). Non recursive generation of frequent K-itemsets from frequent pattern tree representations. Proceedings of the 5th International Conference on Data Warehousing and Knowledge Discovery, Prague, Czech Republic. Finn, P., Muggleton, S., Page D., & Srinivasan, A. (1998). Pharmacophore discovery using the inductive logic programming system PROGOL. Machine Learning, 30(2-3), 241-270. Helma, C., Kramer, S., & De Raedt, L. (2002). The molecular feature miner MolFea. Proceedings of the Beilstein-Institut Workshop Molecular Informatics: Confronting Complexity, Bozen, Italy. Hofer, H., Borgelt, C., & Berthold, M. (2003). Large scale mining of molecular fragments with wildcards. In M. Berthold, H.J. Lenz, E. Bradley, R. Kruse, & C. Borgelt 1062
(Eds.), Advances in Intelligent Data Analysis V (pp. 380389). Springer-Verlag. Huan J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. Proceedings of the International Conference on Data Mining, Melbourne, Florida. Inokuchi, A., Washio, T., & Motoda, H. (2000). An aprioribased algorithm for mining frequent substructures from graph data. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France. Inokuchi, A., Washio, T., & Motoda, H. (2003). Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3), 321-354. Inokuchi, A., Washio, T., Nishimura, K., & Motoda, H. (2002). A fast algorithm for mining frequent connected subgraphs. Tokyo: IBM Research. King, R., Srinivasan, A., & Dehaspe, L. (2001). Warmr: A data mining tool for chemical data. Journal of ComputerAided Molecular Design, 15, 173-181. Kuramochi M., & Karypis G. (2001) Frequent subgraph discovery. Proceedings of the IEEE International Conference on Data Mining, San Jose, California. Nijssen, S., & Kok, J. (2004). A quickstart in frequent structure mining can make a difference [technical report]. Leiden, Netherlands: Leiden Institute of Advanced Computer Science. Rückert, U., & Kramer, S. (2004). Frequent free tree discovery in graph data. Proceedings of the ACM Symposium on Applied Computing, Nicosia, Cyprus. Valiente, G. (2002). Algorithms on trees and graphs. Springer-Verlag. Wang, C., Wang, W., Pei, P., Zhu, Y., & Shi, B. (2004). Scalable mining large disk-based graph databases. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington. Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan.
KEY TERMS Antimonicity Constraint: The antimonicity constraint states that any supergraph of an infrequent graph must be infrequent itself.
Subgraph Mining
Candidate Generation: Creating new subgraphs out of smaller ones; then it is checked how often this new subgraph appears in the analyzed graph database. Frequent Subgraph: A subgraph that occurs in a certain percentage of all graphs in the database. Graph Isomorphism: Two graphs that contain the same number of graph vertices connected in the same way by edges are said to be isomorphic. Determining if two graphs are isomorphic is thought to be neither an NPcomplete problem nor a P-problem, although this has not been proved (Valiente, 2000).
process; pruning criteria may be the size of the graphs, the support of the graphs, or algorithm-specific constraints. Subgraph: A graph G’ whose vertices and edges form subsets of the vertices and edges of a given graph G. If G’ is a subgraph of G, then G is said to be a supergraph of G’. Subgraph Isomorphism: Decision whether a graph G’ is isomorphic to a subgraph another graph G. This problem is known to be NP-complete. Support: The number of graphs in the analyses database in which a subgraph occurs.
Search Tree Pruning: Cutting of certain branches of the (conceptual) search tree that is built during the mining
1063
5
1064
Support Vector Machines MamounAwad University of Texas at Dallas, USA Latifur Khan University of Texas at Dallas, USA
INTRODUCTION The availability of reliable learning systems is of strategic importance, as many tasks cannot be solved by classical programming techniques, because no mathematical model of the problem is available. So, for example, no one knows how to write a computer program that performs handwritten character recognition, though plenty of examples are available. It is, therefore, natural to ask if a computer could be trained to recognize the letter A from examples; after all, humans learn to read this way. Given the increasing quantity of data for analysis and the variety and complexity of data analysis problems being encountered in business, industry, and research, demanding the best solution every time is impractical. The ultimate dream, of course, is to have some intelligent agent that can preprocess data, apply the appropriate mathematical, statistical, and artificial intelligence techniques, and then provide a solution and an explanation. In the meantime, we must be content with the pieces of this automatic problem solver. The data miner’s purpose is to use the available tools to analyze data and provide a partial solution to a business problem. The support vector machines (SVMs) have been developed as a robust tool for classification and regression in noisy and complex domains. SVMs can be used to extract valuable information from data sets and construct fast classification algorithms for massive data. The two key features of support vector machines are the generalization theory, which leads to a principled way to choose a hypothesis, and kernel functions, which introduce nonlinearity in the hypothesis space without explicitly requiring a nonlinear algorithm. SVMs map data points to a high-dimensional feature space, where a separating hyperplane can be found. This mapping can be carried on by applying the kernel trick, which implicitly transforms the input space into highdimensional feature space. The separating hyperplane is computed by maximizing the distance of the closest patterns, that is, margin maximization. SVMs can be defined as “a system for efficiently training linear learning machines in kernel-induced feature spaces, while respecting the insights of generalisation
theory and exploiting optimisation theory” (Cristianini & Shawe-Taylor, 2000, p. 93). Support vector machines have been applied in many real-world problems and in several areas: pattern recognition, regression, multimedia, bio-informatics, artificial intelligence, and so forth. Many techniques, such as decision trees, neural networks, genetic algorithms, and so on, have been used in these areas; however, what distinguishes SVMs is their solid mathematical foundation, which is based on the statistical learning theory. Instead of minimizing the training error (empirical risk), SVMs minimize the structural risk, which expresses an upper bound on the generalization error, that is, the probability of an erroneous classification on yet to be seen examples. This quality makes SVMs especially suited for many applications with sparse training data.
BACKGROUND The general problem of machine learning is to search a (usually) very large space of potential hypotheses to determine the one that will best fit the data and any prior knowledge. “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” (Mitchell, 1997, p. 2). Machine learning can be categorized into several categories based on the data set and labels of the data set. The data used for learning may be labeled (for example, data might be medical records, where each record reflects the history of a patient and has a label denoting whether that patient had heart disease or not) or unlabeled. If labels are given, then the problem is one of supervised learning, in that the true answer is known for a given set of data. If the labels are categorical, then the problem is one of classification, for example, predicting the species of a flower given petal and sepal measurements. If the labels are real-valued, then the problem is one of regression statistics, for example, predicting property values
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Support Vector Machines
from crime, pollution, and so forth. If labels are not given, then the problem is one of unsupervised learning, and the aim is to characterize the structure of the data, for example, by identifying groups of examples in the data that are collectively similar to each other and distinct from the other data.
Pattern Recognition Formally, in pattern recognition, we want to estimate a function f : R N → {±1} by using input-output training data, ( x1 , y1 ),..., ( xl , yl ) ∈ R N × {±1} , such that f will correctly classify unseen examples (x,y), that is, f(x)=y for examples (x,y) that were generated from the same underlying probability distribution P(x,y) as the training data. Each data point has numerical properties that might be useful to distinguish them and that are represented by x in (x,y). The y is either +1 or –1 to denote the label or the class to which this data point belongs. For example, in a medical record, x might be the age, weight, allergy, blood pressure, blood type, disease, and so forth. The y might represent whether the person is susceptible to a heart attack. Notice that some attributes, such as an allergy, might need to be encoded (for example, 1 if the person is allergic to medicine, or 0 if not) in order to be represented as a numerical value. If we put no restriction on the class of functions that we choose our estimate f from, even a function that does well on the training data, for example, by satisfying f ( xi ) = yi for all i = 1,..., l , might not require to generalize well to unseen examples. To see this, note that for each function f and test set ( x1 , y1 ),..., ( xl , yl ) ∈ R N × {±1} satisfy-
ing {x1 ,..., xl } ∩ {x1 ,..., xl } = {} , there exists another function
f * such that f * ( xi ) = f ( xi ) for all i = 1,..., l , yet
f * ( x i ) ≠ f ( x i ) for all i = 1,..., l ; that is, both functions, f and f* , return the same prediction for all training examples, yet they disagree on their predictions for all testing examples. As we are only given the training data, we have no means of selecting which of the two functions (and hence which of the completely different sets of test outputs) is preferable. Hence, only minimizing the training error (or empirical risk), Remp [ f ] =
1 l 1 ∑ f ( xi ) − y i l i =1 2 ,
(1)
does not imply a small test error (called risk), averaged over test examples drawn from the underlying distribution P ( x, y ) , R[ f ] = ∫
1 f ( x) − y ) dP( x, y ) . 2
(2)
In Equation 1, notice that the error, f ( xi ) − yi , is equal to 0 if the data point
xi is correctly classified, because
f ( xi ) = y i . Statistical learning theory (Vapnik & Chervonenkis, 1974; Vapnik, 1979), or VC (Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the class of functions that f is chosen from to one that has a capacity suitable for the amount of available training data. VC theory provides
Figure 1A. VC-dimension of H equals the set of all linear dicision surfaces Figure 1B. Four points cannot be shattered by H
1065
5
Support Vector Machines
bounds on the test error. The minimization of these bounds, which depend on both the empirical risk and capacity of the function class, leads to the principle of structural risk minimization (Vapnik, 1979). The best known capacity concept of VC theory is the VC dimension, described as the largest number, h, of points that can be separated in all possible ways by using functions of the given class. The definition of VC dimension is based on the concept of shattering defined as follows: “A set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy” (Mitchell, 1997). “The VapnikChervonenkis dimension VC(H), of hypothesis space H defined over instance space X, is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC (H ) ≡ ∞ ” (Mitchell, 1997, p. 215). For example, Figure 1A illustrates that the VC-Dimension of the H equals the set of all linear decision surfaces in the plane. Hypotheses are depicted by blue lines. The largest number of points that we can shatter is 3; therefore, the VC-Dimension is greater than or equal to 3. Figure 1A also depicts the points that each linear hypothesis shatter. Because there are 3 data points a, b, and c, we have 23=8 dichotomies. The hypothesis at 1 shatters the empty set from {a, b, c}, the hypothesis at 2 shatters Point a from {b, c}, and so forth. Figure 1B shows that 4 points cannot be shattered by a linear hypothesis, because no hypothesis can shatter points 1 and 4 without including Point 2 or 3 (However, if you use circles as hypothesis space, they could be shattered). Hence, the VC-Dimension of H is 3.
MAIN THRUST In this section, we will explore the key concepts and features of SVMs to help understand support vector machines.
Problem Formalization The classification problem can be restricted to consideration for the two-class problem without loss of generality. In this problem, the goal is to separate the two classes by a function (or hypothesis), which is induced from available examples. The goal is then to produce a classifier that will work well on unseen examples, that is, that generalizes well. Consider the example shown in Figure 2. Here the data belong to two different classes, depicted as circles and squares. As you can see, in Part A, many possible linear classifiers (or separators) can separate the data, but in Part B, only one separator maximizes the margin (i.e., maximizes the distance between it and the nearest data points of each class). This linear classifier is termed the optimal separating classifier. Intuitively, we would expect this boundary to generalize well as opposed to the other possible boundaries. In higher dimensional spaces, a separator classifier is called a hyperplane classifier. For example, in two-dimensional spaces, a hyperplane is simply a line, and in three-dimensional spaces, a hyperplane is a plane. Suppose we have N labeled training data points {(x1, y1),..., (xN, yN)}, where xi ∈ R d and y i ∈ {+1,−1} . The optimal classifier can be computed by solving the quadratic programming problem: Minimize( w,b) 1 / 2w T w
(3)
Subject to yi .( w ⋅ xi − b) ≥ 1 ∀ 1 ≤ i ≤ N
(4)
where w is the optimal separating hyperplane vector, T denotes the vector transpose, and b is the bias. Intuitively, by solving this quadratic programming problem, we try to find the optimal hyperplane and two hyperplanes (H1 and H2) parallel to it and with equal distances to it, with the conditions that no data points are between
Figure 2. Separating classifiers (A) and the optimal separator (B)
S
1066
Support Vector Machines
H1 and H2 and the distance between H1 and H2 is maximized. When the distance between H1 and H2 is maximized, some positive data points will be on H1 and some negative data points will be on H2. These data points are called support vectors, because only these points participate in the definition of the separating hyperplane, and other data points can be removed and/or moved around as long as they do not cross the planes H1 and H2. Figure 3A explains the representation of the hyperplane in a mathematical form. As you can see, any hyperplane can be represented by using two parameters w, which is the vector perpendicular to the hyperplane, and b, which is the bias, that is, the intersection of the hyperplane with the y-axis. Figure 3B depicts the geometrical representation of solving the SVM’s quadratic programming problem by showing H (optimal separator) and the H1 and H2 hyperplanes. Because quadratic programming problems have a unique local minimum, the method is sometimes said to solve the typical problem of having many local minima. Hence, SVMs do not suffer from the problem of falling in one of the local minima, which is a typical problem in other techniques, such as neural networks. The hyperplane is found by finding the values of w and b of the formalization in Equations 3 and 4. A new data point x can be classified by using f ( x) = sign( w ⋅ x − b)
(5)
This formalization is normally converted to the Wolfe dual representation, N N α 1 i T Maximize: L(w, b, α) ≡ 12 w w − ∑αi yi (w.xi − b) + ∑ = ∑αi − ∑αiα j yi y j xi ⋅ x j 2 i, j i =1 i i
(6)
with respect to α , subject to the constraint that the gradient of L( w, b, α ) with respect to the primal variables w and b vanishes. The constraints in the dual formalization, including the optimality Karush-Kuhn-Tucker (KKT) conditions, are N
w = ∑ α i yi xi
(7)
∑α y
(8)
i
i
i
i
=0
yi ( xi ⋅ w + b) − 1 ≥ 0 ∀i
(9)
α i ≥ 0 ∀i
(10)
α i ( yi ( xi ⋅ w + b) − 1) = 0 ∀i
(11)
A new data point x can be classified by using l
f ( x) = sign(∑ α i y i x ⋅ xi + b) i =1
= sign( ∑ α i y i x ⋅ xi )
(12)
i = SV
Support vectors are those data points having α i > 0 (see Figure 4). An important consequence of the dual representation is that the dimension of the feature space need not affect the computation. And with the introduction of the kernel function, as you see in the next paragraph, the number of operations required to compute the inner product by evaluating the kernel function is not necessarily proportional to the number of features.
Figure 3A. The hyperplane representation Figure 3B. The geometrical representation of solving the SVM's quadratic programming problem
1067
5
Support Vector Machines
Figure 4. The values of αι for the training data
Figure 5. Feature mapping
dimension by simply applying a kernel function instead of the dot product.
Limitations Feature Mapping and Kernel Functions Linear machines have limited computation power, especially when dealing with complex real-world applications that require more expressive hypothesis spaces. Kernel representations offer an alternative solution by projecting the data into a high-dimensional feature space to increase the computational power of the linear learning machines, such as SVM. Figure 5 explains the idea of mapping nonseparable data points to another space by using the mapping function Φ. As you can see, the linearly nonseparable data points become linearly separable after mapping. The new quantities introduced to describe original data are called features. The use of kernels makes it possible to map the data implicitly into feature space and to train a linear machine in such a space, potentially sidestepping the computational problems inherent in evaluation the feature map, Φ. Examples of kernel functions include the radial base function, K ( x, z ) = e
− x, z
2
(13)
2σ 2
Perhaps the biggest limitation of the support vector approach lies in the choice of the kernel function. After the kernel is fixed, SVMs’ classifiers have only one userchosen parameter (the error penalty), but the kernel is a very big rug under which to sweep parameters. Some work has been done on limiting kernels by using prior knowledge (Schölkopf, Simard, Smola, & Vapnik, 1998; Burges, 1998), but the best kernel choice for a given problem is still a research issue (Cristianini, Shawe-Taylor, Elisseeff, & Kandola, 2001; Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002). Other limitations are speed and size, both in training and testing. Although the speed problem in test phase is largely solved in Burges, Knirsch, and Haratsh (1996), this still requires two training passes. Training for very large data sets (millions of support vectors) is an unsolved problem. Discrete data presents another problem, although with suitable rescaling, excellent results have nevertheless been obtained (Joachims, 1997). Finally, although some work has been done on training multiclass SVMs in one step, the optimal design for multiclass SVMs’ classifiers is a further area for research.
and the polynomial function, K ( x, z ) = ( x, z + c )
d
(14)
In the Wolfe dual representation of SVMs problem (Equations 6 through 12), the training examples occur only in the form of a dot product in both the objective function and the solution. This form presents a nice transformation of the training examples to a higher 1068
FUTURE TRENDS There are many indications that SVMs will be extended further to fit applications such as data mining that require large data sets. Other techniques, such as random selection, sample selection, clustering analysis (Ben-Hur, Horn, Siegelmann, & Vapnik, 2001), and so forth, have been used with SVMs to speed up the training process and to
Support Vector Machines
reduce the training set. This new direction of research aims to approximate the support vectors in advance and reduce the training set. Other attempts emerged lately to use SVMs in clustering analysis. The use of kernel functions to map data points to higher dimensional space in SVMs led to generating new clusters. The idea of multiclass SVMs is an important research topic. There are many multiclass applications in many research areas. SVMs have been extended by applying the “one versus all” and the “one versus one” approaches; however, a tremendous amount of training and testing is involved, because in both approaches, several classifiers are generated (Rifkin & Klautau, 2004). A new data point is classified by testing it over all the generated classifiers.
CONCLUSION SVMs provide a new approach to the problem of classification and pattern recognition, together with regression estimation and linear operation inversion, with clear connections to the underlying statistical learning theory. They differ radically from comparable approaches, such as neural networks: SVMs’ training always finds a global minimum, and their simple geometric interpretation provides fertile ground for further investigation. An SVM is largely characterized by the choice of its kernel, and SVMs thus link the problem for which they are designed with a large body of existing work on kernel-based methods. SVMs have some limitations, though. They are not suitable for applications involving large data sets, because the problem formalization of SVMs is a quadric programming problem that requires at least O(N2) computation with a data set of size N. Other issues include multiclass SVMs, discrete data sets, and kernel choice.
REFERENCES
Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001). On kernel-target alignment. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14. Cambridge, MA: MIT Press. Hwanjo, Y., Jiong, Y., & Jiawei, H. (2003). Classifying large data sets using SVM with hierarchical clusters. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 306315), Washington, D.C., USA. Joachims, T. (1997). Text categorization with support vector machines (Tech. Rep. LS VIII No. 23). University of Dortmund, Germany. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. Journal of Machine Learning Research, 2, 419-444. Paris: EC2 & Cie. Mitchell, T. (1997). Machine learning. Columbus: McGraw-Hill. Rifkin, R., & Klautau, A. (2004). In defense of one-vs.-all classification. Journal of Machine Learning Research, 5, 101-141. Schölkopf, B., Simard, P., Smola, A., & Vapnik, V. (1998). Prior knowledge in support vector kernels. In M. Jordan, M. Keans, & S. Solla (Eds.), Advances in neural information processing systems 10 (pp. 640-646). Cambridge, MA: MIT Press. Smola, A., & Schölkopf, B. A tutorial on support vector regression (Tech Rep.). Retrieved October 2004, from http://www.neurocolt.com Vapnik, V. (1979). Estimation of dependences based on empirical data. Nauka, Moscow. Moscow: Springer. Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Nauka, Moscow: Theorie der Zeichenerkennung, Akademie-Verlag, Berlin.
Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125-137.
KEY TERMS
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121-167.
Feature Space: The higher dimensional space that results from mapping the input space, as opposed to the input space occupied by the training examples.
Burges, C., Knirsch, P., & Haratsh, R. (1996). Support vector web page (Tech. Rep.). Retrieved August 2004, from http://svm.research.bell-labs.com
Functional Margin: Geometrically, the functional margin is the Euclidean distance of the closest point from the decision boundary to the input space.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines (and other kernel-based learning methods). Cambridge: Cambridge University Press.
Hypotheses: A particular set or class of candidate functions before you begin to learn the correct function.
1069
5
Support Vector Machines
Kernel Function: A kernel is a function K, such that for all x, z ∈ X , K ( x, z ) = Φ ( x), Φ( z ) , where Φ is a mapping from X to a (inner-product) feature space F. One versus All: A multiclass approach that constructs, for each class, a classifier that separates that class from the rest of the classes of data. A test datum x will be classified in the class that maximizes (w ⋅ x − b ) . One versus One: A multiclass approach that constructs, for each pair of classes, a classifier that separates those classes. A test datum is classified by all the classifiers and will belong to the class with the largest number of positive outputs from these pairs of classifiers.
1070
Supervised Learning (Classification): In supervised learning, the learning machine is given a training set of examples (or inputs) with associated labels (or output values). Usually, the examples are in the form of attribute vectors, so the input space is a subset of Rn. When the attribute vectors are available, a number of sets of hypotheses could be chosen for the problem. Support Vector Machines: A system for efficiently training linear learning machines in kernel-induced feature spaces while respecting the insights of generalization theory and exploiting optimization theory. Unsupervised Learning (Clustering): Unsupervised learning occurs when you are given a set of observation with the aim of establishing the existence of classes or clusters in the data.
1071
Support Vector Machines Illuminated
5
David R. Musicant Carleton College, USA
INTRODUCTION In recent years, massive quantities of business and research data have been collected and stored, partly due to the plummeting cost of data storage. Much interest has therefore arisen in how to mine this data to provide useful information. Data mining as a discipline shares much in common with machine learning and statistics, as all of these endeavors aim to make predictions about data as well as to better understand the patterns that can be found in a particular dataset. The support vector machine (SVM) is a current machine learning technique that performs quite well in solving common data mining problems.
BACKGROUND
question marks in Table 2, are hidden and “pretended” not to be known. After training is complete, the rule is used to predict values for the output attribute for each point in the test set. These output predictions are then compared with the (now revealed) known values for these attributes, and the difference between them is measured. Success is typically measured by the fraction of points which are classified correctly. This measurement provides an estimate as to how well the algorithm will perform on data where the value of the output attribute is truly unknown. One might ask: why bother with a test set? Why not measure the success of a training algorithm by comparing predicted output with actual output on the training set alone? The use of a test set is particularly important because it helps to estimate the true error of a classifier. Specifically, a test set determines if overfitting has occurred. A learning algorithm may learn the training data “too well,” i.e. it may perform very well on the training data, but very poorly on unseen testing data. For example, consider the following classification rule as a solution to the above posed classification problem:
The most common use of SVMs is in solving the classification problem, which we focus on in the following example. The dataset in Table 1 contains demographic information for four people. These people were surveyed to determine whether or not they purchased software on a regular basis. The dataset in Table 2 contains demographic information for two more people who may or may not be good targets for software advertising. We wish to determine which of the people in Table 2 purchase software on a regular basis. This classification problem is considered an example of supervised learning. In supervised learning, we start off with a training set (Table 1) of examples. We use this training set to find a rule to be used in making predictions on future data. The quality of a rule is typically determined through the use of a test set (Table 2). The test set is another set of data with the same attributes as the training set, but which is held out from the training process. The values of the output attributes, which are indicated by
This solution is clearly problematic – it will yield 100% accuracy on the training set, but should do poorly on the test set since it assumes that all other points are automatically “No”. An important aspect of developing supervised learning algorithms is ensuring that overfitting does not occur. In practice, training and test sets may be available a priori. Most of the time, however, only a single set of data is available. A random subset of the data is therefore held out from the training process in order to be used as a test
Table 1. Classification example training set
Table 2. Classification example test set
Age
Income
30 50 16 35
$56,000 / yr $60,000 / yr $2,000 / yr $30,000 / yr
Years of Education 16 12 11 12
Software Purchaser? Yes No Yes No
•
Overfitted Solution: If our test data is actually present in Table 1, look it up in the table to find the class for which the point belongs. If the point is not in Table 1, classify it in the “No” category.
Age
Income
40 29
$48,000 / yr $60,000 / yr
Years of Education 17 18
Software Purchaser? ? ?
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Support Vector Machines Illuminated
set. This can introduce widely varying success rates, depending on which data points are chosen. This is traditionally dealt with by using cross-validation. The available data is randomly broken up into k disjoint groups of approximately equal size. The training process is run k times, each time holding out a different group to use as a test set and using all remaining points as the training set. The success of the algorithm is then measured by an average of the success over all k test sets. Usually we take k=10, yielding the process referred to as tenfold cross-validation. A plethora of methodologies can be found for solving classification and regression problems. These include the backpropagation algorithm for artificial neural networks, decision tree construction algorithms, spline methods for classification, probabilistic graphical dependency models, least squares techniques for general linear regression, and algorithms for robust regression. SVMs are another approach, rooted in mathematical programming. Euler observed that “Nothing happens in the universe that does not have...a maximum or a minimum”. Optimization techniques are used in a wide variety of fields, typically in order to maximize profits or minimize expenses under certain constraints. For example, airlines use optimization techniques in finding the best ways to route airplanes. The use of mathematical models in solving optimization problems is referred to as mathematical programming. The use of the word “programming” is now somewhat antiquated, and is used to mean “scheduling.” Mathematical programming techniques have become attractive for solving machine learning problems, as they perform well while also providing a sound theoretical basis that some other popular techniques do not. Additionally, they offer some novel approaches in addressing the problem of overfitting (Herbrich, 2002; Vapnik, 1998). It is not surprising that ideas from mathematical programming should find application in machine learning. After all, one can summarize classification and regression problems as “Find a rule that minimizes the errors made in predicting an output.” One formalization of these ideas as an optimization problem is referred to as the SVM (Herbrich, 2002; Vapnik, 1998). It should be noted that the word “machine” is used here figuratively, referring to an algorithm that solves classification problems.
features. In Table 1, for example, m = 4 and n = 3 . The “software purchaser” column is considered to be the classification for each row, and not one of the features. Each point can therefore be represented as a vector xi of size n, where i ranges from 1 to m. To indicate class membership, we denote by yi the class for point i, where yi = +1 for points in class 1 and yi = −1 for points in class -1. For example, if we consider class “Yes” as class 1 and “No” as class -1 we would represent the training set in our classification example as: x1 = [ 30 56000 16], x 2 = [ 50 60000 12 ], x 3 = [ 16 2000 11], x 4 = [ 35 30000 12 ],
d1 = 1 d 2 = −1 d3 = 1 d 4 = −1
(1)
The goal is to find a hyperplane that will best separate the points into the two classes. To solve this problem, let us visualize a simple example in two dimensions which is completely linearly separable, i.e. a straight line can perfectly separate the two classes. Figure 1 shows a simple linearly separable classification problem, where the separating hyperplane, or separating surface (2)
w ⋅x + b = 0
separates the points in class 1 from the points in class- 1. The goal then becomes one of finding a vector w and scalar b such that the points in each class are correctly classified. In other words, we want to find w and b such that the following inequalities are satisfied: w ⋅ x + b > 0, w ⋅ x + b < 0,
for all i such that yi = 1 for all i such that yi = −1
(3)
In practice, however, finding a solution to this problem is considerably easier if we express these as non-strict inequalities. To do this, we define δ > 0 as: Figure 1. Linearly separable classification problem
MAIN THRUST Suppose that we wish to classify points that can be in one of two classes, which we will label as 1 and -1. In the previous examples, class 1 could correspond to those people that are software purchasers, and class -1 could correspond to those that are not. We will assume that the training data consists of m examples, each of which has n 1072
Class -1
Class 1 ww ⋅xx +bb = 00
w + bx= 1b 1 w ⋅ xx + bb = 00 w ⋅ xw
Support Vector Machines Illuminated
(4)
δ = min1≤i ≤m yi ( w ⋅ x i + b)
We then divide the above inequalities by δ, and redefine w → w / δ , b → b / δ to yield the constraints: for all i such that yi = 1 w ⋅ x i + b ≥ 1, w ⋅ x i + b ≤ −1, for all i such that yi = −1
(5)
It turns out that we can write these two constraints as a single constraint: yi ( w ⋅ x i + b) ≥ 1, for all i = 1,2,..., m
(6)
Figure 1 shows the geometric interpretation of these two constraints. We effectively construct two bounding planes, one with equation w ⋅ x i + b = 1 , and the other with equation w ⋅ x i + b = −1 . These planes are parallel to the separating plane, and lie closer to the separating plane than any point in the associated class. Any w and b which satisfy constraint (6) will appropriately separate the points in the two classes. The next task, then, is to determine how to find the best possible choices for w and b. We want to find a plane that not only classifies the training data correctly, but will also perform well when classifying test data. Intuitively, the best possible separating plane is therefore one where the bounding planes are as far as part as possible. Figure 2 shows a plane where the bounding planes are very close together, and thus is likely not a good choice. In order to find the best separating plane, one should spread the bounding planes as far as possible while retaining classification accuracy. This idea can be backed up quantitatively with concepts from statistical learning theory (Herbrich, 2002). It turns out that the distance between the bounding planes is given by
2 . Therefore, in order to w⋅w
Figure 2. Alternative separating surface to the same data shown in Figure 1. The bounding planes are closer together, and thus this plane is not expected to generalize as well.
Class -1
maximize the distance, we can formulate this as an optimization problem where we minimize the magnitude of w subject to constraint (6): 1 min w ⋅ w w ,b 2 such that yi (w ⋅ x i + b) ≥ 1 for all i = 1, 2,..., m
(7)
Note that the above problem minimizes w ⋅ w , as this yields an equivalent and more tractable optimization 1 problem than if we minimized w ⋅ w . The factor of that 2 we have added in front of w ⋅ w (the objective that we are trying to minimize) does not change the answer to the minimization problem in any way, and is conventionally used. This type of optimization problem is known as a quadratic program (Fletcher, 1987; Gill, Murray, & Wright, 1981), and is characterized by the quadratic expression being minimized and the linear constraints. We next consider the case where the classes are not linearly separable, as shown in Figure 3. If the classes are not linearly separable, then we want to choose w and b which will work “as best as possible”. Therefore, we introduce a vector of slack variables ξ into constraint (6) which will take on nonzero values only when points are misclassified, and we minimize the sum of these slack variables. m 1 min w ⋅ w + C ∑ ξi w ,b ,î 2 i =1 such that yi ( w ⋅ x i + b) + ξi ≥ 1
(8)
for all i = 1, 2,..., m
Note that the objective of this quadratic program now has two terms. The w ⋅ w term attempts to maximize the Figure 3. Linearly inseparable classification problem
Class 1
1073
5
Support Vector Machines Illuminated
distance between the bounding planes. The
m
∑ξ i =1
i
term
attempts to minimize classification errors. Therefore, the parameter C ≥ 0 is introduced to balance these goals. A large value of C indicates that most of the importance is placed on reducing classification error. A small value of C indicates that most of the importance is placed on separating the planes and thus attempting to avoid overfitting. Finding the correct value of C is typically an experimental task, accomplished via a tuning set and cross-validation. More sophisticated techniques for determining C have been proposed that seem to work well under certain circumstances (Cherkassky & Ma, 2002; Joachims, 2002a). Quadratic program (8) is referred to as an SVM. All points which lie on the “wrong” side of their corresponding bounding plane are called support vectors (see Figure 4), where the name “support vectors” comes from a mechanical analogy in which these vectors can be thought of as point forces keeping a stiff sheet in equilibrium. Support vectors play an important role. If all points which are not support vectors are removed from a dataset, the SVM optimization problem (8) yields the same solution as it would if all these points were included. SVMs classify datasets with numeric attributes, as is clear by the formulation shown in (8). In practice, many datasets have categorical attributes. SVMs can handle such datasets if the categorical attributes are transformed into numeric attributes. One common method for doing so is to create a set of artificial binary numeric features, where each feature corresponds to a different possible value of the categorical attribute. For each data point, the values of all these artificial features are set to 0 except for the one feature that corresponds to the actual categorical value for the point. This feature is assigned the value 1. SVMs can be used to find nonlinear separating surfaces as well, significantly expanding their applicability. To see how to do so, we first look at the equivalent dual problem to the SVM (8). Every solvable quadratic pro-
Figure 4. Sample two-dimensional dataset, with support vectors indicated by circles
1074
gram has an equivalent dual problem that is sometimes more tractable than the original. The dual of our SVM (Herbrich, 2002) is expressed as:
min á
m 1 m yi y jα iα j ( x i ⋅ x j ) −∑ α i ∑ 2 i , j =1 i =1
such that
m
∑ yα i =1
i
i
=0
0 ≤ α i ≤ C, i = 1,2,..., m
The vector α is referred to as a dual variable, and takes the place of w and b in the original primal formulation. This dual formulation (9) can be generalized to find nonlinear (9) separating surfaces. To do this, we observe that problem (9) does not actually require the original data points x i , but rather scalar products between different points as indicated by the x i ⋅ x j term in the objective. We therefore use the so-called “kernel trick” to replace the term x i ⋅ x j with a kernel function, which is a nonlinear function that plays a similar role as the scalar product in the optimization problem. Two popular kernels are the polynomial kernel and the Gaussian kernel: Example 1: Polynomial Kernel: K (x i , x j ) = (x i ⋅ x j + 1) d , where d is a fixed positive integer Example 2: Gaussian (Radial Basis) Kernel: K (x i , x j ) = e
− ( x i − x j )⋅( x i − x j ) / 2σ 2
, where σ is a fixed posi-
tive real value Using a polynomial kernel in the dual problem is actually equivalent to mapping the original data into a higher order polynomial vector space, and finding a linear separating surface in that space. In general, using any kernel that satisfies Mercer’s condition (Herbrich, 2002; Vapnik, 1998) to find a separating hyperplane corresponds to finding a linear hyperplane in a higher order (possibly infinite) feature space. If Mercer’s condition is not satisfied for a particular kernel, then it is not necessarily true that there is a higher dimensional feature space corresponding to that kernel. We can therefore express the dual SVM as a nonlinear classification problem:
Support Vector Machines Illuminated
m 1 m min ∑ yi y jα iα j K ( x i ⋅ x j ) −∑ α i á 2 i , j =1 i =1
such that
m
∑ yα i =1
i
=0
i
(10)
0 ≤ α i ≤ C, i = 1, 2,..., m *
If we use the notation á to indicate the solution to the above problem, the classification of a test point x can be determined by the sign of m
f ( x) = ∑ yiα i* K ( x, x i ) + b* i =1
(11)
with b* chosen such that yi f ( x i ) = 1 for any i with 0 < α i* < C . A value of 1 indicates class 1 and a value of-
1 indicates class -1. In the unlikely event that the decision function yields a 0, i.e. the case where the point is on the decision plane itself, an ad-hoc choice is usually made. Practitioners often assign such a point to the class with the majority of training points. Since an SVM is simply an optimization problem stated as a quadratic program, the most basic approach in solving it is to use a quadratic or nonlinear programming solver. This technique works reasonably for small problems, on the order of hundreds or thousands of data points. For larger problems, these tools can require exorbitant amounts of memory and time. A number of algorithms have thus been proposed that are more efficient, as they take advantage of the structure of the SVM problem. Osuna, Freund, and Girosi (1997) proposed a decomposition method. This algorithm repeatedly selects small working sets, or “chunks” of constraints from the original problem, and uses a standard quadratic programming solver on each chunk. The QP solver can find a solution quickly due to each chunk’s small size. Moreover, only a relatively small amount of memory is needed at a time, since optimization takes place over a small set of constraints. The speed at which such an algorithm converges depends largely on the strategy used to select the working sets. To that end, the SVM light algorithm (Joachims, 2002a, 2002b) uses the decomposition ideas mentioned above coupled with techniques for appropriately choosing the working set. The SMO algorithm (Platt, 1999; Schölkopf & Smola, 2002) can be considered to be an extreme version of decomposition where the working set always consists of only two constraints. This yields the advantage that the solution to each optimization problem can be found analytically and evaluated via a straightforward formula, i.e.
a quadratic programming solver is not necessary. SMO and its variants have become quite popular in the SVM community, due to their relatively quick convergence speeds. As a result, further optimizations to SMO have been made that result in even further improvements in its speed (Keerthi, Shevade, Bhattacharyya, & Murthy, 2001).
FUTURE TRENDS Support vector machines are being increasingly adopted by the mainstream data mining community, as can be seen by the growing numbers of software suites making SVMs available for use. Toolkits such as BSVM (Hsu & Lin, 2001), SVMTorch (Collobert & Bengio, 2001), LIBSVM (Chang & Lin, 2003), and Weka (Witten & Frank, 2004) may be freely downloaded. Data mining software systems such as SAS Enterprise Miner (SAS Enterprise Miner, 2004) and KXEN (KXEN Analytic Framework, 2004) are now introducing SVMs into the commercial marketplace, and this trend is likely to accelerate.
CONCLUSION Support vector machines perform remarkably well when compared to other machine learning and data mining algorithms due to their inherent resistance to overfitting. SVMs have often been shown in practice to work considerably better than other techniques when the number of features is quite high, such as when classifying text data (Joachims, 2002a). The last five years have seen a fairly dramatic increase in use of SVMs, and they are now making their way into general purpose data mining suites.
REFERENCES Chang, C.-C., & Lin, C.-J. (2003). LIBSVM - A library for support vector machines (Version 2.5). Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm Cherkassky, V., & Ma, Y. (2002). Selection of meta-parameters for support vector regression. In J.R. Dorronsoro (Ed.), International Conference on Artificial Neural Networks (Vol. 2415, pp. 687-693): Springer. Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1, 143-160. Fletcher, R. (1987). Practical methods of optimization (2nd ed.). John Wiley & Sons.
1075
5
Support Vector Machines Illuminated
Gill, P.E., Murray, W., & Wright, M.H. (1981). Practical optimization. Academic Press. Herbrich, R. (2002). Learning kernel classifiers. Cambridge, MA: MIT Press. Hsu, C.-W., & Lin, C.-J. (2001). BSVM Software. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/bsvm/ Joachims, T. (2002a). Learning to classify text using support vector machines. Kluwer. Joachims, T. (2002b). SVMlight (Version 5.0). Retrieved from http://svmlight.joachims.org Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., & Murthy, K.R.K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637-649. KXEN Analytic Framework. (2004). KXEN, Inc. Retrieved from http://www.kxen.com Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan & E. Wilson (Eds.). Neural Networks for Signal Processing VII - Proceedings of the 1997 IEEE Workshop (pp. 276-285). Los Alamitos, CA: IEEE Press. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training support vector machines. In B. Schölkopf, C.J.C. Burges, & A.J. Smola (Eds.), Advances in kernel methods - Support vector learning (pp. 185208). MIT Press. SAS Enterprise Miner. (2004). Cary, NC: SAS Institute. Retrieved from http://www.sas.com Schölkopf, B., & Smola, A. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Vapnik, V.N. (1998). Statistical learning theory. New York: Wiley.
1076
Witten, I.H., & Frank, E. (2004). Weka: Data mining software in Java (Version 3). Retrieved from http:// www.cs.waikato.ac.nz/ml/weka/
KEY TERMS Kernel Function: A nonlinear function of two vectors, used with a support vector machine to generate nonlinear separating surfaces. Mathematical Program: An optimization problem characterized by an objective function to be maximized or minimized, and a set of constraints. Overfitting: Finding a solution to a supervised learning problem that performs extremely well on the training set but poorly on the test set. Such a solution is typically too specialized to the specific points in the training set, and thus misses general relationships. Quadratic Program: A mathematical program where the objective is quadratic and the constraints are linear equations or inequalities. Supervised Learning: Learning how to label each data point in a dataset through consideration of repeated examples (training set) where the labels are known. The goal is to induce a labeling for data points where the labels are unknown (test set). Support Vector: A data point in a support vector machine classification problem that lies on the “wrong” side of its corresponding bounding plane. Support Vector Machine: A particular optimization problem that determines the best surface for separating two classes of data. This separating surface is one that best satisfies two possibly contradictory goals: minimizing the number of misclassified points, but also reducing the effect of overfitting.
1077
Survival Analysis and Data Mining Qiyang Chen Montclair State University, USA Alan Oppenheim Montclair State University, USA Dajin Wang Montclair State University, USA
INTRODUCTION Survival analysis (SA) consists of a variety of methods for analyzing the timing of events and/or the times of transition among several states or conditions. The event of interest can happen at most only once to any individual or subject. Alternate terms to identify this process include Failure Analysis (FA), Reliability Analysis (RA), Lifetime Data Analysis (LDA), Time to Event Analysis (TEA), Event History Analysis (EHA), and Time Failure Analysis (TFA), depending on the type of application for which the method is used (Elashoff, 1997). Survival Data Mining (SDM) is a new term that was coined recently (SAS, 2004). There are many models and variations of SA. This article discusses some of the more common methods of SA with real-life applications. The calculations for the various models of SA are very complex. Currently, multiple software packages are available to assist in performing the necessary analyses much more quickly.
BACKGROUND The history of SA can be roughly divided into four periods: the Grauntian, Mantelian, Coxian, and Aalenian paradigms (Harrington, 2003). The first paradigm dates back to the 17th century with Graunt’s pioneering work (Holford, 2002), which attempted to understand the distribution for the length of human life through life tables. During World War II, early life tables’ analysis led to reliability studies of equipment and weapons and was called TFA. The Kaplan-Meier method, a main contribution during the second paradigm, is perhaps the most popular means of SA. In 1958, a paper by Kaplan and Meier in the Journal of the American Statistical Association “brought the analysis of right-censored data to the attention of mathematical statisticians” (Oakes, 2000, p. 282). The Kaplan-Meier product limit method is a tool
used in SA to plot survival data for a given sample of a survival study. Hypothesis testing continued on these missing data problems until about 1972. Following the introduction by Cox of the proportional hazards model, the focus of attention shifted to examine the impact of survival variables (covariates) on the probability of survival through the period of third paradigm. This survival probability is known within the field as the hazard function. The fourth and last period is the Aalenian paradigm, as Statsoft, Inc. (2003) claims. Aalen used a martingale approach (exponential rate for counting processes) and improved the statistical procedures for many problems arising in randomly censored data from biomedical studies in the late 1970s.
MAIN THRUST The two biggest pitfalls in SA are (a) the considerable variation in the risk across the time interval, which demonstrates the need for shorter time intervals, and (b) censoring. Censored observations occur when a loss of observation occurs. This most often arises when subjects withdraw or are lost from follow-up before the completion of the study. The effect of censoring often renders a bias within studies based upon incomplete data or partial information on survival or failure times. There are four basic approaches for the analysis of censored data: complete data analysis, the imputation approach, analysis with dichotomized data, and the likelihood-based approach (Leung, Elashoff, & Afifi, 1997). The most effective approach to censoring problems is to use methods of estimation that adjust for whether an individual observation is censored. These likelihoodbased approaches include the Kaplan-Meier estimator and the Cox-regression, both popular methodologies. The Kaplan-Meier estimator allows for the estimation of survival over time, even for populations that include subjects who enter at different times or drop out.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
5
Survival Analysis and Data Mining
Having discovered the inapplicability of multiple regression techniques due to the distribution (exponential vs. normal) and censoring, Cox assumed “a multiplicative relationship between the underlying hazard function and the log-linear function of the covariates” (Statsoft, Inc., 2003) and arrived at the assumption that the underlying hazard rate (rather than survival time) is a function of the independent variables (covariates) by way of a nonparametric model. As SA emerged and became refined through the periods, it is evident even from the general overview herein that increasingly more complex mathematical formulas were being applied. This was done in large measure to account for some of the initial flaws in the research population (i.e., censoring), to provide for the comparison of separate treatments, and to take entirely new approaches concerning the perceived distributions of the data. As such, the calculations and data collection for the various models of SA became very complex, requiring the use of equally sophisticated computer programs. In that vein, software packages capable of performing the necessary analyses have been developed and include but are not limited to SAS/STAT software (compares survival distributions for the event-time variables, fits accelerated failure time models, and performs regression analysis based on the proportional hazards model) (SAS, 2003). Also available is the computer software NCSS 2004 statistical analysis system (2003).
Multiple Area Applications The typical objective of SA in demography and medical research centers on clinical trials designed is to evaluate the effectiveness of experimental treatments, to model disease progression in an effort to take preemptive action, and also to estimate disease prevalence within a population. The fields of engineering and biology found applicability of SA later. There is always a need for more data analysis. The information gained from a successful SA can be used to make estimates on treatment effects, employee longevity, or product life. As SA went through more advanced stages of development, business-related fields such as economics and social sciences started to use it. With regard to a business strategy, SA can be used to predict, and thereby improve upon, the life span of manufactured products or customer relations. For example, by identifying the timing of risky behavior patterns (Teredata, 2003) that lead to reduced survival probability in the future (ending the business relationship), a decision can be made to select the appropriate marketing action and its associated cost.
1078
Lo, MacKinlay, and Zhang (2002) of MIT Sloan School of Management developed and estimated an econometric model of limit-order execution times. They estimated versions for time-to-first-fill and time-to-completion for both buy and sell limit orders and incorporated the effects of explanatory variables such as the limit price, limit size, bid/ offer spread, and market volatility. Through SA of actual limit-order data, they discovered that execution times are very sensitive to the limit price but are not sensitive to limit size. Hypothetical limit-order executions, constructed either theoretically from first-passage times or empirically from transaction data, are very poor proxies for actual limit-order executions. Blandón (2001) investigated the timing of foreign direct investment in the banking sector which, among other things, leads to differential benefits for the first entrants in a foreign location and to the problem of reversibility. When uncertainty is considered, the existence of some ownership-location-internalization advantages can make foreign investment less reversible and/or more delayable. Such advantages are examined, and a model of the timing of foreign direct investment specified. The model is then tested for a case using duration analysis. In many industries, alliances have become the organization model of choice. Having used data from the Airline Business annual surveys of airline alliances, Gudmundsson and Rhoades (2001) tested a proposed typology predicting survival and duration in airline alliances. They classified key activities of airline alliances by their level of complexity and resource commitment in order to suggest a series of propositions on alliance stability and duration. The results of their analysis indicate that alliances containing joint purchasing and marketing activities had lower risk of termination than alliances involving equity. Kimura and Fujii (2003) conducted a Cox-type SA of Japanese corporate firms using census-coverage data. A study of exiting firms confirmed several characteristics of Japanese firms in the 1990s. They found that in order to increase the probability of survival, an efficient concentration on core competencies, but not excessive internalization in the corporate structure and activities, is vital to a company. They also found that via carefully selected channels, a firm’s global commitment helps Japanese firms be more competitive and more likely to survive. SA concepts and calculations were applied by Hough, Garitta, and Sánchez (2004) to consumers’ acceptance/ rejection data of samples with different levels of sensory defects. The lognormal parametric model was found adequate for most defects and allowed prediction of concentration values corresponding to 10% probability of consumer rejection.
Survival Analysis and Data Mining
The state of the psychotherapy termination literature to date might best be characterized as inconclusive. Despite decades of studies, almost no predictors of premature termination have emerged consistently. An examination of this literature reveals a number of recurrent methodological-analytical problems that likely have contributed substantially to this state. SA, which was designed for longitudinal data on the occurrence of events, not only circumvents these problems but also capitalizes on the rich features of termination data and opens brand new avenues of investigation (Corning and Malofeeva, 2004). From the measurement of the relationship between income inequality and the time-dependent risk (hazard) of a subsequent pregnancy (Gold, et. al, 2004) to selfreported teenagers’ crash involvements and citations (McCartt, Shabanova & Leaf, 2003) to investigation of the role of product features in preventing customer churn (Larivière & Poel, 2004) to the improvement of the operations management process in the provision of service (Pagell & Melnyk, 2004) and factors affecting corporate survival rates (Parker, Peters & Turetsky, 2002), additional applicability of SA in the business setting was significant and varied, from personnel management to accounting to equipment maintenance and repair.
Combination with Other Methods SA can be combined with many other decision models in the real world. Each model has its share of advantages and shortcomings. The complimentary effects, supporting arguments, and different view points may strengthen the final results. Eleuteri, Tagliaferri, Milano, De Placido and De Laurentiis (2003) present a feedforward neural network architecture aimed at survival probability estimation, which generalizes the standard (usually linear) models described in literature. The network builds an approximation to the survival probability of a system at a given time, conditional on the system features. The resulting model is described in a hierarchical Bayesian framework. Experiments with synthetic and real-world data compare the performance of this model to the commonly used standard ones. With the introduction of compulsory long-term care (LTC) insurance in Germany in 1995, a large claims portfolio with a significant proportion of censored observations became available. Czado and Rudolph (2002) presented an analysis of part of this portfolio by using the Cox proportional hazard model to estimate transition intensities. In contrast to the more commonly used Poisson regression with graduation approach, where censored observations and time dependent risk factors are ignored, this approach allows the inclusion of both censored observations as well as time-dependent risk
factors, such as time spent in LTC. Furthermore, they calculated premiums for LTC insurance plans in a multiple state Markov process based on these estimated transition intensities. Vance and Geoghegan (2002) of the U.S. EPA National Center for Environmental Economics took as its point of departure a simple utility maximizing model that suggests many possible determinants of deforestation in an economic environment characterized by missing or thin markets. Hypotheses from the model are tested on a data set that combines a time series of satellite imagery with data collected from a survey of farm households whose agricultural plots were georeferenced by using a global positioning system (GPS). Model results suggest that the deforestation process is characterized by nonlinear duration dependence, with the probability of forest clearance first decreasing and then increasing with the passage of time.
Theoretical Improvements Molinaro, Dudoit and Laan (2004) proposed a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. A number of common estimation procedures follow this approach in the full data situation, but depart from it when faced with the obstacle of evaluating the loss function for censored observations. They argue that one can and should also adhere to this estimation road map in censored data situations. Although SA traditionally included all the information on a subject during a particular interval, period analyses look at just at survival experience in a recent time interval. Therefore, SA allows the researcher to limit or cut off the survival experience at the beginning and end of any chosen interval and allows this experience to be adapted to studies where short-term survival is common. The idea is therefore that the results are less biased, as Smith, Lambert, Botha and Jones (2004) proved. It is possible that this technique will be more widely used in the future, as it seems to be more practical. Multivariate survival data arise when subjects in the same group are related to each other or when there are multiple recurrences of the disease in the same subject. A common goal of SA is to relate the outcome (time to event) to a set of covariates. Gao, Manatunga, and Chen (2004) focused on prognostic classification for multivariate survival data where identifying subgroups of patients with similar prognosis is of interest. They proposed a computationally feasible method to identify prognostic groups with the widely used Classification and Regression Trees (CART) algorithm, a popular one in data mining. 1079
5
Survival Analysis and Data Mining
Limitations of SA
CONCLUSION
Underlying assumptions in the models, dealing with censored data and statistical power, have been problems in this area. According to Fiddell and Tabachnick (2001, p. 805), the challenging issues in SA “include testing the assumption of proportionality of hazards, dealing with censored data, assessing strength of association of models and individual covariates, choosing among the variety of statistical tests for differences among treatment groups and contributions of covariates, and interpreting odds ratios.” Missing data is a common problem with SA. Larger samples are required for testing with covariates. Normality of sampling distributions, linearity, and homoscedasticity can lead to the results in better increased predictability and less difficulty in dealing with outliers. Censored cases should be systematically similar to those remaining in the study; otherwise, the selection can no longer be considered randomly assigned. The conditions ought to remain constant throughout the experiment. Those assumptions are challengeable. As with any scientific method, an element of art needs to be added in order to make the theory more usable. Certainly, SA is subject to GIGO (Garbage In, Garbage Out), because the results of SA can be strongly influenced by the presence of error in the original data. As with any form of data gathering and analysis, it is important that the researchers use only information that can be considered relevant to the subject at hand.
Aside from mathematics and economics, SA is mostly used in the medical field. Gradually, SA has also been widely used in the social sciences, where interest is on analyzing time-to-events such as job changes, marriage, birth of children, and so forth. To the extent that the second paradigm of Mantel began a mere 50 years ago, the expansion and development of SA today is indeed remarkable. The progress of Kaplan-Meier, Mantel, Cox, and Aalen, as well as that of others not even mentioned in this article, has proven SA as a reliable scientific tool susceptible to the rigors of modern mathematics. In order to properly administer treatment, caregivers and pharmaceutical providers should incorporate SA into the decision-making process. The same holds true for the effective management of business operations. It demonstrates that SA is a dynamic field, with many advances since its inception as well as many opportunities for evolution in the future. Technology and SA must simply enjoy a symbiotic relationship for both to flourish. SA is a dynamic and developing science with no boundaries other than those that are imposed upon it by human limitations.
FUTURE TRENDS
Blandón, J. G. (2001). The timing of foreign direct investment under uncertainty: Evidence from the Spanish banking sector. Journal of Economic Behavior & Organization, 45(2), 213-224.
The past centuries have shown great strides in the development of the field of SA, and there is no reason for their use to become anything but more important. As the computer age continues and advanced mathematical problems are solved with the stroke of a few keys, the use of SA will only become more important and play a greater role in our everyday lives. Certainly, using incorrect models will lead to erroneous results/conclusions. We imagine that in the future, a single and unified model may dramatically increase the power for all SA studies. Also, SDM as a new branch of data mining may integrate with other datamining tools. SA is based on a foundation of common principles and a common goal: no end to transformations of SA methodologies is in sight, and new variations on the theme and new applications for those variations are constantly forming. The use of SA is a significant contribution to society and will increase the longevity of populations in the future. 1080
REFERENCES Bland, M., & Douglas, A. (1998, December). Statistics notes: Survival probabilities — the Kaplan-Meier method. British Medical Journal.
Corning, A. F., & Malofeeva, E. V. (2004). The application of survival analysis to the study of psychotherapy termination. Journal of Counseling Psychology, 51(3), 354367. Czado, C., & Rudolph, F. (2002). Application of survival analysis methods to long-term care insurance. Insurance: Mathematics and Economics, 31(3), 395-413. Eleuteri, A., Tagliaferri, R., Milano, L., De Placido, S., & De Laurentiis, M. (2003). A novel neural network-based survival analysis model. Neural Networks, 16(5-6), 855-864. Fiddell, L., & Tabachnick, B. (2001). Using multivariate statistics. Allyn & Bacon. Gao, F., Manatunga, A. K., & Chen, S. (2004). Identification of prognostic factors with multivariate survival data. Computational Statistics & Data Analysis, 45(4), 813-824.
Survival Analysis and Data Mining
Gold, R., Connell, F.A., Heagerty, P., Bezruchka, S., Davis, R., & Cawthon, M.L. (2004). Income inequality and pregnancy spacing. Social Science & Medicine, 59(6), 11171126.
Pagell, M., & Melnyk, S. (2004). Assessing the impact of alternative manufacturing layouts in a service setting. Journal of Operations Management, 22, 413-429.
Gudmundsson, S. V., & Rhoades, D. L. (2001). Airline alliance survival analysis: Typology, strategy, and duration. Transport Policy, 8(3), 209-218.
Parker, S., Peters, G.F., Turetsky, H.F. (2002). Corporate governance and corporate failure: A survival analysis. Corporate Governance, International Journal of Business Society, 2(2), 4-12.
Harrington, D. (2003). History of survival data analysis. Retrieved from http://filebox.vt.edu/org/stathouse/ Survival.html
SAS (2003). SAS/STAT [Computer software]. Retrieved from http://www.sas.com/technologies/analytics/statistics/stat/
Holford, T. (2002). Multivariate methods in epidemiology. New York: Oxford University Press.
SAS (2004). Survival data mining: Predictive hazard modeling for customer history data. Retrieved from http://support.sas.com/training/us/crs/bmce.html
Hough, G., Garitta, L., & Sánchez, R. (2004). Determination of consumer acceptance limits to sensory defects using survival analysis. Food Quality and Preference. Kimura, F., & Fujii, T. (2003). Globalizing activities and the rate of survival: Panel data analysis on Japanese firms. Journal of the Japanese and International Economies, 17(4), 538-560. Larivière, B., & Poel, D. V. (2004). Investigating the role of product features in preventing customer churn by using survival analysis and choice modeling: The case of financial services. Expert Systems with Applications, 27(2), 277-285.
Smith, L.K., Lambert, P.C., Botha, J.L. & Jones, D.R. (2004). Providing more up-to-date estimates of patient survival: A comparison of standard survival analysis with period analysis using life-table methods and proportional hazards models. Journal of Clinical Epidemiology, 57(1), 14-20. Statsoft, Inc. (2003). Survival/Failure Time Analysis. Retrieved from http://www.stasoftinc.com/textbook/ stsurvan.html Tableman, M. (2003). Survival analysis using S: Analysis of time-to-event data. Chapman & Hall/CRC.
Leung, K., Elashoff, R., & Afifi, A. (1997). Censoring issues in survival analysis. Annual Review of Public Health, 18, 83-104.
Teradata (2003). New customer survival analysis solution for telcos. Retrieved from http://www.business wire.com
Lo, A. W., MacKinlay, A. C., & Zhang, J. (2002). Econometric models of limit-order executions. Journal of Financial Economics, 65(1), 31-71.
Vance, C., & Geoghegan, J. (2002). Temporal and spatial modeling of tropical deforestation: A survival analysis linking satellite and household survey data. Agricultural Economics, 27(3), 317-332.
McCartt, A., Shabanova, V. & Leaf, W. (2003). Driving experience, crashes and traffic citations of teenage beginning drivers. Accident Analysis & Prevention, 35(3), 311320. Molinaro, A. M., Dudoit, S., & Laan, M.J. (2004). Treebased multivariate regression and density estimation with right-censored data. Journal of Multivariate Analysis, 90(1), 154-177. Morriso, J. (2004). Introduction to survival analysis in business. Journal of Business Forecasting. NCSS 2004 statistical analysis system [Computer software]. (2003). Retrieved from http://www.ncss.com/ ncsswin.html Oakes, D. (2000, March). Survival analysis. Journal of the American Statistical Association, 282-285.
KEY TERMS Censored: Censored cases are those in which the survival times are unknown. Cumulative Proportion Surviving: The cumulative proportion of cases surviving up to the respective interval. Because the probabilities of survival are assumed to be independent across the intervals, this probability is computed by multiplying out the probabilities of survival across all previous intervals. The resulting function is also called the survivorship or survival function.
1081
5
Survival Analysis and Data Mining
Failure Analysis: Computing the time it takes for a manufactured component to fail. Hazard Function: A time-to-failure function that gives the instantaneous probability of the event (failure) given that it has not yet occurred. Life Tables: Describing the survival rate as a function of time, referred to as the survivor function.
1082
Lifetime: (or failure time, survival data) Data that measure lifetime or the length of time until the occurrence of an event. Proportion Failing: This proportion is computed as the ratio of the number of cases failing in the respective interval divided by the number of cases at risk in the interval. Survival Time: The time to the occurrence of a given event.
1083
Symbiotic Data Mining
5
Kuriakose Athappilly Western Michigan University, USA Alan Rea Western Michigan University, USA
INTRODUCTION Symbiotic data mining is an evolutionary approach to how organizations analyze, interpret, and create new knowledge from large pools of data. Symbiotic data miners are trained business and technical professionals skilled in applying complex data-mining techniques and business intelligence tools to challenges in a dynamic business environment.
BACKGROUND Most experts agree (Anon, 2002; Thearling, 2002) that data mining began in the 1960s with the advent of computers that could store and process large amounts of data. In the 1980s, data mining became more common and widespread with the distribution of relational databases and SQL. In the 1990s, business saw a boom in data mining, as desktop computers and powerful server-class computers became affordable and powerful enough to process large amounts of data in data warehouses as well as real-time data via online analytical processing (OLAP). Today, we see an increasing use of advanced processing of data with the help of artificial intelligence technology tools, such as fuzzy logic, decision trees, neural networks, and genetic algorithms (Gargano & Raggad, 1999). Moreover, current trends are moving organizations to reclassify data mining as business intelligence, using such tools as Cognos (2004). We also see three distinct theoretical approaches to data mining: statistical (classical), artificial intelligence (heuristics), and machine learning (blended AI and statistics). The three approaches do not adhere to the historical boundaries applied to data mining; rather, they are embarkation points for data-mining practitioners (Kudyba & Hoptroff, 2001; Thuraisingham, 1999). It is not the intent of this discussion to argue which approach best informs data mining. Instead, we note that many software platforms adhere to one or more methods for solving problems via data-mining tools. Most organizations agree that sifting through data to create business intelligence, which they can use to gain a competitive edge, is an essential business component (Lee & Siau, 2001). Whether it is to gain customers,
increase productivity, or improve business processes, data mining can provide valuable information, if it is done correctly. In most cases, a triad of business manager, information technology technician, and statistician is needed even to begin the data-mining process. Although this combination can prove useful if a symbiotic relationship is fostered, typically, the participants cannot work effectively with one another, because they do not speak the same language. The manager is concerned with the business process, the technician with software and hardware performance, and the statistician with analyses of data and interpretations of newfound knowledge. While this may be an overgeneralization, it is not far from the truth. What is needed, then, is an individual who can pull all three components together—a symbiotic data miner trained in business, technology, and statistics.
MAIN THRUST In this paper, we will discuss how an individual, trained not only in business but also in technology and statistics, can add value to any data-mining and business-intelligence effort by assisting an organization to choose the right data-mining techniques and software as well as interpret the results within an informed business context.
Data Mining in Contemporary Organizations Data mining is the “semi-automatic discovery of patterns, associations, changes, anomalies, rules, and statistically significant structures and events in data” (Dhond et al., 2000, p. 480). Analyzed data is many times larger than the sum of its parts. In other words, data mining can find new knowledge from observing relationships among the attributes in the form of predictions, clustering, or associations that many experts might miss. The new knowledge in a continuously changing environment is the most potent weapon for organizations to become and remain competitive. In today’s business, organizations intelligence is necessary to anticipate economic trends, predict poten-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Symbiotic Data Mining
tial revenue streams, and create processes to maximize profits and efficiency. This is especially true for strategic and other mid-level managers (Athappilly, 2003). In the past, many decisions were made using corporate experience and knowledge experts. This is still true today. However, with the increased influx of data (some experts argue that the amount of information in the world doubles every 20 months) (Dhond et al., 2000), many high-level managers now turn to data-mining software in order to more effectively interpret trends and relationships among variables of interest. To support data mining, an increasing amount of funds are invested in complex software to glean the data for patterns of information; hardware is purchased that can effectively run the software and distribute the results, and personnel are continually retrained or hired. The personnel include IT technicians, knowledge experts, statisticians, and various business managers. The mix of personnel needed to effectively collect, glean, analyze, interpret, and then apply data-mined knowledge ultimately can lead to one of the biggest data-mining challenges—communicating results to business managers so that they can make informed decisions. Although the managers are the ones who ultimately make the decisions, they do not have the necessary skills, knowledge base, and techniques to assess whether the heuristics, software, and interpreted results accurately inform their decisions. There is ultimately a disjunction between theoretical interpretation and pragmatic application (Athappilly, 2004).
The Challenge for Contemporary Organizations The challenge is twofold: (1) a shortcoming of many datamining tools is the inability of anyone except experts to interpret the results. Business managers must be able to analyze the results of a data-mining operation to “help them gain insights … to make critical business decisions” (Apte et al., 2001, p. 49); and (2) business managers must rely on IT technicians to apply rules and algorithms, and then rely on statisticians and other experts to develop models and to interpret the results before applying them to a business decision. This process adds at least two layers between the decision and the data. Moreover, there are numerous opportunities for miscommunication and misinterpretation among team members. In order to flatten the layers between the requisite gleaned knowledge and its interpretation and application, a new type of business IT professional is needed to create a symbiotic relationship that can sustain itself without the triadic team member requirements and the inherent polarities among them.
1084
The Solution for Contemporary Organizations The solution to the complex data-mining process is symbiotic data miners. The symbiotic data miner is a trained business information system professional with a background in statistics and logic. A symbiotic data miner not only can choose the correct data-mining software packages and approaches, but also can analyze and glean knowledge from large data warehouses. Combined with today’s complex analysis and visualization software, such as Clementine (SPSS, 2004) and Enterprise Miner (SAS, 2004), the symbiotic data miner can create illustrative visual displays of data patterns and apply them to specific business challenges and predictions. Just as today’s business managers use spreadsheets to predict market trends, analyze business profits, or manage strategic planning, the symbiotic data miner can fulfill the same functions on a larger scale using complex data-mining software. Moreover, the miner also can directly apply these results to organizational missions and goals or advise management on how to apply the gleaned knowledge. Figure 1 demonstrates how the symbiotic data miner (Athappilly, 2002) is situated at the crux of data-mining technology, statistics, logic, and application. These components of business (corporate needs, actionable decisions, and environmental changes), technology (databases, AI, interactive, and visualization tools), and statistical and theoretical models (math/stat tools) all flow into the symbiotic data miner’s realm. The symbiotic data miner plays a crucial role in flattening the layers between data-mining theory and statistics, technical support, and business acumen. The miner can reduce miscommunication and bring applicable knowledge to a business challenge more quickly than a triadic team of business manager, technician, and statistician. While we do not recommend replacing all managers, technicians, and statisticians with miners, we do recommend that organizations infuse their data-mining decisions and business intelligence departments with symbiotic data miners.
FUTURE TRENDS In the near future, organizations will have allocated positions for symbiotic data miners. These business-informed, technologically-adept individuals will play a crucial role in strategic management decisions and long-term mission planning initiatives. The fledgling business intelligence
Symbiotic Data Mining
Figure 1. Symbiotic data miner
departments of today will continue to grow and infuse themselves into every aspect of the organizational structure. Through sheer success, these departments will be subsumed into every department with symbiotic data miners specializing in particular business aspects. Even further into the future, symbiotic data mining will simply become a way of doing business. As software infused with business protocols and data-mining technology increases (Rea & Athappilly, 2004), business will implement data-mining systems on the desktop. Business managers at all levels will use symbiotic data-mining software as easily as many use office suite software (e.g., Microsoft Office) today. The software that supports data mining will be userfriendly, transparent, and intuitive. Simultaneously, users will have experienced increased exposure to higher education, become more familiar with quantitative methods and technology tools, and be better informed of the business culture and environment. As a result, data mining will be an inevitable routine activity implemented to make more informed decisions. The catalyst for this movement will be a collective force comprised of educators, students, and business professionals. Through internships, continued research, and increased business and academic partnerships and collaborations, the integration of business, data-mining technology, statistics, and theory into practical business software will become a reality.
CONCLUSION The symbiotic data miner will not come about without changes in how individuals are trained in higher education
5
and on the job. Without a combination of business, information technology, statistics, and logic, we cannot look for an infusion of symbiotic data miners anytime soon. As organizations move more toward business intelligence, we will see more symbiotic data miners, even though we may not identify them by this name.
REFERENCES Anon. (2002). A brief history of data mining. Retrieved July 28, 2004, from http://www.data-miningsoftware.com/data_mining_history.htm Apte, C., Liu, B., Pednault, E., & Smyth, P. (2002). Business applications of data mining. Communications of the ACM, 45(8), 49-53. Athappilly, K. (2002). Symbiotic mining: An antidote for corporate insanity. Proceedings of High Performance Computing (HiPC), Bangalore, India. Athappilly, K. (2003). Data mining coming of age and corporate insanity in diminishing returns. Proceedings of the 39th Annual Meeting of Midwest Business Administration Association Proceedings, Chicago, Illinois. Athappilly. K. (2004). Data mining at crossroads: A retailer’s story. Proceedings of the 40th Annual Meeting of Midwest Business Administration Association, Chicago, Illinois. Cognos. (2004). Enterprise business intelligence. Retrieved August 3, 2004, from http://www.cognos.com/ products/businessintelligence/
1085
Symbiotic Data Mining
Dhond, A., Gupta, A., & Vadhavkar, S. (2000). Data mining techniques for optimizing inventories for electronic commerce. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts.
Decision Trees: Tree-shaped structures that represent sets of decisions. Different types of decisions trees, such as Classification and Regression Trees (CART), allow experts to create validated decision models that can then be applied to new datasets.
Gargano, M., & Raggad, B. (1999). Data mining—A powerful information creating tool. OCLC Systems and Services, 15(2), 81-90.
Enterprise Miner: Data-mining software developed by SAS Corporation that is used to create predictive models to solve business challenges.
Kudyba, S., & Hoptroff, R. (2001). Data mining and business intelligence: A guide to productivity. Hershey, PA: Idea Group Publishing.
Fuzzy Logic: A type of logic that does not rely on a binary yes or no. Instead, computer systems are able to rank responses on a scale of 0.0 to 1.0, with 0.0 being false to 1.0 being true. This allows computer systems to deal with probabilities rather than absolutes.
Lee, S., & Siau, K. (2001). A review of data mining techniques. Industrial Management & Data Systems, 101(1), 41-46. Rea, A., & Athappilly, K. (2004). End-user data mining using the E2DM prototype: A discussion of prototype development, testing, and evaluation. Proceedings of the 2004 Midwest Decision Sciences Institute, Cleveland, Ohio. SAS. (2004). Enterprise miner. Retrieved August 3, 2004, from http://www.sas.com/technologies/analytics/ datamining/miner/ SPSS. (2004). Clementine. Retrieved August 3, 2004, from http://www.spss.com/clementine/ Thearling, K. (2002). An introduction to data mining. Retrieved July 28, 2004, from http://www.thearling.com/ text/dmwhite/dmwhite.htm Thuraisingham, B. (1999). Data mining: Technologies, techniques, tools, and trends. Boca Raton, FL: CRC Press.
KEY TERMS Artificial Intelligence: A field of information technology that studies how to imbue computers with human characteristics and thought. Expert systems, natural language, and neural networks fall under the AI research area. Business Intelligence: Information that enables highlevel business managers and executives to make strategic and long-term business decisions. Clementine: Data-mining software developed by SPSS Corporation that is used to create predictive models to solve business challenges. Cognos: Business intelligence software that enables organizations to monitor performance and develop strategic business solutions based on collected data.
1086
Genetic Algorithms: A large collection of rules that represents all possible solutions to a problem. Inspired by Darwin’s theory of evolution, these rules are simultaneously applied to data using powerful software on highspeed computers. The best solutions are then used to solve the problem. Heuristics: A set of rules derived from years of experience in solving problems. These rules can be drawn from previous examples of business successes and failures. Artificial intelligence models rely on these rules to find relationships, patterns, or associations among variables. Machine Learning: This involves a combination of AI and statistics. Software programs are able to predict and learn approaches to solve problems after repeated attempts. Neural Networks: An artificial intelligence program that attempts to learn and make decisions much like the human brain. Neural networks function best with a large pool of data and examples from which they can learn. OLAP: An acronym for Online Analytical Processing. OLAP tools allow users to analyze different dimensions of multi-dimensional data. SQL: Structured Query Language. This is a standardized query language used to pull information from a database. Symbiotic Data Miner: An individual trained in business, information technology, and statistics. The symbiotic data miner is able to implement data-mining solutions, interpret the results, and then apply them to business challenges.
1087
Symbolic Data Clustering
5
Edwin Diday University of Dauphine, France M. Narasimha Murty Indian Institute of Science, India
INTRODUCTION In data mining, we generate class/cluster models from large datasets. Symbolic Data Analysis (SDA) is a powerful tool that permits dealing with complex data (Diday, 1988) where a combination of variables and logical and hierarchical relationships among them are used. Such a view permits us to deal with data at a conceptual level, and as a consequence, SDA is ideally suited for data mining. Symbolic data have their own internal structure that necessitates the need for new techniques that generally differ from the ones used on conventional data (Billard & Diday, 2003). Clustering generates abstractions that can be used in a variety of decision-making applications (Jain, Murty, & Flynn, 1999). In this article, we deal with the application of clustering to SDA.
•
•
data is provided by relational databases if we have an application that needs several relations merged (Bock & Diday, 2000). Knowledge Management: It is possible to extract meaningful conceptual knowledge from clustering symbolic data. It is also possible to use expert knowledge in symbolic clustering (Rossi & Vautrain, 2000). Biometrics: Clustering is used in a variety of biometric applications, including face recognition, fingerprint identification, and speech recognition. It is also used in protein sequence grouping (Zhong & Ghosh, 2003).
The SDA Community enjoys a right mix of theory and practice. The Symbolic Official Data Analysis System (SODAS) software package developed over the past few years is available for free distribution (Morineau, 2000).
BACKGROUND In SDA, we consider multivalued variables, products of interval variables, and products of multivalued variables with associated weights (Diday, 1995). Clustering of symbolic data (Gowda & Diday, 1991; De Souza & De Carvalho, 2004) generates a partition of the data and also descriptions of clusters in the partition using symbolic objects. It can have applications in several important areas coming under data mining: • •
Pattern Classification: The abstractions generated can be used for efficient classification (Duda, Hart, & Stork, 2001). Database Management: SDA permits generation of symbolic objects from relational databases (Stéphan, Hébrail, & Lechevallier, 2000). Usage of data in aggregate form where variables assuming interval values can be handy. This not only permits a brief description of a large dataset but also in dealing with privacy issues associated with information of an individual (Goupil, Touati, Diday, & Moult, 2000). An important source of symbolic
MAIN THRUST We deal with various components of a symbolic dataclustering system in this section.
Symbolic Data Analysis (SDA) In SDA the input comes in the form of a table; columns of the table correspond to symbolic variables, which are used to describe a set of individual patterns. Rows of the table are symbolic descriptions of these individuals. They are different from the conventional descriptions that employ a vector of quantitative or categorical values to represent an individual (Jain & Dubes, 1988). The cells of this symbolic data table may contain data of the following types: 1. 2.
A single quantitative value: for example, height (John) = 6.2. A single categorical value: for example, color _of_ eyes (John) = blue.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Symbolic Data Clustering
3.
5.
A set of values or categories: for example, room_number (John) = {203, 213, 301}, which means that the number of John’s room is either 203, 213, or 301. An Interval: for example, height (John) = [6.0,6.3], which means that John’s height is in the interval [6.0,6.3]; note that the minimum and maximum values of an interval are instances of an interval variable. So the interval is an instance of an ordered pair of interval variables. An ordered set of values with associated weights: here we have either a histogram or a membership function. In the case of the histogram, the weight is the normalized frequency of occurrence, and in the case of membership function, the weight corresponds to the membership of the value in the concept. Note that this definition permits us to deal with variables that have probability distributions as their values, or functions as their values, also.
In addition, it is possible to have logical and structural relationships among these variables. For example, the statement “If the age of John is between one and two months, then the height of John is between 30 and 40 centimeters” is a logical implication. Two or more variables could be hierarchically related (Bock & Diday, 2000). For example, the variable color is considered to be light if it is yellow, white, or metallic. Similarly, we can describe the make and model of a car if one owns a car, which depicts dependency between variables (Bock & Diday, 2000). Symbolic objects are formal operational models of concepts. A concept in the real world is mathematically described by a symbolic object; it may use a formula in a classical logic or a multivalued logic to describe the concept. In addition, the symbolic object provides a way to calculate the extent of a concept, which is a set of individuals in the real world associated with the concept (Diday, 2002). The important step in symbolic clustering is to output symbolic objects corresponding to the clustering. These output symbolic descriptions are used in a variety of decision-making situations and can be used again as new units for a higher level analysis or clustering (Bock & Diday, 2000).
Dissimilarity Measures for Symbolic Objects In conventional clustering, we put similar objects in the same group and dissimilar objects in different groups (Jain et al., 1999). So the notion of similarity/dissimilarity plays an important role in arriving at the partition of the dataset. A good collection of dissimilarity mea-
1088
sures is used in dealing with the conventional data consisting of only numerical or categorical variables (Duda et al., 2001). The need for computing dissimilarities between symbolic objects is obvious because we would like to group, for reducing both time and space requirements, symbolic objects that are summaries of groups of objects. An excellent collection of dissimilarity measures between symbolic objects is given in Esposito, Malerba, and Lisi (2000). It is possible to use a distance function to capture the dissimilarity. A simple view is to accept that similarity can be obtained from dissimilarity between objects. However, it may be inspiring to view similarity and dissimilarity as complimenting each other. A variety of dissimilarity functions are defined and used in symbolic clustering. A most popular dissimilarity measure is the one proposed by De Carvalho (1998). Dissimilarity measures for histograms and probability distributions are reported in Bock and Diday (2000).
Grouping Algorithms Traditionally, clustering algorithms are grouped into hierarchical and partitional categories (Jainet al., 1999). The hierarchical algorithms are computationally expensive, as they need to either compute and store a proximity matrix of size quadratic in the number of patterns or compute the proximity based on need using time that is cubic in the number of patterns. Even the incremental hierarchical algorithms need time that is quadratic in the number of objects. So even though hierarchical algorithms are versatile, they may not scale up well to handle large datasets. The partitional algorithms, such as the dynamic clustering algorithm (Diday & Simon, 1976), are better as they take linear time in the number of inputs. So they have been successfully applied to moderately large datasets. The dynamic clustering algorithm may be viewed as a K-kernels algorithm, where a kernel could be the mean, a line, multiple points, probability law, and other more general functions of the data. Such a general framework was proposed for the first time in the form of the dynamic clustering algorithm. The well-known kmeans algorithm (Duda et al., 2001) is a special case of the dynamic clustering algorithm where the kernel of a cluster is the centroid. However, most of these partitional algorithms are iterative in nature and may require a data scan several times. It will be useful to explore schemes that can help in scaling up the existing symbolic clustering algorithms. The possible solutions are to: •
Use an incremental clustering algorithm (Jain et al., 1999). One of the simplest incremental algo-
Symbolic Data Clustering
•
rithms is the leader algorithm. The advantage with incremental clustering algorithms is that they assign a new pattern to an existing cluster or a new cluster based on the existing cluster representatives; they need not look at the patterns that are already processed. Such a framework can cluster the dataset by using a single scan of the dataset. Read a part of the data to be clustered from the disk into the main memory and process the data block completely before transferring the next data block to the main memory. This idea was originally used in Diday (1975). Also, a single-pass k-means algorithm was designed based on considering the data blocks sequentially for clustering (Farnstrom, Lewis, & Elkan, 2000). A natural extension to these schemes is the divide-and-conquer clustering (Jain et al., 1999). In the divide-and-conquer approach, the blocks of data can be clustered parallelly. However, clustering these representatives requires another step.
Composite Symbolic Objects It is important to represent a cluster of objects by using a compact abstraction for further decision making. A popular scheme is to use the symbolic centroid of the individuals in the cluster as the representative of the cluster. For example, for interval data, the interval made up of the mean of the minima and the mean of the maxima of the data in the cluster is prescribed as the representative. Conventional clustering schemes generate only a partition of the data. However, there are applications where a composite symbolic object (Gowda & Diday, 1991) represents the structure in the data better than a partition; for example, a symbolic object is related in different ways to some other symbolic objects. It is not possible to bring out the relation through partitions.
SODAS Software The aim of SODAS software (Morineau, 2000) is to build symbolic descriptions of concepts and to analyse them by SDA. In addition to the generation of symbolic objects from databases (Stéphan et al., 2000), it permits clustering of individuals described by symbolic data tables to generate partitions, hierarchies, and pyramids where each cluster is a symbolic object. In the process, it employs the calculation of dissimilarities between symbolic objects. It provides a graphical representation as well (Noirhomme-Fraiture & Rouard, 2000). This software package offers so many other operations on symbolic objects; refer to Bock and Diday (2000) for details.
FUTURE TRENDS Clustering symbolic data is important because in order to handle large datasets that are routinely encountered in data mining, it is meaningful to deal with summaries orabstractions of groups of objects instead of dealing with individual objects. So clustering lies at the heart of data mining. Symbolic clustering permits us to deal with concepts that reveal the structure in the data and describe the data in a compact form. However, to deal with large datasets, some schemes for improving the scalability of the existing algorithms are required. One well-known algorithm design paradigm in this context is the divide-and-conquer strategy. It is to be explored in the context of symbolic data clustering. Another possible solution is to scan the database once and represent it by using a compact data structure. Use only this structure for further processing. There are several applications where symbolic data clustering is already used. For example, processing of Census Data from the Office for National Statistics (Goupil et al., 2000) and Web access mining (Arnoux, Lechevallier, Tanasa, Trousse, & Verde, 2003). It is important to consider other application areas.
CONCLUSION Clustering symbolic objects generates symbolic objects; it is not possible to generate relations between objects by using a conventional clustering tool. Symbolic data clustering has an excellent application potential. Many researchers have contributed to some of the important components such as representation of symbolic objects and dissimilarity computation. However, there is an important need to explore schemes for realizing scalable clustering algorithms. Also, the notion of “pyramid” (Diday, 1986) is general enough to handle details in a hierarchical manner and also can be useful to generate both hard and soft partitions of data. It is important to explore efficient schemes for building pyramids. Another important area where additional work is required is in representing clusters of symbolic objects, which is called composite symbolic object generation. Clustering symbolic objects is an important activity to deal with the ever-growing datasets that are routinely collected and processed in data mining.
1089
5
Symbolic Data Clustering
REFERENCES
E. Diday (Eds.), Analysis of symbolic data (pp. 165-186). Berlin, Germany: Springer-Verlag.
Arnoux, M., Lechevallier, Y., Tanasa, D., Trousse, B., & Verde, R. (2003). Automatic clustering for Web usage mining. In D. Petcu, D. Zaharie, V. Negru, & T. Jebeleanu (Eds.), Proceedings of the International Workshop on Symbolic and Numeric Algorithms for Scientific Computing (pp. 54-66). Mirton, Timisoara.
Farnstrom, F., Lewis, J., & Elkan, C. (2000). Scalability of clustering algorithms revisited. ACM SIGKDD Explorations, 2(1), 51-57.
Billard, L., & Diday, E. (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of the American Statistical Association, 98 (462), 470-487. Bock, H.-H., & Diday, E. (Eds.). (2000). Analysis of symbolic data. Berlin, Germany: Springer-Verlag. De Carvalho, F. A. T. (1998). Extension based proximities between constrained Boolean symbolic objects. In C. Hayashi, N. Oshumi, K. Yajima, Y. Tanaka, H.-H. Bock, & Y. Baba (Eds.), Advances in data science, classification and related methods (pp. 370-378). Tokyo: SpringerVerlag. De Souza, R. M. C. R., & De Carvalho, F. A. T. (2004). Clustering of interval data based on city-block distances. Pattern Recognition Letters, 25(3), 353-365. Diday, E. (1975). Classification automatique séquentielle pour grands tableaux. RAIRO B-1, 29-61. Diday, E. (1986). Orders and overlapping clusters by pyramids. In J. De Leeuw, W. J. Heiser, J. J. Meulman, & F. Critchley (Eds.), Multidimensional data analysis (pp. 201-234). Leiden, The Netherlands: DSWO Press. Diday, E. (1988). The symbolic approach in clustering and related methods of data analysis: The basic choices. In H.H. Bock (Ed.), IFCS-87, 673-684. Diday, E. (1995). Probabilist, possibilist and belief objects for knowledge analysis. Annals of Operations Research, 55, 227-276. Diday, E. (2002). An introduction to symbolic data analysis and Sodas software. Journal of Symbolic Data Analysis, 0(0), 1-25. Diday, E., & Simon, J. C. (1976). Cluster analysis. In K. S. Fu (Ed.), Digital pattern recognition (pp. 47-94). Berlin, Germany: Springer-Verlag. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Esposito, F., Malerba, D., & Lisi, F. A. (2000). Dissimilarity measures for symbolic objects. In H.-H. Bock &
1090
Goupil, F., Touati, M., Diday, E., & Moult, R. (2000). Processing census data from ONS. In H.-H. Bock & E. Diday (Eds.), Analysis of symbolic data (pp. 382-385). Berlin, Germany: Springer-Verlag. Gowda, K. C., & Diday, E. (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition, 24(6), 567-578. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. NJ: Prentice Hall. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Morineau, A. (2000). The SODAS software package. In H.H. Bock & E. Diday (Eds.), Analysis of symbolic data (pp. 386-391). Berlin, Germany: Springer-Verlag. Noirhomme-Fraiture, M., & Rouard, M. (2000). Visualizing and editing symbolic objects. In H.-H. Bock & E. Diday (Eds.), Analysis of symbolic data (pp. 125-138). Berlin, Germany: Springer-Verlag. Rossi, F., & Vautrain, F. (2000). Expert constrained clustering: A symbolic approach. In D. A. Zighed, J. Komorowski, & J. Zytkow (Eds.), Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 605-612), Lyon, France. Stéphan, V., Hébrail, G., & Lechevallier, Y. (2000). Generation of symbolic objects from relational databases. In H.H. Bock & E. Diday (Eds.), Analysis of symbolic data (pp. 78-105). Berlin, Germany: Springer-Verlag. Zhong, S., & Ghosh, J. (2003). A unified framework for model-based clustering. Journal of Machine Learning Research, 4(11), 1001-1037.
KEY TERMS Concept: Each category value of a categorical variable or a logical association of variables. For example, a concept can be simply a town or a type of unemployment, or, in a more complex way, a socio-professional category (SPC), associated with an age category, A, and a region, R.
Symbolic Data Clustering
Divide-and-Conquer: A well-known algorithm design strategy where the dataset is partitioned into blocks and each block is processed independently. The resulting block-level (local) kernels are merged to realize the global output. It increases the efficiency of the algorithms in terms of both space and time requirements. Dynamic Clustering: A scheme to discover simultaneous clusters and their representations in such a way that they fit together optimally. The cluster representations are called kernels. (Mean is a special case of kernel, as in k-means). Hierarchical Clustering: A hierarchy of partitions is generated as output; it may be depicted as a tree of partitions or a pyramid of overlapping clusters.
Kernel: A function of data points. A simple instantiation is the centroid. Large Dataset: A dataset that does not fit in the main memory of a machine, so it is stored on a disk and is read into the memory based on need. Note that disk access is more time-consuming than memory access. Partitional Clustering: A single partition of the data is iteratively obtained so that some criterion function is optimized. Symbolic Object: A description of a concept that provides a way of obtaining the extent or the set of individuals associated with the concept.
1091
5
1092
Synthesis with Data Warehouse Applications and Utilities Hakikur Rahman SDNP, Bangladesh
INTRODUCTION
BACKGROUND
Today’s fast moving business world faces continuous challenges and abrupt changes in real-life situations at the context of data and information management. In the current trend of information explosion, businesses recognize the value of the information they can gather from various sources. The information that drives business decisions can have many forms, including archived data, transactional data, e-mail, Web input, surveyed data, data repositories, and data marts. The organization’s business strategy should be to deliver high-quality information to the right people at the right time. Business analysis desires that some data must be absolutely current. Other data may be comprised of historical or summary information and are less time sensitive. To overcome the data loss, improve efficiency, make real-time update, and maintain a wellmarked path to other data, a high-speed connectivity is always needed. It also needs to protect the information the systems gather, while ensuring that it is readily available, consistent, accurate, and reliable. It also must consider how the software environment has been designed and what impact that design has on the performance, availability, and maintainability of the system. Among all these parameters, defining the basic layout of a storage environment is critical for creating an effective storage system. With data residing on numerous platforms and servers in a multitude of formats, gaining efficient and complete access to all relevant organizational data is essential. While designing the data warehouse, the network topology, data consistency, data modeling, reporting tool, storage, and enactment of data need to be clearly understood. In recent days, data warehouse database(s) grew at such a pace that the traditional concept of database management needed to be revisited, redesigned, and refocused with increased demand, availability, and frequent updates by putting pressure on data warehouse methodologies and application tools. Innovative applications and techniques have evolved to handle data warehousing more efficiently and to provide easier data access.
Information is all about integration and interaction of data sets. Inaccuracies in a single data column may affect the results and directly affect the cost of doing business and the quality of business decisions. Usually, preventive measures are more economical and less tormenting to ensure data quality. It has been found that delaying the inevitable data cleansing dramatically increases the cost of doing so, as well as increases the time delay for the cleansing process. Data warehousing was formally defined as a separate environment to support analytical processing that is subject-oriented, time-variant, and integrated. A data warehouse that provides accurate, consistent, and standardized data enables organizations to achieve better revenue generation and, at the same time, attain cost optimization. An effective data quality utility and methodology should address its quality at application and data entry levels, during application integration stages, and during the quality analysis level. Earlier data warehouses used to be mere replacements of MIS systems with limited service facilities. Due to simpler operating environments, they did not justify allocation of significant resources. With incremental demand, especially from the business community, progress of data warehousing concepts triggered tremendous development with sophisticated requirements, increase in database sizes, and complexity in the data warehouse environment. Nowadays, companies are spending thousands of dollars, and a significantly large portion of it goes to the information technology budget in the form of firmware to build sophisticated databases and data warehouses. In the quest for successful business intelligence, various applications and systems have been deployed, and manifold information retrieval processes have been developed. Traditional database system architectures face a rapidly evolving operating environment, where millions of users store and access terabytes of data (Harizopoulos & Ailamaki, 2003). Database applications that use multiterabyte datasets are becoming increasingly important for scientific fields such as astronomy and biology (Papadomanolakis & Ailamaki, 2004).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Synthesis with Data Warehouse Applications and Utilities
Data warehousing ultimately reconciles the design conflicts best by integrating operational applications and analytical applications into a coherent information architecture (SAS, 2001). In this aspect, fully integrated approaches are need to be realized to improve the data quality and its processes (ETL, extraction, transformation, and loading), including data warehousing techniques with the goal of transforming raw data into valuable strategic assets.
MAIN THRUST Technological advancements use techniques like data pattern analysis, clustering, algorithms, and other sophisticated capabilities to ensure that data gathered throughout the organization is accurate, usable, and consistent. By intelligently identifying, standardizing, correcting, matching, and consolidating data, specially designed software can offer better solutions to the data quality (SAS, 2002). The task of ensuring optimal query execution in database management systems is, indeed, daunting (Schindler et al., 2003). To meet the challenges of managing data scalability and handling large volumes of data, the strategic solution should provide a powerful foundation for building a robust and resilient data campus and should integrate the popular access characteristics of the modern day’s information economy. The solutions should be able to run under a wide variety of hardware environments, enabling the choosing of the computing resources by matching the particular needs of the enterprise. On the other hand, the computing environment creates a base for making better business decisions by hosting powerful analysis tools and organizing the information. Establishment and operation of an efficient data site is a critical component of successful solution implementation in order to deal with the ever-increasing volumes of data associated with customer relationship management (CRM), supplier relationship management (SRM), enterprise performance management (EPM), and hazard analysis. Similarly, inconsistencies in data semantics are among the most difficult tasks in establishing large data warehouses. A few of the data warehouse applications and utilities are synthesized in the following section:
Direct Data Storage Direct data storage is an acceptable method of modeling relay time-current characteristics for devices with fixed characteristics. Like the name implies, the direct data storage approach consists of storing data points over a
given curve into computer memory. The relay then monitors the line current and compares that value to the current values stored in memory. Forelle (2004) reports that the sales of hard-disk data-storage systems rose steadily in the fourth quarter of 2003, reversing previous slides and giving some hope for a recovery in big-ticket technology spending.
Data Mining Data mining is the search for patterns and structure in large data sets, and the discovery of information may not be present explicitly in the data. However, one of the most difficult problems in data mining is to concretely define the classes of patterns that may be of interest. Riedel, et al. (2000, p. 3) stated that “the major obstacles to starting a data mining project within an organization is the high initial cost of purchasing the necessary hardware”. Data mining also is employed widely in sales and marketing operations to calculate the profitability of customers or to find out which customers are most likely to leave for the competition. Forrester Research (1999) reported in a study of Fortune 1000 companies that the usage of data mining will grow rapidly. The report also suggested that marketing, customer service, and sales may remain as the major business application areas for data mining.
EMC EMC (www.emc.com) has a reputation for building a highly resilient information environment and protects valuable information by providing flexibility, as the business requirements change. EMC’s Symmetrix information storage systems can be integrated with other computer systems to manage, protect, and share the IT infrastructure. EMC Sysmmetrix storage systems implement a broad range of storage protection and acceleration techniques, including disk mirroring, RAID storage protection and redundancy, data caching, and hot spares and replacements of individual components.
RAID RAID techniques in hardware or software (or both) can be implemented. Most data sites with high data volumes choose to implement RAID storage options in hardware, using disk arrays. Thus, disk arrays offer additional performance and availability options beyond basic RAID techniques. Mirroring and Parity RAID techniques balance general performance and availability of data for all task-critical and business-critical applications by maintaining a duplicate copy of volumes on two disk devices. 1093
5
Synthesis with Data Warehouse Applications and Utilities
RAID5 One popular combination includes local RAID-5, remote mirroring, snapshot, and backup to tape. RAID-5 provides protection against disk failures, remote mirroring guards against site failures, snapshots address user errors, and tape backup protects against software errors, and provides archival (Keeton & Wilkes, 2003).
Webification Webification of the data center has long been treated as the logical next step. It is recognized globally that the browser as the universal client can decrease costs, minimize user complexity, and increase efficiency. Today’s enterprise application vendors offer Web-enabled versions of their products (L’Heureux, 2003).
SAN In the last couple of years, a dramatic growth of enterprise data storage capacity has been observed. As a result, new strategies have been sought that allow servers and storage being centralized to better manage the explosion of data and the overall cost of ownership. Nowadays, a common approach is to combine storage devices into a dedicated network that is connected to LANs and/or servers. Such networks are usually called storage area networks (SAN). A very important aspect for these networks is scalability. If a SAN undergoes changes (i.e., due to insertions or removals of disks), it may be necessary to replace data in order to allow efficient use of the system. To keep the influence of data replacements on the performance of the SAN small, this should be done as efficiently as possible (Brinkmann, Salzwedel & Scheideler, 2000).
MDDB Multidimensional databases are another storage option, especially useful when providing business users with multiple views of their data. MDDBs provide specialized storage facilities where data is pulled from a data warehouse or other data source for storage in a matrix-like format for fast and easy access to multidimensional data views. In addition to these technologies, warehouse data can also be stored in third-party hierarchical and relational databases like DB2, ORACLE, SQL Server, and others.
SAS Flexible and scalable storage options provided through SAS (www.sas.com) Intelligence Storage facilitate quick 1094
and cost-effective information dissemination for business and analytic applications. With integrated data warehouse management, it has a single point of control for managing processes across the entire enterprise. SAS data access patterns are sequential and get benefits from OS read-ahead algorithms. Thus, the amount of memory dedicated to file caching dramatically affects the read-ahead performance. On a system dedicated to SAS applications, the effectiveness of file caching is determined as a function of physical memory, the number of concurrently executing processes, and the memory size (memsize) configuration within the controlling process, including memory utilization as well as file cache settings.
ETL Extraction, transformation, and loading (ETL) is the initial processing stream involved in populating a data warehouse. Most ETL tools generate code-based tables on the validation and transformation rules tabulated in the tool. A stable ETL should be able to create multiple output tables from a single pass through source tables, making it both simpler and faster. Improved synergy between the ETL warehousing process and data quality offers the ability to manage complex data integration more easily. By applying data quality in the ETL process, data integrity and accuracy are also assured. Much of the data warehousing effort is concentrated in the ETL process with the extraction of records and fields from various data sources, conversion of the data to new formats, and loading of the data to other target destinations like warehouse or data mart.
Data Morphing Using data morphing, a cache-efficient attribute layout called a partition is first determined through an analysis of the query workload. This partition is then used as a template for storing data in a cache-efficient way. Data morphing technique provides a significant performance improvement over both the traditional N-ary storage model and the PAX model (Hankins & Patel, 2003). Data morphing consists of two phases: (a) calculating a cacheefficient storage template and (b) reorganizing the data into this cache-efficient organization.
SPDS The Scalable Performance Data Server (SPDS), a multiuser server for data retrieval in data warehousing applications, is designed to scale well in performance to handle large tables. It offers a high-availability design using a centralized name server and a secure data server
Synthesis with Data Warehouse Applications and Utilities
with user validation, user identification, and password verification. SPDS utilizes symmetric multi-processing (SMP) capabilities by targeted use of threads to perform parallel and overlapped data processing.
FUTURE TRENDS Constructing dependable storage systems is difficult, because there are many techniques to pick up, and often, they interact in unforeseen ways. The resulting storage systems are often either over-provisioned, provide inadequate protection, or both. The result is a first step down the path of self-managing, dependability-aware storage systems, including a better understanding of the problem space and its tradeoffs and a number of insights that are believed to be helpful to others (Keeton & Wilkes, 2003). Many technological advances have made possible distributed multimedia servers that allow bringing online large amounts of information, including images, audio and video, and hypermedia databases. Increasingly, there are applications that demand high-bandwidth access, either in single user streams (e.g., large image browsing, uncompressible scientific and medical video, and coordinated multimedia streams) or in multiple user environments. Data storage requirements are increasing dramatically, and, therefore, much attention is being given to next-generation data storage media. However, in optical data storage devices the storage density is limited by the wavelength of light. This limitation can be avoided by an alternate data addressing mechanism using electric fields between nanoelectrodes with smaller dimensions than the wavelength of commercially available lasers (Germishuizen et al., 2002). Carino, et al. (2001) described an active storage hierarchy, in which their StorHouse/Relational Manager executes SQL queries against data stored on all hierarchical storage (i.e., disk, optical, and tape) without post processing a file or a DBA having to manage a data set. There has been a tremendous amount of work on data mining during the past years (Traina et al., 2001). Many techniques have been developed that have allowed the discovery of various trends, relations, and characteristics with large amounts of data (Bayardo et al., 1999). In the field of spatial data mining, work has focused on clustering and discovery of local trends and characteristics (Ester et al., 1999). In high-energy physics experiments, large particle accelerators produce enormous quantities of data, measured in hundreds of terabytes or petabytes per year, which are deposited onto tertiary storage. The best retrieval performance can be achieved only if the data is
clustered on the tertiary storage by all searchable attributes of the events. Since the number of these attributes is high, the underlying data-management facility must be able to cope with extremely large volumes and very high dimensionalities of data at the same time (Orlandic, Lukaszuk & Swietlik, 2002).
CONCLUSION The system administrators can enhance performance across the computing environment by establishing an overall storage environment plan. Though tuning the storage and file system configuration cannot fix all performance issues, careful attention to the storage environment may ensure taking complete advantage of the available computing resources. Webification may change the nature of the data center to run on a simple, elegant platform with fewer devices and more flexibility. Server load balancing, SSL terminators, application firewalls, authentication devices, and dynamic caching with the entire Web tier of networking point products residing between the firewall and the application servers may diminish in the near future. Even dedicated Web servers may disappear. As the data sources expand and additional processing power is added to support the growing IT infrastructure, the ability to share data among diversified operating environments becomes more crucial. Despite the effort and speed of the data warehousing development processes, it always takes time to figure out the business practices within an organization in order to obtain a substantial return on its data warehouse investment. It demands rigorous analysis on the return of investment (ROI) on most of the major data warehouse implementers’ investments and would take a much longer average payback period that needs intensive speculation from the policy makers. Adding data without justifying its business value can lessen the worth of the data warehouse and may abruptly increase the maintenance cost.
REFERENCES Bayardo, R.J., Traina, A., Wu, L., & Faloutsos, C. (1999). Constraint-based rule mining in large, dense databases. Proceedings of the IEEE International Conference on Data Engineering (pp. 188-197), Sydney, Australia. Brinkmann, A., Salzwedel, K., & Scheideler, C. (2000). Efficient, distributed data placement strategies for storage area networks. Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, Bar Harbor, Maine. 1095
5
Synthesis with Data Warehouse Applications and Utilities
Carino Jr., F. et al. (2001). StorHouse metanoia—New applications for database, storage and data warehousing. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California. Ester, M., Kriegel, H.P., & Sander, J. (1999). Spatial data mining: A database approach. Proceedings of the 5th International Symposium on Spatial Databases (pp. 4766), Berlin.
Schindler, J., Ailamaki, A., & Ganger, G.R. (2003). Lachesis: Robust database storage management based on devicespecific performance characteristics. Proceedings of the 29 th VLDB Conference, Berlin, Germany. Traina, A. et al. (2001). Tri-plots: Scalable tools for multidimensional data mining. Proceedings of the KDD2001, San Francisco.
Forelle, C. (2004, March 5). Staff Reporter. The Wall Street Journal.
KEY TERMS
Forrester Research. (1999, December). Net marketplaces grow up. Cambridge, MS: Forrester Research Inc.
Customer Relationship Management (CRM): An enterprise-wide strategy enabling organizations to optimize customer satisfaction, revenue, and profits, while increasing shareholder value through better understanding of customers’ needs.
Germishuizen, W.A. et al. (2002). Data storage using DNA. A Proceedings of the 10th Foresight Conference on Molecular Nanotechnology, Bethesda, Maryland, USA. Hankins, R.A., & Patel, J.M. (2003). Data morphing: An adaptive, cache-conscious storage technique. Proceedings of the 29th VLDB Conference, Berlin, Germany. Harizopoulos, S., & Ailamaki, A. (2003). A case for staged database systems. Proceedings of the 1st International Conference on Innovative Data Systems Research, CIDR2003, Asilomar, CA. Keeton, K., & Wilkes, J. (2003). Automatic design of dependable data storage systems. A Proceedings of Workshop on Algorithms and Architectures for SelfManaging Systems (pp. 7-12), San Diego. John Wiley. L’Heureux, I. (2003, September). The new data center: Toward a consolidated platform. Redline Networks. Orlandic, R., Lukaszuk, J., & Swietlik, C. (2002). The design of a retrieval technique for high-dimensional data on tertiary storage. ACM SIGMOD Record, 31(2), 15-21. Papadomanolakis, S., & Ailamaki, A. (2004). AutopPart: Automating schema design for large scientific databases using data partitioning. Proceedings of the SSDBM 2004 (pp. 383-392), Santorini Island, Greece. Riedel, E., Faloutsos, C., Ganger, G.R., & Nagle, D.F. (2000). Data mining on an OLTP system (nearly) for free. Proceedings of the SIGMOD 2000 (pp. 13-21), Dallas. SAS. (2001). The SAS information delivery architecture. An Introduction for Information Technology Managers, SAS. SAS. (2002). Exponentially enhance the quality of your data with SAS ETL. A SAS White Paper.
1096
Data Mining: A form of information extraction activity whose goal is to discover hidden facts contained in databases; the process of using various techniques (i.e., a combination of machine learning, statistical analysis, modeling techniques, and database technology) to discover implicit relationships between data items and the construction of predictive models based on them. Enterprise Performance Management (EPM): It is a combination of planning, budgeting, financial consolidation, reporting, strategy planning, and scorecarding tools. Most vendors using the term do not offer the full set of components, so they adjust their version of the definition to suit their own product set. Extract, Transfer, and Load (ETL): A set of database utilities used to extract information from one database, transform it, and load it into a second database. This represents processing overhead required to copy data from an external DBMS or file. Management Information System (MIS): A form of software that provides information needed to make informed decisions about an organization or entity; a formalized way of dealing with the information that is required in order to manage any organization. Redundant Array of Inexpensive Disks (RAID): Uses the server processor to perform RAID calculations. Host CPU cycles that read and write data from and to disk are taken away from applications. Software RAID is less costly than dedicated hardware RAID storage processors, but its data protection is less efficient and reliable. Relational Data Base Management Systems (RDBMS): RDBMS are the database management sys-
Synthesis with Data Warehouse Applications and Utilities
tems that maintain data records and indices in tables. Relationships may be created and maintained across and among the data and tables. It is a software package that manages a relational database, optimized for rapid and flexible retrieval of data; also called a database engine. Supplier Relationship Management: Assuming one has a supply chain to manage, then supplier relationship management is a higher level view of how efficient
and profitable any given supply chain is. SRM products help to highlight which parts are often in short supply or bought on spot-markets at high prices; which suppliers are often late or have quality problems. Conversely, which suppliers are reliable, flexible, comprehensive, and costeffective. SRM products help management decide how to fine-tune the supply chain and recommend to engineering and manufacturing which vendors and parts to avoid when possible.
1097
5
1098
Temporal Association Rule Mining in Event Sequences Sherri K. Harms University of Nebraska at Kearney, USA
INTRODUCTION The emergence of remote sensing, scientific simulation and other survey technologies has dramatically enhanced our capabilities to collect temporal data. However, the explosive growth in data makes the management, analysis, and use of data both difficult and expensive. To meet these challenges, there is an increased use of data mining techniques to index, cluster, classify and mine association rules from time series data (Roddick & Spiliopoulou, 2002; Han, 2001). A major focus of these algorithms is to characterize and predict complex, irregular, or suspicious activity (Han, 2001).
BACKGROUND A time series database contains sequences of values typically measured at equal time intervals. There are two main categories of temporal sequences: transactionbased sequences and event sequences. A transactionbased sequence includes an identifier such as a customer ID, and data mining revolves around finding patterns within transactions that have matching identifiers. An example pattern is “A customer who bought Microsoft and Intel stock is likely to buy Google stock later.” Thus, the transaction has a definite boundary around known items of interest. There are many techniques that address these problems (Roddick & Spiliopoulou, 2002; Han, 2001). Data analysis on event sequences is enormously more complex than transactional data analysis. There are no inherently defined boundaries around factors that might be of interest. The factors of interest themselves may not be obvious to domain experts. Temporal event sequence mining algorithms must be able to compute inference from volumes of data, find the interesting events involved, and define the boundaries around them. An example pattern is “A La Niña weather pattern is likely to precede drought in the western United States.” La Niña weather data is based on Pacific Ocean surface temperatures and atmospheric values, and drought data is based on precipitation data from several weather stations located in the western United States. As illustrated by this
example, sequential data analysis must be able to find relationships among multiple time series. The sheer number of possible combinations of interesting factors and relationships between them can easily overwhelm human analytical abilities. Often there is a delay between the occurrence of an event and its influence on the dependent variables. These factors make finding interesting patterns difficult. One of the most common techniques to find interesting patterns is association rule mining. Association rules are implications between variables in the database. The problem was first defined in the context of the market basket data to identify customers’ buying habits (Agrawal et al., 1993), where the Apriori algorithm was introduced. Let I = I1, I2,..., Im be a set of binary attributes, called items. Let T be a database of transactions. An association rule r is an implication of the form X ⇒ Y, where X and Y are sets of items in I, and X∩Y=Æ. X is the rule antecedent, Y is the rule consequent. Support of rule X ⇒ Y in database T is the percentage of transactions in T that contain X ∪ Y. The rule holds in T with confidence c if c% of the transactions in T that contain X also contain Y. For example, it is of interest to a supermarket to find that 80% of the transactions that contain milk also contain eggs and 5% of all transactions include both milk and eggs. Here the association rule is milk ⇒ eggs, with 80% is the confidence of the rule and 5% support. This paper provides the status of current temporal association rule mining methods used to infer knowledge for a group of event sequences. The goal of these tools is to find periodic occurrences of factors of interest, rather than to calculate the global correlation between the sequences. Mining association rules is usually decomposed into three sub-problems: 1) prepare the data for analysis, 2) find frequent patterns, and 3) generate association rules from the sets representing those frequent patterns.
MAIN THRUST Events and Episodes To prepare time series data for association rule mining, the data is discretized and partitioned into sequences of
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Temporal Association Rule Mining in Event Sequences
events. Typically, the time series is normalized and segmented into partitions that have similar characteristics of data within a given interval. Each partition identifier is called an event type. Partitioning methods include symbolizing (Lin et al., 2003) and intervals (Hoppner, 2002). Different partitioning methods and interval sizes produce diverse discretized versions of the same dataset. This step relies on domain-expert involvement for proper discretization. When multivariate sequences are used, each variable is normalized and discretized independently. The time granularity (duration) is converted to a single (finest) granularity before the discovery algorithms are applied to the combined sequences (Bettini et al., 1998). A discretized version of the time series is referred to as an event sequence. An event sequence Ŝ is a finite, time-ordered sequence of events (Mannila et al., 1995). That is, Ŝ = (e1, e2,…e n). An event is an occurrence of an event type at a given timestamp. The time that a given event ei occurs is denoted i, and i ≤ i+1 for all i timestamps in the event sequence. A sequence includes events from a single finite set of event types. An event type can be repeated multiple times in a sequence. For example, the event sequence Ŝ1 = AABCAB is a sequence of 6 events, from a set of 3 event types {A,B,C}. In this event sequence, an A event occurs at time 1, followed by another A event, followed by a B event, and so on. The step size between events is constant for a given sequence. An episode in an event sequence is a combination of events with partially specified order (Mannila et al., 1997). It occurs in a sequence if there are occurrences of events in an order consistent with the given order, within a given time bound (window width). Formally, an episode a is a pair (V, ordering), where V is a collection of events and the ordering is parallel if no order is specified, and serial if the events of the episode have fixed order. The episode length is defined as the number of events in the episode.
Finding Frequent Episodes Based on Sliding Window Technologies The founding work on finding frequent episodes in sequences is Mannila et al. (1995). Frequent episodes are discovered by using a sliding window approach, WINEPI. A window on an event sequence Ŝ is an event subsequence, w= ei, ei+1,…ei+d where the width of window w, denoted d, is the time interval of interest. The set of all windows w on Ŝ, with a width of d is denoted Ŵ ( Ŝ,d). In this system, the value of the window width is userspecified, varying the closeness of event occurrences. To process data, the algorithm sequentially slides the window of width d one step at a time through the data. The frequency of an episode a is defined as the fraction of
windows in which the episode occurs. For example, in the sequence Ŝ1 above, if a sliding window of width 3 is used, serial episode a = AB occurs in the first window (AAB), the second window (ABC), and the fourth window (CAB).1 The guiding principle of the algorithm lies in the “downward-closed’’ property of frequency, which means every subepisode is at least as frequent as its superepisode (Mannila et al., 1995). As with the Apriori method, candidate episodes with (k+1) events are generated by joining frequent episodes that have k events in common, and episodes that do not meet a user-specified frequency threshold are pruned. The WINEPI algorithm was improved by Harms et al. (2001) to use only a subset of frequent episodes, called frequent closed episodes, based on closures and formal concept analysis (Wille, 1982). A frequent closed episode X is the intersection of all frequent episodes containing X. For example, in the Ŝ 1 sequence, using a window width d = 3, and a minimum frequency of three windows, serial episode α= AB is a frequent closed episode since no larger frequent episode contains it2, and it meets the minimum frequency threshold. Using closed episodes results in a reduced input size and in a faster generation of the episodal association rules, especially when events occur in clusters. Harms et al. (2001) use an inclusion constraint set to target specific subsets of episodes. In Hoppner (2002), multivariate sequences are divided into small segments and discretized based on their qualitative description (such as increasing, high value, convexly decreasing, etc.). Patterns are discovered in the interval sequences based on Allen’s temporal interval logic (Allan, 1983). For example, the pattern “A meets B” occurs if interval A terminates at the same point in time at which B starts. For any pair of intervals there is a set of 13 possible relationships, including after, before, meets, ismet-by, starts, is-started-by, finishes, is-finished-by, overlaps, is-overlapped-by, during, contains, and equals. As with WINEPI, this approach finds frequent patterns by using sliding windows and creating a set of candidate (k+1)-patterns from the set of frequent patterns of size k. An approach to detecting suspicious subsequences in event sequences is presented in Gwadera et al. (2003). Using an approach based on WINEPI, they quantify: 1) the probability of a suspicious subsequence occurring in a sequence Ŝ of events within a window of size d, 2) the number of distinct windows containing as a subsequence, 3) the expected number of such occurrences, and 4) the variance of the subsequence . They also establish its limiting distribution that allows users to set an alarm threshold so that the probability of false alarms is small. Ng & Fu (2003) presented a method to mine frequent episodes using a tree-based approach for event sequences. The process is comprised of two phases: 1) tree construction and 2) mining frequent episodes. Each 1099
6
Temporal Association Rule Mining in Event Sequences
node in the tree is labeled by an event, and also contains a count and a node type bit. First, the frequencies of each event are gathered and sorted by descending frequencies. The tree is built similar to the FP-Growth method (Han et al., 2000), but uses sliding windows rather than transactions.
Generating Rules Based on Sliding Window Technologies As introduced by Mannila et al. (1995), association rules are generated in a straightforward manner from the frequent episodes. An episodal association rule r is a rule of the form X ⇒ Y, where X is antecedent episode, Y is the consequent episode, and X ∩ Y =Æ. Harms et al. (2001) used representative episodal association rules, based on representative association rules for transactional data (Saquer & Deogun, 2000), to reduce the number of rules while still maintaining rules of interest to the user. A set of representative episodal association rules is a minimal set of rules from which all rules can be generated. Usually, the number of representative episodal association rules is much smaller than the total number of rules, and no additional measures are needed.
Other Event Sequence Rule Mining Technologies MINEPI, an approach that uses minimal occurrence of episodes rather than a sliding window was developed in Mannila et al. (1997). A minimal occurrence of an episode α in an event sequence Ŝ, is a window w=[ts, te], such that 1) a occurs in the window w, 2) α does not occur in any proper subwindow of w, and 3) the width of window w is less than the user-specified maximum window width parameter. In this definition, timestamp t s records the starting time of the occurrence of the episode, and te records its ending time, and ts ≤ t e. The width of window w equals t e - ts + 1. The minimal occurrence window widths are not constant for a given episode, but are the minimal amount of elapsed time between the start of the episode occurrence and the end of the episode occurrence. The support of an episode α is the number of minimal occurrences of α in • . An episode α is considered frequent if its support conforms to the given minimum support threshold. A technique designed to discover the period of sequential patterns was presented in Ma & Hellerstein (2000). They devised algorithms for mining periodic patterns while considering the presence of noise, phase shifts, the fact that periods may not be known in advance, and the need to have computationally efficient schemes for finding large patterns with low support. Often patterns with 1100
low support are of great interest, such as suspicious security intrusion patterns. Ma & Hellerstein (2000) approach the problem by storing the occurrences of each event type as point sequences and iteratively build point sequences of size k+1 from point sequences of size k. To account for factors such as phase shifts and lack of clock synchronization, a point sequence has a period p with time tolerance t if it occurs every p ± t time units. They also consider point sequences consisting of on-off segments. During the on-segment, the point sequence is periodic with p. Then, there is a random gap, or off-sequence, during which the point sequence is not period with p. They first find all possible periods by using a Chi-squared test approach and then find period patterns using an Apriori-based approach. An approach that finds patterns related to a userspecified target event type is introduced in Sun et al. (2003). Because a sliding window approach may exclude useful patterns that lie across a window boundary, this approach moves the window to the next event of interest. That is, a window always either starts from or ends with a target event. Interesting patterns are those that frequently occur together with the target event and are relatively infrequent in the absence of the target event. Harms & Deogun (2004) introduced MOWCATL, an algorithm based on MINEPI, which finds patterns in one or more sequences that precede the occurrence of patterns in other sequences, with respect to user-specified antecedent and consequent constraints. The MOWCATL approach has mechanisms for: 1) constraining the search space during the discovery process, 2) allowing a time lag between the antecedent and consequent of a discovered rule, and 3) working with episodes from across multiple sequences. The method’s focus is on finding episodal rules of the form α [wina] ⇒ lag b[winc], where the antecedent episode α occurs within a given maximum antecedent window width wina, the consequent episode b occurs within a given maximum consequent window width winc, and the start of the consequent follows the start of the antecedent within a given time lag. The confidence of the rule is the conditional probability that b occurs, given that α occurs, under the time constraints specified by the rule. The support of the rule is the number of times the rule holds in the dataset. The MOWCATL algorithm first stores the occurrences of the event types (single event episodes) that meet the userspecified inclusion constraints. Larger episodes are built from smaller episodes by joining episodes with overlapping minimal occurrences, which occur within the maximum window width. After finding the supported episodes for the antecedent and the consequent independently, they are combined to form an episodal association rule, where the start of the consequent follows the start of the antecedent within a lag in time between the
Temporal Association Rule Mining in Event Sequences
occurrences of the antecedent and the respective occurrences of the consequent. The lag can be either a fixed or a maximum time lag constraint. When a maximal time lag is used, MOWCATL finds rules where the consequent follows shortly after the antecedent, whereas a fixed time lag finds rules where the consequent follows the antecedent at exactly the number of time steps specified by the lag.
FUTURE TRENDS The analysis techniques described in this work facilitate the evaluation of the temporal associations between episodes of events. Often, temporal sequences have a spatial component. For future work, these methods are being expanded to consider the spatial extent of the relationships. In the future, the rule discovery process will automatically generate rules for multiple locations and spatially interpolate areas that do not have observed data. Another problem with most temporal data is that it occurs in the form of data streams, which are potentially unbounded in size. Materializing all data is unrealistic and expensive if it could be stored; techniques that retrieve approximate information are needed. Additionally, parallel algorithms will be needed to handle the volume of data.
CONCLUSION The knowledge discovery methods presented here address sequential data mining problems that have groupings of events that occur close together, even if they occur relatively infrequently over the entire dataset. These methods automatically detect interesting relationships between events in multiple time-sequenced data sets, where time lags possibly exist between the related events. Knowledge of this type of relationship can enable proactive decision-making governing the inferred data. These methods have many applications, including stock market analysis, risk management, and pinpointing suspicious security intrusions.
REFERENCES Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in sequence databases. In Proceedings of 4 th International Conference on Foundations of Data Organizations and Algorithms (pp. 69-84). Chicago, IL. Allen, J.F. (1983). Maintaining knowledge about temporal intervals. Communications of the ACM, 26(110), 832-843.
Bettini, C., Wang, X., & Jajodia, S. (1998). Mining temporal relations with multiple granularities in time sequences. Data Engineering Bulletin, 21(1), 32-38. Gwadera, R., Atallah, M., & Szpankowski, W. (2003). Reliable detection of episodes in event sequences. In Proceedings of ICDM 2003 (pp. 67-74), Florida. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of 2000 SIGMOD, Dallas, TX, USA. Harms, S.K., & Deogun, J. (2004). Sequential association rule mining with time lags. Journal of Intelligent Information Systems (JIIS), 22(1), 7-22. Harms, S.K., Saquer, J., Deogun, J., & Tadesse, T. (2001). Discovering representative episodal association rules from event sequences using frequent closed episode sets and event constraints. In Proceedings of ICDM ‘01 (pp. 603-606), Silicon Valley, CA. Hoppner, F., & Klawonn, F. (2002). Finding informative rules in interval sequences. Intelligent Data Analysis, 6, 237-256. Lin, J., Keogh, E., Lonardi, S., & Chiu, B. (2003). A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in DMKD (pp. 211). San Diego, CA. Ma, S., & Hellerstein, J.L. (2000). Mining partially periodic event patterns with unknown periods. In Proceedings of 2000 ICDE, San Diego, CA, USA. Mannila, H., Toivonen, H., & Verkamo, A.I. (1995). Discovering frequent episodes in sequences. In M.U. Fayyad & R. Uthurusamy (Eds.), Proceedings of KDD-95 (pp. 210-215), Montreal, Quebec, Canada. Menlo Park, CA: AAAI Press. Mannila, H., Toivonen, H., & Verkamo, A.I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 259-289. Ng, A., & Fu, A.W. (2003). Mining frequent episodes for relating financial events and stock trends. In Proceedings of PAKDD 2003, Seoul, Korea. Roddick, J.F., & Spilopoulou, M. (2002). A survey of temporal knowledge discovery paradigms and methods. Transactions on Data Engineering, 14(4), 750-767. Saquer, J., & Deogun, J.S. (2000). Using closed itemsets for discovering representative association rules. In Proceedings of ISMIS 2000, Charlotte, NC. 1101
6
Temporal Association Rule Mining in Event Sequences
Sun, X., Orlowska, M.E., & Zhou, X. (2003). Finding eventoriented patterns in long temporal sequences. In Proceedings of PAKDD 2003, Seoul, Korea.
Event Type: A discretized partition identifier that indicates a unique item of interest in the database. The domain of event types is a finite set of discrete values.
Wille, R. (1982). Restructuring lattice theory: An approach based on hierarchies of concepts. In I. Rivali (Ed.), Ordered sets (pp. 445-470). Dordecht-Boston: Reidel.
Minimal Occurrence: A minimal occurrence of an episode a in an event sequence Ŝ, is a window w=[ts, te], such that 1) a occurs in the window w, 2) α does not occur in any proper subwindow of w, and 3) the width of window w is less than the user-specified maximum window width parameter. Timestamps t s and te records the starting and ending time of the episode, respectively, and t s ≤ te.
KEY TERMS Episodal Association Rule: A rule of the form X ⇒ Y, where X is antecedent episode, Y is the consequent episode and X ∩ Y = Æ. The confidence of an episodal association rule is the conditional probability that the consequent episode occurs, given the antecedent episode occurs, under the time constraints specified. The support of the rule is the number of times it holds in the database. Episode: A combination of events with a partially specified order. The episode ordering is parallel if no order is specified, and serial if the events of the episode have a fixed order. Event: An occurrence of an event type at a given timestamp. Event Sequence: A finite, time-ordered sequence of events. A sequence of events Ŝ includes events from a single finite set of event types.
1102
Window: An event subsequence, ei, e i+1, …ei+d in an event sequence, where the width of the window, denoted d, is the time interval of interest. In algorithms that use sliding windows, the frequency of an episode is defined as the fraction of windows in which the episode occurs.
ENDNOTES 1
2
In Mannila et al. (1995) the first window includes only the first event, and the last window includes only the last event. This ensures that each event occurs in exactly d windows. Although episode AB is contained in episode ABC, episode ABC occurs in one window, and is pruned.
1103
Text Content Approaches in Web Content Mining Víctor Fresno Fernández Universidad Rey Juan Carlos, Spain Luis Magdalena Layos Universidad Politécnica de Madrid, Spain
INTRODUCTION Since the creation of the Web until now, the Internet has become the greatest source of information available in the world. The Web is defined as a global information system that connects several sources of information by hyperlinks, providing a simple media to publish electronic information and being available to all the connected people. In this context, data mining researchers have a fertile area to develop different systems, using Internet as a knowledge base or personalizing Web information. The combination of the Internet and data mining typically has been referred as Web mining, defined by Kosala and Blockeel (2000) as “a converging research area from several research communities, such as DataBase (DB), Information Retrieval (IR) and Artificial Intelligent (AI), especially from machine learning and Natural Language Processing (NLP)” Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services; traditionally focused in three distinct ways, based on which part of the Web to mine: Web content, Web structure and Web usage. Brief descriptions of these categories are summarized below.
•
•
Web Content Mining: Web content consists of several types of data, such as textual, image, audio, video, and metadata, as well as hyperlinks. Web content mining describes the process of information discovery from millions of sources across the World Wide Web. From an IR point of view, Web sites consist of collections of hypertext documents for unstructured documents (Turney, 2002); from a DB point of view, Web sites consist of collections of semi-structured documents (Jeh & Widom, 2004). Web Structure Mining: This approach is interested in the structure of the hyperlinks within the Web itself—the interdocument structure. The Web structure is inspired by the study of social network
•
and citation analysis (Chakrabarti, 2002). Some algorithms have been proposed to model the Web topology, such as PageRank (Brin & Page, 1998) from Google and other approaches that add content information to the link structure (Getoor, 2003). Web Usage Mining: Web usage mining focuses on techniques that could predict user behavior while the user interacts with the Web. A first approach maps the usage data of the Web server into relational tables for a later analysis. A second approach uses the log data directly by using special preprocessing techniques (Borges & Levene, 2004).
BACKGROUND On the Web, there are no standards or style rules; the contents are created by a set of very heterogeneous people in an autonomous way. In this sense, the Web can be seen as a huge amount of online unstructured information. Due to this inherent chaos, the necessity of developing systems that aid us in the processes of searching and efficient accessing of information has emerged. When we want to find information on the Web, we usually access it by search services, such as Google (http://www.google.com) or AllTheWeb (http:// www.alltheweb.com), which return a ranked list of Web pages in response to our request. A recent study (Gonzalo, 2004) showed that this method of finding information works well when we want to retrieve home pages, Websites related to corporations, institutions, or specific events, or to find quality portals. However, when we want to explore several pages, relating information from several sources, this way has some deficiencies: the ranked lists are not conceptually ordered, and information in different sources is not related. The Google model has the following features: crawling the Web, the application of a simple Boolean search, the PageRank algorithm, and an efficient implementation. This model directs us to a Web page, and then we are abandoned with the local server search tools, once the page is reached.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
6
Text Content Approaches in Web Content Mining
Nowadays, these tools are very simple, and the search results are poor. Other ways to find information is using Web directories organized by categories, such as Yahoo (http:// www.yahoo.com) or Open Directory Project (http:// www. dmoz.org). However, the manual nature of this categorization makes the directories’ maintenance too arduous, if machine processes do not assist it. Future and present research tends to the visualization and organization of results, the information extraction over the retrieved pages, or the development of efficient local servers search tools. Next, we summarize some of the technologies that can be explored in Web content mining and give a brief description of their main features.
Web Mining and Information Retrieval These systems retrieve contents with as much text as multimedia; the main feature is that access to information is accomplished in response to a user’s request (Fan et al., 2004; Wang et al., 2003). Techniques inherited from NLP are added to these systems.
Text Categorization on the Web The main goal of these methods is to find the nearest category, from a pre-classified categories hierarchy to a specific Web-page content. Some relevant works in this approach can be found in Chakrabarti (2003) and Kwon and Lee (2003).
Web Document Clustering Clustering involves dividing a set of n documents into a specific number of clusters k, so that some documents are similar to other documents in the same cluster and different from those in other clusters. Some examples in this context are Carey, et al. (2003) and Liu, et al. (2002).
MAIN THRUST In general, Web mining systems can be decomposed into different stages that can be grouped in four main phases: resource access, the task of capturing intended Web documents; information preprocessing, the automatic selection of specific information from the captured resources; generalization, where machine learning or data-mining processes discover general patterns in individual Web pages or across multiple sites; and finally, the analysis phase, or validation and interpretation of the mined patterns. We think that by improving 1104
each of the phases, the final system behavior also can be improved. In this work, we focus our efforts on Web pages representation, which can be associated with the information-preprocessing phase in a general Web-mining system. Several hypertext representations have been introduced in literature in different Web mining categories, and they will depend on the later use and application that will be given. Here, we restrict our analysis to Web-content mining, and, in addition, hyperlinks and multimedia data are not considered. The main reason to select only the tagged text is to look for the existence of special features emerging from the HTML tags with the aim to develop Web-content mining systems with greater scope and better performance as local server search tools. In this case, the representation of Web pages is similar to the representation of any text. A model of text must build a machine representation of the world knowledge and, therefore, must involve a natural language grammar. Since we restrict our scope to statistical analyses for Web-page classification, we need to find suitable representations for hypertext that will suffice for our learning applications. We carry out a comparison between different representations using the vector space model (Salton et al., 1975), where documents are tokenized using simple rules, such as whitespace delimiters in English and tokens stemmed to canonical form (e.g., reading to read). Each canonical token represents an axis in the Euclidean space. This representation ignores the sequence in which words occur and is based on the statistical about single independent words. This independence principle between the words that coappear in a text or appear as multiword terms is a certain error but reduces the complexity of our problem without loss of efficiency. The different representations are obtained using different functions to assign the value of each component in the vector representation. We used a subset of the BankSearch Dataset as the Web document collection (Sinka & Corne, 2002). First, we obtained five representations using wellknown functions in the IR environment. All these are based only on the term frequency in the Web page that we want to represent, and on the term frequency in the pages of the collection. Below, we summarize the different evaluated representations and a brief explanation. 1.
2.
Binary: This is the most straightforward model, which is called set of words. The relevance or weight of a feature is a binary value {0,1}, depending on whether the feature appears in the document or not. Term Frequency (TF): Each term is assumed to have an importance proportional to the number of times it occurs in the text (Luhn, 1957). The
Text Content Approaches in Web Content Mining
3.
4.
5.
weight of a term t in a document d is given by W(d;t)=TF(d;t); where TF(d;t) is the term frequency of the term t in d. Inverse Document Frequency (IDF): The importance of each term is assumed to be inversely proportional to the number of documents that contain the term. The IDF factor of a term t is given by IDF(t)=logNxdf(t), where N is the number of documents in the collection and df(t) is the number of documents that contain the term t. TF-IDF: Salton (1988) proposed to combine TF and IDF to weight terms. Then, the TF-IDF weight of a term t in a document d is given by W(d;t)=TF(d;t)xIDF(t). WIDF: It is an extension of IDF to incorporate the term frequency over the collection of documents. The WIDF weight is given by W(d,t)=TF(d,t)/ ∑diTF(i,t).
In addition to the five representations, we obtained other two representations which combine several criteria extracted from some tagged text and that can be treated differently from other parts of the Web page document. Both representations consider more elements than the term frequency to obtain the term relevance in the Web page content. These two representations are: the Analytical Combination of Criteria (ACC) and Fuzzy Combination of Criteria (FCC). The difference between them is the way they evaluate and combine the criteria. The first one (Fresno & Ribeiro, 2004) uses a lineal combination of those criteria, whereas the second one (Ribeiro et al., 2002) combines them by using a fuzzy system. A fuzzy reasoning system is a suitable framework to capture the qualitative human expert knowledge to solve the ambiguity inherent to the current reasoning process, embodying knowledge and expertise in a set of linguistic expressions that manage words instead of numerical values. The fundamental cue is that often a criterion evaluates the importance of a word only when it appears combined with another criterion. Some Web pages representation methods that use HTML tags in different ways can be found in (Molinari et al., 2003; Pierre, 2001; Yang et al., 2002). The combined criteria in ACC and FCC are summarized below. 1.
2.
Word Frequency in the Text: Luhn (1957) showed that a statistical analysis of the words in the document provides some clues of its contents. This is the most used heuristic in the text representation field. Word’s Appearance in the Text: The word’s appearance in the title of the Web page, considering that in many cases the document title can be a summary about the content.
3.
4.
The Positions All Along the Text: In automatic text summarization, a well-known heuristic to extract sentences that contain important information to the summary is selecting those that appear at the beginning and at the end of the document (Edmunson, 1969). Word’s Appearance in Emphasis Tags: Whether or not the word appears in emphasis tags. For this criterion, several HTML tags were selected, because they capture the author’s intention. The hypothesis is that if a word appears emphasized, it is because the author wants it to stand out.
To compare the quality of the representations, a Web-page binary classification system was implemented in three stages: representation, learning, and classification algorithm. The selected classes are very different from each other to display favorable conditions for learning and classification stages, and to show clearly the achievements of the different representation methods. The representation stage was achieved as follows. The corpus, a set of documents that generate the vocabulary, was created from 700 pages for each selected classes. All the different stemmed words found in these documents generated the vocabulary as axes in the Euclidean space. We fixed the maximum length of a stemmed word to 30 characters and the minimum length to three characters. In order to calculate the values of the vector components for each document, we followed the next steps: (a) we eliminated all the punctuation marks except some special marks that are used in URLs, e-mail addresses, and multi-word terms; (b) the words in a stoplist used in IR were eliminated from the Web pages; (c) we obtained the stem of each term by using the wellknown Porter’s stemming algorithm; (d) we counted the number of times that each term appeared on each Web page and the number of pages where the term was present; and (e) in order to calculate the ACC and FCC representations, we memorized the position of each term all along the Web page and whether or not the feature appears in emphasis and title tags. In addition, another 300 pages for each class were represented in the same vocabulary to evaluate the system. In the learning stage, the class descriptors (i.e., information common to a particular class, but extrinsic to an instance of that class) were obtained from a supervised learning process. Considering the central limit theorem, the word relevance (i.e., the value of each component in the vector representation) in the text content for each class will be distributed as a Gaussian function with the parameters mean µ and variance σ. Then, the density function: 1105
6
Text Content Approaches in Web Content Mining
f i (ri ; ì, ó) =
ri 2σ 2
e
−(ri
− ì)2 / 2ó 2
Figure 1. Comparison between representations in a binary classification
gets the probability that a word i, with relevance r, appears in a class (Fresno & Ribeiro, 2004). The mean and variance are obtained from the two selected sets of examples for each class by a maximum likelihood estimator method. Once the learning stage is achieved, a Bayesian classification process is carried out to test the performance of the obtained class descriptors. The optimal classification of a new page d is the class c j ∈ C , where the probability P(c j | d ) is maximum, where C is the set of considered classes. P(c j | d ) reflects the confidence that c j holds given the page d. Then, the Bayes’ theorem states:
P (c j | d ) =
P ( d | c j ) P (c j ) P (d )
Considering the hypothesis of the independence principle, assuming that all pages and classes have the same prior probability, applying logarithms because it is a non-decreasing monotonic function and shifting the argument one unit to avoid the discontinuity in x=0, finally, the most likelihood class is given by:
c ML
1 (r − ìij )2 − N 2 i 2ó ri ij = arg max ∑ ln + 1 e 2 2 c j ∈C i πσ ij
where N is the vocabulary dimension. We accomplished an external evaluation by means of the F-measure, which combines the Precision and Recall measures: F(i,j)=2xRecall(i,j)xPrecision(i,j)/ (Precision(i,j)+Recall) Recall(i,j)=n ij/ni Precision(i,j)= n ij/n j where nij is the number of pages of class i classified as j, nj is the number of pages classified as j, and ni the number of pages of the i class. To obtain the different representation sizes, reductions were carried out by the document frequency term selection (Sebastiani, 2002) in binary, TF, binary-IDF, TFIDF, and WIDF representa1106
tions; thus, for ACC and FCC, we have used the proper weighting function of each one as a reduction function, selecting the n most relevant features in each Web page. In Figure 1, we show the obtained experimental results in a binary classification, with each representation and with different representation sizes.
FUTURE TRENDS Nowadays, the main lack in systems that aids us in the search and access of the information process is revealed when we want to explore several pages, relating information from several sources. Future trends must find regularities in HTML vocabulary to improve the response of the local server search tools, combining with other aspects such as hyperlink regularities.
CONCLUSION The results on a Web page representation comparison are very dependent on the selected collection and the classification process. In a binary classification with the proposed learning and classification algorithms, the best representation was the binary, because it obtained the best F-measures in all the cases. We can expect that when the class’ number will be increased, this F-measure value will decrease, and the rest of representations will increase their global results. Finally, apart from binary function, the ACC and FCC representations have better F-measure values than the rest, inherited from the IR field, when the sizes are the smallest. This fact can be the result of considering the tagged text in a different way, depending on the tag semantic, and capturing more information than when only the frequency is considered. A most deep exploration must be accomplished to find hidden information behind this Hypertext Markup Language vocabulary.
Text Content Approaches in Web Content Mining
REFERENCES Borges, J., & Levene, M. (2004). An average linear time algorithm for Web data mining (to be published]. International Journal of Information Technology and Decision Making, 3, 307-329. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117. Carey, M., Heesch, D., & Rüger, S. (2003). Info navigator: A visualization tool for document searching and browsing. Proceedings of the International Conference on Distributed Multimedia Systems, Florida, MI, USA.
Liu, B., Zhao, K., & Yi, L. (2002). Visualizing Web site comparisons. Proceedings of the 11 th International Conference on World Wide Web, Honolulu, Hawaii, USA. Luhn, H.P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 4, 309-317. Molinari, A., Pasi, G., & Marques, R.A. (2003). An indexing model of HTML documents. Proceedings of the ACM Symposium on Applied Computing, Melbourne, Florida, USA. Pierre, J.M. (2001). On the automated classification of Web sites. Linköping Electronic Articles in Computer and Information Science, 6(1), 1-11.
Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. San Francisco, CA: Morgan-Kaufmann Publishers.
Ribeiro, A., Fresno, V., García-Alegre, M., & Guinea, D. (2003). A fuzzy system for the Web page representation. Intelligent Exploration of the Web, Series: Studies in Fuzzyness and Soft Computing, 111, 19-38.
Chakrabarti, S., Roy, S., & Soundalgekar, M. (2003). Fast and accurate text classification via multiple linear discriminant projections. VLDB Journal, 12(2), 170-185.
Salton, G. (1988). Automatic text processing: The transformation, analysis and retrieval of information by computer. Boston: Addison-Wesley.
Edmunson, H. (1969). New methods in automatic extracting. Journal of the ACM, 16(2), 264-285.
Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Fan, W., Fox, E.A., Pathak, P., & Wu, H. (2004). The effects of fitness functions on genetic programming-based ranking discovery for Web search. Journal of the American Society for Information Science and Technology, 55(7), 628-636. Fresno, V., & Ribeiro, A. (2004). An analytical approach to concept extraction in HTML environments. Journal of Intelligent Information Systems, 22(3), 215-235. Getoor, L. (2003). Link mining: A new data mining challenge. ACM SIGKDD Explorations Newsletter, 5(1), 84-89. Gonzalo, J. (2004). Hay vida después de Google? Proceedings of the Software and Computing System Seminars, Móstoles, Spain. Jeh, G., & Widom, J. (2004). Mining the space of graph properties. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations Newsletter, 2(1), 1-15. Kwon, O., & Lee, J. (2003). Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing and Management: an International Journal, 39(1), 25-44.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. Sinka, M.P., & Corne, D.W. (2002). A large benchmark dataset for Web document clustering. Proceedings of the 2nd Hybrid Intelligent Systems Conference, Santiago, Chile. Turney, P. (2002). Mining the Web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. NRC Technical Report ERB-1096. Institute for Information Technology, National Research Council Canada, Otawa, Ontario, Canada. Yang, Y., Slattery, S., & Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2), 1-25.
KEY TERMS Central Limit Theorem: When an infinite number of successive random samples is taken from a population, the distribution of sample means calculated for each sample will become approximately normally distributed with mean µ and standard deviation σ/√N (~N(µ,σ/√N))..
1107
6
Text Content Approaches in Web Content Mining
Crawler: Program that downloads and stores Web pages. A crawler starts off with the Uniform Resource Locator (URL) for an initial page, extracts any URLs in it, and adds them to a queue to scan recursively. Information Retrieval (IR): Interdisciplinary science of searching for information, given a user query, in document repositories. The emphasis is on the retrieval of information as opposed to the retrieval of data. Machine Learning: The study of computer algorithms that improve automatically through experience. Natural Language Processing (NLP): Computer understanding, analysis, manipulation, and/or generation of natural language. This can refer to simple string-
1108
manipulation like stemming, or to higher-level tasks such as processing user queries in natural language. Stoplist: Specific collection of so-called noise words, which tend to appear frequently in documents. Supervised Learning: A machine learning technique for creating a function from training data. The training data consists of pairs of input objects and desired outputs. Unsupervised Learning: A machine learning technique that typically treats input objects as a set of random variables. A joint density model is then built for the data set.
1109
Text Mining-Machine Learning on Documents Dunja Mladenić Jozef Stefan Institute, Slovenia
INTRODUCTION Intensive usage and growth of the World Wide Web and the daily increasing amount of text information in electronic form have resulted in a growing need for computersupported ways of dealing with text data. One of the most popular problems addressed with text mining methods is document categorization. Document categorization aims to classify documents into pre-defined categories, based on their content. Other important problems addressed in text mining include document search, based on the content, automatic document summarization, automatic document clustering and construction of document hierarchies, document authorship detection, identification of plagiarism of documents, topic identification and tracking, information extraction, hypertext analysis, and user profiling. If we agree on text mining being a fairly broad area dealing with computer-supported analysis of text, then the list of problems that can be addressed is rather long and open. Here we adopt this fairly open view but concentrate on the parts related to automatic data analysis and data mining. This article tries to put text mining into a broader research context, with the emphasis on machine learning perspective, and gives some ideas of possible future trends. We provide a brief description of the most popular methods only, avoiding technical details and concentrating on example of problems that can be addressed using text-mining methods.
BACKGROUND Text mining is an interdisciplinary area that involves at least the following key research fields: •
•
Machine Learning and Data Mining (Hand, et al., 2001; Mitchell, 1997; Witten & Frank, 1999): Provides techniques for data analysis with varying knowledge representations and large amounts of data. Statistics and Statistical Learning (Hastie, et al., 2001): Contributes data analysis in general in the context of text mining (Duda et al., 2000).
• •
Information Retrieval (Rijsberg, 1979): Provides techniques for text manipulation and retrieval mechanisms. Natural Language Processing (Manning & Schutze, 2001): Provides techniques for analyzing natural language. Some aspects of text mining involve the development of models for reasoning about new text documents, based on words, phrases, linguistics, and grammatical properties of the text, as well as extracting information and knowledge from large amounts of text documents.
The rest of this article briefly describes the most popular methods used in text mining and provides some ideas for the future trends in the area.
MAIN THRUST Text mining usually involves some preprocessing of the data, such as removing punctuations from text, identifying word and/or sentence boundaries, and removing words that are not very informative for the problem on hand. After preprocessing, the next step is to impose some representation on the text that will enable application of the desired text-mining methods. One of the simplest and most frequently used representations of text is word-vector representation (also referred to as bag-ofwords representation). The idea is fairly simple: words from the text document are taken, ignoring their ordering and any structure of the text. For each word, the wordvector contains some weight proportional to the number of its occurrences in the text. We all agree that there is additional information in the text that could be used (e.g., information about structure of the sentences, word type and role, position of the words or neighboring words). However, depending on the problem at hand, this additional information may or may not be helpful and definitely requires additional efforts and more sophisticated methods. There is some evidence for document retrieval of long documents, considering information additional to the bag-of-words is not worth the effort and that for document categorization, using natural language information does not improve the categorization results (Dumais et al., 1998). There is also
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
T
Text Mining-Machine Learning on Documents
some work on document categorization that extends the bag-of-words representation by using word sequences instead of single words (Mladenic & Grobelnik, 2003). This work suggests that the usage of single words and word pairs in the bag-of-words representation improves the results of short documents categorization. The rest of this section gives a brief description of the most important problems addressed by text-mining methods. Text Document Categorization is used when a set of pre-defined content categories, such as arts, business, computers, games, health, recreation, science, and sport, is provided, as well as a set of documents labeled with those categories. The task is to classify previously unseen text documents by assigning each document one or more of the predefined categories. This usually is performed by representing documents as word-vectors and using documents that already have been assigned the categories to generate a model for assigning content categories to new documents (Jackson & Moulinier, 2002; Sebastiani, 2002). The categories can be organized into an ontology (e.g., the MeSH ontology for medical subject headings or the DMoz hierarchy of Web documents). Document Clustering (Steinbach et al., 2000) is based on an arbitrary data clustering algorithm adopted for text data by representing each document as a word vector. The similarity of two documents is commonly measured by the cosine-similarity between the word vectors representing the documents. The same similarity also is used in document categorization for finding a set of the most similar documents. Visualization of text data is a method used to obtain early measures of data quality, content, and distribution (Fayyad et al., 2001). For instance, by applying document visualization, it is possible get an overview of the Web site content or document collection. One form of text visualization is based on document clustering (Grobelnik & Mladenic, 2002) by first representing the documents as word vectors and by performing K-means clustering algorithms on the set of word vectors. The obtained clusters then are represented as nodes in a graph, where each node in the graph is described by the set of most characteristic words in the cluster. Similar nodes, as measured by the cosine-similarity of their word vectors, are connected by an edge in the graph. When such a graph is drawn, it provides a visual representation of the document set. Text Summarization often is applied as a second stage of document retrieval in order to help the user getting an idea about content of the retrieved documents. Research in information retrieval has a long tradition of addressing the problem of text summarization, with the first reported attempts in the 1950s and 1960s, that were exploiting properties such as frequency of words in the text. When dealing with text, especially in different natural languages, 1110
properties of the language can be a valuable source of information. This brings in text summarization of the late 1970s the methods from research in natural language processing. As humans are good at making summaries, we can consider using examples of human-generated summaries to find something about the underlying process by applying machine learning and data-mining methods, a popular problem in 1990s. There are several ways to provide text summary (Mani & Maybury, 1999). The simplest but also very effective way is providing keywords that help to capture the main topics of the text, either for human understanding or for further processing, such as indexing and grouping of documents, books, pictures, and so forth. As the text is usually composed of sentences, we can talk about summarization by highlighting or extracting the most important sentences, a way of summarization that is frequently found in human-generated summaries. A more sophisticated way of summarization is by generating new sentences based on the whole text, as used, for instance, by humans in writing book reviews. User Profiling is used to provide the information that is potentially interesting for the user (e.g., in the context of personalized search engines, browsing the Web, or shopping on the Web). It can be based on the content (of the documents) that the user has visited or the behavior of other users accessing the same data. In the context of text mining, when using the content, the system searches for text documents that are similar to those the user liked (e.g., observing the user browsing the Web and providing help by highlighting potentially interesting hyperlinks on the requested Web pages) (Mladenic, 2002). Contentbased document filtering has its foundation in information retrieval research. One of the main problems with this approach is that it tends to specialize the search for the documents similar to the document already seen by the user.
FUTURE TRENDS There is a number of researchers intensively working in the area of text data mining, mainly guided by the need of developing new methods capable of handling interesting real-world problems. One such problem recognized in the past few years is on reducing the amount of manual work needed for hand labeling the data. Namely, most of the approaches for automatic document filtering, categorization, user profiling, information extraction, and text tagging requires a set of labeled (pre-categorized) data describing the addressed concepts. Using unlabeled data and bootstrapping learning are two directions giving research results that enable important reduction in the needed amount of hand labeling.
Text Mining-Machine Learning on Documents
In document categorization using unlabeled data, we need a small number of labeled documents and a large pool of unlabeled documents (e.g., classify an article in one of the news groups, classify Web pages). The approach proposed by Nigam, et al. (2001) can be described as follows. First, model the labeled documents and use the trained model to assign probabilistically weighted categories to all unlabeled documents. Then, train a new model using all the documents and iterate until the model remains unchanged. It can be seen that the final result depends heavily on the quality of the categories assigned to the small set of hand-labeled data, but it is much easier to hand label a small set of examples with good quality than a large set of examples with medium quality. Bootstrap learning for Web page categorization is based on the fact that most of the Web pages have some hyperlinks pointing to them. Using that, we can describe each Web page either by its content or by the content of the hyperlinks that point to it. First, a small number of documents is labeled, and each is described, using the two descriptions. One model is constructed from each description independently and used to label a large set of unlabeled documents. A few of those documents for which the prediction was the most confident are added to the set of labeled documents, and the whole loop is repeated. In this way, we start with a small set of labeled documents, enlarging it through the iterations and hoping that the initial labels were a good coverage of the problem space. Some work also was done in the direction of mining the extracted data (Ghani et al., 2000), where information extraction is used to automatically collect information about different companies from the Web. Data-mining methods then are used on the extracted data. As Web documents are naturally organized in a graph structure through hyperlinks, there are also research efforts on using that graph structure to improve document categorization (Craven & Slattery, 2001) to improve Web search and visualization of the Web.
CONCLUSION Mining of text data, as described in this article, is a fairly wide area, including different methods used to provide computer support for handling text data. Evolving at the intersection of different research areas and existing in parallel with them, we could say that text mining gets its methodological inspiration from different fields, while its applications are closely connected to the areas of Web mining (Chakrabarti, 2002) and digital libraries. As many of the already developed approaches provide reasonable quality solutions, text mining is gaining popularity in applications, and the researcher are addressing more demanding problems and approaches that, for instance, go beyond the word-vector representation of the
text and combine with other areas, such as semantic Web and knowledge management.
REFERENCES Chakrabarti, S. (2002). Mining the Web: Analysis of hypertext and semi structured data. San Francisco, CA: Morgan Kaufmann. Craven, M., & Slattery, S. (2001). Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43(1/2), 97-119. Duda, R.O., Hart, P.E., & Stork, D.G. (2000). Pattern classification. Wiley-Interscience. Dumais, S.T., Platt, J., Heckerman, D, & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Proceedings of the Seventh International Conference on Information and Knowledge Management, Bethesda, Maryland. Fayyad, U., Grinstein, G.G., & Wierse, A. (Eds.). (2001). Information visualization in data mining and knowledge discovery. San Francisco, CA: Morgan Kaufmann. Ghani, R., Jones, R., Mladenic, D., Nigam, K., & Slattery, S. (2000). Data mining on symbolic knowledge extracted from the Web. Proceedings of the KDD-2000 Workshop on Text Mining, Boston, MA. Grobelnik, M., & Mladenic, D. (2002). Efficient visualization of large text corpora. Proceedings of the Seventh TELRI Seminar, Dubrovnik, Croatia. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining: Adaptive computation and machine learning. Boston, MA: MIT Press. Hastie, T., Tibshirani, R., & Friedman, J.H. (2001). The elements of statistical learning: Data mining, inference, and prediction. Berlin, Germany: Springer Verlag. Jackson, P., & Moulinier, I. (2002). Natural language processing for online applications: Text retrieval, extraction, and categorization. John Benjamins Publishing Co. Mani, I., & Maybury, M.T. (Eds.). (1999). Advances in automatic text summarization. Boston, MA: MIT Press. Manning, C.D., & Schutze, H. (2001). Foundations of statistical natural language processing. Cambridge, MA: MIT Press. Mitchell, T.M. (1997). Machine learning. The McGrawHill Companies, Inc.
1111
6
Text Mining-Machine Learning on Documents
Mladenic, D. (2002). Web browsing using machine learning on text data. In P.S. Szczepaniak (Ed.), Intelligent exploration of the Web (pp. 288-303). New York: PhysicaVerlag. Mladenic, D., & Grobelnik, M. (2003). Feature selection on hierarchy of Web documents. Journal of Decision Support Systems, 35, 45-87. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2001). Text classification from labeled and unlabeled documents using EM. Machine Learning Journal, 39(2/3), 103-134. Boston, MA: Kluwer Academic Publishers. Sebastiani, F. (2002). Machine learning for automated text categorization. ACM Computing Surveys. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Proceedings of the KDD Workshop on Text Mining, Boston, Massachusetts. van Rijsberg, C.J. (1979). Information retrieval. USA: Butterworths. Witten, I.H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco, CA: Morgan Kaufmann.
KEY TERMS Document Categorization: A process that assigns one or more of the predefined categories (labels) to a document. Document Clustering: A process that groups documents, based on their content similarity, using some predefined similarity measure. Information Extraction: A process of extracting data from the text, commonly used to fill in the data into fields of a database based on text documents. Text Summarization: A process of generating summary from a longer text, usually based on extracting keywords or sentences or generating sentences. Topic Identification and Tracking: A process of identifying appearance of new topics in a stream of data, such as news messages, and tracking reappearance of a single topic in the stream of text data. User Profiling: A process for automatic modeling of the user. In the context of Web data, it can be contentbased, using the content of the items that the user has accessed or collaborative, using the ways the other users access the same set of items. In the context of text mining, we talk about user profiling when using content of text documents. Visualization of Text Data: A process of visual representation of text data, where different methods for visualizing data can be used to place the data usually in two or three dimensions and draw a picture.
1112
1113
Text Mining Methods for Hierarchical Document Indexing Han-Joon Kim The University of Seoul, Korea
INTRODUCTION We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo, & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of Web directories such as Yahoo (http://www.yahoo.com/) and Open Directory Project (http://dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents. Currently, the topic hierarchies maintained by most information systems are manually constructed and maintained by human editors. The topic hierarchy should be continuously subdivided to cope with the high rate of increase in the number of electronic documents. For example, the topic hierarchy of the Open Directory Project has now reached about 590,000 categories. However, manually maintaining the hierarchical structure incurs several problems. First, such a manual task is prohibitively costly as well as time-consuming. Until now, large search portals such as Yahoo have invested significant time and money into maintaining their taxonomy, but obviously they will not be able to keep up with the pace of growth and change in electronic documents through such manual activity. Moreover, for a dynamic networked resource (e.g., World Wide Web) that contains highly heterogeneous documents accompanied by frequent content changes, maintaining a ‘good’ hierarchy is fraught with difficulty, and oftentimes is beyond the human experts’ capabilities. Lastly, since human editors’ categorization decision is not only highly subjective but their subjectivity is also variable over time, it is difficult to maintain a reliable and consistent hierarchical structure. The above limitations require information systems that
can provide intelligent organization capabilities with topic hierarchies. Related commercial systems include Northern Light Search Engine (http:// www.northernlight.com/), Inktomi Directory Engine (http://www.inktomi.com/), and Semio Taxonomy (http:/ /www.semio.com/), which enable a browsable Web directory to be automatically built. However, these systems did not address the (semi-)automatic evolving capabilities of organizational schemes and classification models at all. This is one of the reasons why the commercial taxonomy-based services do not tend to be as popular as their manually constructed counterparts, such as Yahoo.
BACKGROUND In future systems, it will be necessary for users to be able to easily manipulate the hierarchical structure and the placement of documents within it (Aggarwal, Gates, & Yu, 1999; Agrawal, Bayardo, & Srikant, 2000). In this regard, this section presents three critical requirements for intelligent taxonomy construction, and taxonomy construction process using text-mining techniques.
Requirements for Intelligent Taxonomy Construction (1)
Automated classification of text documents: In order to organize a huge number of documents, it is essential to automatically assign incoming documents to an appropriate location on a predefined taxonomy. Recent approaches towards automated classification have used supervised machine-learning approaches to inductively build a classification model of pre-defined categories from a training set of labeled (pre-classified) data. Basically, such machine-learning based classification requires sufficiently large number of labeled training examples to build an accurate classification model. Assigning class labels to unlabeled documents should be performed by human labeler, and the task is a highly time-consuming and expensive. Furthermore, an online learning framework is nec-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
6
Text Mining Methods for Hierarchical Document Indexing
Table 1. Procedure for hierarchically organizing text documents Step 1.
Initial construction of taxonomy i. Define an initial (seed) taxonomy Step 2. Category (Re-) Learning i. Collect a set of the controlled training data fit for the defined (or refined) taxonomy ii. Generate (or Update) the current classification model so as to enable a classification task for newly generated categories iii. Periodically update the current classification model so as to constantly guarantee high degree of classification accuracy while refining the training data Step 3. Automatic Classification i. Retrieve documents of interest from various sources ii. Assign each of the unknown documents into more than one categories with its maximal membership value according to the established model Step 4. Evolution of taxonomy i. If concept drift or a change in the viewpoint occurs within a sub-taxonomy, reorganize the specified sub-taxonomy ii. If a new concept sprouts in the unclassified area, perform the cluster analysis for the data within the unclassified area into new categories Step 5. Sub-taxonomy Construction and Integration i. Integrate the refined sub-taxonomy or new categories into the main taxonomy Step 6. Go to Step 2
(2)
(3)
1114
essary because it is impossible to distinguish training documents from unknown documents to be classified in the operational environment. In addition, classification models should be continuously updated so that their accuracy can be maintained at a high level. To resolve this problem, incremental learning methods are required, in which an established model can be updated incrementally without re-building it completely. Semi-automatic management of evolving taxonomy: The taxonomy initially constructed should change and adapt as its document collection continuously grows or users’ needs change. When concept drift (which means that the general subject matter of information within a category may no longer suit the subject that best explained that information when it was originally created) happens in particular categories, or when the established criterion for classification alters with time as the content of the document collection changes, it should be possible for part of taxonomy to be reorganized; the system is expected to recommend users different feasible sub-taxonomies for that part. Making use of domain (or human) knowledge in cluster analysis for topic discovery: In order to refine the taxonomy, it is necessary to discover new topics (or categories) that can precisely describe the currently indexed document collection. In general, topic discovery is achieved by clustering techniques since clusters that are distinct groups of similar documents can be regarded as representing topically coherent topics in the collection. Clustering for topic discovery is a challenging problem with sufficient domain knowledge. This is because taxonomy should reflect the preferences of an individual user or specific requirements of an applica-
tion. However, clustering is inherently an unsupervised learning process without depending on external knowledge. Therefore, a new type of supervised clustering is required that reflects external knowledge provided by users.
Taxonomy Construction Process using Text-Mining Techniques Table 1 illustrates a procedure for hierarchically organizing text documents. The system begins with an initial topic hierarchy in which each document is assigned to its appropriate categories by automatic document classifiers. The topic hierarchy is then made to evolve so as to reflect the current contents and usage of indexed documents. The classification process repeats based on the more refined hierarchy. In Table 1, steps 2 and 3 are related to machinelearning based text classification, step 4 semi-supervised clustering for topic discovery, and step 5 taxonomy building.
MAIN THRUST This section discusses a series of text mining algorithms that can effectively support the taxonomy construction process. Recent text mining algorithms are prompted by machine learning paradigm; in particular, so are classification and clustering algorithms. Another important issue is about feature selection algorithms because textual data includes a huge number of features such as words or phrases. A feature selection module in the system extracts plain text from each of the retrieved documents and automatically determines only more significant features
Text Mining Methods for Hierarchical Document Indexing
to speed up the learning process and to improve the classification (or clustering) accuracy. However, this chapter doest not present the feature selection algorithms because their related issues are not significantly dependent upon the system.
Figure 1. Architecture of an operational classification system Learning flow Classification flow
Operational Automated Classification: A Combination of Active Learning, SemiSupervised Learning, Online Learning, and Incremental Learning As mentioned before, machine-learning based classification methods require a large number of good quality data for training. However, this requirement is not easily satisfied in real-world operational environments. Recently, many studies on text classification focus on the effective selection of good quality training data that accurately reflect a concept in a given category, rather than algorithm design. How to compose training data has become a very important issue in developing operational classification systems. One good approach is a combination of “active learning” and “semi-supervised learning” (Kim & Chang, 2003; Muslea, Minton, & Knoblock, 2002). Firstly, the active learning approach is that the learning module actively chooses the training data from a pool of unlabeled data for humans to give their appropriate class label (Argamon-Engelson & Daga, 1999). Among different types of active learning, the selective sampling method has been frequently used for learning with text data. It examines a pool of unlabeled data and selects only the most informative ones through a particular measure such as the uncertainty of classification accuracy. Secondly, the semi-supervised learning is a variant of supervised learning algorithm in which classifiers can be more precisely learned by augmenting a few labeled training data with many unlabeled data (Demiriz & Bennett, 2000). For semi-supervised learning, EM (Expectation-Maximization) algorithm can be used that is an iterative method for finding maximum likelihood in problems with unlabeled data (Dempster, Laird, & Rubin, 1977). To develop operational text classifiers, the EM algorithm has been evaluated to be a practical and excellent solution to the problem of lack of training examples in developing classification systems (Nigam, McCallum, Thrun, & Mitchell, 2000). Figure 1 shows a classification system architecture, which supports the active learning and the semi-supervised learning (Kim & Chang, 2003). The system consists of three modules: Learner, Classifier, and Sampler; in contrast, conventional systems do not include the Sampler module. The Learner module creates a classification model (or function) by examining and analyzing the contents of training documents. The Classifier module uses
Learner
(on-line process) Learning method
Classification Model
Classifier
(on-line process)
Labeled Training Documents
Sampler
(on-line process) Expert
Classified Documents
Unknown documents
the classification model built by the Learner to determine the categories of each of unknown documents. In the conventional systems, the Learner runs only once as an off-line process, but it should update the current model continuously as an “‘online” process. To achieve the incremental learning, Naïve Bayes or support vector machine learning algorithm is preferable. This is because these algorithms can incrementally update the classification model only by adding additional feature estimates to currently learned model instead of re-building the model completely (Yang & Liu, 1999). Moreover, these learning algorithms have been successfully used for textual data with high dimensional feature space (Agrawal, Bayardo, & Srikant, 2000; Joachims, 2001). In particular, the Naïve Bayes is straightforwardly applied to the EM algorithm due to its probabilistic learning framework (Nigam, McCallum, Thrun, & Mitchell, 2000). Lastly, the Sampler module isolates a subset of candidate examples (e.g., through uncertainty-based selective sampling) from currently classified data, and returns them to a human expert for class labeling. Both selective sampling and EM algorithms assume that a stream of unlabeled documents is provided from some external sources. Practically, rather than acquiring the extra unlabeled data, it is more desirable to use the entire set of data indexed on the current populated taxonomy as a pool of unlabeled documents. As you see in Figure 1, the classified documents are fed into the Sampler to augment the current training documents, and they are also used by the Learner as a pool of the unlabeled documents for EM process. Consequently, in the context of the Learner module, not only can we easily obtain the unlabeled data used for EM process without extra effort, but also some of the mistakenly classified data are correctly classified.
1115
6
Text Mining Methods for Hierarchical Document Indexing
Semi-Supervised (User-Constrained) Clustering for Topic Discovery Most clustering algorithms do not allow introducing external knowledge to the clustering process. However, to discover new categories for taxonomy reorganization, it is essential to incorporate external knowledge into cluster analysis. Such a clustering algorithm is called “semisupervised clustering,” which is very helpful in the situation where we should continuously discover new categories from incoming documents. A few strategies for incorporating external human knowledge into cluster analysis have already been proposed in Talavera & Bejar (1999) and Xing, Ng, Jordan, & Russell (2003). One possible strategy is to vary the distance metrics by weighting dependencies between different components (or features) of feature vectors with the quadratic form distance for similarity scoring. That is, the distance between two document vectors d x and dy is given by:
distW (d x , d y ) = (d x − d y )T ⋅ W ⋅ (d x − d y )
(1)
where each document is represented as a vector of the form d x =(dx1, dx2, …, dxn), where n is the total number of index features in the system and dxi (1 d” i d” n) denotes the weighted frequency that feature t i occurs in document dx, T denotes the transpose of vectors, and W is an n´n symmetrical weight matrix whose entry wij denotes the interrelationship between the components ti and tj of the vectors. Each entry wij in W reveals how closely features t i is associated with feature tj. If the clustering algorithm uses this type of distance functions, then the clusters reflecting users’ viewpoints will be identified more precisely. To represent user knowledge for topic discovery, one can introduce one or more groups of relevant (or irrelevant) examples to the clustering system, depending on the user’s judgment of the selected examples from a given document collection (Kim & Lee, 2002). Each of these document groups is referred to as a “document bundle,” which is divided into two types of bundles: positive and negative ones. Documents within positive bundles (i.e., documents judged jointly “relevant” by users) must be placed in the same cluster while documents within negative bundles (i.e., documents judged “irrelevant” by users) must be located in different clusters. Then, when document bundles are given, the clustering process induces the distance metric parameters to satisfy the given bundle constraints. The problem is how to find the weights that best fit the human knowledge represented as document bundles. The distance metric must be adjusted by minimizing the distance between documents within posi-
1116
tive bundles that belong to the same cluster while maximizing the distance between documents within negative bundles. This dual optimization problem can be solved using the following objective function Q(W): Q( W ) =
∑
( d x , d y )∈R
B
+ ∪R
I (d x , d y ) ⋅ distW (d x , d y ),
B−
+1 if < d x , d y >∈ RB+ I (d x , d y ) = −1 if < d x , d y >∈ RB−
(2)
{ = { d ,d
} for any negative bundle set B }
RB+ = d x , d y | d x ∈ B + and d y ∈ B + for any positive bundle set B + RB−
x
y
| d x ∈ B and d y ∈ B −
−
−
where document bundle set B+ (or B-) is defined to be a collection of positive (or negative) bundles, and 〈dx , dy〉∈RB+ or 〈dx , dy〉∈RB- denotes that a pair of documents dx and dy is found in positive bundles or negative bundles, respectively. Each pair within the bundles is processed as a training example for learning the weighted distance measure, and we should find a weight matrix that minimizes the objective function. To search for an optimal matrix, we can use a gradient descent search method that is used for tuning weights among neurons in artificial neural networks (Mitchell, 1997). When concept drift or a change in a user’s viewpoint occurs within a particular area of taxonomy, the user should prepare a set of document bundles as external knowledge reflecting the concept drift or the change in viewpoint. Then, based on the prepared user constraint, the clustering process discovers categories resolving the concept drift or reflecting changes in user’s viewpoint, and then the isolated categories are incorporated into the main taxonomy.
Automatic Taxonomy Construction For building hierarchical relationships among categories, we need to note that a category is represented by topical terms reflecting its concept (Sanderson & Croft, 1999). This suggests that the relations between categories can be determined by describing the relations between their significant terms. In this regard, we find that it is difficult to dichotomize the relations between categories into groups representing the presence or absence of association, because term associations are generally represented not as crisp relations, but as probabilistic equations. Thus, degrees of association between two categories can be represented by membership grade in a fuzzy (binary) relation. That is, the generality and specificity of categories can be expressed by aggregating the relations among their terms. In Kim & Lee (2003), a hierarchical relationship between two categories is represented by membership
Text Mining Methods for Hierarchical Document Indexing
grade in a fuzzy (binary) relation. The fuzzy relation CSR(ci, cj) (which represents the relational concept “c i subsumes cj”), called “category subsumption relation” (CSR), between two categories ci and cj is defined as follows:
µCSR (ci , c j ) =
∑
ti ∈Vci ,t j ∈Vc j ,Pr( ti |t j )> Pr( t j |ti )
∑
ti ∈Vci ,t j ∈Vc j
τ ci (ti ) × τ c j (t j ) × Pr( ti | t j ) τ ci (ti ) × τ c j (t j )
(3)
where τc(t) denotes the degree to which the term t represents the concept corresponding to the category c, which can be estimated by calculating the χ2 statistic value of term t in category c since the χ2 value represents the degree of term importance (Yang & Pedersen, 1997). Pr(t i|t j) should be weighted by the degree of significance of the terms ti and tj in their categories, and thus the membership function µCSR(⋅) for categories is calculated as the weighted average of the values of Pr(ti|tj) for terms. The function value of µCSR(⋅) is represented by a real number in the closed interval [0,1], and indicates the strength of the relationship present between two categories. By using the above fuzzy relation, we can build a sub-taxonomy of categories discovered by cluster analysis.
FUTURE TRENDS In applying text mining techniques to hierarchically organizing large textual data, a number of issues remain to be explored in the future. An important issue is how to seamlessly reflecting human knowledge to text mining algorithms; precisely, the algorithm for discovering new classes using semi-supervised clustering and the one for learning classification models with active learning. Researches on semi-supervised clustering and active learning cannot be brought to completion without considering user interaction. As an example, for semi-supervised clustering, Kim & Lee (2002) attempted to generate external human knowledge of bundle constraints through user-relevance feedback. In other different aspects, the problem will continue to be intensively tackled. Additionally, in terms of automatic taxonomy construction, semantically richer information needs to be automatically built beyond the subsumption hierarchy information of categories; for example, relevance relationship among categories needs to be extracted for cross-referencing of categories. Another challenging issue is that for extracting more precise document context, it is necessary to utilize structural and contextual features (e.g., tree-like structures and diverse tag information of XML documents) of the original textual data, if any, as well as the simple features of “a bag of words.” Such feature engi-
neering may be more profitable and effective than algorithm design, particularly in building commercial applications. On a practical level, an open challenge is automatic taxonomy engineering for manually constructed topic hierarchies such as Yahoo directory (http:// www.yahoo.com/), ODP directory (http://dmoz.org/) and UNSPSC classification system (http://www.unspsc.org/ ). Since these topic hierarchies are popular and huge in size, they are expected to be good exemplars to evaluate practical value of text-mining techniques for taxonomy building.
CONCLUSION Towards intelligent taxonomy engineering for large textual data, text mining techniques are of great importance. In this chapter, for developing operational classification systems, a combination of active learning and semi-supervised learning has been introduced together with the classification system architecture that has the online and incremental learning framework. In terms of category discovery, a simple representation, called document bundles, of human knowledge has been discussed as a way of incorporating human knowledge into cluster analysis for semi-supervised clustering. As for taxonomy building, the simple fuzzy-relation based algorithm is described without any complicated linguistic analysis. The current research on building hierarchical structure automatically is still in an early stage. Basically, current techniques consider only subsumption hierarchy, but future studies should try to extract other useful semantics of discovered categories. More importantly, how to incorporate human knowledge into text mining algorithm should be further studied with user interaction design. In this regard, semi-supervised clustering, semisupervised learning, and active learning are challenging issues with both academic and practical values.
REFERENCES Aggarwal, C.C., Gates, S.C., & Yu, P.S. (1999, August). On the merits of building categorization systems by supervised clustering. In International Conference on Knowledge Discovery and Data Mining, KDD’99 (pp. 352-356), San Diego, USA. Agrawal, R., Bayardo, R., & Srikant, R. (2000, March). Athena: Mining-based interactive management of text databases. In International Conference on Extending Database Technology, EDBT-2000 (pp. 365-379), Konstanz, Germany. 1117
6
Text Mining Methods for Hierarchical Document Indexing
Argamon-Engelson, S., & Dagan, I. (1999). Committeebased sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11, 335-360. Demiriz, A., & Bennett, K. (2000) Optimization approaches to semi-supervised learning. In M. Ferris, O. Mangasarian, & J. Pang (Eds.), Applications and algorithms of complementarity. Boston: Kluwer Academic Publishers. Dempster, A.P., Laird, N., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1-38. Joachims, T. (2001, September). A statistical learning model of text classification with support vector machines. In International Conference on Research and Development in Information Retrieval, SIGIR-2001 (pp. 128-136), New Orleans, USA. Kim, H.J., & Chang, J.Y. (2003). Improving Naïve Bayes text classifier with modified EM algorithm. Lecture Notes on Aritificial Intelligence, 2871, 326-333. Kim, H.J., & Lee, S.G. (2002). User feedback-driven document clustering technique for information organization. IEICE transactions on Information and Systems, E85-D (6), 1043-1048. Kim, H.J., & Lee, S.G. (2003). Building topic hierarchy based on fuzzy relations. Neurocomputing, 51, 481-486. Mitchell, T.M. (1997). Artificial neural networks: Machine learning. New York: McGraw-Hill. Muslea, I., Minton, S., & Knoblock, C. (2002, July). Active + semi-supervised learning = robust multi-view learning. International Conference on Machine Learning, ICML2002 (pp. 435-442), Sydney, Australia. Nigam, K., McCallum, A., Thrun S., & Mitchell, T.M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (2/3), 103134. Sanderson, M., & Croft, B. (1999, August). Deriving concept hierarchies from text. In International Conference on Research and Development in Information Retrieval, SIGIR’99 (pp. 206-213), Berkeley, USA. Talavera, L., & Bejar, J. (1999, August). Integrating declarative knowledge in hierarchical clustering tasks. In International Conference on Intelligent Data Analysis, IDA’99 (pp. 211-222), Amsterdam, The Netherlands. Xing, E.P., Ng, A.Y., Jordan, M.I., & Russell, S. (2002). Distance metric learning with application to clustering with side-information. Neural Information Processing Systems, NIPS-2002 (pp. 505-512), Vancouver, Canada.
1118
Yang, Y., & Liu, X. (1999, August). A re-examination of text categorization methods. In International Conference on Research and Development in Information Retrieval, SIGIR’99 (pp. 42-49), Berkeley, USA. Yang, Y., & Pedersen, J.O. (1997, July). A comparative study on feature selection in text categorization. In International Conference of Machine Learning, ICML’97 (pp. 412-420), Nashville, USA.
KEY TERMS Active Learning: Learning modules that support active learning select the best examples for class labeling and training without depending on a teacher’s decision or random sampling. Document Clustering: An unsupervised learning technique that partitions a given set of documents into distinct groups of similar documents based on similarity or distance measures. EM Algorithm: An iterative method for estimating maximum likelihood in problems with incomplete (or unlabeled) data. EM algorithm can be used for semi-supervised learning (see below) since it is a form of clustering algorithm that clusters the unlabeled data around the labeled data. Fuzzy Relation: In fuzzy relations, degrees of association between objects are represented not as crisp relations but membership grade in the same manner as degrees of set membership are represented in a fuzzy set. Semi-Supervised Clustering: A variant of unsupervised clustering techniques without requiring external knowledge. Semi-supervised clustering performs clustering process under various kinds of user constraints or domain knowledge. Semi-Supervised Learning: A variant of supervised learning that uses both labeled data and unlabeled data for training. Semi-supervised learning attempts to provide more precisely learned classification model by augmenting labeled training examples with information exploited from unlabeled data. Supervised Learning: A machine learning technique for inductively building a classification model (or function) of a given set of classes from a set of training (prelabeled) examples. Text Classification: The task of automatically assigning a set of text documents to a set of predefined classes. Recent text classification methods adopt supervised learn-
Text Mining Methods for Hierarchical Document Indexing
ing algorithms such as Naïve Bayes and support vector machine.
6
Topic Hierarchy (Taxonomy): Topic hierarchy in this chapter is a formal hierarchical structure for orderly classification of textual information. It hierarchically categorizes incoming documents according to topic in the sense that documents at a lower category have increasing specificity.
1119
1120
Time Series Analysis and Mining Techniques Mehmet Sayal Hewlett-Packard Labs, USA
INTRODUCTION A time series is a sequence of data values that are recorded with equal or varying time intervals. Time series data usually includes timestamps that indicate the time at which each individual value in the times series is recorded. Time series data is usually transmitted in the form of a data stream, i.e., continuous flow of data values. Source of time series data can be any system that measures and records data values over the course of time. Some examples of time series data may be recorded from stock values, blood pressure of a patient, temperature of a room, amount of a product in the inventory, and amount of precipitation in a region. Proper analysis and mining of time series data may yield valuable knowledge about the underlying characteristics of the data source. Time series analysis and mining has applications in many domains, such as financial, biomedical, and meteorological applications, because time series data may be generated by various sources in different domains.
•
MAIN THRUST The techniques for predicting the future trend and values of time series data try to identify the following types of movements: • •
BACKGROUND Time series analysis and mining techniques differ in their goals and algorithms they use. Most of the existing techniques fall into one of the following categories: •
•
Trend Analysis and Prediction: The purpose is to predict the future values in a time series through analysis of historic values (Han & Kamber, 2001; Han, Pei, & Yin, 2000; Han et al., 2000; Kim, Lam, & Han, 2000; Pei, Tung, & Han 2001). For example, “How will the inventory amount change based on the historic data?” or “What will be the value of inventory amount next week?” Similarity Search: The most common purpose is to satisfy the user queries that search for whole sequence or sub-sequence matching among multiple time series data streams (Agrawal, Faloutsos, & Swami, 1993; Faloutsos, Ranganathan, & Manolopoulos, 1994; Kahveci & Singh, 2001; Kahveci, Singh, & Gurel, 2002; Popivanov & Miller, 2002; Wu, Agrawal, & El Abbadi, 2000; Zhu & Shasha, 2002). For example, “Can you find
time series data streams that are similar to each other?” or “Which time series data streams repeat similar patterns every 2 hours?” Relationship Analysis: The main purpose is to identify relationships among multiple time series. Causal relationship is the most popular type, which detects the cause-effect relationships among multiple time series. For example, “Does an increase in price have any effect on profit?”
•
Long-term or trend movements (Han & Kamber, 2001) Seasonal and cyclic variations, e.g., similar patterns that a time series appears to follow during corresponding months of successive years, or regular periods (Han & Kamber, 2001; Han, Pei, & Yin, 2000; Han, Pei, Mortazavi-Asl, et al., 2000; Pei, et al., 2001; Kim, et al., 2000) Random movements
Long-term trend analysis research is mostly dominated by application of well-studied statistical techniques, such as regression. Various statistical methods have been used for detecting seasonal and cyclic variations. Sequence mining techniques have also been used to detect repeating patterns. However, statistical methods are more suitable for detecting additive and multiplicative seasonality patterns in which the impact of seasonality adds up or multiplies the current values with each repetition. Random movements are usually considered as noise and ignored through the use of smoothing techniques. Similarity search techniques have been studied in detail in the last ten years. Those techniques usually reduce the search space by extracting a few identifying features from time series data streams, and comparing the extracted features with each other to determine which time series data streams exhibit similar patterns. Some approaches look for whole pattern matching;
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Time Series Analysis and Mining Techniques
whereas, some others break the time series into segments and try to evaluate the similarity by comparing segments from different time series data streams. Most similarity search techniques use an indexing method in order to efficiently store and retrieve the feature sets that are extracted from time series data streams. The general problem of similarity-based search is well known in the field of information retrieval, and many indexing methods exist to process queries efficiently. However, certain properties of time sequences make the standard indexing methods unsuitable. The fact that the values in the sequences usually are continuous, and that the elements may not be equally spaced in time dimension, makes it difficult to use standard text-indexing techniques like suffix-trees. Faloutsos et al. introduced the most revolutionary ideas on similarity search (Agrawal, et al., 1993; Faloutsos, et al., 1994). Time series are converted into few features using Discrete Fourier Transformation (DFT) and indexed using R-Trees for fast retrieval. An important limitation of spatial indexing methods is that they work efficiently only when the number of dimensions is low. Therefore, the features extracted from time series data streams using DFT or any other method are not suitable for spatial indexing methods. The general solution to this problem is dimensionality reduction, i.e., to extract a signature of low dimensionality from the original feature set. The dimensionality reduction has to preserve the distances between the original data sets to some extent so that indexing and searching in the signature space can be done without loosing accuracy significantly. It was proven that false dismissals are avoided during dimensionality reduction, but false alarms are not avoided. Several research papers applied similar approaches for transforming the time series data from time domain into frequency domain using DFT, while preserving the Euclidean distance among the original data sets to avoid false dismissals (Kahveci, et al., 2002; Zhu & Shasha, 2002). DFT provides a very efficient approximation for time series data streams, but it has its limitations too. For example, DFT preserves Euclidean distance, but loses phase information. Therefore, it is only possible to find out if a similarity exists between two or more time series data streams with DFT based techniques. It is not possible to tell anything about the time distance of similarity. There are some heuristic approaches trying to overcome this limitation, such as storing additional time information during transformation into frequency domain, but none of them seem to be very successful and they increase the complexity of algorithms. Discrete Wavelet Transformation (DWT) was also used in many research papers for feature ex-
traction (Kahveci & Singh, 2001; Popivanov & Miller, 2002; Wu, et al., 2000). Those papers assumed that DWT was empirically superior to DFT, according to the results of a previous research. However, it was claimed later that such comparisons may be biased with regards to implementation details and selected parameters (Keogh & Kasetty, 2002). Research on relationship analysis has recently started gaining momentum. The main purpose of a relationship analysis technique is to find out the relationships among multiple time series data streams. Causal relationships are the most common ones because discovery of causal relationships among time series data streams can be useful for many purposes, such as explaining why certain movements occur in the time series; finding out whether data values of one time series has any effect on the near-future values of any other time series; and predicting the future values of time series data stream not only based on its recent trend and fluctuations but also on the changes in data values of other time series data streams. Early research papers on relationship analysis tried to make use of existing techniques from prediction and similarity search. Those approaches have certain limitations and new approaches are needed for relationship analysis. For example, prediction techniques consider the historic values of a time series and try to predict the future trend and fluctuations based on the historic trend and fluctuations. However, those techniques ignore the possibility that values in one time series may be affected by the values in many other time series. As another example, similarity search techniques can only tell whether two or more time series (or their segments) are similar to each other. Those techniques cannot provide details regarding the time domain when the impact of a change in the values of one time series is observed after a time delay on the values of another time series. This limitation occurs because the similarity model is different from the original data model in those techniques, i.e., data is transformed from time domain into frequency domain for enabling faster search, but certain features of the time series data, such as time relevant information, are lost in most of those techniques. Recently introduced techniques can be applied in the time domain without having to transform data into another domain (Perng, Wang, Zhang, & Parker, 2000; Sayal, 2004). The main idea is to identify important data points in a time series that can be used as the characteristic features of the time series in time domain. The important points can be selected from the local extreme points (Perng, et al., 2000) or change points that correspond to the points in time where the trend of the data values in the time series has changed (Sayal, 2004).
1121
6
Time Series Analysis and Mining Techniques
FUTURE TRENDS The analysis and mining of time series involves several challenges that many researchers have tried to address. The future trends will also be determined by how successfully those challenges are addressed. The main challenges in time series analysis and mining are the unbounded memory requirements and high data arrival rate. Those two issues make it very hard to generate exact results from the original data. Therefore, many approximation and data transformation techniques have been used. For example, sliding window, batch processing, sampling, and synopsis data structures have been discussed by (Babcock, Babu, Datar, Motwani, & Widom, 2002; Garofalakis, Gehrke, & Rastogi, 2002) for query result approximation. Load shedding has been proposed for reducing the data amount to be processed by dropping some data elements (Babcock, Datar, & Motwani, 2003; Tatbul, Cetintemel, Zdonik, Cherniack, & Stonebraker, 2003). Those two approaches reduce the amount of input data that is processed in order to generate approximate results quickly. Another approach by (Gaber, Krishnaswamy, & Zaslavsky, 2004) suggests the use of output granularity that considers the amount of generated results that can fit into the memory before applying any incremental integration with the previous results during the execution of the algorithm. It is claimed that output granularity approach can speed up the analysis without reducing the accuracy of the results. However, the speed of analysis and mining algorithms strongly depend on the amount of input data, as well as how the applied algorithm generates the output. Research on Data Stream Management Systems (DSMS) aims at delivering more efficient ways of storing and querying continuous data flows, such as time series data streams (Babu & Widom, 2001; Motwani, et al., 2003). Important issues in data stream analysis and mining are discussed in (Golab & Ozsu, 2003). All time series analysis and mining techniques that have been published so far had to address memory and arrival rate challenges in one way or the other. The most popular approach is to transform input data into a more compact representation that is much easier to analyze, but those approaches also take the risk of reducing the accuracy of the results. Another important challenge is to determine what techniques are useful and make sense. Several techniques have been used for analysis and mining of time series data. Some of those techniques tried to convert the original problem of extracting knowledge from time series data into a more familiar problem in data mining domain, such as clustering, classification, and frequent itemset extraction. A recent research paper claimed that clustering time series sub-sequences is meaningless, i.e., output of such clustering algorithms is independent 1122
from the input data, which suggest that clustering of sub-sequences leads to loss of important characteristic features of the original data (Keogh, Lin, & Truppel, 2003). The authors contradict with the recent research on time series clustering to make a strong statement that clusters extracted from time series are forced to obey certain constraints that are unlikely to be satisfied in any realistic data sets, and the clusters extracted from time series data using any clustering algorithm are random. Finally, one important challenge that has not been addressed by almost any of the existing research papers in time series analysis and mining is the proper explanation of the results for general audience. Existing research papers are only concerned with the running time performance and accuracy of the results. However, it is very important to be able to explain the results properly to the general audience. For example, the algorithm introduced in (Sayal, 2004) for detecting time correlation rules among multiple time series data streams, i.e., detecting time delayed causal relationships, can also generate plain English description of time correlation rules easily. Various visualization techniques exist for graphical representation of wellknown data mining algorithms, such as visualization of clustering results in multidimensional space, and visualization of classification results using decision trees (Fayyad, Grinstein, & Wierse, 2002). However, a proper graphical visualization of time series analysis and mining results is a difficult task.
CONCLUSION Speed and accuracy have always been important issues for time series analysis and mining techniques. In the future, those will continue to be major criteria for measuring the performance of time series analysis and mining techniques. Another important issue in the near future will be the proper explanation of the generated results. Most of the existing algorithms aim at performing the similar type of analysis in a more time efficient or accurate way, and those algorithms generate results that are difficult to explain to general audience and users of analysis tools.
REFERENCES Agrawal, R., Faloutsos, C., & Swami, A.N. (1993). Efficient similarity search in sequence databases. In Proceedings of the 4th International Conference of Foundations of Data Organization and Algorithms (FODO)(pp. 69-84). Chicago, Illinois: Springer Verlag.
Time Series Analysis and Mining Techniques
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of Symposium on Principles of Database Systems (PODS) (pp. 1-16). Babcock, B., Datar, M., & Motwani, R. (2003). Load shedding techniques for data stream systems (short paper). In Proceedings of the Workshop on Management and Processing of Data Streams (MPDS). Babu, S., & Widom, J. (2001). Continuous queries over data streams. SIGMOD Record, 30(3), 109-120. Faloutsos, C., Ranganathan, M., & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In Proceedings of ACM SIGMOD International Conference on Management of Data (pp. 419-429). Fayyad, U., Grinstein, G.G., & Wierse, A. (2002). Information visualization in data mining and knowledge discovery. Morgan Kaufmann. Gaber, M.M., Krishnaswamy, S., & Zaslavsky, A. (2004). Cost-efficient mining techniques for data streams. In Proceedings of Australasian Workshop on Data Mining and Web Intelligence (DMWI2004), 32, Dunedin, New Zealand. Garofalakis, M., Gehrke, J., & Rastogi, R. (2002). Querying and mining data streams: You only get one look a tutorial. In Proceedings of ACM SIGMOD International Conference on Management of Data, 635. Golab, L., & Ozsu, M.T. (2003, June). Issues in data stream management. In SIGMOD Record, 32(2), 5-14. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann. Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., & Hsu, M.-C. (2000, August). FreeSpan: Frequent patternprojected sequential pattern mining. In Proceedings of the 6th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (pp. 355-359), Boston, MA. Han, J., Pei, J., & Yin, Y. (2000, May). Mining frequent patterns without candidate generation. In Proceedings of ACM SIGMOD International Conference on Management of Data, Dallas, TX. Kahveci, T., & Singh, A. (2001, April 2-6). Variable length queries for time series data. In Proceedings of the 17th International Conference on Data Engineering (pp. 273282), Heidelberg, Germany. Kahveci, T., Singh, A., & Gurel, A. (2002). An efficient index structure for shift and scale invariant search of multi-attribute time sequences. In Proceedings of the
18th International Conference on Data Engineering (ICDE) (p. 266), poster paper. Keogh, E., & Kasetty, S. (2002, July 23-26). On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 102-111), Edmonton, Alberta, Canada. Keogh, E., Lin, J., & Truppel, W. (2003, Nov 19-22). Clustering of time series subsequences is meaningless: implications for past and future research. In proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003) (pp. 115-122), Melbourne, Florida. Kim, J., Lam, M.W., & Han J. (2000, September). AIM: Approximate intelligent matching for time series data. Proceedings 2000 Int. Conferences on Data Warehouse and Knowledge Discovery (DaWaK), Greenwich, U.K. Motwani, R. et al. (2003, January). Query processing, approximation, and resource management in a data stream management system. In Proceedings of First Biennial Conference on Innovative Data Systems Research (CIDR). Pei, J., Tung, A., & Han, J. (2001, May). Fault-tolerant frequent pattern mining: Problems and challenges. In Proceedings of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA. Perng, C-S., Wang, H., Zhang, S., & Parker, D.S. (2000, February). Landmarks: A new model for similaritybased pattern querying in time series databases. In Proceedings of the 16th International Conference of Data Engineering (ICDE) (pp. 33-42), San Diego, CA. Popivanov, I., & Miller, R.J. (2002, February 26-March 1). Similarity search over time series data using wavelets. In Proceedings of the 18th International Conference on Data Engineering (pp. 212-221), San Jose, CA. Sayal, M. (2004). Detecting time correlations in timeseries data streams (Technical Report HPL-2004-103). Retrieved from http://lib.hpl.hp.com/techpubs Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., & Stonebraker, M. (2003, September). Load shedding in a data stream manager. Proceedings of the 29th International Conference on Very Large Data Bases (VLDB). Wu, Y., Agrawal, D., & El Abbadi, A. (2000). A comparison of DFT and DWT based similarity search in timeseries databases. In proceedings of the 9th ACM Int’l Conference on Information and Knowledge Management (CIKM) (pp. 488-495).
1123
6
Time Series Analysis and Mining Techniques
Zhu, Y., & Shasha, D. (2002, August 20-23). Statstream: Statistical monitoring of thousands of data streams in real time. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB) (pp. 358-369), Hong Kong, China.
KEY TERMS Data Stream: A continuous flow of data. The most common use of data streams is the transmission of digital data from one place to another. Data Stream Management System (DSMS): Management system for efficient storage and querying of data streams. DSMS can be considered as the Database Management System (DBMS) for data streams. The main difference of DSMS from DBMS is that it has to handle a higher volume of data that is continuously flowing in, and the characteristics of data content may change over time. Dimensionality Reduction: Process of extracting a signature of low dimensionality from the original data while preserving some attributes of the original data, such as the Euclidean distance. Discrete Fourier Transformation (DFT): A transformation from time domain into frequency domain that is widely used in signal processing related fields to analyze the frequencies contained in a sampled signal. Euclidean Distance: The straight-line distance between two points in a multidimensional space. It is calculated by summing up the squares of distances in individual dimensions and taking the square root of the sum. False Alarm: A case in which a candidate match is found during preprocessing step of a similarity analysis algorithm when a match does not really exist. Minimization of false alarms is important because extracting large amount of false candidates in early steps of an algorithm causes performance degradation that will not improve the accuracy of the result.
1124
False Dismissal: A case in which a candidate match is eliminated during preprocessing step of a similarity analysis algorithm when a match does exist. Minimization of false dismissals is important because they reduce accuracy of the algorithm. Load Shedding: Process of dropping data elements from a data stream randomly or semantically. Load shedding is applied for reducing the amount of data that needs to be processed. R-Tree: A spatial indexing method for fast comparison and retrieval of spatial objects. It is a spatial object hierarchy that is formed by aggregating minimum bounding boxes of the spatial objects and storing the aggregates in a tree structure. Seasonal and Cyclic Variations: Similar patterns that are repeated in time series data throughout regular periods, or calendar time units. Sub-Sequence Matching: Identification of similar sub-sequences (segments) from multiple time series. Sub-sequence matching is used for satisfying queries that look for a particular pattern in one or more time series, or identifying the similarities among multiple time series by comparing pieces of those time series. Time Correlation: Time delayed relationship between two or more time series. Time correlation can be used for identifying causal relationships among multiple time series. Time Series: Sequence of data values that are recorded with equal or varying time intervals. Time series data usually includes timestamps that indicate the time at which each individual value in the times series is recorded. Whole Sequence Matching: Identification of similar time series data streams such that the complete patterns of data values in time series data streams are compared to determine similarity.
1125
Time Series Data Forecasting
6
Vincent Cho The Hong Kong Polytechnic University, Hong Kong
1800 1600 1400 1200 1000 800 600 400 200 2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
0 1991
Businesses are recognizing the value of data as a strategic asset. This is reflected by the high degree of interest in new technologies such as data mining. Corporations in banking, insurance, retail, and healthcare are harnessing aggregated operational data to help understand and run their businesses (Brockett et al., 1997; Delmater & Hamcock, 2001). Analysts use data-mining techniques to extract business information that enables better decision making (Cho et al., 1998; Cho & Wüthrich, 2002). In particular, time series forecasting is one of the major focuses in data mining. Time series forecasting is used in a variety of fields, such as agriculture, business, economics, engineering, geophysics, medical studies, meteorology, and social sciences. A time series is a sequence of data ordered in time, such as hourly temperature, daily stock prices, monthly sales, quarterly employment rates, yearly population changes, and so forth.
Figure 1. Visitors to China 1991-2004
Visitors to China ('000)
INTRODUCTION
Moreover, an adequate training data set should be captured for the model building, and the model should be retrained with a moving window, which covers most of the recent cases.
MAIN THRUST BACKGROUND The objective of studying time series is to identify the pattern of how a sequence of data changes over time, and, thus, future forecasting can be made to help in scientific decision making. The typical time series forecasting applications are related to economics, finance, and business operations. Data on economic time trends like GDP and tourist arrivals (Cho, 2001, 2003); financial time trends such as stock indices (Cho et al., 1999; Cho & Wuthrich, 2002; Wuthrich, et al., 1998), exchange rates; and business operations on inventory management, yield management (Choi & Cho, 2000), staff planning (Cho & Ngai, 2003), customer demands and spending patterns (Cho & Leung, 2002), telecommunication traffic (Layton et al., 1986), and marketing (Nijs et al., 2001; Dekimpe & Hanssens, 2000) are common forecasting domains. Figure 1 shows a typical time series that has obvious periodical pattern with some disturbances. The pattern can be captured by time series analysis techniques. In order to have a more reliable forecasting of a time series, usually the time series need to be under a stable environment, and extensive underlying factors determining the time series should be included in the analysis.
The common techniques for time series forecasting are exponential smoothing, ARIMA, transfer functions, Vector Auto-Regression (VAR), and Artificial Neural Network (ANN). The interrelationship among time series is usually described by the cross-correlation. In this article, ARIMA and ANN are presented for time series studies. These two techniques are selected because they are quite different in their natures. ARIMA was developed based on theories of mathematics and statistics, whereas ANN was developed based on the inspiration of nerve structure in human brains. Details are described as follows.
ARIMA ARIMA models are flexible and widely used in time-series analysis. ARIMA (AutoRegressive Integrated Moving Average) combines three types of processes: Auto Regression (AR), differencing to strip off the integration (I) of the series and moving averages (MA). Each of the three types of processes has its own characteristic way of responding to a random disturbance. Identification is a critical step in building an ARIMA(p, d, q)(sp, sd, sq)L model, where p is the AR order that
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Time Series Data Forecasting
indicates the number of coefficients of AR process, d is the number of times the data series must be differenced to induce a stationary series Z, q is the MA order that indicates the number of coefficients of the MA process, sp is the seasonal AR order that indicates the number of coefficients of seasonal AR process, and sq is the seasonal MA order that indicates the number of coefficients of seasonal MR process, sd is the number of times the data series needs to be seasonally differenced to induce a seasonally stationary series, and L indicates the seasonal periodicity. These parameters usually are determined by inspecting the behavior of the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) (Box et al., 1994). The ACF and PACF of a stationary series should show either a cutoff or rapidly dying pattern. In practice, the determination of d and sd requires guessing different combinations among the possible values, until the desired patterns of ACF and PACF are achieved. Next will be the identification of the parameters, p and q, which involve the study of the behavior of the ACF and PACF. On these procedures, we can establish a tentative ARIMA model. However, all parameters are determined by observation and subjective guessing, which is rather unreliable and inaccurate. Traditionally, identification is a rough procedure applied to a set of data to indicate the kind of representational model that is worthy of further investigation. The specific aim here is to obtain some idea of the values of p, d, and q needed in the general linear ARIMA model, and to obtain estimates for the parameters.
Parameter Searching Algorithm Upon the previous drawback in estimating the parameters of the ARIMA model, an algorithm (Cho, 2003) to find the best combination of parameters is devised as follows: This algorithm tries all combinations of parameters, which are limited to an integer lying between zero and two. The combination with the least Akaike AIC will be searched. With such range of parameter searching, the algorithm generates 36 = 729 combinations. The range limitations of the parameters are set to restrict the search to a reasonable scope. Parameters greater than two make a model become too complicated, and the forecasting ability of a compli-
Figure 2. Algorithm of finding ARIMA parameters For p, d, q, sp, sd, and sq each = 0 to 2 Do Execute ARIMA with the set parameters. Record the parameters and corresponding fitting error. Until all possible combinations are tried. Report the parameters that produce the least AIC.
1126
cated model is seldom better than one with less coefficients. For example, for a model with p=6, q=5, sp=4, and sq=3, there would be 18 coefficients that would have to be estimated in the model, which can hardly be interpreted. Even if a complicated model is slightly better than a simple one in terms of accuracy, the simple one often is chosen because of its simplicity. Therefore, parameters greater than two are rarely used in practice. For example, the series of visitors to China shown in Figure 1 was modeled as AR order p=1, MA order q=0, sp=0 and sq=1 with differencing d=0 and seasonal differencing sd=1. The corresponding AIC is the lowest among all different combinations of parameters. Moreover, the solution space was restricted so that the estimated coefficients are all within a predetermined confidence limit of 95%.
Artificial Neural Network (ANN) Artificial Neural Networks are computing devices inspired by the function of nerve cells in the brain. They are composed of many parallel, interconnected computing units. Each of these performs a few simple operations and communicates results to its neighboring units. In contrast to conventional computer programs, where step-by-step instructions are provided to perform a particular task, neural networks can learn to perform tasks by a process of training on many different examples. Typically, the nodes of a neural network are organized into layers, with each node in one layer having a connection to each node in the next layer, as shown in Figure 3. Associated with each connection is a weight, and each node has an activation value. During pattern recognition, each node operates as a simple threshold device. A node sums all the weighted inputs by multiplying the connection weights with the state of the previous layer nodes, and then the sum will be applied to a typically non-linear activation function. If the result is greater than the threshold value, the node will be activated. The result of the output nodes will be compared with the known result in the training set. The error terms will be fed backward for weighting adjustment in the hidden layers, so as to make the neural network resemble the training set more. Neural networks provide a relatively easy way to model and forecast non-linear systems. This gives them an advantage over many current statistical methods used in business and finance, which are primarily linear. They also are very effective in learning cases that contain noisy, incomplete, or even contradictory data. The ability to learn and the capability to handle imprecise data make them very effective in handling financial and business information. A main limitation of neural networks is that they lack explanation capabilities. They do not provide users with details of how they reason with data to arrive
Time Series Data Forecasting
Figure 3. Neural network
new input is a function of both the new input and the preceding context. What is stored in the context vector at any given time is a compressed trace of all preceding inputs, and this compressed trace influences the manner in which the network reacts to each succeeding input.
Cross Correlation Input
Outp ut Hidden layer
at particular conclusions. Neural nets are black boxes. They are provided with input, and the user has to believe in the correctness of the output. Another limitation is the relative slowness of the training process. It typically takes order of magnitudes longer to train a neural net than to build a statistical model. Although common feed-forward, back-propagation neural networks often are applied to time series applications, there is some ANN models designed specifically for time series forecasting. Here, we would like to introduce Elman’s ANN model (Elman, 1990). Elman’s Network is a recurrent network; the output of hidden layers is feedback to itself, and, thus, it is especially suitable for fitting time series (Cho, 2003; Jhee & Lee, 1996). The model is illustrated in Figure 4. The activations in the hidden layer at time t-1 are copied into the context vector, which is the input to the network for time t. This is equivalent to having the hidden layer completely and recurrently connected and back-propagating one step in time along the recurrent connections. Therefore, the reaction of the network to the Figure 4. Elman network Yt
Output Layer
Hidden Layer …
Input Layer … Yt-1
Yt-2
… Yt-L
The relationship among different time series is studied through the cross correlation. The cross-correlation function (CCF) between two series x and y defines the degree of association between the values of x at time t and the values of y at time t+k (where k=0, 1, 2, 3, etc.). The CCF can be used to check whether the two series are independent or not. If x is a leading indicator of y, then x at time t will be positively related to y at time t+k, where k is a positive integer. However, direct application of cross-correlation to the time series is not appropriate. The two series first should be transformed in such a way that they are jointly covariance stationary; then, their interrelationships can be described easily by the cross-correlation function. Also, Haugh (1976) pointed out that if series x and y are themselves autocorrelated, then the lagged cross-correlation estimates can be difficult to interpret. The autocorrelation, which appears in each of the series, can inflate the variance of the cross-correlation estimates. Moreover, the cross-correlation estimates at different lags will be correlated. This can happen, even if two series are, in fact, independent (expected cross-correlation will be zero). Thus, calculating the correlation between time series can lead to a spurious result. Haugh’s (1976) approach involves first fitting ARIMA models to each of the series and then calculating the cross-correlation coefficients of the two residual series. Similarly, we also introduce another method, ANN, which is used to purify the time series so that the residual doesn’t have the autocorrelated factor and is stationary. If the residual series has significant cross-correlation coefficients at positive lags, then one of the series is a leading indicator. For coincident and lagged indicators, we can expect significant correlation coefficients at zero and negative lags, respectively. With the above arguments, the stationary series are transformed by using two techniques, ARIMA and ANN. Similar to Haugh’s (1976) approach, both ARIMA and ANN are used to fit those series; residuals then are calculated by these two methods, respectively. Crosscorrelations then are calculated on residuals fitted by the two methods. One thing plausible would be which methods are better. There is no answer, and it depends on which can give a more realistic explanation.
Context Vector
1127
6
Time Series Data Forecasting
FUTURE TRENDS More advanced techniques for nonlinear time series forecasting have been developed recently; namely, Wavelet Analysis attempts to decompose a time series into time/ frequency space simultaneously. Information such as how amplitude varies with time can be got on both the amplitude of periodic signals within the series. General Autoregressive (GAR), Threshold Autoregressive (TAR), Smooth Transition Autoregressive (STAR), and Markov Switching Models are developed, based on theories of stochastic processes. Other specific designed ANN models are also developed for particular time series applications. These models can be used to describe much more complex time series, which cannot be handled by traditional linear ARIMA models.
CONCLUSION This article elaborates two time series techniques, ARIMA and ANN, and proposes to find the interrelationship among time series with cross-correlation analysis. Stationary and autocorrelation of the series are corrected by fitting with ARIMA and ANN models. This enables us to investigate the interrelationship among the series. With the knowledge of interrelationship among different time series, this would help us in more strategic planning. Let’s say that if the interrelationships among different tourist arrivals are found, then the planning of tourism strategy will be more thoughtful. If the interrelationships among various economic indicators are found, the mechanism of the economies will be much clearer. In a usual case, ANN will outperform ARIMA, as the fitting is using a nonlinear technique. However, ANN doesn’t have much explanatory power, and the future trend would be further comparison among the performance of various advance time series forecasting techniques.
ACKNOWLEDGMENT This research was supported by the Hong Kong Polytechnic University under the Grant A628.
REFERENCES Box, G.E.P., Jenkins, G.M., & Reinsel, G.C. (1994). Time series analysis - forecasting and control (3rd ed.). Englewood Cliffs, NJ: Prentice Hall, Inc.
1128
Brocket, P.L., Cooper, W.W., Golden, L.L., & Xia, X. (1997). A case study in applying neural networks to predicting insolvency for property and casualty insurers. Journal of the Operational Research Society, 48, 11531162. Choi, T.Y., & Cho, V. (2000). Towards a knowledge discovery framework for yield management in the Hong Kong hotel industry. Hospitality Management, 19, 17-31. Cho, V. (2001). Tourism forecasting and its relationship with leading economic indicators. Journal of Hospitality and Tourism Research, 25(4), 399-420. Cho, V. (2003). A comparison of three different approaches to tourist arrival forecasting. Tourism Management, 24, 323-330. Cho, V., & Leung, P. (2002). Towards using knowledge discovery techniques in database marketing for the tourism industry. Journal of Quality Assurance in Hospitality and Tourism, 3(4), 109-131. Cho, V., & Ngai, E. (2003). Data mining for selection of insurance sales agents. Expert Systems, 20(3), 123-132. Cho, V., & Wüthrich, B. (2002). Distributed mining of classification rules. Knowledge and Information Systems, 4, 1-30. Cho, V., Wuthrich, B., & Zhang, J. (1999). Text processing for classification. Journal of Computational Intelligence in Finance, 7(2), 2-6. Dekimpe, M.G., & Hanssens, D.M. (2000). Time-series models in marketing: Past, present and future. International Journal of Research in Marketing, 17, 183-193. Delmater, R., & Hancock, M. (2001). Data mining explained: A manager’s guide to customer-centric business intelligence. Boston: Digital Press. Elman, J.L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Haugh, I.D. (1976). Checking the independence of two covariance-stationary time series: A univariate residual cross-correlation approach. Journal of the American Statistical Association, 71, 378-485. Jhee, W.C., & Lee, J.K. (1996). Performance of neural networks in managerial forecasting. In R.R. Trippi, & E. Turban (Eds.), Neural networks in finance and investing (pp. 703-733). Chicago, IL: Irwin. Layton, A.P., Defris, L.V., & Zehnwirth, B. (1986). An international comparison of economic leading indicators of telecommunication traffic. International Journal of Forecasting, 2, 413-425.
Time Series Data Forecasting
Nijs, V.R., Dekimpe, M.G., Steenkamps, J.E.M., & Hanssens, D.M. (2001). The category-demand effects of price promotions. Marketing Science, 20(1), 1-22.
Demand Forecasting: Projection of the estimated level of goods or service demand during the months or years covered by a marketing plan.
Wüthrich, B. et al. (1998). Daily predication of major stock indices from textual WWW data. HKIE Transactions, 5(3), 151-156.
Differencing: Removes trend from a time series. This is an effective way to provide a clearer view of the true underlying behavior of the series.
KEY TERMS
Residual: Part of a variable that are not explained by the model. It can be defined as the difference of actual and predicted values.
Akaike Information Criterion (AIC) and Schwartz Bayesian Criterion (SBC): The two most commonly used model selection criteria. They trade off fitness of a model for the complexity of the model. If the AIC (or SBC) of model A is smaller than that of model B, it is said that model A is better than model B. Autocorrelation: Measures the correlation between observations of a time series and the same values at a fixed time offset interval.
Stationary Time Series: A time series is called stationary if its mean, variance, and autocovariance (autocorrelation) are independent of time; that is, those values are constant over time. Time Series: A sequence of observations or events that are ordered in time. The successive observations will be dependent on time or previous events.
1129
6
1130
Topic Maps Generation by Text Mining Hsin-Chang Yang Chang Jung University, Taiwan Chung-Hong Lee National Kaohsiung University of Applied Sciences, Taiwan
INTRODUCTION Topic maps provide a general, powerful, and user-oriented way to navigate the information resources under consideration in any specific domain. A topic map provides a uniform framework that not only identifies important subjects from an entity of information resources and specifies the resources that are semantically related to a subject, but also explores the relations among these subjects. When a user needs to find some specific information on a pool of information resources, he or she only needs to examine the topic maps of this pool, select the topic that seems interesting, and the topic maps will display the information resources that are related to this topic, as well as its related topics. The user will also recognize the relationships among these topics and the roles they play in such relationships. With the help of the topic maps, you no longer have to browse through a set of hyperlinked documents and hope that you may eventually reach the information you need in a finite amount of time, while knowing nothing about where to start. You also don’t have to gather some words and hope that they may perfectly symbolize the idea you’re interested in, and be well-conceived by a search engine to obtain reasonable result. Topic maps provide a way to navigate and organize information, as well as create and maintain knowledge in an infoglut. To construct a topic map for a set of information resources, human intervention is unavoidable at the present time. Human effort is needed in tasks such as selecting topics, identifying their occurrences, and revealing their associations. Such a need is acceptable only when the topic maps are used merely for navigation purposes and when the volume of the information resource is considerably small. However, a topic map should not only be a topic navigation map. The volume of the information resource under consideration is generally large enough to prevent the manual construction of topic maps. To expand the applicability of topic maps, some kind of automatic process should be involved during the construction of the maps. The degree of automation in such a construction process may vary for different users with different needs. One person may need only a friendly
interface to automate the topic map authoring process, while another may try to automatically identify every component of a topic map for a set of information resources from the ground up. In this article, we recognize the importance of topic maps not only as a navigation tool but also as a desirable scheme for knowledge acquisition and representation. According to such recognition, we try to develop a scheme based on a proposed text-mining approach to automatically construct topic maps for a set of information resources. Our approach is the opposite of the navigation task performed by a topic map to obtain information. We extract knowledge from a corpus of documents to construct a topic map. Although currently the proposed approach cannot fully construct the topic maps automatically, our approach still seems promising in developing a fully automatic scheme for topic map construction.
BACKGROUND Topic map standard (ISO, 2000) is an emerging standard, so few works are available about the subject. Most of the early works about topic maps focus on providing introductory materials (Ahmed, 2002; Pepper, 1999; Beird, 2000; Park & Hunting, 2002). Few of them are devoted to the automatic construction of topic maps. Two works that address this issue were reported in Rath (1999) and Moore (2000). Rath discussed a framework for automatic generation of topic maps according to a so-called topic map template and a set of generation rules. The structural information of topics is maintained in the template. To create the topic map, they used a generator to interpret the generation rules and extract necessary information that fulfills the template. However, both the rules and the template are to be constructed explicitly and probably manually. Moore discussed topic map authoring and how software may support it. He argued that the automatic generation of topic maps is a useful first step in the construction of a production topic map. However, the real value of such a map comes through the involvement of people in the process. This argument is true if the knowledge that contained in the topic maps can only be ob-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Topic Maps Generation by Text Mining
tained by human efforts. A fully automatic generation process is possible only when such knowledge may be discovered from the underlying set of information resources through an automated process, which is generally known as knowledge discovery from texts, or text mining (Hearst, 1999; Lee & Yang, 1999; Wang, 2003; Yang & Lee, 2000).
MAIN THRUST We briefly describe the text-mining process and the generation process of topic maps in this section.
ter by combining neighboring neurons. To form a supercluster, we first define the distance between two clusters:
(
(1)
where i and j are the neuron indices of the two clusters, and Gi is the two-dimensional grid location of neuron i. ||Gi - Gj|| measures the Euclidean distance between the two coordinates Gi and Gj. H(x) is a bell-shaped function that has a maximum value when x=0. We also define the dissimilarity between two clusters as δ (i, j ) = w i − w j ,
The Text-Mining Process Before we can create topic maps, we first perform a textmining process on the set of information resources to reveal the relationships among the information resources. Here, we only consider those information resources that can be represented in regular texts. Examples of such resources are Web pages, ordinary books, technical specifications, manuals, and so forth. The set of information resources is collectively known as the corpus, and individual resource is referred to as a document in the following text. To reveal the relationships between documents, the popular self-organizing map (SOM) algorithm (Kohonen, Kaski, Lagus, Salojärvi, Honkela, Paatero, & Saarela, 2000) is applied to the corpus to cluster documents. We adopt the vector space model (Baeza-Yates and Ribiero-Neto, 1999) to transform each document in the corpus into a binary vector. These document vectors are used as input to train the map. We then apply two kinds of labeling processes to the trained map and obtain two feature maps, namely the document cluster map (DCM) and the word cluster map (WCM). In the document cluster map, each neuron represents a document cluster that contains several similar documents with high word cooccurrence. In the word cluster map, each neuron represents a cluster of words revealing the general concept of the corresponding document cluster that is associated with the same neuron in the document cluster map. The text-mining process described in the preceding paragraph provides a way for us to reveal the relationships between the topics of the documents. Here, we introduce a method to identify topics and the relationships between them. The method also arranges these topics in a hierarchical manner according to their relationships. As we mention earlier in this article, a neuron in the DCM represents a cluster of documents containing words that often co-occurred in these documents. Besides, documents that associate with neighboring neurons contain similar sets of words. Thus, we may construct a superclus-
)
D(i, j ) = H G i − G j ,
(2)
where wi denotes the synaptic weight vector of neuron i. We may then compute the supporting cluster similarity, Si, for a neuron i from its neighboring neurons by the equations s (i, j ) = Si =
doc(i )doc( j ) F (D(i, j )δ (i, j ))
∑ s (i , j )
(3)
j∈Bi
where doc(i) is the number of documents associated with neuron i in the document cluster map, and Bi is the set of neuron indices in the neighborhood of neuron i. The function F: R +→R+ is a monotonically increasing function. A dominating neuron is the neuron that has locally maximal supporting cluster similarity. We may select dominating neurons by the following algorithm: • • •
Step 1: Find the neuron with the largest supporting cluster similarity. Select this neuron as the dominating neuron. Step 2: Eliminate its neighbor neurons so they will not be considered as dominating neurons. Step 3: If no neuron is left, or the number of dominating neurons exceeds a predetermined value, stop. Otherwise, go to Step 1.
A dominating neuron may be considered as the centroid of a supercluster, which contains several clusters. We assign every cluster to some supercluster by the following method. The ith cluster (neuron) is assigned to the kth supercluster if δ (i, k ) = min δ (i, l ), l is a supercluster. l
(4)
1131
6
Topic Maps Generation by Text Mining
A supercluster may be thought of as a category that contains several subcategories. Let Ck denote the set of neurons that belong to the kth supercluster, or category. The category topics are selected from those words that associate with these neurons in the WCM. For all neurons j∈ Ck, we select the n*th word as the category topic if
∑w
j∈C k
j
n*
= max ∑ w jn 1≤ n ≤ N
j∈Ck
(5)
Equation 5 selects the word that is the most important to a supercluster, because the components of the synaptic weight vector of a neuron reflect the willingness that the neuron wants to learn the corresponding input data, that is, words. The topics that are selected by Equation 5 form the top layer of the category hierarchy. To find the descendants of these topics in the hierarchy, we may apply the above process to each supercluster and obtain a set of subcategories. These subcategories form the new superclusters that are on the second layer of the hierarchy. The category structure can then be revealed by recursively applying the same category generation process to each newfound supercluster. We decrease the size of the neighborhood in selecting dominating neurons when we try to find the subcategories.
Automatic Topic Maps Construction The text-mining process described in the preceding section reveals the relationships between documents and words. Furthermore, it may identify the topics in a set of documents, reveals the relationships among the topics, and arranges the topics in a hierarchical manner. The result of such a text-mining process can be used to construct topic maps. We discuss the steps in topic map construction in the following subsections.
Identifying Topics and Topic Types The topics in the constructed topic map can be selected as the topics identified by Equation 5. All the identified topics in every layer of the hierarchy can be used as topics. Because topics in different layers of the hierarchy represent different levels of significance, we may constrain the significance of topics in the map by limiting the depth of hierarchy from which we select topics. If we only used topics in higher layers, the number of topics is small, but those topics represent more important topics. The significance level can be set explicitly in the beginning of the construction process or determined dynamically during the construction process. One way to determine the num1132
ber of topics is by considering the size of the selforganizing map. The topic types can also be determined by the constructed hierarchy. As we mention earlier in this paper, a topic on higher layers of the hierarchy represents a more important concept than those on lower layers. For a parent-child relationship between two concepts on two adjacent layers, the parent topic should represent an important concept of its child topic. Therefore, we may use the parent topic as the type of its child topics. Such usage also fulfills the requirement of the topic map standard (that a topic type is also a topic).
Identifying Topic Occurrences The occurrences of an identified topic are easy to obtain after the text-mining process. Because a topic is a word labeled to a neuron in the WCM, its occurrences can be assigned as the documents labeled to the same neuron in the DCM. That is, let a topic t be labeled to neuron A in the WCM, and the occurrences of t should be those documents labeled to the same neuron A in the DCM. For example, if the topic ‘text mining’ was labeled to the 20 th neuron in the WCM, all the documents labeled to the 20th neuron in the DCM should be the occurrences of this topic. Furthermore, we may create more occurrences of this topic by allowing the documents labeled to lower levels of the hierarchy to also be included. For example, if neuron 20 in the preceding example were located on the second level of a topic hierarchy, we could also allow the clusters of documents associated with topics below this level to be occurrences of this topic. Another approach is to use the DCM directly, such that we also include the documents associated with the neighboring neurons as its occurrences.
Identifying Topic Associations The associations among topics can be identified in two ways with our method. The first is to use the developed hierarchy structure among topics. A topic is associated with the other if a path exists between them. We should limit the lengths of such paths to avoid establishing associations between pairs of unrelated topics. For example, if we limited the length to 1, only topics that are direct parents and children are associated with the topic under consideration. The type of such associations is essentially an instance-class association. The second way to identify topic associations simply examines the WCM and finds the associations. To establish associations to a topic t, we first find the neuron A to which t is labeled. We then establish associations between t and every topic associated with some neighboring neuron of
Topic Maps Generation by Text Mining
A. The neighboring neurons are selected from a neighborhood of A that is arbitrarily set by the creator. Obviously, a large neighborhood will create many associations. We should at least create associations between t and other topics associated with the same neuron A, because they are considered closely related topics in the text-mining process. The association types are not easy to reveal by this method, because we do not fully reveal the semantic relations among neurons after the text-mining process. An alternative method to determine the association type between two topics is to use the semantic relation defined in a well-developed ontology, such as WordNet (Fellbaum, 1998).
resources. Two feature maps, namely the document cluster map and the word cluster map, are created after the textmining process. We then apply a category hierarchy development process to reveal the hierarchical structure of the document clusters. Some topics are also identified by such a process to indicate the general subjects of those clusters located in the hierarchy. We may then automatically create topic maps according to the two maps and the developed hierarchy. Although our method may not identify all the kinds of components that should construct a topic map, our approach seems promising because the text-mining process achieves satisfactory results in revealing implicit topics and their relationships.
FUTURE TRENDS
REFERENCES
Topic maps will be an emergent standard for information navigation in the near future. Its topic-driven navigation scheme allows the users to retrieve their documents without tedious browsing of the whole infoglut. However, the generation of topic maps still dominates their spread. An editor will help, provided it can generate the necessary ingradients of a topic map automatically, or at least semiautomatically. However, such a generation process is difficult, because we need to reveal the semantics of the documents. In this aspect, the data-mining techniques will help. Therefore, the future trends of topic map generation should be as follows: • • • • • •
Applying knowledge discovery techniques to discover topics and their associations without the intervention of human beings Incorporating topic map standard and Web representation languages, such as XML, to promote the usage of topic maps Developing a user interface that allows users to create and edit the topic maps with the aid of automatic generated ingredients Developing a tool to integrate or migrate existing topic maps Constructing metadata in topic maps for applications on the semantic Web (Daconta, Obrst, & Smith, 2003) Mining the existing topic maps from their structures and ingredients
CONCLUSION In this paper, we present a novel approach for semiautomatic topic map construction. The approach starts from applying a text-mining process to a set of information
Ahmed, K. (2002). Introducing topic maps. XML Journal 3(10), 22-27. Baeza-Yates, R., & Ribiero-Neto, B. (1999). In Modern information retrieval (Chapter 2). Reading, MA: AddisonWesley. Beird, C. (2000). Topic map cartography. Proceedings of the XML Europe 2000 GCA Conference, Paris, June 1216. Daconta, M. C., Obrst, L. J., & Smith, K. T. (2003). The semantic Web: A guide to the future of XML, Web services, and knowledge management. Indianapolis: Wiley. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press. Hearst, M. A. (1999). Untangling text data mining. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, Maryland, USA, June 20-26. ISO (2000). ISO/IEC 13250, Information technology - SGML Applications - Topic Maps. Geneva, Switzerland: ISO. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574-585. Lee, C. H., & Yang, H. C. (1999). A web text mining approach based on a self-organizing map. Proceedings of the Second ACM Workshop on Web Information and Data Management (pp. 59-62), Kansas City, Missouri, USA, November 5-6. Moore, G. (2000). Topic map technology — the state of the art. In XML Europe 2000, Paris, France.
1133
6
Topic Maps Generation by Text Mining
Park, J., & Hunting, S. (2002). XML topic maps: Creating and using topic maps for the Web, June 12-16. Reading, MA: Addison-Wesley. Pepper, S. (1999). Navigating haystacks, discovering needles. Markup Languages: Theory and Practice, 1(4), 41-68. Rath, H. H. (1999). Technical issues on topic maps. Proceedings of the Metastructures 1999 Conference, GCA, Montreal, Canada, August 16-18. Wang, J. (2003). Data mining: Opportunities and challenges. Hershey, PA: Idea Group. Yang, H. C., & Lee, C. H. (2000). Automatic category structure generation and categorization of Chinese text documents. Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 673-678), France, September 1316.
KEY TERMS Neural Networks: Learning systems, designed by analogy with a simplified model of the neural connections in the brain, that can be trained to find nonlinear relationships in data. Self-Organizing Maps: A neural network model developed by Teuvo Kohonen that has been recognized as one of the most successful models. The model uses an unsupervised learning process to cluster high-dimensional data and map them into a one- or two-dimensional feature map. The relationships among data can be reflected by the geometrical distance between their mapped neurons.
1134
Text Mining: The application of analytical methods and tools to usually unstructured textual data for the purpose of identifying patterns and relationships such as classification, prediction, estimation, or affinity grouping. Topic Associations: The relationships between two or more topics in a topic map. Topic Maps: A navigation scheme for exploring information resources in a topic-driven manner. When a set of information resources are provided, their topics as well as the associations among topics are identified and are used to form a map that guides the user through the topics. Topic Occurrences: A topic may be linked to one or more information resources that are deemed to be relevant to the topic in some way. Such resources are called occurrences of the topic. Topics: The object or node in the topic map that represents the subject being referred to. However, the relationship between topics and subjects is (or should be) one to one, with every topic representing a single subject, and every subject being represented by just one topic. Topic Types: Topics can be categorized according to their kind. In a topic map, any given topic is an instance of zero or more topic types. This corresponds to the categorization inherent in the use of multiple indexes in a book (index of names, index of works, index of places, etc.), and to the use of typographic and other conventions to distinguish different types of topics.
1135
Transferable Belief Model Philippe Smets Université Libre de Bruxelles, Belgium
INTRODUCTION This note is a very short presentation of the transferable belief model (TBM), a model for the representation of quantified beliefs based on belief functions. Details must be found in the recent literature. The TBM covers the same domain as the subjective probabilities except probability functions are replaced by belief functions which are much more general. The model is much more flexible than the Bayesian one and allows the representation of states of beliefs not adequately represented with probability functions. The theory of belief functions is often called the DempsterShafer’s theory, but this term is unfortunately confusing.
The Various Dempster-Shafer’s Theories Dempster-Shafer’s theory covers several models that use belief functions. Usually their aim is in the modeling of someone’s degrees of belief, where a degree of belief is understood as strength of opinion. They do not cover the problems of vagueness and ambiguity for which fuzzy sets theory and possibility theory are more appropriate. Beliefs result from uncertainty. Uncertainty can result from a random process (the objective probability case), or from a lack of information (the subjective case). These two forms of uncertainty are usually quantified by probability functions. Dempster-Shafer’s theory is an ambiguous term as it covers several models. One of them, the “transferable belief model” is a model for the representation of quantified beliefs developed independently of any underlying probability model. Based on Shafer’s initial work (Shafer, 1976) it has been largely extended since (Smets,1998; Smets & Kennes, 1994; Smets & Kruse, 1997).
The Representation of Quantified Beliefs Suppose a finite set of worlds Ω called the frame of discernment. The term “world” covers concepts like state of affairs, state of nature, situation, context, value of a variable... One world corresponds to the actual
world. An agent, denoted You (but it might be a sensor, a robot, a piece of software), does not know which world corresponds to the actual world because the available data are imperfect. Nevertheless, You have some idea, some opinion, about which world might be the actual one. So for every subset A of Ω, You can express Your beliefs, i.e., the strength of Your opinion that the actual world belongs to A. This strength is denoted bel(A). The larger bel(A), the stronger You believe that the actual world belongs to A.
Credal vs. Pignistic Levels Intrinsically beliefs are not directly observable properties. Once a decision must be made, their impact can be observed. In the TBM, we have described a two level mental model in order to distinguish between two aspects of beliefs, belief as weighted opinions, and belief for decision making (Smets, 2002a). The two levels are: the credal level, where beliefs are held, and the pignistic level, where beliefs are used to make decisions (credal and pignistic derive from the Latin words “credo”, I believe and “pignus”, a wage, a bet). Usually these two levels are not distinguished and probability functions are used to quantify beliefs at both levels. Once these two levels are distinguished, as done in the TBM, the classical arguments used to justify the use of probability functions do not apply anymore at the credal level, where beliefs will be represented by belief functions. At the pignistic level, the probability function needed to compute expected utilities are called pignistic probabilities to enhance they do not represent beliefs, but are just induced by them.
BACKGROUND Belief Function Inequalities The TBM is a model developed to represent quantified beliefs. The TBM departs from the Bayesian approach in that we do not assume that bel satisfies the additivity encountered in probability theory. We get inequalities like : bel(A∪B) >bel(A) + bel(B) - bel(A∩B).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
6
Transferable Belief Model
Basic Belief Assignment Definition 2.2 Let Ω be a frame of discernment. A basic belief assignment (bba) is a function m : 2Ω →[0, 1] that satisfiesΣA⊆Ω m(A) = 1. The term m(A) is called the basic belief mass (bbm) given to A. The bbm m(A) represents that part of Your belief that supports A, i.e., the fact that the actual world belongs to A, without supporting any more specific subset, by lack of adequate information. As an example, consider that You learn that the actual world belongs to A, and You know nothing else about its value. Then some part of Your beliefs will be given to A, but no subset of A will get any positive support. In that case, You would have m(A) > 0 and m(B) = 0 for all B≠A, B≠Ω, and m(Ω) = 1-m(A).
Belief Functions The bba m does not in itself quantify your belief that the actual world belongs to A. Indeed, the bbm m(B) given to any non empty subset B of A also supports that the actual world belongs to A. Hence, the degree of belief bel(A) is obtained by summing all the bbms m(B) for all B non empty subset of A. The degree of belief bel(A) quantifies the total amount of justified specific support given to the fact that the actual world belongs to A. We say justified because we include in bel(A) only the bbms given to subsets of A. For instance, consider two distinct elements x and y of Ω. The bbm m({x, y}) given to {x, y} could support x if further information indicates this. However given the available information the bbm can only be given to {x, y}. We say specific because the bbm m(Ø) is not included in bel(A) as it is given to the subsets that supports not only A but also not A. The originality of the TBM comes from the non-null masses that may be given to non-singletons of . In the special case where only singletons get positive bbms, the function bel is a probability function. In that last case, the TBM reduces itself to the Bayesian theory. Shafer assumed m(Ø) = 0. In the TBM, such a requirement is not assumed. That mass m(Ø) reflects both the non-exhaustivity of the frame and the existence of some conflict between the beliefs produced by the various belief sources.
Expressiveness of the TBM The advantage of the TBM over the classical Bayesian approach resides in its large flexibility, its ability to represent every state of partial beliefs, up to the state of total ignorance. In the TBM, total ignorance is repre1136
sented by the vacuous belief function, i.e., a belief function such that m(Ω) = 1, m(A) = 0 for all A with A≠Ω. Hence bel(Ω) = 1 and bel(A) = 0 for all A strict subset of Ω. It expresses that all You know is that the actual world belongs to Ω. The representation of total ignorance in probability theory is hard to achieve adequately, most proposed solutions being doomed to contradictions. With the TBM, we can of course represent every state of belief, full ignorance, partial ignorance, probabilistic beliefs, or even certainty (m(A) = 1 corresponds to A is certain).
Example Let us consider a somehow reliable witness in a murder case who testifies to You that the killer is a male. Let 0.7 be the reliability You give to the testimony (0.7 is the probability, the belief that the witness is reliable). Suppose furthermore that a priori You have an equal belief that the killer is a male or a female. A classical probability analysis would compute the probability P(M) of M= “the killer is a male” given the witness testimony as: P(M) = P(M|Reliable)P(Reliable) + P(M|Not Reliable)P(Not Reliable)= 1.0x0.7 + 0.5x0.3 = 0.85, where “Reliable and Not Reliable refer to the witness” reliability. The value 0.85 is the sum of the probability of M given the witness is reliable (1.) weighted by the probability that the witness is reliable (0.7) plus the probability of M given the witness is not reliable (0.5, the proportion of males among the killers) weighted by the probability that the witness is not reliable (0.3). The TBM analysis is different. You have some reason to believe that the killer is a male, as so said the witness. But this belief is not total (maximal) as the witness might be wrong. The 0.7 is the belief You give to the fact that the witness tells the truth (is reliable), in which case the killer is male. The remaining 0.3 mass is given to the fact that the witness is not really telling the truth (he lies or he might have seen a male, but this was not the killer). In that last case, the testimony does not tell You anything about the killer’s sex. So the TBM analysis will give a belief 0.7 to M: bel(M) = 0.7 (and bel(Not M) = 0).The information relative the population of killers (the 0.5) is not relevant to Your problem. Similarly, the fact that almost all crimes are committed by the members of some particular group of individuals may not be used to prove your case.
Conditioning Suppose You have some belief on Ω represented by the bba m. Then some further evidence becomes available to You and this piece of information implies that the actual
Transferable Belief Model
world cannot be one of the worlds in not-A. Then the mass m(B) that initially was supporting that the actual world is in B now supports that the actual world is in B∩A as every world in not-A must be “eliminated”. So m(B) is transferred to B∩A after conditioning on A. (The model gets its name from this transfer operation.) This operation leads to the conditional bba. This rule is called the Dempster’s rule of conditioning.
7.
Example
12.
Continuing with the murder case, suppose there are only two potential male suspects: Phil and Tom, so m({Phil, Tom}) = 0.7. Then You learn that Phil is not the killer. The initial testimony now supports that the killer is Tom. The reliability 0.7 You gave to the testimony initially supported “the killer is Phil or Tom”. The new information about Phil implies that the value 0.7 now supports “the killer is Tom”. After conditioning, a mass can be given to Ø. It represents the conflict between the previous beliefs given to not-A with the new conditioning piece of evidence that states that A holds. In probability theory and in the model initially developed by Shafer, this conflict is hidden. In the TBM, we keep it and use it to develop expert systems built for conflict resolutions. Note that some positive mass given to Ø may also result from the non exhaustivity of the frame of discernment.
Further Results Since Shafer seminal work, many new concepts have been developed. For lack of space, we cannot present them. Reader is referred to the author web site for downloadable papers (http://iridia.ulb.ac.be/ ~ psmets/). On the theoretical side, the next issues have been solved, among which: 1. 2. 3.
4. 5. 6.
The concept of open and close world assumptions, so non-exhaustive frames of discernment are allowed. The disjunctive rule of combination, the general combination rules, the belief function negation. The generalized Bayesian theorem to build a belief on space Y from the conditional belief on space X given each value of Y and an observation on X (Delmotte & Smets, 2004). The pignistic transformation to build the probability function needed for decision-making. The discounting of beliefs produced by partially reliable sources. The manipulation of the conflict (Lefevre, Colot, & Vannooremberghe, 2002).
8. 9. 10. 11.
13. 14. 15. 16.
The canonical decompositions of any belief functions in simple support functions. The specialization, cautious combinations, alpha-junctions. The belief functions defined on the reals. Belief ordering and least commitment principle. Doxastic independence that translates stochastic independence into belief function domain (Ben Yaghlane, Smets, & Mellouli, 2001). Evidential networks, directed or undirected, for the efficient propagation of beliefs in networks (Shenoy, 1997). Fast Mobius transforms to transform masses into belief and plausibility functions and vice versa. Approximation methods (Denoeux & Ben Yaghlane, 2002; Haenni & Lehmann, 2002). Matrix notation for manipulating belief functions (Smets, 2002b). Axiomatic justifications of most concepts.
The TBM has been applied to many problems among which: 1.
Kalman filters and joint tracking and classifications. 2. Data association and determination of the number of detected objects (Ayoun & Smets, 2001). 3. Data clustering (Denoeux & Masson, 2004). 4. Expert systems for conflict management (Milisavljevic, Bloch, & Acheroy, 2000). 5. Belief assessment (similarity measures, frequencies). 6. TBM classifiers: case base and model base. 7. Belief decision trees (Elouedi, Mellouli, & Smets, 2001). 8. Planning and pre-posterior analyses. 9. Sensors with limited knowledge, limited communication bandwidth, self-repeating, decaying memory, varying domain granularity. 10. Tuning reliability coefficients for partially reliable sensors (Elouedi, Mellouli, & Smets, 2004). The author has developed a computer program TBMLAB, which is a demonstrator for the TBM written in MATLAB. It is downloadable from the web site: http://iridia.ulb.ac.be/ ~ psmets/. Many tools, tutorials and applications dealing with the TBM can be found in this freeware.
FUTURE TRENDS The TBM is supposed to cover the same domain as probability theory, hence the task is enormous. Many 1137
6
Transferable Belief Model
problems have not yet been considered and are open for future work. One major problem close to be solved is the concept of credal inference, i.e., the equivalent of statistical inference (in its Bayesian form) but within the TBM realm. The advantage will be that inference can be done with an a priori that really represents ignorance. Real life successful applications start to show up, essentially in the military domain, for object recognitions issues.
CONCLUSIONS We have very shortly presented the TBM, a model for the representation of quantified beliefs based on belief functions. The whole theory has enormously increased since Shafer’s seminal work. We only present very general ideas and provide pointers to the papers where the whole theory is presented. Full details must be found in the recent up to date literature.
REFERENCES Ayoun, A. & Smets, P. (2001). Data association in multi-target detection using the transferable belief model. International Journal of Intelligent Systems, 16, 1167-1182. Ben Yaghlane, B., Smets, Ph., & Mellouli, K. (2001). Belief function independence: I. the marginal case. International Journal of Approximate Reasoning, 29, 47-70. Delmotte, F., & Smets, P., (2004). Target identification based on the transferable belief model interpretation of Dempster-Shafer model. IEEE Trans. Syst., Man, Cybern. A., 34, 457-471.
Haenni, R., & Lehmann, N. (2002). Resource-bounded and anytime approximation of belief function computations. International Journal of Approximate Reasoning, 32, 103-154. Lefevre, E., Colot, O., & Vannooremberghe, P. (2002). Belief functions combination and conflict management. Information fusion, 3, 149-162 Milisavljevic, N., Bloch, I., & Acheroy, M. (2000). Modeling, combining and discounting mine detection sensors within Dempster- Shafer framework. In Detection technologies for mines and minelike targets (Vol. 4038, pp. 1461-1472). Orlando, USA: SPIE Press. Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton University Press. Shenoy, P. P. (1997). Binary join trees for computing marginals in the Shenoy-Shafer architecture. International Journal of Approximate Reasoning, 17, 239-263. Smets, P. (1998). The transferable belief model for quantified belief representation. In D. M. Gabbay & P. Smets (Eds.), Handbook of defeasible reasoning and uncertainty management systems (Vol. 1, pp. 267-301). The Netherlands: Kluwer, Doordrecht. Smets, P. (2002a). Decision making in a context where uncertainty is represented by belief functions. In R. P. Srivastava, & T. J. Mock (Eds.), Belief functions in business decisions (pp. 17-61). Heidelberg, Germany: PhysicaVerlag. Smets, P. (2002b). The application of the matrix calculus to belief functions. International Journal of Approximate Reasoning, 31, 1-30. Smets, P., & Kennes, R. (1994). The transferable belief model. Artificial Intelligence, 66, 191-234.
Denoeux, T., & Ben Yaghlane, A. (2002). Approximating the combination of belief functions using the fast mobius transform in a coarsened frame. International Journal of Approximate Reasoning, 31, 77-101.
Smets, P., & Kruse, R. (1997). The transferable belief model for quantified belief representation. In A. Motro & P. Smets (Eds.), Uncertainty in information systems: From needs to solutions (pp. 343-368). Boston, MA: Kluwer.
Denoeux, T., & Masson, M.-H. (2004). EVCLUS: Evidential clustering of proximity data. IEEE Trans. SMC: B, 34, 95-109.
KEY TERMS
Elouedi, Z., Mellouli, K., & Smets, P. (2001). Belief decision trees: theoretical foundations. International Journal of Approximate Reasoning, 28, 91-124. Elouedi, Z., Mellouli, K., & Smets, P. (2004). Assessing sensor reliability for multisensor data fusion with the transferable belief model. IEEE Trans. SMC B, 34, 782787.
1138
Basic Belief Assignment: M(A) is the parts of belief that support that the actual world is in A without supporting any more specific subset of A. Belief Function: Bel(A) is the total amount of belief that support that the actual world is in A without supporting its complement.
Transferable Belief Model
Conditioning: Revision process of a belief by a fact accepted as true.
Pignistic Probability Function: BetP is the probability function used for decision making.
Conjunctive Combination: The combination of the beliefs induced by several sources into an aggregated belief.
Plausibility Function: Pl(A) is the total amount of belief that might support that the actual world is in A.
Open World Assumption: The fact that the frame of discernment might not be exhaustive.
1139
6
1140
Tree and Graph Mining Dimitrios Katsaros Aristotle University, Greece Yannis Manolopoulos Aristotle Univeristy, Greece
INTRODUCTION
BACKGROUND
During the past decade, we have witnessed an explosive growth in our capabilities to both generate and collect data. Various data mining techniques have been proposed and widely employed to discover valid, novel and potentially useful patterns in these data. Data mining involves the discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in huge collections of data. One of the key success stories of data mining research and practice has been the development of efficient algorithms for discovering frequent itemsets – both sequential (Srikant & Agrawal, 1996) and nonsequential (Agrawal & Srikant, 1994). Generally speaking, these algorithms can extract co-occurrences of items (taking or not taking into account the ordering of items) in an efficient manner. Although the use of sets (or sequences) has effectively modeled many application domains, like market basket analysis, medical records, a lot of applications have emerged whose data models do not fit in the traditional concept of a set (or sequence), but require the deployment of richer abstractions, like graphs or trees. Such graphs or trees arise naturally in a number of different application domains including network intrusion, semantic Web, behavioral modeling, VLSI reverse engineering, link analysis and chemical compound classification. Thus, the need to extract complex tree-like or graphlike patterns in massive data collections, for instance, in bioinformatics, semistructured or Web databases, became a necessity. The class of exploratory mining tasks, which deal with discovering patterns in massive databases representing complex interactions among entities, is called Frequent Structure Mining (FSM) (Zaki, 2002). In this article we will highlight some strategic application domains where FSM can help provide significant results and subsequently we will survey the most important algorithms that have been proposed for mining graph-like and tree-like substructures in massive data collections.
As a motivating example for graph mining consider the problem of mining chemical compounds to discover recurrent (sub) structures. We can model this scenario using a graph for each compound. The vertices of the graphs correspond to different atoms and the graph edges correspond to bonds among the atoms. We can assign a label to each vertex, which corresponds to the atom involved (and maybe to its charge) and a label to each edge, which corresponds to the type of the bond (and maybe to information about the 3D orientation). Once these graphs have been generated, recurrent substructures become frequently occurring subgraphs. These graphs can be used in various tasks, for instance, in classifying chemical compounds (Deshpande, Kuramochi, & Karypis, 2003). Another application domain where graph mining is of particular interest arises in the field of Web usage analysis (Nanopoulos, Katsaros, & Manolopoulos, 2003). Although various types of usage (traversal) patterns have been proposed to analyze the behavior of a user (Chen, Park, & Yu, 1998), they all have one very significant shortcoming; they are one-dimensional patterns and practically ignore the link structure of the site. In order to perform finer usage analysis, it is possible to look at the entire forward accesses of a user and to mine frequently accessed subgraphs of that site. Looking for examples where tree mining has been successfully applied, we can find a wealth of them. A characteristic example is XML, which has been a very popular means for representing and storing information of various kinds, because of its modeling flexibility. Since tree-structured XML documents are the most widely occurring in real applications, one would like to discover the commonly occurring subtrees that appear in the collections. This task could benefit applications, like database caching (Yang, Lee, & Hsu, 2003), storage in relational databases (Deutsch, Fernandez, & Suciu, 1999), building indexes and/or wrappers (Wang & Liu, 2000) and many more.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Tree and Graph Mining
Tree patterns arise also in bioinformatics. For instance, researchers have collected large amounts of RNA structures, which can be effectively represented using a computer data structure called tree. In order to deduce some information about a newly sequenced RNA, they compare it with known RNA structures, looking for common topological patterns, which provide important insights to the function of the RNA (Shapiro & Zhang, 1990). Another application of tree mining in bioinformatics is found in the context of constructing phylogenetic trees (Shasha, Wang, & Zhang, 2004), where the task of phylogeny reconstruction algorithms it to use biological information about a set of e.g., taxa, in order to reconstruct an ancestral history linking together all the taxa in the set. There are two distinct formulations for the problem of mining frequent graph (tree) substructures and are referred to as the graph-transaction (tree-transaction) setting and the single-graph (single-tree) setting. In the graph-transaction setting, the input to the pattern-mining algorithm is a set of relatively small graphs (called transactions), whereas in the single-graph setting the input data is a single large graph. The difference affects the way the frequency of the various patterns is determined. For the former, the frequency of a pattern is determined by the number of graph transactions that the pattern occurs in, irrespective of how many times a pattern occurs in a particular transaction, whereas in the latter, the frequency of a pattern is based
on the number of its occurrences (i.e., embeddings) in the single graph. The algorithms developed for the graphtransaction setting can be modified to solve the singlegraph setting, and vice-versa. Depending also on the application domain, the considered graphs (trees) can be ordered or unordered, directed or undirected. No matter what these characteristics are, the (sub)graph mining problem can be defined as follows. (A similar definition can be given for the tree mining problem.) Given as input a database of graphs and a user-defined real number 0<σ≤1, we need to find all frequent subgraphs, where the word “frequent” implies those subgraphs with frequency larger than or equal to the threshold σ. (In the following, equivalently to the term frequency, we use the term support.) We illustrate this problem for the case of graph-transaction, labeled, undirected graphs, with σ=2/ 3. The input and output of such an algorithm are given in Figure 1. Although the very first attempts to deal with the problem of discovering substructure patterns from massive graph or tree data are dated back to the early 90’s (Cook & Holder, 1994), only recently the field of mining for graph and tree patterns has flourished. A wealth of algorithms has been proposed, most of which are based on the original level-wise Apriori algorithm for mining frequent itemsets (Agrawal & Srikant, 1994). Next, we will survey the most important of them. Figure 2. Characteristic graph substructures.
Figure 1. Mining frequent labeled undirected graphs.
e10
Input: Set of three connected labeled undirected graphs. p2 p1
x
â
x
q1
p5 ä
y
á x
â p3
â
x
q2
x
x
x
ã p4
v7
x
â q3
â x
á
â
á
x
â
â
x y
á x
â
â
e10 v7
v6
â
v7
e1
v5
v7
e9 e8
e7 (e) tree
e4 v5
v3
e3 e4
e5
v4
v5
e7 v6
(d) connected subgraph v3
v3 e5
v2
e6
v6
e6
v5
e7
v4
(c) induced subgraph
y â
e4
e7
v1
â
v1
e3
e5 e9
e5
e6
e9
e3 e4
(b) generic subgraph
v3
e8
y
á x
v4
v5
e7
(S)
â
e4
v3
v1
e3
e5
e6
â s3
Output: Set of seven connected frequent subgraphs. á
v2 v3
e2
e8
v7 x
e9
v2
(a) graph
(Q)
(P)
s1 â
s2 á
y
á
e1
v1
v4
e5 v7
e7
e4
v4
v5
(f) path
1141
6
Tree and Graph Mining
ALGORITHMS FOR GRAPH MINING The graph is one of the most fundamental constructions studied in mathematics and thus, numerous classes of substructures are targeted by graph mining. These substructures include the generic subgraph, induced subgraph, connected subgraph, (ordered and unordered) tree and path (see Figure 2). We give the definitions of these substructures in the next paragraph and subsequently present the graph mining algorithms, able to discover all frequent substructures of any kind mentioned earlier. Following mathematical terminology, a graph is represented as G(V, E, f), where V is a set of vertices, E is a set of edges connecting pairs of vertices and f is a function f:E→VxV. For instance, in Figure 2 we see that f(e1)=(v1,v2). We say that GS(Vs,Es,f) is a generic subgraph of G, if Vs⊂V, Es⊂E and vi,vj∈Vs for all edges f(ek)=(vi,vj)∈Es. An induced subgraph ISG(Vs,Es,f) of G has a subset of vertices of G and the same edges between pairs of vertices as in G, in other words, Vs⊂V, Es⊂E and ∀vi,vj∈Vs, ek=(vi,vj)∈Es⇔f(ek)=(vi,vj)∈E. We say that CSG(Vs,Es,f) is a connected subgraph of G, if Vs⊂V, Es⊂E and all vertices in Vs are reachable through some edges in Es. An acyclic subgraph of G is called a tree T. Finally, a tree of G which does not include any braches is a path P in G. The first algorithm for mining all frequent subgraph patterns is AGM (Inocuchi, Washio, & Motoda, 2000, 2003). AGM can mine various types of patterns, namely generic subgraphs, induced subgraphs, connected subgraphs, ordered and unordered trees and subpaths. AGM is based on a labeling of the graph’s vertices and edges, a “canonical labeling” (Washio & Motoda, 2003) for the adjacency matrix of the graph, which allows for the unambiguous representation of the graph. The basic principle of AGM is similar to that of Apriori for basket analysis. Starting from frequent graphs, where each graph is a single vertex, the frequent graphs having larger sizes are searched in bottom up manner by generating candidates having an extra vertex. A similar canonical labeling and the Apriori-style is also followed by FSG (Kuramochi & Karypis, 2004b), by its variations like gFSG (2002), HSIGRAM and VSIGRAM (2004), and by the algorithm proposed in (Vanetik, Gudes, & Shimony, 2002). Finally, based on the same principles as above, (Huan, Wang, & Prins, 2003) proposed an interesting labeling for graphs based on the adjacency matrix representation, which can drastically reduce the computation complexity of the candidate generation phase.
1142
Recently, a depth-first-searching (DFS) “canonical labeling” was introduced by gSpan (Yan & Han, 2002) and its variation for mining closed frequent graph patterns, the CloseGraph (2003). The main difference of their coding from other approaches is that they use a tree representation of each graph instead of the adjacency matrix to define the code of the graph.
ALGORITHMS FOR TREE MINING There has been very little work in mining all frequent subtrees; only a handful of algorithms can be found in the literature. Although graph mining algorithms can be applied to this problem as well, they are likely to be too general, since they do not exploit the existence of the root and the lack of cycles. Thus, they are most probably very inefficient. The first effort towards tree mining was conducted by Ke Wang (Cong, Yi, Liu, & Wang, 2002; Wang & Liu, 2000). Their algorithm is an application of the original level-wise Apriori to the problem of mining frequently occurring collections of paths, where none of them is a prefix of the other and thus they correspond to trees. Later, (Katsaros, Nanopoulos, & Manolopoulos, 2004) developed a numbering scheme for labeled, ordered trees in order to speed-up the execution of Wang’s algorithm, and (Katsaros, 2003) proposed the DeltaSSD algorithm to study the efficient maintenance of the discovered tree substructures under database updates. Departing from the Apriori paradigm, (Zaki, 2002) proposed the TREEMINER and (Abe, Kawasoe, Asai, Arimura, & Arikawa, 2002; Asai et al., 2002; Asai, Arimura, Uno, & Nakamo, 2003) proposed the FREQT algorithms. These tree mining algorithms are very similar; they are both based on an efficient enumeration technique that allows for the incremental construction of the set of frequent tree patterns and their occurrences level by level. TREEMINER performs a depthfirst search for frequent subtrees and uses a vertical representation for the trees in the database for fast support counting. FREQT uses the notion of rightmost expansion to generate candidate trees by attaching nodes only to the rightmost branch of a frequent subtree. Similar, in spirit are the algorithms proposed by (Chi, Yang, & Muntz, 2003, 2004). Finally, Xiao, Yao, Li, and Dunham (2003) proposed a method for discovering only the maximal frequent subtrees using a highly condensed data structure for the representation and finding of the maximal frequent trees of the database.
Tree and Graph Mining
The tree mining problem is also related to the tree isomorphism and tree pattern matching problems. Though, the fundamental difference between the tree mining and the other two research problems is the fact that the tree mining algorithms focus on discovering all frequent tree substructures, and not only deciding whether or not a particular instance of a tree is “contained” in another larger tree.
FUTURE TRENDS The field of tree and graph mining is still in its infancy. To date research and development have focused on the discovery of very simple structural patterns. Although, it is difficult to foresee the future directions in the field, because they depend on the application requirements, we believe that the discovery of richer patterns requires new algorithms. For instance, algorithms are needed for the mining of weighted trees or graphs. Moreover, the embedding of the tree and graph mining algorithms into clustering algorithms will be a very active area for both research and practice.
CONCLUSION During the past decade, we have witnessed the emergence of the data mining field as a novel research area, investigating interesting problems and developing reallife applications. Initially, the targeted data formats were limited to relational tables, comprised by unordered collections of rows, where each row of the table is treated as a set. Due to the rapid proliferation of many applications from biology, chemistry, bioinformatics and communication networking, whose data model can only be described by a graph or a tree, the research field of graph and tree mining started to emerge. The field provides many attractive topics for both theoretical and engineering achievements and it is expected to be one of the key fields in data mining research for the years ahead.
REFERENCES Abe, K., Kawasoe, S., Asai, T., Arimura, H., & Arikawa, S. (2002). Optimized substructure discovery for semistructured data. Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Lecture Notes in Artificial Intelligence (LNAI), vol. 2431, 1-14. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Proceedings
of the International Conference on Very Large Data Bases (VLDB) (pp. 487-499). Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., & Arikawa, S. (2002). Efficient substructure discovery for large semi-structured data. Proceedings of the SIAM Conference on Data Mining (SDM) (pp. 158-174). Asai, T., Arimura, H., Uno, T., & Nakano, S. (2003). Discovering frequent substructures in large unordered trees. Proceedings of the Conference on Discovery Sciences (DS), Lecture Notes in Artificial Intelligence (LNAI), vol. 2843, 47-61. Chen, M.S., Park, J.S., & Yu, P.S. (1998). Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2), 209-221. Chi, Y., Yang, Y., & Muntz, R.R. (2003). Indexing and mining free trees. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 509-512). Chi, Y., Yang, Y., & Muntz, R.R. (2004). HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. Proceedings of the IEEE Conference on Scientific and Statistical Data Base Management (SSDBM) (pp. 11-20). Cong, G., Yi, L., Liu, B., & Wang, K. (2002). Discovering frequent substructures from hierarchical semi-structured data. Proceedings of the SIAM Conference on Data Mining (SDM) (pp. 175-192). Cook, D.J., & Holder, L.B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1, 231255. Deshpande, M., Kuramochi, M., & Karypis, G. (2003). Frequent sub-structure-based approaches for classifying chemical compounds. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 35-42). Deutsch, A., Fernandez, M.F., & Suciu, D. (1999). Storing semistructured data with STORED. Proceedings of the ACM International Conference on Management of Data (SIGMOD) (pp. 431-442). Huan, J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 549-552). Inokuchi, A., Washio, T., & Motoda, H. (2000). An aprioribased algorithm for mining frequent substructures from graph data. Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery
1143
6
Tree and Graph Mining
(PKDD), Lecture Notes in Artificial Intelligence (LNAI), vol. 1910, 13-23. Inokuchi, A., Washio, T., & Motoda, H. (2003). Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3), 321-354. Katsaros, D. (2003). Efficiently maintaining structural associations of semistructured data. Lecture Notes on Computer Science (LNCS), vol. 2563,118-132. Katsaros, D., Nanopoulos, A., & Manolopoulos, Y. (2005). Fast mining of frequent tree structures by hashing and indexing. Information & Software Technology, 47(2), 129-140. Kuramochi, M., & Karypis G. (2002). Discovering Frequent Geometric Subgraphs, Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 258-265). Kuramochi, M., & Karypis G. (2004). Finding frequent patterns in a large sparse graph. Proceedings of the SIAM Conference on Data Mining (SDM). Kuramochi, M., & Karypis G. (2004b). An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1038-1051. Nanopoulos, A., Katsaros, D., & Manolopoulos, Y. (2003). A data mining algorithm for generalized web prefetching. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1155-1169. Shapiro, B., & Zhang, K. (1990). Comparing multiple RNA secondary structures using tree comparisons. Computer Applications in Biosciences, 6(4), 309-318. Shasha, D., Wang, J.T.L., & Zhang, S. (2004). Unordered tree mining with applications to phylogeny. Proceedings of the IEEE International Conference on Data Engineering (ICDE) (pp. 708-719). Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Proceedings of the International Conference on Extending Database Technology (EDBT’96) (pp. 3-17). Vanetik, N., Gudes, E., & Shimony, S.E. (2002). Computing frequent graph patterns from semistructured data. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 458-465). Wang, K., & Liu, H. (2000). Discovering structural association of semistructured data. IEEE Transactions on Knowledge and Data Engineering, 12(3), 353-371.
1144
Washio, T. & Motoda, H. (2003). State of the art of graphbased Data Mining. ACM SIGKDD Explorations, 5(1), 5968. Xiao, Y., Yao, J.-F., Li, Z., & Dunham, M.H. (2003). Efficient data mining for maximal frequent subtrees. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 379-386). Yan, X., & Han, J. (2002). gSpan: Graph-based substructure pattern mining. Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 721-724). Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 286-295). Yang, L.H., Lee, M.L., & Hsu, W. (2003). Efficient mining of XML query patterns for caching. Proceedings of the International Conference on Very Large Data Bases (VLDB) (pp. 69-80). Zaki, M. (2002). Efficiently mining frequent trees in a forest. Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) (pp. 71-80).
KEY TERMS Closed Frequent Graph: A frequent graph pattern G is closed if there exists no proper super-pattern of G with the same support in the dataset. Correlation: describes the strength or degree of linear relationship. That is, correlation lets us specify to what extent the two variables behave alike or vary together. Correlation analysis is used to assess the simultaneous variability of a collection of variables. For instance, suppose one wants to study the simultaneous changes with age of height and weight for a population. Correlation analysis describes the how the change in height can influence the change in weight. Embedded Subtree: Let T(N,B) be a tree, where N represents the set of its nodes and B the set of its edges. We say that a tree S(Ns,Bs) is an embedded subtree of T provided that: i) Ns⊆N, ii) b=(nx,ny)∈Bs if and only if nx is an ancestor of ny in T. In other words, we require that a branch appear in S if and only if the two vertices are on the same path from the root to a leaf in T. Exploratory Data Analysis (EDA): comprises a set of techniques used to identify systematic relations between variables when there are no (or not complete) a priori
Tree and Graph Mining
expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns. Free Tree: Let G be a connected acyclic labeled graph. If we label the leaves (because of its acyclicity, each connected acyclic labeled graph has at least one node which is connected to the rest of the graph by only one edge, that is, a leaf) with zero and the other nodes recursively with the minimal label of its neighbors plus one, then we get an unordered, unrooted tree-like structure, a so-called free tree. It is a well-known fact that every free tree has at most two nodes, which minimize the maximal distance to all other nodes in the tree, the so-called centers.
Induced Subtree: Let T(N,B) be a tree, where N represents the set of its nodes and B the set of its edges. We say that a tree S(Ns,Bs) is an induced subtree of T provided that: i) Ns⊆N, ii) b=(nx,ny)⊆Bs if and only if nx is a parent of ny in T. Thus, induced subtrees are a specialization of embedded subtrees. Linear Regression: is used to make predictions about a single value. Simple linear regression involves discovering the equation for a line that most nearly fits the given data. That linear equation is then used to predict values for the data. For instance, if a cost modeler wants to know the prospective cost for a new contract based on the data collected from previous contracts, then s/he may apply linear regression.
1145
6
1146
Trends in Web Content and Structure Mining Anita Lee-Post University of Kentucky, USA Haihao Jin University of Kentucky, USA
INTRODUCTION Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. This area of research is fast-developing today, drawing attention and interests from both researchers and practitioners. The tremendous growth of information available on the Web and the recent interest in e-commerce have accounted for this phenomenon (Kosala & Blockeel, 2000).
BACKGROUND Depending on the nature of the data to be mined, Web mining can be categorized into three areas: Web content, Web structure, and Web usage mining (Srivastava, Cooley, Deshpande, & Tan, 2000). •
•
Web content mining is the discovery or retrieval of useful information from the content of the Web including text, images, audio, video, and other forms of content that make up the Web pages such as symbolic, metadata, and hyperlink data. Text and hypertext content are the most common sources of data for content mining. The information extracted holds the key to search engine operations. The resulting information mined is represented as an index. A key word supplied by a user is matched against this index to retrieve relevant information for the user. An ideal index is one that links every string, word, phrase, tune and image on the Web to all the pages that contain them (Linoff & Berry, 2001). Web structure mining is the discovery of useful information from the underlying hyperlink structures of the Web. The structure is represented as a graph showing how pages or documents within a site and between sites are linked (Broder, Kumar, Maghoul, Raghavan, Rajagopalan & Stata, 2000). An ideal graph is one that maps all links connecting every document on the entire Web (Linoff & Berry, 2001). By analyzing the topology of the Web, infor-
•
mation such as the popularity and richness of a document can be revealed. Links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or the variety of topics covered in the document. Such information enhances the usefulness of a search engine, adding popularity and richness to the relevancy of information retrieved. Web usage mining is the discovery of useful information from users’ usage patterns. Usage data in the form of pages visited, duration of the visit, navigation paths, browse/click pattern, etc. is available from Web server access logs, proxy server logs, browser logs, user profiles, registration files, user sessions or transactions, user queries, bookmark folders, mouse-clicks and scrolls, and any other data generated by the interaction of users and the Web. Usage data is represented as user profiles. An ideal user profile is generated from continually updating records of an individual user’s interactions with the Web including, among other things, sites visited, paths taken, queries issued, documents read, and items purchased. The user profile is particularly useful for e-commerce companies to track and predict customer behavior on their Web sites.
We will discuss in detail past research contributions with respect to Web content and structure mining next. Research efforts relating to Web usage mining will be covered in a separate chapter.
MAIN THRUST Web Content Mining Web content mining focuses on the discovery or retrieval of useful information from Web content/data/ documents. The contributions of Web content mining can be evaluated on two fronts: information retrieval or search result mining and information extraction or Web page content mining (Pal, Talwar, & Mitra, 2002).
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Trends in Web Content and Structure Mining
INFORMATION RETRIEVAL Search result mining is about information retrieval on the Web, a task performed by search engines. The goal of information retrieval is to find what you are looking for precisely. The usefulness of a search engine hinges on its ability to retrieve a relevant subset of documents expediently from a large collection of Web pages based on a user query. The most commonly used technique for search result mining is Web document classification or text categorization. Classification is the process of assigning an item to a class with a certain degree of confidence. In search result mining, classification amounts to assigning keywords to Web documents with varying degrees of confidence. A Web document classified in this way makes possible its subsequent retrieval based on keyword-based searches. According to Kosala and Blockeel (2000), text categorization can be represented in various forms, including the frequencies of specific words, phrases, concept categories, and named entities. The relevance of the page retrieved is measured by its rank: the most relevant pages appear at the top of the list of returned pages. Various rules are used by different search engines in ranking Web pages. For instance, some search for the frequency and location of keywords or phrases in the Web page document, while others scan the META tag, title field, headers and text near the top of the document. An alternative to the ranked list approach for information retrieval is document clustering. Rather than presenting users with a ranked list, document clustering partitions the retrieval results in sets or clusters based on the topical homogeneity of Web documents retrieved. The topical homogeneity is expressed as a similarity metric constructed by analyzing textual contents, hyperlinks, and/or co-citation patterns of Web documents (He, Zha, Ding, & Simon, 2002). Recent developments in search result mining include image classification, multimedia information retrieval and cross-language information retrieval. Yanai (2003) used a large number of images from the Web as training data in generic image classification. Meghini, Sebastiani and Straccla (2001) combined similarity and semantic based methods to retrieve text and images from forms, content, and structure of the Web. Kwork (2000) and Chen, Chau and Yeh (2004) proposed means for cross-language text retrieval and multilingual text mining.
Information Extraction Web page content mining is about information extraction, originally the task of locating specific information
from a natural language document. The goal of information extraction is to extract relevant facts from documents as opposed to retrieving relevant documents to satisfy a user’s information need as in information retrieval. With information extraction, information is pulled from texts of heterogeneous formats, including PDF files, emails, and Web pages, and organized into a single homogeneous form such as tables within a relational database. Essentially the Web content is converted into a database that end-users can search or organize into taxonomies, allowing them to wade through and cope with the overwhelming amount of digital information by breaking up the Web into small, more manageable pieces (Adams, 2001). The most commonly used technique for Web page content mining is natural language processing which recognizes words, attribute values and even concepts in a restricted domain (Adams, 2001). The process of pulling relevant information from Web documents and restructuring it as a database is known as feature extraction. For example, pages about automboile engine design can be scanned to extract features such as engine capacity and fuel consuming rate to forecast trends in these areas. Feature extraction on the Web has been used by e-commerce sites to compare prices for similar products, a service known as comparative shopping. Until more advanced technology that extracts information from unstructured text is available, the current approach to comparative shopping is restricted to analyzing structured contents in the form of XML tags on the Web (Linoff & Berry, 2001). Recent developments in Web page content mining include Web table mining, opinions or reputation extraction, information fusion, and concept mining. Yang and Luk (2002) proposed means to extract information from tables embedded inside Web pages. Dave, Lawrence & Pennock (2003) extracted opinions of a product from the Web for product reviews. Morinaga, Yamanishi, Tateishi and Fukushima (2002) used Web content mining techniques to extract information about a target product’s reputation from the Web. Etzioni, Cafarella and Downey (2004) extracted and fused information from multiple documents in a domain independent and scalable manner. Liu, Chin and Ng (2003) and Loh, Wives, Leandro and Oliveira (2000) worked with topic-specific concepts and definitions to discover concept-based knowledge in text extracted from the Web.
Web Structure Mining Web structure mining uses the hyperlink structure of the Web to yield useful information, including definitive pages specification, hyperlinked communities identification, Web pages categorization, and Web site completeness evaluation. 1147
6
Trends in Web Content and Structure Mining
The most definitive or authoritative pages add a second dimension to the notion of relevance in information retrieval, enhancing the quality of search results. Kleinberg (1998) suggested an iterative procedure called Hyperlink-Induced Topic Search algorithm to assign hub and authority scores for a page. The hub and authority scores computed for each Web page indicate the extent to which the Web page serves as a “hub” pointing to good “authority” pages or as an “authority” on a topic pointed to by good hubs. Brin and Page (1998) formulated a PageRank algorithm that ranks Web pages. The rank of a page does not simply depend on the number of inbound links but also on the rank of those pages that link to it. Web link topology has been exploited to identify hyper linked communities (Kitsuregawa, Toyoda, & Pramudiono, 2002). The Web harbors a large number of communities that are groups of content creators sharing a common interest. Link analysis shows that communities can be viewed as containing a core of central authoritative pages linked together by hub pages. It reveals the ways in which independent users build connections to one another in hypermedia and provides a basis for predicting the way in which on-line communities develop. Identifying these communities helps not only in understanding the intellectual and sociological evolution of the Web but also in providing detailed information to a collection of people with certain focused interests. In addition to their use as a means for finding hubs, authorities, and communities, hyperlinks can be mined to categorize Web pages (Chakrabarti, Dom, Gibson, Kleinbery, Kumar & Raghavan, 1999). Hyperlinks contain high-quality semantic clues as to the topic of a page that is lost by a purely keyword-based categorizer. By exploiting link information in the neighborhood around a document, the relationship between the Web pages can be deduced. Pages on the related topics tend to be linked more frequently than those on unrelated topics. They may be related by synonyms or ontology, have similar content, sit in the same Web server or be created by the same person. This is very useful for a focused Web crawler which is designed to search the Web for pages on a particular topic or set of topics. By categorizing pages as it crawls, the focused crawler is able not just to filter out the irrelevant pages, but also return the most relevant pages. Finally, structure mining provides information to evaluate the completeness of a website (Madria, Bhowmick, Ng, & Lim, 1999). The completeness of a website is measured by the frequency of local links that reside in the same server, the existence of mirrored sites, and the hierarchical nature of hyperlinks of the site. The completeness evaluation is critical for improving website usability. Web structure mining techniques have been applied to benefit areas beyond Web mining. Agrawal, Rajagopalan, Srikant, & Xu (2003) applied link-based methods to assign 1148
newsgroup users into opposite camps. Yin and Lee (2004) proposed a ranking algorithm similar to PageRank to improve object layouts in mobile devices.
FUTURE TRENDS The future trends of web content and structure mining should be directed to address the following needs: •
•
•
•
The Web is increasingly rich in multimedia content containing images, videos, audio, etc. Web page content is increasingly complex. Pages differ in structure, style and content presentation. The trend towards incorporating more variety of multimedia elements in a Web document, including text, image, audio, video, metadata and hyperlinks is apparent. However, mining algorithms currently are text-centric developed from a text mining framework. Web mining algorithms having capabilities for analyzing multimedia as well as the diverse and unstructured nature of Web documents need to be developed in the near future. The Web serves a wide spectrum of users with different backgrounds, interests, preferences, usage purposes and information needs. In addition, many of them are unaware of the Internet structure and do not know how to use the Web effectively. These users frequently get lost in Web surfing and frustrated at the poor quality of search results. Furthermore, not all Web pages contain truly relevant or useful information to a given user at certain times. To protect themselves from being overwhelmed by the ocean of information on the Web, a majority of users focus only on a small portion of the Web dismissing the rest as useless data. More intelligent search engines or agents are needed to address these concerns. Currently, search engines are keyword-based. In the future search engines should support visual queries. Content-based image retrieval systems use features that can be extracted from the image files themselves for use in searching a collection of images, rather than relying on manual indexing or text descriptions by humans are most promising. Work in this field holds tremendous potential in setting the course for future information retrieval research. Most search engines perform searches on English text only. However, multilingual search engines are becoming increasingly common. This, coupled with the development of information retrieval systems that can identify languages, translate, perform thematic classification, and provide
Trends in Web Content and Structure Mining
summaries automatically, envisions more powerful search engines in the near future.
CONCLUSION The research contributions and recent developments of the Web content and structure mining and its future trends are discussed in this chapter. Despite significant progress made in these areas, various challenges of web content and structure mining remain that await future research investigation.
REFERENCES Adams, K.C. (2001). The Web as database: New extraction technologies and content management. ONLINE, 25(2). Retrieved from http://www.onlinemag.net/ OL2001/adams3_01.html Agrawal, R., Rajagopalan, S., Srikant, R., & Xu, Y. (2003, May). Mining newsgroups using networks arising from social behavior. Proceedings of the 12th International World Wide Web Conference, WWW’03 (pp. 529-535), Budapest, Hungary. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., & Stata, R. (2000, May). Graph structure in the Web. Computer networks. Proceedings of the 9th International World Wide Web Conference, WWW’00 (pp. 309-320), Amsterdam, the Netherlands. Brin, S., & Page, L. (1998, April). The anatomy of a large scale hypertextual Web search engine. Proceedings of the 7 th International World Wide Web Conference, WWW’98, Brisbane, Australia. Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, S., & Raghavan, P. (1999). Mining the link structure of the World Wide Web. IEEE Computer, 32(8), 60-67. Chen, J., Chau, R., & Yeh, C. (2004). Discovering parallel text from the World Wide Web. Proceedings of the 2nd workshop on Australasian information security, Data Mining and Web Intelligence, and Software International, DMWI’04 (pp. 157-161), Dunedin, New Zealand. Dave, K., Lawrence, S., & Pennock, D. (2003, May). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. Proceedings of the 12th International World Wide Web Conference, WWW’03 (pp. 519-528), Budapest, Hungary. Etzioni, O., Cafarella, M., & Downey, D. (2004, May). Webscale information extraction in KnowITALL. Proceedings
of the 13th Conference on World Wide Web, WWW’04, (pp. 100-110), New York City, New York, USA. He, X., Zha, H., Ding, C., & Simon, H. (2002). Web document clustering using hyperlink structures. Computational Statistics and Data Analysis, 41, 19-45. Kitsuregawa, M., Toyoda, M., & Pramudiono, I. (2002). WEB community mining and WEB log mining: Commodity cluster based execution. Proceedings of the 13th Australasian Conference on Database Technologies (pp. 3-10), Melbourne, Australia. Kleinberg, J.M. (1998). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604-632. Kosala, R., & Blockeel, H. (2000). Web mining aresearch: A survey. SIGKDD Explorations, 2(1), 1-15. Kwork, K.L. (2000, November). Exploiting a ChineseEnglish bilingual wordlist for English-Chinese cross language information retrieval. Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (pp. 173-179), Hong Kong, China. Linoff, G.S., & Berry, J.A. (2001). Mining the Web. New York: John Wiley & Sons, Inc. Liu, B., Chin, C., & Ng, H. (2003, May). Mining topicspecific concepts and definitions on the Web. Proceedings of the 12th World Wide Web Conference, WWW’03 (pp. 251-260), Budapest, Hungary. Loh, S., Wives, Leandro, L., & Oliveira, J. (2000). Conceptbased knowledge discovery in texts extracted from the Web. SIGKDD Explorations, 2(1), 29-39. Madria, S.K., Bhowmick, S.S., Ng, W.K., & Lim, E.P. (1999). Research issues in Web data mining. Proceedings of the 1st International Conference on Data Warehousing and Knowledge Discovery, DaWaK’99 (pp. 303-312). Meghini, C., Sebastiani, F., & Straccla, U. (2001). A model of multimedia information retrieval. Journal of the ACM, 48(5), 909-970. Morinaga, S., Yamanishi, K., Tateishi, K., & Fukushima, T. (2002, July). Mining product reputations on the Web. Proceedings of the 8 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SIGKDD’02 (pp. 342-349), Edmonton, Alberta, Canada. Pal, S.K., Talwar, V., & Mitra, P. (2002). Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Transactions on Neural Networks, 13(5), 1163-1177.
1149
6
Trends in Web Content and Structure Mining
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Yanai, K. (2003, November). Generic image classification using visual knowledge on the Web. Proceedings of the 11 th ACM International Conference on Multimedia, MM’03 (pp. 167-176), Berkeley, California, USA. Yang, Y., & Luk, W. (2002, November). A framework for Web table mining. Proceedings of the 4th International Workshop on Web Information and Data Management, WIDM’02 (pp. 36-42), McLean, Virginia, USA. Yin, X., & Lee, W. (2004, May). Using link analysis to improve layout on mobile devices. Proceedings of the 13th World Wide Web Conference, WWW’04 (pp. 338344), New York City, New York, USA.
KEY TERMS Authority Pages: The pages that contain the most definitive, central, and useful information in the context of query topics.
1150
Hub Pages: The pages that contain a large number of links to pages that contain information about the query topics. Hyperlink: A link in a document to information within the same document or another document. These links are usually represented by highlighted words or images. Hypertext: Any text that contains links to other documents. Information Extraction: The process of pulling out or extracting relevant or predefined types of information from a set of documents. The extracted information can range from a list of entity names to a database of event descriptions. Information Retrieval: The process of discovering and indexing the relevant documents from a collection of documents, based on a query presented by the user. Web Content Mining: The discovery or retrieval of useful information from the content of the Web. Web Mining: The use of data mining techniques to automatically discover and extract information from Web documents and services. Web Structured Mining: The discovery of useful information from the underlying hyperlink structures of the Web.
1151
Trends in Web Usage Mining Anita Lee-Post University of Kentucky, USA Haihao Jin University of Kentucky, USA
INTRODUCTION In this paper, we will discuss research efforts devoted to the remaining area of Web mining, namely Web usage mining. Taken together, a complete picture of the trends in Web mining can be discerned.
BACKGROUND Web mining is a fast developing area using data mining techniques to discover useful knowledge from Web documents and services (Etzioni, 1996; Kosala & Blockeel, 2000). Based on the types of data available on the Web, Web mining is generally divided into three categories: Web content mining, Web structure mining, and Web usage mining (Srivastava, Cooley, Deshpande, & Tan, 2000). Both content mining and structure mining work with idealized static representations of the Web, i.e., the pages and links as they exist at a particular moment.The information discovered from content mining and structure mining is instrumental to the development of more powerful and intelligent search engines or agents (Glover, Tsioutsiouliklis, Lawrence, Pennock, & Flake, 2002; Leake & Scherle, 2001). Web usage mining, on the other hand, is the discovery of useful information from users’ usage patterns. The data required to build a complete usage pattern is scattered across Web logs, application server logs, ad server logs, commerce server logs, product databases, and customer databases owned by a host of different organizations. Many of them have neither the ability nor the willingness to share the information they own. Furthermore, pages viewed by users through caching at client or proxy servers will not be recorded in the server logs, thus affecting the accuracy of server logs’ data. In addition, users are reluctant to let their Web activities be monitored due to privacy, security, and profiling concerns. The dynamic, diverse and incomplete nature of the usage data present a challenge to Web usage mining. However, as explained in the next section, a significant amount of Web usage mining research has been con-
ducted despite the difficulty of working with an incomplete source of usage data.
MAIN THRUST Web usage mining is the application of data mining techniques to discover usage patterns from Web data in order to understand and better serve the needs of Webbased applications (Srivastava, Cooley, Deshpande, & Tan, 2000). Usage data can be collected from three sources: Web servers, proxy servers, and Web clients. Server access logs contain information about the name and IP address of the remote host, date and time of a user’s request, the URL of the Web page requested, size of the page requested, as well as status of the request that help characterize a user’s access to a specific Web server. Proxy server logs reveal actual requests from multiple clients to multiple Web servers served by that proxy server. The information is useful for learning the browsing behavior of a group of anonymous users sharing a common proxy server so that future page requests can be predicted to improve proxy caching services. Client side data provides detailed information about an actual user’s browsing activities. A Web client’s usage data is tracked by a remote agent deployed via JavaScript, Java applet, or modified browser. The data collected from these data sources is then used to construct data abstractions of users, user sessions, episodes, click-stream behaviors, and page views. Data abstractions are necessary for discovering usage patterns that range from single-user navigation patterns, single-site browsing patterns to multiuser, multi-site access patterns. Usage patterns discovered have been critical for applications such as personalization, Web server performance improvement, Web site modification, and customer relationship management (Facca & Lanzi, 2003). Web usage mining is performed in three phases, namely preprocessing, pattern discovery, and pattern analysis (Srivastava, Cooley, Deshpande, & Tan, 2000). The preprocessing phase converts the usage, content, and structure information contained in various data
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
6
Trends in Web Usage Mining
sources into data abstractions necessary for pattern discovery. Pattern discovery draws upon methods and algorithms such as statistical analysis, association rules, clustering, classification, sequential pattern and dependency modeling to characterize usage patterns in the form of frequency of page views, Web pages that frequently appear together in users’ sessions, usage clusters, page clusters, inter-session patterns, and user profiles. The usage patterns discovered can be analyzed to personalize the Web experience for a user, improve the performance of Web servers and Web-based applications, modify the design of a website, or manage customer relationships. Personalization or personalizing the Web experience for a user can be achieved by making dynamic recommendations to a Web user based on his/her profile or navigation pattern (Eirinaki & Vazirgiannis, 2003). The recommendations can be a site cluster that contains dynamically selected page links. They can also be an adaptive website that automatically improves its organization and presentation to suit a user’s access pattern (Perkowitz & Etzioni, 2000). Pre-fetching and caching improve the performance of Web servers and Web-based applications by reducing server response time (Jespersen, Pedersen, & Thorhauge, 2003; Lan, Bressan, Ooi, & Tan, 2000). Frequently accessed pages are “pre-fetched” from the Web server and “cached” to anticipate and satisfy future user requests expediently. Website redesign or modifying the design of a website to improve its usability and quality can be guided by usage mining discoveries (Srikant & Yang, 2001). Detailed feedback on user behavior can be used by Website designer to improve the content and structure of its website. Usage patterns discovered from server log can also be clustered to generate index pages or home pages. E-commerce intelligence or mining business intelligence from web usage data is important for e-commerce operations, in particular, customer relationship management. Data generated about customers from e-commerce transactions can be analyzed to improve marketing, sales, and customer services. Effective customer relationship management is made possible by tailoring these customer service activities to maximize customer satisfaction. Other recent applications of usage mining include search relevance ranking, adaptive Web site navigation, and social network mining. A ranking of search topic relevance was established as a result of mining a user’s browsing records (Wang, Chen, Tao, Ma, & Liu, 2002) or a user’s past search history (Mukhopadhyay, Giri, & Singh, 2003). Zhu, Hong, and Hughes (2004) paved the way for adaptive Web site navigation by analyzing Web log files to discover conceptual link hierarchies among Web pages. The intent was to allow for less specific search criteria involving more Web pages on multiple 1152
conceptual levels. Domingos and Richardson (2001) proposed calculating an online customer’s network value based on correlations between online customers. The idea was to recognize the exponential growth potential of the expected profit from a customer in a social network as he/she influenced others to shop online. The integration of Web content, structure, and usage mining has shown promising results. Fang & Sheng (2004) proposed an approach to maximize the efficiency and effectiveness of a portal page to a Website. The hyperlinks in the portal page was built based on relationships among hyperlinks extracted from the Website as well as relationships among access patterns discovered from Web logs. Cooley (2003) was able to identify interesting Web usage patterns from Web content and structure data. In addition, combining information discovered from both content and structure mining was instrumental in enhancing Web personalization (Eirinaki & Vazirgiannis, 2003; Eirinaki, Vazirgiannis, & Varlamis, 2003).
FUTURE TRENDS The future trends of web usage mining should be directed to address the following needs: •
•
•
Web usage mining is currently restricted to modeling the behavior of visitors to a particular site or a network of sites or to data provided by a small sample of users who have volunteered to have their movements tracked across unrelated sites. A broader and more complete source of data for Web usage mining is critically needed. Combining the tasks of content and structure mining has been shown to enhance the quality of search results, returning pages that are both relevant and authoritative. Information discovered from usage mining needs to work with content and structure data for effective personalization and website redesign. Therefore, there is a need to integrate the three areas of content, structure, and usage mining so that a more complete data source can be used and analyzed to yield more useful information and applications. The area of Web mining is increasingly multidisciplinary. More advanced extraction technology is needed for unstructured content mining. More intelligent analytical tools are needed for both structure and usage mining. There is a need to involve more academic fields, in particular, machine learning and linguistic science to join efforts in advancing the growth and development of the Web mining field.
Trends in Web Usage Mining
CONCLUSION The research efforts and recent developments of the Web usage mining and its future trends are discussed in this paper. Taken as a whole, Web mining research has contributed significantly to the ease with which tremendous amount of information can be organized, structured, retrieved, extracted, and disseminated. This allows for the flourishing of Web mining applications that benefits areas beyond Web mining, impacting both the business and social science communities particularly. The future of Web mining research, however, hinges on its ability in addressing the ongoing challenges faced by researchers in this dynamic field.
REFERENCES Cooley, R. (2003). The use of Web structure and content to identify subjectively interesting Web usage patterns. ACM Transactions on Internet Technology, 3(2), 93-116. Domingos, P., & Richardson, M. (2001). Mining the network value of customers. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’01 (pp. 57-66), San Francisco, California, USA. Eirinaki, M., & Vazirgiannis, M. (2003). Web mining for personalization. ACM Transactions on Internet Technology, 3(1), 1-27. Eirinaki, M., Vazirgiannis, M., & Varlamis, I. (2003. August). SEWeP: Using site semantics and a taxonomy to enhance the Web personalization processes. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, SKGKDD’03 (pp. 99-108), Washington, DC, USA. Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine. Communications of the ACM, 39(11), 65-68. Facca, F.M., & Lanzi, P.L. (2003). Recent developments in Web usage mining research. In Y. Kambayashi, M. Mohania, & W. Woss (Eds.), Data warehousing and knowledge discovery. Lecture Notes on Computer Science, Vol. 2737, 140-150. New York: Springer-Verlag. Fang, X., & Sheng, O. (2004). LinkSelector: A Web mining approach to hyperlink selection for Web portals. ACM Transactions on Internet Technology, 4(2), 209-237. Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., & Flake, G. W. (2002, May). Using Web structure for classifying and describing Web pages. Proceedings of the 11 th International Conference on World Wide Web, WWW’02 (pp. 562-569), Honolulu, Hawaii, USA.
Jespersen, S., Pedersen, T. B., & Thorhauge, J. (2003, November). Evaluating the Markov assumption for Web usage mining. Proceedings of the 5th ACM International Workshop on Web Information and Data Management, WIDM’03 (pp. 82-89), New Orleans, Louisiana, USA. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15. Lan, B., Bressan, S., Ooi, B., & Tan, K. (2000, November). Rule-assisted prefetching in Web-server caching. Proceedings of the 9th International Conference on Information and Knowledge Management, CIKM’00 (pp. 504-511), McLean, Virginia, USA. Leake, D. B., & Scherle, R. (2001, January). Towards context-based search engine selection. Proceedings of the 6th International Conference on Intelligent User Interfaces, IUI’01 (pp. 109-112), Santa Fe, New Mexico, USA. Mukhopadhyay, D., Giri, D., & Singh, S. (2003). An approach to confidence based page ranking for user oriented Web search. ACM SIGMOD Record, 32(2), 28-33. Perkowitz, M., & Etzioni, O. (2000). Adaptive Web sites. Communications of the ACM, 43(8), 152-158. Srikant, R. & Yang, Y. (2001, May). Mining Web logs to improve Website organization. Proceedings of the 10th International World Wide Web Conference, WWW’01 (pp. 430-437), HongKong, China. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Wang, J., Chen, Z., Tao, L., Ma, W., & Liu, W. (2002, November). Ranking user’s relevance to a topic through link analysis on Web logs. Proceedings of the 4 th International Workshop on Web Information and Data Management, WIDM’02 (pp. 49-54), McLean, Virginia, USA. Zhu, J., Hong, J., & Hughes, J. (2004). PageCluster: Mining conceptual link hierarchies from Web log files for adaptive Web site navigation. ACM Transactions on Internet Technology, 4(2), 185-208.
KEY TERMS Click Stream: A sequential series of web page view requests from an individual user. Data Preprocessing: The data mining phase that converts the usage, content, and structure information 1153
6
Trends in Web Usage Mining
contained in various data sources into data abstractions necessary for pattern discovery. Page View: The visual rendering of a web page in a specific environment at a specific point of time. Pattern Discovery: The data mining phase that draws upon methods and algorithms such as statistics and association rules, to characterize usage patterns in wanted forms. Server Log: An audit file that provides a record of all requests for files on a particular web server, also known as an access log file.
1154
User Profile Data: Data that provide information about the users of a Web site, such as the demographic information and the interests or preference of the users. Web Personalization: The process of customizing the content and structure of a Web site to the individual needs of specific users, based on the analysis of the user’s navigation data, Web content, structure, and user profile data. Web Usage Mining: The discovery of useful information from users’ usage patterns.
1155
Unsupervised Mining of Genes Classifying Leukemia Diego Liberati Consiglio Nazionale delle Ricerche, Italy Sergio Bittanti Politecnico di Milano, Italy Simone Garatti Politecnico di Milano, Italy
INTRODUCTION Micro-arrays technology has marked a substantial improvement in making available a huge amount of data about gene expression in pathophysiological conditions; among the many papers and books recently devoted to the topic, see, for instance, Hardimann (2003) for a discussion on such a tool. The availability of so many data attracted the attention of the scientific community much on how to extract significant and directly understandable information in an easy and fast automatic way from such a big quantity of measurements. Many papers and books have been devoted as well to various ways to process micro-arrays data; Knudsen (2004) is a recent re-edition of a book pointing to some of the approaches of interest to the topic. When such opportunity to have many measurements on several subjects arises, one of the typical goals one has in mind is to classify subjects on the basis of a hopefully reduced meaningful subset of the measured variables. The complexity of the problem makes it worthwhile to resort to automatic classification procedures. A quite general data-mining approach that proved to be useful also in this context is described elsewhere in this encyclopedia (Liberati, 2004), where different techniques also are referenced, and where a clustering approach to piecewise affine model identification also is reported. In this contribution, we will resort to a different recently developed unsupervised clustering approach, the PDDP algorithm, proposed in Boley (1998). According to the analysis provided in Savaresi & Boley (2004), PDDP is able to provide a significant improvement of the performances of a classical k-means approach (Hand et al., 2001; MacQueen, 1967), when PDDP is used to initialize the kmeans clustering procedure. Such cascading of PDDP and k-means was, in fact, already successfully applied in a totally different context for analyzing the data regarding a large virtual community of Internet users (Garatti et al., 2004).
The approach taken herein may be summarized in the following four steps, the third of which is the core of the method, while the first two constitute a preprocessing phase useful to ease the following task, and the fourth one a post-processing designed to focus back on the original variables, found to be meaningful after the transforms operated in the previous steps: 1.
2. 3.
4.
A first pruning of genes not likely to be significant, for the final classification is performed on the basis of their small intersubject variance, thus reducing the size of the subsequently faced problem. A principal component analysis defines a hierarchy in the remaining transformed orthogonal variables. Finally, the clustering is obtained by means of the subsequent application of the principal direction divisive partitioning and the bisecting K-means algorithms. The classification is achieved without using a priori information on the patient’s pathology (unsupervised learning). This approach presents the advantage that it automatically highlights the (possibly unknown) patient casuistry. By analyzing the obtained results, the number of genes for the detection of pathologies is further reduced, so that the classification eventually is based on a few genes only.
The application of such classification procedure is quite general, even beyond micro-arrays data; many problems resemble this one for statistical structure, like prognostic factor in oncology or drug discovery, as described in Liberati (2005) but also, for instance, for risk management in finance in an apparently totally different framework. Here, results will be shown in the paradigmatic case of automatically classifying two kinds of leukemia in a few patients whose thousands of gene expressions are publicly available on the Internet (Golub et al., 1999). Our approach seems to present some advantages with respect
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
7
Unsupervised Mining of Genes Classifying Leukemia
to the one originally obtained by Golub, et al. (1999), with a different approach in that our classification eventually is based on a very limited number of genes without any type of a priori information. This encouraging result, together with the ones in Garatti et al. (2004) and with the theoretical considerations in Savaresi and Booley (2004) suggests that the methodology proposed in the present contribution, besides providing significant results in the presented example, is likely to be of help in (and beyond) the bioinformatics context.
BACKGROUND Among the problems to which a bioinformatics approach to micro-arrays data is required, the classification problems are of paramount interest, as in almost every context in which one would resort to data mining; it often is needed to be able to discriminate among two (or more) classes of subjects on the basis of a small number of the many available measured variables. For classification, a basic tool is provided by clustering procedures, which are the subject of many papers (Jain et al., 1999) and books (Duda & Hart, 1973; Hand et al., 2001; Jain & Dubes, 1998; Kaufman & Rousseeuw, 1990). As is well known, one can distinguish unsupervised procedures and supervised procedures; the former perform the classification on the sole basis of the intrinsic characteristics of the data by means of a suitable notion of distance; the latter makes use of additional information on the data classification available a priori. For applications illustrative of these two approaches, the interested reader is referred to Karayiannis and Bezdek (1997), Setnes (2000), Muselli and Liberati (2002), Ferrari-Trecate, et al. (2003), and Muselli and Liberati (2000).
The leukemia dataset, chosen as a paradigmatic example to illustrate the classification performances of the algorithm proposed here, is often used as a test bed in bioinformatics. For example, it was treated in Golub, et al. (1999) by resorting to a supervised approach and in De Moor, et al. (2003) by the k-means technique alone; in the last paper, no final results are available in order to make a direct comparison; this may be due to the fact that kmeans alone is sensitive to initialization, while our preprocessing via PDDP provides unique initialization to kmeans, as shown in Savaresi and Boley (2004), where it is also discussed that the cascade of the two algorithms outperforms each one alone.
MAIN THRUST Our four-step data analysis can be outlined as follows: 1.
2.
Variance Analysis: The variance of the expression value is computed for each gene across the patients in order to have a first indicator of the relative intersubject expression variability and to reject those genes whose variability is below a defined threshold. The idea behind this is that if the variability of a gene expression over the subjects is small, then that gene does not detect any variability and, hence, is not useful for classification. Principal Component Analysis: Principal Component Analysis (O’Connel, 1974; Hand et al., 2001) is a multivariate analysis designed to select the linear combinations of variables with higher intersubject covariances; such combinations are the most useful for classification. More precisely, PCA returns a new set of orthogonal coordinates of the data space,
Table 1. PDDP clustering algorithm Step 1. Compute the centroid w of S. ~ Step 2. Compute an auxiliary matrix S as: ~ S = S − ew , T where e is the N-dimensional vector of ones (i.e., e = [1,1,1,1,1,...1] ). ~ Step 3. Compute the Singular Value Decompositions (SVD) of S : ~ S = UΣV T , where Σ is a diagonal N × p matrix, and U and V are orthonormal unitary square matrices whose dimensions are N × N and p × p , respectively (Golub & van Loan, 1996). Step 4. Take the first column vector of V (i.e., v = V1 ), and divide S = [x1 , x 2 ,K, x N ] into two subclusters, S L and S R , according to the following rule: T
xi ∈ S L if v T ( xi − w) ≤ 0 xi ∈ S R if v T ( xi − w) > 0 1156
Unsupervised Mining of Genes Classifying Leukemia
Table 2. Bisecting K-means algorithm
7
S tep 1 . (Initialization). S elect tw o points in the data dom ain space (i.e., c L , c R ∈ ℜ p ). S = [x 1 , x 2 , K , x N S tep 2 . D ivid e th e fo llo w in g rule: x i ∈ S L x i ∈ S R
]T
into tw o subclusters, S L and S R , according to
if
xi − c L ≤ xi − c R
if
xi − c L > xi − c R
S tep 3 . C o m p ute the cen troid s w L and w R of S L and S R ,.
S tep 4 . If w L = c L and w R = c R , stop. O therw ise, let c L := w L , c R := w R and g o back to S tep 2.
3.
where such coordinates are ordered in decreasing order of intersubject covariance. Clustering: Unsupervised clustering is performed via the cascade of a non-iterative technique—the Principal Direction Divisive Partitioning (PDDP) (Booley, 1998) based upon singular value decomposition (Golub & van Loan, 1996) and the iterative centroid-based divisive algorithm K-means (Mac Queen, 1967). Such a cascade, with the clusters obtained via PDDP used to initialize K-means centroids, is shown to achieve best performances in terms of both quality of the partition and computational effort (Savaresi & Boley, 2004). The whole dataset thus is bisected into two clusters, with the objective of maximizing the distance between the two clusters and, at the same time, minimizing the distance among the data points lying in the same clusters. These two algorithms are recalled in Tables 1 and 2. In both tables, the input is a N × p matrix S, where the data for each subject are the rows of the matrix, and the outputs are the two matrices S L and
S R each one representing a cluster. Both algorithms are based on the following quantity: w=
4.
1 N ∑ xi , where N i =1
xi ’s are the rows of S,
and where w is the average of data samples and is called the centroid of S. Gene Pruning: The previous procedure is complemented with an effective gene-pruning technique in order to detect a few genes responsible for each pathology. In fact, from each identified principal component, many genes may be involved. Only the one(s) influencing each selected principal component more are kept.
A Paradigmatic Example The Leukemia Classification: Data were taken from a public repository often adopted as a reference benchmark (Golub et al., 1999) in order to test new classification techniques and compare the various methodology to each other. Our database contained gene expression data over 72 subjects, relying on 7,129 genes. Of the 72 subjects, 47 are cases of acute lymphoblastic leukemia (ALL), while the remaining 25 are cases of acute myeloid leukemia (AML). An experimental bottleneck in this kind of experiment is the difficulty in collecting a high number of homogeneous subjects in each class of interest, making the classification problem even harder; not only a big matrix is involved, but such matrix has a huge number of variables (7,129 genes) with only a very poor number of samples (72 subjects). The cutoff on lower inter-subject gene variance thus is implemented in order to limit the number of genes in the subsequent procedure. The result of the variance analysis for the 7,129 genes shows that the variance is small for thousands of genes. Having selected a suitable threshold, 6,591 genes were pruned from the very beginning. So, attention has been focused on 178 genes only. Of course, the choice of the cutoff level is a tuning parameter of the algorithm. The adopted level may be decided on the basis of a combination of biological considerations, if it is known under which level the variance should be considered of little significance; technological knowledge, when assessing how accurate the micro-arrays measurements can be; empirical considerations, by imposing either a maximum number of residual variables or a minimum fraction of variance with respect to the maximum one. Then, the remaining phases of the outlined procedure have been applied. In this way, the set of 72 subjects has been subdivided into two subsets containing 23 and 49
1157
Unsupervised Mining of Genes Classifying Leukemia
Table 3. The seven genes discriminating AML from ALL 1. 2. 3. 4. 5. 6. 7.
FTL Ferritin, light polypeptide MPO Myeloperoxidase CST3 Cystatin C Azurocidin gene GPX1 Glutathione peroxidase 1 INTERLEUKIN-8 PRECURSOR VIM Vimentin
patients, respectively. As already said, this portioning has been obtained without exploiting a priori information on the pathology of the patients (i.e., ALL or AML). Interestingly, all 23 subjects of the smaller cluster turn out to be affected by the AML pathology. Thus, the only error of our unsupervised procedure consists in the misclassification of two AML patients, erroneously grouped in the bigger cluster, together with the remaining 47 subjects affected by the ALL pathology. Thus, the misclassification percentage is 2/72<3%. In addition, it should be pointed out that the final gene-pruning step leads to a very small number of significant genes; namely, only seven genes, as listed in Table 3. Our results outperform the original one of Golub, et al. (1999), using a supervised tool, and, thus, splitting the 72 patients into 38 training samples and 34 testing samples, they correctly classified 29 (about 85%) of the 34 test subjects with as much as 50 genes, three of which (Cystatin C, Azurocidin, and Interleukin-8 precursor) also are among the seven sufficient in our approach. A possible interpretation is that the three genes within the intersection of the two subsets probably are really determinant, while the complementing four genes identified by the procedure proposed in this article discriminate better than the complementing 43 in the subset of Golub, et al. (1999).
FUTURE TRENDS The proposed approach is now under application in other similar contexts. The fact that a combination of different approaches, taken from partially complementary disciplines, proves to be effective may indicate a fruitful direction in combining in different ways classical and new approaches to improve classification.
CONCLUSION The proposed clustering algorithm is effective for the discrimination of the two kinds of leukemia of the consid1158
M11147_at M19507_at M27892_at M96326_ rna1_at Y00433_at Y00787_s_at Z19554_s_at
ered dataset on the basis of an extremely limited number of patients. The unsupervised nature of the presented approach enables the classification without any knowledge on the pathologies of the patients. Also, it does not require the subdivision of the data into a training set and a testing set. The proposed approach is very general and is not limited to the bioinformatics field. For instance, it already was used successfully for analyzing the data regarding a large virtual community of Internet users (Garatti et al., 2004).
ACKNOWLEDGMENTS This article was supported by CNR-IEIIT. The authors would also like to thank Andrea Maffezzoli, who was in charge of the computation in fulfillment of his master’s thesis at Milan Institute of Technology. Three anonymous reviewers are also gratefully acknowledged for indicating how to improve the writing.
REFERENCES Boley, D.L. (1998). Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325344. De Moor, B., Marchal, K., Mathys, J., & Moreau, Y. (2003). Bioinformatics: Organism from Venus, technology from Jupiter, algorithms from Mars. European Journal of Control, 9(2-3), 237-278. Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. New York: Wiley. Ferrari-Trecate, G., Muselli, M., Liberati, D., Morari, M. (2003). A clustering technique for the identification of piecewise affine systems. Automatica, 39, 205-217. Garatti, S., Savaresi, S., & Bittanti, S. (2004). On the relationships between user profiles and navigation ses-
Unsupervised Mining of Genes Classifying Leukemia
sions in virtual communities: A data-mining approach. Intelligent Data Analysis.
algorithms. International Journal on Intelligent Data Analysis.
Golub, G.H., & van Loan, C.F. (1996). Matrix computations. Baltimore: Johns Hopkins University Press.
Setnes, M. (2000). Supervised fuzzy clustering for rule extraction. IEEE Trans. Fuzzy Systems, 8, 416-424.
Golub, T.R. et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression. Science, 286, 531-537.
KEY TERMS
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data-mining. Cambridge, MA: MIT Press.
Bioinformatics: The processing of the huge amount of information pertaining to biology.
Hardimann, G. (2003). Microarrays methods and applications: Nuts & bolts. Eagleville, PA: DNA Press.
Discriminant Variables: The information that really matters among the many apparently involved in the true core of a complex set of features.
Jain, A., & Dubes, R. (1998). Algorithms for clustering data. London: Sage Publications. Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 264-323. Karayiannis, N.B,. & Bezdek, J.C. (1997). An integrated approach to fuzzy learning vector quantization and fuzzy C-means clustering. IEEE Trans. Fuzzy Systems, 5, 622628. Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley. Knudsen, S. (2004). Guide to analysis of DNA microarray data. New York, NY: John Wiley & Sons. Liberati, D. (2005). Model identification through data mining. In J. Wang (Ed.), Encyclopedia of data warehousing and mining. Hershey, PA: Idea Group Reference. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California. Muselli, M., & Liberati, D. (2000). Training digital circuits with Hamming clustering. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 47, 513-527. Muselli, M., & Liberati, D. (2002). Binary rule generation via Hamming clustering. IEEE Transactions on Knowledge and Data Engineering, 14, 1258-1268. O’Connel, M.J. (1974). Search program for significant variables. Comp. Phys. Comm., 8, 49.
DNA: Nucleic acid, constituting the genes, codifying proteins. Gene Expression: The proteins actually produced in the specific cell by the individual. K-Means: Iterative clustering technique subdividing the data in such a way to maximize the distance among centroids of different clusters, while minimizing the distance among data within each cluster. It is sensitive to initialization. Leukemia: Blood disease affected by genetic factors. Micro-Arrays: Chips where thousands of gene expressions may be obtained from the same biological cell material. PDDP (Principal Direction Divisive Partitioning): One-shot clustering technique based on principal component analysis and singular value decomposition of the data, thus partitioning the dataset according to the direction of maximum variance of the data. It is used here in order to initialize K-means. Principal Component Analysis: Rearrangement of the data matrix in new orthogonal transformed variables ordered in decreasing order of variance. Singular Value Decomposition: Algorithm able to compute the eigen values and eigen vectors of a matrix; also used to make principal components analysis. Unsupervised Clustering: Automatic classification of a dataset in two of more subsets on the basis of the intrinsic properties of the data without taking into account further contextual information.
Savaresi, S.M., & Boley, D.L. (2004). A comparative analysis on the bisecting K-means and the PDDP clustering
1159
7
1160
Use of RFID in Supply Chain Data Processing Jan Owens University of Wisconsin-Parkside, USA Suresh Chalasani University of Wisconsin-Parkside, USA Jayavel Sounderpandian University of Wisconsin-Parkside, USA
INTRODUCTION The use of Radio Frequency Identification (RFID) is becoming prevalent in supply chains, with large corporations such as Wal-Mart, Tesco, and the Department of Defense phasing in RFID requirements on their suppliers. The implementation of RFID can necessitate changes in the existing data models and will add to the demand for processing and storage capacities. This article discusses the implications of the RFID technology on data processing in supply chains.
BACKGROUND RFID is defined as the use of radio frequencies to read information on a small device known as a tag (Rush, 2003). A tag is a radio frequency device that can be read by an RFID reader from a distance, when there is no obstruction or misorientation. A tag affixed to a product flowing through a supply chain will contain pertinent information about that product. There are two types of tags: passive and active. An active tag is powered by its own battery, and it can transmit its ID and related information continuously. If desired, an active tag can be programmed to be turned off after a predetermined period of inactivity. Passive tags receive energy from the RFID reader and use it to transmit their ID to the reader. The reader then may send
Figure 1. Reading ID information from an RFID tag RFID Reader (1) Tag receives energy from the reader
RFID Tag
(3) Reader sends ID information to the Host system
(2) Tag transmits its ID information to the reader
Host System
the data to a host system for processing. Figure 1 depicts the activity of reading the ID from a passive tag by an RFID reader (Microlise, 2003). The ID in the above discussion is a unique ID that identifies the product, together with its manufacturer. MIT’s Auto-ID Center proposed the Electronic Product Code (EPC) that serves as the ID on these tags (Auto-ID Technology Guide, 2002). EPC can be 64 bits or 96 bits long. However, EPC formats allow the length of the EPC to be extended in future. Auto-ID center envisions RFID tags constituting an Internet of things. RFID tag information is generated based on events such as a product leaving a shelf or being checked out by a customer at a (perhaps automatic) checkout counter. Such events or activities generate data for the host system shown in Figure 1. The host system, when it processes these data, in turn may generate more data for other partners in the supply chain. Our focus in this article is to study the use of RFID in supply chains.
MAIN THRUST This article explores the data generated by RFID tags in a supply chain and where this data may be placed in the data warehouse. In addition, this article explores the acceptance issues of RFID tags to businesses along the supply chain and to consumers.
Types of Data Generated by RFID Tags The widespread use of the Internet has prompted companies to manage their supply chains using the Internet as the enabling technology (Gunasekaran, 2001). Internetbased supply chains can reduce the overall cost of managing the supply chains, thus allowing the partners to spend more money and effort on innovative research and product development (Grosvenor & Austin, 2001; Hewitt, 2001). Internet-based supply chains also allow smaller companies to thrive without massive physical
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Use of RFID in Supply Chain Data Processing
Figure 2. Interaction between a retailer and a supplier in a supply chain Retailer Location 1 Retailer Location L
Intranet
Retailer’s Central Server With Inventory Information
Figure 3. Processing information from operational data stores (ODS) to an enterprise data warehouse (EDWH) and to data marts (DM)
Supplier’s Information System
VPN
ODS
ODS
infrastructures. The impact of RFID on a retailer supplier interaction in the supply chain is discussed below. The information system model for communication between a retailer and a supplier is shown in Figure 2. The retailer is assumed to have several locations, each equipped with RFID readers and RFID-tagged items. Each location has its own computer system comprising a local database of its inventory and application programs that process data from the RFID readings. The complete inventory information for the retailer is maintained at a central location comprising high-end database and application servers (Chalasani & Sounderpandian, 2005). Computer systems at retail locations are interconnected to the central inventory server of the retailer by the company’s intranet. Reordering of inventory from the supplier, once inventory levels fall below the reorder points, takes place by communication of specific messages between the retailer’s information systems and the supplier’s information system (Ranganathan, 2003). Any such communication is facilitated by a virtual private network (VPN), to which the retailer and the supplier subscribe (Weisberg, 2002). For large retailers, such as Wal-Mart, each location communicating with one central server is impractical. In such cases, a hierarchical model of interconnected servers, where each server serves a regional group of locations, is more practical (Prasad & Sounderpandian, 2003). In this article, for the sake of simplicity, we assume a flat hierarchy. The ideas and developed in this article can be extended and applied to hierarchical models as well. RFID readings and the transactions that may be triggered upon processing the readings are classified by Chalasani and Sounderpandian (2004).
Placing RFID Data in an Enterprise Data Warehouse System Data warehouse systems at the retailer and the supplier should be able to handle the information generated by the transactions described previously. Figure 3 presents
Retailer’s Enterprise Operational Data Store (EODS)
EDWH ETL (Extract, Transform, Load)
Load
DM
Load
DM
a typical data warehouse system at the retailer’s central location. The operational data store (ODS) at each retailer’s location stores the data relevant to that location. The data from different retailer ODSs is combined together to obtain an enterprise operational data store (EODS). The process commonly referred to as ETL (extract → transform → load) is applied to the EODS data, and the resulting data is loaded into the enterprise data warehouse (EDWH). The data from EDWH are then sliced along several dimensions to produce data for the data marts. The transactions described in the previous section are handled by several tables in the ODS and the EODS databases. These tables are depicted in Figure 4. The reader table contains the Reader_ID for each RFID reader. This reader ID is the primary key in this table. In addition, it contains the location of the reader. Reader_Location often is a composite attribute containing the aisle and shelf and other data that precisely identify the location of the reader. The product table has several attributes pertaining to the product, such as the product description. The primary key in the product table is the Product_EPC, which is the electronic product code (EPC) that uniquely identifies each product and is embedded in the RFID tag. Transaction type table is a lookup table that assigns transaction codes to each type of transaction (e.g., point of sale or shelf replenishment). Each of the tables— Reader, Product, Transaction Type—have a one-to-many relationship with the transactions table, with the many sides of the relationship ending on the transactions table. The amount of transaction data generated by RFID transactions can be estimated using a simple model. Let N be the total number of items and f be the average number of tag-reads per hour. The total number of RFID transactions are N * f per hour. If B is the number of bytes required for each entry in the transactions table, the total storage requirements in the transactions table per hour is given by N * f * B. For example, if there are 1161
7
Use of RFID in Supply Chain Data Processing
Figure 4. Tables in ODS and EODS that hold RFID transactions Transaction Type
Product 1
Tran_Type_Code (PK) Tran_Type_Description
Product_EPC (PK) Product_Description …
1 M Transactions Tran_ID (PK) Tran_Type_Code (FK) Product_EPC (FK) Reader_ID (FK)
M M
Reader 1
Reader_ID (PK) Reader_Location …
100,000 items on the shelves at a retail store location, and the items are read every 15 minutes, there are 400,000 transactions every hour. In addition, if each transaction requires 256 bytes of storage, the total storage requirement per hour is 100 MB. If the store operates on average for 15 hours a day, the total storage per day is 1.5 GB. If the retailer has 1,000 locations, the total storage required at the EODS is 1.5 terabytes a day. To reduce the storage requirements, the following principles are adhered to: (1) archive the transactions data at the EODS on a daily basis; that is, move the data from the EODS transactions table to an archived table; (2) purge the transactions data from the ODS tables on a weekly basis; purging data does not cause loss of this data, since this data are already archived by the EODS; (3) calculate the summary data and write only the summary data to the EDWH.
Customer Acceptance of RFID Tags Customer acceptance of RFID technology and diffusion will depend on its perceived value in the supply chain. The rate of acceptance can be explained and predicted through five general product characteristics that impact this perception of value: complexity, compatibility, relative advantage, observability, and trialability (Rogers, 1983). Complexity describes the degree of difficulty involved in understanding and using a new product. The more complex the product, the more difficult it is to understand and use, and the slower will be its diffusion in the marketplace. Compatibility describes the degree to which the new product is consistent with the adopter’s existing values and product knowledge, past experiences, and current needs. Incompatible products diffuse more slowly than compatible products, as adopters must invest more time in understanding the new product and rationalizing or replacing existing operations, knowledge, and behaviors. Relative advantage describes the degree to which a product is perceived as superior to existing substitutes. A product that is notably easier to use, more 1162
efficient, or accomplishes an objective more effectively than other alternatives provides a relative advantage. Related to this is the speed with which the anticipated benefits of the new product accrue to the adopter. Observability describes the degree to which the benefits or other results of using the product can be observed and communicated to target customers. This is the noticeable difference factor, or the ability to observe its features or performance. Trialability describes the degree to which a product can be tried on a limited basis with little or acceptable risk. Demonstrations, test-drives, limited implementation, and sampling are some of the ways that promote customer adoption and diffusion of a new product. By demonstrating the usefulness and value of the new product, potential adopters reduce the risk of making a poor adoption decision.
Producers, Manufacturers, Distributors, and Retailers The adoption of RFID technology presents some initial advantages and disadvantages for all business-to-business customers. One advantage is the potential supply chain savings, but the savings come with considerable upfront cost, even when issues of compatibility and reliability have been resolved (Mukhopadhyay & Kekre, 2002). Because buyers and sellers have somewhat different operational concerns specific to each business, the value estimation of these advantages and disadvantages may differ among supply chain partners and, thus, the relative advantage of adopting RFID. At a basic level, both retailers and suppliers will incur substantial upfront investment in readers and systems integration. However, this cost to retailers will be relatively fixed. In contrast, manufacturers have the additional unit cost of the tags affixed to pallets, cases, or items. This cost is currently prohibitive for many small-ticket, smallmargin goods, particularly in high-volume manufacturing such as consumer packaged goods. Essentially, the compatibility issue will be dictated by the more powerful players in the supply chain. Even if the firm sees many obstacles in RFID adoption, powerful buyers will dictate a producer’s RFID adoption process, if the latter hopes to retain their key accounts. From a producer’s perspective, only compliance may keep its business with an important retailer or customer (i.e., there is a strong motivation to master the new technology and become compatible with new technological demands (Pyke, 2001.) Similarly, the regulatory environment is an important motivator in adoption of new technologies. Many ranchers and food producers have adopted the RFID tag to meet future tracking requirements in the European Union. Even so, the volume of new data, as well as system compatibility
Use of RFID in Supply Chain Data Processing
issues, are not small considerations in effective implementation of RFID (Levinson, 2004.) The producers who see the relative advantage of RFID at current RFID unit prices sell higher-margin goods and containers, such as pharmaceuticals, car parts, boxcars, livestock, and apparel. Here, the high value of the item can accommodate the current cost of the RFID tags. Manufacturers that can use additional information about the product at different points in the supply chain would also understand a clear advantage of RFID over bar codes. For example, pharmaceuticals that must maintain constant temperature, time, and moisture ranges for efficacy can monitor such data through to the customer’s purchase. High-end apparel can code tags to indicate the availability of other colors or complementary products; RFID tags married to VIN numbers on automobiles can maintain a car’s manufacturing, use, and repair history. Indeed, some products such as higher-end consumer electronics see RFID as a potentially less expensive anti-theft device than the security systems they currently use (Stone, 2004). Clear identification of counterfeit goods is also a valued advantage of RFID adoption in defending proprietary designs and formulations across many product categories as well as providing a guarantee of product safety and efficacy. However, even high value-added products, such as pharmaceuticals, currently use only track-and-trace applications until EPC tag specifications can prevent one chip’s programming being copied to another, a key requirement in guaranteeing product authenticity (Whiting, 2004). In contrast, the tags are prohibitively expensive on low-ticket, low-margin items typical of consumer packaged goods. Compared to current, fairly reliable distribution systems, RFID may present less added value and efficiency to manufacturers compared to retailers. Producers are also concerned that their tags are not talking to their competitors’ readers, either in competitor stores or warehouses. Producers feel that they have more to lose in divulging inventory and marketing specifications than do retailers (Berkman, 2002). Loss of sensitive, proprietary information can be seen to increase the system’s vulnerability to competitors, a potentially prohibitive cost to its implementation. Software safeguards must be in place so that tags only can be read and changed by approved readers (Hesseldahl, 2004a.) Pilot projects will facilitate RFID trialability and observability, both within and between firms, and subsequently demonstrate comparative advantages over the prior inventory systems. Besides improved inventory monitoring and control, new RFID tags that are incorporated into the packaging also promise to reduce system costs. As major supply chain groups fine-tune the technology and share successes and improvements, and as unit costs of RFID decline, the more likely RFID will be
considered and adopted. In contrast, as glitches arise, only firms that envision the highest payoff from RFID are likely to engage in extensive pilot testing (Shi & Bennett, 2001). However, third-party mid-level firms are increasingly available to smooth the complexities of RFID implementation (Hesseldahl, 2004b.)
RFID and Consumer Concerns Consumers will not be concerned with the technical complexity and compatibility of RFID in the same way as producers and retailers, similar to their lack of concern about the technical issues of bar codes. Instead, they are more concerned about RFID compatibility with lifestyle issues and the comparative advantage of the tags in their everyday lives. Advantages can be seen in everything from assured efficacy of pharmaceuticals to quality parts assurance. New technologies may see RFID tags on shirts that program an appropriate wash cycle; that provide heating instructions from a frozen dinner to a microwave oven; that suggest complementary merchandise to an item of apparel or accessories based on past purchases; that warn of past-due shelf dates; and that can add a product to a shopping list when it is removed from the refrigerator. However, until consumers have homes with appropriately programmed appliances and other infrastructure, or retailers demonstrate value-added services based on RFID technology, observability, trialability, and compatibility will be issues in demonstrating the value of RFID to consumers. A great concern for consumers is privacy. Consumers worry about RFID’s ability in post-purchase surveillance, as it opens the door to corporate, government, and other sources of abuse. Groups such as Consumers Against Supermarket Privacy Invasion and Numbering (CASPIAN) already have organized protests against chains at the forefront of data gathering, such as information from customer loyalty cards. The clothing retailer Benetton considered putting RFID tags into some apparel, until Caspian threatened a boycott. Some consumer groups fear that a thief with an RFID reader easily could identify which homes had the more expensive TVs and other tempting targets, unless the tags are deactivated (Pruitt, 2004.)
FUTURE TRENDS Retailers have proposed various solutions to privacy concerns. A customer may deactivate the tags when exiting a store, but this gets cumbersome for large shopping trips. Furthermore, deactivation could block useful data in third-party situations, such as dietary information and allergy alerts. Other suggestions have 1163
7
Use of RFID in Supply Chain Data Processing
included special carry bags that block the RFID signal, but this is often impractical or ineffective in home storage for many items. However, it is more feasible for small, personally sensitive items such as prescription medications. Expiring product signals may defeat the purpose of accurately identifying product returns or recalls, unless the tag can be reactivated.
CONCLUSION To date, retailers have been the driving forces behind the adoption of RFID technology. Retailers who have a very wide variety of goods, such as large general merchandisers like Wal-Mart and grocery chains such as Metro, can observe much improvement in inventory systems using RFID compared to bar-code technology. Yet, in its current form, RFID systems are complex to install and incompatible with current systems, plant, and equipment. Furthermore, staff will have to be retrained in order to extract the full value from RFID technology, including the store clerk who can search for additional inventory in an off-site warehouse.
Hesseldahl, A. (2004b). Master of the RFID universe. Forbes.com. Retrieved from http://www.forbes.com/manufacturing/2004/06/29/cx_ah_0629rfid.html Hewitt, F. (2001). After supply chains, think demand pipelines. Supply Chain Management Review, 5(3), 28-41. Levinson, M. (2004). The RFID imperative. CIO.com. Retrieved from http://www.cio.com.au/pp.php?id= 557782928&fp=2&fpid=2 Microlise. (2003). White Paper on RFID Tagging Technology. Mukhopadhyay, T., & Kekre, S. (2002). Strategic and operational benefits of electronic integration. Management Science, 48(10), 1301-1313. Prasad, S., & Sounderpandian, J. (2003). Factors influencing global supply chain efficiency: Implications for information systems. Supply Chain Management: An International Journal, 8(3), 241-250. Pruitt, S. (2004). RFID: Is big brother watching? Infoworld. Retrieved from http://www.inforworld.com/ article/04/03/19/HNbigbrother_1.html
REFERENCES
Pyke, D. et al. (2001). e-Fulfillment: It’s harder than it looks. Supply Chain Management Review, 5(1), 26-33.
Auto-Id Technology Guide. (2002). MIT Auto ID Center, Cambridge, MA.
Ranganathan, C. (2003). Evaluating the options for business-to-business e-commerce. In C.V. Brown (Ed.), Information systems management handbook. New York, NY: Auerbach Publications.
Berkman, E. (2002). How to practice safe B2B: Before swapping information with multiple e-commerce partners, it pays to protect yourself by pushing partners to adopt better security practices. CIO, 15(17), 1-5. Chalasani, S., & Sounderpandian, J. (2004). RFID for retail store information systems. Proceedings of the Americas Conference on Information Systems (AMCIS 2004), New York. Chalasani, S., & Sounderpandian, J. (2005). Performance benchmarks and cost sharing models for B2B supply chain information systems [to appear in Benchmarking: An International Journal].
Rogers, E.M. (1983). Diffusion of innovations. New York: The Free Press. Rush, T. (2003). RFID in a nutshell—A primer on tracking technology. UsingRFID.com. Retrieved from http:// usingrfid.com/features/read.asp?id=2 Shi, N., & Bennett, D. (2001). Benchmarking for information systems management using issues framework studies: Content and methodology. Benchmarking: An International Journal, 8(5), 358-375.
Grosvenor, F., & Austin, T.A. (2001). Cisco’s eHub initiative. Supply Chain Management Review, 5(4), 28-35.
Stone, A. (2004). Stopping sticky fingers with tech. BusinessWeekOnline. Retrieved from http:// www.businessweek.com/technology/content/aug2004/ tc20040831_9087_tc172.htm
Gunasekaran, A. (2001). Editorial: Benchmarking tools and practices for twenty-first century competitiveness. Benchmarking: An International Journal, 8(2), 86-87.
Thomson, I. (2004). Privacy fears haunt RFID rollouts. CRM Daily. Retrieved from http://wireless.newsfactor.com/ story.xhtml?story_id=23471
Hesseldahl, A. (2004a). A hacker’s guide to RFID. Forbes.com. Retrieved from http://www.forbes.com/ commerce/2004/07/29/cx_ah_0729rfid.html
Weisberg, D. (2002). Virtual private exchanges change eprocurement. Information Executive, 6(1), 4-5.
1164
Use of RFID in Supply Chain Data Processing
Whiting, R. (2004). RFID to flourish in pharmaceutical industry. InformationWeek. Retrieved from http:// infomrationweek.com/story/showArticle.jhtml? article!ID=29116923
KEY TERMS Active Tag: An active tag is powered by its own battery, and it can transmit its ID and related information continuously.
values and product knowledge, past experiences, and current needs. Electronic Product Code (EPC): Uniquely identifies each product and is normally a 128-bit code. It is embedded in the RFID tag of the product. Observability: The degree to which the benefits or other results of using the product can be observed and communicated to target customers. Passive Tag: Receive energy from the RFID reader and then transmit its ID to the reader.
Auto Id Technology: A precursor to the RFID technology that led to the definitions of RFID technology, including EPC.
RFID: Radio Frequency Identification, defined as the use of radio frequencies to read information on a small device known as a tag.
Compatibility: Describes the degree to which the new product is consistent with the adopter’s existing
Trialability: The degree to which a product can be tried on a limited basis.
1165
7
1166
Using Dempster-Shafer Theory in Data Mining Malcolm J. Beynon Cardiff University, UK
INTRODUCTION The origins of Dempster-Shafer theory (DST) go back to the work by Dempster (1967) who developed a system of upper and lower probabilities. Following this, his student Shafer (1976), in his book “A Mathematical Theory of Evidence” added to Dempster’s work, including a more thorough explanation of belief functions. In summary, it is a methodology for evidential reasoning, manipulating uncertainty and capable of representing partial knowledge (Haenni & Lehmann, 2002; Kulasekere, Premaratne, Dewasurendra, Shyu, & Bauer, 2004; Scotney & McClean, 2003). The perception of DST as a generalization of Bayesian theory (Shafer & Pearl, 1990), identifies its subjective view, simply, the probability of an event indicates the degree to which someone believes it. This is in contrast to the alternative frequentist view, understood through the “Principle of Insufficient Reason”, whereby in a situation of ignorance a Bayesian approach is forced to evenly allocate subjective (additive) probabilities over the frame of discernment. The development of DST includes analogies to rough set theory (Wu, Leung, & Zhang, 2002) and its operation within neural and fuzzy environments (Binaghi, Gallo, & Madella, 2000; Yang, Chen, & Wu, 2003). Techniques based around belief decision trees (Elouedi, Mellouli, & Smets, 2001), multi-criteria decision making (Beynon, 2002) and non-parametric regression (Petit-Renaud & Denœux, 2004), utilize DST to allow analysis in the presence of uncertainty and imprecision. This is demonstrated with the CaRBS (Classification and Ranking Belief Simplex) system for object classification, see Beynon (2005).
BACKGROUND The terminology inherent within DST starts with a finite set of hypotheses Θ (frame of discernment). A basic probability assignment (bpa) or mass value is a function m: 2Θ → [0,1] such that: m(∅) = 0 (∅ - empty set) and ∑ A∈2 m( A) = 1 (2Θ the power set of Θ). If the constraint m(∅) = 0 is not imposed then the transferable belief model can be adopted (Elouedi, Mellouli, & Smets, 2001; Petit-Renaud & Denœux, 2004). Any A ∈ 2Θ, Θ
for which m(A) is non-zero is called a focal element and represents the exact belief in the proposition depicted by A. From one source of evidence, a set of focal elements and their mass values can be defined a body of evidence (BOE). Based on a BOE, a belief measure is a function Bel: 2Θ → [0,1], defined by Bel(A) = ∑ B ⊆ A m( B) , for all A ⊆ Θ. It represents the confidence that a specific proposition lies in A or any subset of A. A plausibility measure is a function Pls: 2Θ → [0,1], defined by Pls(A) =
∑ A∩B≠∅ m( B) ,
for all A ⊆ Θ. Clearly Pls(A) represents the extent to which we fail to disbelieve A. These measures are clearly related to one another, Bel(A) = 1 − Pls(¬A) and Pls(A) = 1 − Bel(¬A), where ¬A refers to its compliment ‘not A’. To collate two or more sources of evidence (e.g. m1(⋅) and m2(⋅)), DST provides a method to combine them, using Dempster’s rule of combination. If m1(⋅) and m2(⋅) are independent BOEs, then the function m1 ⊕ m2: 2Θ → [0,1], defined by: 0 [m1 ⊕ m2](y) = (1 − k ) −1 ∑ m ( A)m2 ( B ) A∩ B = y 1
y =∅ y≠∅
where k = ∑ A∩B=∅ m1 ( A)m2 ( B) , is a mass value with y ⊆ Θ. The term (1 − k), can be interpreted as a measure of conflict between the sources. It is important to take this value into account for evaluating the quality of combination: when it is high, the combination may not make sense and possibly lead to questionable decisions (Murphy, 2000). To demonstrate the utilization of DST, the example of the murder of Mr. Jones is considered, where the murderer was one of three assassins, Peter, Paul and Mary, frame of discernment Θ = {Peter, Paul, Mary}. There are two witnesses. Witness 1, is 80% sure that it was a man, the concomitant BOE, defined m1(⋅), includes m1({Peter, Paul}) = 0.8. Since we know nothing about the remaining mass value it is allocated to Θ, m1({Peter, Paul, Mary}) = 0.2. Witness 2, is 60% confident that Peter was leaving on a jet plane when the murder occurred, a BOE defined m2(⋅), includes m2({Paul, Mary}) = 0.6 and m2({Peter, Paul, Mary}) = 0.4. The aggregation of these two sources of information, using Dempster’s combination rule, is based on the inter-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Using Dempster-Shafer Theory in Data Mining
section and multiplication of focal elements and mass values from the BOEs m 1(⋅) and m2(⋅). Defining this BOE m3(⋅), it can be found; m3({Paul}) = 0.48, m3({Peter, Paul}) = 0.32, m3({Paul, Mary}) = 0.12 and m3({Peter, Paul, Mary}) = 0.08. This combined evidence has a more spread-out allocation of mass values to varying subsets of Θ. Further, there is a general reduction in the level of ignorance associated with the combined evidence. In the case of the belief (Bel) and plausibility (Pls) measures, considering the subset {Peter, Paul}, then Bel3({Peter, Paul}) = 0.8 and Pls3({Peter, Paul}) = 1.0. A second larger example supposes that the weather in New York at noon tomorrow is to be predicted from the weather today. We assume that it is in exactly one of three states: dry (D), raining (R) or snowing (S), hence the frame of discernment Θ = {D, R, S}. Let us presume two pieces of evidence have been gathered: i) The temperature today is below freezing and ii) The barometric pressure is falling, i.e. a storm is likely. These pieces of evidence are represented by the two BOEs mfreeze(⋅) and mstorm(⋅), respectively, and are reported in Table 1. For each BOE in Table 1 the exact belief (mass) is distributed among the focal elements (excluding ∅). For mfreeze(⋅), greater mass is assigned to {S} and {R, S}, for mstorm(⋅), greater mass is assigned to {R} and {R, S}. Assuming that mfreeze(⋅) and mstorm(⋅) represent evidence which are independent of each other, the BOE from the combination of this evidence, defined mboth(⋅), is made up of the mass values reported in Table 2. The BOE mboth(⋅) represented in Table 2 has a lower level of ignorance (mboth(Θ) = 0.0256), than both of the original BOEs m freeze(⋅) and mstorm(⋅). Amongst the other focal elements, more mass is assigned to {R} and {S}, a consequence of the greater mass assigned to the associated focal elements in the other two BOEs. The other focal elements all exhibit net losses in their mass values. As with the assassin example, measures of belief (Bel) and plausibility (Pls) could be found, to offer evidence (confidence) on combinations of states representing tomorrow’s predicted weather. This section is closed with some cautionary words still true to this day (Pearl, 1990), ‘Some people qualify any
model that uses belief functions as Dempster-Shafer. This might be acceptable provided they did not blindly accept the applicability of Dempster’s rule of combination (and others). Such critical - and in fact often inappropriate - use of these rules explain most of the errors encountered in the so-called Dempster-Shafer literature’.
MAIN THRUST This section outlines one of many different methods which utilizes DST, within a data mining environment. The CaRBS system is a data mining technique for the classification (and subsequent prediction) of objects to a given hypothesis (x) and its compliment (¬x), using a series of characteristic values (Beynon, 2005). The rudiments of CaRBS are based on DST, and with two exhaustive outcomes works on a binary frame of discernment (BFOD). Subsequently, the aim of CaRBS is to construct a BOE for each characteristic value in the evidential support for the classification of an object to {x}, {¬x} and concomitant ignorance {x, ¬x} (considered as uncertainty in subjective judgements), see Figure 1. In Figure 1, stage a) shows the transformation of a characteristic value vj,i (jth object, i th characteristic) into a confidence value cf i(v j,i), using a sigmoid function, with control variables k i and θ i. Stage b) transforms a cfi(vj,i) into a characteristic BOE mj,i(⋅), made up of the three mass values mj,i({x}), mj,i({¬x}) and mj,i({x, ¬x}), and from Gerig, Welti, Guttman, Colchester, & Szekely (2000) are defined by: B
AB
i i i mj,i({x}) = 1 − A cf i (v j ,i ) − 1 − A , i
i
−B
i mj,i({¬x}) = 1 − A cf i (v j ,i ) + Bi , i
and m j,i({x, ¬x}) = 1 - m j,i({x}) - m j,i({¬x}), where Ai and Bi are two further control variables. When either mj,i({x}) or mj,i({¬x}) are negative they are set to zero
Table 1. Mass values and focal elements for mfreeze(⋅) and mstorm(⋅) BOE mfreeze(⋅) mstorm(⋅)
∅ 0.0 0.0
{D} 0.1 0.1
{R} 0.1 0.2
{S} 0.2 0.1
{D, R} {D, S} 0.1 0.1 0.1 0.1
{R, S} 0.2 0.3
Θ 0.2 0.1
Table 2. Mass values and focal elements for m both(⋅) BOE mboth(⋅)
∅ 0.0
{S} {D, R} {D, S} {R, S} {D} {R} Θ 0.1282 0.2820 0.2820 0.0513 0.0513 0.1795 0.0256
1167
7
Using Dempster-Shafer Theory in Data Mining
Figure 1. Stages within the CaRBS system for a characteristic value vj,i. 1 1 +e
0 .5
Bi
0
1
1
Ai
−k i ( v j ,i −θi )
cf i( v j,i )
m j,i ({ x } )
0 .5
m j,i ({ x , ¬ x } ) m j,i ({ ¬ x } )
0 a)
1 0 v j,i,2
θi v j,i c)
v j,i,1 v j,i,3
1 b)
{x, ¬ x} e j,i,1 p j,i,v
e j,i,2
e j,i,3
{¬ x}
{ x}
(before the calculation of mj,i({x, ¬x})). The control variable Ai depicts the dependence of mj,i({x}) on cfi(vj,i) and Bi is the maximum value assigned to mj,i({x}) or m j,i({¬x}). Stage c) shows a BOE mj,i(⋅) can be represented as a simplex coordinate in a simplex plot. That is, the ratios of the distances of the simplex coordinate pj,i,v to the edges of the simplex plot ( p j ,i ,v e j ,i ,k ,k = 1, …, 3) are the same as the ratios of the respective mass values. When characteristics ci, i = 1, …, nC, describe an object oj, j = 1, …, nO, individual characteristic BOEs are constructed. Dempster’s rule of combination is used to combine these (independent) characteristic BOEs to produce an object BOE mj(⋅), associated with an object oj and its level of classification to x or ¬x. To illustrate the relationship between the combination of characteristic BOEs and the subsequent final object BOE, a small example is considered, represented visually in a simplex plot, see Figure 2. Two example characteristic BOEs mj,1(⋅) and mj,2(⋅) are considered, with their mass values (mj,i({x}), mj,i({¬x}) and Figure 2. Simplex coordinates characteristic and object BOEs m j,2 ({ x } ) = 0 .0 5 1 7 m j, 2 ({ ¬ x } ) = 0 .3 9 8 3 m j, 2 ({ x , ¬ x } ) = 0 .5 5 0 0
{ x, ¬ x}
of
example
m j, 1 ({ x } ) = 0 .5 6 4 4 m j, 1 ({ ¬ x } ) = 0 .0 m j, 1 ({ x , ¬ x } ) = 0 .4 3 5 6 m j({ x } ) = 0 .4 6 7 2 m j({¬ x } ) = 0 .2 2 3 8 m j({ x , ¬ x } ) = 0 .3 0 9 0
{¬ x}
1168
{x}
mj,i({x, ¬x})) also shown in Figure 2. The combination of mj,1(⋅) and mj,2(⋅) defined mj(⋅) = [mj,1 ⊕ mj,2](⋅) is evaluated to be [0.4672, 0.2238, 0.3090]. For each BOE, the smaller a mass value the further away from the respective vertex the simplex coordinate is. Also illustrated is a general consequence of the combination of two or more BOEs, namely the reduction in the concomitant ignorance of the combined object BOE. The effectiveness of CaRBS is governed by the control variables ki, θi, Ai and Bi, when the classification of each object is known, it is a constrained optimization problem. Optimized here using the evolutionary algorithm, trigonometric differential evolution (TDE), described in Fan and Lampinen (2003), quantified using an objective function (OB). The equivalence classes E({x}) and E({¬x}) are sets of objects known to be classified to {x} and {¬x} respectively. The OB, defined here, directly attempts to reduce ambiguity, but not the inherent ignorance. For objects in E({x}) and E({¬x}) the optimum solution is to maximize the difference values (mj({x}) − mj({¬x})) and (m j({¬x}) − mj({x})) respectively. The subsequent OB is given by:
OB =
1 1 ∑ (1 − m j ({x}) + m j ({¬x})) 4 | E ({ x}) | o j ∈E ({ x}) )) +
1 (1 + m j ({x} − m j ({¬x})) ∑ | E ({¬x}) | o j ∈E ({¬x})
In the limit, each difference value can attain –1 and 1, then 0 ≤ OB ≤ 1. It is noted, maximizing a difference value such as (mj({x}) − mj({¬x})) only indirectly affects the associated ignorance, rather than making it a direct issue (since the OB does not incorporate the respective mj({x, ¬x}) mass values). As a study to demonstrate the CaRBS system, data for 20 failed (¬x or 0) and 20 non-failed (x or 1) mediumsized companies were considered. Four financial characteristics that describe each company are; SALES: annual sales, ROCE: ratio of profit to capital employed, GEAR: ratio of current liabilities plus long-term debt to total assets and QACL: ratio of current assets minus stock to current liabilities (see Beynon, 2005). Using the CaRBS system, to assimilate ignorance in the characteristics (evidence), an upper bound of 0.5 is placed on each of the Bi control variables. The data set was also standardized (zero mean and unit standard deviation) allowing bounds on the ki and θ i control variables of [−2, 2] and [−1, 1] respectively (limited marginal effect on results). To utilize TDE, parameters are required for its configuration. Following Fan and Lampinen (2003), amplification control F = 0.99, cross-
Using Dempster-Shafer Theory in Data Mining
Figure 3. Simplex coordinates of characteristic and company BOEs a ) o 26 {x, ¬ x} SA LES QACL ROCE GEAR m 26 (.)
b ) o 13
{x, ¬ x}
GEAR QACL SA LES
c) ROCE 0
m 13 (.) {¬x}
{x}
{¬ x}
over constant CR = 0.85, trigonometric mutation probability M t = 0.05 and number of parameter vectors NP = 10 × number of control variables = 160. The TDE was then applied and the subsequent control variables used to construct characteristic BOEs, and post combination a company BOE, for each company, see Figure 3. The first two simplex plots in Figure 3 represent the evidence and final classification of the two companies o26 and o13. For a characteristic BOE the further down the simplex plot its simplex coordinate is, the less ignorance is associated with its evidence. Their position either side of the vertical dashed lines identify when their evidence supports a company’s classification as failed (¬x) or non-failed (x). The simplex coordinate of each final company BOE is labeled with m26(⋅) and m13(⋅), their respective positions indicate they would be classified as non-failed and failed respectively (correct in both their cases). The third simplex plot reports the company BOEs for each of the 40 companies considered (0 and 1 labels for failed (¬x) and non-failed (x) companies respectively). Their level of classification to being a nonfailed company increases, from left to right (in a rough arc). For this single analysis, using the vertical dashed line to partition classification to {¬x} or {x}, indicates 80% classification accuracy. One novel advantage of the use of this technique would be in the case of the presence of missing characteristic values. Using CaRBS they would be considered ignorant values and their complete mass assigned to {x, ¬x} in each case. The CaRBS system would be processed without any incumbent manipulation of the data. This is in contrast to more conventional techniques where missing values may need to be imputed or other action undertaken, including the removal of objects with missing values etc. (see Schafer & Graham, 2001).
FUTURE TRENDS The practical utilization and understanding of DST is a continuing debate, specific to its inherent generality, including the notion of ignorance (partial knowledge).
7
{x, ¬ x}
0
{x}
00 0 00 00
0 00
01 1 1 0 1 0 0 1 0 1 1111 11 0 1 01 1 1 1
{¬ x}
11 1
{x}
This includes the practical construction of the required BOEs, without necessitating the constraining influence of an expert. This is an ardent possibility with the more subjective nature of DST over the relative frequentist form with Bayesian Theory. The CaRBS system discussed is a data mining technique particular to object classification (and ranking). The trigonometric differential evolution method of constrained optimization mitigates the influence of expert opinion in assigning values to the incumbent control variables. However, the effect of the limiting bounds on these control variables still needs formal understanding.
CONCLUSION The utilization of DST highlights the acknowledgement of the attempt to incorporate partial knowledge (ignorance/uncertainty) in functional data mining. The understanding and formulization of what ignorance is, and how to quantify it in general and application specific studies is an underlying problem. The future will undoubtedly produce varying directions of appropriate (and inappropriate) derivations to its quantification.
REFERENCES Beynon, M. J. (2002). DS/AHP method: A mathematical analysis, including an understanding of uncertainty. European Journal of Operational Research, 140(1), 148-164. Beynon, M. J. (2005). A novel technique of object ranking and classification under ignorance: An application to the corporate failure risk problem. European Journal of Operational Research, 167, 497-513. Binaghi, E., Gallo, I., & Madella, P. (2000). A neural model for fuzzy Dempster-Shafer classifiers. International Journal of Approximate Reasoning, 25, 89-121. Dempster, A. P. (1967). Upper and lower probabilities induced by a muilti-valued mapping. Ann. Math. Stat., 38, 325-339. 1169
Using Dempster-Shafer Theory in Data Mining
Elouedi, Z., Mellouli, K., & Smets, P. (2001). Belief decision trees: Theoretical foundations. International Journal of Approximate Reasoning, 28, 91-124. Fan, H-Y., & Lampinen J. (2003). A trigonometric mutation operation to differential evolution. Journal of Global Optimization, 27, 105-129.
Wu, W-Z., Leung, Y., & Zhang, W-X. (2002). Connections between rough set theory and Dempster-Shafer theory of evidence. International Journal of General Systems, 31(4), 405-430. Yang, M-S., Chen, T-C., & Wu, K-L. (2003). Generalized belief function, plausibility function, and Dempster’s combinational rule to fuzzy sets. International Journal of Intelligent Systems, 18, 925-937.
Gerig, G., Welti, D., Guttman, C. R. G., Colchester, A. C. F., & Szekely G. (2000). Exploring the discrimination power of the time domain for segmentation and characterisation of active lesions in serial MR data. Medical Image Analysis, 4, 31-42.
KEY TERMS
Haenni, R., & Lehmann, N. (2002). Resource bounded and anytime approximation of belief function computations. International Journal of Approximate Reasoning, 31, 103-154.
Belief: A positive function that represents the confidence that a proposition lies in a focal element or any subset of it.
Kulasekere, E. C., Premaratne, K., Dewasurendra, D. A., Shyu, M-L., & Bauer, P. H. (2004). Conditioning and updating evidence. International Journal of Approximate Reasoning, 36, 75-108. Murphy, C. K. (2000). Combining belief functions when evidence conflicts. Decision Support Systems, 29, 1-9. Pearl, J. (1990) Reasoning with belief functions: An analysis of comparability. International Journal of Approximate Reasoning, 4, 363-390.
Evolutionary Algorithm: An algorithm that incorporates aspects of natural selection or survival of the fittest. Focal Element: Subset of the frame of discernment with a positive mass value associated with it. Frame of Discernment: A finite non-empty set of hypotheses. Mass Value: A positive function of the level of exact belief in the associated proposition (focal element).
Petit-Renaud, S., & Denœux, T. (2004). Nonparametric regression analysis of uncertain and imprecise data using belief functions. International Journal of Approximate Reasoning, 35, 1-28.
Objective Function: A positive function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177.
Plausibility: A positive function that represents the extent to which we fail to disbelieve the proposition described by a focal element.
Scotney, B., & McClean S. (2003). Database aggregation of imprecise and uncertain evidence. Information Sciences, 155, 245-263.
Standardized Data: A collection of numeric data is standardized by subtracting a measure of central location (mean) and by dividing by some measure of spread (standard deviation).
Shafer, G. (1976). A mathematical theory of evidence. Princeton: Princeton University Press. Shafer, G., & Pearl J., (1990). Readings in uncertain reasoning. San Mateo, California: Morgan Kaufman Publishers, Inc.
1170
1171
Using Standard APIs for Data Mining in Prediction
7
Jaroslav Zendulka Brno University of Technology, Czech Republic
INTRODUCTION There are three standardization initiatives concerning application programming interfaces (API) for data mining — OLE DB for Data Mining (OLE DB for DM), SQL/MM Data Mining (SQL/MM DM), and Java Data Mining (JDM) (Schwenkreis, 2001; Grossman et al., 2002; Grossman, 2004). Their goal is to make it possible for different data mining algorithm providers from various software vendors to be easily plugged into applications. Although the goal is the same for all the APIs, the approach applied in each of them is different. OLE DB for DM is a language-based interface developed by Microsoft, SQL/MM DM is an ISO/IEC standard based on userdefined data types of SQL:1999, and JDM, which is being developed under SUN’s Java Community Process, contains packages of data mining oriented Java interfaces and classes. This short paper presents a simple example that shows how the APIs can be used in an application for prediction based on classification. The objective is to demonstrate basic steps that the application must include if we decided to use a given API. The example also helps to understand better different approaches the APIs are based on.
BACKGROUND A brief characterization of all three APIs is presented in another article in this book (Zendulka, 2005). There are several introductory publications that deal with OLE DB for DM (Han & Kamber, 2001; Netz et al., 2001) and its referential implementation in Microsoft SQL Server 2000 (de Ville, 2001; Whitney, 2000; Tang & Kim, 2001). Keeping (2002) presented a scenario for using OLE DB for DM. The full specification was published by Microsoft in 2000 (OLE DB, 2000). Melton & Eisenberg (2001) presented an introductory paper to the SQL Multimedia and Application Package (SQL/MM), part of which the specification of data mining support known as SQL/MM DM is. The standard was accepted as ISO/IEC standard in 2002 (SQL 2002). JDM is developed as a Java Specification Request JSR-73. At the time of writing this paper, it was in the stage of final draft public review (Hornick et al., 2004).
Classification is a two-step process. In the first step, a model is built describing training data. The model can have a form of, for example, a decision tree. In the second step, the model is used for prediction. First, the predictive accuracy of the model is estimated using testing data. If the accuracy is considered acceptable, the model can be used to classify future (previously unseen) data (Han & Kamber, 2001). These two steps are refined in the application according to the API used for implementation.
MAIN THRUST Consider that based on a debt level, income level and employment type we want to predict the credit risk of a customer. A set of data is stored in a Customers table with columns CustomerID, DebtLevel, Income, EmploymentType, and CreditRisk. The CreditRisk column will be a target for prediction.
OLE DB for Data Mining OLE DB for DM uses SQL CREATE, INSERT and SELECT statements with extended syntax and semantics in some cases to provide a language-based API for data mining services provided by a data mining provider that implements the required data mining technique. There are four basic steps that must be performed by our prediction application: 1. 2. 3. 4.
Define a data mining model. Populate the data mining model. Test the data mining model. Apply the data mining model.
First, it is necessary to define a data mining model. OLE DB for DM provides a CREATE statement for this (the square brackets are name delimiters by convention for Microsoft SQL server): CREATE MINING MODEL [RiskPrediction] ( [CustomerID] LONG KEY, [DebtLevel] TEXT DISCRETE,
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
%Model name %Source columns
Using Standard APIs for Data Mining in Prediction
[Income] [EmploymentType] [CreditRisk] )
TEXT DISCRETE, TEXT DISCRETE, TEXT DISCRETE PREDICT,
USING [Decision_Tree]
%Target %Algorithm
The statement specifies the name of the model, columns used for prediction, and a mining algorithm. Each column definition can contain a lot of specialized information. In our example, the CustomerID column is specified as the identifier of cases, the other columns are attributes of cases with discrete values, and the CreditRisk is a target for prediction. OLE DB for DM also supports cases that represent hierarchical data (nested tables). In addition, it is possible to introduce attributes with a predefined meaning, for example, the probability of an associated value, support, etc. Finally, a data mining algorithm is specified. Once a data mining model is defined, it can be populated with training data stored in the Customers table: INSERT INTO [RiskPrediction] ([CustomerId], [DebtLevel], [Income], [EmploymentType], [CreditRisk]) SELECT [CustomerID], [DebtLevel], [Income], [EmploymentType], [CreditRisk]
standard specifies a set of UDTs supporting data mining. If the UDTs are implemented, any SQL-based application can employ them. Consider our example and assume that we want to make it possible to re-compute the classification model from time to time. The following steps must be performed to specify a data mining task that defines the model (all names starting with the DM_ prefix are UDTs or methods specified by the standard): 1.
Create a DM_MiningData value using a static method DM_defMiningData. The value will contain metadata describing physical data - the Customers table. Create a DM_LogicalDataSpec value using a DM_genDataSpec method of the DM_MiningData value. The value will represent the input of a data mining task as a set of data mining fields - logical data. A DM_genDataSpec method can generate this representation from the DM_MiningData value. Create a DM_ClasSettings value using a default constructor, assign the input of the model (the DM_LogDataSpec value) using a method DM_useClasDataSpec, and declare the CreditRisk column as a target field using a DM_setClasTarget method. The DM_ClasSettings value will contain settings that are used to generate the model, for example, cost rate, an input of the model, and a target field. Create a DM_ClasBldTask value using a static method DM_defClasBldTask. The method takes two DM_MiningData values for training and testing data, and a DM_ClasSetting value as arguments. In our example, we assume that the model will only be trained during building. The DM_ClasBldTask value represents the task that builds the model. Store the newly created DM_ClasBldTask value in a table; assume MyTasks, with Id and Task columns so that the classification model can be re-computed. Let us insert the value to the table with ID = 1.
2.
3.
FROM Customers
The data mining algorithm analyzes the input cases and populates the patterns it has discovered to the data mining model. That is, the data mining model is built in this step. The model can be tested or applied using a SELECT statement with a PREDICTION JOIN operator:
4.
SELECT Customers.[CustomerID], [RiskPrediction].[CreditRisk], FROM [RiskPrediction] PREDICTION JOIN Customers ON [RiskPrediction].[DebtLevel] = Customers.[DebtLevel] AND [RiskPrediction].[Income] = Customers.[Income] AND [RiskPrediction].[EmploymentType] = Customers.[EmploymentType]
For each case from the input (a row of the Customers table with new data), PREDICTION JOIN, using the conditions in the ON clause, will find the best prediction. To validate the accuracy of the trained model, we can use a similar statement for testing data with both predicted and known values, and to process these values. For descriptive mining techniques, a populated data mining model is only browsed using the ordinary SQL SELECT statement.
SQL/MM Data Mining SQL/MM DM is based on SQL:1999 and its structured user-defined data types (UDT). The structured UDT is the fundamental facility in SQL:1999 (Melton & Simon, 2001). It allows specifying types of objects that encapsulate both attributes and methods. The SQL/MM DM 1172
5.
All the steps can be expressed as a single SQL statement: WITH MyData AS (—Physical data—(Step 1) DM_MiningData::DM_defMiningData(‘Customers’) ) INSERT INTO MyTasks (Id, Task) (Step 5) VALUES (1,—ID of the model DM_ClasBldTask::DM_defClasBldTask((Step 4) MyData, NULL,—Training and testing data ( (NEW DM_ClasSettings()). (Step 3) DM_useClasDataSpec(MyData.DM_genDataSpec())
) )
(Step 2) ).DM_setclasSetTarget(‘CreditRisk’) —Target column
Using Standard APIs for Data Mining in Prediction
Once a classification task is defined, the classification training can be initiated and a classification model built using a DM_buildClasModel method. Assume that the model will be stored in a MyModels table with Id and Model columns: WITH MyTask AS ( SELECT Task FROM MyTasks WHERE Id = 1
) INSERT INTO MyModels VALUES ( 1, MyTask.DM_buildClasModel()
—Task
Assume Id = 1 —Classification model
6.
7.
specific parameters for building a data mining model to be set. For example, for classification, the object contains references to logical data and algorithm settings objects, and information about a target attribute for prediction. Create an object representing the build task. In this step, an object that provides an abstraction of the metadata needed to build a classification model is created. Execute the build task. As a result, an object representing the classification model is created.
)
The DM_ClasModel type provides DM_testClasModel and DM_applyClasModel methods. The former returns a DM_ClasTestResult value, which holds information on a classification error, predictions for a given class, and etcetera. The later returns a DM_ClasResult value, which makes possible to get a predicted class label and the confidence for the prediction. For example, the following statement returns a classification error for our model and testing data in the Customers table: SELECT (Model.DM_testClasModel( DM_MiningData::DM_defMiningData (‘Customers’))). DM_getClasError() FROM MyModels WHERE Id = 1
Java Data Mining The JDM standard is based on a generalized, objectoriented data mining conceptual model that supports common data mining operations. Compared with the previous two APIs, it is more complex because it does not rely on any other built-in support, such as OLE DB or SQL. Therefore, only basic steps of using it are described here without any fragment of code: 1. 2.
3.
4. 5.
Create a client application’s connection to a data mining engine. Create an object representing physical data. In our example, the Customers table is identified as a data source. Metadata describing columns of the table (referred to as physical attributes) can be imported from the database. Create an object representing logical data. Logical data is a set of logical attributes that describe the logical nature of the data used as input for model building, training, and applying. By default, a logical attribute is defined for each physical one. Create an object representing a mining algorithm and its settings. For example, a decision tree algorithm is selected and its maximum depth specified. Create an object representing mining function settings. This step allows general and mining technique
To be able to test and apply the model, physical data must be specified, an object representing the task created and executed. Finally, the resulting object can be examined.
FUTURE TRENDS Currently, OLE BD for DM has a referential implementation in Microsoft SQL Server 2000. SQL/MM DM and JDM have not been implemented yet. The Oracle9i Data Mining (Oracle9i, 2002) API provides an early look at concepts and approaches proposed for JDM. Since all three standard APIs are based on other more general and widely accepted standards (OLE DB, SQL, Java), we can expect more implementations in the future. On the other hand, the fact that there exist several standard APIs based on different approaches can cause problems with integration. This can be a challenge to the future research aiming at a common unifying framework.
CONCLUSION The use of three existing or emerging standard APIs for data mining was demonstrated on a simple classification example. OLE DB for DM is probably the simplest one. On the other hand, it relies on Microsoft’s OLE DB standard for accessing record-oriented data stores. The most general API is JDM, which does not rely on any other built-in support. SQL/MM DM allows integrating data mining support to object-relational database servers that comply with SQL:1999 in a standard way.
REFERENCES de Ville, B. (2001, January). Data mining in SQL Server 2000. SQL Server Magazine. Retrieved May 20, 2004, from http://www.winnetmag.com/Article/ArticleID/ 16175/16175.html 1173
7
Using Standard APIs for Data Mining in Prediction
Grossman, R. (2004). KDD2003 workshop on data mining standards, services and platforms (DMSSP 03). SIGKDD Explorations, 5(2), 197. Grossman, R.L., Hornick, M.F., & Meyer, G. (2002). Data mining standards initiatives. Communications of the ACM, 45 (8), 59-61. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. Morgan Kaufmann Publishers. Hornick, M. et al. (2004). Java™Specification Request 73: Java™Data Mining (JDM). Version 0.96. Retrieved May 20, 2004, from http://jcp.org/aboutJava/ communityprocess/first/jsr073/ Keeping, F. (2002, June). Data mining on a shoestring. SQL Server Magazine. Retrieved May 20, 2004, from http://www.winnetmag.com/Article/ArticleID/24856/ 24856.html Melton, J., & Eisenberg, A. (2001). SQL multimedia and application packages (SQL/MM). SIGMOD Record, 30(4), 97-102. Melton, J., & Simon, A. (2001). SQL: 1999. Understanding relational language components. Morgan Kaufmann Publishers. Microsoft Corporation. (2000). OLE DB for Data Mining Specification Version 1.0. Netz, A. et al. (2001, April). Integrating data mining with SQL databases: OLE DB for data mining. In Proceedings of the 17 th International Conference on Data Engineering (ICDE ’01) (pp. 379-387), Heidelberg, Germany. Oracle9i Data Mining. Concepts. Release 9.2.0.2. (2002). Viewable CD Release 2 (9.2.0.2.0). Schwenkreis, F. (2001) Data mining: Technology driven by standards? In Ninth International Workshop on High Performance Transaction Systems. Retrieved February 1, 2004, from http://www.research.microsoft.com/ ~jamesrh/hpts2001/submissions/Friedemann Schwenkreis.htm SQL Multimedia and Application Packages – Part 6: Data Mining. ISO/IEC 13249-6 (2002). Whitney, R. (2000, September). Analysis services data mining. SQL Server Magazine. Retrieved May 20, 2004, from http://www.winnetmag.com/Article/ArticleID/ 9160/9160.html
1174
Tang, Z., & Kim, P. (2001). Building data mining solutions with SQL Server 2000. Retrieved May 20, 2004, from http://www.dmreview.com/whitepaper/wp.cfm?topicId= 230001 Zendulka, J. (2005) API standardization efforts for data mining. In J. Wang (Ed.) Encyclopedia of data warehousing and mining. Hershey, PA: Idea Group Reference.
KEY TERMS JDM: Java Data Mining (JDM) is an emerging standard API for the programming language Java. It is an object-oriented interface that specifies a set of Java classes and interfaces supporting data mining operations for building, testing, and applying a data mining model. Logical Data: Input of model building, training and applying tasks in SQL/MM DM and JDM. It describes the logical nature of the data used as input of the task (referred to as physical data). OLE DB for DM: OLE DB for Data Mining (OLE DB for DM) is a Microsoft’s language-based standard API that introduces several SQL-like statements supporing data mining operations for building, testing, and applying a data mining model. Physical Data: Data source for data mining in SQL/ MM DM and JDM. It is mapped to inputs of a data mining task (referred to as logical data). Prediction Join: An operation in OLE DB for DM that finds the best prediction for a given input case and a given data mining model. SQL/MM DM: SQL Multimedia and Application Packages – Part 6: Data Mining (SQL/MM DM) is an international standard the purpose of which is to define data mining user-defined types and associated routins for building, testing, and applying data mining. models It is based on structured user-defined types of SQL:1999. Structured UDT: Structured User Defined Type (UDT) is a database schema object in SQL:1999 that allows types of objects, which encapsulate both attributes and methods, to be specified. SQL/MM DM is based on this feature of SQL:1999.
1175
Utilizing Fuzzy Decison Trees in Decision Making Malcolm J. Beynon Cardiff University, UK
INTRODUCTION The seminal work of Zadeh (1965), fuzzy set theory (FST) has developed into a methodology fundamental to analysis that incorporates vagueness and ambiguity. With respect to the area of data mining, it endeavours to find potentially meaningful patterns from data (Hu & Tzeng, 2003). This includes the construction of if-then decision rule systems, which attempt a level of inherent interpretability to the antecedents and consequents identified for object classification (and prediction), (see Breiman 2001). Within a fuzzy environment this is extended to allow a linguistic facet to the possible interpretation, examples including mining time series data (Chiang, Chow, & Wang, 2000) and multi-objective optimisation (Ishibuchi & Yamamoto, 2004). One approach to if-then rule construction has been through the use of decision trees (Quinlan, 1986), where the path down a branch of a decision tree (through a series of nodes), is associated with a single if-then rule. A key characteristic of the traditional decision tree analysis is that the antecedents described in the nodes are crisp. This chapter investigates the use of fuzzy decision trees as an effective tool for data mining.
BACKGROUND The development of fuzzy decision trees brings a linguistic form to the if-then rules constructed, offering a concise readability in their findings (see Olaru & Wehenkel, 2003). Examples of their successful application include in the areas of optimising economic dispatch (Roa-Sepulveda, Herrera, Pavez-Lazo, Knight, & Coonick, 2003) and the antecedents of company audit fees (Beynon, Peel, & Tang, 2004). Even in application based studies, the linguistic formulisation to decision making is continually investigated (Chakraborty, 2001; Herrera, Herrera-Viedma, & Martinez, 2000). Appropriate for a wide range of problems, fuzzy decision trees (with linguistic variables) allows a representation of information in a more direct and adequate form. A linguistic variable is described in Herrera, Herrera-Viedma, and Martinez (2000), highlighting it differs from a numeri-
cal one, instead of using words or sentences in a natural or artificial language. It further serves the purpose of providing a means of approximate characterization of phenomena, which are too complex, or too ill-defined to be amenable to their description in conventional quantitative terms. The number of elements (words) in a linguistic term set which define a linguistic variable determines the granularity of the characterization. The semantic of these elements is given by fuzzy numbers defined in the [0, 1] interval, which are described by their membership functions (MFs). Indeed it is the role played by, and structure of, the MFs that is fundamental to the utilisation of FST related methodologies (Medaglia, Fang, Nuttle, & Wilson, 2002; Reventos, 1999). In this context, DeOliveria (1999) noted that fuzzy systems have the important advantage of providing an insight on the linguistic relationship between the variables of a system. With an inductive fuzzy decision tree, underlying knowledge related to a decision outcome can be represented as a set of fuzzy if-then decision rules, each of the form: If (A1 is Ti11 ) and (A2 is Ti22 ) … and (Ak is Tikk ) then C is Cj, where A = {A1, A2, .., Ak} and C are linguistic variables in the multiple antecedents (Ai’s) and consequent (C) statements, respectively, and T(Ak) = { T1k , T2k , .. TSki } and {C1, C2, …, CL} are their linguistic terms. Each linguistic term T jk is defined by ìT jk (x) : Ak → [0, 1] (similar for a Cj). The
MFs represent the grade of membership of an object’s antecedent Aj being and consequent C being Cj, respectively (Wang, Chen, Qian, & Ye, 2000; Yuan & Shaw, 1995). Different types of MFs have been proposed to describe fuzzy numbers, including triangular and trapezoidal functions (Lin & Chen, 2002; Medaglia, Fang, Nuttle, & Wilson, 2002). Yu and Li (2001) highlight that MFs may be (advantageously) constructed from mixed shapes, supporting the use of piecewise linear MFs. A general functional form of a piecewise linear MF (in the context of a linguistic term), is given by:
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
7
Utilizing Fuzzy Decision Trees in Decision Making
k j
x − α j ,1 x − α j ,2 x − α j ,3 x − α j ,4 0 1 0 α j ,2 − α j ,1 α j ,3 − α j ,2 α j ,4 − α j ,3 α j ,5 − α j ,4
,
x ≤ α j ,1 α j ,1 < x ≤ α j ,2 α j ,2 < x ≤ α j ,3
x = α j ,3 α j ,3 < x ≤ α j ,4 α j ,4 < x ≤ α j ,5 α j ,5 < x
where [aj,1, aj,2, aj,3, aj,4, aj,5] are the defining values for the MF associated with the j th linguistic term of an antecedent Ak. A visual representation of this MF is presented in Figure 1, which elucidates its general structure along with the role played by the defining values. Also included in Figure 1, using dotted lines are neighbouring MFs (linguistic terms), which collectively would define a linguistic variable. To circumvent the influence of expert opinion in analysis, the construction of the MFs should be automated. On this matter DeOliveria (1999) considers the implication of Zadeh’s principle of incompatibility - that is, as the number of MFs increase, so the precision of the system increases, but at the expense of relevance decreasing.
MAIN THRUST Formulization of Fuzzy Decision Tree The first fuzzy decision tree reference is attributed to Chang and Pavlidis (1977). A detailed description on the concurrent work of fuzzy decision trees is presented in Olaru and Wehenkel (2003). It highlights how methodologies include the fuzzification of a crisp decision tree post its construction, or approaches that directly integrate fuzzy techniques during the tree-growing phase. The latter formulisation is described here, with the inductive method proposed by Yuan and Shaw (1995) considered, based on measures of cognitive uncertainties. This method focuses on the minimisation of classification ambiguity in the presence of fuzzy evidence. A membership function µ(x) from the set describing a fuzzy variable Y defined on X, can be viewed as a possibility distribution of Y on X, that is π(x) = µ(x), for all x ∈ X. The possibility measure Ea(Y) of ambiguity is defined by Ea(Y) Figure 1. General definition of a MF (including defining values aj,1, a j,2, aj,3, a j,4, aj,5) 1 0.8 0.6 0.4 0.2 αj,1 αj,2 1176
αj,3 αj,4
= g(π) = ∑ (π i∗ − π i∗+1 ) ln[i] , π ∗ n
µT ( x)
αj,5
i =1
= {π1 , ∗
π 2∗ , … π n∗ } is the permuta-
tion of the possibility distribution π = {π(x1), π(x2), …, π(xn)}
so that π i∗ ≥ π i∗+1 for i = 1, .., n, and π n∗+1 = 0 . The ambiguity of attribute A (with objects u1, .., um) is 1
m
given as: E α(A) = m ∑ Eα ( A(ui )) , where E α (A(u i)) = i =1 g ( µTs (ui ) max ( µTj (ui ))) , with T the linguistic terms of an 1≤ j ≤ s j
attribute (antecedent) with m objects. When there is overlapping between linguistic terms (MFs) of an attribute or between consequents, then ambiguity exists. The fuzzy subsethood S(A, B) measures the degree to which A is a subset of B and is given by: S(A, B) = ∑ min(µ A (u ), µ B (u )) ∑ µ A (u ) . Given fuzzy evidence E, the u∈U
u∈U
possibility of classifying an object to consequent Ci can S ( E , C j ) , where S(E, C ) be defined as: π(Ci|E) = S ( E , Ci ) / max j i represents the degree of truth for the classification rule (that is ‘if E then Ci’). With a single piece of evidence (a fuzzy number for an attribute), then the classification ambiguity based on this fuzzy evidence is defined as: G(E) = g(π(C| E)). The classification ambiguity with fuzzy partitioning P = {E1, …, Ek} on the fuzzy evidence F, denoted as G(P| F), is the weighted average of classification ambiguity with k
w( Ei | F )G ( Ei ∩ F ) , each subset of partition: G(P| F) = ∑ i =1
where G(Ei ∩ F) is the classification ambiguity with fuzzy evidence Ei ∩ F, and where w(Ei| F) is the weight which represents the relative size of subset Ei ∩ F in F: w(Ei| F) =
∑ min(µ Ei (u ), µ F (u )) ∑ ⎛⎜ ∑ min(µ E j (u ), µ F (u )) ⎞⎟ . j =1 ⎝ u∈U ⎠ k
u∈U
In summary, attributes are assigned to nodes based on the lowest level of ambiguity. A node becomes a leaf node if the level of subsethood (based on the intersection of the nodes from the root) is higher than some truth value β.
Illustrative Application of Fuzzy Decision Tree Analysis to Audit Fees Problem The data mining of an audit fees model may be useful to companies in assessing whether the audit fee they are paying is reasonable. In this analysis, a sample of 120 UK companies is used for training (growing) a fuzzy decision tree (Beynon, Peel, & Tang, 2004). The variables used in the study, are the decision attribute, AFEE: Audit Fee (£000’s) and condition attributes; SIZE: Sales (£000s), SUBS: No. of subsidiaries, FORS: Ratio of foreign to total subsidiaries, GEAR: Ratio of debt to total assets, CFEE: Consultancy fees (£000s), TOTSH: Proportion of shares held by directors and sub-
Utilizing Fuzzy Decision Trees in Decision Making
Figure 2. Estimated distributions and MFs (including defining values) for AFEE I1
I2
I3
A FE E
0 .0 4 0 .0 3 0 .0 2 0 .0 1 0
80
1 84
400
M
L
6 00
H
4 5 .5 8 0 1 29 .4 1 8 4
stantial shareholders, BIG6: 1 = Big 6 auditor, zero otherwise, LOND: 1 = audit office in London, zero otherwise. The construction of MFs (their defining values), which define the linguistic terms for a continuous attribute (linguistic variable) is based on an initial intervalisation of their domains. Then the subsequent construction of estimated distributions of the values in each defined interval, see Figure 2. In Figure 2(left), the decision attribute AFEE is intervalised, and three estimated distributions constructed. Using the centre of area of a distribution to denote an aj,3 defining value and the interval boundary values the respective aj,2 and aj,4, the concomitant piecewise linear MFs are reported in Figure 2(right). Their labels L, M and H denote the linguistic terms low, medium and high for AFEE. Similar sets of MFs (linguistic variables) can de defined for the six continuous condition attributes, see Figure 3. This series of linguistic variables (Figures 2 and 3) and the binary variables BIG and LOND are fundamental to the construction of a fuzzy decision tree for the audit fee model. The inductive fuzzy decision tree method is applied to the data on audit fees previously described, with a minimum
7
A FE E
1 0 .8 0 .6 0 .4 0 .2
3 00
400
5 03 .9
“truth level” of 80% required for a node to be a leaf (end) node, the complete tree is reported in Figure 4. The fuzzy decision tree in Figure 4 indicates that a total of 21 leaf nodes (fuzzy if-then rules) were established. To illustrate, the rule labelled R1 interprets to: “If SALES = L, CFEE = L and SUBS = M then AFEE = M with truth level 92.73%”, or (equivalently): For a company with low sales, low consultancy fees and a medium number of subsidiaries, then a medium level of audit fees is expected with truth level 92.73%. The fuzzy decision tree reported in Figure 4 as well as elucidating the relationship between audit fees (AFEE) and company characteristics can also be utilised to make predictions (matching) on the level of AFEE on future companies not previously considered. The procedure (Wang et al., 2000), is as follows: i) Matching starts from the root node and ends at a leaf node along the branch
Figure 3. Sets of MFs for the six continuous condition attributes
1 0.8 0.6 L M 0.4 0 .2 0 38.3 152 600 11 0 .2 26 1.8 1 0.8 0.6 L M 0.4 0.2 0.021 7 1 0.8 0.6 0.4 0.2 0
L
0 .16 02
M
S IZ E (£ 000s) H
100 0
1 400
18 00
2234
FO R S H
0 .3 4 6 9
0.5
0.6
H
1 0.8 0.6 0.4 0 .2 0 1 0.8 0.6 0.4 0.2
SUB S L
H
M
8.02 31 16 .5 22.93 16
35.5 40
L
M
0 0.320 6
0 .4 3 9 0.465 0 .4 8 5
1 0.8 0.6 0.4 0.2 0
50
5 8.0781
GEAR H
0.6
0.706 8
TO TSH L
M
0.155 5 0.2433 0 .26 06 0.2 916
H
0.515 5
1177
1178
SUBS = L AFEE = L 72.61%
R1
SUBS = M AFEE = M 92.73%
FORS = M AFEE = M 100.00%
BIG6 = 0 FORS
SUBS = H AFEE = M 100.00%
FORS = L AFEE = M 100.00%
CFEE = M BIG6
CFEE = L SUBS
GEAR = M AFEE = H 94.80%
FORS = H AFEE = H 88.97% FORS = M AFEE = M 100.00%
FORS = L AFEE = L, M 100.00%
TOTSH = M AFEE = M 93.45%
GEAR = L FORS
TOTSH = L GEAR
CFEE = L TOTSH
SALES = M CFEE
FORS = H AFEE = L 100.00%
BIG6 = 1 AFEE = M 76.09%
CFEE = H AFEE = M 97.17%
SALES = L CFEE
Root SALES
GEAR = H AFEE = H 80.48%
TOTSH = H AFEE = M 86.42%
CFEE = M LOND
SUBS = L AFEE = M 81.40%
LOND = 0 SUBS
CFEE = H AFEE = H 92.68%
SALES = H AFEE = H 95.80%
SUBS = M AFEE = M 82.80%
LOND = 1 AFEE = H 85.16%
SUBS = H AFEE = H 84.07%
Utilizing Fuzzy Decision Trees in Decision Making
Figure 4. Complete fuzzy decision tree for audit fees problem
Utilizing Fuzzy Decision Trees in Decision Making
of the maximum membership, ii) If the maximum membership at the node is not unique, matching proceeds along several branches and iii) The decision class with the maximum degree of truth from the leaf nodes is then assigned the classification (e.g. L, M or H) for the associated rule.
REFERENCES
FUTURE TRENDS
Chakraborty, D. (2001) Structural quantization of vagueness in linguistic expert opinion in an evaluation programme. Fuzzy Sets and Systems, 119, 171-186.
The ongoing characterization of data mining techniques in a fuzzy environment identifies the influence of its incorporation. The theoretical development of fuzzy set theory will, as well as explicit fuzzy techniques, continue to bring the findings from analysis nearer to the implicit human decision making. Central to this is the linguistic interpretation associated with the fuzzy related membership functions (MFs). There is no doubt their development will continue, including their structure and number of MFs to describe a linguistic variable. This is certainly true for the specific area of fuzzy decision trees, where the fuzzy if-then rules constructed offer a succinct elucidation to the data mining of large (and small) data sets. The development of fuzzy decision trees will also include the alternative approaches to pruning, already considered in the non-fuzzy environment, whereby the branches of a decision tree may be reduced in size to offer even more general results. Recent attention has been towards the building of ensemble of classifiers. Here this relates to random forests (Breiman, 2001), whereby a number of different decision trees are constructed, subject to a random process. This can also include the use of techniques such as neural networks (Zhou & Jiang, 2003).
CONCLUSION Within the area of data mining the increased computational power (speed) available has reduced the perceived distance (relationship) between the original data and the communication of the antecedents and consequent to the relevant decision problem. As a tool for data mining, fuzzy decision trees exhibit the facets necessary for meaningful analysis. These include the construction of if-then decision rules to elucidate the warranted relationship. The fuzzy environment offers a linguistic interpretation to the derived relationship.
Beynon, M. J., Peel, M. J., & Tang, Y.-C. (2004) The application of fuzzy decision tree analysis in an exposition of the antecedents of audit fees. OMEGA, 32, 231-244. Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199-231.
Chang, R. L. P., & Pavlidis, T. (1977) Fuzzy decision tree algorithms. IEEE Transactions Systems Man and Cyrbernetics, SMC-7(1), 28-35. Chiang, D.-A., Chow, L. R., & Wang, Y.-F. (2000) Mining time series data by a linguistic summary system. Fuzzy Sets and Systems, 112, 419-432. DeOliveria, J. V. (1999). Semantic constraints for membership function optimization. IEEE Transactions on Systems, Man and Cybernetics – Part A: Systems and Humans, 29(1), 128-138. Herrera, F., Herrera-Viedma, E., & Martinez, L. (2000) A fusion approach for managing multi-granularity linguistic term sets in decision making. Fuzzy Sets and Systems, 114, 43-58. Hu, Y.-C., & Tzeng, G-H. (2003) Elicitation of classification rules by fuzzy data mining. Engineering Applications of Artificial Intelligence, 16, 709-716. Ishibuchi, H., & Yamamoto, T. (2004) Fuzzy rule selection by multi-objective genetic local search algorithms and rule evaluation measures in data mining. Fuzzy Sets and Systems, 141, 59-88. Lin, C.-C., & Chen, A.-P. (2002) Generalisation of Yang et al.’s method of fuzzy programming with piecewise linear membership functions. Fuzzy Sets and Systems, 132, 346352. Medaglia, A. L., Fang, S.-C., Nuttle, H. L. W., & Wilson, J. R. (2002) An efficient and flexible mechanism for constructing membership functions. European Journal of Operational Research, 139, 84-95. Olaru, C., & Wehenkel, L. (2003) A complete fuzzy decision tree technique. Fuzzy Sets and Systems, 138, 221-254. Quinlan, R. (1986) Induction of decision trees. Machine Learning, 1, 86-106.
1179
7
Utilizing Fuzzy Decision Trees in Decision Making
Reventos, V. T. (1999) Interpreting membership functions: a constructive approach. International Journal of Approximate Reasoning, 191-207. Roa-Sepulveda, C. A., Herrera, M., Pavez-Lazo, B., Knight, U. G., & Coonick, A. H. (2003) Economic dispatch using fuzzy decision trees. Electric Power Systems Research, 66, 115-122. Wang, X., Chen, B., Qian, G., & Ye, F. (2000). On the optimization of fuzzy decision trees. Fuzzy Sets and Systems, 112, 117-125. Yu, C-S., & Li, H-L. (2001). Method for solving quasiconcave and non-cave fuzzy multi-objective programming problems. Fuzzy Sets and Systems, 122, 205-227.
KEY TERMS Antecedent: An antecedent is a driving factor in an event. For example, in the relationship “When it is hot, Mary buys an ice-scream”, “it is hot” is the antecedent. Branch: A single path down a decision tree, from root to a leaf node, denoting a single if-then rule. Consequent: A consequent follows as a logical conclusion to an event. For example, in the relationship “When it is hot, Mary buys an ice-scream”, “buys an icescream” is the consequent. Decision Tree: A tree-like way of representing a collection of hierarchical rules that lead to a class or value.
Yuan, Y., & Shaw, M. J. (1995). Induction of fuzzy decision trees. Fuzzy Sets and Systems, 69, 125-139.
Induction: A technique that infers generalizations from the information in the data.
Zadeh, L. A. (1965). Fuzzy sets, information and control, 8(3), 338-353.
Leaf: A node not further split - the terminal grouping - in a classification or decision tree.
Zhou, Z.-H., & Jiang, Y. (2003) Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble. IEEE Transactions on Information Technology in Biomedicine, 7, 37-42.
Linguistic Term: One of a set of linguistic terms, which are subjective categories for the linguistic variable, each described by a membership function. Linguistic Variable: A variable made up of a number of words (linguistic terms) with associated degrees of membership. Node: A junction point in a decision tree, which describes a condition in an if-then rule.
1180
1181
Vertical Data Mining
V
William Perrizo North Dakota State University, USA Qiang Ding Concordia College,USA Qin Ding Pennsylvania State University, USA Taufik Abidin North Dakota State University, USA
INTRODUCTION The volume of data keeps increasing. There are many data sets that have become extremely large. It is of importance and a challenge to develop scalable methodologies that can be used to perform efficient and effective data mining on large data sets. Vertical data mining strategy aims at addressing the scalability issues by organizing data in vertical layouts and conducting logical operations on vertical partitioned data instead of scanning the entire database horizontally.
BACKGROUND The traditional horizontal database structure (files of horizontally structured records) and traditional scanbased data processing approaches (scanning files of horizontal records) are known to be inadequate for knowledge discovery in very large data repositories due to the problem of scalability. For this reason, much effort has been put on sub-sampling and indexing as ways to address and solve the problem of scalability. However, subsampling requires that the sub-sampler know enough about the large dataset in the first place in order to subsample “representatively.” That is, sub-sampling requires considerable knowledge about the data, which, for many large datasets, may be inadequate or non-existent. Index files are vertical structures. That is, they are vertical access paths to sets of horizontal records. Indexing files of horizontal data records does address the scalability problem in many cases, but it does so at the cost of creating and maintaining the index files separate from the data files themselves. A new way to organize data is to organize them vertically, instead of horizontally. Data miners are typi-
cally interested in collective properties or predictions that can be expressed very briefly (e.g., a yes/no answer). Therefore, the result of a data mining query can be represented by a bitmap vector. This important property makes it possible to do data mining directly on vertical data structures.
MAIN THRUST Vertical data structures, vertical mining approaches and multi-relational vertical mining will be explored in detail to show how vertical data mining works.
Vertical Data Structures The concept of vertical partitioning has been studied within the context of both centralized and distributed database systems for a long time, yet much remains to be done (Winslett, 2002). There are great advantages of using vertical partitioning; for example, it makes hardware caching work really well, it makes compression easy to do, and it may greatly increase the effectiveness of the I/O device since only participating fields are retrieved each time. The vertical decomposition of a relation also permits a number of transactions to be executed concurrently. Copeland & Khoshafian (1985) presented an attributelevel Decomposition Storage Model called DSM, similar to the Attribute Transposed File model (ATF) (Batory, 1979), which stores each column of a relational table into a separate table. DSM was shown to perform well. It utilizes surrogate keys to map individual attributes together, hence requiring a surrogate key to be associated with each attribute of each record in the database. Attribute-level vertical decomposition is also used in Remotely Sensed Imagery (e.g., Landsat Thematic Mapper
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Vertical Data Mining
Imagery), where it is called Band Sequential (BSQ) format. Beyond attribute-level decomposition, Wong et al. (1985) presented the Bit Transposed File model (BTF), which took advantage of encoded attribute values using a small number of bits to reduce the storage space. In addition to ATF, BTF, and DSM models, there has been other work on vertical data structuring, such as BitSliced Indexes (BSI) (Chan & Ioannidis, 1998; O’Neil & Quass, 1997; Rinfret et al., 2001), Encoded Bitmap Indexes (EBI) (Wu & Buchmann, 1998; Wu, 1998), and Domain Vector Accelerator (DVA) (Perrizo et al., 1991). A Bit-Sliced Index (BSI) is an ordered list of bitmaps used to represent values of a column or attribute. These bitmaps are called bit-slices, which provide binary representations of attribute’s values for all the rows. In the EBI approach, an encoding function on the attribute domain is applied and a binary-based bit-sliced index on the encoded domain is built. EBIs minimize the space requirement and show more potential optimization than binary bit-slices. Both BSIs and EBIs are auxiliary index structures that need to be stored twice for particular data columns. As we know, even the simplest index structure used today incurs substantial increase in total storage requirements. The increased database size, in turn, translates into higher media and maintenance costs, and results in lower performance. Domain Vector Accelerator (DVA) is a method to perform relational operations based on vertical bit-vectors. The DVA method performs particularly well for joins involving a primary key attribute and an associated foreign key attribute. Vertical mining requires data to be organized vertically and be processed horizontally through fast, multi-operand logical operations, such as AND, OR, XOR, and complement. Predicate tree (P-tree1) is one form of lossless vertical structure that meets this requirement. P-tree is suitable to represent numerical and categorical data and has been successfully used in various data mining applications, including classification (Khan et al., 2002), clustering (Denton et al., 2002), and association rule mining (Ding et al., 2002). P-trees can be 1-dimensional, 2-dimensional, and multidimensional. If the data has a natural dimension (e.g., spatial data), the P-tree dimension is matched to the data dimension. Otherwise, the dimension can be chosen to optimize the compression ratio. To convert a relational table of horizontal records to a set of vertical P-trees, the table has to be projected into columns, one for each attribute, retaining the original record order in each. Then each attribute column is further decomposed into separate bit vectors, one for each bit position of the values in that attribute. Each bit vector is
1182
then compressed into a tree structure by recording the truth of the predicate “purely 1-bits” recursively on halves until purity is reached.
Vertical Mining Approaches A number of vertical data mining algorithms have been proposed, especially in the area of association rule mining. Mining algorithms using the vertical format have been shown to outperform horizontal approaches in many cases. One example is the Frequent Pattern Growth algorithm using Frequent Pattern Trees introduced by Han et al. (2001). The advantages come from the fact that frequent patterns can be counted via transaction_id_set intersections, instead of using complex internal data structures. The horizontal approach, on the other hand, requires complex hash/search trees. Zaki & Hsiao (2002) introduced a vertical presentation called diffset, which keeps track of differences in the tidset of a candidate pattern from its generated frequent patterns. Diffset drastically reduces the memory required to store intermediate results; therefore, even in dense domains, the entire working set of patterns of several vertical mining algorithms can be fit entirely in main-memory, facilitating the mining for very large database. Shenoy et al. (2000) proposes a vertical approach, called VIPER, for association rule mining of large databases. VIPER stores data in compressed bit-vectors and integrates a number of optimizations for efficient generation, intersection, counting, and storage of bit-vectors, which provides significant performance gains for large databases with a close to linear scale-up with database size. P-trees have been applied to a wide variety of data mining areas. The efficient P-tree storage structure and the P-tree algebra provide a fast way to calculate various measurements for data mining task, such as support and confidence in association rule mining, information gain in decision tree classification, Bayesian probability values in Bayesian classification, and etcetera. P-trees have also been successfully used in many kinds of distance-based classification and clustering techniques. A computationally efficient distance metric called Higher Order Basic Bit Distance (HOBBit) (Khan et al., 2002) has been proposed based on P-trees. For one dimension, the HOBBit distance is defined as the number of digits by which the binary representation of an integer has to be right-shifted to make two numbers equal. For more than one dimension, the HOBBit distance is defined as the maximum of the HOBBit distances in the individual dimensions. Since computers use binary systems to represent numbers in memory, bit-wise logical operations are much faster than ordinary arithmetic operations such as addi
Vertical Data Mining
Multi-Relational Vertical Mining Multi-Relational Data Mining (MRDM) is the process of knowledge discovery from relational databases consisting of multiple tables. The rise of several application areas in Knowledge Discovery in Databases (KDD) that is intrinsically relational has provided and continues to provide a strong motivation for the development of MRDM approaches. Since scalability has always been an important concern in the field of data mining, it is even more important in the multi-relational context, which is inherently more complex. From the perspective of database, multi-relational data mining usually involves one or more joins between tables, as is not the case for classical data mining methods. Until now, there is still lack of accurate, efficient, and scalable multi-relational data mining methods to handle large databases with complex schemas. Databases are usually normalized for implementation reasons. However, for data mining workload, denormalizing the relations into a view may better represent the real world. In addition, if a view can be materialized by storing onto disks, it will be accessed much faster without being computed on the fly. Two alternative materialized view approaches, relational materialized view model and multidimensional materialized view model, can be utilized to model relational data (Ding, 2004). In the relational model, a relational or extended-relational DBMS is used to store and manage data. A set of vertical materialized views can be created to encompass all the information necessary for data mining. Vertical materialized views can be generated directly from vertical representation of the original data. The transformation can be done in parallel by Boolean operations. In the multidimensional materialized view model, multidimensional data are mapped directly to a data cube structure. The advantage of using a data cube is that it allows fast indexing by using offset calculation to precomputed data. The vertical materialized views of the data cube will require larger storage than that of the relational model. However, with the employment of the P-tree technology, there would not be much difference due to the compression inside the P-tree structure. All the vertical materialized views can be easily stored in P-tree format. When encountering any data mining task, by relevance analysis and feature selection, all the relevant materialized view P-trees can be grabbed for data mining.
FUTURE TRENDS Vertical data structure and vertical data mining will become more and more important. New vertical data structures will be needed for various types of data. There is great potential to combine vertical data mining with parallel mining as
well as hardware. The scalability will become a very important issue in the area of data mining. The challenge of scalability is not just dealing with a large number of tuples but also handling with high dimensionality (Fayyad, 1999; Han & Kamber, 2001).
CONCLUSION Horizontal data structure has been proven to be inefficient for data mining on very large sets due to the large cost of scanning. It is of importance to develop vertical data structures and algorithms to solve the scalability issue. Various structures have been proposed, among which P-tree is a very promising vertical structure. Ptrees have shown great performance to process data containing large number of tuples due to the fast logical AND operation without scanning (Ding et al., 2002). Vertical structures, such as P-trees, also provide an efficient way for multi-relational data mining. In general, horizontal data structure is preferable for transactional data with intended output as a relation, and vertical data mining is more appropriate for knowledge discovery on very large data sets.
REFERENCES Batory, D.S. (1979). On searching transposed files. ACM Transactions on Database Systems, 4(4), 531-544. Chan, C.Y., & Ioannidis, Y. (1998). Bitmap index design and evaluation. In Proceedings of the ACM SIGMOD (pp. 355-366). Copeland, G., & Khoshafian, S. (1985). Decomposition storage model. In Proceedings of the ACM SIGMOD (pp. 268-279). Denton, A., Ding, Q., Perrizo, W., & Ding, Q. (2002). Efficient hierarchical clustering of large data sets using P-Trees. In Proceeding of International Conference on Computer Applications in Industry and Engineering (pp. 138-141). Ding, Q. (2004). Multi-relational data mining using vertical database technology. Ph.D. Thesis. North Dakota State University. Ding, Q., Ding, Q., & Perrizo, W. (2002). Association rule mining on remotely sensed images using Ptrees. In Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 66-79). Fayyad, U. (1999). Editorial. SIGKDD Explorations, 1 (1), 1-3. 1183
8
Vertical Data Mining
Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco, CA: Morgan Kaufmann. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD (pp. 1-12).
Zaki, M.J., & Hsiao, C.-J. (2002). CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the SIAM International Conference on Data Mining.
KEY TERMS
Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest neighbor classification on spatial data stream using Ptrees. In Proceeding of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 517-528).
Association Rule Mining: The process of finding interesting association or correlation relationships among a large set of data items.
O’Neil, P., & Quass, D. (1997). Improved query performance with variant indexes. In Proceedings of the ACM SIGMOD (pp. 38-49).
Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns and relationships such as classification, prediction, estimation, or affinity grouping.
Perrizo, W., Gustafson, J., Thureen, D., & Wenberg, D. (1991). Domain vector accelerator for relational operations. In Proceedings of IEEE International Conference on Data Engineering (pp. 491-498). Rinfret, D., O’Neil, P., & O’Neil, E. (2001). Bit-sliced index arithmetic. In Proceedings of the ACM SIGMOD (pp. 4757). Shenoy et al. (2000). Turbo-charging vertical mining of large databases. In Proceedings of the ACM SIGMOD (pp. 22-33). Winslett, M. (2002). David DeWitt speaks out. ACM SIGMOD Record, 31(2), 50-62. Wong, H.K.T., Liu, H.-F., Olken, F., Rotem, D., & Wong. L. (1985). Bit transposed files. In Proceedings of VLDB (pp. 448-457). Wu, M.-C. (1998). Query optimization for selections using bitmaps. Technical Report, DVS98-2. DVS1, Computer Science Department, Technische Universitat Darmstadt. Wu, M.-C., & Buchmann, A. (1998). Encoded bitmap indexing for data warehouses. In Proceedings of IEEE International Conference on Data Engineering (pp. 220230).
HOBBit Distance: A computationally efficient distance metric. In one dimension, it is the number of digits by which the binary representation of an integer has to be right-shifted to make two numbers equal. In other dimension, it is the maximum of the HOBBit distances in the individual dimensions. Multi-Relational Data Mining: The process of knowledge discovery from relational databases consisting of multiple tables. Multi-Relational Vertical Mining: The process of knowledge discovery from relational databases consisting of multiple tables using vertical data mining approaches. Predicate Tree (P-Tree): A lossless tree that is vertically structured and horizontally processed through fast multi-operand logical operations. Vertical Data Mining (Vertical Mining): The process of finding patterns and knowledge from data organized in vertical formats, which aims to address the scalability issues.
ENDNOTE 1
1184
P-tree is a patent-pending technology developed by Dr. William Perrizo’s DataSURG research group at North Dakota State University.
1185
Video Data Mining
8
JungHwan Oh University of Texas at Arlington, USA JeongKyu Lee University of Texas at Arlington, USA Sae Hwang University of Texas at Arlington, USA
INTRODUCTION Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less work has been pursued for the mining of multimedia data (Zaïane, Han, & Zhu, 2000). Digital multimedia differs from previous forms of combined media in that the bits representing texts, images, audios, and videos can be treated as data by computer programs (Simoff, Djeraba, & Zaïane, 2002). One facet of these diverse data in terms of underlying models and formats is that they are synchronized and integrated hence, can be treated as integrated data records. The collection of such integral data records constitutes a multimedia data set. The challenge of extracting meaningful patterns from such data sets has lead to research and development in the area of multimedia data mining. This is a challenging field due to the non-structured nature of multimedia data. Such ubiquitous data is required in many applications such as financial, medical, advertising and Command, Control, Communications and Intelligence (C3I) (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). Multimedia databases are widespread and multimedia data sets are extremely large. There are tools for managing and searching within such collections, but the need for tools to extract hidden and useful knowledge embedded within multimedia data is becoming critical for many decision-making applications.
BACKGROUND Multimedia data mining has been performed for different types of multimedia data: image, audio and video. Let us first consider image processing before discuss-
ing image and video data mining since they are related. Image processing has been around for some time. Images include maps, geological structures, biological structures, and many other entities. We have image processing applications in various domains including medical imaging for cancer detection, and processing satellite images for space and intelligence applications. Image processing has dealt with areas such as detecting abnormal patterns that deviate from the norm, and retrieving images by content (Thuraisingham, Clifton, Maurer, & Ceruti, 2001). The questions here are: what is image data mining and how does it differ from image processing? We can say that while image processing focuses on manipulating and analyzing images, image data mining is about finding useful patterns. Therefore, image data mining deals with making associations between different images from large image databases. One area of researches for image data mining is to detect unusual features. Its approach is to develop templates that generate several rules about the images, and apply the data mining tools to see if unusual patterns can be obtained. Note that detecting unusual patterns is not the only outcome of image mining; that is just the beginning. Since image data mining is an immature technology, researchers are continuing to develop techniques to classify, cluster, and associate images (Goh, Chang, & Cheng, 2001; Li, Li, Zhu, & Ogihara, 2002; Hsu, Dai, & Lee, 2003; Yanai, 2003; Müller & Pun, 2004). Image data mining is an area with applications in numerous domains including space, medicine, intelligence, and geoscience. Mining video data is even more complicated than mining still image data. One can regard a video as a collection of related still images, but a video is a lot more than just an image collection. Video data management has been the subject of many studies. The important areas include developing query and retrieval techniques for video databases (Aref, Hammad, Catlin, Ilyas, Ghanem, Elmagarmid, & Marzouk, 2003). The question we ask yet again is what is the difference between video information retrieval and video mining? There is no clear-cut answer for this question yet. To be consistent
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Video Data Mining
with our terminology, we can say that finding correlations and patterns previously unknown from large video databases is video data mining.
MAIN THRUST Even though we define video data mining as finding correlations and patterns previously unknown, the current status of video data mining remains mainly at the pre-processing stage, in which the preliminary issues such as video clustering, and video classification are being examined and studied for the actual mining. Only a very limited number of papers about finding any patterns from videos can be found. We discuss video clustering, video classification and pattern finding as follows.
Video Clustering Clustering is a useful technique for the discovery of some knowledge from a dataset. It maps a data item into one of several clusters which are natural groupings for data items based on similarity metrics or probability density models (Mitra & Acharya, 2003). Clustering pertains to unsupervised learning, when data with class labels are not available. Clustering consists of partitioning data into homogeneous granules or groups, based on some objective function that maximizes the inter-cluster distances, while simultaneously minimizing the intra-cluster distances. Video clustering has some differences with conventional clustering algorithms. As mentioned earlier, due to the unstructured nature of video data, preprocessing of video data by using image processing or computer vision techniques is required to get structured format features. Another difference in video clustering is that the time factor should be considered while the video data is processed. Since video is a synchronized data of audio and visual data in terms of time, it is very important to consider the time factor. Traditional clustering algorithms can be categorized into two main types: partitional and hierarchical clustering (2003). Partitional clustering algorithms (i.e., K-means and EM) divide the patterns into a set of spherical clusters, while minimizing the objective function. Here the number of clusters is predefined. Hierarchical algorithms, on the other hand, can again be grouped as agglomerative and divisive. Here no assumption is made about the shape or number of clusters, and validity index is used to determine termination. Two of the most popular partitional clustering algorithms are K-means and Expectation Miximization (EM). In K-means, the initial centroids are selected, and each data item is classified to a cluster with the smallest distance. Based on the previous results, the cluster centroids are updated, and all corresponding data items 1186
are re-clustered until there is no centroid change. It is easily implemented, and provides a firm foundation of variances through the clusters. We can find the papers using the K-means algorithm for video clustering in the literature (Ngo, Pong, & Zhang, 2001). EM is a popular iterative refinement algorithm that belongs to the modelbased clustering. It differs from the conventional Kmeans clustering algorithm in that each data point belongs to a cluster according to some weight or probability of membership. In other words, there are no strict boundaries between clusters. New means are computed based on weighted measures. It provides a statistical model for the data and is capable of handling the associated uncertainties. We can find the papers using the EM algorithm for video clustering in the literature (Lu, & Tan, 2002; Frey, & Jojic, 2003). Hierarchical clustering methods create hierarchical nested partitions of the dataset, using a tree-structured dendogram and some termination criterion. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more of the most appropriate clusters. Divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion is achieved. The advantages of hierarchical clustering include: embedded flexibility regarding the level of granularity, ease of handling of any forms of similarity or distance, and applicability to any attribute types. The disadvantages of hierarchical clustering are vagueness of termination criteria, and the fact that most hierarchical algorithms do not revisit constructed (intermediate) clusters for the purpose of their improvement. Hierarchical clustering is used in video clustering because it is easy to handle the similarity of extracted features from video, and it can represent the depth and granularity by the level of tree (Okamoto, Yasugi, Babaguchi, & Kitahashi, 2002).
Video Classification While clustering is an unsupervised learning method, classification is a way to categorize or assign class labels to a pattern set under the supervision. Decision boundaries are generated to discriminate between patterns belonging to different classes. The data set is initially partitioned into training and test sets, and the classifier is trained on the former. A framework to enable semantic video classification and indexing in a specific video domain (medical video) was proposed (Fan,
Video Data Mining
Luo, & Lin, 2003). VideoGraph, a tool for finding similar scenes in a video, was studied (Pan & Faloutsos, 2001). A method for classification of different kinds of videos that uses the output of a concise video summarization technique that forms a list of key frames was presented (Lu, Drew, & Au, 2001).
Pattern Finding A general framework for video data mining was proposed to address the issue of how to extract previously unknown knowledge and detect interesting patterns (Oh, Lee, Kote, & Bandi, 2003). In the work, they develop how to segment the incoming raw video stream into meaningful pieces, and how to extract and represent some feature (i.e., motion) for characterizing the segmented pieces. Then, the motion in a video sequence is expressed as an accumulation of quantized pixel differences among all frames in the video segment. As a result, the accumulated motions of the segment are represented as a two dimensional matrix. Further, a method to capture the location of motions occurring in a segment is developed using the same matrix. How to cluster those segmented pieces using the features (the amount and the location of motion) extracted by the matrix above is studied. Also, an algorithm is investigated to determine whether a segment has normal or abnormal events by clustering and modeling normal events, which occur most frequently. In addition to deciding normal or abnormal, the algorithm computes a Degree of Abnormality of a segment, which represents the distance of a segment from the existing segments in relation to normal events. A fast dominant motion extraction scheme called Integral Template Match (ITM), and a set of qualitative and quantitative description schemes were proposed (Lan, Ma, & Zhang, 2003). A video database management framework and its strategies for video content structure and events mining were introduced (Zhu, Aref, Fan, Catlin, & Elmagarmid, 2003). The methods of extracting editing rules from video stream were proposed by introducing a new data mining technique (Matsuo, Amano, & Uehara, 2002).
FUTURE TRENDS As mentioned above, there have been a number of attempts to apply clustering methods to video data. However, these classical clustering techniques only create clusters but do not explain why a cluster has been established (Perner, 2002). The conceptual clustering method builds clusters and explains why a set of objects confirms a cluster. Thus, conceptual clustering is a type of learning by observations, and it is a way of summarizing data in an understandable manner. In contrast to hierarchi-
cal clustering methods, conceptual clustering methods build the classification hierarchy based on merging two groups. The algorithm properties are flexible in order to dynamically fit the hierarchy to the data. This allows incremental incorporation of new instances into the existing hierarchy and updating this hierarchy according to the new instance. A concept hierarchy is a directed graph in which the root node represents the set of all input instances and the terminal nodes represent individual instances. Internal nodes stand for the sets of instances attached to them and represent a super-concept. The super-concept can be represented by a generalized representation of this set of instances such as the prototype, the method or a user-selected instance. We can find a work applying this conceptual mining to image domain (Perner, 1998). However, we cannot find any papers related to conceptual clustering for video data. Since it is important to understand what a created cluster means semantically, we need to study how to apply conceptual clustering to video data. In fact, one of the most important techniques for data mining is association-rule mining since it is most efficient way to find unknown patterns and knowledge. Therefore, we need to investigate how to apply association-rule mining to video data. For association-rule mining, we need a set of transactions (D), a set of the literals (or items, I), and an itemset (X) (Zhang & Zhang, 2002). After video is segmented into the basic units such as shots, scenes, and events, each segmented unit can be modeled as a transaction, and the features from a unit can be considered as the items contained in the transaction. In this way, video association mining can be transformed into problems of association mining in traditional transactional databases. A work using some associations among video shots to create a video summary was proposed (Zhu & Wu, 2003). But they did not come up with the concepts of transaction and itemset. We need to further investigate the optimal unit for the concept of transaction, and the possible items in a transaction of video to characterize it. Also, we need to study whether video can be considered as time-series data. It looks positive since the behavior of a time-series data item is very similar to that of video data. A time-series data item has a value at any given time, and the value is changing over time. Similarly a feature of video has a value at any given time, and the value is changing over time. If video can be considered as time-series data, we can get the advantages of the techniques already developed for timeseries data mining. When the similarity between data items is computed, the ordinary distance metrics, such as Euclidean distance, may not be suitable, because of its high dimensionality and time factor. In order to address this problem, alternative ways are used to get a more 1187
8
Video Data Mining
accurate measure of similarity; for example, Dynamic Time Warping and Longest Common Subsequences. Although most of data mining techniques are in batch processing, including video data mining as well as conventional data mining, some applications need to be processed in real time or near real time. For example, the anomaly detection system in a surveillance video using data mining should be processed in real time. Therefore, we also need to examine online video data mining.
2003 IEEE International Conference on Multimedia and Expo, (ICME ’03), Vol.3 (pp. 469-472).
CONCLUSION
Lu, H., & Tan, Y.P. (2002). On model-based clustering of video scenes using scenelets. Proceedings of 2002 IEEE International Conference on Multimedia and Expo, Vol.1 (pp. 301-304).
Data mining describes a class of applications that look for hidden knowledge or patterns in large amounts of data. Most of data mining research has been dedicated to alpha-numeric databases, and relatively less work has been done for the multimedia data mining. The current status and the challenges of video data mining which is a very premature field of multimedia data mining, are discussed in this paper. The issues discussed should be dealt with in order to obtain valuable information from vast amounts of video data.
REFERENCES Aref, W., Hammad, M., Catlin, A.C., Ilyas, I., Ghanem, T., Elmagarmid, A., & Marzouk, M. (2003). Video query processing in the VDBMS testbed for video database research. Proceeding of 1st ACM International Workshop on Multimedia Database (pp. 25-32). Fan, J., Luo, H., & Lin, X. (2003). Semantic video classification by integrating flexible mixture model with adaptive EM algorithm. Proceedings of the 5th ACM SIGMM International Workshop on Multimedia Information Retrieval (pp. 9-16).
Li, T., Li, Q., Zhu, S., & Ogihara, M. (2002). A survey on wavelet applications in data mining. ACM SIGKDD Explorations Newsletter, 4(2), 49-68. Lu, C., Drew, M.S., & Au, J. (2001). Classification of summarized videos using hidden Markov models on compressed chromaticity signatures. Proceeding of 9th ACM International Conference on Multimedia (pp. 479-482).
Matsuo, Y., Amano, M., & Uehara, K. (2002). Mining video editing rules in video streams. Proceeding of the 10 th ACM International Conference on Multimedia (pp. 255-258). Mitra, S., & Acharya, T. (2003). Data mining: Multimedia, soft computing, and bioinformatics. John Wiley & Sons, Inc. Müller, H., & Pun, T. (2004). Learning from user behavior in image retrieval: Application of market basket analysis. International Journal of Computer Vision, 56(1/2), 6577. Ngo, C.W., Pong, T.C., & Zhang, H.J. (2001). On clustering and retrieval of video shots. Proceeding of 9th ACM International Conference on Multimedia (pp. 51-60). Oh, J., Lee, J., Kote, S., & Bandi, B. (2003). Multimedia data mining framework for raw video sequences. Mining Multimedia and Complex Data, Lecture Notes in Artificial Intelligence, Vol. 2797 (pp. 18-35). Springer Verlag.
Frey, B.J., & Jojic, N. (2003). Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 1-17.
Okamoto, H., Yasugi, Y., Babaguchi, N., & Kitahashi, T. (2002). Video clustering using spatio-temporal image with fixed length. Proceedings of 2002 IEEE International Conference on Multimedia and Expo, Vol. 1 (pp. 53-56).
Goh, K.S., Chang, E., & Cheng, K.T. (2001). SVM binary classifier ensembles for image classification. Proceedings of the 10th International Conference on Information and Knowledge Management (pp. 395-402).
Pan, J.U., & Faloutsos, C. (2001). VideoGraph: A new tool for video mining and classification. Proceedings of the 1st ACM/IEEE-CS Joint Conference in Digital Libraries (pp. 116-117).
Hsu, W., Dai, J., & Lee, M.L. (2003). Mining viewpoint patterns in image databases. Proceeding of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 553-558).
Perner, P. (1998). Using CBR learning for the low-level and high-level unit of a image interpretation system. In S. Singh (Ed.), International Conference on Advances Pattern Recognition ICAPR98 (pp. 45-54). London: Springer Verlag.
Lan, D.J., Ma, Y.F., & Zhang, H.J. (2003). A novel motionbased representation for video mining. Proceedings of
1188
Video Data Mining
Perner, P. (2002). Data mining on multimedia data. Springer.
KEY TERMS
Simoff, S.J., Djeraba, C., & Zaïane, O.R. (2002). MDM/ KDD2002: Multimedia data mining between promises and problems. ACM SIGKDD Explorations Newsletter, 4(2), 118-121.
Classification: A method of categorizing or assigning class labels to a pattern set under the supervision.
Thuraisingham, B., Clifton, C., Maurer, J., & Ceruti, M.G. (2001). Real-Time data mining of multimedia objects. Proceedings of 4th IEEE International Symposium on ObjectOriented Real-Time distributed Computing, ISORC-2001 (pp. 360-365). Yanai, K. (2003). Managing images: Generic image classification using visual knowledge on the Web. Proceeding of the 11th ACM International Conference on Multimedia (pp. 167-176). Zaïane, O.R., Han, J., & Zhu, H. (2000). Mining recurrent Items in multimedia with progressive resolution refinement. Proceedings of 16th International Conference on Data Engineering (pp. 461-470). Zhang, C., & Zhang, S. (2002). Association rule mining. Springer. Zhu, X., Aref, W.G., Fan, J., Catlin, A.C., & Elmagarmid, A.K. (2003). Medical video mining for efficient database indexing, management and access. Proceedings of the 19th International Conference on Data Engineering (ICDE ’03) (pp. 569-580). Zhu, X., & Wu, X. (2003). Sequential association mining for video summarization. Proceedings of 2003 IEEE International Conference on Multimedia and Expo, (ICME ’03), Vol. 3 (pp. 333-336).
Clustering: A process of mapping a data item into one of several clusters, where clusters are natural groupings for data items based on similarity metrics or probability density models. Concept Hierarchy: A directed graph in which the root node represents the set of all input instances and the terminal nodes represent individual instances. Conceptual Clustering: A type of learning by observations and a way of summarizing data in an understandable manner. Degree of Abnormality: A probability that represents to what extent a segment is distant to the existing segments in relation with normal events. Dendogram: It is a ‘tree-like’ diagram that summaries the process of clustering. Similar cases are joined by links whose position in the diagram is determined by the level of similarity between the cases. Digital Multimedia: The bits that represent texts, images, audios, and videos, and are treated as data by computer programs. Image Data Mining: A process of finding unusual patterns, and making associations between different images from large image databases. One could mine for associations between images, cluster images, classify images, as well as detect unusual patterns. Image Processing: A research area analyzing and manipulating digital images for image enhancement, data compression, or pattern discovery. Video Data Mining: A process of finding correlations and patterns previously unknown from large video databases.
1189
8
1190
Visualization Techniques for Data Mining Herna L. Viktor University of Ottawa, Canada Eric Paquet National Research Council of Canada, Canada
INTRODUCTION The current explosion of data and information, mainly caused by data warehousing technologies as well as the extensive use of the Internet and its related technologies, has increased the urgent need for the development of techniques for intelligent data analysis. Data mining, which concerns the discovery and extraction of knowledge chunks from large data repositories, is aimed at addressing this need. Data mining automates the discovery of hidden patterns and relationships that may not always be obvious. Data mining tools include classification techniques (such as decision trees, rule induction programs and neural networks) (Han & Kamber, 2001), clustering algorithms and association rule approaches, amongst others. Data mining has been fruitfully used in many of domains, including marketing, medicine, finance, engineering and bioinformatics. There still are, however, a number of factors that militate against the widespread adoption and use of this new technology. This is mainly due to the fact that the results of many data mining techniques are often difficult to understand. For example, the results of a data mining effort producing 300 pages of rules will be difficult to analyze. The visual representation of the knowledge embedded in such rules will help to heighten the comprehensibility of the results. The visualization of the data itself, as well as the data mining process, should go a long way towards increasing the user’s understanding of and faith in the data mining process. That is, data and information visualization provide users with the ability to obtain new insights into the knowledge, as discovered from large repositories. This paper describes a number of important visual data mining issues and introduces techniques employed to improve the understandability of the results of data mining. Firstly, the visualization of data prior to and during data mining is addressed. Through data visualization the quality of the data can be assessed throughout the knowledge discovery process, which includes data preprocessing, data mining and reporting. We also discuss information visualization, that is, how the knowl-
edge, as discovered by a data mining tool, may be visualized throughout the data mining process. This aspect includes visualization of the results of data mining as well as the learning process. In addition, the paper shows how virtual reality and collaborative virtual environments may be used to obtain an immersive perspective of the data and the data mining process.
BACKGROUND Human beings intuitively search for novel features, patterns, trends, outliers and relationships in data (Han & Kamber, 2001). Through visualizing the data and the concept descriptions obtained (e.g., in the form of rules), a qualitative overview of large and complex data sets can be obtained. In addition, data and rule visualization can assist in identifying regions of interest and appropriate parameters for more focused quantitative analysis (Grinstein & Ward, 2001). The user can thus get a “rough feeling” of the quality of the data, in terms of its correctness, adequacy, completeness, relevance, and etcetera. The use of data and rule visualization thus greatly expands the range of models that can be understood by the user, thereby easing the so-called “accuracy versus understandability” tradeoff (Thearling et al., 2001). Data mining techniques construct a model of the data through repetitive calculation to find statistically significant relationships within the data. However, the human visual perception system can detect patterns within the data that are unknown to a data mining tool. This combination of the various strengths of the human visual system and data mining tools may subsequently lead to the discovery of novel insights and the improvement of the human’s perspective of the problem at hand. Visual data mining harnesses the power of the human vision system, making it an effective tool to comprehend data distribution, patterns, clusters and outliers in data (Han & Kamber, 2001). Visual data mining is currently an active area of research. Examples of related commercial data mining packages include the DBMiner data mining system, See5
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Visualization Techniques for Data Mining
which forms part of the RuleQuest suite of data mining tools, Clementine developed by Integral Solutions Ltd (ISL), Enterprise Miner developed by SAS Institute, Intelligent Miner produced by IBM, and various other tools (Han & Kamber, 2001). Neural network tools such as NeuroSolutions and SNNS and Bayesian network tools including Hugin, TETRAD, and Bayesware Discoverer, also incorporate extensive visualization facilities. Examples of related research projects and visualization approaches include MLC++, WEKA, AlgorithmMatrix and C4.5/See5, amongst others (Han & Kamber, 2001; Fayyad et al., 2001). Visual data mining integrates data visualization and data mining and is closely related to computer graphics, multimedia systems, human computer interfaces, pattern recognition and high performance computing.
MAIN THRUST Data and information visualization will be further explored in terms of their benefits for data mining.
Data Visualization Data visualization provides a powerful mechanism to aid the user during both data preprocessing and the actual data mining (Foong, 2001). Through the visualization of the original data, the user can browse to get a “feel” for the properties of that data. For example, large samples can
be visualized and analyzed (Grinstein & Ward, 2001). In particular, visualization may be used for outlier detection, which highlights surprises in the data, that is, data instances that do not comply with the general behavior or model of the data (Han & Kamber, 2001; Pyle, 1999). In addition, the user is aided in selecting the appropriate data through a visual interface. Data transformation is an important data preprocessing step. During data transformation, visualizing the data can help the user to ensure the correctness of the transformation. That is, the user may determine whether the two views (original vs. transformed) of the data are equivalent. Visualization may also be used to assist users when integrating data sources, assisting them to see relationships within the different formats. Data visualization techniques are classified in respect of three aspects (Grinstein & Ward, 2001). Firstly, their focus, that is, symbolic versus geometric; secondly their stimulus (2-D vs. 3-D); and lastly, their display (static or dynamic) (Fayyad et al., 2001). In addition, data in a data repository can be viewed as different levels of granularity or abstraction, or as different combinations of attributes or dimensions. The data can be presented in various visual formats, including box plots, scatter plots, 3-D cubes, data distribution charts, curves, volume visualization, surfaces or link graphs, amongst others (Grinstein & Ward, 2001). Scatter plots refer to the visualization of data items according to two axes, namely X and Y values. According
Figure 1. The Vizimine data and information tool (Viktor et al., 2003)
1191
8
Visualization Techniques for Data Mining
to Hoffman & Grinstein (2001), the scatter plot is the most popular visualization tool, since it can help find clusters, outliers, trends and correlations. Figure 1 shows an example of a scatter plot used in the ViziMine system, which visualizes both the data set and the subset of the data covered by a particular rule as discovered by a data mining tool (Viktor et al., 2003). The figure shows the scatter plot for the rule IF (petal-width > 17.50) then Iris-type = Virginica. This data set involves the classification of Irises into one of three types. The figure shows the Virginica data instances covered by the rule, the data instances from the other two types of Irises not covered by the rule. In addition, it also importantly indicates those Virginicas, which were not covered by the rule. Other data visualization techniques, including 3-D – cubes, are used in relationship diagrams, where the data are compared as totals of different categories. In surface charts, the data points are visualized by drawing a line between them. The area defined by the line, together with the lower portion of the chart, is subsequently filled. Link or line graphs display the relationships between data points through fitting a connecting line (Paquet et al., 2000). They are normally used for 2-D data where the X value is not repeated (Hoffman & Grinstein, 2001). Advanced visualization techniques may greatly expand the range of models that can be understood by domain experts, thereby easing the so-called accuracyvs.-understandability trade-off (Singhal et al., 1999). However, due to the so-called “curse of dimensionality,” which refers to the problems associated with working with numerous dimensions, highly accurate models are usually less understandable, and vice versa. In a data mining system, the aim of data visualization is to obtain an initial understanding of the data and the quality thereof. The actual accurate assessment of the data and the discovery of new knowledge are the tasks of the data mining tools. Therefore, the visual display should preferably be highly understandable, possibly at the cost of accuracy. The use of one or more of the above-mentioned data visualization techniques thus helps the user to obtain an initial model of the data, in order to detect possible outliers and to obtain an intuitive assessment of the quality of the data used for data mining. The visualization of the data mining process and results is discussed next.
Information Visualization According to Foster and Gee (2001), it is crucial to be aware of what users require for exploring data sets, small and large. The driving force behind visualizing data mining models can be broken down into two key areas, namely understanding and trust (Singhal et al., 1999; Thearling et al., 2001). Understanding means more than just comprehension; it also involves context. If the user can under1192
stand what has been discovered in the context of the business issue, he will trust the data and the underlying model and thus put it to use. Visualizing a model also allows a user to discuss and explain the logic behind the model to others. In this way, the overall trust in the model increases and subsequent actions taken as a result are justifiable (Thearling et al., 2001). The art of information visualization can be seen as the combination of three well defined and understood disciplines, namely cognitive science, graphics art and information graphics. A number of important factors have to be kept in mind when visualizing both the execution of the data mining algorithm (process visualization), for example, the construction of a decision tree, and displaying the results thereof (result visualization). The visualization approach should provide an easy understanding of the domain knowledge, explore visual parameters and produce useful outputs. Salient features should be encoded graphically and the interactive process should prove useful to the user. The format of knowledge extracted during the mining process depends on the type of data mining task and its complexity. Examples include classification rules, association rules, temporal sequences and casual graphs (Singhal, 1999). Visualization of these data mining results involves the presentation of the results or knowledge obtained from data mining in visual forms, such as decision trees, association rules, clusters, outliers and generalized rules. For example, the Silicon Graphics (SGI) MineSet 3.0 toolset uses connectivity diagrams to visualize decision trees, and simple Bayesian and decision table classifiers (Han & Kamber, 2001; Thearling et al., 2001). Other examples include the Evidence Visualizer, which is used to visualize Bayesian classifiers (Becker et al., 2001); the DB-Discover system that uses multi-attribute generalization to summarize data (Hilderman, 2001); and the NASD Regulation Advanced Detection System, which employs decision trees and association rule visualization for surveillance of the NASDAQ stock market (Senator et al., 2001). Alternatively, visualization of the constructs created by a data mining tool (e.g., rules, decision tree branches, etc.) and the data covered by them may be accomplished through the use of scatter plots and box plots. For example, scatter plots may be used to indicate the points of data covered by a rule in one color and the points not covered by another color. The ViziMine tool uses this method, as depicted in Figure 1 (Viktor et al., 2003). This visualization method allows users to ask simple, intuitive questions interactively (Thearling et al., 2001). That is, the user is able to complete some form of “what if” analysis. For example, consider a rule IF (petal-width > 17.50) then Iris-type = Virginica from the Iris data repository. The user is subsequently able to see the
Visualization Techniques for Data Mining
effect on the data point covered when the rule’s conditions are changed slightly, for example, to IF (petal-width > 16.50) then Iris-type = Virginica.
FUTURE TRENDS Three-dimensional visualization has the potential to show far more information than two-dimensional visualization, while retaining its simplicity. This visualization technique quickly reveals the quantity and relative strength of relationships between elements, helping to focus attention on important data entities and rules. It therefore aids both the data preprocessing and data mining processes. In two dimensions, data representation is limited to bidimensional graphical elements. In three dimensions both two and three-dimensional graphical elements can be utilized. These elements are much more numerous and diversified in three dimensions than in two. Furthermore, three-dimensional representations (or descriptors) can be either volumetric or surface-based depending on whether the internal structure is of interest or not. A surface-based representation only takes into account the outer appearance or the shell of the object while a volumetric approach assigns a value to each volume element. The latter approach is quite common in biomedical imagery such as CAT scanning. Many techniques are available to visualize data in three dimensions (Harris, 2000). For example, it is very common to represent data by glyphs (Hoffman & Grinstein, 2001; Fayyad et al., 2001). A glyph can be defined as a three-dimensional object suitable for representing data or subsets of data. The object is chosen in order to facilitate both the visualization and the data mining process. The glyph must be self-explanatory and unambiguous. Glyphs can have various attributes such as their color and scale. When using these attributes to describe a glyph, a socalled content-based descriptor is constructed. Even if most glyphs are rigid objects, non-rigid and articulated objects can be used as well. It is then possible to use the deformation and the pose of the glyph in order to represent some specific behavior of the data set. Furthermore, glyphs can be animated in order to model some dynamic process. Three-dimensional visualization can be made more efficient by the use of virtual reality (VR). A virtual environment (VE) is a three-dimensional environment characterized by the fact that it is immersive, interactive, illustrative and intuitive. The fact that the environment is immersive is of great importance in data mining. In traditional visualization, the human subject looks at the data from outside, while in a VR environment the user is part of the data world. This means that the user can utilize all his
senses in order to navigate and understand the data. This also implies that the representation is more intuitive. VR is particularly well adapted to representing the scale and the topology of various sets of data. That becomes even more evident when stereo visualization is utilized, since stereo vision allows the analyst to have a real depth perception. This depth perception is important in order to estimate the relative distances and scales between the glyphs. Such estimation can be difficult without stereo vision if the scene does not correspond to the paradigms our brain is used to processing. In certain cases, the depth perception can be enhanced by the use of metaphors. Collaborative virtual environments (CVEs) can be considered as a major breakthrough in data mining (Singhal et al., 1999). By analogy, they can be considered as the equivalent of collaborative agents in visualization. Traditionally, one or more analysts perform visualization at a unique site. This operational model does not reflect the fact that many enterprises are distributed worldwide and so are their operations, data and specialists. It is consequently impossible for those enterprises to centralize all their data mining operations in a single center. Not only must they collaborate on the data mining process, which can be carried out automatically to a certain extent by distributed and collaborative agents, but they must also collaborate on the visualization and the visual data mining aspects.
CONCLUSION The ability to visualize the results of a data mining effort aids the user to understand and trust the knowledge embedded in it. Data and information visualization provide the user with the ability to get an intuitive “feel” for the data and the results, for example in the form of rules, that is being created. This ability can be fruitfully used in many business areas, for example for fraud detection, diagnosis in medical domains and credit screening, amongst others. Virtual reality and collaborative virtual environments are opening up challenging new avenues for data mining. VR is perfectly adapted to analyze alphanumerical data and to map them to a virtually infinite number of representations. Collaborative virtual environments provide a framework for collaborative and distributed data mining by making an immersive and synergic analysis of data and related patterns possible. In addition, there is a wealth of multimedia information waiting to be data mined. With the recent advent of a wide variety of content-based descriptors and the MPEG-7 standard to handle them, the fundamental framework is now in place to undertake this task (MPEG-7, 2004). The use of virtual reality to effectively
1193
8
Visualization Techniques for Data Mining
manipulate and visualize both the multimedia data and descriptors opens up exciting new research avenues.
REFERENCES Becker, B., Kohavi, R., & Sommerfield, D. (2001). Visualizing the simple Bayesian classifier. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 237-250). San Francisco: Morgan Kaufmann. Fayyad, U., Grinstein, G.G., & Wierse, A. (2001). Information visualization in data mining and knowledge discovery. San Francisco: Morgan Kaufmann. Foong, D.L.W. (2001). A visualization-driven approach to strategic knowledge discovery. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 181-190). San Francisco: Morgan Kaufmann. Foster, M., & Gee, A.G. (2001). The data visualization environment. In U. Fayyad, G.G. Grinstein, & A. Wiese (eds.), Information visualization in data mining and knowledge discovery (pp. 83-94). San Francisco: Morgan Kaufmann. Grinstein, G.G., & Ward, M.O. (2001). Introduction to data visualization. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 21-26). San Francisco: Morgan Kaufmann. Han, J., & Kamber, M. (2001). Data mining concepts and techniques. San Francisco: Morgan Kaufmann. Harris, R.L. (2000). Information graphics: A comprehensive illustrated reference. Oxford: Oxford University Press. Hilderman, R.J., Li, L., & Hamilton, H.J. (2001). Visualizing data mining results with domain generalization graphs. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 251-269). San Francisco: Morgan Kaufmann. Hoffman, P.E., & Grinstein, G.G. (2001). A survey of visualization for high-dimensional data mining. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 47-82). San Francisco: Morgan Kaufmann.
Alexandria and Cleopatra. Journal of Electronic Imaging, 9, 421-431. Pyle, D. (1999). Data preparation for data mining. San Francisco: Morgan Kaufman. Senator, T.E., Goldberg, H.G., & Shyr, P. (2001). The NASD regulation advanced detection system. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 363-371). San Francisco: Morgan Kaufmann. Singhal, S. et al. (1999). Networked virtual environments: Design and implementation. Reading, MA: AddisonWesley. Thearling, K. et al. (2001). Visualizing data mining models. In U. Fayyad, G.G. Grinstein, & A. Wiese (Eds.), Information visualization in data mining and knowledge discovery (pp. 205-222). San Francisco: Morgan Kaufmann. Viktor, H.L., Paquet, E., & Le Roux, J.G. (2003). Cooperative learning and virtual reality-based visualization for data mining. In J. Wang (Ed.), Data mining: Opportunities and challenges (pp. 55-79). Hershey, PA: IRM Publishers.
KEY TERMS Collaborative Virtual Environment: An environment that actively supports human-human communication in addition to human-machine communication and which uses a virtual environment as the user interface. Curse of Dimensionality: The problems associated with information overload, when the number of dimensions is too high to visualize. Data Visualization: The visualization of the data set through the use of techniques such as scatter plots, 3-D cubes, link graphs and surface charts. Dimensionality Reduction: The removal of irrelevant, weakly relevant, or redundant attributes or dimensions through the use of techniques such as principle component analysis or sensitivity analysis.
MPEG-4 and MPEG-7. (n.d.). Retrieved from http:// mpeg.telecomitalialab.com
Information Visualization: The visualization of data mining models, focusing on the results of data mining and the data mining process itself. Techniques include rulebased scatter plots, connectivity diagrams, multi-attribute generalization and decision tree and association rule visualization.
Paquet, E., Robinette, K.M., & Rioux, M. (2000). Management of three-dimensional and anthropometric databases:
Multimedia Data Mining: The application of data mining to data sets consisting of multimedia data, such as
1194
Visualization Techniques for Data Mining
2-D images, 3-D objects, video and audio. Multimedia data can be viewed as integral data records, which consist of relational data together with diverse multimedia content. Virtual Reality: Immersive, interactive, illustrative and intuitive representation of the real world based on visualization and computer graphic.
Visual Data Mining: The integration of data visualization and data mining. Visual data mining is closely related to computer graphics, multimedia systems, human computer interfaces, pattern recognition and high performance computing.
Visualization: The graphical expression of data or information.
1195
8
1196
Wavelets for Querying Multidimensional Datasets Cyrus Shahabi University of Southern California, USA Dimitris Sacharidis University of Southern California, USA Mehrdad Jahangiri University of Southern California, USA
INTRODUCTION Following the constant technological advancements that provide more processing power and storage capacity, scientific applications have emerged as a new field of interest for the database community. Such applications, termed Online Science Applications (OSA), require continuous interaction with datasets of multidimensional nature, mainly for performing statistical analysis. OSA can seriously benefit from the ongoing research for OLAP systems and the pre-calculation of aggregate functions for multidimensional datasets. One of the tools that we see fit for the task in hand is the wavelet transformation. Due to its inherent multi-resolution properties, wavelets can be utilized to provide progressively approximate and eventually fast exact answers to complex queries in the context of Online Science Applications.
BACKGROUND OLAP systems emerged from the need to deal efficiently with large multidimensional datasets in support of complex analytical and exploratory queries. Gray et al. (Gray, Bosworth, Layman, & Pirahesh, 1996) demonstrated the fact that analysis of multidimensional data was inadequately supported by traditional relational databases. They proposed a new relational aggregation operator, the Data Cube, which accommodates aggregation of multidimensional data. The relational model, however, is inadequate to describe such data, and an inherent multidimensional approach using sparse arrays was suggested in Zhao, Deshpande & Naughton (1997) to compute the data cube. Since the main use of a data cube is to support aggregate queries over ranges on the domains of the dimensions, a large amount of work has
been focused on providing faster answers to such queries at the expense of higher update and maintenance cost. Pre-aggregation is the key term here, as it resulted in performance benefits. Ho et al. (1997) proposed a data cube (Prefix Sum) in which each cell stored the summation of the values in all previous cells, so that it can answer range-aggregate queries in constant time. The maintenance cost of this technique, however, can be as large as the size of the cube. A number of following publications focused on balancing the trade-off between pre-aggregation benefits and maintenance costs. It is not until recent years that the Wavelet Transformation was proposed as a means to do pre-aggregation on a multidimensional dataset. However, most of these approaches share the disadvantage of providing only approximate answers by compressing the data. Vitter, Wang, & Iyer have used the wavelet transformation to compress a pre-processed version of the data cube (1998) or the original data cube (Vitter & Wang, 1999), constructing Compact Data Cubes. Lemire (2002) transforms a pre-aggregated version of the data cube to support progressive answering, whereas in Wu, Agrawal & Abbadi (2000) and Chakrabarti, Garofalakis, Rastogi, & Shim (2000) the data cube is directly transformed and compressed into the wavelet domain, in a way similar to image compression. A totally different perspective in using wavelets for scientific queries is proposed in Schmidt & Shahabi (2002). Here, the answer to queries posed in scientific applications is represented as the dot-product of a query vector with a data vector. It has been shown (Schmidt & Shahabi, 2002) that for a particular class of queries, wavelets can compress the query vector making fast progressive evaluation of these queries a reality. This technique, as it based on query compression and not data, can accommodate exact, approximate or progressive query evaluation.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Wavelets for Querying Multidimensional Datasets
MAIN THRUST What is the Wavelet Transformation? We will start our discussion by attempting to provide a crude definition of the wavelet transformation, in particular the Discrete Wavelet Transformation (DWT). As the name suggests, it is a transformation of some signal, not too different from other well-known transformations such as Fourier, Laplace, and etcetera. In the context of database applications, the signal is, in general, a multivariate discrete signal that represents a dataset. As a transformation DWT is, essentially, another way to view a signal. The expectance, of course, is that such a view will be more useful and provide more information to the applications in hand. DWT is lossless, or an orthonormal transformation in signal processing terms, as is the case with the most common transformations. This implies that its effects can be reversed and thus the original signal can be reconstructed in its entirety; a highly desirable property. DWT achieves (lossless) compression by separating the “smooth” part of a signal from the “rough” and iterating on the “smooth” part to further analyze the signal. This is true, provided that the signal is relative smooth, which is the case with real-life datasets and especially with query signals, as we will see. We can now give the crude definition we promised at the beginning. The Discrete Wavelet Transformation is a lossless transformation that provides a multi-resolution view of the “smooth” and “rough” parts of a signal.
An Example with Haar Wavelets Haar wavelets are the simplest and were the first to be discovered. The “smooth” version of the signal is produced by pairwise averaging, whereas the “rough” version is produced by pairwise differencing. This is why the Haar wavelet coefficients are called averages and differences or details. Using signal processing terminology, the “smooth” version of the signal is produced by a low-pass filter, which filters out the rough elements. On the other hand, the “rough” version of the signal is produced by a highpass filter, which filters out the smooth elements. Together, these filters are called a filterbank, and they produce the smooth and rough views of the signal. DWT is performed by chaining a filterbank on the output of the low pass filter; doing so iteratively leads to the multiresolution view of the signal. A digital filter is simply comprised by a set of coefficients that multiply the input to produce the output. As an example the low-
pass Haar filter is comprised by the coefficients { 1 2 , which multiply input {a,b} to produce output
1
(a + b)
2} 2
.
Similarly, the high-pass filters consists of the coefficients { 1 2 , − 1 2 } which multiply input {a,b} to produce output ( a − b)
2
. We say that the length of the Haar filter is 2, as
both low-pass and high-pass filters have 2 coefficients and thus require an input of 2 to produce an output. Other wavelets that are generated by longer filters exhibit better performance in terms of separating the smooth and rough elements. In the example that follows, we will use the filters { 1 2 , 1 2 } and { 1 2 , − 12 } to avoid the ugly square roots for illustration purposes. Let us consider a signal of 8 samples (a vector of 8 values) {3,5,7,5,8,12,9,1} and let us apply the DWT. We start by first taking pairwise averages: {4,6,10,5}. We also get the following pairwise differences {-1,1,-2,4}. For any two consecutive and non-overlapping pair of data values a,b we get their average: (a + b) 2 and their difference divided by 2: (a − b) 2 . The result is 2 vectors each of half size containing a smoother version of the signal, the averages, and a rougher version, the differences; these coefficients form the first level of decomposition. We continue by constructing the averages and differences from the smooth version of the signal: {4,6,10,5}. The new averages are {5,7.5} and the new differences are {-1,2.5}, forming the second level of decomposition. Continuing the process, we get the average {6.25} and difference {-1.25} of the new smooth signal; these form the third and last level of decomposition. Note that 6.25 is the average of the entire signal as it is produced by iteratively averaging pairwise averages. Similarly, -1.25 represents the difference between the average of the first half of the signal and the average of the second half. The final average {6.25} and the differences produced at all levels of decomposition {1.25}, {-1,2.5}, {-1,1,-2,4} can perfectly reconstruct the original signal. These form the Haar DWT of the original signal: {6.25,-1.25,-1,2.5,-1,1,-2,4}. The key is that at each level of decomposition the averages and differences can be used to reconstruct the averages of the previous level. Lossy compression in the DWT is achieved by thresholding: only the coefficients whose energy is above the threshold are preserved, whereas the rest are implicitly set to 0. If we decide to keep half as many coefficients the resulting wavelet vector contains the 4 highest (normalized by 1 2 at each level) coefficients: {6.25,-1.25,0,2.5,0,0,0,4}. Then, the compressed decomposed signal is an approximation of the original: {5,5,5,5,10,10,9,1}
1197
9
Wavelets for Querying Multidimensional Datasets
Data Visualization Instrument There are two ways to perform a multidimensional wavelet decomposition, the standard and the non-standard. In short, the standard form is performed by applying a series of one dimensional decompositions along each dimension whereas the non-standard form does not decompose each dimension separately. In the non-standard form, after each level of decomposition only the averages corresponding to the same level are further decomposed. The non-standard form of decomposition involves fewer operations and thus is faster to compute but does not compress as efficiently as the standard form, especially in case of range aggregate query processing.
Processing Polynomial RangeAggregate Queries The fact that DWT is an orthonormal transformation has the convenience that it preserves energy. As a result, any query that can be seen as a dot-product query involving a query and a data vector, can be casted as a dot-product query on the wavelet domain. The answer to the transformed query will be the same because of the orthonormality, but may require less retrieves from database. Polynomial Range-Aggregate Queries is a very interesting class of dot-product queries, that covers queries usually found in a diverse field of applications, such as scientific, decision support, and statistical, to name a few. We can understand the type of queries by looking at the name of the class. The term Range-Aggregate implies that the queries span on very large subsets of the multidimensional dataset, perhaps the entire hypercube, and involve the calculation of some aggregation function. The term Polynomial specifies these aggregation functions, as belonging to the space of polynomial functions. Such a space is large enough to contain many complex functions; for example in the context of statistical queries, the class includes second order functions like COVARIANCE, third order SKEW, fourth order KURTOSIS, and etcetera, besides the typical AVERAGE and VARIANCE functions. For example, assume that the data vector contains the frequency distribution of the attribute Salary of a dataset categorized in 8 ranges each of $10K: a 0 in the vector implies that no tuple exists for the particular salary range, whereas non-zero values count the number of tuples with that salary. A COUNT query, which returns the number of salaries in the database, is formed as a query vector of all 1s. The dot product of the query with the data vector returns the count. A SUM query is formed as a 1198
query vector containing the salary ranges: {10,20,30,40,50,60,70,80}. Again the dot product of the query vector with the data vector is the sum of all salaries. An AVERAGE query is calculated using the two previous queries; in general more complex queries can be calculated with simple dot-product queries of the form: {10d,20d,30d,40d,50d,60d,70d,80d} for d=0,1,2… as demonstrated. A more detailed description of polynomial rangeaggregate queries can be found in Schmidt & Shahabi (2002). The signal that corresponds to these queries is very smooth as shown and thus is highly compressed by the wavelet transformation, provided that a certain condition regarding the wavelet filter length and the highest order (d) of polynomial necessary is met. In the wavelet domain the transformed query is extremely sparse and becomes independent of the range size, meaning that very large queries, that usually hinder OLAP systems, cost as much as smaller range queries. The cost for a polynomial range-aggregate query when a wavelet filter of length l is used to transform a d-dimensional dataset of domain size nd is O(l d (log n) d ) . The wavelet transformation preserves the energy of the signal by re-distributing it across wavelet coefficients in a way that most of the energy is contained in a small subset of the coefficients. Therefore, some coefficients become more significant than others (Garofalakis & Gibbons, 2002). By ordering and thus retrieving coefficients according to their significance we achieve optimal progressive query evaluation. On top of that, we can provide very fast, fairly accurate progressive answers long before the retrieval process is complete. Experiments have verified this, as even when only a small percentage of the required coefficients (which is significantly less than the entire range size) is retrieved, the answers are highly accurate.
FUTURE TRENDS Wavelets are now beginning to emerge as a tool for processing queries in data streams, where resources are really limited. Gilbert, Kotidis, Muthukrishnan & Strauss (2001) described techniques for estimating the wavelet transformation under different data stream models, and thus introduced well established tools to data stream query processing. Following this work, a number of publications on data stream mining based on wavelet coefficient prediction are beginning to appear (Papadimitriou, Brockwell & Faloutsos, 2003; Bulut & Singh, 2003). Wavelets and their compression properties are also being utilized as a tool for handling sensor data (Ganesan, Greenstein, Perelyubskiy, Estrin, &
Wavelets for Querying Multidimensional Datasets
Heidemann, 2003). We expect to see more publications in the future utilizing and introducing the wavelet transformation in applications related to data streams and remote sensor data, where wavelets have not been extensively explored.
CONCLUSION We have discussed the use of the wavelet transformation in applications dealing with massive multidimensional datasets requiring fast approximate and eventually exact query processing. We also investigated some of the properties of this transformation justifying the use of wavelets and making them attractive for the scenarios mentioned. Wavelets have reached a level of maturity and acceptance from the database community and are now considered an irreplaceable tool for query processing in a broad range of applications. However, as a relative new tool, wavelets are yet to reveal their true potential. Taking ideas from signal processing applications, where they originated, can help towards this direction. As technology advances and more demanding data processing applications appear, wavelets are bound to be considered and investigated even further.
REFERENCES Bulut, A., & Singh, A.K. (2003). Swat: Hierarchical stream summarization in large networks. In Proceedings of the 19th International Conference on Data Engineering (pp. 303-314), March 5-8, 2003, Bangalore, India. Chakrabarti, K., Garofalakis, M.N., Rastogi, R., & Shim, K. (2000). Approximate query processing using wavelets. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases (pp. 111-122). Ganesan, D., Greenstein, B., Perelyubskiy, D., Estrin, D., & Heidemann, J. (2003). An evaluation of multi-resolution storage for sensor networks. In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems. ACM Press. Garofalakis, M., & Gibbons, P.B. (2002). Wavelet synopses with error guarantees. In Sigmod 2002. ACM Press. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., & Strauss, M.J. (2001). Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In VLDB
2001, Proceedings of 27th International Conference on Very Large Data Bases. Gray, J., Bosworth, A., Layman, A., & Pirahesh, H. (1996). Datacube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the 12th International Conference on Data Engineering (pp. 152-159). Ho, C., Agrawal, R., Megiddo, N., & Srikant, R. (1997). Range queries in OLAP data cubes. In Sigmod 1997, Proceedings of the ACM Sigmod International Conference on Management of Data (pp. 73-88). ACM Press. Lemire, D. (2002, October). Wavelet-based relative prefix sum methods for range sum queries in data cubes. In Proceedings of Center of Advanced Studies Conference (CASCON 2002). Muthukrishnan, S. (2004). Data streams: Applications and algorithms. Papadimitriou, S., Brockwell, A., & Faloutsos, C. (2003). Awsom: Adaptive, hands-on stream mining. In VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany. San Francisco: Morgan Kaufmann. Schmidt, R., & Shahabi, C. (2002). Propolyne: A fast wavelet-based technique for progressive evaluation of polynomial range-sum queries. In Conference on Extending Database Technology (EDBT’02), Lecture Notes in Computer Science. Berlin: Springer. Vitter, J.S., & Wang, M. (1999). Approximate computation of multidimensional aggregates of sparse data using wavelets. In Sigmod 1999, Proceedings of the ACM Sigmod International Conference on the Management of Data (pp. 193-204). ACM Press. Vitter, J.S., Wang, M., & Iyer, B.R. (1998). Data cube approximation and histograms via wavelets. In CKIM 1998, Proceedings of the 7th International Conference on Information and Knowledge Management (pp. 96-104). ACM. Wu, Y.L., Agrawal, D., & Abbadi, A.E. (2000). Using wavelet decomposition to support progressive and approximate range-sum queries over data cubes. In CIKM 2000, Proceedings of the 9th International Conference on Information and Knowledge Management (pp. 414-421). ACM. Zhao, Y., Deshpande, P.M., & Naughton, J.F. (1997). An array-based algorithm for simultaneous multidimensional aggregates. In SIGMOD’97 (pp. 159-170).
1199
9
Wavelets for Querying Multidimensional Datasets
KEY TERMS Data Streams: Data Streams, according to S. Muthukrishnan (2004), “represent input data that comes at high rate. High rate means it stresses communication and computing infrastructure, so it may be hard to transmit, compute and store the entire input.” Discrete Wavelet Transformation (DWT): An orthonormal decomposition that provides a multi-resolution (multi-scale) view of the “smooth” and “rough” elements of a signal. Dot Product Queries: A class of queries, where the query answer can be seen as the inner product between a vector dependent only on the query and a vector dependent on the data.
1200
Online Analytical Processing (OLAP): Applications that provide Fast Analysis of Shared Multidimensional Information (FASMI), according to The OLAP Report. Online Science Applications (OSA): Real-time applications oriented towards exploratory scientific analysis of large multivariate datasets. Polynomial Range-Aggregate Queries: A subclass of dot-product queries, where the query is a polynomial aggregate query defined over a contiguous range of data values. Sensor Networks: A large network of devices measuring (remote) sensor data, with frequently changing topology where broadcast is usually the means of communication. The sensors typically have limited processing ability and restricted power.
1201
Web Mining in Thematic Search Engines Massimiliano Caramia Istituto per le Applicazioni del Calcolo IAC-CNR, Italy Giovanni Felici Istituto di Analisi dei Sistemi ed Informatica (IASI-CNR), Italy
INTRODUCTION
BACKGROUND
The recent improvements of search engine technologies have made available to Internet users an enormous amount of knowledge that can be accessed in many different ways. The most popular search engines now provide search facilities for databases containing billions of Web pages, where queries are executed instantly. The focus is switching from quantity (maintaining and indexing large databases of Web pages and quickly selecting pages matching some criterion) to quality (identifying pages with a high quality for the user). Such a trend is motivated by the natural evolution of Internet users who are now more selective in their choice of the search tool and may be willing to pay the price of providing extra feedback to the system and to wait more time for their queries to be better matched. In this framework, several have considered the use of datamining and optimization techniques, which are often referred to as Web mining (for a recent bibliography on this topic, see, e.g., Getoor, Senator, Domingos & Faloutsos, 2003), and Zaïane, Srivastava, Spiliopoulou, & Masand, 2002). Here, we describe a method for improving standard search results in a thematic search engine, where the documents and the pages made available are restricted to a finite number of topics, and the users are considered to belong to a finite number of user profiles. The method uses clustering techniques to identify, in the set of pages resulting from a simple query, subsets that are homogeneous with respect to a vectorization based on context or profile; then we construct a number of small and potentially good subsets of pages, extracting from each cluster the pages with higher scores. Operating on these subsets with a genetic algorithm, we identify the subset with a good overall score and a high internal dissimilarity. This provides the user with a few nonduplicated pages that represent more correctly the structure of the initial set of pages. Because pages are seen by the algorithms as vectors of fixed dimension, the role of the context- or profile-based vectorization is central and specific to the thematic approach of this method.
Let P be a set of Web pages, with p∈P indicating a page in that set. Now assume that P is the result of a standard query to a database of pages, and thus represents a set of pages that satisfy some conditions expressed by the user. Each page p∈P is associated with a score based on the query that generated P, which would determine the order that the pages are presented to the person submitting the query. The role of this ordering is crucial for the quality of the search: In fact, if the dimension of P is relevant, the probability that the user considers a page p strongly decreases as the position of p in the order increases. This may lead to two major drawbacks: The pages in the first positions may be very similar (or even equal) to each other; pages that do not have a very high score but are representative of some aspect of set P may appear in a very low position in the ordering, with a negligible chance of being seen by the user. Our method tries to overcome both drawbacks, focusing on the selection from the initial set P of a small set of pages with a high score and sufficiently different from each other. A condition needed to apply our approach is the availability of additional information from the user, who indicates a search context (a general topic to which the search is referred to, not necessarily linked with the search keywords that generated the set P), and a user profile (a subjective identification of the user, which may either be provided directly by choosing amongst a set of predefined profiles or extracted from the pages that have been visited more recently by that user).
MAIN THRUST The basic idea of the method is to use the information conveyed by the search context or the user profile to analyze the structure of P and determine in it an optimal small subset that better represents all the information available. This is done in three steps. First, the search context and the user profile are used to extract a finite set
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
9
Web Mining in Thematic Search Engines
of significant words or page characteristics that is then used to create, from all pages in P, a vector of characteristics (page vectorization). Such vectorization represents a particular way of looking at the page, specific to each context/profile, and constitutes the ground on which the following steps are based. Second, the vectorized pages are analyzed by a clustering algorithm that partitions them into subsets of similar pages. This induces a two-dimensional ordering on the pages, as each page p can now be ordered according to the original score within its cluster. At this point the objective is to provide the user with a reduced list that takes into account the structure identified by the clusters and the original score function. This is done in the third step, where a genetic algorithm works on the pages that have a higher score in each cluster to produce a subset of those pages that are sufficiently heterogeneous and of good values for the original score. In the following sections, we describe the three steps in detail.
Page Vectorization The first step of the method is the representation of each page that has been acquired by a vector of finite dimension m, where each component represents a measure of some characteristic of the page (page vectorization). Clearly, such representation is crucial for the success of the method; all the information of a page that is not maintained in this step will be lost for further treatment. For this reason we must stress the thematic nature of the vectorization process, where only the information that appears to be relevant for a context or a profile is effectively kept for future use. In the most plain setting, each component of the vector is the number of occurrences of a particular word; you may also consider other measurable characteristics that are not specifically linked with the words that are contained in the page, such as the presence of pictures, tables, banners, and so on. As mentioned previously, the vectorization is based on one context, or one profile, chosen by the user. You may then assume that for each of the contexts/profiles that have been implemented in the search engine, a list of words that are relevant to that context/profile is available, and a related vectorization of the page is stored. Many refinements to this simple approach, may and should be considered. The dimension m of the vector (i.e., the number of relevant words associated with a context) is not theoretically limited to be particularly small, but you must keep in mind that in order to apply this method over a significant number of pages, it is reasonable to consider m≤100. We propose two methods to determine such a list of words:
1202
•
•
The words are determined in a setup phase, when the search engine managers decide which contexts/profiles are supported and what words are representative of that context/profile. This operation may be accomplished together with the users of a thematic engine devoted to a specific environment (such as an association of companies, a large corporation, or a community of users) The words are identified starting from an initial set of pages that are used as training sample for a context/profile. When user profiles are used, you may consider as a training sample for a profile the pages that have been visited more recently by the user(s) that belong to that profile so that the words associated with the profile evolve with the behavior of the users in a smooth way.
Page Clustering Extensive research has been done on how to improve retrieval results by employing clustering techniques. In several studies the strategy was to build a clustering of the entire document collection and then match the query to the cluster centroids (see, e.g., Willet, 1988). More recently, clustering has been used for helping the user in browsing a collection of documents and in organizing the results returned by a search engine (Leuski, 2001; Zamir, Etzioni, Madani, & Karp, 1997) or by a metasearch engine (Zamir & Etzioni, 1999) in response to a user query. In Koller and Sahami (1997) document clustering has also been used to automatically generate hierarchical clusters of documents. Document clustering in information retrieval usually deals with agglomerative hierarchical clustering algorithms (see, e.g., Jain, Murty & Flynn, 1999) or kmeans algorithm (see Dubes & Jain, 1998). Although agglomerative hierarchical clustering algorithms are very slow when applied to large document databases (Zamir & Etzioni, 1998) (single link and group average methods take O(|P|2) time, complete link methods take O(|P|3) time), k-means is much faster (its execution time is O(k⋅|P|)). Measuring clustering effectiveness and comparing performance of different algorithms is a complex task, and there are no completely satisfactory methods for comparing the quality of the results of a clustering algorithm. A largely used measure of clustering quality that behaves satisfactorily is the CalinskiHarabasz (C-H) pseudo-F statistic; the higher the index value, the better the cluster quality. For a given clustering, the mathematical expression of the pseudo-F statistic is C-H =
R 2 (1 − R 2 ) / , (k − 1) (n − k )
where R2=(SST-SSE) / SST,
SST is the sum of the squared distances of each object
Web Mining in Thematic Search Engines
from the overall centroid, and SSE is the sum of the squared distances of each object from the centroid of its own group. From experiments conducted on real and simulated data using the pseudo-F as a cluster quality measure, we confirm that k-means clustering performs well in limited computing times — a must for this type of applications, where both the number of pages and the dimension of the vectors may be large.
•
The Genetic Algorithm
•
Genetic algorithms have been implemented efficiently in information retrieval by several researchers. Chen (1995) used genetic algorithms to optimize keywords that were used to suggest relevant documents. Amongst others, Kraft, Petry, Buckles, and Sadavisan (1997) and Sanchez and Pierre (1994) presented several approaches to enhance the query description based on genetic algorithms. In Boughanem, Chrisment, and Tamine (1999), a genetic algorithm was deployed to find an optimal set of documents that best match the user’s need. In Horng and Yeh (2000), a method for extracting keywords from documents and assigning them weights was proposed. Our aim is to select a small subset P’ of the original set of pages P for which the sum of the scores is large, but also where the similarity amongst the selected page is restrained. We select such a subset by using a genetic algorithm (GA). Several reasons motivate this choice. First, the use of metaheuristic techniques is well established in optimization problems where the objective function and the constraints do not have a simple mathematical formulation. Second, we have to determine a good solution in a small computing time, where the dimension of the problem may be significantly large. Third, the structure of our problem is straightforward, representable by the data structure commonly used by a GA. GAs (see, Goldberg, 1999) are local search algorithms that start from an initial collection of strings (a population) representing possible solutions to the problem. Each string is called a chromosome and has an associated value called fitness function (ff) that contributes in the generation of new populations by means of genetic operators. Every position in a chromosome is called a gene, and its value is called the allelic value. This value may vary on an assigned allelic alphabet; often, the allelic alphabet is {0,1}. At each generation, the algorithm uses the fitness function values to evaluate the survival capacity of each string i by using simple operators to create a new set of artificial creatures (a new population) that tries to improve on the current ff values by using pieces of the old ones. Evolution is interrupted when no significant improvement of the fitness function can be obtained. The genetic operators work iteratively and are:
•
Reproduction, where individual strings are copied according to their fitness function values (the higher the value of a string, the higher the probability of contributing to one or more offspring in the next generation) Simple crossover, in which the members reproduced in the new mating pool are mated randomly and, afterward, each pair of strings undergoes a cross change Mutation, which is an occasional random alteration of the allelic value of a chromosome that occurs with small probability
Starting from the clusters obtained, we define the chromosomes of the initial population as subsets of pages of bounded cardinality (in the GA terminology, a page is a gene). The genetic algorithm works on the initial population ending up with a representative subset of pages to present to the user. The idea is to start the genetic evolution with a population that is already smaller than the initial set of pages P. Each chromosome is created by picking a page from each cluster, starting with the ones having a higher score. Thus, the first chromosome created will contain the pages with the highest score in each cluster, the second chromosome will contain the second best, and so on. If the cardinality of a cluster is smaller than the number of chromosomes to be created, then that cluster will not be represented in each chromosome, while other clusters with higher cardinality may have more than one page representing them in some chromosome. We indicate with dc the number of pages included in each chromosome in the initial population and with nc the number of chromosomes. The population will thus contain np = dc⋅ nc pages. The fitness function computed for each chromosome is expressed as a positive value that is higher for “better” chromosomes and is thus to be maximized. It is composed of three terms. The first is the sum of the score of the pages in chromosome C, i.e., t1(C)= ∑ score( pi ) , pi ∈C
where score (pi) is the original score given to page pi as previously described. This term considers the positive effect of having as many pages with as high a score as possible in a chromosome but also rewards chromosomes with many pages regardless of their. This drawback is balanced with the second term of the fitness function. Let ID be such an ideal dimension; the ratio t 2(C) = np / abs(|C|-ID)+1 constitutes the second term of the fitness function. It reaches its maximum np when the dimension of C is exactly equal to the ideal dimension ID and rapidly decreases when the number of pages contained in chromosome C is less than or greater than ID.
1203
9
Web Mining in Thematic Search Engines
The chromosomes that are present in the initial population are characterized by the highest possible variability as far as the clusters to which the pages belong are concerned. The evolution of the population may alter this characteristic, creating chromosomes with high fitness where the pages belong to the same cluster and are very similar to each other. Moreover, the fact that pages belonging to different clusters are different in the vectorized space may not be guaranteed, as it depends both on the nature of the data and on the quality of the initial clustering process. For this reason, we introduce in the fitness function a third term, which measures directly the overall dissimilarity of the pages in the chromosome. Let D(pi, pj) be the Euclidean distance of the vectors representing pages pi and pj. Then t3(C)= ∑ p , p ∈C , p ≠ p D( pi , p j ) is the sum of the distances between the pairs of pages in chromosome C and measures the total variability expressed by C. The final form of the fitness function for chromosome C is then ff(C)=α⋅t1(C)+β⋅t2(C)+γ⋅t3(C), where α, β, and γ are parameters that depend on the magnitude of the initial score and of the vectors that represent the pages. In particular, α, β, and γ are chosen so the contributions given by t1(C), t2(C), and t3(C) are balanced. Additionally, they may be tuned to express the relevance attributed to the different aspects represented by the three terms. The goal of the GA is to find, by means of the genetic operators, a chromosome C* such that ff(C*)=maxC=1,...,nc ff(C). i
j
i
j
FUTURE TRENDS The application of sophisticated data analysis and datamining techniques to the search of information on the Web is a field that receives increasing interest from both research and industry. The strategic importance of such tools should not be underestimated, as the amount of information keeps increasing while the user time available for searching is not increasing. Such a trend motivates the research effort to produce tools that help in improving Web search results. One may question whether this method can be run online in a search engine as the standard execution of a user’s query. We believe that with a proper tuning of the parameters and a proper engineering of the algorithms, the overall search process can be dealt with satisfactorily. Future work will cover the extension of the page vectorization technique and the definition and test of automatic procedures for parameter tuning in the genetic algorithm.
1204
CONCLUSION Experimental results conducted with the method described in this article have shown its effectiveness in the selection of small subsets of pages of good quality, where quality is not considered as a simple sum of the scores of each page but as a global characteristic of the subset. The current implementations of the GA and of the clustering algorithm converge fast to good solutions for data sets of realistic dimensions, and future work covering extensions of the page vectorization technique and the definition of automatic procedures for parameter tuning will surely lead to better results.
REFERENCES Boughanem, M., Chrisment, C., & Tamine, L. (1999). Genetic approach to query space exploration. Information Retrieval, 1, 175-192. Chen, H. (1995). Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms. Journal of the American Society for Information Science, 46(3), 194-216. Dubes, R. C., & Jain, A. K. (1988). Algorithms for clustering data. Prentice Hall. Getoor, L., Senator, T.E., Domingos, P., & Faloutsos, C. (Eds.). (2003). Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, USA. Goldberg, E. D. (1999). Genetic algorithms in search optimization & machine learning. Addison-Wesley. Horng, J. T., & Yeh, C. C. (2000). Applying genetic algorithms to query optimization in document retrieval. Information Processing and Management, 36, 737-759. Jain, A. K, Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Proceedings of the 14th International Conference on Machine Learning (pp. 170-178), USA. Kraft, D. H., Petry, F. E., Buckles, B. P., & Sadavisan, T. (1997). Genetic algorithms for query optimization in information retrieval: Relevance feedback. In E. Sanchez, L. A. Zadeh, & T. Shibata (Eds.), Genetic algorithms and fuzzy logic systems: Soft computing perspectives (pp. 155-173). World Scentific.
Web Mining in Thematic Search Engines
Leuski, A. (2001). Evaluating document clustering for interactive information retrieval. Proceedings of the ACM International Conference on Information and Knowledge Management (pp. 33-44), USA. Sanchez, E., & Pierre, P. (1994). Fuzzy logic and genetic algorithms in information retrieval. Proceedings of the Third International Conference on Fuzzy Logic, Neural Net, and Soft Computing (pp. 29-35), Japan. Zaïane, O. R., Srivastava, J., Spiliopoulou, M., & Masand, B. M. (Eds.). (2002). International workshop in mining Web data for discovering usage patterns and profiles. Edmonton, Canada. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval(pp. 46-54), Australia. Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to Web search results. Proceedings of the Eighth International Conference on World Wide Web (pp. 1361-1374), Canada.
KEY TERMS Clustering: Partitioning a data set into subsets (clusters) so the data in each subset share some common trait. Genetic Algorithm: A heuristic optimization algorithm based on the concept of biological evolution. Page Score: The numeric value that measures how well a single page matches a given query. A higher score would imply a better matching. Search Engine: Software that builds a database of Web pages, applies queries to it, and returns results. Thematic Search Engine: A search engine devoted to the construction and management of a database of Web pages that pertain to a limited subset of the knowledge or of the Web users. Vectorization: The representation of objects in a class by a finite set of measures defined on the objects. Web Page: The basic unit of information visualized on the Web.
Zamir, O., Etzioni, O., Madani, O., & Karp, R. M. (1997). Fast and intuitive clustering of Web documents. Proceedings of the Third Interational Conference on Knowledge Discovery and Databases (pp. 287-290), USA.
1205
9
1206
Web Mining Overview Bamshad Mobasher DePaul University, USA
INTRODUCTION In the span of a decade, the World Wide Web has been transformed from a tool for information sharing among researchers into an indispensable part of everyday activities. This transformation has been characterized by an explosion of heterogeneous data and information available electronically, as well as increasingly complex applications driving a variety of systems for content management, e-commerce, e-learning, collaboration, and other Web services. This tremendous growth, in turn, has necessitated the development of more intelligent tools for end users as well as information providers in order to more effectively extract relevant information or to discover actionable knowledge. From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident. Web mining (i.e. the application of data mining techniques to extract knowledge from Web content, structure, and usage) is the collection of technologies to fulfill this potential. In this article, we will summarize briefly each of the three primary areas of Web mining—Web usage mining, Web content mining, and Web structure mining—and discuss some of the primary applications in each area.
BACKGROUND Knowledge discovery on and from the Web has been characterized by four different but related types of activities (Kosala & Blockeel, 2000): 1. 2. 3. 4.
Resource Discovery: Locating unfamiliar documents and services on the Web. Information Extraction: Extracting automatically specific information from newly discovered Web resources. Generalization: Uncovering general patterns at individual Web sites or across multiple sites. Personalization: Presentation of the information requested by an end user of the Web.
The goal of Web mining is to discover global as well as local structures, models, patterns, or relations within and between Web pages. The research and practice in
Web mining has evolved over the years from a processcentric view, which defined Web mining as a sequence of tasks (Etzioni, 1996), to a data-centric view, which defined Web mining in terms of the types of Web data that were being used in the mining process (Cooley et al., 1997).
MAIN THRUST The evolution of Web mining as a discipline has been characterized by a number of efforts to define and expand its underlying components and processes (Cooley et al., 1997; Kosla & Blockeel, 2000; Madria et al., 1999; Srivastava et al., 2002). These efforts have led to three commonly distinguished areas of Web mining: Web usage mining, Web content mining, and Web structure mining.
Web Content Mining Web content mining is the process of extracting useful information from the content of Web documents. Content data correspond to the collection of facts a Web page was designed to convey to the users. Web content mining can take advantage of the semi-structured nature of Web page text. The HTML tags or XML markup within Web pages bear information that concerns not only layout but also the logical structure and semantic content of documents. Text mining and its application to Web content have been widely researched (Berry, 2003; Chakrabarti, 2000). Some of the research issues addressed in text mining are topic discovery, extracting association patterns, clustering of Web documents, and classification of Web pages. Research activities in this field generally involve using techniques from other disciplines, such as information retrieval (IR), information extraction (IE), and natural language processing (NLP). Web content mining can be used to detect co-occurrences of terms in texts (Chang et al., 2000). For example, co-occurrences of terms in newswire articles may show that gold frequently is mentioned together with copper when articles concern Canada, but together with silver when articles concern the US. Trends over time also may be discovered, indicating a surge or decline in interest in certain topics, such as programming languages like Java. Another application area is event detection, the identifi-
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Web Mining Overview
cation of stories in continuous news streams that correspond to new or previously unidentified events. A growing application of Web content mining is the automatic extraction of semantic relations and structures from the Web. This application is closely related to information extraction and ontology learning. Efforts in this area have included the use of hierarchical clustering algorithms on terms in order to create concept hierarchies (Clerkin et al., 2001), the use of formal concept analysis and association rule mining to learn generalized conceptual relations (Maedche & Staab, 2000; Stumme et al., 2000), and the automatic extraction of structured data records from semistructured HTML pages (Liu, Chin & Ng, 2003). Often, the primary goal of such algorithms is to create a set of formally defined domain ontologies that represent precisely the Web site content and to allow for further reasoning. Common representation approaches are vector-space model (Loh et al., 2000), descriptive logics (i.e., DAML+OIL) (Horrocks, 2002), first order logic (Craven et al., 2000), relational models (Dai & Mobasher, 2002), and probabilistic relational models (Getoor et al., 2001).
Web Structure Mining The structure of a typical Web graph consists of Web pages as nodes and hyperlinks as edges connecting between two related pages. Web structure mining can be regarded as the process of discovering structure information from the Web. This type of mining can be divided further into two kinds, based on the kind of structural data used (Srivastava et al., 2002); namely, hyperlinks or document structure. There has been a significant body of work on hyperlink analysis, of which Desikan et al. (2002) provide an up-to-date survey. The content within a Web page also can be organized in a tree-structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents (Moh et al., 2000) or on using the document structure to extract data records or semantic relations and concepts (Liu, Chin & Ng, 2003; Liu, Grossman & Zhai, 2003). By far, the most prominent and widely accepted application of Web structure mining has been in Web information retrieval. For example, the Hyperlink Induced Topic Search (HITS) algorithm (Klienberg, 1998) analyzes hyperlink topology of the Web in order to discover authoritative information sources for a broad search topic. This information is found in authority pages, which are defined in relation to hubs as their counterparts: Hubs are Web pages that link to many related authorities; authorities are those pages that are linked by many good hubs. The hub and authority scores computed for each Web page indicate the extent to which the Web page serves as
a hub pointing to good authority pages or as an authority on a topic pointed to by good hubs. The search engine Google also owes its success to the PageRank algorithm, which is predicated on the assumption that the relevance of a page increases with the number of hyperlinks pointing to it from other pages and, in particular, of other relevant pages (Brin & Page, 1998). The key idea is that a page has a high rank, if it is pointed to by many highly ranked pages. So, the rank of a page depends upon the ranks of the pages pointing to it. This process is performed iteratively until the rank of all the pages is determined. The hyperlink structure of the Web also has been used to automatically identify Web communities (Flake et al., 2000; Gibson et al., 1998). A Web community can be described as a collection of Web pages, such that each member node has more hyperlinks (in either direction) within the community than outside of the community. An excellent overview of techniques, issues, and applications related to Web mining, in general, and to Web structure mining, in particular, is provided in Chakrabarti (2003).
Web Usage Mining Web usage mining (Cooley et al., 1999; Srivastava et al., 2000) refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common needs or interests. The primary data sources used in Web usage mining are log files automatically generated by Web and application servers. Additional data sources that also are essential for both data preparation and pattern discovery include the site files and meta-data, operational databases, application templates, and domain knowledge. The overall Web usage mining process can be divided into three interdependent tasks: data preprocessing, pattern discovery, and pattern analysis or application. In the preprocessing stage, the clickstream data is cleaned and partitioned into a set of user transactions representing the activities of each user during different visits to the site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to obtain possibly hidden patterns reflecting the typical behavior of users, as well as summary statistics on Web resources, sessions, and users. In the final stage of
1207
9
Web Mining Overview
the process, the discovered patterns and statistics are further processed, filtered, and used as input to applications such as recommendation engines, visualization tools, and Web analytics and report generation tools. For a full discussion of Web usage mining and its applications, see the article “Web Usage Mining” in the current volume (Mobasher, 2005).
FUTURE TRENDS
REFERENCES Berendt, B., Hotho, A., Mladenic, D., van Someren, M., & Spiliopoulou, M. (2004). Web mining: From Web to semantic Web. Lecture Notes in Computer Science, Vol. 3209. Heidelberg, Germany: Springer-Verlag. Berendt, B., Hotho, A., & Stumme, G. (2002). Towards semantic Web mining. Proceedings of the First International Semantic Web Conference (ISWC02), Sardinia, Italy.
An important emerging area that holds particular promise is Semantic Web Mining (Berendt et al., 2002). Semantic Web mining aims at combining the two research areas: semantic Web and Web mining. The primary goal is to improve the results of Web mining by exploiting the new semantic structures on the Web. Furthermore, Web mining techniques can help to automatically build essential components of the Semantic Web by extracting useful patterns, structures, and semantic relations from existing Web resources (Berendt et al., 2004). Other areas in which Web mining research and practice is likely to make substantial gains are Web information extraction, question-answering systems, and personalized search. Progress in the applications of natural language processing as well as increasing sophistication of machine learning and data mining techniques applied to Web content are likely to lead to the development of more effective tools for information foraging on the Web. Some recent advances in these areas have been highlighted in recent research activities (Mobasher et al. 2004; Muslea et al. 2004).
Berry, M. (2003). Survey of text mining: Clustering, classification, and retrieval. Heidelberg, Germany: Springer-Verlag.
CONCLUSION
Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: Information and pattern discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference on Tools With Artificial Intelligence (ICTAI ’97), Newport Beach, California.
Web mining is the application of data mining techniques to extract knowledge from the content, structure, and usage of Web resources. With the continued growth of the Web as an information sources and as a medium for providing Web services, Web mining will continue to play an ever expanding and important role. The development and application of Web mining techniques in the context of Web content, Web usage, and Web structure data already have resulted in dramatic improvements in a variety of Web applications, from search engines, Web agents, and content managements systems to Web analytics and personalization services. A focus on techniques and architectures for more effective integration and mining of content, usage, and structure data from different sources is likely to lead to the next generation of more useful and more intelligent applications.
1208
Brin, S., & Page, L. (1998). The anatomy of a large-scale hyper-textual Web search engine. Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia. Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. SIGKDD Explorations, 1(2), 1-11. Chakrabarti, S. (2003). Mining the Web: Discovering knowledge from hypertext data. San Francisco, CA: Morgan Kaufmann. Chang, G., Healey, M.J., McHugh, J.A.M., & Wang, J.T.L. (2001). Mining the World Wide Web: An information search approach. Boston: Kluwer Academic Publishers. Clerkin, P., Cunningham, P., & Hayes, C. (2001). Ontology discovery for the semantic Web using hierarchical clustering. Proceedings of the Semantic Web Mining Workshop at ECML/PKDD-2001, Freiburg, Germany.
Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1(1), 5-32. Craven, M. et al. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1-2), 69-113. Dai, H., & Mobasher, B. (2002). Using ontologies to discover domain-level Web usage profiles. Proceedings of the 2nd Semantic Web Mining Workshop at ECML/ PKDD 2002, Helsinki, Finland. Desikan, P., Srivastava, J., Kumar, V., & Tan, P.-N. (2002). Hyperlink analysis: Techniques and applications [tech-
Web Mining Overview
nical report]. Minneapolis, MN: Army High Performance Computing Center. Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine. Communications of the ACM, 39(11), 65-68.
Mobasher, B., Liu, B., Masand, B., & Nasraoui, O. (2004). Web mining and Web usage analysis. Proceedings of the 6th WebKDD workshop at the 2004 ACM SIGKKDD Conference, Seattle, Washington.
Flake, G.W., Lawrence, S., & Giles, C.L. (2000). Efficient identification of Web communities. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000), Boston.
Moh, C-H., Lim, E-P., & Ng, W.K. (2000). DTD-miner: A tool for mining DTD from XML documents. Proceedings of Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems, San Jose, California.
Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2001). Learning probabilistic models of relational structure. Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA.
Muslea, I. et al. (2004). Proceedings of the AAAI 2004 Workshop on Adaptive Text Extraction and Mining, ATEM-2004, San Jose, California.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web communities from link topology. Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia, Pittsburgh, Pennsylvania. Horrocks, I. (2002). DAML+OIL: A description logic for the semantic Web. IEEE Data Engineering Bulletin, 25(1), 4-9. Kleinberg, M. (1998). Authoritative sources in hyperlinked environment. Proceedings of the Ninth Annual ACMSIAM Symposium on Discrete Algorithms, San Francisco, California. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15. Liu, B., Chin, C.W., & Ng, H.T. (2003). Mining topicspecific concepts and definitions on the Web. Proceedings of the Twelfth International World Wide Web Conference (WWW-2003), Budapest, Hungary. Liu, B., Grossman, R., & Zhai, Y. (2003). Mining data records in Web pages. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2003), Washington, D.C. Loh, S., Wives, L.K., & de Oliveira, J.P. (2000). Conceptbased knowledge discovery in texts extracted from the Web. SIGKDD Explorations, 2(1), 29-39. Madria, S., Bhowmick, S., Ng, W.K., & Lim, E.-P. (1999). Research issues in Web data mining. Proceedings of Data Warehousing and Knowledge Discovery, First International Conference, Florence, Italy. Maedche, A., & Staab, S. (2000). Discovering conceptual relations from text. Proceedings of the European Conference on Artificial Intelligence (ECAI00), Berlin, Germany. Mobasher, B. (2005). Web usage mining. In J. Wang (Ed.), Web usage mining data preparation. Hershey, PA: Idea Group Publishing.
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Srivastava, J., Desikan, P., & Kumar, V. (2002). Web mining—Accomplishments and future directions. Proceedings of the National Science Foundation Workshop on Next Generation DataMining (NGDM’02), Baltimore, Maryland. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2000). Fast computation of concept lattices using data mining techniques. Proceedings of the Knowledge Representation Meets Databases Conference (KRDB00), Berlin, Germany.
KEY TERMS Hubs and Authorities: Hubs and authorities are Web pages defined by a mutually reinforcing relationship with respect to their hyperlink structure. Hubs are Web pages that link to many related authorities; authorities are those pages that are linked to by many good hubs. Hyperlink: A hyperlink is a structural unit that connects a Web page to a different location, either within the same Web page or to a different Web page. A hyperlink that connects to a different part of the same page is called an intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Web Community: A Web community can be described as a collection of Web pages, such that each member node has more hyperlinks (in either direction) within the community than outside of the community. Web Content Mining: The process of extracting useful information from the contents of Web documents. 1209
9
Web Mining Overview
Content data corresponds to the collection of facts that a Web page was designed to convey to users. It may consist of unstructured or semi-structured text, images, audio, video, or structured records, such as lists and tables. Web Mining: The application of data-mining techniques to extract knowledge from the content, structure, and usage of Web resources. It is generally subdivided into three independent but related areas: Web usage mining, Web content mining, and Web structure mining.
1210
Web Structure Mining: Web structure mining can be regarded as the process of discovering structure information from the Web. This type of mining can be divided further into two kinds, based on the kind of structural data used: hyperlinks connecting Web pages and the document structure in semi-structured Web pages. Web Usage Mining: The automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites.
1211
Web Page Extension of Data Warehouses Anthony Scime State University of New York College Brockport, USA
INTRODUCTION
BACKGROUND
Data warehouses are constructed to provide valuable and current information for decision-making. Typically this information is derived from the organization’s functional databases. The data warehouse is then providing a consolidated, convenient source of data for the decisionmaker. However, the available organizational information may not be sufficient to come to a decision. Information external to the organization is also often necessary for management to arrive at strategic decisions. Such external information may be available on the World Wide Web; and when added to the data warehouse extends decisionmaking power. The Web can be considered as a large repository of data. This data is on the whole unstructured and must be gathered and extracted to be made into something valuable for the organizational decision maker. To gather this data and place it into the organization’s data warehouse requires an understanding of the data warehouse metadata and the use of Web mining techniques (Laware, 2005). Typically when conducting a search on the Web, a user initiates the search by using a search engine to find documents that refer to the desired subject. This requires the user to define the domain of interest as a keyword or a collection of keywords that can be processed by the search engine. The searcher may not know how to break the domain down, thus limiting the search to the domain name. However, even given the ability to break down the domain and conduct a search, the search results have two significant problems. One, Web searches return information about a very large number of documents. Two, much of the returned information may be marginally relevant or completely irrelevant to the domain. The decision maker may not have time to sift through results to find the meaningful information. A data warehouse that has already found domain relevant Web pages can relieve the decision maker from having to decide on search keywords and having to determine the relevant documents from those found in a search. Such a data warehouse requires previously conducted searches to add Web information.
To provide an information source within an organization’s knowledge management system, database structure has been overlaid on documents (Liongosari, Dempski, & Swaminathan, 1999). This knowledge base provides a source for obtaining organizational knowledge. Data warehouses also can be populated in Web-based interoperational environments created between companies (Triantafillakis, Kanellis & Martakos, 2004). This extends knowledge between cooperating businesses. However, these systems do not explore the public documents available on the Web. Systems have been designed to extract relevant information from unstructured sources such as the Web. The Topicshop system allows users to gather, evaluate, and organize collections of Web sites (Amento, Terveen, Hill, Hix, & Schulman, 2003). Using topic discovery techniques Usenet news searching can be personalized to categorize contents and optimise delivery contents for review (Manco, Ortale & Tagarelli, 2005). Specialized search engines and indexes have been developed for many domains (Leake & Scherle, 2001). Search engines have been developed to combine the efforts of other engines and select the best search engine for a domain (Meng, Wu, Yu, & Li, 2001). However, these approaches do not organize the search results into accessible, meaningful, searchable data. Web search queries can be related to each other by the results returned (Wu & Crestani, 2004; Glance, 2000). This knowledge of common results to different queries can assist a new searcher in finding desired information. However, it assumes domain knowledge sufficient to develop a query with keywords, and does not provide corresponding organizational knowledge. Some Web search engines find information by categorizing the pages in their indexes. One of the first to create a structure as part of their Web index was Yahoo! (http:/ /www.yahoo.com). Yahoo! has developed a hierarchy of documents, which is designed to help users find information faster. This hierarchy acts as a taxonomy of the domain. Yahoo! helps by directing the searcher through
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
9
Web Page Extension of Data Warehouses
the domain. Again, there is no organizational knowledge to put the Web pages into a local context, so the documents must be accessed and assimilated by the searcher. DynaCat provides knowledge-based, dynamic categorization of search results in the medical domain (Pratt, 1999). The domain of medical topics is established and matched to predefined query types. Retrieved documents from a medical database are then categorized according to the topics. Such systems use the domain as a starting point, but do not catalog the information and add it to an existing organized body of domain knowledge such as a data warehouse. Web pages that contain multiple semi-structured records can be parsed and used to populate a relational database. Multiple semi-structured records are data about a subject that is typically composed of separate information instances organized individually, but generally in the same format. For example, a Web page of want ads or obituaries. The first step is to create an ontology of the general structure of the semi-structured data. The ontology is expressed as an Object-Relationship Model. This ontology is then used to define the parsing of the Web page. Parsing into records uses the HTML tags to determine the structure of the Web page, determining when a record starts and ends. The relational database structure is derived from the ontology. The system requires multiple records in the domain, with the Web page having a defined structure to delimit records. However, the Web pages must be given to the system, it cannot find Web pages, or determine if they belong to the domain (Embley et al., 1999). The Web Ontology Extraction (WebOntEx) project semi-automatically determines ontologies that exist on the Web. These ontologies are domain specific and placed in a relational database schema. Using the belief that HTML tags typically highlight a Web page’s concepts, concepts are extracted, by selecting some number of words after the tag as concepts. They are reviewed and may be selected to become entity sets, attributes or relationships in a domain relational database. The determination is based on the idea that nouns are possible entity and attribute types and verbs are possible relationship types. By analyzing a number of pages in a domain an ontology is developed within the relational database structure (Han & Elmasri, 2004). This system creates the database from Web page input, whereas an existing data warehouse needs only to be extended with Web available knowledge. Web based catalogs are typically taxonomy-directed. A taxonomy-directed Web site has its contents organized in a searchable taxonomy, presenting the instances of a category in an established manner. DataRover is a system that automatically finds and extracts products from taxonomy-directed, online catalogs. It utilizes heuristics to 1212
turn the online catalogs into a database of categorized products (Davulcu, Koduri & Nagarajan, 2003). This system is good for structured data, but is not effective on unstructured, text data. To find domain knowledge in large databases domain experts are queried as to the topics and subtopics of a domain creating an expert level taxonomy (Scime, 2000, 2003). This domain knowledge can be used to assist in restricting the search space. The results found are attached to the taxonomy and evaluated for validity; and create and extend the searchable data repository.
WEB SEARCH FOR WAREHOUSING Experts within a domain of knowledge are familiar with the facts and the organization of the domain. In the warehouse design process, the analyst extracts from the expert the domain organization. This organization is the foundation for the warehouse structure and specifically the dimensions that represent the characteristics of the domain. In the Web search process; the data warehouse analyst can use the warehouse dimensions as a starting point for finding more information on the World Wide Web. These dimensions are based on the needs of decision makers and the purpose of the warehouse. They represent the domain organization. The values that populate the dimensions are pieces of the knowledge about the warehouse’s domain. These organizational and knowledge faucets can be combined to create a dimension-value pair, which is a special case of a taxonomy tree (Kerschberg, Kim & Scime, 2003; Scime & Kerschberg, 2003). This pair is then used as keywords to search the Web for additional information about the domain and this particular dimension value. The pages retrieved as a result of dimension-value pair based Web searches are analyzed to determine relevancy. The meta-data of the relevant pages is added to the data warehouse as an extension of the dimension. Keeping the warehouse current with frequent Web searches keeps the knowledge fresh and allows decision makers access to the warehouse and Web knowledge in the domain.
WEB PAGE COLLECTION AND WAREHOUSE EXTENSION The Data Warehouse Web Extension Architecture (Figure 1) shows the process for adding Web pages to a data warehouse.
Web Page Extension of Data Warehouses
Figure 1. Data warehouse Web extension architecture
Data Warehouse
9
2
DimensionValue Pair Keyword String
3
Web Search Engine Results List
1
4
8 5
Web Page Meta-Data
6
Data Warehouse Analyst
World Wide Web
7 7
6
Web Pages
1.
2.
3. 4.
5.
6.
Select Dimensions: The data warehouse analyst selects the dimension attributes that are likely to have relevant data about their values on the Web. For example, the dimension city would be chosen; as most cities have Web sites. Extract Dimension Value Pair: As values are added to the selected dimensions the dimension label and value are extracted as a dimension-value pair and converted into a keyword string. The value Buffalo for the dimension city becomes the keyword string city Buffalo. Keyword String Query: The keyword string is sent to a search engine (for example, Google). Search the World Wide Web: The keyword string is used as a search engine query and the resulting hit lists containing Web page meta-data are returned. This meta-data typically includes page URL, title, and some summary information.In our example, the first result is the Home Page for the City of Buffalo in New York State. On the second page of results is the City of Buffalo, Minnesota. Review Results Lists: The data warehouse analyst reviews the resulting hit list for possible relevant pages. Given that large number of hits (over 5 million for city Buffalo) the analyst must limit consideration of pages to a reasonable amount. Select Web Documents: Web pages selected are those that may add knowledge to the data warehouse. This may be new knowledge or extensional knowledge to the warehouse. Because the analyst knows that the city of interest to the data warehouse
7.
8.
is Buffalo, New York, he only considers the appropriate pages. Relevancy Review: The analyst reviews the selected pages to ensure they are relevant to the intent of the warehouse attribute. The meta-data of the relevant Web pages is extracted during this relevancy review. The meta-data includes the Web page URL, title, date retrieved, date created, and summary. This meta-data may come from the search engine results list. For the Buffalo home page this meta-data is found in Figure 2. Add Meta-Data: The meta-data for the page is added as an extension to the data warehouse. This addition is added as an extension to the city dimension creating a snowflake-like schema for the data warehouse.
Figure 2. Meta-data for Buffalo relevant Web page Title: City of Buffalo Home Page -- City of Buffalo URL: www.ci.buffalo.ny.us/ Date Retrieved: Apr 20, 2004 Date Created: Apr 18, 2004 Summary: City of Buffalo, Leadership, City Services, Our City, News/Calendar, Return to Payment Cart. ... Tourism. Buffalo My City. Architecture and Landscapes. All America City.
1213
Web Page Extension of Data Warehouses
FUTURE TRENDS
REFERENCES
There are two trends in data repositories that when combined will greatly enhance the ability to extend data warehouses with Web based information. The first is the movement to object-oriented databases (Ravat, Teste & Zurfluh, 1999). The other is the movement to the semantic Web (Engels & Lech, 2003). Currently, modeling and implementation of databases uses the Entity-Relationship model. This model has difficulty in representing multidimensional data views common in today’s data warehouses. The object-oriented paradigm provides increased modeling capabilities in the multidimensional environment (Trujillo, Palomar & Gómez, 2000). Furthermore, the object-oriented data warehouse can be organized as an ontology. In the search engine of the future the linear index of Web pages will be replaced by an ontology. This ontology will be a semantic representation of the Web. Within the ontology the pages may be represented by keywords and will also have connections to other pages. These connections will be the relationships between the pages and may also be weighted. Investigation of an individual page’s content, the inter-page hypertext links, the position of the page on its server, and search engine discovered relationships would create the ontology (Guha, McCool & Miller, 2003). Matches will no longer be query keyword to index keyword, but a match of the data warehouse ontology to the search engine ontology. Rather than point-to-point matching, the query is the best fit of one multi-dimensional space upon another (Doan, Madhavan, Dhamankar, Domingos, & Halevy, 2003). The returned page locations are then more specific to the information need of the data warehouse.
Amento, B., Terveen, L., Hill, W., Hix, D., & Schulman, R. (2003). Experiments in social data mining: The TopicShop system. ACM Transactions on Computer-Human Interaction, 10(1), 54-85.
CONCLUSION The use of the Web to extend data warehouse knowledge about a domain provides the decision maker with more information than may otherwise be available from the organizational data sources used to populate the data warehouse. The Web pages referenced in the warehouse are derived from the currently available data and knowledge of the data warehouse structure. The Web search process and the data warehouse analyst sifts through the external, distributed Web to find relevant pages. This Web generated knowledge is added to the data warehouse for decision maker consideration.
1214
Davulcu, H., Koduri, S., & Nagarajan, S. (2003). Datarover: A taxonomy-based crawler for automated data extraction from data-intensive Websites. Proceedings of the Fifth ACM International Workshop on Web Information and Data Management (pp. 9-14), New Orleans, Louisiana. Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., & Halevy, A. (2003). Learning to match ontologies on the semantic Web. The International Journal on Very Large Data Bases, 12(4), 303-319. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.K., et al., (1999). Conceptualmodel-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31(3), 227-251. Engels, R., & Lech, T. (2003). Generating ontologies for the semantic Web: OntoBuilder. In J. Davies, & F.D. Van Harmelem (Eds.), Towards the semantic Web: Ontologydriven Knowledge management (pp. 91-115). U.K.: John Wiley & Sons. Glance, N.S. (2000). Community search assistant. AAAI Workshop Technical Report of the Artificial Intelligence for Web Search Workshop (pp. 29-34), Austin, Texas. Guha, R., McCool, R., & Miller, E. (2003). Semantic search. Proceedings of the Twelfth International Conference on the World Wide Web (pp. 700-709), Budapest, Hungary. Han, H. & Elmasri, R. (2004). Learning rules for conceptual structure on the Web. Journal of Intelligent Information Systems, 22(3), 237-256. Kerschberg, L., Kim, W. & Scime, A. (2003). A personalizable agent for semantic taxonomy-based Web search. In W. Truszkowski, C. Rouff, & M. Hinchey (Eds.), Innovative concepts for agent-based systems. Lecture notes in artificial intelligence 2564 (pp. 3-31). Heidelberg: Springer. Laware, G. (2005). Metadata management: A requirement for Web warehousing and knowledge management. In A. Scime (Ed.), Web mining: Applications and techniques (pp. 1-26). Hershey: Idea Group Publishing. Leake, D. B. & Scherle, R. (2001). Towards context-based search engine selection. Proceedings of the 6th International Conference on Intelligent User Interfaces (pp. 109-112), Santa Fe, New Mexico.
Web Page Extension of Data Warehouses
Liongosari, E.S., Dempski, K.L., & Swaminathan, K.S. (1999). In search of a new generation of knowledge management applications. SIGGROUP Bulletin, 20(2), 60-63. Manco, G., Ortale, R., & Tagarelli, A. (2005). The scent of a newsgroup: Providing personalized access to usenet sites through Web mining. In A. Scime (Ed.), Web mining: Applications and techniques (pp. 393-413). Hershey: Idea Group Publishing. Meng, W., Wu, Z., Yu, C., & Li, Z. (2001). A highly scalable and effective method for metasearch. ACM Transactions on Information Systems, 19(3), 310-335. Pratt, W., Hearst, M., & Fagan, L. (1999). A knowledgebased approach to organizing retrieved documents. AAAI99: Proceedings of the Sixteenth National Conference on Artificial Intelligence (pp. 80-85), Orlando, Florida. Ravat, F., Teste, O., & Zurfluh, G. (1999). Towards data warehouse design. Proceedings of the Eighth International Conference on Information and Knowledge Management (pp. 359-366), Kansas City, Missouri. Scime, A. (2000). Learning from the World Wide Web: Using organizational profiles in information searches. Informing Science, 3(3), 135-143. Scime, A. (2003). Web mining to create a domain specific Web portal database. In D. Taniar & J. Rahayu (Eds.), Web-powered databases (pp. 36-53). Hershey: Idea Group Publishing. Scime, A. & Kerschberg, L. (2003). WebSifter: An ontological Web-mining agent for e-business. In R. Meersman, K. Aberer, & T. Dillon (Eds.), Semantic issues in ecommerce systems (pp. 187-201). The Netherlands: Kluwer Academic Publishers. Triantafillakis, A., Kanellis, P., & Martakos, D. (2004). Data warehouse interoperability for the extended enterprise. Journal of Database Management, 15(3), 73-83. Trujillo, J., Palomar, M., & Gómez, J. (2000). The GOLD definition language (GDL): An object-oriented formal specification language for multidimensional databases.
Proceedings of the 2000 ACM Symposium on Applied Computing (pp. 346-350), Como, Italy. Wu, S. & Crestani, F. (2004). Shadow document methods of results merging. Proceedings of the 2004 ACM Symposium on Applied Computing (pp. 1067-1072), Nicosia, Cyprus.
KEY TERMS Dimension: A category of information relevant to the decision making purpose of the data warehouse. Domain: The area of interest for which a data warehouse was created. Meta-Data: Data about data. In a database the attributes, relations, files, etc. have labels or names indicating the purpose of the attribute, relation, file, etc. These labels or names are meta-data. Search Engine: A Web service that allows a user to find Web pages matching the user’s selection of keywords. Star Schema: The typical logical topology of a data warehouse; where a fact table occupies the center of the data warehouse and dimension tables are related to most fact table attributes. Taxonomy Tree: A collection of related concepts organized in a tree structure where higher-level concepts are decomposed into lower-level concepts. URL: The Uniform Resource Locator (URL) is the address of all Web pages, images, and other resources on the World Wide Web. Web Page: A file that is on the Web and is accessible by its URL. Web Site: A collection of Web pages located together on a Web server. Typically the pages of a Web site have a common focus and are connected by hyperlinks.
1215
9
1216
Web Usage Mining Bamshad Mobasher DePaul University, USA
INTRODUCTION
BACKGROUND
With the continued growth and proliferation of e-commerce, Web services, and Web-based information systems, the volumes of clickstream and user data collected by Web-based organizations in their daily operations have reached astronomical proportions. Analyzing such data can help these organizations determine the lifetime value of clients, design cross-marketing strategies across products and services, evaluate the effectiveness of promotional campaigns, optimize the functionality of Webbased applications, provide more personalized content to visitors, and find the most effective logical structure for their Web space. This type of analysis involves the automatic discovery of meaningful patterns and relationships from a large collection of primarily semi-structured data often stored in Web and applications server access logs as well as in related operational data sources. Web usage mining (Cooley et al., 1997; Srivastava et al., 2000) refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns usually are represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common needs or interests. The overall Web usage mining process can be divided into three interdependent tasks: data preprocessing, pattern discovery, and pattern analysis or application. In the preprocessing stage, the clickstream data are cleaned and partitioned into a set of user transactions representing the activities of each user during different visits to the site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to obtain possibly hidden patterns reflecting the typical behavior of users, as well as summary statistics on Web resources, sessions, and users. In the final stage of the process, the discovered patterns and statistics are further processed, filtered, and used as input to applications, such as recommendation engines, visualization tools, and Web analytics and report generation tools. In this article, we provide a summary of the analysis and data-mining tasks most commonly used in Web usage mining and discuss some of their typical applications.
The log data collected automatically by the Web and application servers represent the fine-grained navigational behavior of visitors. Each hit against the server generates a single entry in the server access logs. Each log entry (depending on the log format) may contain fields identifying the time and date of the request, the IP address of the client, the resource requested, possible parameters used in invoking a Web applications, status of the request, HTTP method used, the user agent (browser and operating system types and versions), the referring Web resource, and, if available, client-side cookies that uniquely identify repeat visitors. Depending on the goals of the analysis, these data need to be transformed and aggregated at different levels of abstraction. In Web usage mining, the most basic level of data abstraction is that of a pageview. A pageview is an aggregate representation of a collection of Web objects contributing to the display on a user’s browser resulting from a single user action (such as a clickthrough). At the user level, the most basic level of behavioral abstraction is that of a session. A session is a sequence of pageviews by a single user during a single visit. The process of transforming the preprocessed clickstream data into a collection of sessions is called sessionization. The goal of the preprocessing stage in Web usage mining is to transform the raw clickstream data into a set of user sessions, each corresponding to a delimited sequence of pageviews (Cooley et al., 1999). The sessionized data can be used as the input for a variety of data-mining algorithms. However, in many applications, data from a variety of other sources must be integrated with the preprocessed clickstream data. For example, in e-commerce applications, the integration of both customer and product data (e.g., demographics, ratings, purchase histories) from operational databases with usage data can allow for the discovery of important business intelligence metrics, such as customer conversion ratios and lifetime values (Kohavi et al., 2004). The integration of semantic knowledge from the site content or semantic attributes of products can be used by personalization systems to provide more useful recommendations (Dai & Mobasher, 2004; Gahni & Fano, 2003). A detailed discussion of the data preparation and data collection in Web usage mining can be found in the article
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Web Usage Mining
“Data Preparation for Web Usage Mining” in this volume (Mobasher, 2005).
MAIN THRUST The types and levels of analysis performed on the integrated usage data depend on the ultimate goals of the analyst and the desired outcomes. This section describes the most common types of pattern discovery and analysis employed in the Web usage mining domain and discusses some of their applications.
Session and Visitor Analysis The statistical analysis of preprocessed session data constitutes the most common form of analysis. In this case, data are aggregated by predetermined units, such as days, sessions, visitors, or domains. Standard statistical techniques can be used on these data to gain knowledge about visitor behavior. This is the approach taken by most commercial tools available for Web log analysis. Reports based on this type of analysis may include information about most frequently accessed pages, average view time of a page, average length of a path through a site, common entry and exit points, and other aggregate measures. Despite a lack of depth in this type of analysis, the resulting knowledge potentially can be useful for improving system performance and providing support for marketing decisions. Furthermore, commercial Web analytics tools are increasingly incorporating a variety of datamining algorithms resulting in more sophisticated site and customer metrics. Another form of analysis on integrated usage data is Online Analytical Processing (OLAP). OLAP provides a more integrated framework for analysis with a higher degree of flexibility. The data source for OLAP analysis is usually a multidimensional data warehouse, which integrates usage, content, and e-commerce data at different levels of aggregation for each dimension. OLAP tools allow changes in aggregation levels along each dimension during the analysis. Analysis dimensions in such a structure can be based on various fields available in the log files and may include time duration, domain, requested resource, user agent, and referrers. This allows the analysis to be performed on portions of the log related to a specific time interval or at a higher level of abstraction with respect to the URL path structure. The integration of e-commerce data in the data warehouse further can enhance the ability of OLAP tools to derive important business intelligence metrics (Buchner & Mulvenna, 1999). The output from OLAP queries also can be used as the input for a variety of data-mining or data visualization tools.
Association and Correlation Analysis Association rule discovery and statistical correlation analysis on usage data result in finding groups of items or pages that are commonly accessed or purchased together. This, in turn, enables Web sites to organize the site content more efficiently or to provide effective crosssale product recommendations. Association rule discovery algorithms find groups of items (e.g., pageviews) occurring frequently together in many transactions (i.e., satisfying a pre-specified minimum support threshold). Such groups of items are referred to as frequent itemsets. Association rules that satisfy a minimum confidence threshold are then generated from the frequent itemsets. An association rule r is an expression of the form X → Y (σr, αr), where X and Y are itemsets, σ is the support of the itemset X ∪ Y representing the probability that X and Y occur together in a transaction, and α is the confidence for the rule r, representing the conditional probability that Y occurs in a transaction, given that X has occurred in that transaction. The discovery of association rules in Web transaction data has many advantages. For example, a high-confidence rule, such as {special-offers/, /products/software/ } → {shopping-cart/}, might provide some indication that a promotional campaign on software products is positively affecting online sales. Such rules also can be used to optimize the structure of the site. For example, if a site does not provide direct linkage between two pages A and B, the discovery of a rule {A} → {B} would indicate that providing a direct hyperlink from A to B might aid users in finding the intended information. Both association analysis (among products or pageviews) and statistical correlation analysis (generally among customers or visitors) have been used successfully in Web personalization and recommender systems (Herlocker et al., 2004; Mobasher et al., 2001).
Cluster Analysis and Visitor Segmentation Clustering is a data-mining technique to group together a set of items having similar characteristics. In the Web usage domain, there are two kinds of interesting clusters that can be discovered: user cluster and page clusters. Clustering of user records (sessions or transactions) is one of the most commonly used analysis tasks in Web usage mining and Web analytics. Clustering of users tends to establish groups of users exhibiting similar browsing patterns. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in e-commerce applications or to
1217
9
Web Usage Mining
provide personalized Web content to users with similar interests. Further analysis of user groups, based on their demographic attributes (e.g., age, gender, income level, etc.), may lead to the discovery of valuable business intelligence. Usage-based clustering also has been used to create Web-based user communities, reflecting similar interests of groups of users (Paliouras et al., 2002), and to learn user models that can be used to provide dynamic recommendations in Web personalization applications (Mobasher et al., 2002). Clustering of pages (or items) can be performed, based on the usage data (i.e., starting from the users session or transaction data) or on the content features associated with pages or items (i.e., keywords or product attributes). In the case of content-based clustering, the result may be collections of pages or products related to the same topic or category. In usage-based clustering, items that are commonly accessed or purchased together can be organized automatically into groups. It also can be used to provide permanent or dynamic HTML pages that suggest related hyperlinks to the users according to their past history of navigational or purchase activity.
Analysis of Sequential and Navigational Patterns The technique of sequential pattern discovery attempts to find intersession patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketers can predict future visit patterns that will be helpful in placing advertisements aimed at certain user groups. Other types of temporal analyses that can be performed on sequential patterns include trend analysis, change point detection, and similarity analysis. In the context of Web usage data, sequential pattern mining can be used to capture frequent navigational paths among user trails. The view of Web transactions as sequences of pageviews allows for a number of useful and well-studied models to be used in discovering or analyzing user navigation patterns. One such approach is to model the navigational activity in the Web site as a Markov model; each pageview (or category) can be represented as a state, and the transition probabilities between these states can represent the likelihood that a user will navigate from one state to another. This representation allows for the computation of a number of useful user or site metrics. For example, one might compute the probability that a user will make a purchase, given that the user has performed a search in an online catalog. Markov models have been proposed as the underlying modeling machinery for link prediction as well as for Web prefetching to minimize system latencies (Deshpande & Karypis, 2004; Sarukkai, 2000). The goal of such approaches is to predict the next user action 1218
based on a user’s previous surfing behavior. They also have been used to discover high probability user navigational trails in a Web site (Borges & Levene, 1999). More sophisticated statistical learning techniques, such as mixtures of Markov models, also have been used to cluster navigational sequences and to perform exploratory analysis of users’ navigational behaviors in a site (Cadez et al., 2003). Another way of efficiently representing navigational trails is by inserting each trail into a trie structure. A wellknown example of this approach is the notion of aggregate tree introduced as part of the WUM (Web Utilization Miner) system (Spiliopoulou & Faulstich, 1999). Each node in the tree represents a navigational subsequence from the root (an empty node) to a page and is annotated by the frequency of occurrences of that subsequence in the session data. This approach and its extensions have proved useful in evaluating the navigational design of a Web site (Spiliopoulou, 2000).
Web User Modeling and Classification Classification is the task of mapping a data item into one of several predefined classes. In the Web domain, one is interested in developing a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. Classification can be done by using supervised learning algorithms, such as decision tree classifiers, naive Bayesian classifiers, knearest neighbor classifiers, neural networks, and support vector machines. It is also possible to use previously discovered clusters and association rules for classification of new users. Classification techniques play an important role in Web analytics applications for modeling users according to various predefined metrics. For example, given a set of user transactions, the sum of purchases made by each user within a specified period of time can be computed. A classification model then can be built based on this enriched data in order to classify users into those with a high propensity to buy and those that do not, taking into account features such as users’ demographic attributes as well their navigational activity. Another important application of classification and user modeling in the Web domain is that of Web personalization and recommender systems. For example, most collaborative filtering applications in existing recommender systems use k-nearest neighbor classifiers to predict user ratings or purchase propensity by measuring the correlations between a target user and past user transaction (Herlocker et al., 2004). Many of the Web usage mining approaches discussed can be used to automatically discover user models and then apply these
Web Usage Mining
models to provide personalized content to an active user (Mobasher et al., 2000; Pierrakos et al., 2003).
FUTURE TRENDS Usage patterns discovered through Web usage mining are effective in capturing item-to-item and user-to-user relationships and similarities at the level of user sessions. However, without the benefit of deeper domain knowledge, such patterns provide little insight into the underlying reasons for which such items or users are grouped together. Furthermore, the inherent and increasing heterogeneity of the Web has required Web-based applications to more effectively integrate a variety of types of data across multiple channels and from different sources. Thus, a focus on techniques and architectures for more effective integration and mining of content, usage, and structure data from different sources is likely to lead to the next generation of more useful and more intelligent applications and more sophisticated tools for Web usage mining that can derive intelligence from user transactions on the Web. It is possible to capture some of the site semantics by integrating keyword-based content-filtering approaches with usage mining techniques. However, in order to capture more complex relationships at a deeper semantic level based on the attributes associated with structured objects, it would be necessary to go beyond keyword-based representations and to automatically integrate relational structure and domain ontologies into the preprocessing and mining processes. Efforts in this direction are likely to be the most fruitful in the creation of much more effective Web usage mining, user modeling, and personalization systems that are consistent with emergence and proliferation of the semantic Web (Dai & Mobasher, 2004).
CONCLUSION Web usage mining has emerged as the essential tool in realizing more personalized, user-friendly and businessoptimal Web services. Advances in data preprocessing, modeling, and mining techniques applied to the Web data have already resulted in many successful applications in adaptive information systems, personalization services, Web analytics tools, and content management systems. As the complexity of Web applications and users’ interactions with these applications increases, the need for intelligent analysis of the Web usage data also will continue to grow.
REFERENCES Borges, J., & Levene, M. (1999). Data mining of user navigation patterns. Proceedings of Web Usage Analysis and User Profiling, WebKDD’99 Workshop, San Diego, CA. Buchner, A., & Mulvenna, M.D. (1999). Discovering Internet marketing intelligence through online analytical Web usage mining. SIGMOD Record, 4(27), 54-61. Cadez, I.V., Heckerman, D., Meek, C., Smyth, P., & White, S. (2003). Model-based clustering and visualization of navigation patterns on a Web site. Data Mining and Knowledge Discovery, 7(4), 399-424. Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web mining: Information and pattern discovery on the World Wide Web. Proceedings of the 9th IEEE International Conference on Tools With Artificial Intelligence (ICTAI ’97), Newport Beach, California. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1(1), 5-32. Dai, H., & Mobasher, B. (2004). Integrating semantic knowledge with Web usage mining for personalization. In A. Scime (Ed.), Web mining: Applications and techniques (pp. 276-306). Hershey, PA: Idea Group Publishing. Deshpande, M., & Karypis, G. (2004). Selective Markov models for predicting Web page accesses. ACM Transactions on Internet Technology, 4(2), 163-184. Ghani, R., & Fano, A. (2002). Building recommender systems using a knowledge base of product semantics. Proceedings of the Workshop on Recommendation and Personalization in E-Commerce, International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, Malaga, Spain. Herlocker, J.L., Konstan, J., Terveen, L., & Riedl, J. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5-53. Kohavi, R., Mason, L., Parekh, R., & Zheng, Z. (2004). Lessons and challenges from mining retail e-commerce data. Machine Learning, 57, 83-113. Mobasher, B. (2005). Web usage mining data preparation. In J. Wang (Ed.), Encyclopedia of data warehousing and mining . Hershey, PA: Idea Group Publishing. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Communications of the ACM, 43(8), 142-151.
1219
9
Web Usage Mining
Mobasher, B., Dai, H., Luo, T., & Nakagawa, N. (2001). Effective personalization based on association rule discovery from Web usage data. Proceedings of the 3rd ACM Workshop on Web Information and Data Management (WIDM01), Atlanta, Georgia. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Discovery and evaluation of aggregate usage profiles for Web personalization. Data Mining and Knowledge Discovery, 6, 61-82. Paliouras, G., Papatheodorou, C., Karkaletsis, V., & Spyropoulos, C.D. (2002). Discovering user communities on the Internet using unsupervised machine learning techniques. Interacting With Computers Journal, 14(6), 761-791.
KEY TERMS Navigational Pattern: A collection of pageviews or Web resources that are frequently accessed together by users during one or more sessions (usually in a particular order). Pageview: An aggregate representation of a collection of Web objects or resources contributing to the display on a user’s browser resulting from a single user action (such as a clickthrough). Sessionization: The preprocessing task of partitioning the clickstream Web log data into sessions (i.e., delimited sequence of pageviews attributed to a single user during a single visit to a site).
Pierrakos, G., Paliouras, G., Papatheodorou, C., & Spyropoulos, C. (2003). Web usage mining as a tool for personalization: A survey. User Modeling and UserAdapted Interaction, 13, 311-372.
User Modeling: The process of using analytical or machine learning techniques to create an aggregate characterization of groups of users with similar interests or behaviors.
Sarukkai, R.R. (2000). Link prediction and path analysis using Markov chains. Proceedings of the 9th International World Wide Web Conference, Amsterdam, Netherlands.
Web Analytics: The study of the impact of a site on the users and their behaviors. In e-commerce, Web analytics involves the computation of a variety of site- and customeroriented metrics (e-metrics) to determine the effectiveness of the site content and organization and to understand the online purchasing decisions of customers.
Spiliopoulou, M. (2000). Web usage mining for Web site evaluation. Communications of ACM, 43(8), 127-134. Spiliopoulou, M., & Faulstich, L. (1999). WUM: A tool for Web utilization analysis. Proceedings of the EDBT Workshop at WebDB’98. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23.
1220
Web Personalization: The process of dynamically serving customized content (e.g., pages, products, recommendations, etc.) to Web users, based on their profiles, preferences, or expected interests. Web Usage Mining: The automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites.
1221
Web Usage Mining and Its Applications Yongjian Fu Cleveland State University, USA
INTRODUCTION With the rapid development of the World Wide Web or the Web, many organizations now put their information on the Web and provide Web-based services such as online shopping, user feedback, technical support, and so on. Understanding Web usage through data mining techniques is recognized as an important area. Web usage mining is the process to identify interesting patterns from Web server logs. It has shown great potentials in many applications such as adaptive Web sites, Web personalization, cache management, and so on.
BACKGROUND Most commonly used Web servers maintain a server log, which consists of page requests in the form of Common Log Format. The Common Log Format specifies that a record in a log file contains, among other data, the IP address of the user, the date and time of the request, the URL of the page, the protocol, the return code of the server, and the size of the page if the request is successful (Luotonen, 1995). A few examples of log records in Common Log Format are given in Table 1. The IP addresses are modified for privacy reasons. The URLs of the pages are relative to the Web server’s home page address, in this example, www.csuohio.edu. In Web usage mining, the server logs are first preprocessed to clean and transform the data. Data mining techniques are then applied on these preprocessed data to find usage patterns. The usage patterns are employed by many applications to evaluate and improve Web sites. In preprocessing, the server log files are cleaned to filter out irrelevant information for Web usage mining,
such as background images, and transformed into a set of sessions. A session is conceptually a single visit of a user (Cooley et al., 1999). For example, when a user buys an airplane ticket from a Web site, the log records related to the transaction compose a session. In practice, a session consists of pages accessed by a user in a certain period of time. Various data mining techniques can be applied on sessions to find usage patterns, including association rules, clustering, and classification. Other techniques have also been used for Web usage analysis including data warehousing and OLAP, intelligent agent, and collaborative filtering. Web usage mining has a broad range of applications, such as adaptive Web sites, Web personalization, and cache management, to name a few. Moreover, Web usage patterns may be combined with other information such as Web page content (texts and multimedia), hyperlinks, and user registrations to provide more comprehensive understandings and solutions.
MAIN THRUST The preprocessing of Web server logs, techniques for web usage mining, and applications of web usage mining are discussed.
Preprocessing In preprocessing, irrelevant records in a server log are thrown out and others are put into sessions. Log records from the same user are put into a session. The IP addresses in the log records are used to identify users. Two records with the same IP address are assumed from the same user. A session contains a unique session ID and a set of (pid,
Table 1. Examples from a Web server log dan.ece.csuohio.edu -- [01/Aug/2001:13:17:45 -0700] "GET /~dan/a.html" 200 34 131.39.170.27 -- [01/Aug/2001:13:17:47 -0700] "GET /~white/Home.htm HTTP/1.0" 200 2034 dan.ece.csuohio.edu -- [01/Aug/2001:13:17:48 -0700] "GET /~dan/b.html HTTP/1.0" 200 8210 131.39.170.27 -- [01/Aug/2001:13:17:50 -0700] "GET /~white/cloud.gif HTTP/1.0" 200 4489 131.39.170.27 -- [01/Aug/2001:13:17:51 -0700] "GET /~white/hobby.htm HTTP/1.0" 200 890 117.83.344.74 -- [01/Aug/2001:13:17:51 -0700] "GET /~katz/arrow.jpg HTTP/1.0" 200 2783
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
9
Web Usage Mining and Its Applications
t) pairs, where pid is a page identifier and t is the time the user spent on that page. Generally, the preprocessing involves the following steps (Cooley et al., 1999): 1. 2.
3.
Records about image files (.gif, .jpg, etc) are filtered as well as unsuccessful requests (return code not 200). Requests from the same IP address are grouped into a session. A timeout threshold max_idle is used to decide the end of a session, that is, if the same IP address does not occur within a time range of max_idle minutes, the current session is closed. Subsequent requests from the same IP address will be treated as a new session. The time spent on a particular page is determined by the time difference between two consecutive requests.
The introduction of max_idle is for both conceptual and practical purposes. From a conceptual point, it helps to limit a session to a single visit. For instance, a user can buy a book and comes back the next day to check movies. The activities will be separated into two sessions. From a practical point, it prevents a session from running too long. The selection of max_idle is dependent on the Web site and application. Empirically, a few studies found 30 minutes to be suitable (Cooley et al., 1999; Fu et al., 1999). For example, the Web server log in Table 1 will be organized into sessions as shown in Table 2. It should be noted that session IDs are not IP addresses since they may be shared by multiple sessions. There are some difficulties in accurately identifying sessions and estimating times spent on pages, due to client or proxy caching of pages, sharing of IP addresses, and network traffic (Cooley et al., 1999). Besides, the time the user spent on the last page is unknown since there are no more requests after it.
Techniques Several data mining techniques have been successfully applied in Web usage mining, including association rules, clustering, and classification. Besides, data warehousing and OLAP techniques have also been employed.
Association rules represent correlations among objects, first proposed to capture correlations among items in transactional data. For example, an association rule “hot dog → soft drink [10%, 56%]” says that 56% of people who buy hot dogs also buy soft drinks, which constitute 10% of all customers. If a session is viewed as a transaction, association rule mining algorithms can be employed to find associative relations among pages browsed (Yang et al., 2002). For example, an association rule “Page A → Page B [5%, 80%]” says 80% of users who browse page A will also browse page B, and 5% of all users browse both. Using the same algorithms, we may find frequent paths traversed by many users, for example, 40% of users browsing page A, then pages B and C, finally page D (Frias-Martinez & Karamcheti, 2002). Clustering is the identification of classes, also called clusters or groups, for a set of objects whose classes are unknown. Using clustering techniques, we can cluster users based on their access patterns (Fu et al., 1999). In this approach, sessions are treated as objects and each page representing a dimension in the object space. Sessions containing similar pages will be grouped. For examples, if a user browses pages A, B, C, and D, and another user browses pages A, C, D, and F, they may be clustered in to a group. A more sophisticated clustering approach would use the browsing times of pages in sessions. For example, two sessions [1, (A, 15), (B, 10), (C, 1)] and [2, (A, 12), (B, 12), (D, 2)] will be clustered into one group. In classification, a classifier is developed from a training set of objects where classes are known. Given a set of sessions in different classes, a classifier can be built using classification methods. For example, a classifier may tell whether a user will be a buyer or a non-buyer based on the browsing patterns of the user for an e-commerce site (Spiliopoulou et al., 1999). Data warehouse techniques may be used to create data cubes from Web server logs for OLAP. The statistics along pages, IP domains, geographical locations of users, and browsing times are calculated from sessions. Other techniques exist for Web usage mining. For example, a hybrid method, which combines hypertext probabilistic grammar and click fact table, shows promising results (Jespersen et al., 2002).
Table 2. Sessions from the server logs Session ID 1 2
1222
IP Address dan.ece.csuohio.edu 131.39.170.27
Requested Page /~dan/a.html /~dan/b.html /~white/Home.htm /~white/hobby.htm
Time Spent 3 seconds 4 seconds
Web Usage Mining and Its Applications
Applications The main purpose of Web usage mining is to discovery usage patterns that can help understanding and improving Web sites. Applications in adaptive Web sites, Web personalization, and cache management are described below. An adaptive Web site is a Web site that semi-automatically improves its organization and presentation by learning from user access patterns. There are two steps involved in building adaptive Web sites. First, we need to analyze the users and their use of a Web site. Second, the Web site should be updated semi-automatically based on the information gathered in the first step. The first step is discussed here since it is the step that is related to Web usage mining. One approach to adaptive Web site is creating an index page which links Web pages that are not directly linked but are frequently accessed together. An index page is a Web page that is used mostly for the navigation of a Web site. It normally contains little information except links. A similar approach is to cluster pages based on their occurrences in frequent paths that are found through association rule mining (Mobasher et al., 1999). Another approach to adaptive Web site is by evolving a site’s structure so that its users can get the information they want with less clicks (Fu et al., 2001). Web personalization is another big application of Web usage mining. By analyzing and understanding users’ browsing patterns, the Web can be customized to fit the users’ needs. There are two sides to Web personalization. First, on client side, Web personalization means creating a personalized view of the Web for a user (Toolan & Kusmerick, 2002). This view will be unique and customized according to the user’s preferences. Second, on the server side, a Web server can provide personalized services to users (Nasraoui & Rojas, 2003). From the client side, a user’s browsing activities can be analyzed together with server logs. It will provide base for building a user profile. A personalized Web site can be created based on the user profile. An example of this is the personalization of Web pages for mobile users. Because of limitations in bandwidth, screen size, computing capacity, and power, mobile devises such as PDAs and cell phones have difficulty to download and browse Web pages designed for desktop computers. One way to personalize Web pages for mobile users is to add extra links by predicting users’ future requests (Anderson et al., 2001). From the server side, the first step in Web personalization is clustering users; because there are usually a large number of users and it is hard to cater to individual users. Web usage mining can help to identify user groups with similar browsing patterns as mentioned above. The server
then customizes the Web site for a user based on his/her group. For example, the server could create a dynamic link for a user if other users in the group follow the link, or recommend a new product or service to a user if others in the group use the product or service (Hagen et al., 2003). Another application of Web usage mining is to improve navigation of users by optimizing browser caching. Since the cache size on a client is limited, its efficient usage will improve cache hit rate, thus reduce network traffic and avoid latency. From a user’s browsing patterns, more efficient cache management algorithms can be developed (Yang & Zhang, 2001). For example, if a user browses page A after pages B and C in 8 out of past 10 sessions, it makes sense keeping A in the cache after the user browsed pages B and C.
FUTURE TRENDS The first step in Web usage mining is to organize server logs into sessions. It is usually done by identifying users through IP addresses and imposing a session timeout threshold. However, because of client/proxy caching, network traffic, sharing of IP, and other problems, it is hard to obtain the sessions accurately. Possible solutions include cookies, user registration, client side log, and path completion. We will see more exciting techniques for preprocessing that use these methods. Although Web usage mining is able to reveal a lot of interesting patterns, it is much more interesting to mine the various data sources, such as Web pages, structure (links), and server logs, and synthesize the findings and results (Li & Zaiane, 2004). For example, by analyzing server logs and the corresponding pages we can build a user profile, which tells not only the pages a user is interested, but also the characteristics of the pages. This will let us personalize based on the user’s preferences at word or phrase level instead of page level. Most current approaches in Web usage mining find patterns in server logs. An interesting direction for future research and development is mining client side data, along with server logs. A client’s activity on the Web can be investigated to understand the individual’s interests. Based on such individual interests, personalized Web services such as searching, filtering, and recommendation can be developed (Fu & Shih, 2002).
CONCLUSION Web usage mining applies data mining and other techniques to analyze Web server logs. It can reveal patterns
1223
9
Web Usage Mining and Its Applications
in users’ browsing activities. Its applications include adaptive Web sites, Web personalization, and browser cache management. It is certainly an area with much potential.
Mobasher, B., Cooley, R., & Srivastava, J. (1999, November). Creating adaptive Web sites through usagebased clustering of URLs. In Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’99).
REFERENCES
Nasraoui, O., & Rojas, C. (2003, March). From static to DynamicWeb usage mining: Towards scalable profiling and personalization with evolutionary computation. In Workshop on Information Technology, Rabat, Morocco.
Anderson, C., Domingos, R.P., & Weld, D.S. (2001). Web site personalization for mobile devices. In IJCAI Workshop on Intelligent Techniques for Web Personalization, Seattle, USA. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, 1(1). Frias-Martinez, E., & Karamcheti, V. (2002, July). A prediction model for user access sequences. In Proceedings of the WEBKDD Workshop: Web Mining for Usage Patterns and User Profiles. Fu, Y., Creado, M., & Shih, M. (2001, June). Adaptive Web site by Web usage mining, In International Conference on Internet Computing (IC’2001) (pp. 28-34), Las Vegas, NV. Fu, Y., Sandhu, K., & Shih, M. (1999). Clustering of Web users based on access patterns. In International Workshop on Web Usage Analysis and User Profiling (WEBKDD’99), San Diego, CA. Fu, Y., & Shih, M. (2002). A framework for personal Web usage mining. International Conference on Internet Computing (IC’2002) (pp. 595-600), Las Vegas, NV. Hagen, S., Someren, M., & Hollink, V. (2003). Exploration/ exploitation in adaptive recommender systems. In Proceedings European Symposium on Intelligent Technologies, Hybrid Systems and their Implementation in Smart Adaptive Systems, Oulu, Finland. Jespersen, S., Thorhauge, J., & Pedersen, T. (2002, September). A hybrid approach to Web usage mining. In Proceedings of the Fourth International Conference on Data Warehousing and Knowledge Discovery (pp. 7382), Aix-en-Provence, France. Li, J., & Zaiane, O. (2004). Combining usage, content and structure data to improve Web site recommendation. In 5th International Conference on Electronic Commerce and Web Technologies (EC-Web 04), Zaragoza, Spain. Luotonen, A. (1995). The common log file format. Retrieved from http://www.w3.org/pub/WWW/Daemon/ User/Config/Logging.html 1224
Spiliopoulou, M., Faulstich, L.C., & Winkler, K. (1999, July). A data miner analyzing the navigational behaviour of Web users. In Workshop on Machine Learning in User Modeling of the ACAI’99 International Conference, Creta, Greece. Toolan, F., & Kusmerick, N. (2002). Mining Web logs for personalized site maps. In Third International Conference on Web Information Systems Engineering (WISEw’02) (pp. 232-237), Singapore. Yang, H., Parthasarathy, S., & Reddy, S. (2002). On the use of temporally constrained associations for Web log mining, WEBKDD, Edmonton, Canada. Yang, Q., & Zhang, H. (2001). Integrating Web prefetching and caching using prediction models. World Wide Web, 4(4), 299-321.
KEY TERMS Adaptive Web Site: A Web site that semi-automatically improves its organization and presentation by learning from user access patterns. Web usage mining techniques are employed to determine the adaptation of the site. Browser Caching: A Web browser keeps a local copy of server pages in an area called cache on client’s computer. This is to avoid repeated requests to the server. However, this also makes server logs imcomplete because some requests are served by the cache. A related issue is the management of cache to improve its hit rate. Common Log Format (CLF): A W3C standard format for records in a server log. The main items in the CLF are IP address of the user, the date and time of the request, the URL of the page, the protocol, the return code of the server, and the size of the page if the request is successful. Server Log: A file that a Web server keeps about requests on its pages from users. It is usually in a standard format such as common log format. Session: A single visit of a user to a Web server. A session consists of all log records of the visit.
Web Usage Mining and Its Applications
Web Personalization: A personalized Web view or site. From a user’s perspective, a personalized Web view is the one that is customized to the user’s preferences. From the server perspective, a personalized Web site provides services tailored to its users.
Web Usage Mining: The process of identifying interesting patterns from Web server logs. Data mining and OLAP techniques are employed to analyze the data and uncover patterns.
1225
9
1226
Web Usage Mining Data Preparation Bamshad Mobasher DePaul University, USA
INTRODUCTION Web usage mining refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal of Web usage mining is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. Analyzing such data can help these organizations determine the lifetime value of clients, design cross marketing strategies across products and services, evaluate the effectiveness of promotional campaigns, optimize the functionality of Web-based applications, provide more personalized content to visitors, and find the most effective logical structure for their Web space. An important task in any data-mining application is the creation of a suitable target dataset to which data mining and statistical algorithms are applied. This is particularly important in Web usage mining due to the characteristics of clickstream data and its relationship to other related data collected from multiple sources and across multiple channels. The data preparation process is often the most time-consuming and computationallyintensive step in the Web usage mining process and often requires the use of special algorithms and heuristics not commonly employed in other domains. This process is critical to the successful extraction of useful patterns from the data. This process may involve preprocessing the original data, integrating data from multiple sources, and transforming the integrated data into a form suitable for input into specific data-mining operations. Collectively, we refer to this process as data preparation. In this article, we summarize the essential tasks and requirements for the data preparation stage of the Web usage mining process.
BACKGROUND The primary data sources used in Web usage mining are the server log files, which include Web server access logs and application server logs. Additional data sources that are also essential for both data preparation and pattern discovery include the site files and meta-data, operational databases, application templates, and domain knowledge. In some cases and for some users, additional data
may be available due to client-side or proxy-level (Internet service provider) data collection, as well as from external clickstream or demographic data sources (e.g., ComScore, NetRatings, MediaMetrix, and Acxiom). Much of the research and practice in usage data preparation has focused on preprocessing and integrating these data sources for different types of analyses. Usage data preparation presents a number of unique challenges that have led to a variety of algorithms and heuristic techniques for preprocessing tasks, such as data fusion and cleaning, user and session identification, pageview identification (Cooley et al., 1999). The successful application of data-mining techniques to Web usage data is highly dependent on the correct application of the preprocessing tasks. Furthermore, in the context of e-commerce data analysis and Web analytics, these techniques have been extended to allow for the discovery of important and insightful user and site metrics (Kohavi et al., 2004). Figure 1 provides a summary of the primary tasks and elements in usage data preprocessing. We begin by providing a summary of data types commonly used in Web usage mining and then provide a brief discussion of some of the primary data preparation tasks. The data obtained through various sources can be categorized into four primary groups (Cooley et al., 1999; Srivastava et al., 2000).
Usage Data The log data collected automatically by the Web and application servers represents the fine-grained navigational behavior of visitors. Each hit against the server, corresponding to an HTTP request, generates a single Figure 1. Summary of data preparation tasks for Web usage mining O p erational Datab ases c us to m ers
S ite C on ten t & S truc ture
D om ain K now led ge
U sage P rep roces sin g W eb & A p plica tion S erv er L ogs
D a ta F us io n D ata C le a ning P age vie w Id e ntificatio n Se ss io nization Ep iso d e Ide ntification
P re pro ces se d C lic kstrea m D a ta
p rod ucts
o rders
D a ta T ran sfo rm a tion D ata Inte g ration D a ta Ag greg ation D a ta G e nera liza tio n
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
U se r T ran sac tion D ataba se
Web Usage Mining Data Preparation
entry in the server access logs. Each log entry (depending on the log format) may contain fields identifying the time and date of the request, the IP address of the client, the resource requested, possible parameters used in invoking a Web application, status of the request, HTTP method used, the user agent (browser and operating system type and version), the referring Web resource, and, if available, client-side cookies that uniquely identify a repeat visitor. Depending on the goals of the analysis, the data need to be transformed and aggregated at different levels of abstraction. In Web usage mining, the most basic level of data abstraction is that of a pageview. A pageview is an aggregate representation of a collection of Web objects contributing to the display on a user’s browser resulting from a single user action (such as a click-through). Conceptually, each pageview can be viewed as a collection of Web objects or resources representing a specific user event (e.g., reading an article, viewing a product page, or adding a product to the shopping cart. At the user level, the most basic level of behavioral abstraction is that of a session. A session is a sequence of pageviews by a single user during a single visit. The notion of a session can be abstracted further by selecting a subset of pageviews in the session that is significant or relevant for the analysis tasks at hand.
Content Data The content data in a site are the collection of objects and relationships that are conveyed to the user. For the most part, these data are comprised of combinations of textual material and images. The data sources used to deliver or generate this data include static HTML/XML pages, multimedia files, dynamically generated page segments from scripts, and collections of records from the operational databases. The site content data also include semantic or structural meta-data embedded within the site or individual pages, such as descriptive keywords, document attributes, semantic tags, or HTTP variables. The underlying domain ontology for the site also is considered part of the content data. Domain ontologies may include conceptual hierarchies over page contents, such as product categories, explicit representations of semantic content and relationships via an ontology language such as RDF, or a database schema over the data contained in the operational databases.
Structure Data The structure data represent the designer’s view of the content organization within the site. This organization is captured via the inter-page linkage structure among pages, as reflected through hyperlinks. The structure data also include the intra-page structure of the content within a
page. For example, both HTML and XML documents can be represented as tree structures over the space of tags in the page. The hyperlink structure for a site normally is captured by an automatically generated site map. A sitemapping tool must have the capability to capture and represent the inter- and intra-pageview relationships. For dynamically generated pages, the site-mapping tools either must incorporate intrinsic knowledge of the underlying applications and scripts or must have the ability to generate content segments using a sampling of parameters passed to such applications or scripts.
User Data The operational database(s) for the site may include additional user profile information. Such data may include demographic information about registered users, user ratings on various objects such as products or movies, past purchase or visit histories of users, as well as other explicit or implicit representations of a user’s interests. Some of these data can be captured anonymously, as long as there is the ability to distinguish among different users. For example, anonymous information contained in clientside cookies can be considered part of the users’ profile information and can be used to identify repeat visitors to a site. Many personalization applications require the storage of prior user profile information.
MAIN THRUST As noted in Figure 1, the required high-level tasks in usage data preprocessing include the fusion and synchronization of data from multiple log files, data cleaning, pageview identification, user identification, session identification (or sessionization), episode identification, and the integration of clickstream data with other data sources, such as content or semantic information, as well as user and product information from operational databases. Data fusion refers to the merging of log files from several Web and application servers. This may require global synchronization across these servers. In the absence of shared embedded session ids, heuristic methods based on the referrer field in server logs, along with various sessionization and user identification methods (see following), can be used to perform the merging. This step is essential in inter-site Web usage mining, where the analysis of user behavior is performed over the log files for multiple related Web sites (Tanasa & Trousse, 2004). Data cleaning is usually site-specific and involves tasks such as removing extraneous references to embedded objects, style files, graphics, or sound files, and removing references due to spider navigations. The latter task can be performed by maintaining a list of known 1227
9
Web Usage Mining Data Preparation
spiders, using heuristics or classification algorithms to build models of spider and Web robot navigations (Tan & Kumar, 2002). Client- or proxy-side caching often can result in missing access references to those pages or objects that have been cached. Missing references due to caching can be inferred heuristically through path completion, which relies on the knowledge of site structure and referrer information from server logs (Cooley et al., 1999). In the case of dynamically generated pages, form-based applications using the HTTP POST method result in all or part of the user input parameter not being appended to the URL accessed by the user (though, in the latter case, it is possible to recapture the user input through packet sniffers on the server side). Identification of pageviews is heavily dependent on the intra-page structure of the site as well as on the page contents and the underlying site domain knowledge. For a single frame site, each HTML file has a one-to-one correlation with a pageview. However, for multi-framed sites, several files make up a given pageview. In addition, it may be desirable to consider pageviews at a higher level of aggregation, where each pageview represents a collection of pages or objects (e.g., pages related to the same concept category). In order to provide a flexible framework for a variety of data-mining activities, a number of attributes must be recorded with each pageview. These attributes include the pageview id (normally a URL uniquely representing the pageview), static pageview type (e.g., information page, product view, category view, or index page), and other metadata, such as content attributes (e.g., keywords or product attributes). The analysis of Web usage does not require knowledge about a user’s identity. However, it is necessary to distinguish among different users. In the absence of authentication mechanisms, the most widespread approach to distinguishing among unique visitors is the use of client-side cookies. Not all sites, however, employ cookies, and, due to privacy concerns, client-side cookies sometimes are disabled by users. IP addresses alone generally are not sufficient for mapping log entries onto the set of unique visitors. This is due mainly to the proliferation of ISP proxy servers that assign rotating IP addresses to clients as they browse the Web. In such cases, it is possible to more accurately identify unique users through combinations of IP addresses and other information, such as the user agents and referrers (Cooley et al., 1999). Since a user may visit a site more than once, the server logs record multiple sessions for each user. We use the phrase user activity record to refer to the sequence of logged activities belonging to the same user. Sessionization is the process of segmenting the user activity log of each user into sessions, each representing a single visit to the site. Web sites without the benefit of additional authenti1228
cation information from users and without mechanisms such as embedded session ids must rely on heuristic methods for sessionization. The goal of a sessionization heuristic is the reconstruction from the clickstream data of the actual sequence of actions performed by one user during one visit to the site. Generally, sessionization heuristics fall into two basic categories: time-oriented or structure-oriented. Time-oriented heuristics apply either global or local timeout estimates to distinguish between consecutive sessions, while structure-oriented heuristics use either the static site structure or the implicit linkage structure captured in the referrer fields of the server logs. Various heuristics for sessionization have been identified and studied (Cooley et al., 1999). More recently, a formal framework for measuring the effectiveness of such heuristics has been proposed (Spiliopoulou et al., 2003), and the impact of different heuristics on various Web usage mining tasks has been analyzed (Berendt et al., 2002). Episode identification can be performed as a final step in preprocessing the clickstream data in order to focus on the relevant subsets of pageviews in each user session. An episode is a subset or subsequence of a session comprised of semantically or functionally related pageviews. This task may require the automatic or semi-automatic classification of pageviews into different functional types or into concept classes according to a domain ontology or concept hierarchy. In highly dynamic sites, it also may be necessary to map pageviews within each session into service-based classes according to a concept hierarchy over the space of possible parameters passed to script or database queries (Berendt & Spiliopoulou, 2000). For example, the analysis may ignore the quantity and attributes of an item added to the shopping cart and focus only on the action of adding the item to the cart. These preprocessing tasks ultimately result in a set of user sessions (episodes), each corresponding to a delimited sequence of pageviews. However, in order to provide the most effective framework for pattern discovery and analysis, data from a variety of other sources must be integrated with the preprocessed clickstream data. This is particularly the case in e-commerce applications, where the integration of both user data (e.g., demographics, ratings, purchase histories) and product attributes and categories from operational databases is critical. Such data, used in conjunction with usage data, in the mining process can allow for the discovery of important business intelligence metrics, such as customer conversion ratios and lifetime values (Kohavi et al., 2004). In addition to user and product data, e-commerce data include various product-oriented events, including shopping cart changes, order and shipping information, im-
Web Usage Mining Data Preparation
pressions, clickthroughs, and other basic metrics, used primarily for data analysis. The successful integration of this type of data requires the creation of a site-specific event model based on which subsets of a user’s clickstream are aggregated and mapped to specific events, such as the addition of a product to the shopping cart. Generally, the integrated e-commerce data are stored in final transaction database. To enable full-featured Web analytics applications, these data are often stored in a data warehouse called an e-commerce data mart. The e-commerce data mart is a multi-dimensional database integrating data from various sources and at different levels of aggregation. It can provide pre-computed e-metrics along multiple dimensions and is used as the primary data source for OLAP (Online Analytical Processing) for data visualization and in data selection for a variety of data-mining tasks (Buchner & Mulvenna, 1999; Kimbal & Merz, 2000).
FUTURE TRENDS The integration of content, structure, and user data in various phases of the Web usage mining process may be essential in providing the ability to further analyze and reason about the discovered patterns. For example, the integration of semantic knowledge from the site content or semantic attributes of products can be used by personalization systems to provide more useful recommendations (Dai & Mobasher, 2004; Gahni & Fano, 2003). Thus, an important area of future work in Web usage preprocessing is the seamless integration of semantic and structural knowledge with the clickstream data. One direct source of semantic knowledge that can be integrated into the mining process is the collection of content features associated with items or pageviews on a Web site. These features include keywords, phrases, category names, and specific attributes associated with items or products, such as price, brand, and so forth. Content preprocessing involves the extraction of relevant features from text and meta-data. Further preprocessing on content features can be performed by applying textmining techniques. For example, classification of content features based on a concept hierarchy can be used to limit the discovered usage patterns to those containing pageviews about a certain subject or class of products. Performing clustering or association rule mining on the feature space can lead to composite features representing concept categories. In many Web sites, it may be possible and beneficial to classify pageviews into functional categories representing identifiable tasks (e.g., completing an online loan application). The mapping of pageviews onto a set of concepts or tasks allows for the analysis of user sessions at different levels of abstraction according to a concept hierarchy or
according to the types of activities performed by users (Eirinaki et al., 2003; Oberle et al., 2003).
CONCLUSION The data preparation stage is one of the most important steps in the Web usage mining process and is critical to the successful extraction of useful patterns from the data. This process may involve preprocessing the original data, integrating data from multiple sources, and transforming the integrated data into a form suitable for input into specific data-mining operations. Preprocessing of Web usage data often requires the use of special algorithms and heuristics not commonly employed in other domains. This includes specialized techniques for data cleaning (including the detection of Web robots), pageview identification, session identification (or sessionization), and the integration of clickstream data with other data sources, such as user and product information from operational databases. We have summarized these essential tasks and discussed the requirements for the data preparation stage of the Web usage mining process.
REFERENCES Berendt, B., Mobasher, B., Nakagawa, M., & Spiliopoulou, M. (2002). The impact of site structure and user environment on session reconstruction in Web usage analysis. Proceedings of the WebKDD 2002 Workshop at the ACM Conference on Knowledge Discovery in Databases (KDD’02), Edmonton, Alberta, Canada. Berendt, B., & Spiliopoulou, M. (2000). Analysing navigation behaviour in Web sites integrating multiple information systems. VLDB Journal, 9(1), 56-75. Buchner, A., & Mulvenna, M.D. (1999). Discovering Internet marketing intelligence through online analytical Web usage mining. SIGMOD Record, 4(27), 54-61. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, 1(1), 5-32. Dai, H., & Mobasher, B. (2004). Integrating semantic knowledge with Web usage mining for personalization. In A. Scime (Ed.), Web mining: Applications and techniques (pp. 276-306). Hershey, PA: Idea Group Publishing. Eirinaki, M., Vazirgiannis, M., & Varlamis, I. (2003). SEWeP: Using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of the 9th ACM
1229
9
Web Usage Mining Data Preparation
SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C.
KEY TERMS
Ghani, R., & Fano, A. (2002). Building recommender systems using a knowledge base of product semantics. Proceedings of the Workshop on Recommendation and Personalization in E-Commerce, International Conference on Adaptive Hypermedia and Adaptive Web Based Systems, Malaga, Spain.
Episode: A subset or subsequence of a session comprised of semantically or functionally related pageviews.
Kimball, R. & Merz, R. (2000). The data Webhouse toolkit: Building the Web-enabled data warehouse. New York, New York: John Wiley & Sons. Kohavi, R., Mason, L., Parekh, R., & Zheng, Z. (2004). Lessons and challenges from mining retail e-commerce data. Machine Learning, 57, 83-113. Oberle, D., Berendt, B., Hotho, A., & Gonzalez, J. (2003). Conceptual user tracking. Proceedings of the Atlantic Web Intelligence Conference, Madrid, Spain. Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for the evaluation of session reconstruction heuristics in Web usage analysis. INFORMS Journal of Computing, 15(2), 171-190. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. (2000). Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23. Tan, P.N., & Kumar, V. (2002). Discovery of Web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6(1), 9-35. Tanasa, D., & Trousse, B. (2004). Advanced data preprocessing for intersite Web usage mining. IEEE Intelligent Systems, 19(2), 59-65.
1230
Hit (Request): An HTTP request made by a Web client agent (e.g., a browser) for a particular Web resource. It can be explicit (user initiated) or implicit (Web client initiated). Explicit Web requests are sometimes called clickthroughs. Pageview: An aggregate representation of a collection of Web objects contributing to the display on a user’s browser resulting from a single user action (such as a clickthrough). Session: A delimited sequence of pageviews attributed to a single user during a single visit to a site. User Activity Record: The collection of all sessions belonging to a particular user during a specified time period. Web Resource: A resource accessible through the HTTP protocol from a Web server. Web resources (or Web objects) may be static, such as images or existing HTML pages; or they may be dynamic, such as databasedriven Web applications or Web services. Each Web resource is identified uniquely by a Uniform Resource Identifier (URI). Web Usage Mining: The automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites.
1231
Web Usage Mining through Associative Models Paolo Giudici University of Pavia, Italy Paola Cerchiello University of Pavia, Italy
INTRODUCTION The aim of this contribution is to show how the information, concerning the order in which the pages of a Web site are visited, can be profitably used to predict the visit behaviour at the site. Usually every click corresponds to the visualization of a Web page. Thus, a Web clickstream defines the sequence of the Web pages requested by a user. Such a sequence identifies a user session. Typically, a Web usage mining analysis only concentrates on the part of each user session concerning the access at one specific site. The set of the pages seen in a user session, on a determinate site, is usually referred to with the term server session or, more simply, visit. Our objective is to show how associative models can be used to understand the most likely paths of navigation in a Web site, with the aim of predicting, possibly online, which pages will be seen, having seen a specific path of pages in the past. Such analysis can be very useful to understand, for instance, what is the probability of seeing a page of interest (such as the buying page in an e-commerce site) coming from a specified page. Or what is the probability of entering or (exiting) the Web site from any particular page. The two most successful association models for Web usage mining are: sequence rules, which belong to the class of local data mining methods known as association rules; and Markov chain models, which can be seen, on the other hand, as (global) predictive data mining methods.
BACKGROUND We now describe what a sequence rule is. For more details the reader can consult a recent text on data mining, such as Han & Kamber (2001), Witten & Frank (1999) or, from a more statistical viewpoint, Hand et al. (2001), Hastie et al. (2001) and Giudici (2003). An association rule is a statement between two sets of binary variables (itemsets) A and B, that can be written in
the form A → B, to be interpreted as a logical statement: if A, then B. If the rule is ordered in time we have a sequence rule and, in this case A precedes B. In Web clickstream analysis, a sequence rule is typically indirect: namely, between the visit of page A and the visit of page B other pages can be seen. On the other hand, in a direct sequence rule A and B are seen consecutively. A sequence rule model is, essentially, an algorithm that searches for the most interesting rules in a database. The most common of such algorithms is the Apriori model, introduced by Agrawal et al. (1995). In order to find a set of rules, statistical measures of “interestingness” have to be specified. The measures more commonly used in Web mining to evaluate the importance of a sequence rule are the indexes of support and confidence. The support is a relative frequency that indicates the percentage of the users that have visited in succession the two pages. In presence of a high number of visits, as it is usually the case, it is possible to state that the support for the rule approximates the probability a user session contains the two pages in sequence. Therefore, the confidence approximates the conditional probability that in a server session in which has been seen the page A is subsequently required page B. While the support approximates the joint probability of seeing pages A and B, the confidence approximates the conditional probability that in a server session in which has been seen the page A is subsequently required page B. The above referred to itemsets A and B containing one page each; however, each itemset can contain more than one page, and the previous definitions carry through. The order of a sequence is the total number of pages involved in the rule. For instance, the rules discussed previously are sequences of order two. Therefore, the output of a sequence search algorithm (e.g., the a priori algorithm) can be visualised in terms of the sequence rules with the highest interestingness, as measured, for instance, by the support and confidence of the rules that are selected.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
9
Web Usage Mining through Associative Models
An important point that must be made about sequence rules is that they are typically indirect: that is, the sequence A → B means that A has been seen before B, but not necessarily immediately before. Other pages might have been seen in between. From an interpretational viewpoint we underline that indirect rules are not much usel; direct rules, for which A → B means that B is seen consecutively to A are much more interpretable.
MAIN THRUST A graphical model [see, for example, Edwards (1995), Jensen (1996), Lauritzen (1996), Cox & Wermuth (1996), Whittaker (1996) and Cowell et al. (1999)] is a family of probability distributions incorporating the conditional independence assumptions represented by a graph. It is specified via a graph that depicts the local relations among the variables (that are represented with nodes). Undirected graphs give rise to symmetric graphical models (such as graphical loglinear models and graphical Gaussian models). Directed acyclic graphs (DAGs) give rise to recursive graphical models, which are used in probabilistic expert systems. It is known that recursive graphical models consist of a powerful tool for predictive data mining, because of their fundamental assumption of casual dependency between variables. On the other hand, symmetric graphical models can be considered as an important and valid tool in the preliminary phase of analysis because they can show the main relevant association, useful to construct a subsequent recursive model. Both models have been used and compared with association rules in Web usage mining [see, e.g., Heckerman et al. (2000); Blanc & Giudici (2002) and, for a review, Giudici (2003)]. Although results are comparable, we remark that graphical models are usually built from contingency tables and, therefore, cannot simply take order into account. We now consider a different model for the analysis of Web usage transactional dataset. Association rules (of which sequence rules are a special instance) are an instance of a local model: they take into account only a portion of the data, that is, that which satisfies the rule being examined. We now consider a global model, for which association patterns are discovered on the basis of the whole dataset. A global model suited to analyse the Web clickstream data is the Markov chain model. Precisely, here we consider discrete Markov chains. The idea is to introduce dependence between timespecific variables. In each session, to each time point i,
1232
here corresponding to the i-th click, it corresponds a discrete random variable, with as many modalities as the number of pages (these are named states of the chain). The observed i-th page in the session is the observed realisation of the Markov chain, at time i, for that session. Time can go from i=1 to i=T, and T can be any finite number. Note that a session can stop well before T: in this case the last page seen is said an absorbing state (end_session for our data). A Markov chain model establishes a probabilistic dependence between what is seen before time i, and what will be seen at time i. In particular, a first-order Markov chain, which is the model we consider here, establishes that what is seen at time i depends only on what is seen at time i-1. This short-memory dependence can be assessed by a transition matrix that establishes what is the probability of going from any one page to any other page in one step only. For example, with 36 pages there are 36 X 36 probabilities of this kind. The conditional probabilities in the transition matrix can be estimated on the basis of the available conditional frequencies. If we add the assumption that the transition matrix is constant in time, as we shall do, we can use the frequencies of any two adjacent pairs of time-ordered clicks to estimate the conditional probabilities. Note the analogy of Markov chains with direct sequences. It can be shown that a first order Markov chain is a model for direct sequences of order two; a secondorder Markov model is a model for direct sequences of order three, and so on. The difference is that the Markov chain model is a global and not a local model. This is mainly reflected in the fact that Markov chains consider all pages and not only those with a high support. Furthermore, the Markov model is a probabilistic model and, as such, allows inferential results to be obtained. For space purposes, we now briefly consider some of the results that can be obtained from the application of Markov chain models. For more details, see Giudici (2003). For instance, we can evaluate where it is most likely to enter the site. To obtain this we have to consider the transition probabilities of the start_session row. We can also consider the most likely exit pages. To obtain this we have to consider the transition probabilities of the end_session column. We can also build up several graphical structures that correspond to paths, with an associated occurrence probability. For example, from the transition matrix we can establish a path that connects nodes through the most likely transitions.
Web Usage Mining through Associative Models
FUTURE TRENDS
rules. In Advances in knowledge discovery and data mining. Cambridge: AAAI/MIT Press.
We have considered two classes of statistical models to model Web mining data. It is quite difficult to choose between them. Here the situation is complicated by the fact that we have to compare local models (such as sequence rules) with global models (such as Markov chains). An expected future trend is the development of systematic ways to compare the two model classes. For global models, such as Markov chains, statistical evaluation can proceed in terms of classical scoring methods, such as likelihood ratio scoring, AIC or BIC; or, alternatively, by means of computationally intensive predictive evaluation, based on cross-validation and/or bootstrapping. But the real problem is how to compare them with sequence rules. A simple and natural scoring function of a sequence rule is its support that gives the proportion of the population to which the rule applies. Another measure of interestingness of a rule, with respect to a situation of irrelevance, is the lift of the rule itself. The lift is the ratio between the support of the confidence of the rule A → B and the support of B. Recalling the definition of the confidence index, the lift compares the observed absolute frequency of the rule with that corresponding to independence between A and B.
Blanc, E., & Giudici, P. (2002). Statistical models for web clickstream analysis. Italian Journal of Applied Statistics, 14, 123-134.
CONCLUSION To summarise, we believe that the assessment of an association pattern has to be judged by their utility for the objectives of the analysis at hand. In the present article, for instance, the informative value of the start_session → end_session rule, which in Table 1 has the largest support and confidence (100%), is, for instance, null. On the other hand, the informative value of the rules that go from start_session to other pages, and from other pages to end_session can be extremely important for the design of the Web site. We finally remark that the methodology presented here has been applied to several Web usage mining logfiles; the reader can consult, for example, Giudici & Castelo (2003), Blanc & Giudici (2002), and Castelo & Giudici (2001). It has also been applied to other data mining problems: see, for example, Brooks et al. (2003), Giudici & Green (1999) and Giudici (2001).
Brooks, S.P., Giudici, P., & Roberts, G.O. (2003). Efficient construction of reversible jump MCMC proposal distributions. Journal of The Royal Statistical Society, Series B, 1, 1-37. Castelo, R., & Giudici, P. (2001). Association models for Web mining. Journal of Knowledge Discovery and Data Mining, 5, 183-196. Cox, D.R., & Wermuth, N. (1996). Multivariate dependencies: Models, analysis and interpretation. London: Chapman & Hall. Cowell, R.G., Dawid, A.P., Lauritzen, S.L., & Spiegelhalter, D.J. (1999). Probabilistic networks and expert systems. New York: Springer-Verlag. Edwards, D. (1995). Introduction to graphical modelling. New York: Springer-Verlag. Giudici, P. (2001). Bayesian data mining, with application to credit scoring and benchmarking. Applied Stochastic Models in Business and Industry, 17, 69-81. Giudici, P. (2003). Applied data mining. London: Wiley. Giudici, P., & Castelo, R. (2003). Improving MCMC model search for data mining. Machine Learning, 50, 127-158. Giudici, P., & Green, P.J. (1999). Decomposable graphical Gaussian model determination. Biometrika, 86, 785-801. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. New York: Morgan Kaufmann. Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of data mining. New York: MIT Press. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative filtering and data visualization. Journal of Machine Learning Research, 1, 49-75.
REFERENCES
Whittaker, J. (1996). Graphical models in applied multivariate statistics. Chichester: Wiley.
Agrawal, R., Manilla, H., Srikant, R., Toivonen, H., & Verkamo, A.I. (1995). Fast discovery of association
Witten, I., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementation. New York: Morgan Kaufmann. 1233
9
Web Usage Mining through Associative Models
KEY TERMS Association Rules: Let X1, .., Xp be a collection of random variables. In general, a pattern for such variables identifies a subset of all possible observations over them. A rule is a logical statement between two patterns, say α and β , written as α → β . Chain Graphical Models: The graph contains both undirected and directed links. They can model both symmetric and asymmetric relationships. They give rise to graphical chain models. Conditional Independence: Consider two random variables X and Y. It will be said that X and Y are independent conditionally on a third random variable (or random vector), Z, if the joint probability distribution of X and Y, conditionally on Z, can be decomposed in the product of two factors, the conditional density of X given Z and the conditional density of Y given Z. In formal terms, X and Y are independent, conditionally on Z (in symbols: X ⊥ Y | Ζ) f (x, y Z = z ) = f (x Z = z ) f (y Z = z )
Confidence of a Rule: The confidence for the rule A → B instead is obtained dividing the number of server sessions, which satisfy the rule by the number of sessions containing the page A:
N A→ B confidence {A → B}= N A
N A→ B N = NA = N
support{A → B} support{A}
Directed Graphical Models: The graph contains only directed links. They are used to model asymmetric
1234
relations among the variables. They give rise to recursive graphical models, also known as probabilistic expert systems. Graphical Models: A graphical model is a family of probability distributions incorporating the conditional independence assumptions represented by a graph. It is specified via a graph that depicts the local relations among the variables (that are represented with nodes). Lift of a Rule: The lift of a rule relates the confidence of a rule with the support of the head of the same rule: lift{A → B}=
confidence{A → B} = support{B}
support{A → B} support (A) support{B}
Support of a Rule: Consider a sequence A → B and indicate as N A→ B the number of visits that appear in such sequence at least once. Let N be the total number of the server sessions. The support for the rule A → B is obtained dividing the number of server sessions that satisfy the rule by the total number of server sessions: support {A → B} =
N A→ B N
Symmetric Graphical Models: The graph contains only undirected links. They are used to model symmetric relations among the variables. They give rise to the symmetric graphical models.
1235
World Wide Web Personalization Olfa Nasraoui University of Louisville, USA
INTRODUCTION The Web information age has brought a dramatic increase in the sheer amount of information (Web content), in the access to this information (Web usage), and in the intricate complexities governing the relationships within this information (Web structure). Hence, not surprisingly, information overload when searching and browsing the World Wide Web (WWW) has become the plague du jour. One of the most promising and potent remedies against this plague comes in the form of personalization. Personalization aims to customize the interactions on a Web site, depending on the user’s explicit and/or implicit interests and desires.
BACKGROUND
Jeff Bezos, CEO of Amazonä once said, “If I have 3 million customers on the Web, I should have 3 million stores on the Web” (Schafer et al., 1999, p. ). Hence, in both the ecommerce sector and digital libraries, Web personalization has become more of a necessity than an option. Personalization can be used to achieve several goals, ranging from increasing customer loyalty on e-commerce sites (Schafer et al., 1999) to enabling better search (Joachims, 2002).
Modes of Personalization Personalization falls into four basic categories, ordered from the simplest to the most advanced: 1.
The Birth of Personalization: No Longer an Option but a Necessity The move from traditional physical stores of products or information (e.g., grocery stores or libraries) to virtual stores of products or information (e.g., e-commerce sites and digital libraries) has practically eliminated physical constraints, traditionally limiting the number and variety of products in a typical inventory. Unfortunately, the move from the physical to the virtual space has drastically limited the traditional three-dimensional layout of products for which access is further facilitated, thanks to the sales representative or librarian who knows the products and the customers, to a dismal planar interface without the sales representative or librarian. As a result, the customers are drowned by the huge number of options, most of which they may never even get to know. In the late 1990s,
2.
3.
Memorization: In this simplest and most widespread form of personalization, user information, such as name and browsing history, is stored (e.g. using cookies), to be used later to recognize and greet the returning user. It usually is implemented on the Web server. This mode depends more on Web technology than on any kind of adaptive or intelligent learning. It also can jeopardize user privacy. Customization: This form of personalization takes as input a user’s preferences from registration forms in order to customize the content and structure of a Web page. This process tends to be static and manual or, at best, semi-automatic. It usually is implemented on the Web server. Typical examples include personalized Web portals such as My Yahoo!. Guidance or Recommender Systems: A guidancebased system tries to automatically recommend hyperlinks that are deemed to be relevant to the user’s interests in order to facilitate access to the needed information on a large Web site (Mobasher
Table 1. Possible goals of Web personalization • • • • • •
Converting browsers into buyers Improving Web site design and usability Improving customer retention and loyalty Increasing cross-sell by recommending items related to the ones being considered Helping visitors to quickly find relevant information on a Web site Making results of information retrieval/search more aware of the context and user interests
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
9
World Wide Web Personalization
4.
et al., 2000; Nasraoui et al., 2002; Schafer et al., 1999). It usually is implemented on the Web server and relies on data that reflect the user’s interest implicitly (browsing history as recorded in Web server logs) or explicitly (user profile as entered through a registration form or questionnaire). This approach will form the focus of our overview of Web personalization. Task Performance Support: In these client-side personalization systems, a personal assistant executes actions on behalf of the user in order to facilitate access to relevant information. This approach requires heavy involvement on the part of the user, including access, installation, and maintenance of the personal assistant software. It also has very limited scope in the sense that it cannot use information about other users with similar interests.
In the following, we concentrate on the third mode of personalization—automatic Web personalization based on recommender systems—because they necessitate a minimum or no explicit input from the user. Also, since they are implemented on the server side, they benefit from a global view of all users’ activities and interests in order to provide an intelligent (learns user profiles automatically), and yet transparent (requiring very little or no explicit input from the user) Web personalization experience.
MAIN THRUST Phases of Automatic Web Personalization The Web personalization process can be divided into four distinct phases (Mobasher et al., 2000; Schafer et al., 1999): 1.
2.
1236
Collection of Web Data: Implicit data includes past activities/clickstreams as recorded in Web server logs and/or via cookies or session tracking modules. Explicit data usually comes from registration forms and rating questionnaires. Additional data such as demographic and application data (e.g., ecommerce transactions) also can be used. In some cases, Web content, structure, and application data can be added as additional sources of data in order to shed more light on the next stages. Preprocessing of Web Data: Data is frequently preprocessed to put it into a format that is compatible with the analysis technique to be used in the next step. Preprocessing may include cleaning data of
3.
4.
inconsistencies, filtering out irrelevant information according to the goal of analysis (e.g., automatically generated requests to embedded graphics will be recorded in Web server logs, even though they add little information about user interests), and completing the missing links (due to caching) in incomplete clickthrough paths. Most importantly, unique sessions need to be identified from the different requests, based on a heuristic, such as requests originating from an identical IP address within a given time period. Analysis of Web Data: Also known as Web Usage Mining (Nasraoui et al., 1999; Spiliopoulou & Faulstich, 1999; Srivastava et al., 2000), this step applies machine learning or data-mining techniques in order to discover interesting usage patterns and statistical correlations between Web pages and user groups. This step frequently results in automatic user profiling and is typically applied off-line so that it does not add a burden to the Web server. Decision-Making/Final Recommendation Phase: The last phase in personalization makes use of the results of the previous analysis step to deliver recommendations to the user. The recommendation process typically involves generating dynamic Web content on the fly, such as adding hyperlinks to the last Web page requested by the user. This can be accomplished using a variety of Web technology options, such as CGI programming.
Categories of Data Used in Web Personalization The Web personalization process relies on one or more of the following data sources (Eirinaki & Vazirgiannis, 2003): 1. 2. 3.
4.
Content Data: Text, images, and so forth in HTML pages, as well as information in databases. Structure Data: Hyperlinks connecting the pages to one another. Usage Data: Records of the visits to each Web page on a Web site, including time of visit, IP address, and so forth. This data typically is recorded in Web server logs, but it also can be collected using cookies or other session tracking tools. User Profile: Information about the user, including demographic attributes (age, income, etc.) and preferences that are gathered either explicitly (through registration forms) or implicitly (through Web server logs). Profiles either can be static or dynamic. They also can be individualized (one per user) or aggregate (summarize several similar users in a given group).
World Wide Web Personalization
answer combinations, and on customizations by an expert. It suffers from a lack in intelligence (no automatic learning) and tends to be static.
Different Ways to Compute Recommendations Automatic Web personalization can analyze the data to compute recommendations in different ways, including: 1.
2.
3.
Content-Based or Item-Based Filtering: This system recommends items deemed to be similar to the items that the user liked in the past. Item similarity typically is based on domain-specific item attributes (i.e., author and subject for book items, artist and genre for music items). This approach has worked well for Amazon (Linden et al., 2003) and has the advantage of easily including brand new items in the recommendation process, since there is no need for any previous implicit or explicit user rating or purchase data to make recommendations. Collaborative Filtering: Based on the assumption that users with similar past behaviors (rating, browsing, or purchase history) have similar interests, this system recommends items that are liked by other users with similar interests (Schafer et al., 1999). This approach relies on an historic record of all user interests, as can be inferred from their ratings of the items on a Web site (products or Web pages). Rating can be explicit (explicit ratings, previous purchases, customer satisfaction questionnaires) or implicit (browsing activity on a Web site). Computing recommendations can be based on a lazy or eager learning phase in order to model user interests. In lazy learning, all previous user activities simply are stored until recommendation time, when a new user is compared against all previous users to identify those who are similar and, in turn, generate recommended items that are part of these similar users’ interests. Lazy models are fast in training/learning, but they take up huge amounts of memory to store all user activities and can be slow at recommendation time because of all the required comparisons. On the other hand, eager learning relies on data-mining techniques in order to learn a summarized model of user interests (decision tree, clusters/profiles, etc.) that typically requires only a small fraction of the memory needed in lazy approaches. While eager learning can be slow and, thus, is performed off-line, using a learned model at recommendation time generally is much faster than lazy approaches. Rule-Based Filtering: In this approach, which is used frequently to customize products on e-commerce sites such as Dell on Line, the user answers several questions until receiving a customized result, such as a list of products. This approach is based mostly on heavy planning and manual concoctions of a judicious set of questions, on possible
Recommender Systems One of the most successful examples of personalization comes in the form of recommender systems. Several approaches to automatically generate Web recommendations based on users’ Web navigation patterns or ratings exist. Some involve learning a usage model from Web access data or from user ratings. For example, lazy modeling is used in collaborative filtering, which simply stores all users’ information and then relies on K-Nearest-Neighbors (KNN) to provide recommendations from the previous history of similar users (Schafer et al., 1999). Frequent itemsets (Mobasher et al., 2001), a partitioning of user sessions into groups of similar sessions, called session clusters (Mobasher et al., 2000; Nasraoui et al., 1999) or user profiles (Mobasher et al., 2000; Nasraoui et al., 1999), also can form a user model obtained using data mining. Association rules can be discovered off-line and then used to provide recommendations based on Web navigation patterns. Among the most popular methods, the ones based on collaborative filtering and those based on fixed support association rule discovery may be the most difficult and expensive to use. This is because, for the case of highdimensional (i.e., too many Web pages or items) and extremely sparse (i.e., most items/Web pages tend to be unrated/unvisited) Web data, it is difficult to set suitable support and confidence thresholds in order to yield reliable and complete Web usage patterns. Similarly, collaborative models may struggle with sparse data and do not scale well to a very large number of users (Schafer et al., 1999).
Challenges in WWW Personalization WWW personalization faces several tough challenges that distinguish it from the main stream of data mining: 1.
Scalability: In order to deal with large Web sites that have huge activity, personalization systems need to be scalable (i.e., efficient in their time and memory requirements). To this end, some researchers (Nasraoui et al., 2003) have started considering Web usage data as a special case of noisy data streams (data that arrives continuously in an environment constrained by stringent memory and computational resources). Hence, the data only can be processed and analyzed sequentially and cannot be stored.
1237
9
World Wide Web Personalization
2.
3.
4.
5.
1238
Accuracy: WWW personalization poses an enormous risk of upsetting users or e-commerce customers in case the recommendations are inaccurate. One promising approach (Nasraoui & Pavuluri, 2004) in this direction is to add an additional data mining phase that is separate from the one used to discover user profiles by clustering previous user sessions, and whose main purpose is to learn an accurate recommendation model. This approach differs from existing methods that do not include adaptive learning in a separate second phase and, instead, base the recommendations on simplistic assumptions (e.g., nearest profile recommendations or deployment of pre-discovered association rules). Based on this new approach, a new method was developed for generating simultaneously accurate and complete recommendations, called Context Ultra-Sensitive Approach, based on two-step Recommender systems (CUSA-2-step-Rec) (Nasraoui & Pavuluri, 2004). CUSA-2-step-Rec relies on a committee of profile-specific URL-predictor neural networks. This approach provides recommendations that are accurate and fast to train, because only the URLs relevant to a specific profile are used to define the architecture of each network. Similar to the task of completing the missing pieces of a puzzle, each neural network is trained to predict the missing URLs of several complete ground-truth sessions from a given profile, given as input several incomplete subsessions. This is the first approach that, in a sense, personalizes the recommendation modeling process itself, depending on the user profile. Evolving User Interests: Dealing with rapidly evolving user interests and highly dynamic Web sites requires a migration of the complete Web usage mining phases from an off-line framework to one that is completely online. This only can be accomplished with scalable single-pass evolving stream mining techniques (Nasraoui et al., 2003). Other researchers also have studied Web usage from the perspective of evolving graphs (Desikan & Srivastava, 2004). Data Collection and Preprocessing: Preprocessing Web usage data is still imperfect, mainly due to the difficulty to identify users accurately in the absence of registration forms and cookies and due to log requests that are missing because of caching. Some researchers (Berendt et al., 2001) have proposed clickstream path completion techniques that can correct problems of accesses that do not get recorded due to client caching. Integrating Multiple Sources of Data: Taking semantics into account also can enrich the Web personalization process in all its phases. A focus
6.
7.
on techniques and architectures for more effective integration and mining of content, usage, and structure data from different sources is likely to lead to the next generation of more useful and more intelligent applications (Li & Zaiane, 2004). In particular, there recently has been an increasing interest in integrating Web mining with ideas from the semantic Web, leading to what is known as semantic Web mining (Berendt et al., 2002). Conceptual Modeling for Web usage Mining: Conceptual modeling of the Web mining and personalization process also is receiving more attention, as Web mining becomes more mature and also more complicated. Recent efforts in this direction include Meo, et al. (2004) and Maier (2004). Privacy Concerns: Finally, privacy adds a whole new dimension to WWW personalization. In realty, many users dislike giving away personal information. Some also may be suspicious of Web sites that rely on cookies and may even block cookies. In fact, even if a Web user agrees to give up personal information or accept cookies, there is no guarantee that Web sites will not exchange this information without the user’s consent. Recently, the W3C (World Wide Web Consortium) has proposed recommendations for a standard called Platform for Privacy Preferences (P3P) that enables Web sites to express their privacy practices in a format that can be retrieved and interpreted by client browsers. However, legal efforts still are needed to ensure that Web sites truly comply with their published privacy practices. For this reason, several research efforts (Agrawal & Srikant, 2000; Kargupta et al., 2003) have attempted to protect privacy by masking the user data, using several methods such as randomization, that will modify the input data, yet without significantly altering the results of data mining. The use of these techniques within the context of Web mining is still open for future research.
FUTURE TRENDS The Web is an incubator for a large spectrum of applications involving user interaction. User preferences and expectations, together with usage patterns, form the basis for personalization. Enabling technologies includes data mining, preprocessing, sequence discovery, real-time processing, scalable warehousing, document classification, user modeling, and quality evaluation models. As Web sites become larger, more competitive, and more dynamic, and as users become more numerous and more demanding,
World Wide Web Personalization
Table 2. Projected future focus efforts in Web personalization • • • • • • •
9
Scalability in the face of huge access volumes Accuracy of recommendations Dealing with rapidly changing usage access patterns Reliable data collection and preprocessing Taking semantics into account Systematic conceptual modeling of the Web usage mining and personalization process Adhering to privacy standards
and as their interests evolve, there is a crucial need for research that targets the previously mentioned enabling technologies and leads them toward the path of scalable, real-time, online, accurate, and truly adaptive performance. From another perspective, the inherent and increasing heterogeneity of the Web has required Web-based applications to integrate a variety of types of data from a variety of channels and sources. The development and application of Web-mining techniques in the context of Web content, usage, and structure data will lead to tangible improvements in many Web applications, from search engines and Web agents to Web analytics and personalization. Future efforts, investigating architectures and algorithms that can exploit and enable a more effective integration and mining of content, usage, and structure data from different sources, promise to lead to the next generation of intelligent Web applications. Table 2 summarizes most of the active areas of future efforts that target the challenges that have been discussed in the previous section.
CONCLUSION Because of the explosive proliferation of the Web, Web personalization recently has gained a big share of attention, and significant strides already have been accomplished to achieve WWW personalization while facing tough challenges. However, even in this slowly maturing area, some newly identified challenges beg for increased efforts in developing scalable and accurate Web mining and personalization models that can stand up to huge, possibly noisy, and highly dynamic Web activity data. Along with some crucial challenges, we also have pointed to some possible future direction in the area of WWW personalization.
ACKNOWLEDGMENTS The author gratefully acknowledges the support of the National Science Foundation CAREER Award IIS-0133948.
REFERENCES Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, Texas. Berendt, B. et al. (2001). Measuring the accuracy of sessionizers for Web usage analysis. Proceedings of the Workshop on Web Mining at the First SIAM International Conference on Data Mining. Berendt, B., Hotho, A., & Stumme, G. (2002). Towards semantic Web mining. Proceedings of the International Semantic Web Conference (ISWC02). Desikan, P., & Srivastava, J. (2004). Mining temporally evolving graphs. Proceedings of the WebKDD- 2004 Workshop on Web Mining and Web Usage Analysis at the Knowledge Discovery and Data Mining Conference, Seattle, Washington. Eirinaki, M., Vazirgiannis, M. (2003). Web mining for Web personalization. ACM Transactions On Internet Technology (TOIT), 3(1), 1-27. Joachims, T. (2002). Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD Conference. Kargupta, H. et al. (2003). On the privacy preserving properties of random data perturbation techniques. Proceedings of the 3rd ICDM IEEE International Conference on Data Mining (ICDM’03), Melbourne, Florida. Li, J., & Zaiane, O. (2004). Using distinctive information channels for a mission-based Web-recommender system. Proceedings of the WebKDD- 2004 Workshop on Web Mining and Web Usage Analysis at the ACM KDD: Knowledge Discovery and Data Mining Conference, Seattle, Washington. Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations item-to-item collaborative filtering. IEEE Internet Computing, 7(1), 76-80.
1239
World Wide Web Personalization
Maier, T. (2004). A formal model of the ETL process for OLAP-based Web usage analysis. Proceedings of the WebKDD- 2004 Workshop on Web Mining and Web Usage Analysis at the ACM KDD: Knowledge Discovery and Data Mining Conference, Seattle, Washington. Meo, R. et al. (2004). Integrating Web conceptual modeling and Web usage mining. Proceedings of the WebKDD2004 Workshop on Web Mining and Web Usage Analysis at the ACM KDD: Knowledge Discovery and Data Mining Conference, Seattle, Washington. Mobasher, B. et al. (2001). Effective personalizaton based on association rule discovery from Web usage data. ACM Workshop on Web Information and Data Management, Atlanta, Georgia. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Communications of the ACM, 43(8), 142-151. Nasraoui, O. et al. (2002). Automatic Web user profiling and personalization using robust fuzzy relational clustering. In J. Segovia, P. Szczepaniak, & M. Niedzwiedzinski (Eds.), E-commerce and intelligent methods. SpringerVerlag. Nasraoui, O. et al. (2003). Mining evolving user profiles in noisy Web clickstream data with a scalable immune system clustering algorithm. Proceedings of the WebKDD 2003—KDD Workshop on Web Mining as a Premise to Effective and Intelligent Web Applications, Washington, D.C. Nasraoui, O., Krishnapuram, R., & Joshi, A. (1999). Mining Web access logs using a relational clustering algorithm based on a robust estimator. Proceedings of the 8th International World Wide Web Conference, Toronto, Canada. Nasraoui, O., & Pavuluri, M. (2004). Complete this puzzle: A connectionist approach to accurate Web recommendations based on a committee of predictors. Proceedings of the WebKDD—2004 Workshop on Web Mining and Web Usage Analysis at the ACM KDD: Knowledge Discovery and Data Mining Conference, Seattle, Washington.
usage patterns from Web data. SIGKDD Explorations, 1(2), 12-23.
KEY TERMS CGI Program: (Common Gateway Interface) A small program that handles input and output from a Web server. Often used for handling forms input or database queries, IT also can be used to generate dynamic Web content. Other options include JSP (Java server pages) and ASP (Active server pages), scripting languages allowing the insertion of server-executable scripts in HTML pages and PHP, a scripting language used to create dynamic Web pages. Clickstream: Virtual trail left by a user’s computer as the user surfs the Internet. The Clickstream is a record of every Web site visited by a user, how long they spend on each page, and in what order the pages are viewed. It is frequently recorded in Web server logs. Collaborative Filtering: A method for making automatic predictions (filtering) about the interests of a user by collecting ratings and interest information from many users (collaborating). Cookie: A message generated and sent by a Web server to a Web browser after a page has been requested from the server. The browser stores this cookie in a text file, and this cookie then is sent back to the server each time a Web page is requested from the server. Frequent Itemset: A set of items (e.g., {A, B, C}) that simultaneously co-occur with high frequency in a set of transactions. This is a prerequisite to finding association rules of the form (e.g., {A, B} → C). When items are URLs or products (i.e., books, movies, etc.) sold or provided on a Web site, frequent itemsets can correspond to implicit collaborative user profiles. IP Address: (Internet protocol address). A unique number consisting of four parts separated by dots, such as 145.223.105.5. Every machine on the Internet has a unique IP address.
Schafer, J.B., Konstan, J., & Reidel, J. (1999). Recommender systems in e-commerce. Proceedings of the ACM Conference E-Commerce.
Recommender System: A system that recommends certain information or suggests strategies users might follow to achieve certain goals.
Spiliopoulou, M., & Faulstich, L.C. (1999). WUM: A Web utilization miner. Proceedings of the EDBT Workshop WebDB98, Valencia, Spain.
Web Client: A software program (browser) that is used to contact and obtain data from a server software program on another computer (the server).
Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining: Discovery and applications of
Web Server: A computer running special server software (e.g., Apache), assigned an IP address, and con-
1240
World Wide Web Personalization
nected to the Internet so that it can provide documents via the World Wide Web.
9
Web Server Log: Each time a user looks at a page on a Web site, a request is sent from the user’s client computer to the server. These requests are for files (HTML pages, graphic elements, or scripts). The log file is a record of these requests.
1241
1242
World Wide Web Usage Mining Wen-Chen Hu University of North Dakota, USA Hung-Jen Yang National Kaohsiung Normal University, Taiwan Chung-wei Lee Auburn University, USA Jyh-haw Yeh Boise State University, USA
INTRODUCTION World Wide Web data mining includes content mining, hyperlink structure mining, and usage mining. All three approaches attempt to extract knowledge from the Web, produce some useful results from the knowledge extracted, and apply the results to certain real-world problems. The first two apply the data mining techniques to Web page contents and hyperlink structures, respectively. The third approach, Web usage mining (the theme of this article), is the application of data mining techniques to the usage logs of large Web data repositories in order to produce results that can be applied to many practical subjects, such as improving Web sites/pages, making additional topic or product recommendations, user/customer behavior studies, and so forth. This article provides a survey and analysis of current Web usage mining technologies and systems. A Web usage mining system must be able to perform five major functions: (i) data gathering, (ii) data preparation, (iii) navigation pattern discovery, (iv) pattern analysis and visualization, and (v) pattern applications. Many Web usage mining technologies have been proposed, and each technology employs a different approach. This article first describes a generalized Web usage mining system, which includes five individual functions. Each system function is then explained and analyzed in detail. Related surveys of Web usage mining techniques also can be found in Hu, et al. (2003) and Kosala and Blockeel (2000).
• •
•
•
•
Figure 1 shows a generalized structure of a Web usage mining system; the five components will be detailed in the next section. A usage mining system also can be divided into the following two types: •
BACKGROUND A variety of implementations and realizations is employed by Web usage mining systems. This section introduces the Web usage mining background by giving a generalized structure of the systems, each of which carries out five major tasks:
Usage Data Gathering: Web logs, which record user activities on Web sites, provide the most comprehensive, detailed Web usage data. Usage Data Preparation: Log data are normally too raw to be used by mining algorithms. This task restores the user’s activities that are recorded in the Web server logs in a reliable and consistent way. Navigation Pattern Discovery: This part of a usage mining system looks for interesting usage patterns contained in the log data. Most algorithms use the method of sequential pattern generation, while the remaining methods tend to be rather ad hoc. Pattern Analysis and Visualization: Navigation patterns show the facts of Web usage, but these require further interpretation and analysis before they can be applied to obtain useful results. Pattern Applications: The navigation patterns discovered can be applied to the following major areas, among others: (i) improving the page/site design, (ii) making additional product or topic recommendations, and (iii) Web personalization.
•
Personal: A user is observed as a physical person for whom identifying information and personal data/properties are known. Here, a usage mining system optimizes the interaction for this specific individual user. Impersonal: The user is observed as a unit of unknown identity, although some properties may be accessible from demographic data. In this case, a usage mining system works for a general population.
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
World Wide Web Usage Mining
Figure 1. A Web usage mining system structure WWW
Instructions
Instructions Queri es System Administrator
Result Patterns
•
Usage Data Gathering
Raw Data
•
Usage Data P reparation
Prepared Dat a Navigation P attern Discovery
Navigation Patterns P attern Analysis & Visualization
Result Patterns Instructions
P attern Applications
This article concentrates on the impersonal systems. Personal systems actually are a special case of impersonal systems, so readers can easily infer the corresponding personal systems, given the information for impersonal systems.
MAIN THRUST OF THE ARTICLE This section details the five major functions of a Web mining system: (i) data gathering, (ii) data preparation, (iii) navigation pattern discovery, (iv) pattern analysis and visualization, and (v) pattern applications.
Data Gathering Web usage data are usually supplied by two sources: trial runs by humans and Web logs. The first approach is impractical and rarely used because of the nature of its high time and expense costs and its bias. Most usage mining systems use log data as their data source. This section looks at how and what usage data can be collected.
•
Server-Side Logs: These logs generally supply the most complete and accurate usage data. Proxy-Side Logs: A proxy server takes the HTTP requests from users and passes them to a Web server; the proxy server then returns to users the results passed to them by the Web server. Client-Side Logs: Participants remotely test a Web site by downloading special software that records Web usage or by modifying the source code of an existing browser. HTTP cookies also could be used for this purpose. These are pieces of information generated by a Web server and stored in the user’s computer, ready for future access.
Web Log Information A Web log is a file to which the Web server writes information each time a user requests a resource from that particular site. Examples of the types of information the server preserves include the user’s domain, subdomain, and host name; the resources the user requested (e.g., a page or an image map); the time of the request; and any errors returned by the server. Each log provides different and various information about the Web server and its usage data. Most logs use the format of a common log file or extended log file. For example, the following is an example of a file recorded in the extended log format: #Version: 1.0 #Date: 12-Jan-1996 00:00:00 #Fields: time cs-method cs-uri 00:34:23 GET /foo/bar.html 12:21:16 GET /foo/bar.html 12:45:52 GET /foo/ bar.html 12:57:34 GET /foo/bar.html
Data Preparation
Web Logs A Web log file records activity information when a Web user submits a request to a Web server. A log file can be Figure 2. Three Web log file locations Client Browser
Log
located in three different places: (i) Web servers, (ii) Web proxy servers, and (iii) client browsers, as shown in Figure 2.
Web Proxy Server
Web Server
Requests
Requests
Results
Results
Log
The information contained in a raw Web server log does not reliably represent a user session file. The Web usage data preparation phase is used to restore users’ activities in the Web server log in a reliable and consistent way. At a minimum, this phase should achieve the following four major tasks: (i) removing undesirable entries, (ii) distinguishing among users, (iii) building sessions, and (iv) restoring the contents of a session (Cooley, Mobasher & Srivastava, 1999).
Removing Undesirable Entries Log
Web logs contain user activity information, of which some is not closely relevant to usage mining and can be 1243
9
World Wide Web Usage Mining
Figure 3. A sample Web site
vious two paths can be assigned further to three sessions: (i) A-D-I-H, (ii) A-B-F, and (iii) C-H-B if a threshold value of thirty minutes is used.
A B E
D
C F
H
Restoring the Contents of a Session I
removed without noticeably affecting the mining such as all log-image entries and robot accesses. As much irrelevant information as possible should be removed before applying data mining algorithms to the log data.
Distinguishing Among Users A user is defined as a single individual that accesses files from one or more Web servers through a browser. A Web log sequentially records users’ activities according to the time each occurred. In order to study the actual user behavior, users in the log must be distinguished. Figure 3 is a sample Web site where nodes are pages, edges are hyperlinks, and node A is the entry page of the site. The edges are bi-directional, because users can easily use the back button on the browser to return to the previous page. Assume the access data from an IP address recorded on the log are those given in Table 1. Two user paths are identified from the access data: (i) A-D-I-H-AB-F and (ii) C-H-B. These two paths are found by heuristics; other possibilities also may exist.
Building Sessions For logs that span long periods of time, it is very likely that individual users will visit the Web site more than once, or their browsing may be interrupted. The goal of session identification is to divide the page accesses of each user into individual sessions. A time threshold is usually used to identify sessions. For example, the preTable 1. Sample access data from an IP address on the Web site in Figure 3 No. 1 2 3 4 5 6 7 8 9 10
1244
Time Requested URL Remote URL 12:05 A – 12:11 D A 12:22 C – 12:37 I D 12:45 H C 12:58 B A 01:11 H D 02:45 A – 03:16 B A 03:22 F B
This task determines if there are important accesses that are not recorded in the access logs. For example, Web caching or using the back button of a browser will cause information discontinuance in logs. The three user sessions previously identified can be restored to obtain the complete sessions: (i) A-D-I-D-H, (ii) A-BF, and (iii) C-H-A-B, because there are no direct links between I and H and between H and B in Figure 3.
Navigation Pattern Discovery Many data mining algorithms are dedicated to finding navigation patterns. Among them, most algorithms use the method of sequential pattern generation, while the remaining methods tend to be rather ad hoc.
A Navigation Pattern Example Before giving the details of various mining algorithms, the following example illustrates one procedure that may be used to find a typical navigation pattern. Assume the following list contains the visitor trails of the Web site in Figure 3. 1. 2. 3. 4. 5.
A-D-I (4) B-E-F-H (2) A-B-F-H (3) A-B-E (2) B-F-C-H (3)
The number inside the parentheses is the number of visitors per trail. An aggregate tree constructed from the list is shown in Figure 4, where the number after the page is the support, the number of visitors having Figure 4. An aggregate tree constructed from the list of visitor trails (F, 3)
(H, 3)
(B, 5) (E, 2)
(A, 9) (
, 50)
(D, 4)
(I, 4)
(E, 2)
(F, 2)
(H, 2)
(F, 3)
(C, 3)
(H, 3)
(B, 5)
World Wide Web Usage Mining
Figure 5. The navigation patterns from page B to page H in Figure 4 (E, 2)
(F, 2)
(H, 2)
(F, 3)
(C, 3)
(H, 3)
•
(B, 5)
(B, 5)
(F, 3)
(H, 3)
•
reached the page. A Web usage mining system then looks for interesting navigation patterns from this aggregate tree. Figure 5 shows an example of navigation patterns from page B to page H in Figure 4.
Sequential Pattern Discovery The problem of discovering sequential patterns consists of finding intertransaction patterns such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set (Agrawal & Srikant, 1995). The following two systems use a variant of sequential pattern generation to find navigation patterns: •
WUM (Web Utilization Miner) (Spiliopoulou & Faulstich, 1998) discovers navigation patterns using an aggregated materialized view of the Web log. This technique offers a mining language that experts can use to specify the types of patterns in which they are interested. Using this language, only patterns having the specified characteristics are saved, while uninteresting patterns are removed early in the process. For example, the following query generates the navigation patterns shown in Figure 5. select glue(t) from node as B, H template B×H as t where B=’B’ and H=’H’;
•
MiDAS (Büchner et al., 1999) extends traditional sequence discovery by adding a wide range of Web-specific features. New domain knowledge types in the form of navigational templates and Web topologies have been incorporated, as well as syntactic constraints and concept hierarchies.
Ad Hoc Methods Apart from the above techniques of sequential pattern generation, some ad hoc methods worth mentioning are as follows:
Association rule discovery can be used to find unordered correlations between items found in a set of database transactions (Agrawal & Srikant, 1994). In the context of Web usage mining, association rules refer to sets of pages that are accessed, together with a support value exceeding some specified threshold. OLAP (Online Analytical Processing) is a category of software tools that can be used to analyze data stored in a database. It allows users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. WebLogMiner (Zaiane, Xin & Han, 1998) uses the OLAP method to analyze the Web log data cube, which is constructed from a database containing the log data. Data mining methods such as association or classification are then applied to the data cube to predict, classify, and discover interesting patterns and trends.
Pattern Analysis and Visualization Navigation patterns, which show the facts of Web usage, need further analysis and interpretation before application. The analysis is not discussed here because it usually requires human intervention or is distributed to the two other tasks: navigation pattern discovery and pattern applications. Navigation patterns are normally two-dimensional paths that are difficult to perceive, if a proper visualization tool is not supported. A useful visualization tool may provide the following functions: • •
Displays the discovered navigation patterns clearly. Provides essential functions for manipulating navigation patterns such as zooming, rotation, scaling, and so forth.
WebQuilt (Hong & Landay, 2001) allows captured usage traces to be aggregated and visualized in a zooming interface. The visualization also shows the most common paths taken through the Web site for a given task, as well as the optimal path for that task, as designated by the designers of the site.
Pattern Applications The results of navigation pattern discovery can be applied to the following major areas, among others: (i) improving site/page design, (ii) making additional topic or product recommendations, and (iii) Web personalization. Learning user/customer behavior (Adomavicius & Tuzhilin, 2001) and Web caching (Lan, Bressan & Ooi, 1999), which are less important applications for navigation patterns, are also worth studying. 1245
9
World Wide Web Usage Mining
Web Site/Page Improvements
FUTURE TRENDS
The most important application of discovered navigation patterns is to improve the Web sites/pages by (re)organizing them. Other than manually (re)organizing the Web sites/pages (Ivory & Hearst, 2002), there are some other automatic ways to achieve this. Adaptive Web sites (Perkowitz & Etzioni, 2000) automatically improve their organization and presentation by learning from visitor access patterns. They mine the data buried in Web-server logs to produce easily navigable Web sites. Clustering mining and conceptual clustering mining techniques are applied to synthesize the index pages, which are central to site organization.
Though Web usage mining is a fairly new research topic, many systems and tools are already on the market (Uppsala University, n.d.). Most of them provide limited knowledge or information, such as the number of hits and the popular paths/products. Table 2 gives the latest major research systems and projects in the field. They make it possible to extract hidden knowledge from log data and apply the knowledge to certain real-world problems. This table also shows that the future trends of Web usage mining research are on sequence discovery and recommender systems. Various methods of sequence discovery have been introduced in the previous section; however, a satisfactory method has yet to be found. Recommender systems have been widely used in electronic commerce, and it is to be expected that the Web usage information will play a crucial role in recommendations.
Additional Topic or Product Recommendations Electronic commerce sites use recommender systems or collaborative filtering to suggest products to their customers or to provide consumers with information to help them decide which products to purchase. For example, each account owner at Amazon.com is presented with a section of Your Recommendations, which suggests additional products based on the owner’s previous purchases and browsing behavior. Various technologies have been proposed for recommender systems (Sarwar et al., 2000), and many electronic commerce sites have employed recommender systems in their sites (Schafer, Konstan & Riedl, 2000).
CONCLUSION In less than a decade, the World Wide Web has become one of the world’s three major media, the other two being print and television. Electronic commerce is one of the major forces that allows the Web to flourish, but the success of electronic commerce depends on how well the site owners understand users’ behaviors and needs. Web usage mining can be used to discover interesting user navigation patterns, which then can be applied to real-world problems such as Web site/page improvement, additional product/topic recommendations, user/customer behavior studies, and so forth. This article has provided a survey and analysis of current Web usage mining systems and technologies. A Web usage mining system performs five major functions: (i) data gathering, (ii) data preparation, (iii) navigation pattern discovery, (iv) pattern analysis and visualization, and (v) pattern applications. Each function requires substantial effort to fulfill its objectives, but the most crucial and complex part of this system is its navigation pattern discovery function. Many usage-mining algo-
Web Personalization Web personalization (re)organizes Web sites/pages based on the Web experience to fit individual users’ needs (Mobasher, Cooley & Srivastava, 2000). It is a broad area that includes adaptive Web sites and recommender systems as special cases. The WebPersonalizer system (Mobasher et al., 2002) uses a subset of Web log and session clustering techniques to derive usage profiles, which are then used to generate recommendations.
Table 2. Major research systems and projects concerning Web usage mining No.
Title
URL
1
Adaptive Web Sites GroupLens MiDAS WebQuilt WebLogMiner WebSift WUM
http://www.cs.washington.edu/research/adaptive/
Major Method/Application Pattern application
http://www.cs.umn.edu/Research/GroupLens/ ¯ http://guir.berkeley.edu/projects/webquilt/ http://www.dbminer.com/ http://www.cs.umn.edu/Research/webshift/ http://wum.wiwi.hu-berlin.de/
Recommender systems Sequence discovery Proxy logging OLAP application Data mining Sequence discovery
2 3 4 5 6 7
1246
World Wide Web Usage Mining
rithms use the method of sequential pattern generation, while the rest tend to use ad hoc methods. Sequential pattern generation does not dominate the algorithms, since navigation patterns are defined differently from one application to another, and each definition may require a unique method.
for Web personalization. Data Mining and Knowledge Discovery, 6(1), 61-82. Perkowitz, M., & Etzioni, O. (2000). Towards adaptive Web sites: Conceptual framework and case study. Artificial Intelligence, 118, 245-275.
REFERENCES
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2000). Analysis of recommender algorithms for e-commerce. Proceedings of the ACM Electronic Commerce Conference.
Adomavicius, G., & Tuzhilin, A. (2001). Using data mining methods to build customer profiles. IEEE Computer, 34(2), 74-82.
Schafer, J.B., Konstan, J., & Riedl, J. (2000). Electronic commerce recommender applications. Journal of Data Mining and Knowledge Discovery, 5(1/2), 115-152.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceeding of the 20th Very Large DataBases Conference (VLDB), Santiago, Chile.
Spiliopoulou, M., & Faulstich, L.C. (1998). WUM: A tool for Web utilization analysis. Proceedings of the Workshop on the Web and Databases (WEBDB), Valencia, Spain.
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan. Büchner, A.G., Baumgarten, M., Anand, S.S., Mulvenna, M.D., & Hughes, J.G. (1999). Navigation pattern discovery from Internet data. Proceedings of the Workshop on Web Usage Analysis and User Profiling (WEBKDD), San Diego, California. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems, 1(1), 5-32. Hong, J.I., & Landay, J.A. (2001). WebQuilt: A framework for capturing and visualizing the Web experience. Proceedings of the 10 th International World Wide Web Conference, Hong Kong. Hu, W., Zong, X., Lee, C., & Yeh, J. (2003). World Wide Web usage mining systems and technologies. Journal on Systemics, Cybernetics and Informatics, 1(4). Ivory, M.Y., & Hearst, M.A. (2002). Improving Web site design. IEEE Internet Computing, 6(2), 56-63. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15. Lan, B., Bressan, S., & Ooi, B.O. (1999). Making Web servers pushier. Proceedings of the Workshop on Web Usage Analysis and User Profiling, San Diego, California. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Communications of the ACM, 43(8), 142-151. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Discovery and evaluation of aggregate usage profiles
Uppsala University. (n.d.). Access log analyzers. Retrieved March 2, 2004, from http://www.uu.se/Software/Analyzers/Access-analyzers.html Zaiane, O.R., Xin, M., & Han, J. (1998). Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. Proceedings of Advances in Digital Libraries (ADL), Santa Barbara, California.
KEY TERMS Association Rule Discovery: A rule in the form of “if this, then that” that associates events in a database. Association rule discovery can be used to find unordered correlations between items found in a set of database transactions, such as the association between purchased items at a department. Sequential Pattern Generation: The problem of discovering sequential patterns consists of finding intertransaction patterns such that the presence of a set of items is followed by another item in the time-stampordered transaction set. User Navigation Patterns: Interesting usage patterns contained in the Web log data. Most algorithms use the method of sequential pattern generation, while the remaining methods tend to be rather ad hoc. Web Logs: A Web log file records activity information when a Web user submits a request to a Web server. A log file can be located in three different places: (i) Web servers, (ii) Web proxy servers, and (iii) client browsers.
1247
9
World Wide Web Usage Mining
Web Proxy Servers: A proxy server takes the HTTP requests from users and passes them to a Web server; the proxy server then returns to users the results passed to them by the Web server. World Wide Web Data Mining: It attempts to extract knowledge from the World Wide Web, producing some useful results from the knowledge extracted, and applies the results to certain real-world problems.
1248
World Wide Web Usage Mining: The application of data mining techniques to the usage logs of large Web data repositories in order to produce results that can be applied to many practical subjects, such as improving Web sites/ pages, making additional topic or product recommendations, user/customer behavior studies, and so forth.
1
Index of Key Terms
Symbols α-Surface 232 β-Rule 614 ∆B+ Tree 232 0-1 Distance 468
A absolute difference 497 accuracy 194 ACML 27 actionable rule 5 active disk 10 learning 16, 1118 tag 1165 adaptive boosting (AdaBoost) 357 system 27 Web site 1224 adjacency matrix 1042 adverse effects 244 aerophones 88 agent 27 ontology 27 aggregate conceptual direction 296 materialized view 425 similarity search 232 aggregation 37 queries 994 AIC Criterion 468 Akaike Information Criterion (AIC) 1129 algebraic function 201 algorithmic information theory 565 alternative document model 999 hypothesis 852 storage 22
Amortize-Scans 476 analytical data 327 information technologies (AIT) 237 query 1058 anomaly 81 detection 256 ANOVA or analysis of variance 81 antecedent 72, 1180 antimonicity constraint 1062 antimonotonic 628 API 42 application service providers 954 solution providers 463 applied ethics 458, 836 apriori 63, 73, 1032 algorithm 163 architecture 868 area of black pixels 875 of the search space 673 ARIS 123 articulation 88 artificial intelligence 1086 neural network (ANN) 194, 222, 437, 673 ASCII 98 “as-is” Business Process 123 assertion between knowledge patterns 651 association (undirected association rule) 565 rule 69, 153, 164, 513, 708, 745, 751, 799, 929, discovery 485, 1247 mining 1184 rules 48, 757, 863, 1234 attribute 143 dependency 476 discretization 818 fuzzification 818
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Index of Key Terms
attributed graph 809 attribute-oriented induction 77 auditing 222 authority pages 1150 auto ID technology 1165 autocorrelation 1129 automatic classification 88 automatically-defined function (ADF) 532 autonomous information system 5, 643 auxiliary view 721 average precision (P_Avg) 381
B B+-Tree 425 back-end processing 883 backpropagation 148, 868 algorithm 673 badly structured data 77 bag relation 565 bagging (bootstrap aggregating) 357 basic belief assignment 1138 basis for association rules 757 batch learning 452 Bayes factor 677 Bayes’ theorem 93 Bayesian Information Criterion 158, 726 Bayesian Network(s) 432,809, 964 , 1021 belief 1170 function 1138 benchmarking 267 biased sampling 994 bibliometrics 105 bibliomining 105, 282 BIC Criterion 468 binarization 692 bioinformatics 282, 485, 554, 637, 733, 814, 825, 1159 biometric 971 authentication 875 bitmap index 910 blurred graph body mass index (BMI) 261 Boolean Function 692, 982 boosting 140, 189 bootstrap 158, 726 bootstrapping 905 bottom-up cube computation 201 boundary element method 250 branch 1180 breadth-first 1032 browser caching 1224 bucket 53 -based histogram 53
2
business and scientific applications 237 intelligence 237, 343, 386, 463, 954, 1086 process 123, 900 management system 900 busy hour 919 buy-and-hold strategy 784
C cache 322 -results 476 calling path 958 campaign design 704 cancer-testis (CT) antigen 554 candidate generation 1063 capital-efficient marketing 919 car pooling 412 cardiac data fusion 592 case-based reasoning 128 categorical attribute 143 data 954 categorization 132 censored 1081 center-based clustering 140 central limit theorem 1107 centre of gravity 875 CGI Program 1240 chain graphical models 1234 changes to XML 804 characterizing function 522 chase 643 chat 762 chemoinformatics 668 chi-squareddistance 468 chordophones 88 circulation database 709 Cis-regulatory element 733 class label 206 -conditional independence 37 classification 58, 287, 437, 491, 554, 624, 688, 933, 954, 1189 error 697, 778 rule mining 778 tree 143, 357 classifier 988 system 491 “CLDS” 22 clementine 1086 clickstream 391, 1153, 1240 closed frequent graph 1144 itemset 757
Index of Key Terms
sequential pattern 1032 closure operator 153 cluster 171, 179 analysis 158, 182, 502, 585 centres (prototypes) 585 frequent item 559 prototypes (centres) 182 sampling 348 -support of feature F in cluster Ci 518 validation 179 clustering 140, 171, 179 ,222, 244, 282, 287, 311, 437, 485, 610, 624, 663, 688, 733, 954, 1189, 1205 data mining task 789 Coalesce 726 co-clustering 179 code growth (code bloat) 532 codes of ethics 836 Cognos 1086 collaboration 762 filtering 48, 128, 442, 1240 virtual environment 1194 collective dependence 940 combination 605 combinatorial optimization 697 common log format (CLF) 1224 community 772 mining 447 co-morbidity index 363 compact and generalized abstraction of the training set 905 compatibility 1027, 1165 competitive advantage 386 complementary similarity measurement 370 comprehensibility 194 computational cardiology 592 problem 412 computer aided design (CAD) 327 manufacturing (CAM) 327 vision 971 -based training 271 computerized assisted auditing techniques (CAATs) 222 computing with symbols 565 with words 565 concept 133, 153, 518, 1090 drift 206 hierarchy 513, 1189 conceptual clustering 1189 construction with incomplete data 296
graph 545 maps 133 condensed representations 211 condition number 508 conditional 767 independence 677, 767, 940, 1234 probability 93 structure 767 conditioning 1139 confidence 69, 73, 864, 370 of A rule 1234 of Rule XY 513 confidentiality 836 confluence 538 conjunctive combination 1139 consequent 1180 constraintbased data mining 211 mining 745 contentaddressable file store –(CAFS) 10 based filtering 128 image retrieval 216, 778, 841 context 153, 518 continuous auditing 222 data 396 value assumption (CVA) 53 contrast set 799 control flow graph 116, 949 group 704 conversational systems 128 cookie 1240 cooperative information systems 301 corporate information factory (CIF) 22 correct number 370 correlation 81, 1144 corrupted parties 1009 cost sensitive classification 933 crawler 1108 CRISP-DM (Cross Industry Standard Process for Data Mining) 895 CRM 745, 895 cross and edge points 875 -validation 508 crossover 480, 532 cryptographic data mining techniques 929 cube 910 cell 201 cubegrade 292 3
Index of Key Terms
cuboid 201 cumulative proportion surviving 1081 curse of dimensionality 179, 232, 688, 1194 effect 905 customer relationship management 164, 1096 customized protocols 244 cut-points 401 cyber security 569
D DARM 407 [data] Missing Completely At Random (MCAR) 598 data and model assembly 646 assimilation 237 center 407 checking 1047 ceaning 306 cube 32, 425, 476 operator 476, 883 cubes 343 dependencies 949 editing 1047 envelopment analysis (DEA) 352 extraction 338 imputation 598, 1047 integration system 301 interchange 1052 management 569 mart 22, 316 mining 93, 164, 171, 276, 287, 306, 391, 412, 437, 463, 624, 633, 673, 751, 1096, 1184 or knowledge discovery in databases 789 data table 363 model 42 technologies (DMT) 237 missing at random (MAR) 598 modeling 574 perturbation 924 preprocessing 53, 1153 processing 1052 quality 301 reduction 624 repository 651 retrieval 333 schema 301 instance 301 sequence 1032 set 206 staging area (DSA) 316 stream 206, 1124 management system (DSMS) 1124 streams 1200 4
validation 1047 visualization 244, 646, 1194 warehouse(es) 227, 261, 316, 463, 528, 633, 739, 883, 910 warehousing 98, 105 database 98 clustering 548 computer or machine 10 gap 10 item 864 management system (DBMS) 22 marketing 704 schema 628 status 713 data-driven design 227 decision making unit (DMU) 352 rule 357, 977, 988 set 988 support 463 system (DSS) 238, 327, 333, 831, 994 systems/tools 633 tree(s) 143, 194, 244, 437, 452, 1086, 1180 deductive data mining 565 deep Web mining 447 deferred maintenance 713 definable set 977 degree of abnormality 1189 demand forecasting 1129 dendogram 1189 density-biased sampling 348 dependence relation 476 depth-first 1032 design recovery 116, 949 device profile 27 differencing 1129 digital audio 858 forensics 569 library 709, 762 multimedia 1189 digitisation 123 dimension 32, 425, 528, 910, 1215 dependency 476 table 574, 911 dimensional attribute 658 data model 227 model 574 modeling 338 dimensionality reduction 847, 1124, 1194 dimensions 463 directed acyclic graph (DAG) 677
Index of Key Terms
graphical models 1234 discovery informatics 391, 709 discrepancy of a model 468 discrete data 396 fourier transformation (DFT) 1124 wavelet transformation (DWT) 1200 /continuous attributes 401 discretization 37, 396 discriminant analysis 222 variables 1159 disjunctive set of conjunctive rules 148 disk access time 10 distance measure 370 distributed chase 643 data mining 924, 929 distributive function 201 divide-and-conquer 1091 DMG 42 DNA 1159 document categorization 1112 clustering 559, 1112, 1118 cut-OFF value (DCV) 381 frequency 381 vector 559 domain 629, 1215 analysis technique 463 of rule 5 dot product queries 1200 drill down 292, 463, 528 drill-down (roll-up) 32, 425 drug discovery 244 dynamic clustering 1091 feature extraction 216 graph 545 programming 809 sampling (Adaptive Sampling) 348 time warping (DTW) 174 topic mining 610 Web pages 713, 716 weighting 140
E edit restraints 306 editing procedure 1047 efficient 352 Eigenfaces 971 Eigenvalue 502 electronic
business (EB) 352 product code (EPC) 1165 EM algorithm 1118 embedded PI submodel 940 subtree 1144 emerging pattern 799 enhanced data mining with incomplete data 296 ensemble 357, 452 learning 194 -based methods 491 enterprise architecture 98 miner 1086 performance management (EPM) 1096 resource planning /enterprise systems 633 system 267 entity-relationship model (E-R-Model) 73 entropic distance 468 entropy 767 envelopes 875 episodal association rule 1102 episode 1102, 1230 epistemology 227 Epoch 868 equal employment opportunity classification 267 error rate 890 -depth discretization 818 -width discretization 818 ERP 123 error control 697 localization 1047 e-science 638 e-service 27 estimation-maximization algorithm 158, 726 ethics 458, 836 ethnography 463 ETL 317 ETL (Extract/Transform/Load) 333 Euclidean distance 468, 1124 event 1102 sequence 1102 type 1102 evidence-based librarianship 105 evolutionary 391 algorithm 1170 computation 480, 491, 532, 663, 673 exact learning 614 rules 614 exceptional pattern 548 5
Index of Key Terms
executive information system 271 expectation maximization 809 explanation-oriented data mining 497 explicit edit 1047 knowledge 605 exploratory data analysis (EDA) 1144 tree 357 extended item 658 transaction 658 eXtensible Business Reporting Language (XBRL) 222 Markup Language (XML) 27, 716, 804, 964, 1021, 1053 extension 692 external data 333, 831 externalization 605 extraction, transformation, and loading (ETL) 98, 629, 1096
F F value 81, 620 face space 971 fact 32, 528 (Multidimensional Datum) 425 table 32, 426, 574, 911 factless tact table 574 factor score 502 factual data 458 analysis 458 failure analysis 1082 false acceptance rate 890 alarm 1124 discovery rate (FDR) 852 dismissal 1124 negative 852 positive 852 rejection rate 890 FAST (Fast Algorithm for Splitting Trees) 357 feature (or attribute) 486 extraction 110 eduction Methods 244 selection 110,164, 287, 486, 919 space 1069 vector 216, 232, 841 federal government 271 fidelity 194 field learning 614 filter technique 799 financial 6
distress 508 ratio 508 finite element method 250 first story detection 610 fitness landscape 491 flexible attribute 5 mining of association rules 513 focal element 1170 focus+context technique 580 foreign key 629 formal concept analysis 518 fourier analysis 858 fractional factorial 704 frame of discernment 1170 free tree 1145 freeblock scheduling 10 frequent itemset 1240, 153, 407, 513, 757, 793, 864, 933, 1240 pattern 69 pattern 946 subgraph 1063 mining 545 front-end processing 883 full PI model 940 functional areas 123 dependency 116, 949 genomics 733 margin 1069 functionality 1009 fuzzy association rule 819 clustering 182, 585 C-Means Algorithms 663 estimate 522 histogram 522 information 522 logic 784, 1086 membership 663 number 522 relation 1118 set 276 statistics 522 transformation 296 valued functions 522 vector 522
G galaxy structure 528 Galois Connection 153 gene 287 expression 554, 733, 1159
Index of Key Terms
microarrays 287 profile 110 microarray data 852 ontology (GO) 814 generality 497 generalization(s) 292, 868 generalized disjunction-free pattern 946 disjunctive pattern 946 generating-pruning 1032 generation 480, 532 generic algorithm (s) 148 genetic algorithm 58, 276, 486, 508, 784, 954, 1086, 1205 programming 381 genome 287 databases 739 medicine 733 geographicinformation systems (GIS) 98, 238 gestational diabetes 261 GIS 412 global baseline 875 features 890 frequent itemset 559 glycemic control 363 goals of scientific research 497 grammars 809 granular data 831 granularity 322 granulation and partition 565 graph 538, 809, 1014 grammar 545 invariants 1042 isomorphism 1063 production 538 rewriting 538 spectra 809 -based data mining 545, 1014 graphical models 1234 user interface (GUI) 327 group technology (GP) 327 growth rate 799
H H2 580 H2DV 580 hamming clustering 825 distance 982 handoff or handover 958
hard clustering 585 hazard function 1082 head pose 971 HEDIS 267 Hegelian/Kantian perspective of knowledge management 605 heuristic algorithms 412 heuristics 1086 HgbA1c 363 Hidden Markov Model (HMM) 620, 682 Random Field (HMRF) 809 neurons 868 hierarchical agglomerative clustering (HAC) 958 clustering 171, 1091 data files 98 hierarchy 528 high-vote pattern 548 HIPAA 267 histogram(s) 53, 232, 311, 994 historical XML 804 hit (request) 1230 HITS 447 Algorithm (Hypertext Induced Topic Selection) 1042 HIV 726 HMDS 580 HOBBit Distance 1184 holdout technique 799 holistic function 201 homeland security 569 homogeneity 502 homonymy 651 homoscedasticity 502 hopfield neural network 296 horizontal and vertical projections 875 partitioning 911 horizontally partitioned data 929 HSOM 580 HTML 716 HTTP 716 hub pages 1150 hubs and authorities 1206 human resource information system 267 -computer interface 646 hybrid data analysis methods 522 systems 825 technique 276 hyperbolic 7
Index of Key Terms
geometry 581 tree viewer 581 hyperlink 1150, 1206 hypertext 386, 1150 hyponymy/hypernymy 651 hypotheses 1069 test 852 testing 116
I iceberg cubes 292 query 292 ideal model 1009 idiophones 88 ill-formed XML documents 376 image data mining 1189 features 809 processing 1189 understanding 809 immediate maintenance 713 immersive IVUS images 592 implicit edit 1047 imputation 306 inapplicable responses 598 inclusion dependency 629, 949 incomplete data 296 independence 1027 independent 93 index structures 216 tree 311 indexing 322 individual 480, 532 induced subtree 1145 induction 1180 method 189 inductive database(s) 211, 745 learning 16 logic programming 37, 73, 545 reasoning 983 inexact learning 614 rules 614 information extraction 1112 information extraction 282, 682, 1150 gain 148 paralysis 895 retrieval 282, 333, 386, 858, 964, 1021, 1108, 1150 system 1003 8
visualization 1194 informational privacy 458 informed (or valid) consent 836 infrequent itemset 69 pattern 793 inputs/outputs 352 instance 206, 624 selection 624 instances 401 integrated library system 105 integration of processes 123 intelligence density 633 intelligent data analysis 638 query answering 643 interaction 77 interactive data mining 646 intercluster similarity 171 inter-cluster similarity 559 internal data 333 internalization 605 interschema properties 651 intertransactional association 658 interval set clustering algorithms 663 sets 663 intracluster similarity 171 intratransactional association 658 intrusion 256 setection 256 detection 569 invisible Web 333 inwards projection 919 IP Address 1240 island mode GP 532 itemset 63, 69, 153, 757, 799, 864, 929, 1032 support 153 IVUS data reconciliation 592 standards and semantics 592
J JDM 42, 1174 join index 911 joint probability 93 distribution 432 junction tree 432 algorithm 809
K k-anonymity 924 Karhunen-Loeve Transform (KLT) 311
Index of Key Terms
KDD 745 kernel 1091 function 668, 847, 1070, 1076 matrix 668 methods 688 Kernel-Density Estimation (KDE) 174 key 63 business process 227 k-itemset 69 k-line 784 k-means 1159 clustering 919 k-most interesting rule discovery 799 knowledge base 5, 643 creation 386 discovery 244, 386, 432 in databases (KDD) 548 process 598 domain visualization 999 extraction 673 integration 988 management 98, 391, 716, 762 spiral 605 Kolmogorov Complexity of a String 565 Kullback-Leibler Divergence 468
L labeled data 1027 LAD (logical analysis of data) 692 large dataset 1091 leaf 1180 learner 988 learning algorithm 869 legacy data 911 system 99, 271 LERS 977 leukemia 1159 levelwise discovery 629 life tables 1082 lifetime 1082 value 919 lift of a rule 1234 linear discriminant analysis 508 model 77 regression 437, 1145 linguistic term 1180 variable 1180 link analysis 569 data mining task 789
consistency 713 mining 1014 literal 946 load shedding 1124 local feature relevance 688 features 890 pattern analysis 548 Lockean/Leibnitzian perspective of knowledge management 605 log-based personalized search 442 query clustering 442 expansion 442 logic data mining 697 programming 682 synthesis 983 logical consistency 1047 data 1174 logistics support system 271 log-likelihood score 468 longest common subsequence similarity (LCSS) 174 loopy belief propagation 809 lossless representation of frequent patterns 946 lower approximation of a rough set 977 low-Quality Data 614 LPA (low prediction accuracy) problem 614
M machine learning 128, 149, 282, 452, 638, 668, 1021, 1086, 1108 malicious adversary 1009 mal-processes 123 management information system (MIS) 1096 MANET 958 margin 668 marginal independence 940 market basket 799 efficiency theory 784 marketing optimization 704 Markov chain 1042 Monte Carlo 726 random field (MRF) 809 mark-up language 386 mass dataveillance 458 value 1170 material acquisition 709 category 709 9
Index of Key Terms
materialized hypertext 713 view 716 view 32, 426, 721, 1058 mathematical program 1076 matrix 502 maximal frequent sequential pattern 1032 -profit item selection (MPIS) 69 maximum likelihood 93 measure 32, 426, 911 measurements 463 medical image mining 592 transactional database 363 medoid 559 member 528 membranophones 88 Mercer’s condition 847 merge/purge 629 metabonomics 814 meta -combiner (meta-learner) 988 -data 1215 -learning 16, 480 -recommenders 48 metadata 22, 301, 333, 386, 716, 739, 831, 858, 971 metaheuristic 58 method 1004 level operator 189 of explanation-oriented data mining 497 schema 1004 metric-driven design 227 metropolis algorithm 420 MIAME 739, 814 microarray 110, 733 databases 739 informatics 739 markup language (MAML) 739 micro-arrays 825, 1159 microarrays 554 MIDI 858 minimal occurrence 1102 support (minSup) 793 description length (MDL) Principle 545 mining 762 e-mails 772 historical XML 804 of primary Web data 789 of secondary Web data 789 question-answer pairs 772 sentences 772 MINSAT 697 10
misuse detection 256 mixture of distributions 158, 726 modal properties 250 mode shapes 250 model 206 (knowledge base) 988 checking 117 identification 825 quality 989 -based clustering 140, 726 MOFN expression 194 moment measures 875 monothetic process 363 Monte Carlo method 420 morality 836 MPEG Compression 847 multiagent system (MAS) 27 database mining 549 factorial Disease 486 method Approach 189 relational data mining 545, 1184 vertical mining 1184 resolution analysis 778 view learning 16 multidimensional context 658 OLAP (MOLAP) 884 query 426 multimedia 1021 data mining 1194 database 216 multimodality of feature data 847 multiple data source mining 549 hypothesis test 852 site link structure analysis 999 multiview learning 1027 music information retrieval 858 mutation(s) 292, 480, 532 mutual information 371
N natural frequency 250 language processing (NPL) 282, 386, 620, 1108 navigational pattern 1220 n-dimensional cube 201 nearest neighbor methods 688 -neighbor algorithm 48 near-line storage 22
Index of Key Terms
negative association rules 864 border 1032 itemsets 864 (g, Ci) 518 network intrusion detection 407 training 58 neural model 869 network 93, 244, 261,276, 296, 391, 538, 784, 452, 1086, 1134 neuron 538 node 1180 noise clustering 182, 585 noisy data 638 nominal data 396 non-ignorable missing data 598 non-precise data 522 normality 502 normalization 554, 574, 814 /denormalization 338 normalized database 343 extended itemset 658 extended transaction set 658 NP-hard problems 412 null hypothesis 852 number of intervals 401 numerical attribute 143
O objective function 1170 -based clustering 585 observability 1165 ODM (organizational data mining) 895 ODS (Operational Data Stores) 322 OLAP 244, 322, 338, 343, 420, 437, 476, 895, 1086 system 426 OLE DB DM 1174 for DM 42 OLTP (Online Transaction Processing) 322 one versus all 1070 one 1070 online analytical process (OLAP) 22 processing (OLAP) 238, 317, 327, 633, 884, 1200 systems (OLAP) 463 learning 452
profiling 954 public access catalog (OPAC) 105 science applications (OSA) 1200 signature recognition 890 ontology 5, 610, 643, 762, 1014 open world assumption 1139 operational cost 721 data 831 operations research (OR) 238 optimization 58 criterion 486 problem 412 ordinal data 82, 396 OT (organizational theory) 895 outliers 182 outsourcing 99 overfitting 164, 182, 1076 over-fitting rule 933 overlapping 651 overtraining 869
P page score 1205 view 1154 PageRank 447 algorithm 1042 Pageview 1220, 1230 parallel database system 751 parameterization 88 parsimony 532 parsing 386 part of speech (POS) 620 partial cube 201 PI model 940 partitional vlustering 1091 partition-based vlustering 174 partitioning 1058 clustering 171 tree algorithm 357 passive tag 1165 pattern 688, 692 decomposition 793 detection 778 discovery 1154 domains 211, 778 recognition 971 problem 983 with negation 946 PDDP (principal direction divisive partitioning) 1159 Pearsonian correlation coefficient 82 peer-to-peer systems 301 11
Index of Key Terms
performance metrics 99 permutation GA 480 phi coefficient 371 photogrammetry 809 phylogenetic tree 727 physical data 1174 pignistic probability function 1139 pitch tracking 858 pivot tables 99 plausibility 1170 function 1139 PMML (Predictive Model Markup Language) 42, 895 poincaré mapping of the H2 581 point learning 614 Poisson process 158 polynomial range-aggregate queries 1200 population level operator 189 positive pattern 946 (g, Ci) 518 possibilistic clustering 183 possible world 767 postgenome era 733 power law distribution 1042 precision 381, 620 predicate tree (p-tree) 1184 prediction 58 join 1174 predictive modeling 238, 919 data mining task 789 rules 491 prefetching 958 preprocessed data 306 primary Web data 789 principal curve 158 component analysis (PCA) 110, 175, 311, 452, 847, 972, 1159 of maximum entropy 767 privacy 569, 836 against data mining 924 preserving Data Mining 569, 924, 929 probabilistic conditional 767 graphical model 677 model 1021 neural network 539 probability density function 158, 727 estimate 158 distribution 964 probe arrays 554 procedural factors 598 12
process definition 901 lifecycle management 123 processor per track/head/disk 10 production rule 357 profit mining 933 program analysis 117, 949 proportion failing 1082 proposal measure 371 propositional logic 697 propositionalization 37 proteomics 814 prototype 133 selection 905 pruning 77, 357 pseudo-intent 153 PTUIE 117 publish and subscribe 301 purchase cards 271 push-caching 958 p-value 852
Q quadratic program 1076 qualitative data 396 quantitative association rule 819 data 396 structure-activity relationship (QSAR) 668 query 343 answering 32 by pictorial example 216 by sketch 841 by painting 841 evaluation cost 721 log 442 mining 442 monotonicity 292 performance 1058 rewriting 32 semantics 643 session 442 tool 646 query-by-pictorial-example 841
R radial basis function (RBF) neural network 110 random forest 357 forgery 890 variable 853 walk 175 range query 53
Index of Key Terms
ranking 841 function 381 ratio data 82 of change 497 RDF 1052 real model 1009 reasoning about patterns 946 recall 381, 620 recommendation system 964 recommendations 762 recommender system(s) 48, 128, 762, 1240 recursive partitioning 363, 688 redundant array of inexpensive disks (RAID) 1096 association rule 73 refresh cycle 831 refusal of response 598 regression 140 clustering 140 tree 143, 358 relation 73 relational data 38 model 227 database 73, 629 management systems (RDBMS) 1096 learning 38 OLAP (ROLAP) 884 privacy 458 relative difference 497 relaxation labeling 809 relevance feedback 841 representation space 401 representative sample 420 requirements 463 resampling 508 reservoir sampling 348 residual 1129 response modeling 704 returns to scale (RTS) 352 reverse engineering 949 rewrite system 539 RFID 1165 robust 972 clustering 585 roll-up 292 rollup 528 rotational latency 1058 rough decision table 977 set 277, 977 Data Analysis 149 sets 261, 663, 954
r-tree 1124 rule discovery 799 extraction 194 generation 825, 983 induction 306, 391, 673, 954 mining 491 quality 989
S salient variables 825 sample 420 sampling 311, 624 distribution 853 scatter plot 158 search 58 schema (pl. schemata) 481 abstraction 651 integration 651 matching 629 Schwartz Bayesian Criterion (SBC) 1129 scientific and statistical data mining 739 research processes 497 Web intelligence 999 scientometrics 999 search engine 1205, 1215 process 1004 situations 1004 space 673, 793 of a Transaction N=X:Y 794 transitions 1004 tree pruning 1063 seasonal and cyclic variations 1124 secondary Web data 789 secure multiparty computation 924, 1009 multi-party computation 929 seek time 1058 selection 481, 533 selectivity 420 selfmaintainable view 721 organization 663 organizing map (SOM) 296, 646, 1134 semantic associations 1014 context 1014 data mining 1014 dimension 745
13
Index of Key Terms
e-mail 772 metadata 1014 primitive 565 Web 1014 semantics 5, 643 semihonest adversary 1009 semi-structured data 713, 1021 mining 804 documents 683 semi-supervised classification 1027 clustering 1118 learning 16, 1118 sensitivity to initialization 140 sensor networks 1200 sequential pattern 1032 generation 1247 random sampling 348 server log 1154, 1224 service management 322 session 1224, 1230 sessionization 1220 shape -based similarity search 232 -from-X 809 shared everything/nothing/disks system 10 Sigmod function 149 signature-based intrusion detection 256 similarity 216, 841 transformation 778 simple forgery 890 random sampling 348 with replacement (SRSWR) 348 without replacement (SRSWOR) 348 singular value decomposition 311, 1159 skilled forgery 890 slant 875 sliding window 175 smallest-parent 476 smart sensors 432 smart structure 250 SmartSTOR 10 SMC 407 snowflake model 338 schema 574 SNP 814 social data-mining 48 socialization 605 soft computing 277 software 14
agents 683 analysis 1036 component 1036 cube 1036 mining 1036 warehouse 1036 application management system 1036 architecture 1036 SOTrieIT 63 sound 88, 858 source system 317 space model 171 spam e-mail 772 sparse cube 201 specialization 292 spectral clustering 179 spectral graph theory 1042 spectrum 88 splitting predicate 143 SQL (Structured Query Language) 322, 343, 1086 /MM DM 43, 1174 SQL99 43 SST Method Schema 1004 stable attribute 5 standard error 994 standardized data 1170 standards (international) 1052 star model 338 schema 343, 911, 1215 structure 528 static sampling 348 stationary time series 1129 statistical database 924 independence 677 information system 1053 metadata 1053 stemming 386, 559 step-wise regression 82 stop words removal 559 stoplist 1108 stratified sampling 348, 994 string edit distance 959 strong rule 513 structural equations 77 structured document 964 UDT 1174 structures and structural system 250 subgraph 1063 isomorphism 1063 subject matter expert 267 subschema similarity 651
Index of Key Terms
sub-sequence matching 1124 subspace clustering 179, 688 substitution probability matrix 727 super-efficiency 352 superimposition 232 supervised classification 164 graph 545 learning 93, 282, 663, 1027, 1076, 1108, 1118 learning (classification) 1070 /unsupervised 401 supplier relationship management 1097 supply chain 327 support 73, 565, 864, 1063 (Itemset) or trequency 69 (rule) 69 count of an itemset 64 of a pattern 946 of a rule 1234 of an itemset x 513, 794 or global-support of feature F 518 et 692 threshold 64 vector 1076 machine (SVM) 620, 638, 668, 1070, 1076 survival time 1082 symbiotic data miner 1086 symbolic object 1091 propagation in Bayesian networks 432 rule 194 symmetric graphical models 1234 synonymy 651 syntactic metadata 1014 synthetic data set 206 system development life cycle (SDLC) 22 systems biology 825 development life cycle 463
T “T” value, also called “Student’s t” 82 table scan 10 tabu search 58 tacit knowledge 605 target concept 206 objects 38 system 317 taxonomy 77, 407 tree 1215 Tcpdump 256 temporal
data mining 663 recommenders 48 terabyte 99 term frequency 381 termination 539 test error 491 statistic 853 testing a neural network 111 tests of Regression Model 77 text classification 772, 1118 mining 386, 610, 762, 1134 summarization 1112 TFIDF 772 thematic search engine 1205 theory 692 threats 569 time correlation 1124 series 175, 1124, 1129 “to-be” business process 123 top-down cube computation 201 topic associations 1134 detection and tracking (TDT) 610 hierarchy (taxonomy) 1119 identification and tracking 1112 maps 1134 occurrences 1134 ontology 610 tracking 610 types 1134 topics 1134 total cost of ownership 99 traced forgery 890 training a neural network 111 data 306, 688 set 905 transaction 117 transaction 69, 794, 934 processing council 10 data 327, 458 transcription factor 733 transducer 683 transformation 831 transmutator 189 travel cards 271 treatment group 704 tree and rule induction 222 trialability 1165 Trie 64 Trie, Patricia 63 15
Index of Key Terms
truth table 983 t-test 111 twice-learning 195 type 1 diabetes 261 2 diabetes 261 conflict 651
U uniform distance 468 random sampling 348 sampling 420, 994 spread assumption (USA) 53 univariate/multivariate 402 unlabeled data 1027 unsupervised clustering 1159 learning 282, 663, 1027, 1070, 1108 upper approximation of a rough set 977 URL 1215 user activity record 1230 modeling 27, 1220 navigation patterns 1247 profile 27 data 1154 profiling 1112 utilities (Utiles) 633
V valid XML document 376 variable precision rough set model 977 variance 502 vector 502 quantization 175 space model (VSM) 381 vectorization 1205 vertical data mining (vertical mining) 1184 vertically partitioned data 929 video data mining 1189 videotext 847 view 292, 721, 1027, 1058 maintenance 721, 1058 monotonicity 292 validation 16 virtual reality 1195 vasculature reconstruction 592 data mining 646, 739, 1195 visualization 638, 954, 1195 of text data 1112
16
vocabulary spectral analysis 999 voting 697
W wavelet analysis 858 transform 311 wavelets 53 Web analytics 1220 client 1240 community 1206 content mining 999, 1150, 1206 graph mining 447 image search engine 778 information extraction 447 log(s) 745, 1247 mining 778, 804, 1150, 1207 page 1215, 1205 personalization 1154, 1220, 1225 process 901 data log 901 proxy servers 1248 resource 1230 server 1241 log 1241 service 901 site 1215 personalization 128 structure mining 447, 999, 1207 structured mining 1150 usage mining 1154, 1207, 1220, 1225, 1230 -enabled e-business 789 weight 539 weighted random sampling 348 well-formed XML documents 376 whole sequence matching 1124 window 1102 wireless ad-hoc sensor network 432 workflow 901 management systems 901 workload 994 World Wide Web data mining 1248 usage Mining 1248 worms 256 wrapper 683 induction 16 XML (Extensible Markup Language) 27, 716, 804, 964,1021, 1053 content analysis mining 376 intrastructure mining 376 mining 376
Index of Key Terms
structural changes 804 clarification mining 376 XPath 964
Z z-transformation 183
17
lvi
1
Index
A acceptance/rejection sampling 345 accuracy-weighted ensemble 203 actionability 1, 930 active disks 6 learning 12, 1115 AdaBoost 450 adaptive metric techniques 685 numerical value discretization method 816 sampling 346 Web sites 1223 adaptivity 23 adjacency lattice 61 administration and management 17 advising system 435 aggregate conjunctive query 422 aggregates 319 aggregation 33, 469 Akaike Information Criterion 465 algebraic 197 algorithm(s) 263, 504 alternative document models 996 analytical information technologies 233 query 1055 analytics 698 annotation-free WI systems 679 anomaly 78 detection 251, 384 ANOVA F value 80 antecedent 1098 approximate 1196 query answering 990 approximation space 659 Apriori 59, 509, 957 -based approach 1010 AQ-PM Systems 203 araneus data model (adm) 715
archiving 321 ARIMA 1125 ARIS 118 artificial intelligence 893, 1083 neural networks (ANN) 191, 273, 433 neuron 54 assertion between knowledge patterns 648 association informatics 706 mining 560 rule 272, 773, 1217 discovery 795, 1245 extraction 752 mining 59, 65, 70, 403, 1098 rules 45, 74, 150, 288, 735, 746, 763, 859, 923, 925, 930, 941, 1222 utilization 705 attribute-oriented induction 74 audio retrieval by content 854 audit fees 1175 auditing 217 auditory scene analysis 854 authority 444, 1207 automated classification 1113 automatic document classification 383 indexing and searching of audio files 86 music transcription 855 autonomous 1 averaging 355
B background knowledge 494 backpropagation learning 865 bagging 355, 449 base data 1054 Bayesian 90 belief networks 146 criteria 465 inference 92
Copyright © 2006, Idea Group Inc., distributing in print or electronic forms without written permission of IGI is prohibited.
Index
information criterion 465 methods 266 nets 805 network 89, 92, 674, 1016 retrieval model 961 networks 427, 960 network theory 93 belief 1166 function (s) 612, 1135 network model 961 propagation 91 best practices 94 biased sampling 991 bibliometric methods 100 bibliomining 100 for library 100 billing errors 269 binary association rule 509 image 871 bioinformatics 728 biometric 885 biotechnology 734 Bit-sliced indexes 424 Bitmap indexes 424 block level HITS 444 PageRank 444 blocking 303 blurred graph 543 Bonferroni 850 Boolean algebra 563 formula 694 function 689 boosting 355, 450 bootstrapping techniques 902 bottom-up computation 198 breadth-first 151, 1029 bucket 50 budgeting 265 business 234 analytics 390 case 269 intelligence (BI) 268, 318, 382, 1152 process 118 management (BPM) 119 systems 896 processes 896 busy-hour4 network utilization 913
2
C cache management 1221, 1223 CAD/CAM 324 cancer-testis (CT) antigen 551 candidate sequence 1029 canonical model 962 capital efficient marketing 912 car pooling 408 case studies 257 CASPIAN 1163 categorical attributes 34 categorization 129 causal relation 834 causality 263 CD3 203 cell 196 censoring 1077 census 94 center-based clustering 134 characterizing function 519 chat discussions 758 chemical, biological, radiological, nuclear 427 chemoinformatics 664 chi-squared distance 685 chunks 198 circulation 705 class-conditional distributions 34 classical probability 90 classification 141, 144, 272, 448, 622, 865, 922, 930, 1064, 1071, 1113, 1171, 1222 ambiguity 1176 and regression trees 353 based on associations 774 tree 141 classifier system 487 classifying 1155 clean data 1043 cleansing 454 clickstream 1216 data 1207, 1227 client log 438 closed feature set 514 itemset(s) 150, 942 sequential pattern 1029 cluster analysis 154, 159, 729, 1217 centre 582 sampling 345 -support 516 clustering 165, 172, 176, 180, 272, 309, 477, 622, 659, 735, 820, 995, 1087, 1114, 1122, 1155, 1222 algorithms 159
Index
and classification 263 methods 582 techniques 1201 co-training 1022 codes of ethics 832 collaboration 758 filtering (CF) 44, 124, 931, 1237 and personalized search 438 virtual environments (CVEs) 1190, 1193 collective dependence 938 collinearity 504 collocation 383 combinatorial optimization 482 committee methods 448 communities, informal 769 community mining 444 compact and generalized abstraction 902 compensation planning 262 competitive advantage 17, 382 complex data types 178 composite symbolic object 1089 compositional algorithms 191 compression 321 computational biology 482 complexity 695 computer ethics 832 vision 965 -based training 270 computer programming, the art of 59 computing 634 with words 562 concept drift 202 concepts 129, 150 conceptual clustering for video data 1187 construction 293 maps 130 condensed representations 207 condition number 504 conditional pattern bases 60 probabilities 89 probability 89 probability table (CPT) 90, 92 confidence 1098 confidentiality 832, 923, 1005 confusion matrix 466 congressional sampling 992 consent 832 consequent 1098 constraint-based mining 207 content
analysis 373 -based filtering 124 image retrieval 837 context 514 ultra-sensitive approach 1238 continuous association rule mining algorithm (CARMA) 61 auditing (CA) 217 contradictions 76 contrast sets 796 contribution function 612 control of inductive bias 530 cooperation 487 cooperative 639 information system 297 core competence 387 corporate financial distress 503 libraries 100 correlation 78, 1122 graph 957 cost of turnover 262 projections 265 -sensitive 930 crisup 510 criteria based on score functions 465 critical support 510 CRM 1093 cross correlation 1127 industry standard process (CRISP) 122 for data mining 894 cross-validation 466, 503, 1072 crossover 505 cryptographic techniques 926 Cube lattice 197 operator processing 423 -by operator 469, 877 cubegrades 288 cuboid 196 curse 178, 684, 902, 1192 curve overlap 154 customer classification 339 relationship management 59, 891, 950 customized prescriptions 239 cut point 392 CVFDT 203 cyber attacks 251 security 567 3
Index
D damage detection 245 techniques 246 data 1190 acquisition 454, 886 aggregation 1054 analysis 454 and model assembly 645 cleaning 1227 cleansing 996 collection methods 1000 compression 307 cube 28, 421, 470, 877 computational approaches 469 dependency 947 envelopment analysis (DEA) 350 fusion 1227 imputation 595, 1044 integration 298 intensive Web sites 714 interoperability 94 management 566 marts 335 mining 112, 129, 159, 165, 190, 224, 239, 262, 272, 372, 388, 392, 403, 438, 443, 448, 454, 560, 621, 625, 630, 689, 728, 734, 746, 763, 785, 859, 921, 925,947, 973, 990, 1048, 1206 algorithms 190 benchmarking association 265 by using database statistics 626 components 339 models 39, 339 technologies 233 missing at random 594 completely at random 594 modeling 239 normalization 811 perturbation 926 preprocessing 1216 quality 297, 636, 1043, 1049 reduction 52, 307, 621 repository 647 security 921 sequence 1028 staging area (DSA) 312 stream(s) 59, 172 validation 1044 visualization 644, 1190 warehouse 259, 318, 312, 523, 826, 876, 906, 1054 development 18 team 17 4
warehouses 17, 101 warehousing 14, 94, 223, 263, 317, 630, 1048, 1092 -driven 223 -mining 141, 483 applications 378 data table 359 technology 433 database 443 machines 7 marketing 698 partitioning 319 systems 746 datamart 263 datasets of multidimensional 1196 decision making 1135 rule 353 learning 145 rules from data 975 support 459, 990 system (DSS) 17, 233, 326, 469, 630 tree 394 algorithms 240 analysis 264 construction algorithm 141 induction 353 learning 144 trees 141, 184, 693, 735, 1175 -making 1211 decompositional algorithms 191 deductive data mining 562 deep Web mining 443 defense 94 software collaborators (DACS) 265 defining user requirements 459 degree of belief 90 dehooking 887 demand forecasting 1129 Dempster-Shafer Theory 1135, 1166 density biased sampling 345 -based clustering 160 Department of Agriculture 94 Defense 268 Transportation 94, 268 dependent variable 698 depth first 151 descriptive function 433 design 313 recovery 112 device profiles 25 diabetes 259, 359
Index
diagnosis 259 differential evolution 1168 digital forensics 568 library 100, 758 reference 102 dilemma 454 dimensional algebra 226 attributes 654 data model 223 model 319 dimensionality reduction 843 dimensions 523 direct marketing 931 directed acyclic graph (DAG) 92, 674 edges 92 disclosure 834 discovering knowledge 689 discovery 190 informatics 387, 705 discrete attributes 397 wavelet transform –(DWT) 307, 308 discretization 392 discriminant analysis 503 disease 259 disjoint clustering 515 dissimilarity measures 1088 distributed 2 autonomous information systems 640 distributive 197 divide-and-conquer approach 1089 DNA (Deoxyribonucleic Acid) 735 chip 106 document classification 1015 cluster map (DCM) 1131 clustering 555, 607, 1147 domain organization 1212 dot reduction 887 drug design 664 discovery 239 dual problem 1074 dynamic bandwidth allocation 956 clustering algorithm 1088 graphs 543 health strategies (DHS) 265 sampling 346 time warping 172 topic mining 608
Web pages 714 weighted majority 203 weighting 134
E e-commerce 1235 e-business (EB) 349 e-government 300 e-mail filing 768 mining 768 e-metrics 1229 e-service 23 eature space 1074 eclectic algorithms 191 “factless” fact 572 efficiency 349 eigenfaces 967 electronic business (EB) 349 customer relationship management (e-CRM) 892 product code 1160 elementary granules 561 elitism 505 EM (expectation-maximization) algorithm 1115 emantic associations 1011 emerging patterns 796 empirical risk 1064, 1065 employee retention 262 encoding schemes 680 enhanced data mining 294 ensemble 448 methods 202, 355 rule based classification methods 487 enterprise resource planning (ERP) 118, 630 Environmental Protection Agency 94 episodal association rule r 1100 episode 1099 equal employment opportunity (EEO) 263 frequency discretization method 398 width interval binning 398 ERR 888 error 834 estimated error 932 ethics 454, 832 ethnographic 459 Euclidean distance 582 event 1099 type 1099 evidence-based librarianship 103 evolutionary algorithms 477, 483, 487 approaches 184 5
Index
computation 482 executive information system 269 expectation maximization 134, 1023 experimental stress tolerances 736 expertise 435 explanation 190 construction 494 evaluation 495 -oriented association mining 493 explicit 599 exponential weighting scheme 686 eXtensible Business Reporting Language (XBRL) 217 external data 328, 826 segmentation 886 extraction rules 13 extraction, transformation and loading (ETL) 312 extraction-transformation-loading 95 extractors 678
F F value 80 face recognition 965 factor analysis (FA) 498 loadings 499 facts 523 failure analysis (FA) 1077 false acceptance rate (FAR or type II error) 874 discovery rate 850 negatives 848 positives 848 FAQ detecting 439 FAR 888 feature 283 construction 33, 530 element 773 based image classification (FEBIC) 775 elements 774 extraction 530, 902 matrix 872 relevance 684 selection 285, 400, 902 and extraction 477 vectors 773 Federal 94 Aviation Administration 268 Credit Union 94 Government 268 Fellegi-Holt method 1044 field learning 611 filtering 887 6
financial ratios 503 finite closed interva 519 fitness landscapes 487 FLORA 203 forgeries 870 formal concept analysis (FCA) 150, 514 fragments 1059 framework 493 fraud 269 detection 269 free text 382 freeblock scheduling 8 frequency 1099 frequent closed episodes 1099 feature set 515 itemsets 754 itemset 404, 555 mining 207 itemsets 150, 790, 859 pattern 941 -growth 60 -tree 60 structure mining 1140 subgraph mining 540 FRR 888 functional genomics 728 fuzzy and interval set clustering 659 c-means 661 clustering 162, 180 histogram 520 logic 272 number 519 observations 294 patterns 294 rules 272 set theory 1175 transformation 295
G Gaussian kernel 1074 gene chip 106 clustering 283 expression microarray 283 ontology 812 selection 283 methods 107 general accounting office 268 generalized additive multi-mixture models 355
Index
disjunction-free literal set model (GD 942 projection (GP) 30, 423 generating-pruning 1029 generative model 1016 generators 942 of frequent closed itemsets 754 genes 734 genetic 504 algorithm 273, 378, 477, 779, 1201 and evolutionary computation (GEC) 477 programming (GP) 377, 529 genetically modified (GM) foods 736 genomics 482, 835 geographic information systems (GIS) 97, 409 geographical information systems 233 global features 872 search 479 -support 516 glycemic control 360 Gram-Schmidt Kernels 667 grammar-based 529 granular computing 560 graph 540, 1010 databases 1059 grammar 542 kernel 666 mining 1140 representation 526 transaction 1141 -based ARM 71 data mining 1010 graphical models 935, 1232 user interface (GUI) tools 326 greedy strategy 685 grid-based clustering 161 group differences 795 health analysis 265 management 265 group pattern discovery systems 546 Grubbs’ test 180 GT 324
H hamming clustering 978, 981 health and urban development 94 and wellness initiatives 265 plan benefits 265
employer data and information set (HEDIS) 265 Health Insurance Portability and Accountability Act 265 heterogeneous gene expression data sets 550 heuristic rating algorithm 779 heuristics 1083 hidden Markov models 616 Web 443 hierarchical clustering 159 conceptual clustering 541 methods 166 partitioning 264 high frequency patterns 560 pressure region image 872 histogram(s) 50, 309, 991 historical research 570 HITS 444 HMD 577 holistic 197 homeland security 566 homonymy 648 Hopfield neural network 295 horizontally partitioned database 927 HPR 872 HSOM 577 HTL 577 hubs 444, 1207 human resource information systems (HRIS) 262 -computer interaction 644 -computer interface 645 Human Resources Benchmarking Association™ 265 humanities data warehousing 570 research 570 hybrid approaches 185 hyperbolic multidimensional scaling 577 self-organizing map 577 space 575 tree layout 577 HyperCluster Finder 399 hyperlink analysis 1207 hyperlinked communities identification 1147 hypernym 648 hyperplane 1064 classifier 1066 Hypersoap 966 hypertext analysis 384 7
Index
model 714 view 710 hyponym 648 hypothesis 1064 testing 848
I iceberg cube 198 query 198 identifier attributes 34 identity theft 264 ideological 130 if-then rules 1175 IFL method 611 ignorance 1166, 1167 illumination 966 image features and their relations 805 search engines 773 structure 805 impact factors 552 impoverish epistemology 224 imprecision 273 improving performance 190 inapplicable responses 593 inclusion 648 dependency mining 626 incomplete data 293 incremental approaches 202 hypertext view maintenance 710 learning 529 mining 1030 text mining 607 independence principle 1104 index trees 309 -term selection 439 indexing techniques 906 individualized protocols 243 inductive database 207, 1010 logic processing (ILP) 1010 programming 33, 70 reasoning 766, 978, 979 inexact fielding learning (IFL) 611 rules 611 inference 92, 962 network model 961 information 382
8
browsing 409 enhancing 546 extraction 278, 615, 678, 1146, 1206 integration system 678 management 1092 paralysis 891 retrieval 159, 278, 377, 382, 438, 443, 960, 1109, 1146 scientists 100 seeking 1000 system 2 technology 893 theory 763 visualization 1190 -retrieval 328 -theoretic 542 informational privacy 455 informed consent 833 instance matching 625 selection 621 -based pattern synthesis 902 integrated library systems 101 integration of data sources 625 intelligence density (ID) 630 intelligent data analysis 634 text mining 383 interaction 75 interactive visual data mining 644 interactivity 151 interdisciplinary study 634 interestingness 796 of rules 1 interface schema 445 interlibrary loan 102 Internal Revenue Service 94 Internet 328, 349 interpage structure 443 interpretability 1175 interschema properties 647 intersite schema matching 445 interstructure mining 372 intrapage structure 443 intrasite schema matching 445 intrastructure mining 372 intrusion detection 251, 568 system 251 isomorphic 561 isomorphism modulo a product 30 issues in mining digital libraries 279 item 1028 itemset 509, 795, 1028
Index
J JDM 39, 1171 joint probability distribution (JPD) 429
K k-dimensional fuzzy vector 519 k-free sets 942 k-harmonic means 134 k-lines 779 k-means 134, 166, 660, 1155 k-medoid 166 k-nearest neighbor 167 queries 308 classifier 146 Kaplan-Meier method 1077 kernel 1068 function 1074 approach 1010 functions 842, 1064 methods 664 partial least 667 based methods 687 -density-based clustering 173 keyword 1211 -based association analysis 383 knowledge 382, 599 creation 383, 600 discovery 372, 385, 388, 390, 482, 746, 1181 methods 639 process 593 domain visualization 995 management 94, 268, 387, 758 nuggets 76 representations 184 transformation 600 -based recommenders 124 Kohonen self-organizing maps 661
L large datasets 1089 lasers 734 latent semantic indexing 807 lattice of concepts 151 theory 61 layer-based strategy 151 learning algorithms 184 theory 689, 1064, 1065 leave-one-out 503 libraries 100
lifetime value 913 lift chart 466 linear discriminant analysis 685 models 503 regression 309 linguistic variables 1175 linguistics computing 570 link analysis 383, 566, 735 mining 1012 local features 872 pattern-analysis 546 location services 965 log data 1216, 1226 files 1207 -based query clustering 438 expansion 438 logic data 693 synthesis 978, 981 logical analysis of data (LAD) 689 logistics support system 94, 268 loss function based criteria 466 low prediction accuracy (LPA) 611 lower bound 659
M machine learning 124, 234, 272, 390, 443, 448, 674, 693, 1015, 1083, 1113 magnum opus 796 maintenance algorithm 712 mal-processes 118 marked-up language 384 market basket analysis 59, 925 efficiency theory 779 research 698 segmentation 339 -basket analysis 653 Markov chain models 1231 matching parameters 303 material acquisition 705 categories 706 materialization 710 materialized hypertext view 714 view(s) 29, 717, 906, 1055 mathematical programming 1072 9
Index
maximal-profit item selection (MPIS) 67 maximum a posteriori or MAP decision rule 89 entropy 763 MDLPC 398 mean filter 871 median filter 871 medical digital libraries 278 informatics 243 transactional database 359 MEDLINE 616 membership functions 1175 memorized 55 Mercer’s condition 1074 message understanding conferences 615 metaclassification 552 data 329, 826 learner 15 learning 12 recommenders 46 metabolic pathways 736 metadata 298, 382, 1048, 1211 database 1050 model 1048 standards 1049 metainformation 1048 method schema 1000 methods for gene selection 107 metric-driven 223 micro-array data 687 microarray 106, 728, 810, 848 data sets 107 image analysis 735 microarrays 734 military services 268 min-max basis for approximate association rules 755 exact association rules 755 minconf 509 mine rule 740 minimal non-redundant association rule 755 occurrence 1100 minimum cost satisfiability problem 694 description length 541 support 1028 mining 1120 biomedical literature 280 of association rules 740 question-answer pairs 769 sentences 770 10
MINSAT 695 minsup 509 MIS 1092 missing values 1044 misuse detection 251 mixture of curves 154 mobile user trajectories 956 modal data 247 model combiners 448 identification 820 -based clustering 161, 134 pattern synthesis 902 modeling 313, 434 models of the data 340 molecular biology 735 monotonicity of the query 198 Monte Carlo method 414 moral 454 morphology 384 mosaic-based retrieval 838 multiagent system 23 criterion optimization 478 dimensional 523 dimensional sequential pattern mining 1031 link lookahead search 938 method 185 relational 540 view learning 12 multidimensional cubes 1054 database 17 datasets 1196 on-line analytical processing (MO 877 multifactorial diseases 483 multimedia 965 metadata 842 multimodal analysis 843 multiple data source mining 546 hypothesis testing 848 relations 71 site link structure analysis 996 multivariate discretization 398 multiview learning 1022 music information retrieval 854 mutation 505
N n-dimensional cube 196 Naïve Bayes 1115 Naive
Index
Bayesian classification 89 Bayesian classifier 145 -Bayes 394 National Committee for Quality Assurance (NCQA) 265 national security 832, 1012 natural language document 1147 processing 384, 443, 1109 navigational patterns 1218 NCR Teradata database system 746 nearest neighbor classifier 904 rule 684 and correlation-based recommender 45 negative association rules 860 itemsets 860 nested objects 715 networked data 34 neural networks 146, 245, 272, 507, 735, 779, 865 neuro-fuzzy computing 274 noise clustering 180, 582 magnitude 154 reduction 887 removal 871 noisy link 444 non-derivable itemsets 942 non-Ignorable Missing Data 594 nonlinear dynamics 234 nonparametric classifer 904 normalization 570, 871, 887 null hypothesis 849 number of intervals 397 numerical dependencies 31
O object classification 1166 identification 302 objective 492 function-based clustering 582 observability 1162 ODM Implementation 892 OLAP 17, 319, 413, 1196 systems 422 OLE DB for DM 39, 1171 on-line analytical processing (OLAP) 876, 891 online aggregation 991 analytical processing (OLAP) 233 196, 335, 469, 876, 1083 technologies 325
tools 339 discussions 759 only-one code 980 ontology 759, 1011, 1214 OPAC 102 operational cost 718 databases 826 operations research 234 opic-sensitive PageRank 440 OPTI 334 optimization 409 algorithms 719 of capital outlay 913 order-entry systems 334 ordinal data 80 organizational data mining 891 decision making 102 theory 893 orthogonal (uncorrelated) 498 osmotic stress microarray information database 736 outliers 180, 992 over-fitting 932 overfitting 1071 overlap pattern graph 903 -based pattern synthesis 903 overlapping 648 overtraining 866
P page class 714 scheme 715 vectorization 1202 PageRank 444 pageview 1227 paradigm shift 239 partial cube 199 -memory approaches 202 partition search space 791 -based pattern synthesis 903 partitioning 543 algorithm 817 and parallel processing 906 clustering 160 path mining 896 pattern classification 684 count tree 904 decomposition 790, 791 11
Index
discovery 1151, 1207, 1216 finding 1186 recognition 129, 246, 689, 965 problem 979 synthesis 902 with negation 941 -projection 1029 patterns 1120 Pearsonian correlation coefficient 80 pedagogical algorithms 191 peer-to-peer 300 perceptron 55 performance 349, 1054 personal digital assistants (PDAs) 427 personalization 1151, 1235 personnel hiring records 262 pharmaceutical industry 239 piece-wise affine (PWA) identification 820 pixel 734 plant geonomics 736 microarray databases 736 -made-pharmaceuticals 736 plausibility 1166 PMML 39 Poincaré disk 576 point density 154 policy learning 530 polynomial kernel 1074 pooled information 265 post-pattern-analysis 546 pragmatics 384 pre-calculation 320 predicate tree (P-tree) 1182 prediction 865, 955, 1120 join 1172 predictive function 433 model markup language (PMML) 894 preprocessing 886, 1228 primary Web data 785 principal component analysis 172, 498, 967, 1155 direction divisive partitioning 1155 priori algorithm 1231 privacy 405, 454, 566, 832, 921 issues 264 preserving data mining 922, 568, 1005 probabilistic decision table 974 graphical models 674 reasoning 935 probability distributions 696, 960 probably approximately correct (PAC) learning 346 12
PROC OPTEX 699 procedural factors 593 process management 18 modeling methodology 241 of discretization 775 product assortment decisions 66 professionalism 833 profit mining 930 program analysis 112, 947 mining 112 progressive 1196 sampling 346 progressively 1196 projected databases 1029 projection indexes 424 proportional hazards model 1077 propositional logic 694 propositionalization 33 protein-protein interaction graph 71 prototype(s) 129, 180 proxy log 438 server 1243 pruning 76 algorithm 142 search space 791 pseudo-independent (PI) models 936 public health 833 publish/subscribe 299 push-cache 956
Q quadratic time 816 qualitative data 392 quality of care 265 quantified beliefs 1135 quantitative association rules 815 data 392 SAR (QSAR) 664 query 645 answering system 639 evaluation 717 language engines 339 log 438 mining 438 preprocessing 438 optimization 423 parallelism 320 reformulation 439 session clustering 439 sessions 438
Index
subsumption 31 -by-humming 855 -by-sample-image 837 -by-sketch 837 questionnaires 1045
R radio frequency identification (RFID) 326 random forest 355 range query 49 -aggregate queries 1196 ranking 838 function discovery 378 ratio 79 raw data 436 receiver operating characteristic 466 recommendation 930 recommender systems 44, 124, 758, 931, 1235, 1246 reconciliation 300 record linkage 302 recursive partitioning 354 reengineering 947 refinement 190 reflexive relationships 70 refusal of response 593 regression 141, 144, 264, 272, 735 clustering 134 statistics 1064 reinforcement learning 529 relational 540 data 70 mining 72 database systems 262 databases 33 join operation 71 learning 33 modeling 570 on-line analytical processing (ROLAP) 877 privacy 455 queries 808 relationship 1120 relative advantage 1162 frequencies 90 relevance feedback 438, 837 reporting 334 representation of frequent patterns 942 space 397 episodal association rules 1100 representative samples 413 resampling 503, 886
reservoir sampling 345 result schema 445 reuse 529 risk exposure 265 models 265 rough classification 611 decision table 974 set and genetic algorithm 272 sets 659, 973 -set genome 660 rule extraction 191 generation 821 method(s) 978 mining 509 -based classification models 773
S salient variables 820 sample 413 class prediction 283 clustering 283 sampling 310, 622, 991 satisfiability problem 694 scalability 636, 1181 and efficiency 178 scalable data-mining tools 434 scanned images 734 schema 523 abstraction 647 integration 647 matching 625 theory 477 scheme 131 science 234 scientific applications 1196 research 492 Web Intelligence 995 scientometrics 995 score function 516 search engine 328, 443, 1146, 1211 situation 1001 transition 1000 space 790 transition 1001 secondary Web data 786 secure multiparty computation 1005 security definitions 1005 management 18 13
Index
segmentation 353, 444, 1217 selectivity 413, 991 self organizing map (SOM) 273 -insured 265 -organizing maps 294 semantic context 1012 data mining 1010 dimensions 740 metadata 1011 Web 1012, 1214 -based retrieval 805 semantics 384 of attributes 1 semistructured data 372 documents 678 supervised clustering 1114 learning 12, 1022 semiotics 129 semistructured pages 445 sensor network 965 sentence completion 770 separating set 694 sequence analysis 383 rules 1231 pattern 1028 discovery 1245 sequential random sampling 345 server log 438 session 1221 sessionization 1227 set theoretic interval 660 shared dimensions 199 knowledge 130 signature 251 significant genes 551 similarity 1120 detection 383 simple random 345 sampling 344 simplex plot 1168 singlegraph 1141 -link lookahead search 938 singular value decomposition 307 site structure 714 skeletonization 872 SMC 405 14
SMO 1075 smoothening 872, 887 social data mining system 46 network 1152 SODAS software 1089 soft computing 272 software analysis 1033 classification/clustering 1034 engineering 112 filtering 1034 integration 1034 mining 1034 transformation 1034 warehouse 1033 SOK 334 sound 84 analysis 84 space models 165 standardization 871 spam filtering 768 spanning tree 151 sparse data cubes 423 spherical clusters 582 split selection algorithm 142 splitting predicates 141 spurious rules 797 SQL/MM DM 39, 1171 SRM 1093, 1097 SSP 334 STAGGER 203 concepts 204 standalone information system 640 standardization 1171 star schema 877 -snowflake 526 state of the variable 674 static Features 887 statistical analysis 737 data editing 1043 hypothesis testing 464 information systems (SIS) 1048 significance 848 statistics 74, 634, 1109 utilization 705 step-wise regression 80 STICH (Space Models Identification Through Hierarc 165 stock technical analysis 779 strategic HR decisions 262
Index
strategies 636 stratified random sampling 344 sampling 992 stream data mining 1030 streaming 991 ensemble algorithm 203 strong or unique identifiers 303 structural health monitoring system 245 risk 1064 structure-activity relationship (SAR) 664 structured data 443 documents 1015 information retrieval 960 STUCCO 796 subdue system 540 subgraph mining 1059 subjective probability 90 subschema similarity 648 substructure 541 summarization 251, 272 supervised data mining 811 discretization 398 inductive learning 477 learning 54, 144, 494,542, 865,1064, 1071, 1105 method 303 supply chain 323 management 325, 326 support 1231, 1098 threshold 60 vector machine 117, 664, 1064, 1071 vectors 1067, 1074 suppress redundancies 76 surveillance 835 survival analysis (SA) 1077 data mining (SDM) 1077 SVM 1075 SW-Cube 1034 SWAMS (software warehouse application management) 1034 symbiotic data mining 1083 data analysis 1087 symbolic descriptions 1087 objects 1087 synonymy 648 syntax 384 system LERS 974 systems analysis 234
T ‘t’ value 80 tacit 600 taxonomies 392 taxonomy 615, 1113 tree 1212 TDT (topic detection and tracking) 608 technology trends 8 temporal characteristics 870 data mining 662 recommenders 47 tenfold cross-validation 1072 termination reports 262 terrorist information awareness 268 test set 1071 text classification 13, 768 mining 278, 382, 606, 1130, 1206 -machine learning 1109 -mining 1113 textual messages 758 thematic search engine 1201 thermometer Code 980 thinning 872 threats 566 three-dimensional visualization 1193 time series 172 time series 991, 1125 top-down computation 197 topic drifting 444 hierarchy 1113 map authoring 1130 maps 1130 ontology 608 total information awareness 268 training set 1071 transaction record 653 -based data 325 sequences and event sequences 1098 transactional data 323 database 325 supply chain data 323 transactions 947 transferable belief model 1135 transformations 1051 tree algorithms 190 isomorphism 1143 15
Index
mining 1140 pattern matching 1143 trend 1120 trialability 1162 Trie, Patricia 60 tumor classification 550 two-dimensional scatter plot 154 type a error 695 b error 695 conflict 648 I error 888 II error 888
U U.S. Army 94 Coast Guard 94, 268 Navy 94 uncertainty 834, 960, 1135 ungeneralizable patterns 263 unified theory of data 226 uniform sampling 413 units 54 univariate sampling 346 universal itemset 59 unlabeled data 1022 unstructured data 384 unsupervised data mining 811 discretization 398 learning 555, 865, 1065 mining 1155 upper bound 659 usage information 102 pattern 1151 user information 101 modeling 1218 preferences 440 profile(s) 23, 440, 1236 uses for mining medical digital libraries 280 utility(ies) 630, 930
V validation 190 value density 816 distance 816 variable precision rough set model 974 selection 735 16
VC-Dimension 1066 vector space model 1104 -characterizing function 519 version space 208 vertical data mining 1181 vertically partitioned data 927 Veteran administrations 268 vibration data 247 video association mining 1187 classification 1186 clustering 1186 data mining 1185 databases 966 view dependency graph 711 maintenance 718 monotonicity of a query 198 selection 717 usability problem 30 views 12 virtual environment (VE) 1193 private network 1161 reality 1190, 1193 reference 102 vision-based page 444 visual data mining 1190 visualization 995, 1190
W wavelet transformation 1196 wavelets 991, 1196 weak identifiers 303 Web analytics 1208, 1217, 1229 clickstream analysis 1231 communities 1207 content 1146 mining 800, 950, 1103, 1206 databases 443 graph mining 443 IE 443 information extraction 443 searching 1000 intelligence 995 linking 996 log 1151, 1217 file 1243 mining 950, 1103, 1146, 1151, 1201, 1206 techniques 438, 1211
Index
pages representation 1104 personalization 1218, 1223, 1248 processes 896 robot 1228 search 438 server logs 1221 structure 1146 mining 800, 1103, 1206 usage mining 438, 800, 950, 1103, 1151,1206, 1216, 1221, 1226, 1236, 1242 -enabled e-business 785 weighted-majority 203 well-formed structures 525 wild point correction 887 window 1099 adjustment heuristic 203 wireless ad-hoc sensor grids 427 mobile networks 955 telephony 912 word cluster map (WCM) 1131 workflow management systems 896 workflows 896 workload 992 World Wide Web 59, 1211 data mining 1242 wrapper induction 13, 445, 678 wrappers 678
X XML 372, 787, 800, 965, 1015 documents 372 mining 372 sources 648 tags 384 XyDiff 801
Y y 440
Z z-score models 503
17
lvi
Recommend Documents
Sign In