This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Front Matter Table of Contents About the Author
Designing a Data Warehouse: Supporting Customer Relationship Management Chris Todman Publisher: Prentice Hall PTR First Edition December 01, 2000 ISBN: 0-13-089712-4, 360 pages
Today’s next-generation data warehouses are being built with a clear goal; to maximize the power of Customer Relationship Management. To make CRM_focused data warehousing work, you need new techniques, and new methodologies. In Designing A Data Warehouse, Dr. Chris Todman - one of the world’s leading data warehousing consultants - delivers the first start-to-finish methodolgy for defining, designing, and implementing CRM-focused data warehouses. Todman covers all this, and more: A new look at data warehouse conceptual models, logical models, and physical implementation; Project management: deliverables, assumptions, risks, and team-building - including a full breakdown of work; DW futures: temporal databases, OLAP SQL extensions, active decision support, integrating external and unstructured data, search agents, and more. only for RuBoard - do not distribute or recompile
1
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management List of Figures Preface FIRST GENERATION DATA WAREHOUSES SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT WHO SHOULD READ THIS BOOK Acknowledgments 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY 3. Design Problems We Have to Face Up To DIMENSIONAL DATA MODELS WHAT WORKS FOR CRM
2
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SUMMARY 4. The Implications of Time in Data Warehousing THE ROLE OF TIME PROBLEMS INVOLVING TIME CAPTURING CHANGES FIRST-GENERATION SOLUTIONS FOR TIME VARIATIONS ON A THEME CONCLUSION TO THE REVIEW OF FIRST-GENERATION METHODS 5. The Conceptual Model REQUIREMENTS OF THE CONCEPTUAL MODEL THE IDENTIFICATION OF CHANGES TO DATA DOT MODELING DOT MODELING WORKSHOPS SUMMARY 6. The Logical Model LOGICAL MODELING THE IMPLEMENTATION OF RETROSPECTION THE USE OF THE TIME DIMENSION LOGICAL SCHEMA PERFORMANCE CONSIDERATIONS CHOOSING A SOLUTION FREQUENCY OF CHANGED DATA CAPTURE CONSTRAINTS EVALUATION AND SUMMARY OF THE LOGICAL MODEL 7. The Physical Implementation THE DATA WAREHOUSE ARCHITECTURE CRM APPLICATIONS BACKUP OF THE DATA ARCHIVAL EXTRACTION AND LOAD SUMMARY 8. Business Justification THE INCREMENTAL APPROACH THE SUBMISSION SUMMARY 9. Managing the Project INTRODUCTION WHAT ARE THE DELIVERABLES? WHAT ASSUMPTIONS AND RISKS SHOULD I INCLUDE? WHAT SORT OF TEAM DO I NEED?
3
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Summary 10. Software Products EXTRACTION, TRANSFORMATION, AND LOADING OLAP QUERY TOOLS DATA MINING CAMPAIGN MANAGEMENT PERSONALIZATION METADATA TOOLS SORTS 11. The Future TEMPORAL DATABASES (TEMPORAL EXTENSIONS) OLAP EXTENSIONS TO SQL ACTIVE DECISION SUPPORT EXTERNAL DATA UNSTRUCTURED DATA SEARCH AGENTS DSS AWARE APPLICATIONS A. Wine Club Temporal Classifications B. Dot Model for the Wine Club APPENDIX B DOT MODEL FOR THE WINE CLUB C. Logical Model for the Wine Club D. Customer Attributes HOUSEHOLD AND PERSONAL ATTRIBUTES BEHAVIORAL ATTRIBUTES FINANCIAL ATTRIBUTES EMPLOYMENT ATTRIBUTES INTERESTS AND HOBBY ATTRIBUTES References
only for RuBoard - do not distribute or recompile
4
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management Library of Congress Cataloging-in-Publication Date
Todman, Chris. Designing a data warehouse: in support of customer relationship management/Chris Todman. p. cm. Includes bibliographical references and index. ISBN: 0-13-089712-4 1. Data warehousing. I. Title. HD30.2.T498 2001 CIP 658.4/03/0285574 21 1220202534 Credits
Editorial/Production Supervisor: Kerry Reardon
5
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Project Coordinator: Anne Trowbridge Acquisitions Editor: Jill Pisoni Editorial Assistant: Justin Somma Manufacturing Buyer: Maura Zaldivar Manufacturing Manager: Alexis Heydt Marketing Manager: Dan DePasquale Art Director: Gail Cocker-Bogusz Cover Designer: Nina Scuderi Cover Design Director: Jerry Votta Manager HP Books: Patricia Pekary Editorial Director, Hewlett Packard Professional
6
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Prentice-Hall of Japan, Inc., Tokyo Pearson Education Asia, Pte. Ltd. Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro only for RuBoard - do not distribute or recompile
8
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
List of Figures 1.1
Just who are our best customers?
1.2
CRM in an organization.
1.3
The components of CRM.
1.4
The number of communication channels is growing.
2.1
Fragment of data model for the Wine Club.
2.2
Three-dimensional data cube.
2.3
Wine sales dimensional model for the Wine Club.
2.4
Data model showing multiple join paths.
2.5
The main components of a data warehouse system.
2.6
General-state transition diagram.
2.7
State transition diagram for the orders process.
2.8
Star schema showing the relationships between facts and dimensions.
2.9
Stratification of the data.
2.10 Snowflake schema for the sale of wine. 2.11 Levels of summarization in a data warehouse. 2.12 Modified data warehouse structure incorporating summary navigation and data mining.
3.1
Star schema for the Wine Club.
3.2
Third normal form version of the Wine Club dimensional model.
3.3
Confusing and intimidating hierarchy.
3.4
Common organizational hierarchy.
3.5
Star schema for the Wine Club.
3.6
Sharing information.
3.7
General model for customer details.
3.8
General model for a customer with changing circumstances.
3.9
Example model showing customer with changing circumstances.
3.10 The general model extended to include behavior. 3.11 The example model extended to include behavior. 3.12 General conceptual model for a customer-centric data warehouse. 3.13 Wine Club customer changing circumstances. 3.14 Wine Club customer behavior.
9
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
3.15 Derived segment examples for the Wine Club.
4.1
Fragment of operational data model.
4.2
Operational model with additional sales fact table.
4.3
Sales hierarchy.
4.4
Sales hierarchy with sales table attached.
4.5
Sales hierarchy showing altered relationships.
4.6
Sales hierarchy with intersection entities.
4.7
Sales hierarchy with data.
4.8
Simple general business hierarchy.
4.9
Graphical representation of temporal functions.
4.10 Types of temporal query. 4.11 Traditional resolution of m:n relationships. 4.12 Representation of temporal attributes by attaching them to the dimension. 4.13 Representation of temporal hierarchies by attaching them to the facts. 4.14 Representation of temporal attributes by attaching them to the facts.
5.1
Example of a two-dimensional report.
5.2
Example of a three-dimensional cube.
5.3
Simple multidimensional dot model.
5.4
Representation of the Wine Club using a dot model.
5.5
Customer-centric dot model.
5.6
Initial dot model for the Wine Club.
5.7
Refined dot model for the Wine Club.
5.8
Dot modeling worksheet showing Wine Club sales behavior.
5.9
Example of a hierarchy
6.1
ER diagram showing new relationships to the time dimension.
6.2
Logical model of part of the Wine Club
7.1
The EASI data architecture.
7.2
Metadata model for validation.
7.3
Integration layer.
7.4
Additions to the metadata model to include the source mapping layer.
7.5
Metadata model for the VIM layer.
7.6
Customer nonchanging details.
7.7
The changing circumstances part of the GCM.
10
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
7.8
Behavioral model for the Wine Club.
7.9
Data model for derived segments.
7.10 Daily partitions. 7.11 Duplicated input.
8.1
Development abstraction shown as a roadmap.
9.1
Classic waterfall approach.
9.2
Example project team structure.
10.1 Extraction, transformation, and load processing. 10.2 Typical OLAP architecture. 10.3 Descriptive field distribution. 10.4 Numeric field distribution using a histogram. 10.5 Web plot that relates gender to regions. 10.6 Rule induction for wine sales. 10.7 Example of a multiphase campaign.
only for RuBoard - do not distribute or recompile
11
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Preface The main subject of this book is data warehousing. A data warehouse is a special kind of database that, in recent years, has attracted a great deal of interest in the information technology industry. Quite a few books have been published about data warehousing generally, but very few have focused on the design of data warehouses. There are some notable exceptions, and these will be cited in this book, which concentrates, principally, on the design aspects of data warehousing. Data warehousing is all about making information available. No one doubts the value of information, and everyone agrees that most organizations have a potential “Aladdin's Cave” of information that is locked away within their operational systems. A data warehouse can be the key that opens the door to this information. There is strong evidence to suggest that our early foray in the field of data warehousing, what I refer to as first-generation data warehouses, has not been entirely successful. As is often the case with new ideas, especially in the information technology (IT) industry, the IT practitioners were quick to spot the potential, and they tried hard to secure the competitive advantage for their organizations that the data warehouse promised. In doing so I believe two points were overlooked. The first point is that, at first sight, a data warehouse can appear to be quite a simple application. In reality it is anything but simple. Quite apart from the basic issue of sheer scale (data warehouse databases are amongst the largest on earth) and the consequent performance difficulties presented by this, the data structures are inherently more complex than the early pioneers of these systems realized. As a result, there was a tendency to over-simplify the design so that, although the database was simple to understand and use, many important questions could not be asked. The second point is that data warehouses are unlike other operational systems in that it is not possible to define the requirements precisely. This is at odds with conventional systems where it is the specification of requirements that drives the whole development lifecycle. Our approach to systems design is still, largely, founded on a thorough understanding of requirements–the “hard” systems approach. In data warehousing we often don't know what the problems are that we are trying to solve. Part of the role of the data warehouse should be to help organizations to understand what their problems are. Ultimately it comes down to design and, again, there are two main points to consider. The
12
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
first concerns the data warehouse itself. Just how do we ensure that the data structures will enable us to ask the difficult questions? Secondly, the hard systems approach has been shown to be too restrictive and a softer technique is required. So not only do we need to improve our design of data warehouses, we also need to improve the way in which we
approach the design. It is in response to these two needs that this book has been written. only for RuBoard - do not distribute or recompile
13
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
FIRST GENERATION DATA WAREHOUSES Historically, the first-generation data warehouses were built on certain principles that were laid down by gurus in the industry. This author recognizes two great pioneers in data warehousing: Bill Inmon and Ralph Kimball. These two chaps, in my view, have done more to advance the development of data warehousing than any others. Although many claim to have been “doing data warehousing long before it was ever called data warehousing,” Inmon and Kimball can realistically claim to be the founders because they alone laid down the definitions and design principles that most practitioners are aware of today. Even if their guidelines are not followed precisely, it is still common to refer to Inmon's definition of a data warehouse and Kimball's rules on slowly changing dimensions. Chapter 2 of this book is an introduction to data warehousing. In some respects it should be regarded as a scene-setting chapter, as it introduces data warehouses from first principles by describing the following: Need for decision support How data warehouses can help Differences between operational systems and data warehouses Dimensional models Main components of a data warehouse Chapter 2 lays the foundation for the evolution to the second-generation data warehouses. only for RuBoard - do not distribute or recompile
14
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT Before the introduction to data warehousing, we take a look at the business issues in a kind of rough guide to customer relationship management (CRM). Data warehousing has been waiting for CRM to appear. Without it, data warehouses were still popular but, very often, the popularity was as much in the IT domain as anywhere else. The IT management was quick to see the potential of data warehouses, but the business justification was not always the main driver and this has led to the failure of some data warehouse projects. There was often a reluctance on the part of business executives to sponsor these large and expensive database development projects. Those that were sponsored by IT just didn't hit the spot. The advent of CRM changed all that. CRM cannot be practiced in business without a major source of information, which, of course, is the data warehouse raison
d'etre. Interest in data warehousing has been revitalized, and this time it is the business people who are firmly in the driving seat. Having introduced the concept of CRM and described its main components, we explore, with the benefit of hindsight, the flaws in the approach to designing first-generation data warehouses and will propose a method for the next generation. We start by examining some of the design issues and pick our way carefully through the more sensitive areas in which the debate has smoldered, if not raged a little, over the past several years. One of the fundamental issues surrounds the representation of time in our design. There has been very little real support for this, which is a shame, since data warehouses are true temporal applications that have become pervasive and ubiquitous in all kinds of businesses. In formulating a solution, we reintroduce, from the mists of time, the old conceptual, logical, and physical approach to building data warehouses. There are good reasons why we should do this and, along the way, these reasons are aired. We have a short chapter on the business justification. The message is clear. If you cannot justify the development of the data warehouse, then don't build it. No one will thank us for designing and developing a beautifully engineered, high-performing system if, ultimately, it cannot pay for itself within an appropriate time. Many data warehouses can justify themselves several times over, but some cannot. We do not want to add to the list of failed projects. Ultimately, no one benefits from this and we should be quite rigorous in the justification process.
15
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Project management is a crucial part of a data warehouse development. The normal approach to project management doesn't work. There are many seasoned, top-drawer project managers who, in the beginning, are very uncomfortable with data warehouse projects. The uncertainty of the deliverables and the imprecise nature of the acceptance criteria send them howling for the safety net of the famous system specification. It is hoped that the chapter on project management will provide some guidance. People who know me think I have a bit of a “down” on software products and if I'm honest I suppose I do. I get a little irritated when the same old query tools get dusted off and relaunched as each new thing comes along as though they are new products. Once upon a time a query tool was a query tool. Now it's a data mining product, a segmentation product, and a CRM product as well. OK, these vendors have to make a living but, as professional consultants, we have to protect our customers, particularly the gullible ones, from some of these vendors. Some of the products do add value and some, while being astronomically expensive, don't add much value at all. The chapter on software products sheds some light on the types of tools that are available, what they're good at, what they're not good at, and what the vendors won't tell you if you don't ask. only for RuBoard - do not distribute or recompile
16
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
WHO SHOULD READ THIS BOOK Although there is a significant amount of technical material in the book, the potential audience is quite wide: For anyone wishing to learn the principles of data warehousing, Chapter 2 has been adapted from undergraduate course material. It explains, in simple terms: What data warehouses are How they are used The main components The data warehouse “jargon” There is also a description of some of the pitfalls and problems faced in the building of data warehouses. For consultants, the book contains a method for ensuring that the business objectives will be met. The method is a top-down approach using proven workshop techniques. There is also a chapter devoted to assisting in the building of the business justification. For developers of data warehouses, the book contains a massive amount of material about the design, especially in the area of the data model, the treatment of time, and the conceptual, logical, and physical layers of development. The book contains a complete methodology that provides assistance at all levels in the development. The focus is on the creation of a customer-centric model that is ideal for supporting the complex requirements of customer relationship management. For project managers there is an entire chapter that provides guidelines on the approach together with: Full work breakdown structure (WBS) Project team structure Skills needed
17
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The “gotchas” only for RuBoard - do not distribute or recompile
18
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Acknowledgments I would like to thank my wife, Chris, for her unflagging support during the past twenty years. Chris has been a constant source of encouragement, guidance, and good counsel. I would also like to thank Malcolm Standring and Richard Getliffe of Hewlett Packard Consulting for inviting me to join their Data Warehousing practice in 1995. Although I was already deeply involved in database systems, the role in HP has opened doors to many new and exciting possibilities. Thank you, Mike Newton and Prof. Pat Hall of the Open University, for your technical guidance over several years. As far as the book is concerned, thanks are due to Chris, again, for helping to tighten up the grammar. Thanks especially to Jim Meyerson, of Hewlett Packard, for a rigorous technical review and helpful suggestions. Finally, I am grateful to Jill Pisoni and the guys at Prentice Hall for publishing this work. only for RuBoard - do not distribute or recompile
19
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY only for RuBoard - do not distribute or recompile
20
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE BUSINESS DIMENSION First and foremost, this book is about data warehousing. Throughout the book, we will be exploring ways of designing data warehouses with a particular emphasis on the support of a customer relationship management (CRM) strategy. This chapter provides a general introduction to CRM and its major components. Before that, however, we'll take a short detour and review what has happened in the field of data warehousing, from a business perspective. Although data warehousing has received a somewhat mixed reception, it really has captured the imagination of business people. In fact, it has become so popular in industry that it is cited as being the highest-priority postmillennium project of more than half of Information Technology (IT) executives. It has been estimated that, as far back as 1997 (Menefee, 1998), $15 billion was spent on data warehousing worldwide. Recent forecasts (Business Wire, August 31, 1998) expect the market to grow to around $113 billion by the year 2002. A study carried out by the Meta Group (Meyer and Cannon, 1998) found that 95 percent of the companies surveyed intended to build a data warehouse. Data warehousing is being taken so seriously by the industry that the Transaction Processing Council (TPC), which has defined a set of benchmarks for general databases, introduced an additional benchmark specifically aimed at data warehousing applications known as TPC/D, followed up in 1999 by further benchmarks (TPC/H and TPC/R). As a further indication of the “coming of age of data warehousing,” a consortium has developed an adjunct to the TPC benchmark called “The data warehouse challenge” as a means of assisting prospective users in the selection of products. The benefits of building a data warehouse can be significant. For instance, increasing knowledge within an organization about customers' trends and business can provide a significant return on the investment in the warehouse. There are many documented examples of huge increases in revenue and profits as a result of decisions taken based upon information extracted from data warehouses. So if someone asked you the following question:
How many data warehouse projects ultimately are regarded as having failed?
21
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
How would you respond? Amazingly, research has shown that it's over 70 percent! This is quite staggering. Why is it happening and how will we know whether we are being successful? It's all about something we've never really been measured on in the past—business benefit. Data warehouses are different in quite a few ways from other, let us say traditional, IT projects. In order to explain one of these differences, we have to delve back into the past a little. Another charge that has historically been leveled at the IT industry is that the solution that is finally delivered to the customer, or users, is not the solution they were expecting. This problem was caused by the methods that were used by IT departments and system integration companies. Having identified that there was a problem to be solved, they would send in a team of systems analysts to analyze the current situation, interview the users and recommend a solution. This solution would then be built and tested by a system development team and delivered back to the users when it was finished. The system development lifecycle that was adopted consisted of a set of major steps: 1. Requirements gathering 2. Systems analysis 3. System design 4. Coding 5. System testing 6. Implementation The problem with this approach was that each step had to be completed before the next could really begin. It has been called the waterfall approach to systems development, and its other problem was that the whole process was carried out—out of the sight of the users who just continued with their day jobs until, one day, the systems team descended upon them their new, completed system. This process could have taken anything from six months to two years or more. So, when the systems team presents the new system to the users, what happens? The users say, “Oh!, But this isn't what we need.” And the systems project leader exclaims, “But this is what you asked for!” and it all goes a bit pear-shaped after that. Looking back, the problems and issues are clear to see. But suffice it to say there were always lots of arguments. The users were concerned that their requirements had clearly
22
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
not been understood or that they had been ignored. The systems team would get upset because they had worked hard, doing their level best to develop a quality system. Then there would be recriminations. The senior user, usually the one paying for the system, would criticize the IT manager. The systems analyst would be interrogated by the IT manager (a throat grip was sometimes involved in this) and eventually they would agree that there had been a communications misunderstanding. It is likely that the users had not fully explained their needs to the analyst. There are two main reasons for this. The first is that the analyst may have misinterpreted the needs of the user and designed a solution that was simply inappropriate. This happened a lot and was usually the fault of the analyst, whose job it was to ensure that the needs were clearly spelled out. The second reason is more subtle and is due to the fact that businesses change over time as a kind of natural evolution. Simple things like new products in the catalog or people changing roles can cause changes in the everyday business processes. Even if the analyst and the users had not misunderstood each other there is little hope that, after two years, the delivered system would match the needs of the people who were supposed to use it. Subsequently, the business processes would have to change again in order to accommodate the new system. After a little while, things would settle down, the “teething troubles” would be fixed and the system would hopefully provide several years of service. Anyway, switched-on IT managers had to figure out a way of ensuring that the users would not have any reason to complain in the future and the problem was solved by the introduction of the now famous “system specification” or simply “system spec.” Depending on the organization, this document had many different names including system manual, design spec, design manual, system architecture, etc. The purpose of the system spec was to establish a kind of contract between the IT department, or systems integrator, and the users. The system spec contained a full and detailed description of precisely what would be delivered. Each input screen, process, and output was drawn up and included in the document. Both protagonists “signed off” the document that reflected precisely what the IT department were to deliver and, theoretically at least, also reflected what the users expected to receive. So long as the IT department delivered to the system spec, they no longer could be accused of having ignored or misunderstood the requirements of the users. So the system spec was a document that was invented by IT as a means of protecting themselves from impossibly ungrateful users. In this respect it was successful and it is still the cornerstone of most development methods. When data warehouses started to be developed, the developers began using their tried and trusted methodologies to help to build them. And why not? This approach of nailing down the requirements has proved to be
23
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
successful in the past, at least as far as IT departments were concerned. The problem is that data warehouses are different. Until now, IT development practitioners have been working almost exclusively on streamlining and improving business processes and business functions. These are systems that usually have predefined inputs and, to some extent at least, predefined outputs. We know that is true because the system spec said so. The traditional methods we use are sometimes referred to as “hard” systems development methods. This means that they used to solve problems that are well defined. But, and this is where the story really starts, the requirements for a data warehouse are
never well defined! These are the softer “We think there's a problem but we're not sure what it is” type of issue. It's actually very difficult to design systems to solve this kind of problem and our familiar “hard” systems approach is clearly inappropriate. How can we write a system specification that nails down the problem and the solution when we can't even clearly define the problem? Unfortunately, most practitioners have not quite realized that herein lies the crux of the problem and, consequently, the users are often forced to state at least some requirements and sign the inevitable systems specification so that the “solution” can be developed. Then once the document is signed off, the development can begin as normal and the usual systems development life-cycle kicks in and away we go folks. The associated risk is that we'll finish up by contributing to the seventy percent failure statistic. Is there a solution? Yes there is. All it takes is recognition of this soft systems issue and to develop an approach that is sympathetic with it. The original question that was posed near the beginning of this section was “How will we know whether we are being successful?” If the main reason for failure is that we didn't produce sufficient business benefit, then the answer is to focus on the business. That means what the business is trying to achieve and not the requirements of the data warehouse. only for RuBoard - do not distribute or recompile
24
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
BUSINESS GOALS The phrase “Focus on what the business is trying to achieve,” refers to the overall business, or the part of the business that is engaging us in this project. This means going to the very top of the organization. Historically, it has been common for IT departments to invest in the building of data warehouses on a speculative basis, assuming that once in place the data warehouse will draw business users like bees to a honey pot. While the sentiments are laudable, this “build it and they will come” approach is generally doomed to fail from the start. The reason is that the warehouse is built around information that the IT department thinks is important, rather than the business. The main rule is quite simple. If you are hired by the CEO to solve the problem, you have to understand what the CEO is trying to achieve. If you are hired by the marketing director, then you have to find out what drives the marketing director. It all comes down to business goals. Each senior manager in an organization has goals. They may not always be written down. They may not be well known around the organization and to begin with, even the manager may not be able to articulate them clearly but they do exist. As a data warehouse practitioner, we need some extra “soft” skills and techniques to help us help our customers to express these soft system problems, and we explore this subject in detail in Chapter 5 when we build the conceptual model. So what is a business goal? Well it's usually associated with some problem that some executive has to solve. The success or failure on the part of the executive in question may be measured in terms of their ability to solve this problem. Their salary level may depend on their performance in solving this problem, and ultimately their job may depend on it as well. In short, it's the one, two, or three things that sometimes keep them awake at night. (jobwise that is). How is a business goal defined? Well, it's important to be specific. Some managers will say things like, “We need to increase market share” or “We'd like to increase our gross margin” or maybe “We have to get customer churn down.” These are not bad for a start but they aren't specific enough. Look at this one instead: “Our objective is to increase customer loyalty by 1 percent each year for the next five years.” This is a real goal from a real company and it's perfect. The properties of a good business goal are that they should be: 1. Measurable
25
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
2. Time bounded 3. Customer oriented This helps us to answer the question of how we'll know we've been successful. The managers will know whether they have been successful if they hit their measured goal targets within the stated time scale. Just a point about number three on the list. It is not an absolute requirement but it is a good check. There is a kind of edict inside of Hewlett Packard and it goes like this: “If you aren't doing it for a customer, don't do it!” In practice most business goals, in my experience, are customer oriented. Generally, as businesses we want to: Get more good customers Keep our better customers Maybe offload our worst customers Sell more to customers People have started to wake up to the fact that the customer is king. It's the customer we have to identify, convince, and ultimately satisfy if we are to be really successful. It's not about products or efficient processes, although these things are important too. Without the customer we might as well stay at home. Anyway, once we know what a manager's goals are, we can start to talk strategy. only for RuBoard - do not distribute or recompile
26
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
BUSINESS STRATEGY So now we know our customer's goals. The next step in ensuring success is to find out, for each goal, exactly how they plan to achieve it. In other words, what is their business strategy? Before we begin, let's synchronize terminology. There is a risk at this point of sinking into a semantic discussion on what is strategic and what is tactical. For our purposes, a strategy is defined as one or more steps to be employed in pursuit of a business goal. After a manager has explained the business goal, it is reasonable to then follow up with the question “And what is your strategy for achieving this goal?” only for RuBoard - do not distribute or recompile
27
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE VALUE PROPOSITION Every organization, from the very largest down to the very smallest, has a value proposition. A company's value proposition is the thing that distinguishes its business offering from all the others in the marketplace. Most senior managers within the organization should be able to articulate their value proposition but often they cannot. It is helpful, when dealing with these people, to discuss their business. Anything they do should be in some way relevant to the overall enhancement of the value proposition of their organization. It is generally accepted that the value proposition of every organization falls into one of three major categories of value discipline (Treacey and Wiersema, 1993). The three categories are customer intimacy, product leadership, and operational excellence. We'll just briefly examine these.
Customer Intimacy We call this the customer intimacy discipline because companies that operate this type of value proposition are the types of companies that really do try to understand their individual customer's needs and will try to move heaven and earth to accommodate their customers. For instance, in the retail clothing world, a bespoke tailor will know precisely how their customers like to have their clothes cut. They will specially order in the types and colors of fabric that the customer prefers and will always deal with the customer on a one-to-one, personal basis. These companies are definitely not cheap. In fact, their products are usually quite expensive compared to some and this is because personal service is an expensive commodity. It is expensive because it usually has to be administered by highly skilled, and therefore expensive, people. However, their customers prefer to use them because they feel as though they are being properly looked after and their lives are sufficiently enriched to justify the extra cost.
Product Leadership The product leaders are the organizations that could be described as “leading edge.” Their value proposition is that they can keep you ahead of the pack. This means that they are always on the lookout for new products and new ideas that they can exploit to keep their customers interested and excited. Technology companies are an obvious example of this
28
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
type of organization, but they exist in almost every industry. Just as with the bespoke service, there is an example in the retail fashion industry. The so-called designer label clothes are a good example of the inventor type of value proposition. The people who love to buy these products appreciate the “chic-ness” bestowed upon them. Another similarity with the customer intimate service is that these products also tend to be very expensive. A great deal of research and development often goes into the production of these products and the early adopters must expect to pay a premium.
Operational Excellence This type of organization excels at operational efficiency. They are quick, efficient, and usually cheap. Mail order companies that offer big discounts and guaranteed same-day or next-day delivery fall into this category. They have marketing slogans like “It's on time or
it's on us!” If you need something in a hurry and you know what you want, these are the guys who deliver. Don't expect a tailor-made service or much in the way of after-sales support, but do expect the lowest prices in town. Is there a fashion industry equivalent? Well, there have always been mail order clothes stores. Even some of the large department stores, if they're honest with themselves, would regard themselves as operationally efficient rather than being strong on personal service or product leadership. So are we saying that all companies must somehow be classified into one of the three groups? Well not exactly, but all companies would tend to have a stronger affinity to one of the three categories than with the other two, and it is important for an organization to recognize where its strengths lie. The three categories have everything to do with the way in which the organization interacts routinely with its customers. It is just not possible for a company that majors on operational excellence to become a product leader or to provide a bespoke service without a major change in its internal organization and culture. Some companies are very strong in two of the three categories, while others are working hard toward this. Marks and Spencer is a major successful retail fashion company. Traditionally its products are sold through a branch network of large department stores all over the world. It also has a growing mail order business. That would seem to place it pretty squarely in the operational excellence camp. However, recently it has opened up a completely new range of products, called “Autograph,” that is aimed at providing a bespoke service to customers. Large areas of its biggest stores are being turned over to this exciting new idea. So here is one company that has already been successful with one value proposition, aiming to achieve excellence in a second. Oddly enough, the Autograph range of products has been designed, in part, by established designers, and so they might even claim to be nibbling at the edges of the product leadership category, too!
29
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The point is this: an organization needs to understand: 1. How it interacts with its customers 2. How it would like to interact with its customers You can then start to come up with a strategy to help to improve your customer relationship management. only for RuBoard - do not distribute or recompile
30
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
CUSTOMER RELATIONSHIP MANAGEMENT The world of business is changing rapidly as never before. Our customers are better informed and much more demanding than ever. The loyalty of our customers is something that can no longer be taken for granted and the loss of customers, sometimes known as
customer churn is a subject close to the heart of most business people today. It is said that it can cost up to 10 times as much to recruit a new customer as it does to retain an existing customer. The secret is to know who our customers are (you might be surprised how many organizations don't) and what it is that they need from us. If we can understand their needs then we can offer goods and services to satisfy those needs, maybe even go a little further and start to anticipate their needs so that they feel cared for and important. We need to think about finding products for our customers instead of finding customers for our products. Every business person is very keen to advance their share of the market and turn prospects into customers, but we must not forget that each of our customers is on someone else's list of hot prospects. If we do not satisfy their needs, there are many business people out there who will. The advent of the Internet intensifies this problem; our competitors are now just one mouse click away! And the competition is appearing from the strangest places. As an example, U.K supermarkets traditionally sold food and household goods, maybe a little stationery and odds and ends. The banks and insurance companies were shocked when the supermarket chains started offering a very credible range of retail financial services and it hasn't stopped there. Supermarkets now routinely sell: Mobile phones White goods Personal computers Drugs Designer clothes Hi-fi equipment The retail supermarket chains are ideally placed to penetrate almost all markets when products or services become commodity items, which, eventually, they almost always will. They have excellent infrastructure and distribution channels, not to mention economies of scale that most organizations struggle to compete with.
31
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
They also have something else—and that is customers. The one-stop shop service, added to extremely competitive prices offered by supermarkets, is irresistible to many customers and as a result they are tending to abandon the traditional sources of these goods. The message is clear. No organization can take its customers for granted. Any business executive who feels complacent about their relationship with their customers is likely to be heading for a fall.
So What Is CRM? Customer relationship management is a term that has become very popular. Many businesses are investing heavily in this area. What do we mean by CRM? Well, it's really a concept, a sort of cultural and attitudinal thing. However, in order to enable us to think about it in terms of systems, we need to define it. A working definition is:
CRM is a strategy for optimizing the lifetime value of customers. Sometimes, CRM is interpreted as a soft and fluffy, cuddly sort of thing where we have to be excessively nice to all our customers and then everything will become good. This is not the case at all. Of course, at the level of the customer facing part of our organization, courtesy, honesty, and trustworthiness are qualities that should be taken for granted. However, we are in business to make a profit. Our management and shareholders will try to see to it that we do. Studies by the First Manhattan Group have indicated that while 20 percent of a bank's customers contribute 150 percent of the profits, 40 to 50 percent of customers eliminate 50% of the profits. Similar studies reveal the same information in other industries, especially telecommunications. The graph in Figure 1.1 shows this. Notice too that the best (i.e., most profitable) customers are twice as likely to be tempted away to other suppliers as the average customer. So how do we optimize the lifetime value of customers? It is all about these two things: 1. Getting to know our customers better 2. Interacting appropriately with our customers Figure 1.1. Just who are our best customers?
32
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
During the ordinary course of business, we collect vast amounts of information about customers that, if properly analyzed, can provide a detailed insight into the circumstances and behavior of our customers. As we come to comprehend their behavior, we can begin to predict it and perhaps even influence it a little. We are all consumers. We all know the things we like and dislike. We all know how we would like to be treated by our suppliers. Well, surprise, surprise, our customers are just like that, too! We all get annoyed by blanket mail shots that have no relevance to us. Who has not been interrupted during dinner by an indiscriminate telephone call attempting to interest us in UPVC windows or kitchen makeovers? Organizations that continue to adopt this blanket approach to marketing do not deserve to succeed and, in the future, they and their methods will disappear. Our relationship with our customers has to be regarded more in terms of a partnership where they have a need that we can satisfy. If we show real interest in our customers and treat them as the unique creatures they are, then the likelihood is that they will be happy to continue to remain as customers. Clearly, a business that has thousands or even millions of customers cannot realistically expect to have a real personal relationship with each and every one. However, the careful interpretation of information that we routinely hold about customers can drive our behavior so that the customer feels that we understand their needs. If we can show that we
33
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
understand them and can satisfy their needs, then the likelihood is that the relationship will continue. What is needed is not blanket marketing campaigns, but targeted campaigns directed precisely at those customers who might be interested in products on offer. The concept of personalized marketing, sometimes called “one-to-one” marketing epitomizes the methods that are now starting to be employed generally in the marketplace, and we'll be looking at this and other components of CRM in the following sections. It is well known that we are now firmly embedded in the age of information. As business people we have much more information about all aspects of our business than ever before. The first-generation data warehouses were born to help us capture, organize, and analyze the information to help us to make decisions about the future based on past behavior. The idea was that we would identify the data that was needed to be gathered from our operational business systems, place it into the warehouse, and then ask questions of the data in order to derive valuable information. It will become clear, if indeed it is not already clear, that, in almost all cases, a data warehouse provides the foundation of a successful CRM strategy. CRM is partly a cultural thing. It is a service that fits as a kind of layer between our ordinary products and services and our customers. It is the CRM culture within an organization that leads to our getting to know our customers better, by the accumulation of knowledge. Equally, the culture enables the appropriate interactions to be conducted between our organization and our customers. This is shown in Figure 1.2. Figure 1.2. CRM in an organization.
34
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Figure 1.2 shows how the various parts of an organization fit together to provide the information we need to understand our customers better and the processes to enable us to interact with our customers in an appropriate manner. The CRM culture, attitudes, and behaviors can then be built on top of this, hopefully enhancing our customers' experiences in their dealings with us. In the remainder of this section, we will explore the various aspects of CRM. Although this book is intended to help with the design of data warehouses, it is important to understand the business dimension. This section should be enough to help you to understand sufficiently the business imperatives around CRM, and you should regard it as a kind of “rough guide” to CRM. The pie diagram in Figure 1.3 shows the components of CRM. Figure 1.3. The components of CRM.
35
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
As you can see in Figure 1.3, there are many slices in the CRM pie, so let us review some of the major ones:
Customer Loyalty and Customer Churn Customer loyalty and customer churn are just about the most important issues facing most business today, especially businesses that have vast numbers of customers. This is a particular problem for: Telecommunications companies Internet service providers Retail financial services Utilities Retail supermarket chains
36
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
First, let us define what we mean by customer churn. In simple terms, it relates to the number of customers lost to competitors over a period of time (usually a year). These customers are said to have churned. All companies have a churn metric. This metric of churn is the number of customers lost, expressed as a percentage of the total number of active customers at the beginning of the period. So if the company had 1,000 active customers at the beginning of the year and during the year 150 of those customers defected to a competitor, then the churn metric for the company is 15 percent. Typically, the metric is calculated each month on a rolling 12-month moving average. Some companies operate a kind of “net churn” metric. This is simply the number of active customers at the end of the period expressed as a percentage of the number of active customers at the beginning of the period, minus 100. So if the company starts the period with 1,000 customers and ends the period with 920 customers, then the calculation is:
This method is sometimes favored by business executives for two reasons: 1. It's easy to calculate. All you have to do is count the number of active customers at the beginning and end of the period. You don't have to figure out how many active customers you've lost and how many you've recruited. These kinds of numbers can be hard to obtain, as we'll see in the chapters on design. 2. It hides the truth about real churn. The number presents a healthier picture than is the reality. Also, with this method it's possible to end up with negative churn if you happen to recruit more customers than you lose. Great care must be taken when considering churn metrics. Simple counts are okay as a guide but, in themselves, they reveal nothing about your organization, your customers, or your relationship with your customers. For instance, in describing customer churn, several times the term active was used to describe customers. What does this mean? The answer might sound obvious but, astonishingly, in most organizations it's a devil of a job to establish which customers are active and which are not. There are numerous reasons why a customer might defect to another supplier. These reasons are called churn factors. Some common churn factors are:
The wrong value discipline. You might be providing a customer intimate style of service. However, customers who want quick delivery and low prices
37
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
are unlikely to be satisfied with the level of service you can provide in this area. Customers are unlikely to have the same value discipline requirements for all the products and services they use. For instance, one customer might prefer a customer intimacy type of supplier to do their automobile servicing but would prefer an efficient and inexpensive supplier to service their stationery orders.
A change in circumstances. This is a very common cause of customer churn. Your customer might simply move out of your area. They may get a new job with a much bigger salary and want to trade up to a more exclusive supplier. Maybe they need to make economies and your product is one they can live without.
Bad experience. Usually, one bad experience won't make us go elsewhere, especially if the relationship is well established. Continued bad experiences will almost certainly lead to customer churn. Bad experiences can include unkept promises, phone calls not returned, poor quality, brusque treatment, etc. It is important to monitor complaints from customers to measure the trends in bad experiences, although the behavior of customers in this respect varies from one culture to another. For instance, in the United Kingdom, it is uncommon for people to complain. They just walk.
A better offer. This is where your competitors have you beat. These days it is easy for companies in some industries to leap-frog each other in the services they provide and, equally, it is easy for customers to move freely from one supplier to another. A good example of this is the prepay mobile phone business. When it first came out, it was attractive because there was no fixed contract, but it placed restrictions on the minimum number of calls you had to make in any one period, say, $50 per quarter. As more vendors entered this market the restriction was driven down and down until it got to the stage where you only have to make one call in six months! So how can you figure out which of your customers are active, which are not, and which you are at the most risk of losing? What you need is customer insight!
Customer Insight In order to be successful at CRM, you simply have to know your customers. In fact some people define CRM as precisely that—knowing your customers. Actually it is much more
38
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
than that, but if you don't know your customers, then you cannot be successful at CRM. It's obvious really, isn't it? In order for any relationship to be successful, both participants in the relationship have to have some level of understanding so that they can communicate successfully and in a way that is rewarding for both parties. Notice the usage of the term “both.” Ultimately, a relationship consists of two people. Although we can say that we have a relationship with our friends, what we are actually saying is that we have a relationship with each of our friends and that results in many relationships. Each one of those relationships has to be initiated and developed, sometimes with considerable care. We invest part of ourselves, sometimes a huge amount, into maintaining the relationships that we feel are the most important. In order for the relationship to be truly successful, it has to be built on a strong foundation, and that usually means knowing as much as we can about the other person in the relationship. The more we know, the better we can be at responding to the other person's needs. Each of the parties involved in a relationship has to get some return from their investment in order to justify continuing with it. The parallels here are obvious. Customers are people, and if we are to build a sustained relationship with them, we have to make an investment. Whereas with personal relationships, the investments we make are our time and emotion and usually we do not measure these; business relationships involve time and money, and we can and should measure them. The purpose of a business relationship is profit. If we cannot profit from the business relationship, then we must consider whether the relationship is worth the investment. So how do we know whether our relationship with a particular customer is profitable? It's all tied in with the notion of customer insight—knowing about our customers. Whereas our knowledge regarding our friends is mostly in our heads, our knowledge about customers is generally held in stored records. Every order, payment, inquiry, and complaint is a piece of a jigsaw puzzle that, collectively, describes the customer. If we can properly manage this information, then we can build a pretty good picture of our customers, their preferences and dislikes, their behavioral traits, and their personal circumstances. Once we have that picture, we can begin to develop customer insight. Let's look at some of the uses of customer insight, starting with segmentation.
Segmentation Terms like knowing our customers and customer insight are somewhat abstract. As previously stated, a large company may have thousands or even millions of customers, and it is not possible to know each one personally in the same way as with our friends and family. A proper relationship in that sense is not a practical proposition for most
39
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
organizations. It has been tried, however. Some banks, for instance, operate “personal” account managers. These managers are responsible for a relatively small number of customers, and their mission is to build and maintain the relationships by getting to know the customers in their charge. However, this is not a service that is offered to all customers. Mostly, it applies to the more highly valued customer. Even so, this service is expensive to operate and is becoming increasingly rare. Our customers can be divided into categorized groups called segments. This is one way of getting to know them better. We automatically divide our friends into segments, perhaps without realizing that we are doing it. For instance, friends can be segmented as: Males or females Work mates Drinking buddies Football supporters Evening classmates Long-standing personal friends Clearly, there are many ways of classifying and grouping people. The way in which we do it depends on our associations and our interests. Similarly, there are many ways in which we might want to segment our customers and, as before, the way in which we choose to do it would depend on the type of business we are engaged upon. There are three main types of segmentation that we can apply to customers: the customer's circumstances, their behavior, and derived information. Let's have a brief look at these three. Circumstances
This is the term that I use to describe those aspects of the customer that relate to their personal details. Circumstances are the information that define who the customer is. Generally speaking, this type of information is customer specific and independent and has nothing to do with our relationship with the customer. It is the sort of information that any organization might wish to hold. Some obvious elements included in customer circumstances are: Name
40
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Date of birth Sex Marital status Address Telephone number Occupation A characteristic of circumstances is that they are relatively fixed. Some IT people might refer to this type of information as reference data and they would be right. It is just that
circumstances is a more accurate layperson's description. Some less obvious examples of circumstances are: Hobbies Ages of children Club memberships Political affiliations In fact, there is almost no limit to the amount of information relating to circumstances that you can gather if you feel it would be useful. In the appendix to this book, I have included several hundred that I have encountered in the past. One retail bank, for some reason, would even like to know whether the customer's spouse keeps indoor potted plants. The fact that I have described this type of information as relatively fixed means that it may well change. The reality is that some things will change and some will not. This type of data tends to change slowly over time. For instance, official government research shows that, in the United Kingdom, approximately 10 percent of people change their address each year. A date of birth, however, will never change unless it is to correct an error. Theoretically, each of the data elements that we record about a customer could be used for segmentation purposes. It is very common to create segments of people by: Sex
41
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Age group Income group Geography Behavioral Segmentation
Whereas a customer's circumstances tend not to relate to the relationship between us, a customer's behavior relates to their interaction with our organization. Behavior encompasses all types of interaction such as: Purchases—the products or services that the customer has bought from us. Payments—payments made by the customer. Contacts—where the customer has written, telephoned, or communicated in some other way. This might be an inquiry about product, a complaint about service, perhaps a request for assistance, etc. The kind of segmentation that could be applied to this aspect of the relationship could be: Products purchased or groups of products. For instance, an insurance company might segment customers into major product groups such as pensions, motor insurance, household insurance, etc. Spending category. Organizations sometimes segment customers by spending bands. Category of complaint. Derived Segmentation
The previous types of segmentation relating to a customer's circumstances and their behavior is quite straightforward to achieve because it requires no interpretation of data about the customer. For instance, if you wish to segment customers depending on the amount of money they spend with you, then it is a simple matter of adding up the value of all the orders placed over the period in question, and the appropriate segment for any particular customer is immediately obvious.
42
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Very often, however, we need to segment our customers in ways that require significant manipulation and interpretation of information. Such segmentation classifications may be derived from the customer's circumstances or behavior or, indeed, a combination of both. As this book is, essentially, intended to assist in the development of data warehouses for CRM, we will return to the subject of derived segmentation quite frequently in the up coming chapters. However, some examples of derived segmentation are:
Lifetime value. Once we have recorded a representative amount of circumstantial and behavioral information about a customer, we can use it in models to assist in predicting future behavior and future value. For instance, there is a group of young people that have been given the label “young plastics” (YP). The profile of a YP is someone who has recently graduated from college and is now embarking on the first few years of their working life. They have often just landed their first job and are setting about securing the trappings of life in earnest. Their adopted lifestyle usually does not quite match their current earnings, and they are often debt laden, becoming quite dependent on credit. At first glance, these people do not look like a good proposition upon which to build a business relationship. However, their long-term prospects are, statistically, very good. Therefore, it may be more appropriate to consider the potential lifetime value of the relationship and “cut them some slack,” which means developing products and services that are designed for this particular segment of customers.
Propensity to churn. We have already talked about the problem of churn earlier in this chapter. If we can assess our customers in such a way as to grade them on a scale of, say, 1 to 10 where 1 is a safe customer and 10 is a customer who we are at risk of losing, then we would be able to modify our own behavior in our relationship with the customer so as to manage the risk.
Up-sell and cross-sell potential. By carefully analyzing customers' behavior, and possibly combining segments together, it is possible to derive models of future behavior and even potential behavior. Potential behavior is a characteristic we all have; the term describes behavior we might engage in, given the right stimulus. Advertising companies stake their very existence on this. It enables us to identify opportunities to sell more products and services (up-selling) and different products and services (cross selling) to our customers.
43
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Entanglement potential. This is related to up-sell and cross-sell and it applies to customers who might be encouraged to purchase a whole array of products from us. Over time, it can become increasingly difficult and bothersome for the customer to disentangle their relationship and go elsewhere. The retail financial services industry is good at this. The bank that manages our checking account encourages us to pay all our bills out of the account automatically each month. This is good because it means we don't have to remember to write the checks. Also, if the bank can persuade us to buy our house insurance, contents insurance, and maybe even our mortgage from them too, then it becomes a real big deal if we want to transfer, say, the checking account to another bank. “Householding” is another example of this and, again, it's popular with the banks. It works by enticing all the members of a family to participate in a particular product or service so that, collectively, they all benefit from special deals such as a reduced interest rate on overdrawn accounts. If any of the family members withdraws from the service, then the deal can be revoked. This of course has an effect on the remainder of the family, and so there is an inducement to remain loyal to the bank. Sometimes there are relationships between different behavioral components that would not be spotted by analysts and knowledge workers. In order to try to uncover these relationships we have to employ different analytic techniques. The best way to do this is to employ the services of a data mining product. The technical aspects of data mining will be described later in this book, but it is important to recognize that there are other ways of defining segments. As a rather obvious example, we can all comprehend the relationship between, say, soft fruit and ice cream, but how many supermarkets place soft fruit and ice cream next to each other? If there is a statistically significant relationship such that customers who purchase soft fruit also purchase ice cream, then a data mining product would be able to detect such a relationship. As I said, this is an obvious example, but there will be others. For instance, is there a relationship between: Vacuum cleaner bags and dog food? Toothpaste and garlic? Diapers and beer? (Surely not!) As we have seen, there are many ways in which we can classify our customers into segments. Each segment provides us with opportunities that might be exploited. Of course, it is the business people that must decide whether the relationships are real or merely
44
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
coincidental. Once we have identified potential opportunities, we can devise a campaign to help us persuade our target customers of the benefits of our proposal. In order to do this, it might be helpful to employ another facet of CRM, campaign management.
Campaign Management As I see it, there are, essentially, two approaches to marketing: defensive marketing and aggressive marketing. Defensive marketing is all about keeping what we have already. For instance, a strategy for reducing churn in our customers could be regarded as defensive because we are deploying techniques to try to keep our customers from being lured by the aggressive marketing of our competitors. Aggressive marketing, therefore, is trying to get more. By “more” we could be referring to: Capturing more customers Cross-selling to existing customers Up-selling to existing customers A well-structured strategy involving well-organized and managed campaigns is a good example of aggressive marketing. The concept of campaigns is quite simple and well known to most of us. We have all been the target of marketing campaigns at some point. There are three types of campaign:
Single-phase campaigns are usually one-off special offers. The company makes the customer, or prospect, an offer (this is often called a treatment) and if the customer accepts the offer (referred to as the response), then the campaign, in that particular case, can be regarded as having been successful. An example of this would be three months free subscription to a magazine. The publishing company is clearly hoping the customer will enjoy the magazine enough to pay for a subscription at the end of the free period.
Multi-phase campaigns, as the name suggests, involve a set of treatments instead of just one. The first treatment might be the offer of a book voucher or store voucher if the customer, say, visits a car dealer and accepts a test drive in a particular vehicle. This positive response is recorded and may be followed up by a second treatment. The second treatment could be the offer
45
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
to lend that customer the vehicle for an extended period, such as a weekend. A positive response to this would result in further treatments that, in turn, provoke further responses, the purpose ultimately being to increase the sales of a particular model.
Recurring campaigns are usually running continually. For example, if the customer is persuaded to buy the car, then, shortly after, they can expect to receive a “welcome pack” that makes them feel they have joined some exclusive club. Campaigns can be very expensive to execute. Typically, they are operated under very tight budget constraints and are usually expected to show a profit. This means that the cost of running the campaign is expected to be recouped out of the extra profits made by the company as a result of the campaign. The problem is: How do you know whether the sale of your product was influenced by the campaign? The people who ended up buying the product might have done so without being targeted in the campaign. The approach that most organizations adopt in order to establish the efficacy of a campaign is to identify a “control” group in much the same way as is done in clinical trials and other scientific experiments. The control group is identified as a percentage of the target population. This group receives no special treatments. At the end of the campaign, the two groups are compared. If, say, 2 percent of the control group purchase the product and 5 percent of the target group purchase the product, then it is assumed that the 3 percent difference was due to the influence of the campaign. The box on the following page shows how it's figured out. One of the big problems with campaigns, as you can see, is the minuscule responses. If the response in our example had been 4 percent instead of 5, the campaign would have shown a loss of $19,000 instead of the healthy profit we actually got. We can see that the line between profit and loss is indeed a fine line to tread. It seems obvious that careful selection of the target population is critical to success. We can't just blitz the marketplace with a carpet-bombing approach. It's all about knowing, or having a good idea, about who might be in the market to buy a car. If you think about it, this is the most important part. And the scary thing is, it has nothing to do with campaign management. It has everything to do with knowing your customers. Campaign management systems are an important component of CRM, but the most important part, identifying which customers should be targeted, is outside the scope of most campaign management systems.
46
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
It would be really good if we could make our campaigns a little more tailored to the individual people in our target list. If the campaign had a target of one, instead of thousands, and we could be sure that this one target is really interested in the product, our chances of success would be far greater.
Personalized Marketing Personalized marketing is sometimes referred to as a “segment of one” or “one-to-one
marketing” and is the ultimate manifestation of having gotten to know our customers. Ideally, we know precisely: What they need When they need it How much they are willing to pay for it
47
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Then, instead of merely having a set of products that we try to sell to customers, we can begin to tailor our offerings more specifically to our customers. Some of the Internet- based companies are beginning to develop applications that recognize customers as they connect to the site and present them with “content” that is likely to be of interest to them. Different customers, therefore, will see different things on the same Web page depending on their previous visits. This means that the customer should never be presented with material that is of no interest to them. This is a major step forward in responding to customers' needs. The single biggest drawback is that this approach is currently limited to a small number of Internet-based companies. However, as other companies shift their business onto the Internet, then the capability for this type of solution will grow. It is worth noting that the success of these types of application depends entirely on information. It does not matter how sophisticated the application is; it is the information that underpins it that will determine success or failure.
Customer Contact One of the main requirements in the implementation of our CRM strategy is to get to know our customers. In order to do this we recognize the value in information. However, the information that we use tends to be collected in the routine course of daily business. For instance, we routinely store data regarding orders, invoices, and payments so that we can analyze the behavior of customers. In virtually all organizations there exists a rich source of data that is often discarded. Each time a customer, or prospective customer, contacts the organization in any way, we should be thinking about the value of the information that could be collected and used to enhance our knowledge about the customer. Let's consider for a moment the two main types of contact that we encounter every day: 1. Enquiries. Every time an existing customer or prospect makes an inquiry about our products or services, we might reasonably conclude that the customer may be interested in purchasing that product or service. How many companies keep a managed list of prospects as a result of this? If we did, we would have a ready-made list for personalized campaign purposes. 2.
Complaints. Customers complain for many reasons. Usually customers complain for good reasons, and most quality companies have a purpose built system for managing complaints. However, some customers are “serial” moaners and it would be good to know, when faced with people like this, what segments they are
48
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
classified under and whether they are profitable or loss making. If the operators dealing with the communications had access to this information, they could respond appropriately. Remember that appropriate interaction with customers is the second main requirement in the implementation of our CRM strategy. Principally, when we are considering customer contact, we tend to think automatically about telephone contact. These days, however, there is a plethora of different media that a customer might use to contact us. Figure 1.4 shows the major channels of contact information. Figure 1.4. The number of communication channels is growing.
49
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
There is quite a challenge in linking together all these channels into a cohesive system for the collection of customer contact. There is a further point to this. Each customer contact costs money for us to deal with. Remember that the overall objective for a CRM strategy is to optimize the value of a customer. Therefore, the cost of dealing with individual customers should be taken into account if we are to accurately assess the value of customers. Unfortunately, most organizations aren't able to figure out the cost of an individual customer. What tends to
50
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
happen is they sum up the total cost of customer contact and divide it by the total number of customers and use the result as customer cost. This arbitrary approach is OK for general information, but it is unsatisfactory in a CRM system where we really do want to know which are the individual customers that cost us money to service. only for RuBoard - do not distribute or recompile
51
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY Different people, clearly, have different perspectives of the world. Six experts will give six different views as to what is meant by customer relationship management. In this chapter I have expressed my view as to what CRM means, its definition, and why it's important to different types of businesses. Also we have explored the major components of CRM. We looked at ways in which the information that we routinely hold about customers might be used to help to support a CRM strategy. This book, essentially, is about data warehousing. Now we have figured out how to design a data warehouse that will support the kind of questions that we need to ask about customers in order to help us be successful at CRM. only for RuBoard - do not distribute or recompile
52
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY only for RuBoard - do not distribute or recompile
53
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
INTRODUCTION In this chapter we provide an introduction to data warehousing. It is sensible as a starting point, therefore, to introduce data warehousing using “first generation” principles so that we can then go on to explore the issues in order to develop a “second generation” architecture. Data warehousing relates to a branch of a general business subject known as decision support. So in order to understand what data warehousing is all about, we must first understand the purpose of decision support systems (DSS) in general. Decision support systems have existed, in different forms, for many years. Long before the invention of any form of database management systems (DBMS), information was being extracted from applications to assist managers in the more effective running of their organizations.
So what is a decision support system? The purpose of a decision support system is to provide decision makers in organizations with information. The information advances the decision makers' knowledge in some way so as to assist them in making decisions about the organization's policies and strategy. A DSS tends to have the following characteristics: They tend to be aimed at the less well structured, underspecified problems that more senior managers typically face. They possess capabilities that make them easy to use by noncomputer people interactively. They are flexible and adaptable enough to accommodate changes in the environment and decision-making approach of the user. The job of a DSS is usually to provide a factual answer to a question phrased by the user. For instance, a sales manager would probably be concerned if her actual product sales were falling short of the target set by her boss. The question she would like to be able to ask might be:
54
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Why are my sales not meeting my targets? There are, as yet, no such computer systems available to answer such a question. Imagine trying to construct an SQL (Structured Query Language) query that did that ! Her questioning has to be more systematic such that the DSS can give factual responses. So the first question might be:
For each product, what are the cumulative sales and targets for the year? A DSS would respond with a list of products and the sales figures. It is likely that some of the products are ahead of target and some are behind. A well-constructed report might highlight the offending products to make them easier to see. For instance, they could be displayed in red, or flashing. She could have asked:
What are the cumulative sales and targets for the year for those products where the actual sales are less than the target ? Having discovered those products that are not achieving the target, she might ask what the company's market share is for those products, and whether the market share is decreasing. If it is, maybe it's due to a recently imposed price rise. The purpose of the DSS is to respond to ad hoc questions like these, so that the user can ultimately come to a conclusion and make a decision. A major constraint in the development of DSS is the availability of data—that is, having access to the right data at the right time. Although the proliferation of database systems, and the proper application of the database approach, enables us to separate data from applications and provides for data independence, the provision of data represents a challenge. The introduction of sophisticated DBMSs has certainly eased the problems caused by traditional applications but, nonetheless, unavailability of data still persists as a problem for most organizations. Even today, data remains “locked away” in applications. The main reason is that most organizations evolve over time. As they do, the application systems increasingly fail to meet the functional requirements of the organization. As a result, the applications are continually being modified in order to keep up with the ever-changing business. There comes a time in the life of almost every application when it has been modified to the point where it becomes impossible or impractical to modify it further. At this point a decision is usually made to redevelop the application. When this happens, it is usual for the developers to take advantage of whatever improvements in technology have occurred during the life of the application. For instance, the original
55
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
application may have used indexed sequential files because this was the most appropriate technology of the day. Nowadays, most applications obtain their data through relational database management systems (RDBMS). However, most large organizations have dozens or even hundreds of applications. These applications reach the end of their useful lives at various times and are redeveloped on a piecemeal basis. This means that, at any point in time, an organization is running applications that use many different types of software technology. Further, large organizations usually have their systems on diverse hardware platforms. It is very common to see applications in a single company spread over the following: Large mainframe Several mid-range multi-processor machines External service providers Networked and stand-alone PCs A DSS may require to access information from many of the applications in order to answer the questions being put to it by its users.
Introduction to the Case Study To illustrate the issues, let us examine the operation of a (fictitious) organization that contains some of the features just described. The organization is a mail order wine club. With great originality, it is called the Wine Club. As well as its main products (wines), it also sells accessories to wines such as: Glassware—goblets, decanters, glasses, etc. Tableware—ice buckets, corkscrews, salvers, etc. Literature—books and pamphlets on wine-growing regions, reviews, vintages, etc. It has also recently branched out further into organizing trips to special events such as the Derby, the British Formula One Grand Prix, and the Boat Race. These trips generally involve the provision of a marquee in a prominent position with copious supplies of the
56
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
club's wines and a luxury buffet meal. These are mostly one-day events, but there are an increasing number of longer trips such as those that take in the French wine-growing regions by coach tour. The club's information can be modeled, by an entity attribute relationship (EAR) diagram. A high-level EAR diagram of the club's data is as shown in Figure 2.1 Figure 2.1. Fragment of data model for the Wine Club.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Class (ClassCode,ClassName, Region) Color(ColorCode, ColorDesc) Customer(CustomerCode, CustomerName, CustomerAddress, CustomerPhone)
CustomerOrder (OrderCode, OrderDate, ShipDate, Status, TotalCost) OrderItem(OrderCode,ItemCode, Quantity, ItemCost) ProductGroup(GroupCode, Description) Reservation(CustomerCode, TripCode, Date, NumberOfPeople, Price) Shipment(ShipCode, ShipDate) Shipper(ShipperCode, ShipperName, ShipperAddress, ShipperPhone) Stock(LocationCode, StockOnHand ) Supplier(SupplierCode, SupplierName, SupplierAddress, SupplierPhone) Trip(TripCode, Description, BasicCost) TripDate(TripCode, Date, Supplement, NumberOfPlaces) Wine(WineCode, Name, Vintage, ABV, PricePerBottle, PricePerCase) The Wine Club has the following application systems in place:
Customer administration. This enables the club to add new customers. This is particularly important after an advertising campaign, typically in the Sunday color supplements, when many new customers join the club at the same time. It is important that the new customers' details, and their orders, are promptly dealt with in order to create a good first impression. This application also enables changes to a customer's address to be recorded, as well as removing ex-customers from the database. There are about 100,000 active customers.
Stock control. The goods inward system enables newly arrived stock to
58
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
be added to the stock records. The club carries about 2,200 different wines and 250 accessories from about 150 suppliers.
Order processing. The directors of the club place a high degree of importance on the fulfillment of customers' orders. Much emphasis is given to speed and accuracy. It is a stated policy that orders must be shipped within ten days of receipt of the order. The application systems that support order processing are designed to enable orders to be recorded swiftly so that they can be fulfilled within the required time. The club processes about 750,000 orders per year, with an average of 4.5 items per order.
Shipments. Once an order has been picked, it is packed and placed in a pre-designated part of the dispatch area. Several shipments are made every day.
Trip bookings. This is a new system that records customer bookings for planned trips. It operates quite independently of the other systems, although it shares the customer information held in the customer administration system. The club's systems have evolved over time and have been developed using different technologies. The order processing and shipments systems are based on indexed-sequential files accessed by COBOL programs. The customer administration system is held on a relational database. All these systems are executed on the same mid range computer. The stock control system is a software package that runs on a PC network. The trip bookings system is held on a single PC that runs a PC-based relational database system. There is a general feeling among the directors and senior managers that the club is losing its market share. Within the past three months, two more clubs have been formed and their presence in the market is already being felt. Also, recently, more customers than usual appear to be leaving the club and new customers are being attracted in fewer numbers than before. The directors have held meetings to discuss the situation. The information upon which the discussions are based is largely anecdotal. They are all certain that a problem exists but find it impossible to quantify. They also know that helpful information passes through their systems and should be available to answer questions. In reality, however, while it is not too difficult to get answers
59
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
to the day-to-day operational questions, it is almost impossible to get answers to more strategic questions.
Strategic and Operational Information It is very important to understand the difference between the terms strategic and
operational. In general, strategic matters deal with planning and policy making, and this is where a data warehouse can help. For instance, in the Wine Club the decision as to when a new product should be launched would be regarded as a strategic decision. Examples pertaining to other types of organization include: When a telecommunications company decides to introduce very cheap off-peak tariffs to attract callers away from the peak times, rather than install extra equipment to cope with increasing demand. A large supermarket chain deciding to open its stores on Sundays. A general 20 percent price reduction for one month in order to increase market share. Whereas strategic matters relate to planning and policy, operational matters are generally more concerned with the day-to-day running of a business or organization. Operations can be regarded as the implementation of the organization's strategy (its policies and plans). The day-to-day ordering of supplies, satisfying customers' orders, and hiring new employees are examples of operational procedures. These procedures are usually supported by computer applications and, therefore, they must be able to provide answers to operational questions such as: How many unfulfilled orders are there? On which items are we out of stock? What is the position on a particular order? Typically, operational systems are quite good at answering questions like these because
60
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
they are questions about the situation as it exists right now. You could add the words right
now to the end of each of those questions and they would still make sense. Questions such as these arise out of the normal operation of the organization. The sort of questions the directors of the Wine Club wish to ask are: 1. Which product lines are increasing in popularity and which are decreasing? 2. Which product lines are seasonal? 3. Which customers place the same orders on a regular basis? 4. Are some products more popular in different parts of the country? 5. Do customers tend to purchase a particular class of product? These, clearly, are not “right now” types of questions and, typically, operational systems are not good at answering such questions. Why is this? The answer lies in the nature of operational systems. They are developed to support the operational requirements of the organization. Let's examine the operational systems of the Wine Club and see what they actually do. Each application's role in the organization can usually be expressed in one or two sentences. The customer administration system contains details of current customers. The stock control system contains details of the stock currently held. The order processing system holds details of unfulfilled customer orders and the shipments system records details of fulfilled orders awaiting delivery to the customers. Notice the use of words like details and current in those descriptions. They underline the “right now” nature of operational systems. You could say that the operational systems represent a “snapshot” of an organization at a point in time. The values held are constantly changing. At any point in time, dozens or even hundreds of inserts, updates and deletes may be executing on all, or any, parts of the systems. If you were to freeze the systems momentarily, then they would provide an accurate reflection of the state of the organization at precisely that moment. One second earlier, or one second later, the situation would have changed. Now let us examine the five questions that the directors of the Wine Club need to ask in
61
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
order to reach decisions about their future strategy. What is it that the five questions have in common? If you look closely you will see that each of the five questions is concerned with sales of
products over time. Looking at the first question:
Which product lines are increasing in popularity and which are decreasing? This is obviously a sensible strategic business question. Depending on the answer, the directors might: Expand their range of some products and shrink their range of other products Offer a financial incentive on some products, such as reduced prices or discounts for volume purchases Enhance the promotional or advertising techniques for the products that are decreasing in popularity For the moment, let's focus in on sales of wine and assess whether the information required to ask such a question is available to the directors. Have a look back at the EAR diagram at the beginning of the case study. Remember we are looking for “sales of products over time.” The only way we can assess whether a product line is increasing or decreasing in popularity is to trace its demand over time. If the order processing information was held in a relational database, we could devise an SQL query such as:
Select Name, Sum(Quantity), Sum(ItemCost) Sales From CustomerOrder a, OrderItem b, Wine c Where a.OrderCode = b.OrderCode And a.WineCode = c.WineCode And OrderDate = Group by Name
62
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
If this query were to be executed at the end of the day, it would return the value of all the orders received for the day. The ability for us to discover the value of orders received today is a good start. This is useful information, but what we would really like to do is to examine the trend over (say) the past six months, or to compare this month with the same month last year. So what is the solution? Let's say we were to execute our query every day and append the results for each day in a table. That way, over time, we would be able to build up the historical information that we need. This is the beginning of a data warehouse. only for RuBoard - do not distribute or recompile
63
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
WHAT IS A DATA WAREHOUSE? We will start by giving a definition of data warehousing, and then we'll go on to explore the definition by comparing and contrasting the features of a data warehouse with the features of operational systems that we have already described. The accepted definition of a data warehouse (attributed to Bill Inmon, 1992) is a database that contains the following four characteristics: 1. Subject oriented 2. Nonvolatile 3. Integrated 4. Time variant
Subject oriented means that the data is organized around subjects (such as Sales) rather than operational applications (such as order processing). Operational databases are organized around business application; they are
application oriented. Recall the five queries that the directors have identified as examples of the types of questions they would like to ask of their data. We concluded that they are concerned with sales of products over time. The subject area in our case study is clearly “sales.”
Nonvolatile means that the data, once placed in the warehouse, is not usually subject to change. Anyone who is using the database has confidence that a query will always produce the same result no matter how often it is run. Operational databases are extremely volatile in that they are constantly changing. A query is unlikely to produce the same result twice if it is accessing tables which are frequently updated.
Integrated means the data is consistent. For instance, dates are always stored in the same format. Integration is a problem for most organizations, particularly where there are many different types of technology in use.
64
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Some differences are quite fundamental, such as the character set. Most systems use the ASCII (American Standard Code for Information Interchange) character set, but some do not. IBM, which is one of the largest computer manufacturers in the world, bases all of its mainframe systems and many of its midrange systems on a totally different character set called EBCDIC (Extended Binary Coded Decimal Interchange Code). So the letter “P” has a decimal value of 80 in ASCII but is 215 in EBCDIC (the character with a value of 80 in EBCDIC is “&”). The word “Pool” in ASCII translates to “&??%” in EBCDIC, and it's difficult to imagine anything less integrated than this. Other differences are more subtle, such as dates. Most DBMSs have a “Date” data type (although the storage format is different from one DBMS to another), whereas access methods such as indexed sequential have no such facility. Even more subtle differences occur within different applications within the same technology. This occurs where, for instance, one application designer decides to hold customer addresses as five columns of 25 characters each, whereas another might use a Varchar(100) format. Before data is allowed to enter the data warehouse, it must be integrated. So integration is a process through which the data passes after it leaves the application database and before it enters the warehouse database.
Time variant means that historical data is recorded. Almost all queries executed against a data warehouse have some element of time associated within them. We have already established that most operational systems do not retain historical information. It is almost impossible to predict what will happen in the future without observing what happened in the past. A data warehouse helps to address this fundamental issue by adding a historical dimension to the data taken from the operational databases. One of the most successful techniques in designing data warehouses is called “dimensional analysis and dimensional modeling.” We will now move on to examine this technique. only for RuBoard - do not distribute or recompile
65
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DIMENSIONAL ANALYSIS One approach to data warehouse design is to develop and implement a “dimensional model.” This has given rise to dimensional analysis (sometimes generalized as multi-dimensional analysis ). It was noticed quite early on when data warehouses started to be developed that, whenever decision makers were asked to describe the kinds of questions they would like to get answers to regarding their organizations, they almost always wanted the following: Summarized information with the ability to break the summaries into more detail Analysis of the summarized information across their own organizational components such as “departments” or “regions” Ability to “slice and dice” the information in any way they chose Display of the information in both graphical and tabular form Capability to view their information over time So as an example, they might wish to see a report showing Wine Sales by Product, or a report showing Sales by Customer, or even Sales by Product by Customer Table 2.1 shows a typical report of sales by product.
Table 2.1. The Wine Club— Analysis of Sales by Product for July. Product Name
Quantity Sold (Cases)
Cost Price
Selling Price
Total Revenue
Total Cost
Gross Profit
Chianti
321
26.63
42.95
13,787
8,548
5,239
Bardolino
1,775
15.10
31.35
55,646
26,802
28,844
Barolo
275
46.72
70.95
19,511
12,848
6,663
Lambrusco
1,105
23.25
41.45
45,802
25,691
20,111
Valpolicella
2,475
12.88
32.45
80,313
31,878
48,435
This dimensional approach led Ted Codd to make the following observation:
66
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
There are typically a number of different dimensions from which a given pool of data can be analyzed. This plural perspective, or multidimensional conceptual view, appears to be the way most business persons naturally view their enterprise. —E. F. Codd, 1993 So the concept of dimensional analysis became a method for defining data warehouses. The approach is to determine, by interviewing the appropriate decision makers in an organization, which is the subject area that they are most interested in, and which are the most important dimensions of analysis. Recall that one of the characteristics of a data warehouse is that it is subject oriented. The subject area reflects the subject-oriented nature of the warehouse. In the example above, the subject area would be Sales. The dimensions of analysis would be Customers and Products. The requirement is to analyze sales by customer and sales by product. This requirement is depicted in the following three-dimensional cube. Figure 2.2 shows Sales (the shaded area) having axes of: 1. Customer 2. Product 3. Time Figure 2.2. Three-dimensional data cube.
67
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Notice that time has not been examined so far. Time is regarded as a necessary dimension of analysis (recall that time variance is another characteristic of data warehouses) and so is always included as one of the dimensions of analysis. This means that Sales can be analyzed by Customer by Product over Time. So each element of the cube (each minicube) contains a value for sales to a particular customer, of a particular product, at a particular point in time. The multidimensional cube in Figure 2.2 shows sales as the subject with three dimensions of analysis. There is no real limit to the number of dimensions that can used in a dimensional model, although there is, of course, a limit to the number of dimensions we can draw! Now let's return to the Wine Club. The directors of The Wine Club need answers to questions about sales. Looking back at the five example questions, they all concerned sales: Sales by Product, Sales by Customer, Sales by Area. As in the example above, the subject area for their data warehouse is clearly “Sales.” So what are the dimensions of analysis? Well, we've just mentioned three:
68
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
1. Product 2. Customer 3. Area So is that it? Not quite; we must not forget the “Time” dimension. So now we have it, a subject area and four dimensions of analysis. As we cannot draw four-dimensional models, we can represent the conceptual dimensional model as shown in Figure 2.3. Figure 2.3. Wine sales dimensional model for the Wine Club.
The diagram in Figure 2.3 is often referred to as a Star Schema because the diagram loosely resembles a star shape. The subject area is the center of the star and the dimensions of analysis form the points of the star. The subject area is often drawn long and thin because the table itself is usually long and thin in that it contains a small number of columns but a very large number of rows. The Star Schema is the most commonly used diagram for dimensional models. only for RuBoard - do not distribute or recompile
69
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
BUILDING A DATA WAREHOUSE Now that the Wine Club has decided to build a data warehouse, it must further decide: 1. Where is it to be placed? 2. What technology should be used? 3. How will it be populated? These are problems experienced by most organizations when they attempt to build a data warehouse. As we have said, the business applications have been designed using technology and methodologies that were appropriate when they were developed. At the time, no consideration was given to decision support. The applications were designed to satisfy the business operational requirements. The Wine Club must draw its warehouse data from three very different software environments: 1. Indexed sequential 2. RDBMS 3. Third-party package Also, these applications reside on two very different hardware platforms (PCs and mid-range computers). The question is:
Are any of the current hardware/software environments appropriate for the data warehouse (in particular, could the data warehouse be kept in an existing database), or should they introduce another set of technologies? Usually, the warehouse is developed separately from any other application, in a database of its own. One obvious reason for this is, as stated above, due to the disparate nature of the source systems. There are other reasons, however: 1. There is a conflict between operational systems and data warehouse systems in
70
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
that operational systems are organized around operational requirements and are not subject oriented. 2. Operational systems are constantly changing, whereas decision support systems are “quiet” (nonvolatile). 3. Operational systems schemas are often very large and complex, where any two tables might have multiple join paths. These complex relationships exist because the business processes require them. It follows that a knowledge worker writing queries has more than one option when attempting to join two tables. Each possible join path will produce a different result. The decision support schema allows only one join path. An example may help to explain this. Have a look at the entity relationship diagram in Figure 2.4. The example is deliberately unrealistic in order to illustrate the point. The question we have been asked by the directors is:
Which products are more popular in the Northwest? This is a perfectly reasonable question. But how do we answer it? Assuming the data is stored in a relational database, we have to construct an SQL query that joins the “Area” table to the “Product” table in some way. Figure 2.4. Data model showing multiple join paths.
71
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Consider the following questions: 1. How many join paths are there between area and product? 2. Which is the most appropriate? 3. If six people were asked to answer this question, would they always choose the same path? 4. Will all the join paths produce the same result? The first three questions are kind of rhetorical. The answer to question four is no. For instance, if any of the tables contains no rows, then any join path including that table will
not return any rows. So how does the decision maker know that the result is correct? Another reason for keeping the systems separate is that operational systems do not normally hold historical data. While this is a design issue and, clearly, these systems could hold such data, the effect on performance is not likely to be acceptable to the operational system users. It is usually acceptable for a strategic question to take several minutes, or even hours, to be answered. It would not usually be acceptable for operational system updates to take more than a few seconds.
72
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
In any case, most applications that use the database are not developed with historical reporting in mind. This means that, in order to introduce time variance, the applications would have to be modified, sometimes quite substantially. It is usually the case that organizations would not be prepared to tolerate the disruption this would cause. We will be exploring the issues surrounding time in very great detail in Chapter 4. To answer the question about which is the best technology upon which to build a data warehouse, the RDBMS has emerged as a de facto standard. Part of the reason for this is that, as we have shown, data warehouses are usually designed using dimensional modeling and the relational model supports the dimensional model very well. This is something we will review in Chapter 3. There are several components to a data warehouse. Some are concerned with getting the information out of the source systems and into the warehouse, while others are concerned with getting the information out of the warehouse and presenting it to the users. Figure 2.5 shows the main components. Figure 2.5. The main components of a data warehouse system.
73
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
We now examine each of the components of the warehouse model, starting from the
74
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
bottom.
The Extraction Component We have established that the subject area for our Wine Club data warehouse is sales and we know that we wish to analyze sales by product, customer, area and, of course, time. The first problem is to extract information about sales from the source systems so that the sales can be inserted into the warehouse. In order to insert details about sales into the warehouse, we must have a mechanism for identifying a sale. Having identified the sale, it must be captured and stored so that it may be placed into the warehouse at the appropriate time. Care must be taken to ensure what is meant by the term sale. If you look carefully at the original EAR model for the Wine Club, you will see that there is no attribute in the system that in any way refers to sales. This is a common problem in data warehouse developments. Clearly, the Wine Club does sell things. So how do we identify when a sale has occurred? The answer is—we ask. The question “What is a sale?” has to be asked and a definition of sale must be clearly stated and agreed on by everyone before the data can be extracted. You will find that there are usually several different views and each is equally valid, such as: A salesperson will normally consider that a sale has occurred when the order has been received from a customer. A stock manager will record a sale upon receipt of a notice to ship an order. An accountant doesn't usually recognize a sale until an invoice has been raised against an order, usually after it has been shipped. Although these views are different, there is a common thread, which is that they are all related to the Order entity. What happens is that customers' orders, like all entities, have a lifecycle. At each stage in its lifecycle, the order has a status and a typical order moves logically from one status to the next until its lifecycle is complete and it is removed from the system.
75
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The three different views of a sale, described above, are simply focusing on different states of the same entity (the order). This problem can usually be solved by constructing a state transition diagram, which traces the lifecycle of the entity and defines the events that enable the entity to move from one logical state to the next logical state. Figure 2.6 shows the general form of a state transition diagram. Note that, although the diagram shows four states and five transitions, in practice there are no limits to the numbers of states or transitions. Figure 2.6. General state transition diagram.
The state transition diagram for an order in the Wine Club is shown in Figure 2.7. Figure 2.7. State transition diagram for the orders process.
76
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The Wine Club directors have decided that a sale occurs when an order is shipped to a customer. This means that whenever an order changes its status from complete to shipped, that is the point at which the order must be recorded for subsequent insertion into the data warehouse. The only way to achieve this is to make some change to the application so that the change in status is recognized and captured. Most organizations are not prepared to consider redeveloping their operational systems in order to introduce a data warehouse. Instead, the application software has to be modified to capture the required data into temporary stores so that they may be placed into the warehouse subsequently. In the case of third-generation language (3GL) based applications such as the Wine Club's order processing and shipment systems, the changes are implemented by altering the programs so that they create temporary data stores as, say, sequential files. As you may know, relational database management systems often provide the capability for “triggers” to be stored in the database. These triggers are table-related SQL statements that are executed (sometimes expressed as fired, hence the name trigger) each time a row is inserted, updated, or deleted. We can create triggers to examine the status of an order each time the order is updated. As soon as the status indicates that a sale has occurred, the trigger logic can record the order in a temporary table for subsequent collection into the warehouse.
77
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The use of packaged applications sometimes requires the original developers to become involved to make the required changes so that the data is available to the data warehouse. The order entity does have a status column. So the application programs are altered to add a record (for each order item) into a temporary sequential file each time an order changes its status from complete to shipped. The data to be stored are as follows: Customer code Wine code Order date Ship date Quantity Item cost Every order is captured when, and only when, its status changes from completed to shipped. That completes the extraction of the sales information. It is not, however, the end of the extraction process. We still have to extract the information relating to the dimensions of analysis, the points of the star in the star schema. The customer information is held in a relational database. So obtaining an initial file of customers is a relatively straightforward exercise. The tables in question would be copied to a file using the RDBMS's export utility program. Obtaining details of new customers and changes to existing customers is slightly more tricky but is probably best accomplished by the use of database triggers as described above. Each time a customer's details are updated, the update trigger is fired and we can record the change (sometimes called the delta) into a sequential file for subsequent transfer into the warehouse. You might be wondering, at this point, why we are bothering to do this at all. You may be thinking that there is a perfectly usable customers' table sitting in the database already. Phrases like unnecessary and uncontrolled duplication of data, that is, data redundancy, may well be coming to mind. These are perfectly correct and reasonable concerns and we will address them shortly.
78
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Similarly, we have not discussed the capture of, and actions for, deleting customers' details. See if you can work out why. If not, don't worry because we will be coming to that later as well. We also need to capture information relating to products. This is all held on PCs within a third-party software product. Extracting this information can vary from trivial to impossible depending on the quality of the software and the attitude of the software company involved. Usually, some arrangement can be made whereby the software company either alters its product or provides us with sufficient information so that we can write some additional programs to capture the data we need. Lastly we need information about “Areas.” Recall that one of the dimensions of analysis is the area in which the customers live. There is no such attribute in the system at present. Therefore, some additional work will be needed to obtain this information. There are a number of outside agencies that specialize in providing geographic and demographic information. Their customers are often market research organizations and insurance companies that make heavy use of such statistics. So the Wine Club could obtain the information from outside. Alternatively, another program could be written to process the customers' addresses and analyze them into major towns, or counties or post code groups, etc. The extraction part of the data warehouse is now complete.
The Integration Component Some of the issues relating to integration were described in the definition of data warehouses. There are, in fact, two main aspects of integration: 1. Format integration 2. Semantic integration Format Integration
The issue of format integration is concerned mainly with ensuring that domain integrity is restored where it has been lost. In most organizations, there are usually many cases where attributes in one system have different formats to the same, or similar attributes in other systems. For example:
79
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Bank account numbers or telephone numbers might be stored as type “String” in one system and type “Numeric” in others. Sex might be stored as “male,”/“female,” “m”/“f,” “M”/“F” or even 1,0. Dates, as previously described can be held in many formats including “ddmmyy,” “ddmmyyyy,” “yymmdd,” “yyyymmdd,” “dd mon yy,” “dd mon yyyy.” These are just a few examples. Some systems store dates as a time stamp that is accurate to thousandths of a second. Others use an integer that is the number of seconds from a particular time, for example, 1 Jan 1900. Monetary attributes are also a problem. Some systems store money as integer values and expect the application to insert the decimal points. Others have embedded decimal places. In different systems, differing sizes are used for string values such as names, addresses, and product descriptions, etc. Format integration mismatches are very common, and this is especially true when data is extracted from systems where any of the following is true: 1. The underlying hardware is different. 2. The operating system is different. 3. The application software is different. The integration procedure consists of a series of rules that are designed to ensure that the data that is loaded into a data warehouse is standardized. So all the dates are the same format, monetary values are always represented the same way, etc.
Why is this important? Imagine you are attempting to use the data warehouse and you want a list of all employees grouped by sex, age, and average salary where none of those attributes were properly standardized. It would be a difficult enough task for an experienced programmer to undertake. It would be next to impossible for a non-IT person. Also, a data warehouse accepts information from a variety of sources into a single table.
80
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This is feasible only if the data is of a consistent type and format. The integration rule set is used as a kind of map that specifies how information that has been extracted from the source systems has to be converted before it is allowed into the data warehouse. Semantic Integration
As you know, semantics concerns the meaning of data. Data warehouses draw their data from many different operational systems. Typically, different operational systems are used by different people in an organization. This is logical since financial systems are most likely to be used by the accounts department, whereas stock control systems will be used by warehouse (real warehouse, not data warehouse) staff. Think back to the discussion we had about what a sale is? This kind of ambiguity exists in all organizations and can be very confusing to a database analyst trying to understand the kinds of information being held in systems. The problem is compounded because, often, the users of the systems and the information are unaware of the problem. In everyday conversations, the fact that they are unknowingly discussing different things may not be obvious to them, and most of the time there are no serious repercussions. In building a data warehouse, we do not have the luxury of being able to ignore these usually subtle differences in semantics because the information produced by the queries that exercise the data warehouse will be used to support decision making at the highest levels in the organization. It is vital, therefore, that each item of data that is inserted into the warehouse has a precise meaning that is understood by everyone. To this end, a data warehouse must provide a catalog of information that precisely describes each attribute in the warehouse. The catalog is part of the warehouse and is, in fact, a repository of data that describes other data. This “data about data” is usually referred to as metadata. The subject of metadata is important and is described in more detail later on.
The Warehouse Database Once the information has been extracted and integrated, it can be inserted into the data warehouse. The idea is that each day, the information for that day is added to the previous data, thereby extending the history by another day. For anyone looking at the database, it is rather like an archaeologist standing on the beach next to a cliff face. The strata that you
81
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
can see reflect layers of time covering perhaps millions of years, with each layer having its own story to tell. The data in the data warehouse is built up over time in much the same way. Each layer is a reflection of the organization at a particular point in time. So how do we actually design the data warehouse database? The dimensional data model used to describe the data requirements was the star schema. Previously I said that the dimensional model was well supported by the relational model. We can create a relational schema that directly reflects the star schema. The center of the star schema becomes a relational table, as does each of the dimensions of analysis. The center table is called the fact table because it contains all the facts (i.e., the sales) over which most of the queries will be executed. The dimensions become, simply, dimension tables. The star schema could be interpreted as several tables (the dimensions) each having a one-to-many relationship with the fact table. We can take the original star schema and attach the “crows feet” to show this, as in Figure 2.8. Figure 2.8. Star schema showing the relationships between facts and dimensions.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
ItemCost)
Customer(CustomerCode, CustomerName, CustomerAddress ) Wine(WineCode, Name, Vintage, ABV, PricePerBottle, PricePerCase) Area(AreaCode, AreaDescription) Time(TimeStamp, Date, PeriodNumber, QuarterNumber, Year) Before we continue, it is worth explaining a few things about the diagram. 1. There is no need to record the relationships and their descriptions. In a star schema there is an implicit relationship between the facts and each of the dimensions. 2. The logical identifier of the facts is a composite identifier comprising all the identifiers of the dimensions. 3. Notice the introduction of a time dimension table. The reasoning behind this is connected with the previous point about access to the fact table being driven from the dimension tables. There are some special characteristics about the kind of data held in the fact table. We have described the primary-key attributes, but what about the non-primary-key attributes: Quantity and Item_Cost. These are the real facts in the fact table. We'll now explore this in more detail. Figure 2.9 shows the stratified nature of the data as described previously. Figure 2.9. Stratification of the data.
83
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
You can visualize lots and lots of rows each containing the composite key, a quantity, and item cost. How many rows will there be? Well, according to the original description, the data volumes are:
Over 10 years, allowing for 10 percent growth per annum, the fact table would contain some 62 million rows! Although this is not large in data warehousing terms, it is still a sizable amount of data. The point is, individual rows in themselves are not very meaningful in a system that is designed to answer strategic questions. In order to get meaningful information about trends, etc., the database needs to be able, as we've said before, to answer questions about sales over time. For instance, to obtain the report analyzing sales by product, the individual sales for a particular product have to be added together to form a total for the product. This means that most queries will need to access hundreds, thousands, or even millions of rows. What this means, in effect, is that the data in the fact table or, more precisely, the facts in
84
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
the fact table (Quantity and ItemCost) will almost always have some kind of set function applied to them. Which of the standard column functions do you think will be the most heavily used in SQL queries against the fact table? The answer is: the sum function. Almost all queries will use the summation function. Let's look once more at those famous five questions and see whether our data warehouse can answer them. Note that questions posed by the users of the data warehouse must usually be “interpreted” before they can be translated into an SQL query. Question 1: Which product lines are increasing in popularity and which are decreasing?
Select Name,PeriodNumber, sum(Quantity) From Sales S,Wine W,Time T Where S.WineCode = W.WineCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by Name,PeriodNumber Order by Name,PeriodNumber This query will result in a list of products, each line will show the name, the period, and total quantity sold for that period. If the result set were to be placed into a graph, the products that were increasing and decreasing in popularity would be clearly visible. Question 2: Which product lines are seasonal?
Select Name, QuarterNumber, sum(Quantity), sum(ItemCost) From Sales S, Wine W, Time T Where S.WineCode = W.WineCode And S.OrderTimeCode = T.TimeCode And T.QuarterNumber Between "Q12001" and "Q42002" Group by Name, QuarterNumber Order by Name, QuarterNumber
85
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Question 3: Which customers place the same orders on a regular basis?
This query shows each customer and the name of the wine, where the customer has ordered the same wine more than once:
Select CustomerName, WineName, Count(*) as "Total Orders" from Sales S, Customer C, Wine W Where S.CustomerCode = C.CustomerCode And S.WineCode = W.WineCode Group by CustomerName, WineName Having Count(*) > 1 Order By CustomerName, WineName Question 4: Are some products more popular in different parts of the country?
This query shows, for each wine, both the number of orders and the total number of bottles ordered by area.
Select WineName, AreaDescription, Count(*) "Total Orders," Sum(Quantity) "Total Bottles" From Sales S, Wine W, Area A, Time T Where S.WineCode = W.WineCode And S.AreaCode = A.AreaCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by WineName,AreaDescription Order by WineName,AreaDescription Question 5: Do customers tend to purchase a particular class of product?
This query presents us with a problem. There is no reference to the class of wine in the data warehouse. Information relating to classes does exist in the original EAR model. So it seems that the star schema is incomplete. What we have to do is extend the Schema as shown in Figure 2.10. Figure 2.10. Snowflake schema for the sale of wine.
86
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Of course the Class information has to undergo the extraction and integration processing before it can be inserted into the database. A foreign key constraint must be included in the Wine table to refer to the Class table. The query can now be coded:
Select CustomerName, ClassName, Sum(Quantity) "TotalBottles" From Sales S,Wine W, Customer Cu, Class Cl, Time T Where S.WineCode = W.WineCode And S.CustomerCode = Cu.CustomerCode And W.ClassCode = Cl.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by CustomerName, ClassName Having Sum(Quantity) > 2 * (Select AVG(Quantity)
87
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
From Sales S,Wine W, Class C, Time T Where S.WineCode = W.WineCode And W.ClassCode = C.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002) Order by CustomerName, ClassName The query lists all customers and classes of wines where the customer has ordered that class of wine at more than twice the quantity as the average for all classes of wine. There are other ways that the query could be phrased. It is always a good idea to ask the directors precisely how they would define their questions in business terms before translating the question into an SQL query. There are any number of ways the directors can question the data warehouse in order to answer their strategic business questions. We have shown that the data warehouse supports those types of questions in a way in which the operational applications could never hope to do. The queries show very clearly that the arithmetic functions such as AVG() and particularly SUM() are used in just about every case. Therefore, a golden rule with respect to fact tables can be defined:
The nonkey columns in the fact table must be summable. Data attributes such as Quantity and ItemCost are summable, whereas text columns such as descriptions are not summable. Unfortunately, it is not as straightforward as it seems. Care must be taken to ensure that the summation is meaningful. In some attributes the summation is meaningful only across
certain dimensions. For instance, ItemCost can be summed by product, customer, area, and time with meaningful results. Quantity sold can be summed by product but might be regarded as meaningless across other dimensions. Although this problem applies to the Wine Club, it is much more easily explained in a different organization such as a supermarket. While it is reasonable to sum sales revenue across products (e.g., the revenue from sales of apples added to the revenue from sales of oranges and other fresh fruit each contribute toward the sum of revenue for fresh fruit), adding the quantity of apples sold to the quantity of oranges sold produces a meaningless
88
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
result. Attributes that are summable across some dimensions, but not all dimensions, are referred to as semisummable attributes. Clearly they have a valuable role to play in a data warehouse, but their usage must be restricted to avoid the generation of invalid results. So have we now completed the data warehouse design? Well not quite. Remember that the fact table may grow to more than 62 million rows over time. There is the possibility, therefore, that a query might have to trawl through every single row of the fact table in order to answer a particular question. In fact, it is very likely that many queries will require a large percentage of the rows, if not the whole table, to be taken into account. How long will it take to do that? The answer is - quite a long time. Some queries are quite complex, involving multiple join paths, and this will seriously increase the time taken for the result set to be presented back to the user, perhaps to several hours. The problem is exacerbated when several people are using the system at the same time, each with a complex query to run. If you were to join the 62-million row fact table to the customer table and the wine table, how many rows would the Cartesian product contain?
In principle, there is no need for rapid responses to strategic queries, as they are very different from the kind of day-to-day queries that are executed while someone is hanging on the end of the telephone waiting for a response. In fact, it could be argued that, previously, the answer was impossible to obtain, so even if the query took several days to execute, it would still be worth it. That doesn't mean we shouldn't do what we can as designers to try to speed things up as much as possible. Indexes might help, but in a great deal of cases the queries will need to access more than half the data, and indexes are much less efficient in those cases than a full sequential scan of the tables. No, the answer lies in summaries. Remember we said that almost all queries would be summing large numbers of rows together and returning a result set with a smaller number of rows. Well if we can predict, to
89
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
some degree, the types of queries the users will mostly be executing, we can prepare some summarized fact tables so that the users can access those if they happen to satisfy the requirements of the query. Where the aggregates don't supply the required data, then the user can still access the detail. If we question the users closely enough we should be able to come up with a set, maybe half a dozen or so, of summarized fact tables. The star schema and the snowflake principles still apply, but the result is that we have several fact tables instead of just one. It should be emphasized that this is a physical design consideration only. Its only purpose is to improve the performance of the queries. Some examples of summarization for the Wine Club might be: Customers by wine for each month Customers by wine for each quarter Wine by area for each month Wine by area for each quarter Notice that the above examples are summarizing over time. There are other summaries, and you may like to try to think of some, but summarizing over time is a very common practice in data warehouses. Figure 2.11 shows the levels of summarization commonly in use. Figure 2.11. Levels of summarization in a data warehouse
90
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
One technique that is very useful to people using the data warehouse is the ability to drill
down from one summary level to a lower, more detailed level. For instance, you might observe that a certain range of products was doing particularly well or particularly badly. By drilling down to individual products, you can see whether the whole range or maybe just one isolated product is affected. Conversely, the ability to drill up would enable you to make sure, if you found one product performing badly, that the whole range is not affected. This ability to drill down and drill up are powerful reporting capabilities provided by a data warehouse where summarization is used. The usage of the data warehouse must be monitored to ensure that the summaries are being used by the queries that are exercising the database. If it is found that they are not being used, then they should be dropped and replaced by others that are of more use.
Summary Navigation The introduction of summaries raises some questions: 1. How do users, especially noncomputer professionals, know which summaries are available and how to take advantage of them? 2. How do we monitor which summaries are, in fact, being used? One solution is to use a summary navigation tool. A summary navigator is an additional layer of software, usually a third-party product, that sits between the user interface (the presentation layer) and the database. The summary navigator receives the SQL query from the user and examines it to establish which columns are required and the level of summarization needed. How do summary navigators work? This is a prime example of the use of metadata. Remember metadata is data about data. Summary navigators hold their own metadata within the data warehouse (or in a database separate from the warehouse). The metadata is used to provide a “ mapping ” between the queries formulated by the users and the data warehouse itself. Tables 2.2 and 2.3 are example metadata tables.
91
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Summary_Tables
Table 2.2. Available Summary Tables for Aggregate Navigation Table_Name
DW_Column
Sales_by_Customer_by_Year
Sales
Sales_by_Customer_by_Year
Customer
Sales_by_Customer_by_Year
Year
Sales_by_Customer_by_Quarter
Sales
Sales_by_Customer_by_Quarter
Customer
Sales_by_Customer_by_Quarter
Quarter
Column_Map
Table 2.3. Metadata Mapping Table for Aggregate Navigation User_Column
User_Value
DW_Column
DW_Value
Rating
Year
2001
Year
2001
100
Year
2001
Quarter
Q1_2001
80
Year
2001
Quarter
Q2_2001
80
Year
2001
Quarter
Q3_2001
80
Year
2001
Quarter
Q4_2001
80
The Summary_Tables table contains a list of all the summaries that exist in the data warehouse, together with the columns contained within them. The Column_Map table provides a mapping between the columns specified in the user's query and the columns that are available from the warehouse. Let's look at an example of how it works. We will assume that the user wants to see the sum of sales for each customer for 2001. The simple way to do this is to formulate the following query:
Select CustomerName, Sum(Sales) "Total Sales" From Sales S, Customer C, Time T
92
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Where S.CustomerCode = C.CustomerCode And S.TimeCode = T.TimeCode And T.Year = 2001 Group By C.CustomerName As we know, this query would look at every row in the detailed Sales fact table in order to produce the result set and would very likely take a long time to execute. If, on the other hand, the summary navigator stepped in and grabbed the query before it was passed through to the RDBMS, it could redirect the query to the summary table called “Sales_by_Customer_by_Year.” It does this by: 1. Checking that all the columns needed are present in the summary table. Note that this includes columns in the “Where” clause that are not necessarily required in the result set (such as “Year” in this case). 2. Checking whether there is a translation to be done between what the user has typed and what the summary is expecting. In this particular case, no translation was necessary, because the summary table “Sales_by_Customer_by_Year” contained all the necessary columns. So the resultant query would be:
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Year S, Customer C Where S.CustomerCode = C.CustomerCode And S.Year = 2001 Group By C.CustomerName If, however, “Sales_by_Customer_by_Year” did not exist as an aggregate table (but “Sales_by_Customer_by_Quarter” did) then the summary navigator would have more work to do. It would see that Sales by Customer was available and would have to refer to the Column_Map table to see if the “Year” column could be derived. The Column_Map table shows that, when the user types “Year = 2001,” this can be translated to:
Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") So, in the absence of “Sales_by_Customer_by_Year,” the query would be reconstructed as follows:
93
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Quarter S, Customer C Where S.CustomerCode = C.CustomerCode And S.Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") Group By C.CustomerName Notice that the Column_Map table has a rating column. This tells the summary navigator that “Sales_by_Customer_by_Year” is summarized to a higher level than “Sales_by_Customer_by_Quarter” because it has a higher rating. This directs the summary navigator to select the most efficient path to satisfying the query. You may think that the summary navigator itself adds an overhead to the overall processing time involved in answering queries, and you would be right. Typically, however, the added overhead is in the order of a few seconds, which is a price worth paying for 1000-fold improvements in performance that can be achieved using this technique. We opened this section with two questions. The first question asked how users, especially noncomputer professionals, know which aggregates are available and how to take advantage of them. It is interesting to note that where summary navigation is used, the user never knows which fact table their queries are actually using. This means that they don't need to know which summaries are available and how to take advantage of them. If “Sales_by_Customer_by_Year” were to be dropped, the summary navigator would automatically switch to using “Sales_by_Customer_by_Quarter.” The second question asked how we monitor which summaries are being used. Again, this is simple when you have summary navigator. As it is formulating the actual queries to be executed against the data warehouse, it knows which summary tables are being used and can record the information. Not only that, it can record: The types of queries that are being run to provide statistics so that new summaries can be built Response times
94
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Which users use the system most frequently All kinds of useful statistics can be stored. Where does the summary navigator store this information? In its metadata tables. As a footnote to summary navigation, it is worth mentioning that several of the major RDBMS vendors have expressed the intention of building summary navigation into their products. This development has been triggered by the enormous growth in data warehousing over the past few years.
Presentation of Information The final component of a data warehouse is the method of presentation. This is how the warehouse is presented to the users. Most data warehouse implementations adopt a client-server configuration. The concept of client-server, for our purposes, can be viewed as the separation of the users from the warehouse in that the users will normally be using a personal computer and the data warehouse will reside on a remote host. The connection between the machines is controlled by a computer network. There are very many client products available for accessing relational databases, many of which you may already be familiar with. Most of these products help the user by using the RDBMS schema tables to generate SQL. Similarly, most have the capability to present the results in various forms such as textual reports, pie charts, scatter diagrams, two- and 3-dimensional bar charts, etc. The choice is enormous. Most products are now available on Web servers so that all the users need is a Web browser to display their information. There are, however, some specialized analysis techniques that have largely come about since the invention of data warehouses. The presence of large volumes of time-variant data, hitherto unavailable, has allowed the development of a new process called data
mining. In our exploration into data warehousing and the ways in which it helps with decision support, the onus has always been placed on the user of the warehouse to formulate the queries and to spot any patterns in the results. This leads to more searching questions being asked as more information is returned. Data mining is a technique where the technology does more of the work. The users
95
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
describe the data to the data mining product by identitying the data types and the ranges of valid values. The data mining product is then launched at the database and, by applying standard pattern recognition algorithms, is able to present details of patterns in the data that the user may not be aware of. Figure 2.12 shows how a data mining tool fits into the data warehouse model. Figure 2.12. Modified data warehouse structure incorporating summary navigation and data mining.
96
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The technique has been used very successfully in the insurance industry, where a
97
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
particular insurance company wanted to decrease the number of proposals for life assurance that had to be referred to the company for approval. A data mining program was applied to the data warehouse and reported that men between the ages of 30 and 40 whose height to weight ratio was within a certain range had an increased risk probability of just 0.015. The company immediately included this profile into their automatic underwriting system, thereby increasing the level of automatic underwriting from 50 percent to 70 percent. Even with a data warehouse, it would probably have taken a human “data miner” a long time to spot that pattern because a human would follow logical paths, whereas the data mining program is simply searching for patterns. only for RuBoard - do not distribute or recompile
98
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
PROBLEMS WHEN USING RELATIONAL DATABASES It has been stated that the relational model supports the requirements of data warehousing, and it does. There are, however, a number of areas where the relational model struggles to cope. As we are coming to the end of our introduction to data warehousing, we'll conclude with a brief look at some of these issues.
Problems Involving Time Time variance is one of the most important characteristics of data warehouses. In the section on “Building the Data Warehouse,” we commented on the fact that there appeared to be a certain amount of data redundancy in the warehouse because we were duplicating some of the information, for example, Customers' details, which existed in the operational systems. The reason we have to do this is because of the need to record information over
time. As an example, when a customer changes address we would expect that change to be recorded in the operational database. When we do that we lose the old address. So when a query is next executed where that customer's details are included, any sales of wine, for that customer, will automatically be attributed to the new address. If we are investigating sales by area, the results will be incorrect (assuming the customer moved to a different area) because many of the sales were made when the customer was in another area. That's also the reason why we don't delete customers' details from the data warehouse simply because they are no longer customers. If they have placed any orders at all, then they have to remain within the system. True temporal models are very complex and are not well supported at the moment. We have to introduce techniques such as “start dates” and “end dates” to ensure that the data warehouse returns accurate results. The problems surrounding the representation of time in data warehousing are many. They are fully explored in Chapter 4.
Problems With SQL
99
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SQL is based on set theory. It treats tables as sets and returns its results in the form of a set. There are cases where the use of procedural logic would improve functionality and performance. Ranking/Top (n)
While it is possible to get ranked output from SQL, it is difficult to do. It involves a correlated subquery, which is beyond the capability of most SQL users. It is also very time-consuming to execute. Some individual RDBMS vendors provide additional features to enable these types of queries to be executed, but they are not standardized. So what works on one RDBMS probably won't work on others. Top n Percent
It is not practically possible, for instance, to get a list of the top 10 percent of customers who place the most orders. Running Balances
It is impossible, in practical terms, to get a report containing a running balance using standard SQL. If you are not clear what a running balance is, it's like a bank statement that lists the payments in one column, receipts in a second column, and the balance, as modified by the receipts and payments, in a third or subsequent column. Complex Arithmetic
Standard SQL provides basic arithmetic functions but does not support more complex functions. The different RDBMS vendors supply their own augmentations but these vary. For instance, if it is required to raise a number by a power, in some systems the power has to be an integer, while in others it can be a decimal. Although data warehouses are used for the production of statistics, standard statistical formulas such as deviations and quartiles, as well as standard mathematical modeling techniques such as integral and differential calculus, are not available in SQL. Variables
Variables cannot be included in a standard SQL query.
100
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Almost all of these, and other deficiencies can be resolved by writing 3GL programs such as “C” or COBOL with embedded SQL. Also most RDBMS vendors provide a procedural extension to their standard SQL product to assist in resolving the problems. However, the standard interface between the products that are available at the presentation layer and the RDBMS is a standard called ODBC, which stands for open database connectivity. ODBC, and the more recent JDBC (Java database connectivity) is very useful because it has forced the industry to adopt a standard approach. It does not, at the time of this writing, support the procedural extensions that the RDBMS vendors have provided. It is worth noting that some of these issues are being tackled. We explore the future in Chapter 11. only for RuBoard - do not distribute or recompile
101
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
SUMMARY Data warehouses are a special type of database that are built for the specific purpose of
getting information out rather than putting data in, which is the purpose of most application databases. The emphasis is on supporting questions of a strategic nature, to assist the managers of organizations in planning for the future. A data warehouse is: Subject Oriented Non Volatile Integrated Time Variant Dimensional analysis is a technique used in identifying the requirements of a data warehouse and this is often depicted using a star schema. The star schema identifies the facts and the dimensions of analysis. A fact is an attribute, such as sales value, or call duration, which is analyzed across dimensions. Dimensions are things like customers and products over which the facts are analyzed. A typical query might be:
Show me the sales value of products by customer for this month and last month. Time is always a dimension of analysis. The data warehouse is almost always kept separate from the application databases because: 1. Application databases are optimized to execute insert and update type queries, whereas data warehouses are optimized to execute select type queries. 2. Application databases are constantly changing, whereas data warehouses are quiet (nonvolatile).
102
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. Application databases have large and complex schemas whereas data warehouses are simplified, often denormalized, structures. 4. Data warehouses need historical information and this is usually missing from application databases. There are five main components to a first generation data warehouse. 1. Extraction of the source data from a variety of application databases. These source applications are often using very different technology. 2. Integration of the data. There are two types of integration. First there is format integration, where logically similar data types (e.g., dates) are converted so that they have the same physical data type. Second, semantic integration so that the meaning of the information is consistent. 3. The database itself. The data warehouse database can become enormous as a new layer of fact data is added each day. The star schema is implemented as a series of tables. The fact table (the center of the star ) is long and thin in that it usually has a large number of rows and a small number of columns. The fact columns must be summable. The dimension tables (the points of the star) are joined to the fact table through foreign keys. Where a dimension participates in a hierarchy, the model is sometimes referred to as a snowflake. 4. Aggregate navigation is a technique which enables the users to have their queries automatically directed at aggregate tables without them being aware that it is happening. This is very important for query performance. 5. Presentation of information. This is how the information is presented to the users of the data warehouse. Most implementations opt for a client-server approach, which gives them the capability to view their information in a variety of tabular or graphical formats. Data warehouses are also useful data sources for applications such as data mining, which are software products that scan large databases searching for patterns and reporting the results back to the users. We review products in Chapter 10. There are some problems that have to be overcome, such as the use of time. Care has to be taken to ensure that the facts in the data warehouse are correctly reported with respect to time. We explore the problems surrounding time in Chapter 4.
103
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Also, many of the queries that users typically like to ask of a data warehouse cannot easily be translated into standard SQL queries, and work-arounds have to be used, such as procedural programs with embedded SQL. only for RuBoard - do not distribute or recompile
104
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 3. Design Problems We Have to Face Up To In this chapter, we will be reviewing the traditional approaches to designing data warehouses. During the review we will investigate whether or not these methods are still appropriate now that the business imperatives have been identified. We begin this chapter by picking up on the introduction to data warehousing. only for RuBoard - do not distribute or recompile
105
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DIMENSIONAL DATA MODELS In Chapter 2, we introduced data warehousing and described, at a high level, how we might approach a design. The approach we adopted follows the style of some of the major luminaries in the development of data warehousing generally. This approach to design can be given the general description of dimensional. Star schemes and snowflake schemes are both examples of a dimensional data model. The dimensional approach was adopted in our introduction for the following reasons: Dimensional data models are easy to understand. Therefore, they provide an ideal introduction to the subject. They are unambiguous. They reflect the way that business people perceive their businesses. Most RDBMS products now provide direct support for dimensional models. Research shows that almost all the literature supports the dimensional approach. Unfortunately, although the dimensional model is generally acclaimed, there are alternative approaches and this has tended to result in “religious” wars (but no deaths as far as I know). Even within the dimensional model camp there are differences of opinion. Some people believe that a perfect star should be used in all cases, while others prefer to see the hierarchies in the dimensions and would tend to opt for a snowflake design. Where deep-rooted preferences exist, it is not the purpose of this book to try to make “road to Damascus” style conversions by making nonbelievers “see the light.” Instead, it is intended to present some ideas and systematic arguments so that readers of this book can make their own architectural decisions based on a sound understanding of the facts of the matter. In any case, I believe there are far more serious design issues that we have to consider once we have overcome these apparent points of principle. Principles aside, we have also to consider any additional demands that customer relationship management might place on the data architecture. A good objective for this chapter would be to devise a general high-level data architecture for data warehousing. In doing so, we'll discuss the following issues:
1. Dimensional versus third normal form (3NF) models 2. Stars versus snowflakes
106
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. What works for CRM
Dimensional Versus 3NF There are two principal arguments in favor of dimensional models:
1. They are easy for business people to understand and use. 2. Retrieval performance is generally good. The ease of understanding and ease of use bit is not in dispute. It is a fact that business people can understand and use dimensional models. Most business people can operate spreadsheets and a dimensional model can be likened to a multidimensional spreadsheet. We'll be exploring this in Chapter 5 when we start to investigate the dot modeling methodology. The issue surrounding performance is just as clear cut. The main RDBMS vendors have all tweaked their query optimizers to enable them to recognize and execute dimensional queries more efficiently, and so performance is bound to be good in most instances. Even so, where the dimension tables are massively large, as the customer dimension can be, joins between such tables and an even bigger fact table can be problematic. But this is not a problem that is peculiar to dimensional models. 3NF data structures are optimized for very quick insertion, update, and deletion of discrete data items. They are not optimized for massive extractions of data, and it is nonsensical to argue for a 3NF solution on the grounds of retrieval performance.
What Is Data Normalization?
Normalization is a process that aims to eliminate the unnecessary and uncontrolled duplication of data, often referred to as 'data redundancy'. A detailed examination of normalization is not within the scope of this book. However, a brief overview might be helpful (for more detail see Bruce, 1992, or Batini et al., 1992). Normalization enables data structures to be made to conform to a set of well-defined rules. There are several levels of normalization and these are referred to as first normal form (1NF), second normal form (2NF), third normal form (3NF), and so on. There are exceptions, such as Boyce-Codd normal form (BCNF), but we won't be covering these. Also, we won't explore 4NF and 5NF as, for most purposes, an understanding of the levels up to 3NF is sufficient. In relational theory there exists a rule called the entity integrity rule. This rule concerns the primary key of any given relation and assigns to the key the following two properties:
1. Uniqueness. This ensures that all the rows in a relational table can be uniquely identified. 2. Minimality. The key will consist of one or more attributes. The minimality property ensures that the length of the key is no longer than is necessary to ensure that the first property, uniqueness, is guaranteed.
107
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Within any relation, there are dependencies between the key attributes and the nonkey attributes. Take the following Order relation as an example: Order Order number
Primary key
Item number
Primary key
Order date Customer ID Product ID Product description Quantity Dependencies can be expressed in the form of “determinant” rules, as follows:
1. The Order Number determines the Customer ID. 2. The Order Number and Item Number determine the Product ID. 3. The Order Number and Item Number determine the Product Description. 4. The Order Number and Item Number determine the Quantity. 5. The Order Number determines the Order Date. Notice that some of the items are functionally dependent on Order Number (part of the key), whereas others are functionally dependent on the combination of both the Order Number and the Order Item (the entire key). Where the dependency exists on the entire key, the dependency is said to be a fully functional dependency. Where all the attributes have at least a functional dependency on the primary key, the relation is said to be in 1NF. This is the case in our example. Where all the attributes are engaged in a fully functional relationship with the primary key, the relation is said to be in 2NF. In order to change our relation to 2NF, we have to split some of the attributes into a separate relation as follows: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID
108
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Product description Quantity These relations are now in 2NF, since all the nonkey attributes are fully functionally dependent on their primary key. There is one relationship that we have not picked up so far: The Product ID determines the Product Description. This is known as a transitive dependency because the Product ID itself can be determined by the combination of Order Number and Order Item (see dependency 2 above). In order for a relation to be classified as a 3NF relation, all transitive dependencies must be removed. So now we have three relations, all of which are in 3NF: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID Quantity Product Product ID
Primary key
Product description There is one major advantage in a 3NF solution and that is flexibility. Most operational systems are implemented somewhere between 2NF and 3NF, in that some tables will be in 2NF, whereas most will be in 3NF. This adherence to normalization tends to result in quite flexible data structures. We use the term flexible to describe a data structure that is quite easy to change should the need arise. The changing nature of the business requirements has already been described and, therefore, it must be advantageous to implement a data model that is adaptive in the sense that it can change as the business requirements evolve over time. But what is the difference between dimensional and normalized? Let's have another look at the simple star schema for the Wine Club, from Figure 2.3, which we first produced in our introduction in Chapter 2 (see Figure 3.1).
Figure 3.1. Star schema for the Wine Club.
109
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Some of the attributes of the customer dimension are: Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account manager name This dimension is currently in 2NF because, although all the nonprimary key columns are fully functionally dependent on the primary key, there is a transitive dependency in that the account manager name can also be determined from the account manager ID. So the 3NF version of the customer dimension above would look like this:
110
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account Manager Dimension Account Manager ID Account Manager Name The diagram is shown in Figure 3.2.
Figure 3.2. Third normal form version of the Wine Club dimensional model.
111
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
So what we have done is convert a star into the beginnings of a snowflake and, in doing so, have started putting the model into 3NF. If this process is carried out thoroughly with all the dimensions, then we should have a complete dimensional model in 3NF. Well, not quite. We need also to look at the fact table. But first, let's go back to the original source of the Sales fact table, the Order, and Order Item table. They have the following attributes: Order Order number
(Primary key)
Customer ID Time Order Items Order number
(Primary key)
Order Items
(Primary key)
Wine ID Depot ID Quantity Value Both these tables are in 3NF. If we were to collapse them into one table, called Sales, by joining them on the Order Number, then the resulting table would look like this Sales Order number
(Primary key part 1)
Order item
(Primary key part 2)
Customer ID Wine ID Time Depot ID Quantity Value This table is now in 1NF because the Customer ID and Time, while being functionally dependent on the primary key, do not display the property of “full” functional dependency (i.e., they are not dependent on the Order Item). In our dimensional model, we have decided not to include the Order Number and Order Item details. If we remove them, is there another candidate key for the resulting table? The answer is, it depends! Look at this version of the Sales table:
112
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Sales Customer ID
(Primary key part 2)
Wine ID
(Primary key part 1)
Time
(Primary key part 3)
Depot ID Quantity Value The combination of Customer ID, Wine ID, and Time have emerged as a composite candidate key. Is this realistic? The answer is yes, it is. But, it all depends on the Time. The granularity of time has to be sufficiently fine as to ensure that the primary key has the property of uniqueness. Notice that we did not include the Depot ID as part of the primary key. The reason is that it does not add any further to the uniqueness of the key and, therefore, its inclusion would violate the other property of primary keys, that of minimality. The purpose of this treatise is not to try to convince you that all dimensional models are automatically 3NF models; rather, my intention is to show that it is erroneous to say that the choice is between dimensional models and 3NF models. The following sums up the discussion:
1. Star schemes are not usually in 3NF. 2. Snowflake schemes can be in 3NF.
Stars and Snowflakes The second religious war is being fought inside the dimensional model camp. This is the argument about star schema versus snowflake schema. Kimball (1996) proscribes the use of snowflake schemas for two reasons. The first is the effect on performance that has already been described. The second reason is that users might be intimidated by complex hierarchies. My experience has shown his assertion, that users and business people are uncomfortable with hierarchies, to be quite untrue. My experience is, in fact, the opposite. Most business people are very aware of hierarchies and are confused when you leave them out or try to flatten them into a single level. Kimball (1996) uses the hierarchy in Figure 3.3 as his example.
Figure 3.3. Confusing and intimidating hierarchy.
113
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
It is possible that it is this six-layer example in Figure 3.3 that is confusing and intimidating, rather than the principle. In practice, hierarchies involving so many layers are almost unheard of. Far more common are hierarchies like the one reproduced from the Wine Club case study, shown in Figure 3.4.
Figure 3.4. Common organizational hierarchy.
These hierarchies are very natural to managers because, in real-life scenarios, customers are organized into geographic areas or market segments and the managers' desire is to be able to ask questions about the business performance of this type of segmentation. Similarly, managers are quite used to comparing the performance of one department against other departments, or one product range against other product ranges. The whole of the business world is organized in a hierarchical fashion. We live in a structured world. It is when we try to remove it, or flatten it, that business people become perplexed. In any case, if we present the users of our data warehouse with a star schema, all we have done, in most instances, is precisely that: We have flattened the hierarchy in a kind of “denormalization” process. So I would offer a counter principle with respect to snowflake schemas: There is no such thing as a true star schema in the eyes of business people. They expect to see the hierarchies. Where does this get us? Well, in the dimensional versus 3NF debate, I suspect a certain amount of reading-between-the-lines interpretation is necessary and that what the 3NF camp is really shooting for is the retention of the online transaction processing (OLTP) schema in preference to a dimensional schema. The reason for this is that the OLTP model will more accurately reflect the underlying business processes and is, in theory at least, more flexible and adaptable to change. While this sounds like a great idea, the introduction of history usually makes this impossible to do. This is part of a major subject that we'll be covering in detail in Chapter 4. It is worth noting that all online analytical processing (OLAP) products also implement a dimensional data model. Therefore, the terms OLAP and dimensional are synonymous.
114
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
This brings us neatly onto the subject of data marts. The question “When is it a data warehouse and when is it a data mart”? has also been the subject of much controversy. Often, it is the commercial interests of the software vendors that carry the greatest influence. The view is held, by some, that the data warehouse is the big, perhaps enterprise-wide, repository and that data marts are much smaller, maybe departmental, extractions from the warehouse that the users get to analyze. By the way, this is very closely associated with the previous discussion on 3NF versus dimensional models. Even the most enthusiastic supporter of the 3NF/OLTP approach is prepared to recognize the value that dimensional models bring to the party when it comes to OLAP. In a data warehouse that has an OLAP component, it is that OLAP component that the users actually get to use. Sometimes it is the only part of the warehouse that they have direct access to. This means that the bit of the warehouse that the users actually use is dimensional, irrespective of the underlying data model. In a data warehouse that implements a dimensional model, the bit that the users actually use is, obviously, also dimensional. Therefore, everyone appears to agree that the part that the users have access to should be dimensional. So it appears that the only thing that separates the two camps is some part of the data warehouse that the users do not have access to. Returning to the issue about data marts, we decline to offer a definition on the basis that a subset of a data warehouse is still a data warehouse. There is a discussion on OLAP products generally in Chapter 10. only for RuBoard - do not distribute or recompile
115
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
WHAT WORKS FOR CRM Data warehousing is now a mature business solution. However, the evolution of business requires the evolution of data warehouses. That business people have to grasp the CRM nettle is an absolute fact. In order to do this, information is the key. It is fair to say that an organization cannot be successful at CRM without high-quality, timely, and accurate information. You cannot determine the value of a customer without information. You cannot personalize the message to your customer without information, and you cannot assess the risk of losing a customer without information. If you want to obtain such information, then you really do need a data warehouse. In order to adopt a personalized marketing approach, we have to know as much as we can about our customers' circumstances and behavior. We described the difference between circumstances and behavior in the section on market segmentation in Chapter 1. The capability to accurately segment our customers is one the important properties of a data warehouse that is designed to support a CRM strategy. Therefore, the distinction between circumstances and behavior, two very different types of data, is crucial in the design of the data warehouse. Let's look at the components of a “traditional” data warehouse to try and determine how the two different types of data are treated. The diagram in Figure 3.5 is our now familiar Wine Club example.
Figure 3.5. Star schema for the Wine Club.
It remains in its star schema form for the purposes of this examination, but we could just as easily be reviewing a snowflake model.
116
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The first two questions we have to ask are whether or not it contains information about:
1. Customers' behavior 2. Customers' circumstances Clearly it does. The Sales table (the Fact table) contains details of sales made to customers. This is behavioral information, and it is a characteristic of dimensional data warehouses and data marts that the Fact table contains behavioral information. Sales is a good example, probably the most common, but there are plenty more from all industries: Telephone call usage Shipments and deliveries Insurance premiums and claims Hotel stays Aircraft flight bookings These are all examples of the subject of a dimensional model, and they are all behavioral. The customer dimension is the only place where we keep information about customer circumstances. According to Ralph Kimball (1996), the principal purpose of the customer dimension, as with all dimensions in a dimensional model, is to enable constraints to be placed on queries that are run against the fact table. The dimensions merely provide a convenient way of grouping the facts and appear as row headers in the user's result set. We need to be able to slice and dice the Fact table data “any which way.” A solution based on the dimensional model is absolutely ideal for this purpose. It is simply made for slicing and dicing. Returning to the terms behavior and circumstances, a dimensional model can be described as behavior centric. It is behavior centric because its principal purpose is to enable the easy and comprehensive analysis of behavioral data. It is possible to make a physical link between Fact tables by the use of a common dimension tables as the diagram in Figure 3.6 shows.
Figure 3.6. Sharing information.
117
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This “daisy chain” effect enables us to “drill across” from one star schema to another. This common dimension is sometimes referred to as a conformed dimension. We have seen previously how the first-generation data warehouses tended to focus on the analysis of behavioral information. Well, the second generation needs to support big business issues such as CRM and, in order to do this effectively, we have to be able to focus not only on behavior, but circumstances as well.
Customer Behavior and Customer Circumstances–The Cause-and-Effect Principle We have explored the difference between customers' circumstances and their behavior, but why is it important? Most of the time in data warehousing, we have been analyzing behavior. The Fact table in a traditional dimensional schema usually contains information about a customer's interaction with our business. That is: the way they behave toward us. In the Wine Club example we have been using, the Fact table contained information about sales. This, as has been shown, is the normal approach toward the development of data warehouses. Now let us look again at one of the most pressing business problems, that of customer loyalty and its direct consequence, that of customer churn. For the moment let us put ourselves in the place of a customer of a cellular phone company and think of some reasons why we, as a customer, may decide that we no longer wish to remain as a customer of this company: Perhaps we have recently moved to a different area. Maybe the new area has a poor reception for this particular company. We might have moved to a new employer and have been given a mobile phone as part of the deal, making the old one surplus to requirements. We could have a child just starting out at college. The costs involved might require economies to be made elsewhere, and the mobile phone could be the luxury we can do without.
118
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the above situations could be the cause for us as customers to appear in next month's churn statistics for this cellular phone company. It would be really neat if the phone company could have predicted that we are a high-risk customer. The only way to do that is to analyze the information that we have gathered and apply some kind of predictive model to the data that yields a score between, say, 1 for a very low risk customer to 10 for a very high risk customer. But what type of information is likely to give us the best indication of a customer's propensity to churn? Remember that, traditionally, data warehouses tend to be organized around behavioral systems. In a mobile telephone company, the most commonly used behavioral information is the call usage. Call usage provides information about: Types of calls made (local, long distance, collect, etc.) Durations of calls Amount charged for the call Time of day Call destinations Call distances If we analyze the behavior of customers in these situations, what do you think we will find? I think we can safely predict that, just before the customer churned, they stopped making telephone calls! The abrupt change in behavior is the effect of the change in circumstances. The cause-and-effect principles can be applied quite elegantly to the serious problem of customer churn and, therefore, customer loyalty. What we are seeing when we analyze behavior is the effect of some change in the customer's circumstances. The change in circumstances, either directly or indirectly, is the cause of their churning. If we analyze their behavior, it is simply going to tell us something that we already know and is blindingly obvious–the customer stopped using the phone. By this time it is usually far too late to do anything about it. In view of the fact that most dimensional data warehouses measure behavior, it seems reasonable to conclude that such models may not be much help in predicting those customers that we are at risk of losing. We need to turn our attention to being very much more rigorous in our approach to tracking changes in circumstances, rather than behavior. Thus, the second-generation data warehouses that are being built as an aid to the development of CRM applications need to be able to model more than just behavior. So instead of being behavior centric, perhaps they should be dimension centric or even circumstances centric. The preferred term is customer centric. Our second-generation data warehouses will be classified as customer centric. Does this mean that we abandon behavioral information? Absolutely not! It's just that we need to
119
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
switch the emphasis so that some types of information that are absolutely critical to a successful CRM strategy are more accessible. So what does this mean for the great star schema debate? Well, all dimensional schemes are, in principle, behavioral in nature. In order to develop a customer-centric model we have to use a different approach. If we are to build a customer-centric model, then it make sense to start with a model of the customer. We know that we have two major information types–behavior and circumstances. For the moment, let's focus in on the circumstances. Some of the kinds of things we might like to record about customers are: Customer Name Address Telephone number Date of birth Sex Marital status Of course, there are many, many more pieces of information that we could hold (check out Appendix D to see quite a comprehensive selection), but this little list is sufficient for the sake of example. At first sight, we might decide that we need a customer dimension as shown in Figure 3.7.
Figure 3.7. General model for customer details.
The customer dimension in Figure 3.7 would have some kind of customer identifier and a set of attributes like those listed in the table above. But that won't give us what we want. In order to enable our users to implement a data warehouse that supports CRM, one of the things they must be able to do is analyze, measure, and classify the effect of changes in a customer's circumstances. As far as we, that means the data architects, are concerned, a change in circumstances simply means a change in the value of some attribute. But, ignoring error corrections, not all attributes are subject to change as part of the ordinary course of business. Some attributes change and some don't. Even if an attribute does change, it does not necessarily mean that the change is of any real interest to our business. There is a business issue to be resolved here.
120
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
We can illustrate these points if we look a little more closely at the simple list of attributes above. Ignoring error corrections, which are the attributes that can change? Well, in theory at least, with the exception of the date of birth, they can all change. Now, there are two types of change that we are interested in:
1. Changes where we need to be able to see the previous values of the attribute, as well as the new value
2. Changes where the previous values of the attribute can be lost What we have to do is group the attributes into these two different types. So we end up with a model with two entities like the one in Figure 3.8.
Figure 3.8. General model for a customer with changing circumstances.
We are starting to build a general conceptual model for customers. For each customer, we have a set of attributes that can change as well as a set of attributes for which either they cannot change or, if they do, we do not need to know the previous values. Notice that the relationship has a cardinality of one to many. Please note this is not meant to show that there are many attributes that can change; it actually means that each attribute can change many times. For instance, a customer's address can change quite frequently over time. In the Wine Club, the name, telephone number, date of birth, and sex are customer attributes where the business feels that either the attributes cannot change or the old values can be lost. This means that the address and marital status are attributes where the previous values should be preserved. So, using the example, the model should look as shown in Figure 3.9.
Figure 3.9. Example model showing customer with changing circumstances.
121
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
So each customer can have many changes of address and marital status, over time. Now, the other main type of data that we need to capture about customers is their behavior. As we have discussed previously, the behavioral information comes from the customers' interaction with our organization. The conceptual general model that we are trying to develop must include behavioral information. It is shown in Figure 3.10.
Figure 3.10. The general model extended to include behavior.
Again the relationship between customers and their behavior is intended to show that there are many behavioral instances over time. The actual model for the Wine Club would look something like the diagram in Figure 3.11.
Figure 3.11. The example model extended to include behavior.
122
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the behavioral entities (wine sales, accessories, and trips) probably would have previously been modeled as part of individual subject areas in separate star schemas or snowflake schemas. In our new model, guess what? They still could be! Nothing we have done so far means that we can't use some dimensional elements if we want to and, more importantly, if we can get the answers we need. Some sharp-eyed readers at this point might be tempted into thinking, “Just hold on a second, what you're proposing for the customer is just some glorified form of common (or conformed) dimension, right?.” Well, no. There is, of course, some resemblance in this model to the common dimension model that was described earlier on. But remember this: The purpose of a dimension, principally, is to constrain queries against the fact table. The main purpose of a common dimension is to provide a drill across facility from one fact table to another. They are still behavior-centric models. It is not the same thing at all as a model that is designed to be inherently customer centric. The emphasis has shifted away from behavior, and more value is attached to the customer's personal circumstances. This enables us to classify our customers into useful and relevant segments. The difference might seem quite subtle, but it is, nevertheless, significant. Our general model for a customer centric-data warehouse looks very simple, just three main entity types. Is it complete? Not quite. Remember that there were three main types of customer segmentation. The first two were based on circumstances and behavior. We have discussed these now at some length. The third type of segment was referred to as a derived segment. Examples of derived segments are
123
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
things like “estimated life time value” and “propensity to churn.” Typically, the inclusion and classification of a customer in these segments is determined by some calculation process such as predictive modeling. We would not normally assign a customer a classification in a derived segment merely by assessing the value of some attribute. It is sensible, therefore, to modify our general model to incorporate derived segments, as shown in Figure 3.12.
Figure 3.12. General conceptual model for a customer-centric data warehouse.
So this is it. The diagram in Figure 3.12 is the boiled-down general model for a customer centric data warehouse. In theory, it should be able to answer almost any question we might care to throw at it. I say “in theory” because in reality the model will be far more complex than this. We will need to be able to cope with customers whose classification changes. For example, we might have a derived segment called “life time value” where every customer is allocated an indicator with a value from, say, 1 to 20. Now, Lucie Jones might have a lifetime value indicator of “9.” But when Lucie's salary increases, she might be allocated a lifetime value indicator of “10.” It might be useful to some companies to invent a new segment called, say, “increasing life time values.” This being the case, we may need to track Lucie's lifetime value indicator over time. When we bring time into our segmentation processes, the possibilities become endless. However, the introduction of time also brings with it some very difficult problems, and these will be discussed in the next chapter. Our model can be described as a general conceptual model (GCM) for a customer-centric data warehouse. The GCM provides us with a template from which all our actual conceptual models in the future can be derived. While we are on the subject of conceptual models, I firmly believe it is high time that we reintroduce the conceptual, logical, and physical data model trilogy into our design process.
Whatever Happened to the Conceptual/Logical/Physical Trilogy? In the old days there was tradition in which we used a three-stage process for designing a database. The first model that was developed was called the conceptual data model and it was usually represented by an entity relationship diagram (ERD). The purpose of the ERD was to provide an
124
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
abstraction that represented the data requirements of the organization. Most people with any experience of database design would be familiar with the ERD approach to designing databases. One major characteristic of the conceptual model is that it ought to be able to be implemented using any type of database. In the 1970s, the relational database was not the most widely used type of DBMS. In those days, the databases tended to be:
1. Hierarchical databases 2. Network databases The idea was that the DBMS to be ultimately deployed should not have any influence over the way in which the requirements were expressed. So the conceptual data model should not imply the technology to be used in implementing the solution. Once the DBMS had been chosen, then a logical model would be produced. The logical model was normally expressed as a schema in textual form. So, for instance, where the solution was to be implemented using a relational database, a relational schema would be produced. This consisted of a set of relations, the relationships expressed as foreign key constraints, and a set of domains from which the attributes of the relations would draw their values. The physical data model consisted of the data definition language (DDL) statements that were needed to actually build, in a relational environment, the tables, indexes, and constraints. This is sometimes referred to as the implementation model. One of the strengths of the trilogy is that decisions relating to the logical and physical design of the database could be taken and implemented without affecting the abstract model that reflected the business requirements. The astonishing dominance of relational databases since the mid-1980s has led, in practice, to a blurring of the boundaries between the three models, and it is not uncommon nowadays for a single model to be built, again in the form of an ERD. This ERD is then converted straight into a set of tables in the database. The conceptual model, logical model, and physical model are treated as the same thing. This means that any changes that are made to the design for, say, performance-enhancing reasons are reflected in the conceptual model as well as the physical model. The inescapable conclusion is that the business requirements are being changed for performance reasons. Under normal circumstances, in OLTP-type databases for instance, we might be able to debate the pros and cons of this approach because the business users don't ever get near the data model, and it is of no interest to them. They can, therefore, be shielded from it. But data warehouses are different. There can be no debate; the users absolutely have to understand the data in the data warehouse. At least they have to understand that part of it that they use. For this reason, the conceptual data model, or something that can replace its intended role, must be reintroduced as a necessary part of the development lifecycle of data warehouses. There is another reason why we need to reinvent the conceptual data model for data warehouse development. As we observed earlier, in the past 15 years, the relational database has emerged as a
125
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
de facto standard in most business applications. However, to use the now well worn phrase, data warehouses are different. Many of the OLAP products are nonrelational, and their logical and physical manifestations are entirely different from the relational model. So the old reasons for having a three-tier approach have returned, and we should respond to this.
The Conceptual Model and the Wine Club Now that we have the GCM, we can apply its principles to our case study, the Wine Club. We start by defining the information about the customer that we want to store. In the Wine Club we have the following customer attributes: Customer Information Title Name Address Telephone number Date of birth Sex Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details The attributes have to be divided into two types:
1. Attributes that are relatively static (or where previous values can be lost) 2. Attributes that are subject to change Customer's Static Information Title Name Telephone number Date of birth Sex Customer's Changing Information Address
126
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details We now construct a model like the one in Figure 3.13.
Figure 3.13. Wine Club customer changing circumstances.
This represents the customer static and changing circumstances. The behavior model is shown in Figure 3.14.
Figure 3.14. Wine Club customer behavior.
127
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Now, thinking about derived segments, these are likely to be very dynamic in the sense that some derived segments will change over time, some will remain fairly static over time, and others still will appear for a short while and then disappear. Some examples of these, as they apply to the Wine Club, are: Lifetime value. This is a great way of classifying customers, and every organization should try to do this. It is an example of a fairly static form of segmentation. We would not expect dramatic changes to customers' positions here. It would be good to know which customers are on the “generally increasing” and the “generally decreasing” scale. Recently churned. This is an example of a dynamic classification that will be constantly changing. The ones that we lose that had good lifetime value classifications would appear in our “Win back” derived segment. Special promotions. These can be good examples where a kind of “one-off” segment can be used effectively. In the Wine Club there would be occasions when, for instance, it needs to sell off a particular product quickly. The requirement would be to determine the customers most likely to buy the product. This would involve examination of previous behavior as well as circumstances (e.g., income category in the case of an expensive wine). The point is that this is a “use once” segment. Using the three examples above, our derived segments model looks as shown in Figure 3.15.
Figure 3.15. Derived segment examples for the Wine Club.
128
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
There is a design issue with segments generally, and that is their general dynamic nature. The marketing organization will constantly want to introduce new segments. Many of them will be of the fairly static and dynamic types that will have long lives in the data warehouse. What we don't want is for the data warehouse administrator to have to get involved in the creation of new tables each time a new classification is invented. This, in effect, results in frequent changes to the data warehouse structure and will lead to the marketing people getting involved in complex change control procedures and might ultimately result in a stifling of creativity. So we need a way of allowing the marketing people to add new derived segments without involving the database administrators too much. Sure, they might need help in expressing the selection criteria, but we don't want to put too many obstacles in their path. This issue will be explored in more detail in Chapter 7—the physical model. only for RuBoard - do not distribute or recompile
129
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY I have a theory, which is that data warehousing has always been about customer relationships. It's just that previously we didn't entirely make the connection, because, in the early days, CRM had not really been recognized as any kind of discipline at the systems level. The advent of CRM has put the spotlight firmly back on data warehousing. Data warehouses provide the technology to enable us to perform customer relationship analysis. The management of the relationship is the value that is added by the business. This is the decision-making part. The warehouse is doing what it has always done, providing decision support. That is why this book is about supporting customer relationship management. In this chapter we have been looking at some of the design issues and tried to quantify and rationalize some aspects of data warehouse design that have been controversial in the past. There are good reasons for using dimensional schemas, but there are cases where they can work against us. The best solution for CRM is to use dimensional models where they absolutely do add value, in modeling customers' behavior, but to use more “conventional” approaches when modeling customers' circumstances. Toward the end of the chapter we developed a general conceptual model (Figure 3.12) for a customer-centric data warehouse. We will develop this model later on in this book by applying some of the needs of our case study, the Wine Club. Firstly, however, the awfully thorny subject of time has to be examined. only for RuBoard - do not distribute or recompile
130
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 4. The Implications of Time in Data Warehousing The principal subject of this book is the design of data warehouses. One of the least well understood issues surrounding data warehouse design is the treatment and representation of time. This chapter introduces the characteristics of time and the way that it is used in data warehousing applications. The chapter goes on to describe fully the problems encountered with time. We need to introduce some rigor into the way that time is treated in data warehousing, and this chapter lays out the groundwork to enable that to be achieved. We also examine the more prominent solutions that other major practitioners have proposed in the past and that have been used ubiquitously in first-generation data warehouses. We will see that there are some issues that arise when these methods are adopted. The presence of time and the dependence upon it is one of the things that sets data warehouse applications apart from traditional operational systems. Most business applications are suited to operating in the present environment where time does not require special treatment. In many cases, dates are no more than descriptive attributes. In a data warehouse, time affects the very structure of the system. The temporal requirements of a data warehouse are very different from those of an operational system, yet it is the operational system that feeds information about changed data to the data warehouse. In a temporal database management system, support for time would be implicit within the DBMS and the query language would contain time-specific functions to simplify the manipulation of time. Until such systems are generally available, the data warehouse database has to be designed to take account of time. The support for time has to be explicitly built into the table structures and the queries. only for RuBoard - do not distribute or recompile
131
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE ROLE OF TIME In data warehousing the addition of time enables historical data to be held and queried upon. This means that users of data warehouses can view aspects of their enterprise at any specific point or over any period of time for which the historical data is recorded. This enables the observation of patterns of behavior over time so that we can make comparisons between similar or dissimilar periods, for example, this year versus last year, seasonal trends. Armed with this information, we can extrapolate with the use of predictive models to assist us with planning and forecasting. We are, in effect, using the past to attempt to predict the future: If men could learn from history, what lessons it might teach us! But passion and party blind our eyes, and the light which experience gives is a lantern on the stern, which shines only on the waves behind us! —(Coleridge, 1835) Despite this gloomy warning from the nineteenth century, the use of information from past events and trends is commonplace in economic forecasting, social trend forecasting, and even weather forecasting. The value and importance of historical data are generally recognized. It has been observed that the ability to store historical data is one of the main advantages of data warehousing and that the absence of historical data, in operational systems, is one of the motivating factors in the development of data warehouses. Some people argue that most operational systems do keep a limited amount of history, about 60–90 days. In fact, this is not really the case, because the data held at any one time in, say, an order processing system will be orders whose lifecycle has not been completed to the extent that the order can be removed from the system. This means that it may take, on average, 60–90 days for an order to pass through all its states from inception to deletion. Therefore, at any one time, some of the orders may be up to 90 days old with a status of “invoiced,” while others will be younger, with different status such as “new,” “delivered,” “back ordered,” etc. This is not the same as history in our sense.
Valid Time and Transaction Time Throughout this chapter and the remainder of the book, we will make frequent reference to the valid times (Jensen et al., 1994) and transaction times of data warehouse records. These two times are defined in field of temporal database research and have quite precise meanings that are now explained. The valid time associated with the value of, say, an attribute is the time when the value is true in modeled reality. For instance, the valid time of an order is the time that the order was taken. Such
132
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
values may be associated with:
1. A single instant. Defined to be a time point on an underlying time axis. An event is defined as an instantaneous fact that occurs at an instant.
2. Intervals (periods) of time. Defined to be the time between two instants. The valid time is normally supplied by the user, although in some cases, such as telephone calls, the valid time can be provided by the recording equipment. The transaction time associated with the value of, say, an attribute records the time at which the value was stored in the database and is able to be retrieved. Transaction times are system generated and may be implemented using transaction commit times. Transaction times may also be represented by single instants or time intervals. Clearly, a transaction time can provide only an approximation of the valid time. We can say, generally, that the transaction time that records an event will never be earlier than the true valid time of the same event.
Behavioral Data In a dimensional data warehouse, the source systems from which the behavioral data is derived are the organization's operational systems such as order processing, supply chain, and billing. The source systems are not usually designed to record or report upon historical information. For instance, in an order processing system, once an order has satisfactorily passed through its lifecycle, it tends to be removed from the system by some archival or deletion process. After this, for all practical purposes, the order will not be visible. In any case, it will have passed beyond the state that would make it eligible to be captured for information purposes. The task of the data warehouse manager is to capture the appropriate entities when they achieve the state that renders them eligible to be entered into the data warehouse. That is, when the appropriate event occurs, a snapshot of the entity is recorded. This is likely to be before they reach the end of their lifecycle. For instance, an order is captured into the data warehouse when the order achieves a state of, say, “invoiced.” At this point the invoice becomes a “fact” in the data warehouse. Having been captured from the operational systems, the facts are usually inserted into the fact table using the bulk insertion facility that is available with most database management systems. Once loaded, the fact data is not usually subject to change at all. The recording of behavioral history in a fact table is achieved by the continual insertion of such records over time. Usually each fact is associated with a single time attribute that records the time the event occurred. The time attribute of the event would, ideally, be the “valid time,” that is, when the event occurred in the real world. In practice, valid times are not always available and transaction times (i.e., the time the data was recorded) have to be used. The actual data type used to record the time of the event will vary from one application to another depending on how precise the time has to be (the granularity of time might be day, month, and year when recording sales of wine, but would need to be more precise in the case of telephone calls and
133
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
would probably include hours, minutes, and seconds).
Circumstantial Data Operational data, from which the facts are derived, is accompanied by supporting data, often referred to as reference data. The reference data relates to entities such as customers, products, sales regions, etc. Its primary purpose, within the operation processing systems, is to enable, for instance, the right products and documentation to be sent to the right addresses. It is this data that is used to populate the dimensions and the dimensional hierarchies in the data warehouse. In the same way that business transactions have a lifecycle, these reference entities also have a lifecycle. The lifecycle of reference data entities is somewhat different to transactions. Whereas business transactions, under normal circumstances, have a predefined lifecycle that starts at inception and proceeds through a logical path to deletion, the lifecycle of reference data can be much less clear. The existence of some entities can be discontinuous. This is particularly true of customer entities who may switch from one supplier to another and back again over time. It is also true of some other reference information, such as products (e.g., seasonal products). Also, the attributes are subject to change due to people moving, changing jobs, etc. only for RuBoard - do not distribute or recompile
134
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PROBLEMS INVOLVING TIME There are several areas in data warehousing where time presents a problem. We'll now explore those areas.
The Effect of Time on the Data Model Organizations wishing to build a data warehouse have often already built a data model describing their operational business data. This model is sometimes referred to as the corporate data model. The database administrator's office wall is sometimes entirely obscured by a chart depicting the corporate data model. When building a data warehouse, practitioners often encounter the requirement to utilize the customer's corporate data model as the foundation of the warehouse model. The organization has invested considerably in the development of the model, and any new application is expected to use it as the basis for development. The original motivation for the database approach was that data should be entered only once and that it should be shared by any users who were authorized to have access to it. Figure 4.1 depicts a simple fragment of a data model for an operational system. Although the Wine Club data model could be used, the model in Figure 4.1 provides a clearer example.
Figure 4.1. Fragment of operational data model.
Figure 4.1 is typical of most operational systems in that it contains very little historical data. If we are to introduce a data warehouse into the existing data model, we might consider doing so by the addition of a time variant table that contained the history that is needed. Taking the above fragment of a corporate data model as a starting point, and assuming that the warehouse subject area is “Sales,” a dimensional warehouse might be created as in Figure 4.2.
135
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Figure 4.2. Operational model with additional sales fact table.
Figure 4.2 shows a dimensional model with the fact table (Sales) at the center and three dimensions of analysis. These are time, customer, and salesperson. The salesperson dimension participates in a dimensional hierarchy in which a department employs salespeople and a site contains many departments. Figure 4.2 further shows that the sales fact table is populated by the data contained in the orders table, as indicated by the dotted arrow (not part of standard notation). That is, all new orders that have achieved the state required to enable them to be classified as sales are inserted into the sales table and are appended to the data already contained in the table. In this way the history of sales can be built. At first sight, this appears to be a satisfactory incorporation of a dimensional data warehouse into an existing data model. Upon closer inspection, however, we find that the introduction of the fact table “Sales” has had interesting effects. To explain the effect, the sales dimensional hierarchy is extracted as an example, shown in Figure 4.3. This hierarchy shows that a site may contain many departments and a department may employ many salespeople. This sort of hierarchy is typical of many such hierarchies that exist in all organizations.
Figure 4.3. Sales hierarchy.
The relationships shown here imply that a salesperson is employed by one department and that a department is contained in one site. These relationships hold at any particular point in time.
136
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The addition of a fact table, which contains history, is attached to the hierarchy as shown in Figure 4.4.
Figure 4.4. Sales hierarchy with sales table attached.
The model now looks like a dimensional model with a fact table (sales) and a single dimension (salesperson). The salesperson dimension participates in a dimensional hierarchy involving departments and sites. Assuming that it is possible, during the course of ordinary business, for a salesperson to move from one department to another, or for a department to move from one site to another, then the cardinality (degree) of the relationships “Contains” and “Employs” no longer holds. The hierarchy, consisting of salespeople, departments, and sites contains only the latest view of the relationships. Because sales are recorded over time, some of the sales made by a particular salesperson may have occurred when the salesperson was employed by a different department. Whereas the model shows that a salesperson may be employed by exactly one department, this is only true where the relationship is viewed as a “snapshot” relationship. A more accurate description is that a salesperson is employed by exactly one department at a time. Over time, a salesperson may be employed by one or more departments. Similarly, a department is contained by exactly one site at a time. If it is possible for departments to move from one site to another then, over time, a department may be contained by one or more sites. The introduction of time variance, which is one of the properties of a data warehouse, has altered the degree of the relationships within the hierarchy, and they should now be depicted as many-to-many relationships as shown in Figure 4.5. This leads to the following observation: The introduction of a time-variant entity into a time-invariant model potentially alters the degree of one or more of the relationships in the model.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
A point worth noting is that it is the rules of the business, not a technical phenomenon, that cause these changes to the model. The degree to which this causes a problem will vary from application to application, but dimensions typically do contain one or more natural hierarchies. It seems reasonable to assume, therefore, that every organization intending to develop a data warehouse will have to deal with the problem of the degree of relationships being altered as a result of the introduction of time. The above example describes the kind of problem that can occur in relationships that are able to change over time. In effect, the cardinality (degree) of the relationship has changed from “one to many” to “many to many” due to the introduction of time variance. In order to capture the altered cardinality of the relationships, intersection entities would normally be introduced as shown in Figure 4.6
Figure 4.6. Sales hierarchy with intersection entities.
This brief introduction to the problem shows that it is not really possible to combine a time-variant data warehouse model with a non-time-variant operational model without some disruption to the original model. If we compare the altered data model to the original model, it is clear that the introduction of the time-variant sales entity has had some repercussions and has forced some changes to be made. This is one of the main reasons that forces data warehouses to be built separately from operational systems. Some practitioners believe that the separation of the two is merely a performance issue in that most database products are not able to be optimized to support the highly disparate nature of operational versus decision support type of queries. This is not the case. The example shows that the structure of the data is actually incompatible. In the future it is likely that operational systems will be
138
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
built with more “decision support awareness,” but any attempt to integrate decision support systems into traditional operational systems will not be successful.
The Effect of Time on Query Results As these entities change over time, in operational processing systems, the new values tend to replace existing values. This gives the impression that the old, now replaced, value never existed. For instance, in the Wine Club example, if a customer moves from one address to another and, at the same time, switches to a new region, there is no reason within the order processing system to record the previous address as, in order to service orders, the new address is all that is required. It could be argued that to keep information about the old address is potentially confusing, with the risk that orders may be inadvertently dispatched to the wrong address. In a temporal system such as a data warehouse, which is required to record and report upon history faithfully, it may be very important to be able to distinguish between the orders placed by the customer while resident at the first address from the orders placed since moving to the new address. An example of where this information would be needed is where regional sales were measured by the organization. In the example described above, the fact that the customer, when moving, switched regions is important. The orders placed by the customer while they were at the previous address need to have that link preserved so that the previous region continues to receive the credit for those orders. Similarly, the new region should receive credit for any subsequent orders placed by the customer during their period of residence at the new address. Clearly, when designing a data warehouse in support of a CRM strategy, such information may be very important. If we recall the cause-and-effect principle and how we applied it to changing customer circumstances, this is a classic example of precisely that. So the warehouse needs to record not only the fact that the data has changed but also when the change occurred. There is a conflict between the system supplying the data, which is not temporal, and the receiving system, which is. The practical problems surrounding this issue are dealt with in detail later on in this chapter. The consequences of the problem can be explored in more detail by the use of some data. Figure 4.7 provides a simple illustration of the problem by building on the example given. We'll start by adding some data to the entities.
Figure 4.7. Sales hierarchy with data.
139
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
140
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The example in Figure 4.7 shows a “Relational” style of implementation where the relationships are implemented using foreign key columns. In the data warehouse, the “Salesperson” dimension would be related directly to the sales fact table. Each sales fact would include a foreign key attribute that would contain the sales identifier of the salesperson who was responsible for the sale. In order to focus on the impact of changes to these relationships, time is omitted from the following set of illustrative queries. In order to determine the value of sales by salesperson, the SQL query shown in Query Listing 4.1 could be written:
Listing 4.1 Total sales by sales-person.
Select name, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by name In order to determine the value of sales by department, the SQL query shown in Query Listing 4.2 could be written:
Listing 4.2 Total sales by department.
Select department_name, sum(sales_value) from sales s1, sales_person s2, department d where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_code group by department_name If the requirement was to obtain the value of sales attributable to each site, then the query in Query Listing 4.3 could be used:
Listing 4.3 Total sales by site.
Select address, sum(sales_value) from sales s1, sales_person s2, department d, site s3 where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_id and d.site = s3.site_code group by address The result sets from these queries would contain the sum of the sales value grouped by salesperson, department, and site.
141
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The results will always be accurate so long as there are no changes in the relationships between the entities. However, as previously shown, changes in the dimensions are quite common. As an example, if Sneezy were to transfer from department “SW” to department “NW,” his relationship between the salesperson entity and the department entity will have changed. If the same three queries are executed again, the results will be altered. The results of the first query in Query Listing 4.1, which is at salesperson level, will be the same as before because the sales made by Sneezy are still attributed to him. However, in Query Listing 4.2, which is at the department level, all sales that Sneezy was responsible for when he worked in department “SW” will in future be attributed to department “NW.” This is clearly an invalid result. The result from the query in Query Listing 4.3, which groups by site address, will still be valid because, although Sneezy moved from SW department to NW department, both SW and NW reside at the same address, Bristol. If Sneezy had moved from SW to SE or NE, then the Listing 4.3 results would be incorrect as well. The example so far has focused on how time alters the cardinality of relationships. There is, equally, an effect on some attributes. If we look back at the salesperson entity in the example, there is an attribute called “Grade.” This is meant to represent the sales grade of the salesperson. If we want to measure the performance of salespeople by comparing volume of sales against grades, this could be achieved by the following query:
Select grade, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by grade If any salesperson has changed their grade during the period covered by the query, then the results will be inaccurate because all their sales will be recorded against their current grade. In order to produce an accurate result, the periods of validity of the salesperson's grades must be kept. This might be achieved by the introduction of another intersection entity. If no action is taken, the database will produce inaccurate results. Whether the level of inaccuracy is acceptable is a matter for the directors of the organization to decide. Over time, however, the information would become less and less accurate, and the value of the information is likely to become increasingly questionable. How do the business people know which are the queries that return accurate results and, more importantly, which ones are suspect? Unfortunately for our users, there is no way of knowing.
142
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The Time Dimension The time dimension is a special dimension that contains information about times. For every possible time that may appear in the fact table, an entry must exist in the time dimension table. This time attribute is the primary key to the time dimension. The non–key attributes are application specific and provide a method for grouping the discrete time values. The groupings can be anything that is of interest to the organization. Some examples might be: Day of week Week end Early closing day Public holidays/bank holidays 24-hour opening day Weather conditions Week of year Month name Financial month Financial quarter Financial year Some of the groupings, listed above, could be derived from date manipulation functions supplied by the database management system, whereas others, clearly, cannot.
The Effect of Causal Changes to Data Upon examination, it appears that some changes are causal in nature, in that a change to the value of one attribute implies a change to the value of some other attribute in the schema. The extent of causality will vary from case to case, but the designer must be aware that a change to the value of a particular attribute, whose historical values have low importance to the organization, may cause a change to occur in the value of another attribute that has much greater importance. While this may be true in all systems, it is particularly relevant to data warehousing because of the disparate nature of the source systems that provide the data used to populate the warehouse. It is possible, for instance, that the source system containing customer addresses may not actually hold information about sales areas. The sales area classification may come from, say, a marketing database or some kind of demographic data. Changes to addresses, which are detected in the operational database, must be
143
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
implemented at exactly the same time as the change to the sales area codes. Acknowledgment of the causal relationship between attributes is essential if accuracy and integrity are to be maintained. In the logical model it is necessary to identify the dependencies between attributes so that the appropriate physical links can be implemented. only for RuBoard - do not distribute or recompile
144
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
CAPTURING CHANGES Let's now examine how changes are identified in the source systems and can be subsequently captured into the data warehouse and the problems that can occur. Capturing Behavior As has been previously stated, the behavioral facts relate to the business transactions of the organization. Facts are usually derived from some entity having been “frozen” and captured at a particular status in its lifecycle. The process by which this status is achieved is normally triggered by an event. What do we mean by the term event? There are two ways of considering the definition of an event. If the data warehouse is viewed in isolation so that the facts that it records are not perceived as related to the source systems from which they were derived, then they can be viewed purely as events that occurred at a single point in time. If, however, the data warehouse is perceived as part of the “enterprise” database systems, then the facts should be viewed within the wider context, and they become an entity preserved at a “frozen” state, having been triggered by an event. Either way, the distinguishing feature of facts is that they do not have a lifespan. They are associated with just one time attribute. For the purpose of clarity the following definition of facts will be adopted: A fact is a single state entity that is created by the occurrence of some event. In principle, the processes involved in the capture of behavior are relatively straightforward. The extraction of new behavioral facts for insertion into the data warehouse is performed on a periodic, very often daily, basis. This tends to occur during the time when the operational processing systems are not functioning. Typically this means during the overnight “batch” processing cycle. The main benefit of this approach is that all of the previous day's data can be collected and transferred at one time. The process of identifying the facts varies from one organization to another from between being very easy to almost impossible to accomplish. For instance, the fact data may come from: Telephonic network switches or billing systems, in the case of telecommunications companies Order processing systems as in the case of mail order companies such as the Wine Club Cash receipts in the case of retail outlets Once the facts have been identified, they are usually stored into sequential files or streams that are appended to during the day. As the data warehouse is usually resident on a hardware platform that is separate from the operational system, the files have to be moved before they can be processed further. The next step is to validate and modify each record to ensure that it conforms to the format and semantic integration rules that were described in Chapter 2. The actual loading of the data is usually performed using the “bulk” load utility that most database management systems provide.
145
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Once recorded, the values of fact attributes never change so they should be regarded as single state or stateless. There is a time element that applies to facts, but it is simply the time that the event occurred. It is usually implemented in the form of a single timestamp. The timestamp will vary, in granularity, from one application to another. For instance, in the Wine Club, the timestamp of a sale records the date of the sale. In a telecommunications application, the timestamp would record not only the date but also the hour, minute, and second that the call was placed. Capturing Circumstances The circumstances and dimensions are derived from what has been referred to as the reference entities within the organization. This is information such as customer, product, and market segment. Unlike the facts in a data warehouse, this type of information does have a lifespan. For instance, products may have various states during their lifespan from new to fast moving to slow moving to discontinued to deleted. The identification and capture of new or changed dimensional information are usually quite different to the capture of facts. For instance, it is often the case that customer details are captured in the operational systems some time after the customer starts using the services of the organization. Also, the date at which the customer is enrolled as a customer is often not recorded in the system. Neither is the date when they cease to become a customer. Similarly, when a dimensional attribute changes, such as the address of a customer, the new address is duly recorded in such a way as to replace the existing address. The date of the change of address is often not recorded. The dates of changes to other dimensional attributes are also, usually, not recorded. This is only a problem if the attribute concerned is one for which there is a requirement to record the historic values faithfully. In the Wine Club, for instance, the following example attributes need to have their historic values preserved: Customers' addresses Customers' sales areas Wine sales prices and cost prices Suppliers of wines Managers of sales areas If the time of the change is not recorded in the operational systems, then it is impossible to determine the valid time that the change occurred. Where the valid time of a change is not available, then it may be appropriate to try to ascertain the transaction time of the change event. This would be the time that the change was recorded in the database, as opposed to the time the change actually occurred. However, in the same way that the valid time of changes is not recorded, the transaction time of changes is usually not recorded explicitly as part of the operational application. In order for the data warehouse to capture the time of the changes, there are methods available that can assist us in identifying the transaction times:
146
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Make changes to the operational systems. The degree to which this is possible is dependent on a number of factors. If the system has been developed specifically for the organization, either by an organization's own IT staff or by some third party, as long as the skills are available and the costs and timescales are not prohibitive, then the operational systems can be changed to accommodate the requirements of the data warehouse. Where the application is a standard package product, it becomes very much more difficult to make changes to the system without violating commercial agreements covering such things as upgrades and maintenance. If the underlying database management system supporting the application is relational, then it is possible to capture the changes by the introduction of such things as database triggers. Experience shows that most organizations are reluctant to alter operational applications in order to service informational systems requirements for reasons of cost and complexity. Also, the placing of additional processing inside of operational systems is often seen as a threat to the performance of those systems. Interrogation of audit trail. Some operational applications maintain audit trails to enable changes to be traced. Where these exist, they can be a valuable source of information to enable the capture of transaction time changes. Interrogation of DBMS log files. Most database management systems maintain log files for system recovery purposes. It is possible, if the right skills are available, to interrogate these files to identify changes and their associated transaction times. This practice is discouraged by the DBMS vendors, as log files are intended for internal use by the DBMS. If the files are damaged by unauthorized access, the ability of the DBMS to perform a recovery may be compromised. Also, the DBMS vendors always reserve the right to alter the format of the log files without notice. If this happens, processes that have been developed to capture changes may stop working or may produce incorrect results. Obviously, this approach is not available to non-DBMS applications. File comparison. This involves the capture of an entire file, or table, of dimensional data and the copying of the file so that it can be compared to the data already held in the data warehouse. Any changes that are identified can then be applied to the warehouse. The time of the change is taken to be the system time of the detection of the change, that is, the time the file comparison process was executed. Experience shows that the file comparison technique is the one most frequently adopted when data warehouses are developed. It is the approach that has the least impact on the operational environment, and it is the least costly to implement. It should also be remembered that some dimensions are created by the amalgamation of data from several operational systems and some external systems. This will certainly exacerbate an already complex problem. Where the dimensions in a dimensional model are large (some organizations have several million customers), the capture of the data followed by the transfer to the data warehouse environment and subsequent comparison is a process that can be very time-consuming. Consequently, most organizations place limits on the frequency with which this process can be executed. At best, the frequency is weekly. The processing can then take place over the weekend when the systems are
147
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
relatively quiet and the extra processing required to perform this exercise can be absorbed without too much of an impact on other processing. Many organizations permit only monthly updates to the dimensional data, and some are even less frequent than that. The problem is that the only transaction time available, against which the changes can be recorded, is the date upon which the change was discovered (i.e., the file comparison date). So, for example, let us assume that the frequency of comparison is monthly and the changes are captured at the end of the month. If a customer changes address, and geographic region, at the beginning of the month, then any facts recorded for the customer during the month will be credited permanently to the old, incorrect region. It is possible that, during a single month, more than one change will occur to the same attribute. If the data is collected by the file comparison method, the only values that will be captured are those that are in existence at the time of capture. All intermediate changes will be missed completely. The degree to which this is a problem will vary from application to application. It is accepted that, in general, valid time change capture for dimensions is not, practically speaking, realistic. However, it is important that practitioners recognize the issue and try to keep the difference between transaction time and valid time as small as possible. The fact that some data relating to time as well as other attributes is found to be absent from the source systems can come to dominate data warehouse developments. As a result of these problems, the extraction of data is sometimes the longest and riskiest part of a data warehouse project. Summary of the Problems Involving Time So far in this chapter we have seen that maintaining accuracy in a data warehouse presents a challenging set of problems that are summarized below: 1. Identifying and capturing the temporal requirements. The first problem is to identify the temporal requirements. There is no method to do this currently. The present data warehousing modeling techniques do not provide any real support for this. 2. Capture of dimensional updates. What happens when a relationship changes (e.g., a salesperson moves from one department to another)? What happens when a relationship no longer exists (e.g., a salesperson leaves the company)? How does the warehouse handle changes in attribute values (e.g., a product was blue, now it is red)? Is there a need to be able to report on its sales performance when it was red or blue, as well as for the product throughout the whole of its lifecycle? 3. The timeliness of capture. It now seems clear that absolute accuracy in a data warehouse is not a practical objective. There is a need to be able to assess the level of inaccuracy so that a degree of confidence can be applied to the results obtained from queries. 4. Synchronization of changes. When an attribute changes, a mechanism is required for identifying dependent attributes that might also need to be changed. The absence of synchronization affects the credibility of the results.
148
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We have seen that obtaining the changed data can involve complex processing and may require sophisticated design to implement in a way that provides for both accuracy of information and reasonable performance. Also in this chapter we have explored the various problems associated with time in data warehousing. Some of these problems are inherent in the standard dimensional model, but it is possible to overcome these problems by making changes to the way dimensional models are designed. Some of the problems relate to the way data warehouses interact with operational systems. These problems are more difficult to solve and, sometimes, impossible to solve. Nevertheless, data warehouse designers need to be fully aware of the extent of the problems and familiar with the various approaches to solving them. These are discussed in the coming chapters. The biggest set of problems lies in the areas of the capture and accurate representation of historical information. The problem is most difficult when changes occur in the lifespan of dimensions and the relationships within dimensional hierarchies, also where attributes change their values and there is a requirement to faithfully reflect those changes through history. Having exposed the issues and established the problems, let's have a look at some of the conventional ways of solving them. only for RuBoard - do not distribute or recompile
149
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
FIRST-GENERATION SOLUTIONS FOR TIME We now go on to describe some solutions to the problems of the representation of time that have been used in first-generation data warehouses. One of the main problems is that the business requirements with respect to time have not been systematically captured at the conceptual level. This is largely because we are unfamiliar with the temporal semantics due to the fact that, so far, we have not encountered temporal applications. Logical models systematically follow conceptual models, and so a natural consequence of the failure to define the requirements in the conceptual model is that the requirements are also absent from the logical and physical implementation. As a result, practitioners have subsequently found themselves faced with problems involving time and some have created solutions. However, the solutions have been developed on a somewhat ad hoc basis and are by no means comprehensive. The problem is sufficiently large that we really do need a rigorous approach to solving it. As an example of the scale of the problem, there is, as previously mentioned, evidence in a government-sponsored housing survey that, in the United Kingdom, people change their addresses, on average, every 10 years. This means that an organization can expect to have to implement address details changes to about 10 percent of their customers each year. Over a 10-year period, if an organization has one million customers, it can expect to have to deal with one million changes of address. Obviously, some people will not move, but others will move more than once in that period. This covers only address changes. There are other attributes relating to customers that will also change, although perhaps not with the same frequency as addresses. One of the major contributors to the development of solutions in this area is Ralph Kimball (1996). His collective term for changes to dimensional attributes is slowly changing dimensions. The term has become well known within the data warehouse industry and has been adopted generally by practitioners. He cites three methods of tracking changes to dimensional attributes with respect to time. These he calls simply types one, two, and three. Within the industry, practitioners are generally aware of the three types and, where any support for time is provided in dimensional models, these are the approaches that are normally used. It is common to refer to products and methods as being consistent with Kimball's type one, two, or three. In a later work Kimball (1998) recognizes a type zero that represents dimensions that are not subject to change.
The Type 1 Approach The first type of change, known as Type 1, is to replace the old data values with the new values. This means that there is no need to preserve the previous value. The advantage of this approach, from a system perspective, is that it is very easy to implement. Obviously there is no temporal support being offered in this solution. However, this method sometimes offers the most appropriate solution. We
150
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
don't need to track the historical values of every single database element and, sometimes, overwriting the old values is the right thing to do. In the Wine Club example, attributes like the customer's name are best treated in this way. This is an attribute for which there is no requirement to retain historical values. Only the latest value is deemed by the organization to be useful. All data warehouse applications will have some attributes where the correct approach is to overwrite the previous values with the new values. It is important that the updating is effected on a per attribute basis rather than a per row basis. Each table will have a mixture of attributes, some of which will require the type one replacement approach, while others will require a more sophisticated approach to the treatment of value changes over time. The worst scenario is a full table replacement approach where the dimension is, periodically, completely overwritten. The danger here is that any rows that have been deleted in the operational system may be deleted in the data warehouse. Any rows in the fact table that refer to rows in the dimension that have been deleted will cause a referential integrity violation and will place the database in an invalid state. Thus, the periodic update of dimensions in the data warehouse must involve only inserts and updates. Any logical deletions (e.g., where a customer ceases to be a customer) must be processed as updates in the data warehouse. It is important to know whether a customer still exists as a customer, but the customer record must remain in the database for the whole lifespan of the data warehouse or, at the very least, as long as there are fact table records that refer to the dimensional record. Due to the fact that Type 1 is the simplest approach, it is often used as the default approach. Practitioners will sometimes adopt a Type 1 solution as a short-term expedient, where the application really requires a more complete solution, with the intention of providing proper support for time at a later stage in the project. Too often, the pressures of project budgets and implementation deadlines force changes to the scope of projects and the enhancements are abandoned. Sometimes, Type 1 is adopted due to inadequate analysis of the requirements with respect to time.
The Type 2 Approach The second solution to slowly changing dimensions is called Type 2. Type 2 is a more complex solution than Type 1 and does attempt to faithfully record historical values of attributes by providing a form of version control. Type 2 changes are best explained with the use of an example. In the case study, the sales area in which a customer lives is subject to change when they move. There is a requirement to faithfully reflect regional sales performance over time. This means that the sales area prevailing at the time of the sale must be used when analyzing sales. If the Type 1 approach was used when recording changes to the sales area, the historical values will appear to have the same sales area as current values. A method is needed, therefore, that enables us to reflect history faithfully. The Type 2 method attempts to solve this problem by the creation of new records. Every time an attribute's value changes, if faithful recording of history is required, an entirely new record is created with all the unaffected attributes unchanged. Only the affected attribute is changed to reflect its new value. The obvious problem with this approach is that it would immediately compromise the
151
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
uniqueness property of the primary key, as the new record would have the same key as the previous record. This can be turned into an advantage by the use of surrogate keys. A surrogate key is a system-generated identifier that introduces a layer of indirection into the model. It is a good practice to use surrogate keys in all the customer and dimensional data. The main reason for this is that the production key is subject to change whenever the company reorganizes its customers or products and that this would cause unacceptable disruption to the data warehouse if the change had to be carried through. It is better to create an arbitrary key to provide the property of uniqueness. So each time a new record is created, following a change to the value of an attribute, a new surrogate key is assigned to the record. Sometimes, a surrogate approach is forced upon us when we are attempting to integrate data from different source systems where the identifiers are not the same. There are two main approaches to assigning the value of the surrogate:
1. The identifier is lengthened by a number of version digits. So a customer having an identifier of “1234” would subsequently have the identifier “1234001.” After the first change, a new row would be created that would have an identifier of “1234002.” The customer would now have two records in the dimension. Most of the attribute values would be the same. Only the attribute, or attributes, that had changed would be different.
2. The identifier could be truly generalized and bear no relation to the previous identifiers for the customer. So each time a new row is added, a completely new identifier is created. In a behavioral model, the generalized key is used in both the dimension table and the fact table. Constraining queries using a descriptive attribute, such as the name of the customer, will result in all records for the customer being retrieved. Constraining or grouping the results by use of the name and, say, the sales area attribute will ensure that history is faithfully reflected in the results of queries, of course assuming that uniqueness in the descriptive attribute can be guaranteed. The Type 2 approach, therefore, will ensure that the fact table will be correctly joined to the dimension and the correct dimensional attributes will be associated with each fact. Ensuring that the facts are matched to the correct dimensional attributes is the main concern. An obvious variation of this approach is to construct a composite identifier by retaining the previous number “1234” and adding a new counter or “version” attribute that, initially, would be “1.” This approach is similar to 1, above. The initial identifier is allocated when the customer is first entered into the database. Subsequent changes require the allocation of new identifiers. It is the responsibility of the data warehouse administrator to control the allocation of identifiers and to maintain the version number in order to know which version number, or generalized key, to allocate next. In reality, the requirement would be to combine Type 1 and Type 2 solutions in the same logical row. This is where we have some attributes that we do want to track and some that we do not. An example of this occurs in the Wine Club where, in the customer's circumstances, we wish to trace the history of attributes like the address and, consequently, the sales area but we are not interested in the history of the customer's name or their hobby code. So in a single logical row, an attribute like address would need to be treated as a Type 2, whereas the name would be treated as a Type 1. Therefore, if the customer's name changes, we would wish to overwrite it. However, there may be
152
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
many records in existence for this customer, due to previous changes to other attributes. Do we have to go back and overwrite the previous records? In practice, it is likely that only the latest record would be updated. This implies that, in dimensions where type two is implemented, attributes for which the type one approach would be preferred will be forced to adopt an approach that is nearly, but not quite, Type 2. In describing the problems surrounding time in data warehousing, we saw how the results of a query could change due to a customer moving. The approach taken was to simply overwrite the old addresses and the sales area codes with the new values. This is equivalent to Kimball's Type 1 approach. If we implement the same changes using the Type 2 method, the results would not be disrupted, as new records would be created with a new surrogate identifier. Future insertions to the sales fact table will be related to the new identifying customer codes, and so the segmentation will remain consistent with respect to time for the purposes of this particular query. One potential issue here is that, by making use of generalized keys, it becomes impossible to recognize individual customers by use of the identifying attribute. As each subsequent change occurs, a new row is inserted and is identified by a key value that is in no way associated with the previous key value. For example Lucie Jones's original value for the customer code attribute might be, say, 1136, whereas the value for the new customer code for the new inserted row could be anything, say, 8876, being the next available key in the domain range. This means that, if it was required to extract information on a per customer basis, the grouping would have to be on a nonidentifying attribute, such as the customer's name, that is:
select customer_name "Name", sum(quantity) "Bottles",sum(value) "Revenue" from sales s, customer c where c. customer _code=s. customer _code group by customer _name order by customer _name Constraining and grouping queries using descriptive attributes like names are clearly risky, since names are apt to be duplicated and erroneous results could be produced. Another potential issue with this approach is that, if the keys are truly generalized, as with key hashing, it may not be possible to identify the latest record by simply selecting the highest key. Also, the use of generalized keys means that obtaining the history of, say, a customer's details may not be as simple as ordering the keys into ascending sequence. One solution to this problem is the addition of a constant descriptive attribute, such as the original production key, that is unique to the logical row. Alternatively, a variation as previously described, in which the original key is retained but is augmented by an additional attribute to define the version,
153
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
would also provide the solution to this. The Type 2 method does not allow the use of date columns to identify when changes actually take place. For instance, this means that it is not possible to establish with any accuracy when a customer actually moved. The only date available to provide any clue to this is the transaction date in the fact table. There are some problems associated with this. A query such as “List the names and addresses of all customers who have purchased more than twelve bottles of wine in the last three months” might be useful for campaign purposes. Such a query will, however, result in incorrect addresses being returned for those customers who have moved but not since placed an order. The query in Query Listing 4.4 shows this.
Listing 4.4 Query to produce a campaign list.
select c.customer_code, customer_name, customer_address,sum(quantity) from customer c, sales s, time t where c.customer_code = s.customer_code and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c.customer_code,customer_name, customer_address having sum(quantity) > 12 The table in Table 4.1 is a subset of the result set for the query in Listing 4.4.
Table 4.1. List of Customers to be Contacted Customer Code Customer Name
Customer Address
Sum (Quantity)
1332
A.J. Gordon
82 Milton Ave, Chester, Cheshire
49
1315
P. Chamberlain
11a Mount Pleasant, Sunderland
34
2131
Q.E. McCallum
32 College Ride, Minehead, Somerset
14
1531
C.D. Jones
71 Queensway, Leeds, Yorks
31
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
32
2141
J.K. Noble
79 Priors Croft, Torquay, Devon
58
4153
C. Smallpiece
58 Ballard Road, Bristol
21
1321
D. Hartley
88 Ballantyne Road, Minehead, Somerset
66
The highlighted row is an example of the point. Customer L. Jones has two entries in the database. Because Lucie has not purchased any wine since moving, the incorrect address was returned by the query. The result of a simple search is shown in Table 4.2.
154
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 4.2. Multiple Records for a Single Customer Customer Code
Customer Name
Customer Address
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
8876
L. Jones
44 Sea View Terrace, West Bay, Bridport, Dorset
If it can be assumed that if the generalized key is always ascending, then the query could be modified, as in the following query, to select the highest value for the key.
select customer_code, customer_name, customer_address from customer where customer_code = (select max(customer_code) from customer where customer_name = 'L. Jones') This query would return the second of the two rows listed in Table 4.2. Using the other technique to implement the Type 2 method, we could have altered the customer code from “1136” to “113601” for the original row and, subsequently, to “113602” for the new row containing the changed address and sales area. In order to return the correct addresses, the query in Query Listing 4.5 has to be executed.
Listing 4.5 Obtaining the latest customers details using Type 2 with extended identifiers.
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and c.customer_code = (select max(customer_code) from customer c2 where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4)) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The query in Listing 4.5 is another correlated subquery and contains the following line:
where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4) The query matches the generic parts of the customer code by use of a “substring” function, provided
155
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
by the query processor. It is suspected that this type of query may be beyond the capability of most users. This approach is dependent on the fact that all the codes are of the same fundamental format. That is, they have four digits plus a suffix. If the approach was to start from 1 through 9,999, then this technique could not be adopted, because the substring function would not produce the right answer. The obvious variation on the above approach is to add an extra attribute to distinguish versions. The identifier then becomes the composite of two attributes instead of a single attribute. In this case, the original attribute remains unaltered, and the new attribute is incremented, as shown in Table 4.3.
Table 4.3. A Modification to Type 2 Using Composite Identifiers Customer Code
Version Number
1136
01
1136
02
1136
03
Using this technique, the following query is executed:
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and s.counter = c1.counter and c.counter = (select max(counter) from customer c2 where c1.customer_code = c2.customer_code) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The structure of this query is the same as in Query Listing 4.5. However, this approach does not require the use of substrings to make the comparison. This means that the query will always produce the right answer irrespective of the consistency, or otherwise, of the encoding procedures within the organization. These solutions do not resolve the problem of pinpointing when a change occurs. Due to the absence of dates in the Type 2 method, it is impossible to determine precisely when changes occur. The only way to extract any form of alignment with time is via a join to the fact table. This, at best, will give an approximate time for the change. The degree of accuracy is dependent on the frequency of fact table entries relating to the dimensional entry concerned. The more frequent the entries in the fact table, the more accurate will be the traceability of the history of the dimension, and vice versa. It is also not
156
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
possible to record gaps in the existence of dimensional entries. For instance, in order to track precisely the discontinuous existence of, say, a customer, there must be some kind of temporal reference to record the periods of existence.
Problems With Hierarchies
So far in this chapter, attention has focused on so-called slowly changing dimensions and how these might be supported using the Type 2 solution. Now we will turn our attention to, what we shall call, slowly changing hierarchies. As an example, we will use the dimensional hierarchy illustrated in Figure 3.4. The attributes, using a surrogate key approach, are as follows: Sales_Area(Sales_Area_Key, Sales Area Code, Manager key, Sales Area Name) Manager(Manager_Key, Manager Code, Manager Name) Customer(Customer_Key, Customer code, Customer Name, Customer Address, Sales Area key, Hobby Code, Date Joined) Let's say the number of customers and the spread of sales areas in the case study database is as shown in Table 4.4.
Table 4.4. Customers Grouped by Sales Area Sales Area
Count
North East
18,967
North West
11,498
South East
39,113
South West
28,697
We will assume that we have implemented the type two solution to slowly changing dimensions. If sales area SW was to experience a change of managers from M9 to M12, then a new sales area record would be inserted with an incremented counter, together with the new manager_code. So if the previous record was (1, SW, M9, “South West”), the new record with its new key, assumed to be 5, would contain (5, SW, M12, “South West”). However, that is not the end of the change. Each of the customers, from the “SW” sales area still have their foreign key relationship references pointing to the original sales area record containing the reference to the old manager (surrogate key 1). Therefore, we also have to create an entire set of new records for the customers, with each of their sales area key values set to “5”. In this case, there are 11,498 new records to be created. It is not valid to simply update the foreign keys with the new value, because the old historical link will be lost.
157
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Where there are complex hierarchies involving more levels and more rows, it is not difficult to imagine very large volumes of inserts being generated. For example, in a four-level hierarchy where the relationship is just 1:100 in each level, a single change at the top level will cause over a million new records to be inserted. A relationship of 1:100 is not inordinately high when there are many data warehouses in existence with several million customers in the customer dimension alone. The number of extraneous insertions generated by this approach could cause the dimension tables to grow at a rate that, in time, becomes a threat to performance. For the true Star Schema advocates, we could try flattening the hierarchies into a single dimension (de-normalizing). This converts a snowflake schema into a star schema. If this approach is taken, the effect is that, in the four-level 1:100 example, the number of insertions reduces from 1.01 million insertions to 1 million insertions. So reducing the number of insertions is not a reason for flattening the hierarchy.
Browse Queries
The Type 2 approach does cause some problems when it comes to browsing. It is generally reckoned that some 80 percent of data warehouse queries are dimension-browsing queries. This means that they do not access any fact table. A typical browse query we might wish to perform is to count the number of occurrences. For instance, how many customers do we have? The standard way to do this in SQL is shown in the following query:
Select count(*) from
where <predicate> In a Type 2 scenario, this will produce the wrong result. This is because for each logical record, there are many physical records that result in a number of “duplicated” rows. Take the example of a sales exec entity and a customer entity shown in Figure 4.8.
Figure 4.8. Simple general business hierarchy.
In order to count the number of customers that a sales exec is responsible for, a nonexpert user might express the query as shown in Query Listing 4.6.
158
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Listing 4.6 Nonexpert query to count customers.
Select count(*) from Sales_Exec S,Customer C where S.SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' When using the Type 2 approach to allow for changes to a customer attribute, this will produce the wrong result. This is because for each customer there may be many rows, resulting from the duplicates created when an attribute value changes. With more careful consideration, it would seem that the query should instead be expressed as follows:
Select count(distinct ) from
where <predicate> In our example, it translates to the following query:
Select count(distinct CustNum) from Sales_Exec S,Customer C where S. SalesExecNum =C.SalesExecNum and S.Name = 'Tom Sawyer' Unfortunately, this query does not give the correct result either because the result set contains all the customers that this sales executive has ever been responsible for at any time in the past. The query includes the customers that are no longer the responsibility of Tom Sawyer. When comparing sales executives, this would result in customers being double counted. On further examination, it might appear that this problem can be resolved by selecting the latest entry for each customer to ensure that they are counted as a customer for only their current sales executive. Assuming an incremented general key, this can be expressed by the following query:
Select count(*) from Sales_Exec S,Customer C1 where S. SalesExecNum=C1.SalesExecNum And S.Name = 'Tom Sawyer' and C1.Counter =(Select max(Counter from Customer C2) where C1.CustNum=C2.CustNum) In fact, this query also gives invalid results because it still includes customers that are no longer the responsibility of Tom Sawyer. That is, the count includes customers who are not currently the responsibility of any sales executive, because they are no longer customers, but whose records must remain because they are referenced by rows of the fact table.
159
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This example should be seen as typical of data warehouses, and the problem just described is general in this situation. That is, using the Type 2 approach, there are simple dimensional queries that cannot be answered correctly. It is reasonable to conclude that the type two approach does not fully support time in the dimensions. Its purpose, simply, is to ensure that fact entries are related to the correct dimensional entries such that each fact, when joined to any dimension, displays the correct set of attribute values.
Row Timestamping
In a much later publication, the lifecycle toolkit, Kimball (1998) recognizes the problem of dimensional counts and appears to have changed his mind about the use of dates. His solution is that the type two method is “embellished” by the addition of begin and end timestamps to each dimension record. This approach, in temporal database terms, is known as row timestamping. Using this technique, Kimball says it is possible to determine precisely how many customers existed at any point in time. The query that achieves this is shown here:
Select count(*) from Sales_Exec S,Customer C1 where S. SalesExecNum=C1.SalesExecNum And S.Name = 'Tom Sawyer' and C.EndDate is NULL For the sake of example, the null value in the end date is assumed to represent the latest record, but other values could be used, such as the maximum date that the system will accept , for example, 31 December 9999. In effect, this approach is similar to the previous example because it does not necessarily identify ex-customers. So instead of answering the question “How many customers is Tom Sawyer responsible for?” it may be asking “How many customers has Tom Sawyer ever been responsible for?” This method will produce the correct result only if the end date is updated when the customer becomes inactive. The adoption of a row timestamping approach can provide a solution to queries involving counts at a point in time. However, it is important to recognize that the single pair of timestamps is being used in a multipurpose way to record:
1. Changes in the active state of the dimension 2. Changes to values in the attributes 3. Changes in relationships Therefore, this approach cannot be used where there is a requirement to implement discontinuous existences where, say, a customer can become inactive for a period because it is not possible to determine when they were inactive. The only way to determine inactivity is to try to identify two temporally consecutive rows where there is a time gap between the ending timestamp of the earlier
160
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
row and the starting timestamp of the succeeding row. This is not really practical using standard SQL. Even where discontinuous existences are not present, the use of row timestamping makes it difficult to express queries involving durations because a single duration, such as the period of residence at a particular address or the period that a customer had continuously been active before closing their account, may be spread over many physical records. For example, in order to determine how many of the customers of the Wine Club had been customers for more than a year before leaving during 2001 could be expressed as follows:
Select '2001' as year, count(*) From customer c1, customer c2 Where c1.start_date = (select min(c3.start_date) from customer c3 where c3.customer_code = c1.customer_code) and c2.end_date = (select max(c4.end_date) from customer c4 where c4.customer_code = c2.customer_code) and c2.end_date between '2001/01/01' and '2001/12/31' and c1.customer_code = c2.customer_code and c2.end_date - c1.start_date > 365 group by year This query contains a self join and two self correlated sub queries. So the same table is used four times in the query. It is likely that the customer dimension is the one that would be subjected to this type of query most often and organizations that have several million customers are, therefore, likely to experience poor browsing performance. This problem is exacerbated when used with dimensions that are engaged in hierarchies because, like the Type 2 solution, changes tend to cause the problem of cascaded extraneous rows to be inserted. The second example has a further requirement which is to determine length of time that a customer had resided at the same address. This requires the duration to be limited by the detection of change events on attributes of the dimension. This is not possible to do with absolute certainty because of the fact that circumstances, having changed, might later on revert to the previous state. For instance, students leave home to attend university but might well return to the previous address later. In the Wine Club, a supplier might be reinstated as the supplier of a wine they had previously supplied. This is, in effect, another example of a discontinuous existence and cannot be detected using standard SQL.
The Type 3 Approach The third type of change solution (Type 3) involves recording the current value, as well as the original value, of an attribute. This means that an additional column has to be created to contain the extra
161
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
value. In this case, it makes sense to add an effective date column as well. The current value column is updated each time a change occurs. The original value column does not change. Intermediate values, therefore, are lost. In terms of its support for time, type three is rather quirky and does not add any real value and we will not be considering it further.
TSQL2 TSQL2 has been proposed as a standard for the inclusion of temporal extensions to standard SQL but has not so far been adopted by any of the major RDBMS vendors. It introduces some functions that a temporal query language must have. In this book we are concerned with providing a practical approach that can be implemented using today's technology and so, therefore, we cannot dwell on potential future possibilities. However, some of the functions are useful and can be translated into the versions of SQL that are available right now. These functions enable us to make comparisons between two periods (P1 and P2) of time (a period is any length of time, i.e., it has a start time and an end time). There are four main temporal functions:
1. Precedes. For this function to return “true,” the end time of P1 must be earlier than the start time of P2.
2. Meets. For this to be true, the end time of P1 must be one chronon earlier than the start time of P2, or vice versa. A chronon is the smallest amount of time allowed in the system. How small this is will depend on the granularity of time in use. It might be as small as a microsecond or as large as one day.
3. Overlaps. In order for this function to return “true,” any part of P1 must be the same as any part of P2.
4. Contains. The start of P1 must be earlier than the start of P2 and the end of P1 must be later than the end of P2. The diagram in Figure 4.9 illustrates these points.
Figure 4.9. Graphical representation of temporal functions.
162
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This introduction on how to interpret the TSQL2 functions such as Precedes, Overlaps, etc., in standard (nontemporal) SQL is very useful, as it shows how these functions can be used in standard SQL. Upon closer examination, the requirements in a dimensional data warehouse are quite simple. The “contains” function is useful because it enables queries such as “How many customers did the Wine Club have at the end of 2000?” to be asked. What this is really asking is whether the start and end dates of a customer's period of existence contain the date 31-12-2000. This can be restated to ask whether the date 31-12-2000 is between the start and end dates. So the “contains” from TSQL2 can be written almost as easily using the “between” function. Other functions such as “meets” can be useful when trying to detect dimensional attribute changes in an implementation that uses row timestamps. The technique is to perform a self-join on the dimension where the end date of one row meets the start date of another row and to check whether, say, the addresses or supplier codes are different.
Temporal Queries While we are on the subject of temporal extensions to existing technology, it is worth mentioning, but not dwelling upon, the considerable research that has been undertaken into the field of temporal databases. Although there have been more than 1,000 papers published on the subject, a solution is not close at hand. However, some temporal definitions have already been found to be useful in this book. For instance, the valid time and transaction time illustrate the difference between the time an event occurred in real life and the time that the event becomes known to the database. Another useful result is that it is now possible to define three principal types of temporal query that are highly relevant to data warehousing (see Snodgrass, 1997):
1. State duration queries. In this type of query, the predicate contains a clause that specifies a period. An example of such a query is: “List the customers who lived in the SW sales area for a duration of at least one year.” This type of query selects particular values of the dimension, utilizing a predicate associated with a group definition that mentions the duration of the row's period.
163
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
2. Temporal selection queries. This involves a selection based on a group definition of the time dimension. So the following query would fall into this category: “List the customers who lived in SW region in 1998.” This would involve a join between the customer dimension and the time dimension, which is not encouraged. Kimball's reasoning is that the time constraints for facts and dimensions are different.
3. Transition detection queries. In this class of query, we are aiming to detect a change event such as: “List the customers who moved from one region to another region.” This class of query has to be able to identify consecutive periods for the same dimension. The query is similar to the state duration query in that, in order to write it, it is necessary to compare row values for the same customer. We'll be using these terms quite a bit, and the diagram in Figure 4.10 is designed as an aide memoir for the three types of temporal query.
Figure 4.10. Types of temporal query.
only for RuBoard - do not distribute or recompile
164
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
VARIATIONS ON A THEME One way of ensuring that the data warehouse correctly joins fact data to dimensions and dimensional hierarchies is to detach the superordinate dimensions from the hierarchy and reattach them directly to the fact table. Pursuing the example presented earlier, in which a salesperson is able to change departments and a department is able to move from one site to another, what follows is the traditional approach of resolving many-to-many relationships by the use of intersection entities, as the diagram in Figure 4.11 shows. Figure 4.11. Traditional resolution of m:n relationships.
This approach need not be limited to the resolution of time-related relationships within dimensional hierarchies. It can also be used to resolve the problem of time-varying attributes within a dimension. So if it were required, for instance, to track a salesperson's salary over time, a separate dimension could be created as shown in Figure 4.12. Figure 4.12. Representation of temporal attributes by attaching them to the dimension.
165
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
The identifier for the salary dimension would be the salesperson's identifier concatenated with a date. The salary amount would be a nonidentifying attribute.
Salary (Salesperson_Id, StartDate, EndDate, Salary_Amount) The same approach could be used with all time-varying attributes. Another approach is to disconnect the hierarchies and then attach the dimensions to the fact table directly, as is shown in Figure 4.13. Figure 4.13. Representation of temporal hierarchies by attaching them to the facts.
This means that the date that is attached to each fact will, automatically, apply to all the levels of the dimensional hierarchy. The facts would have a foreign key referring to each of the levels of the hierarchy.
166
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We are left with a choice as to how we treat the changing attributes. The salary attribute in Figure 4.12 could be treated in the same way as before (i.e., as a separate dimension related to the salesperson dimension). Alternatively, it could be attached directly to the fact table in the same way as the other dimensions as shown in Figure 4.14. Figure 4.14. Representation of temporal attributes by attaching them to the facts.
A further variation on this approach is to include the salary amount as a nonidentifying attribute of the fact table, as follows:
Sales(Site_Id, Dept_Id, Salesman_Id, Time_Id, Salary_Amount, Sales_Quantity, Sales_Value) This approach eliminates dimensional hierarchies and, therefore, removes the problem with type two of extraneous cascaded inserts when changes occur in the hierarchy. However, this is by no means a complete solution as it does nothing to resolve the problems of time within the hierarchical structures. The question “How many customers do we have?” is not addressed by this solution. It is presented as a means of implementing Type 2 without incurring the penalties associated with slowly changing hierarchies. There is a further drawback with this approach. The tables involved in the hierarchy are now related only via the fact table. Therefore, the hierarchy cannot be reconstructed so long as no sales records exist for any relationship between the dimensions. only for RuBoard - do not distribute or recompile
167
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
CONCLUSION TO THE REVIEW OF FIRST-GENERATION METHODS In this chapter we focused on the subject of time. Time has a profound effect on data warehouses because data warehouses are temporal applications. The temporal property of data warehouses was never really acknowledged in first-generation data warehouses and, consequently, the representation of time was, in most cases, not adequate. In dimensional models the representation of time is restricted largely to the provision of a time dimension. This enables the fact tables to be accurately partitioned with respect to time, but it does not provide much assistance with the representation of time in the dimensional hierarchies. The advent of CRM has highlighted the problem and reinforced the need for a systematic approach to dealing with time. In order to design and build a truly customer centric data warehouse that is capable of assisting business people in the management of serious business problems such as customer churn, we absolutely have to make sure that the problems posed by the representation of time are properly considered and appropriately addressed. We saw how the first-generation data warehouse design struggles to answer the most basic of questions, such as “How many customers do we have?” With this kind of limited capability it is impossible to calculate churn metrics or any other measures that are influenced by changes in circumstances. We also examined how the traditional approach to solving the problem of slowly changing dimensions can cause the database to have to generate hundreds or thousands of extraneous new records to be inserted when dealing with simple hierarchies that exist in all organizations. We then went on to explore some of the different types of temporal query that we need to be able to ask of our data. Temporal selection, transition detection, and state duration queries will enable us to analyze changes in the circumstances of our customers so that we might gain some insight into their subsequent behavior. None of these types of query can be accommodated in the traditional first-generation data warehouse. It is reasonable to conclude that, in first-generation data warehouse design, the attributes of the dimensions are really regarded as further attributes of the facts. When we perform a join of the fact table to any dimension we are really just extending the fact attributes with those of the dimension. The main purpose of the type two solution appears to be that we must ensure that each fact table entry joins up with the correct dimension table entry. In this respect it is entirely successful. However, it cannot be regarded as the solution to the problem of the proper representation of time in data warehouses. The next generation, the customer-centric solution that supports CRM, has to be capable of far more. This concludes our exploration of the issues surrounding the representation of time in data warehouses. We now move on to the development of our general conceptual model. In doing so we will reintroduce the traditional conceptual, logical, and physical stages. At the same time, we will attempt to overcome the problems we have so far described. only for RuBoard - do not distribute or recompile
168
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 5. The Conceptual Model This chapter focuses on the conceptual modeling part of our data warehouse solution. We intend to adhere to the definition of the conceptual model as one that specifies the information requirements. For a data warehouse design that uses our general conceptual model (GCM; see Chapter 3), we need to be able to provide the following components: Customer-centric diagram Customer's changing and nonchanging circumstances Customer behavior Derived segments In the previous chapter we explored, in some detail, the issues surrounding the representation of time, and so our conceptual model needs to be able to capture the temporal requirements of each data element. We also looked at the causal changes and the dependencies within the data items that we need to capture. only for RuBoard - do not distribute or recompile
169
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
REQUIREMENTS OF THE CONCEPTUAL MODEL Before proceeding, the general requirements of a conceptual data model for data warehousing are restated succinctly. The model must provide for the following:
1. Be simple to understand and use by nontechnical people 2. Support the GCM 3. Support time The most widely used data models are entity relationship (ER) or sometimes, extended entity relationship (EER) methods. These models are widely understood by IT professionals but are not easy for non-IT people to understand. A properly produced ER model contains quite a lot of syntax when you think about it. We have rules regarding cardinality, participation conditions, inclusive and exclusive relationships, n-ary relationships, and entity supertypes and subtypes. These syntax rules are necessary for models for operational systems but, in many respects, the diagrammatic requirements for dimensional data warehouses are simpler than traditional ER models. Take for the moment the rules for dimensional models: These are what we will use to model customer behavior in the GCM.
1. The structure of a dimensional model is predictable. There is a single fact table at the center of the model. The fact table has one or more dimension tables related to it. Each dimension will have zero, one, or more hierarchical tables related to it.
2. The relationships are not usually complex. Relationships are always “one to many” in a configuration where the dimension is at the “one” end of the relationship and the fact table is the “many” end. Where dimensional hierarchies exist, the outer entity (farthest from the fact table) is the “one” end and the inner entity (nearest to the fact table) is the “many” end.
3. “One-to-one” and “many-to-many” relationships are rare, although this changes when “time” is introduced. There is no real need to model the cardinality (degree) of the relationships.
4. The participation conditions do not need to be specified. The dimension at the “one” end of the relationship always has optional participation. The participation condition for the dimension, or fact, at the “many” end is always mandatory.
5. Entity super/subtypes do not feature in dimensional models. 6. There is no requirement to name or describe the relationships as their meaning is implicit. It is important to show how the dimensional hierarchies are structured, but that is the only information that is needed to describe relationships.
7. There is no requirement for the fact table rows to have a unique identifier.
170
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
8. There is no requirement to model inclusive nor exclusive relationships. The additional rules and notations that are required to support the features in the list above are, therefore, not appropriate for dimensional data warehouses. There is a further consideration. We know that data warehouses are not designed to support operational applications such as order processing or stock control. They are designed to assist business people in decision making. It is important, therefore, that the data warehouse contains the right information. Often, the business people are unable to express clearly their requirements in information terms. It is frequently the case that they feel they have a problem but are unsure where the problem lies. This issue was brought out in the introduction to the Wine Club case study in Chapter 2. Most business managers have a set of business objectives. These can be formally defined key performance indicators (KPIs), against which their performance is measured, or they can be more informal, self-imposed, objectives. A data warehouse can be designed to help them to achieve their business objectives if they are able to express them clearly and to describe the kind of information they need to help make better decisions in pursuit of their objectives. One method that leads business managers through a process of defining objectives and subsequent information requirements is described later on in this chapter. What is needed is an abstraction that allows for the business requirements to be focused upon in a participative fashion. The business people must be able to build, validate, modify, or even replace the model themselves. However, in addition, the model must be powerful enough to enable the technical requirements of each data object to be specified so that the data warehouse designers can go on to develop the logical model. Later on we will introduce the dot modeling notation as a model for capturing information requirements in a way that business people can understand. There exists a fundamental requirement that the people who use the data warehouse must understand how it is structured. Some business applications need to be supported by complex data models. Data modeling specialists are comfortable with this complexity. Their role in the organization, to some extent, depends on their being able to understand and navigate complex models. The systems are usually highly parametric in nature, and users tend to be shielded from the underlying complexity by the human computer interface. The users of data warehouses are business people such as marketing consultants. Their usage of the warehouse is often general and, therefore, unpredictable, and it is neither necessary nor desirable to shield them from the structure of it. The conceptual model should be easy to understand by nontechnical people to the extent that, with very little training, such people could produce their own models. If it is accepted that the dimensional model, due to its simplicity, is an appropriate method to describe the information held in a data warehouse, then it would be sensible to ensure that the simplicity is maintained and that the model does not add complexity. Another requirement of the conceptual model is that it should retain its dimensional shape. Again, having achieved a model that is readily understood, we should try to ensure that the essential radial shape is retained even in relatively complex examples.
171
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Also, there is a need within the model to record business semantic information about each of the attributes. This means that some additional, supporting information about attributes will have to be recorded in any case. For the purpose of keeping the conceptual model simple, it seems sensible to incorporate the temporal aspects into the same supporting documents.
The Treatment of Behavior The temporal requirements of behavioral data are very straightforward. Once recorded, they do not change. They do not, therefore, have any form of lifespan. The behavioral facts are usually derived from an entity that has attained a particular state. The attainment of the state is caused by an event, and the event will occur at a particular point in time. So one of the attributes of the fact will be a time attribute. The time attribute records the time of the occurrence of the event. The time attribute will be used in two ways:
1. As part of the selection constraint in a query 2. As an aid to joining the dimensions to higher-level groupings For each event or change there is, as has been said, an associated time. Different applications have different requirements with respect to the grain of time. Some applications require only a fairly coarse grain such as day. These might include: Companies selling car insurance Banks analyzing balances Supermarkets analyzing product-line sales Some other applications might wish to go to a finer grain, perhaps hours. These might include: Supermarkets analyzing customer behavior Transport organizations monitoring traffic volumes Meteorological centers studying weather patterns Other applications would require a still more fine grain of time to say, seconds. An example of this is the telecommunication industry monitoring telephone calls. Another requirement of the data model, therefore, is that it needs to show clearly the granularity of time pertaining to the model. Some applications require more than a single grain of time. Consider the examples above. The supermarket example shows a different requirement with respect to time for product analysis as distinct from customer behavior analysis. In any case, almost all organizations conducting dimensional analysis will require the ability to summarize information, from whatever the granularity of the base event, to a higher (coarser) level in order to extract trend information.
172
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The Treatment of Circumstances—Retrospection In a first-generation dimensional data warehouse, the way in which an attribute of a dimension is treated, with respect to historical values, depends entirely upon the requirements to partition the facts in historical terms. The key role of dimensional attributes is to serve as the source of constraints in a query. The requirements for historical partitioning of dimensional attributes for, say, dimension browsing have been regarded as secondary considerations. Principally, the dimensional attributes exist to constrain queries about the facts. Even though some 80 percent of queries executed are dimension browsing, the main business purpose for this is as a process of refinement of the fact table constraints. It is only comparatively recently, since the advent of CRM, that the dimensions themselves have been shown to yield valuable information about the business such as the growth in customers, etc. In some industries (especially telecommunications and retail banking), this is seen as the largest business imperative at the present time. Our GCM, being a customer-centric model, enables amazingly rich and complex queries to be asked about customers' circumstances, far beyond a traditional dimensional model. The GCM, therefore, imposes the need for a much greater emphasis on the nondimensional aspects of the customer, that is, the circumstances and derived segments. In recognition of the need to place more emphasis on the treatment of history within the GCM, we have to examine the model in detail in order to assess how each of the various elements should be classified. Each component of the model that is subject to change will be evaluated to assess the way in which past (historical) values should be treated. When we refer to a component we mean: Entity: a set of circumstances or a dimension (e.g., customer details or product dimension) Relationship: for example, a hierarchy Attribute: for example, the customer's address Each component will then be given a classification. This is called the retrospection [1] of the component. Retrospection has three possible values: [1]
Retrospection means literally “looking back into the past.”
1. True 2. False 3. Permanent True retrospection means that the object will reflect the past faithfully. It enables queries to return temporal subsets of the data reflecting the partitioning of historical values. Each dimension, relationship, and attribute value will, in effect, have a
173
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
lifespan that describes the existence of the object. An object may have a discontinuous lifespan, that is, many periods of activity, punctuated by periods of inactivity. True retrospection is the most accurate portrayal of the life of a data warehouse object. False retrospection means that the view of history will be altered when the object's value changes. In simple terms, when changes occur, the old values will be overwritten and are, therefore, lost. It is as though the old values had never existed. Permanent retrospection means that the value of the object will not change over time. Let us now explore how the various values for retrospection apply to dimensions, relationships, and attributes.
Retrospection in Entities
So far as entities are concerned, the value for retrospection relates to the existence of the dimension. For instance, the existence of a customer starts when the customer first orders a product from the Wine Club, and the existence ends when the customer becomes inactive. Retrospection = true for entities means that the lifespan of the entity consists of one or more time intervals. A single customer may not have a single, continuous lifespan. It is also true of other entities such as the wine dimension. A wine may be available for intervals of time spanning many years, or the entire lifespan may be punctuated by periods when the wine is not available. An example of this would be the Beaujolais Nouveau, which, for some reason, is very popular when first available in November each year but must be consumed quickly, as it soon deteriorates. As this wine is not available for 10 months out of each year, it is reasonable to say that the lifespan of this wine is discontinuous. Retrospection = false for entities means that the current state only, of the existence of the entities, is recorded. An example from the Wine Club would be the supplier dimension. There may be a need to be able to distinguish between current suppliers and previous suppliers. There is no requirement to record the intervals of time when a supplier was actually supplying wine to the Wine Club as distinct from the intervals of time when they were not. Retrospection = permanent for entities means that the entity exists forever. The concept of existence does not, therefore, apply. An example from the Wine Club would be the region dimension. Regions, which represent the wine-growing areas of the world, are unlikely to disappear once created.
Retrospection in Relationships
174
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
In a dimensional model, the degree of snapshot relationships is always “one to many.” When the relationship becomes temporal, due to the need for true retrospection on the relationship, the degree of the relationship may change to that of “many to many.” This has been described in detail in Chapter 4. It is important that this information is recorded in the model without introducing significant complexity into the model. The essential simplicity of the model must not be dominated by time, while at the same time it needs to be straightforward for designers to determine, quickly and precisely, the degree to which support for temporal relationships is required. The situation with relationships is similar to that of entities. It is the requirement with respect to the existence, and lifespan, of a relationship that determines the value for retrospection in relationships. Retrospection = true for relationships means that the lifespan of each relationship must be recorded and kept so that the results from queries will faithfully reflect history. An example of this in the Wine Club is the relationship between customer and sales area. If a customer moves from one sales area to another, it is important that the previous relationships of that customer with sales areas are retained. Retrospection = false for relationships means that only the current relationship needs to be recorded. There is no need for the system to record previous relationships. A true view of history is not required. An example of this, within the Wine Club, is the relationship between a hobby and a customer. If Lucie Jones informs the club, through the periodic data update process, that her hobby has changed from, say, horse riding to choral singing, then the new hobby replaces the old hobby and all record of Lucie's old hobby is lost. Retrospection = permanent for relationships means that the relationship is never expected to change. No change procedures have to be considered. An example of this kind of relationship, in the Wine Club, is the relationship between a wine and a growing region. A wine is produced in a region. A particular wine will always be produced in the same region, so it is reasonable to take the view that this relationship will not change.
Retrospection in Attributes
Each attribute in the model must be assessed to establish whether or not it needs temporal support. Therefore, one of the properties of an attribute is its value for retrospection. As with the other data objects, the recording of the temporal requirements for attributes should not introduce significant complexity into the model. The situation with respect to attributes and retrospection appears, at first, to be somewhat different to the requirements for other objects. In reality, the situation is very similar to that of relationships. If we consider an attribute to be engaged in a relationship with a set of values from a domain, it becomes easy to use precisely the same approach with attributes as with relationships.
175
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Retrospection = true for attributes means that we need to record faithfully the values associated with the attribute over time. An example of this is the cost of a bottle of wine. As this cost changes over time, we need to record the new cost price without losing any of the previous cost prices. Retrospection = false for attributes means that only the latest value for the attribute should be kept. When the value of the attribute changes, the new value replaces the old value such that the old value is lost permanently. An example of this in the Wine Club is the alcohol by volume (ABV) value for the wine. If the ABV changes, then the previous ABV is replaced by the new ABV. Retrospection = permanent for attributes means that the value is not expected to change at all. Changes to the values of these type of attributes do not have to be considered. An example of this type of attribute in the Wine Club is the hobby name. Once a hobby has been given a name, it is never expected to change. There is a rule that can be applied to identifying attributes. With great originality it is called the identifying attribute rule and simply states that all identifying attributes have a value for retrospection of permanent. This is because the identifying attributes should not change. There is an implicit inclusion of an existence attribute for entities, relationships, and attributes where the value for retrospection is true. The status of true retrospection will direct the logical database designers to provide some special treatment, with respect to time, to the affected object. The type of treatment will vary on a per-case basis and will, to some extent, depend on the type of database management system that is to be deployed. The inclusion of an existence attribute is also implicit for entities, but not relationships and attributes, where the value for retrospection is false. The provision of time support, where retrospection is false, is usually simpler to implement than where retrospection is true. For relationships and attributes, it is simply a case of replacing the previous value with a new value—in other words, a simple update.
Retrospection and the Case Study
Table 5.1 now lists a handful of the data elements, entities, attributes, and relationships for the Wine Club. For each, the value for retrospection is given that satisfies the requirements of the Wine Club with regard to the representation of time. In accordance with the previous point about their implicit nature, there is no explicit mention of existence attributes in Table 5.1. A complete list can be found in Appendix A. So far, we have analyzed the requirements with respect to time and have identified three cases that can occur. The Wine Club data model has been examined and each object classified accordingly. The requirement that follows on from this is to develop a method that enables the classification to be incorporated into the model so that a solution can be designed. It is important that the requirements, at
176
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
this conceptual level, do not prescribe a solution. The designers will have additional points to consider such as the volumes of data, frequency of access, and overall performance. As always, some compromises are likely to have to be made.
Table 5.1. Wine Club Entities, Relationships, and Attributes Object Name
Type
Retrospection
Reason
Customer
Entity
True
Customers may have more than one interval of activity. It is important to the Wine Club that it monitors the behavior of customers over time. There is a requirement, therefore, to record the full details of the existence of customers.
Sales_Area
Entity
False
Latest existence only, is required. Sales areas may be combined, or split. Only the latest structure is of interest. Note that some organizations might wish to compare old regional structures to the new one.
Hobby
Entity
Permanent
The hobby details, once entered, will exist forever.
Sales Area?Customer
Relationship True
There is a requirement to monitor the performance of sales areas. As customers move from one area to another, therefore, we need to retain the historical record of where they lived previously, so that sales made to those customers can be attributed to the area in which they lived at the time.
Hobby?Customer
Relationship False
A customer's hobby is of interest to the Wine Club. Only the current hobby is required to be kept.
Customer?Sales
Relationship Permanent
The relationship of a particular sale to the customer involved in the sale will never change.
Customer.Customer_Code
Attribute
Permanent
Identifying attribute rule.
Customer.Customer_Name
Attribute
False
The latest value only is sufficient.
Customer.Customer_Address Attribute
True
Requirement to analyze by detailed area down to town/city level.
Customer.Date_Joined
False
The latest value only is sufficient.
Attribute
only for RuBoard - do not distribute or recompile
177
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE IDENTIFICATION OF CHANGES TO DATA Causality It would be very helpful if, during the analysis of the kinds of changes that can occur, it is made clear as to whether the changes are causal in nature. It would be sufficient to identify causal changes only and to assume that all unidentified changes are noncausal. This will provide assistance to the designers. Some changes, as has been previously indicated, can occur to attributes that have the property of false retrospection but that, due to the fact that they are determinants, have a “knock-on” effect on other attributes that might have the property of true retrospection. The capture of changes has to be developed into an automated process. Some mechanism is created that enables changes that have occurred to be identified by examining the organization's operational systems as these are, invariably, the source of data for the data warehouse. The source system will not share the data warehouse data model and will not be aware of the effect of changes. For instance, in the Wine Club, there is a relationship between the address of a customer and the sales area that contains the address. So it could be said that the address determines the sales area. This determinant relationship is identical to that used in the process of normalization. However, the purpose here is quite different and has more to do with the synchronization of the timing of changes to attribute values, to ensure temporal consistency, than the normalization of relations. Thus the term causality has been adopted in order to distinguish this requirement as it is unique to data warehousing. The operational system that records customers' details may not be aware of the sales area hierarchy. When a customer moves, the fact that a change in sales area might have occurred would not normally be apparent. It becomes the responsibility of the data warehouse designer to manage this problem. Generally, there is little recognition of the fact that logically connected data may need to be extracted from different data files which, in turn, might belong to various operational systems. That there may be a need to implement a physical link between data sources due to the causal nature of the relationship is also not recognized. This stitching together of data from various sources is very important in data warehousing. Apart from operational systems, the sources can also be external to the organization. For instance, an external attribute relating to a customer's economic classification might be added to the customer's record. This is a good example of causality. What is the trigger that generates a change in the economic classification when a change in the customer's circumstances is implemented so that temporal consistency is maintained? Without such a facility, the data warehouse may be recording inconsistent information. If a customer's address changes, then the sales area code must be checked and updated, if necessary, at the same time as the address change. Where the data relating to addresses and sales areas is derived from different source systems, the temporal synchronization of these changes may be difficult to implement. If temporal synchronization is not achieved, then any subsequent query involving the history of these attributes may produce inaccurate results. The main point is to recognize the problem and ensure that the causal nature of changes is covered during the requirements analysis.
178
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The Frequency of Capture of Changes Associated with identification of changes is the timing with which changes to data are captured into the data warehouse. In this respect, the behavior of “behavioral data” is different from the behavior of “circumstances.” The frequency of capture for the fact data is usually as close as possible to the granularity of the valid time event. For instance, in the Wine Club example, the granularity of time of a sale is “day” and the sales are captured into the data warehouse on a daily basis. There are exceptions, such as telecommunications where the granularity of time is “seconds” but the frequency of capture is, typically, still daily. Nevertheless, the time assigned to the fact can usually be regarded as the valid time. The granularity of time for recording changes to the dimensions adopts an appearance that is often misleading. The most frequently used method for identifying changes to dimensions is by use of the file comparison approach, as outlined in the previous chapter. The only time that can be used to determine when the change occurred will be the time that the change was detected (i.e., the time the file comparison process was executed). The time recorded on the dimensional record can be at any level of grain, for example, day. In this example, the granularity of time for the changed data capture appears to be daily because the date that the change was captured will be used to record the change. However, this is a record of the transaction time so far as the data warehouse is concerned. It is not actually a record of the transaction time that the change was recorded in the originating source system. The granularity is related to the frequency that the changed data is captured. If the changes are detected and captured into the data warehouse on a monthly basis, then the transaction time frequency should be recorded as monthly. In practical situations, different parts of the model are usually updated at differing frequencies of time. Some changed data is captured daily while others are weekly and, still others, monthly. The frequency of capture is often dependent on the processing cycle of the source systems. As with the previous section on causality, the valid time and transaction time should be the same, if possible. Where such synchronization is not possible, the difference between the two times should be recorded so that the potential error can be estimated. Our modeling method should provide a means of capturing the true granularity of time on a per attribute basis. only for RuBoard - do not distribute or recompile
179
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DOT MODELING The remainder of this chapter is devoted to the explanation of a methodology for the development of conceptual models for data warehouses. The methodology is called dot modeling. Dot modeling is based on the simplified requirements for dimensional models that were described in the introduction to this chapter. It is a complete methodology that enables nontechnical people to build their own conceptual model that reflects their personal perception of their organization in dimensional terms. It also provides a structured way for constructing a logical (currently relational) model from the conceptual. The method was invented in July 1997 and has been used in real projects since then. It has received positive reviews from nontechnical people in environments where it has been deployed. The name was given by a user. Dot is not an acronym. It comes from the characteristic that the center of the behavioral part of the model, the facts, are represented by a dot. The method was developed as a kind of evolution using dimensional concepts and has been evolved to adapt to the requirements of the customer centric GCM. We start by modeling behavior. Figure 5.1 represents the design of a two dimensional tabular report. This kind of report is familiar to everyone and is a common way of displaying information, for example, as a spreadsheet.
Figure 5.1. Example of a two-dimensional report.
180
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The intersection of the axes in this example, as shown by the dot, would yield some information about the sale of a particular product to a particular customer. The information represented by the dot is usually numeric. It could be an atomic value, such as the monetary value of the sale, or it could be complex and may include other values such as the unit quantity and profit on the sale. Where there is a requirement to include a further dimension such as time into the report, then one might envisage the report being designed as several pages where each page represents a different time period. This could be displayed as shown in Figure 5.2.
Figure 5.2. Example of a three-dimensional cube.
181
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Now the dot represents some information about the sale of a particular product to a particular customer at a particular time. The information contained in the dot is still the same as before. It is either atomic or complex and is usually numeric. All that has changed is that there are more dimensions by which the information represented by the dot may be analyzed. It follows, therefore, that the dot will continue to represent the same information irrespective of how many dimensions are needed to analyze and report upon it. However, it is not possible to represent more than three dimensions diagrammatically using this approach. In effect, the dot is “trapped” inside this three-dimensional diagram. In order to enable further dimensions of analysis to be represented diagrammatically, the dot must be removed to a different kind of structure where such constraints do not apply. This is the rationale behind the development of the dot modeling methodology. In dot modeling the dot is placed in the center of the diagram and the dimensions are arranged around it as shown in Figure 5.3.
Figure 5.3. Simple multidimensional dot model.
182
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The model readily adopts the well-understood radial symmetry of the dimensional star schema.
The Components of a Behavioral Dot Model There are three basic components to a dot model diagram: Dot. The dot represents the facts. The name of the subject area of the dimensional model is applied to the facts. In the Wine Club, the facts are represented by “sales.” Dimension names. Each of the dimensions is shown on the model and is given a name. Connectors. Connectors are placed between the facts and dimensions to show first-level dimensions. Similarly, connectors are placed between dimensions and groupings to show the hierarchical structure Emphasis has been placed on simplicity so there are virtually no notational rules on the main diagram. It is sensible to place the dot near the center of the diagram and for the dimensions to radiate from the dot. This encourages a readable dimensional shape to emerge. The behavioral dot model for the Wine Club is reproduced in Figure 5.4. The attributes for the facts and dimensions are not shown on the diagram. Attributes are described on supporting worksheets. Similarly, the temporal requirements are represented on supporting worksheets rather than on the diagram. The method uses a set of worksheets. The worksheets are included in the appendices. Some of the worksheets are completed during the conceptual design stage of the development and some are completed during the logical design stage. The first worksheet is the data model worksheet itself. It
183
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
contains the following: Name of the application, or model (e.g., The Wine Club-Sales) Diagram, as shown in Figure 5.4
Figure 5.4. Representation of the Wine Club using a dot model.
List of the “fact” attributes (i.e., “quantity” and “value” in the Wine Club) For each fact attribute, some information describing the fact is recorded under, what is commonly known as “metadata.” Its purpose is to document the business definition of the attribute. This is to solve the problem of different people, within an organization, having differing views about the semantics of particular attributes. The descriptions should be phrased in business terms. A second worksheet, the entities worksheet, is used to record the following: Behavioral dimensions Customer circumstances Derived segments This part of the method holds some of the more complex information in the model. The model name is given on each page to ensure that parts of the document set are not mistakenly mixed up with other
184
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
models' documents. The purpose of the entities worksheet is to aid the designers of the system to understand the requirements in order to assist them in the logical design. For each entity the following items of information are recorded: Name of the dimension as it is understood by the business people. For example, “customer.” Retrospection of the entity's existence. Existence attribute for the entity. For entities with permanent retrospection, an example of which might be “region” in the Wine Club, there is no requirement to record the existence of an entity, because, once established, the entity will exist as long as the database exists. With other entities, however, an attribute to represent existence would be needed so that, for instance, the Wine Club would be able to determine which wines were currently stocked. Frequency of the capture of changes to the existence of the dimension. This will help to establish whether the dimension will be subject to errors of temporal synchronization. For each dimension, a set of attributes is also defined on a separate worksheet. The existence attribute has already been described. The following description refers to the properties of other attributes. So for each attribute, the following information is recorded: Name of the dimension that “owns” it. Name of the attribute. This is the name as the business (nontechnical) people would refer to it. Retrospection. Whether or not the historical values of this attribute should be faithfully recorded. Frequency. This is the frequency with which the data is recorded in the data warehouse. This is an important component in the determination of the accuracy of the data warehouse. Dependency. This relates to causality and identifies other attributes that this attribute is dependent upon. Identifying attribute. This indicates whether the attribute is the identifying attribute, or whether it forms part of a composite identifying attribute. Metadata. A business description of the attribute. Source. This is a mapping back to the source system. It describes where the attribute actually comes from. Transformations. Any processing that must be applied to the attribute before it is eligible to be brought into the data warehouse. Examples of transformations are the restructuring of dates to the same format, the substitution of default values in place of nulls or blanks.
185
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Data type. This is the data type, and precision of the attribute. Information about dimensional hierarchies is captured on the hierarchies worksheet. Pictorially, the worksheet shows the names of the higher and lower components of the hierarchy. The following information is also captured: Retrospection of the hierarchy Frequency of capture Metadata describing the nature of the hierarchy
Dot and the GCM An interesting development occurred when working with a major telecommunications company in the United Kingdom. Their business objective is to build a customer-centric information data model that covers their entire enterprise. There were several different behavioral dot models: Call usage. The types of phone calls made, the duration, cost, etc. Payments. Whether the customer paid on time, how often they had to be chased, etc. Recurring revenue. Covered insurance, itemized billing, etc. Nonrecurring revenue. Accessories and other one-off services requested Order fulfillment. How quickly orders placed by customers were satisfied by the company Service events. Customers recording when a fault has occurred in their equipment or service Contacts. Each contact made to a customer through offers, campaigns, etc. In dimensional modeling terms this means several dimensional models, each having a different subject area. During a workshop session with this customer I was able to show how the whole model might look using a single customer-centric diagram, which I now refer to as joining the dots. The diagram is shown in Figure 5.5.
Figure 5.5. Customer-centric dot model.
186
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Figure 5.5 shows seven separate dimensional models that share some dimensions. This illustrates that, even with very complex situations, it is still very easy to determine the individual dimensional models using the dot modeling notation because the radial shape of each individual model is still discernible. The use of the dot model, in conjunction with business-focused workshops, coming up next, enables the “softer” business requirements, in the form of business objectives or key performance indicators, to be expressed in information terms so that the data warehouse can be designed to provide precisely what the business managers need. A further requirement is that the model should enable the business people to build the conceptual abstraction themselves. They should be able to construct the diagrams, debate them, and replace them. My own experiences, and those of other consultants in the field, are that business people have found dot models relatively easy to construct and use. only for RuBoard - do not distribute or recompile
187
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
DOT MODELING WORKSHOPS The purpose of the remainder of this chapter is to provide assistance to data warehouse practitioners in the delivery of data warehouse applications. This practitioner's guide is subdivided into two main sections, conceptual design and logical design. The conceptual design stage consists of the development of a data model using the dot modeling method described in this chapter. The logical model is developed as an extension of the dot modeling method and this will be described in the next chapter.
The Dot Modeling Methodology The methodology, as outlined earlier in this chapter, will now be described in detail. As you might expect, there are a fair number of processes associated with the construction of a complete dot model. Experience shows that these processes can be conducted through the use of two highly structured workshops. The use of joint application development (JAD) workshops is now a well-accepted practice, and holding one or more of these workshop sessions is a good way of gathering user requirements. The purpose of JAD workshops is to enable the development of better systems by bringing together the business people and the system developers. The idea is that the system design is a joint effort. These workshops require skilled facilitation.
Each of the two workshops can be completed in two days
The Information Strategy Workshop The objective of the information strategy workshop is for the business people, within the customer's organization, to develop their own dot model that reflects their own perception of their business. The session should last for approximately two days. The emphasis is on the word workshop. Every participant must expect to work and not to be lectured to.
Participants—Practitioners
There should be, at least, two consultants present at the workshop sessions. The ideal combination is to have one highly organized workshop facilitator and one business consultant who understands the methodology and the customer's business. It is very useful to have a third person who is able to record the proceedings of the sessions. Quite a substantial proportion of the work is actually done in teams. It is quite useful to have an extra
188
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
person who can assist in wandering from team to team checking that everyone understands what is required and that progress is being made. The success or failure of these sessions often has very little to do with the technical content of the workshop but has everything to do with softer issues such as: Ambience—are they enjoying actually being there? Relationships between the participants Friendliness of the presenters Comfort of the participants It is helpful to have consultants who are sensitive to these issues and can detect and respond to body language, suggest breaks at the right time, etc. Also, where the native language of the participants is different from the native language of the consultants, great care must be taken to ensure that the message, in both directions, is clear. Ideally, at least one of the consultants should be fluent in the language of the participants. If that is not possible, then someone from the customer organization must step in to act as the interpreter. If an interpreter is required to translate every statement by the presenters, then the workshop would generally take about 50 percent longer than normal.
Participants—Customer
The information strategy workshop requires a mixed attendance of business people and IT people. The ideal proportions are two-thirds from the business and one-third from IT. It is very important that the session is not dominated by IT staff.
Getting several senior people together for two whole days can be a virtually impossible task. The only way to achieve this is to ensure that: You have senior executive commitment You book their diaries in advance It's no good waiting until the week before; you won't get them to attend. There must be some representation from the senior levels of management. As a test of seniority, we would describe a senior person as one who is measured on the performance of the business. This is someone whose personal income is, to some extent, dependant on the business being successful. In any case, they must be able to influence the business strategy and direction. The key to being successful in data warehousing is to focus on the business, not the technology. There must be a clear
189
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
business imperative for constructing the data warehouse. It is not enough to merely state that information is strategic. This tendency to cite business requirements as fundamental to success has many followers. Any information system should exist solely to support the firm's business objectives, and it is critical that the business goals are clearly defined. One of the major goals, initially, is to build and validate the business case, and we will be covering this in Chapter 8.
The Information Strategy Workshop Process
The steps involved in running the workshop are now described. Some of the subsequent sections should be regarded as being for guidance only. Different consultants have differing approaches to facilitating workshops, and these variations in technique should be maintained. Here is one method that has been found to be successful in the past. First, the room organization:
1. The tables in the room should be organized in a horseshoe shape rather than a classroom style, as this greatly encourages interaction.
2. Plenty of writing space should be available. I personally favor white boards with at least two flip charts to record important points.
3. Try to work without an overhead projector as they tend to introduce a lecturing or presentation tone to the proceedings, whereas the atmosphere we want to encourage is that of a workshop where everyone contributes. We'll now go through the workshop step by step. For each step I'll give an estimate for how long it ought to take. Bear in mind that different people might prefer a faster or slower pace.
Workshop Introduction
Estimated time: 60 minutes Explain why the team is assembled and what the objectives are.
The objective: By the end of the information strategy workshop we will have built a conceptual model of the information requirements needed to support the business direction of the organization. Do not assume that everyone in the room will know who all the others are. I have yet to attend a workshop session where everyone knew one another. Get everybody to introduce themselves. This is, obviously, a good ice-breaking method that has been used successfully for many years. Ask them for:
190
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Name (pronunciation is very important if the language is foreign to you) Position in the organization Brief description of their responsibilities Also, ask them to tell us one thing about themselves that is not work related, maybe a hobby or something. Breaking the ice is very important in workshops and so it should not be rushed. There are a number of different approaches that you may already be familiar with. One way is to get people to spend 10 minutes talking to the person sitting next to them and for them then to introduce each other. A good variation on this theme is to hang a large piece of paper on the wall and get them to draw a picture of the person they have introduced. If this sounds juvenile, well I suppose it is, but it really does get people relaxed.
I also like to make a note about which of them are business people and which are IT people. It is a good idea to introduce yourself first and for your colleagues to follow you. This takes the pressure off the participants. Next come the ground rules. These are important and here are some suggestions. I think they are just common sense, but you never know whom you are dealing with so it is best to be explicit: Silence means assent. If people do not voice objections to decisions, then they are deemed to have accepted them. It is of no use to anyone if, during the wrap-up session on day two, someone says something like, “Well I didn't agree with your basic ideas in the first place!” The sessions start on time. It is incumbent on the facilitator to state clearly and precisely when the next session is due to start. It is the responsibility of all attendees to be there at that time. So it is absolutely vital that the consultants are never late to sessions. Mobile telephones are not allowed. Personal criticisms are not allowed. It's OK for them to criticize the consultants, or the method, but not each other. Only one person is allowed to be speaking at a time. No stupid questions. This does not mean that you are not allowed to ask stupid questions. It actually means that no question should be regarded as stupid.
191
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Always give them the collective authority to change the rules or add others.
Always give them the collective authority to change or add rules. The next thing to do is to outline the agenda for the whole of the workshop. People like to know when their breaks are, what time lunch will be served, and roughly at what time they might expect to be leaving in the afternoon. It is always worth asking whether anyone has any time constraints (flights, etc.) on the last day. Try to accommodate these as far as you are able by manipulating the agenda. Now we move into the workshop proper. The first session is on business goals.
Business Goals
Estimated Time: 30 Minutes This is the first real business session. The following question should be posed: Does your organization have business goals? This is a directed question in the sense that it should be asked of the senior business people present. You need to pick up the body language quickly because some people may feel:
1. Embarrassed that they don't know what their business goals are 2. Uncomfortable about sharing their business goals with outsiders Experience shows that the former is more common than the latter. Most people have a vague idea about business goals and would offer things like: Increase market share Increase customer loyalty Increase profitability What we need to get to is to be able to articulate business goals that have the following properties:
1. Measurable 2. Time bounded 3. Customer focused The third of these is not an absolute requirement, but most business goals do seem to relate to
192
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
customers in some way. This is particularly true if the data warehouse is being built in support of a CRM initiative. If this is the case, try to guide the participants toward customer-related business goals. So a well-articulated business goal might be: To increase customer loyalty by 5 percent per annum for the next three years If you can get one of these types of goals from your business people, that is good. If you can get two or three, that's excellent. The most likely result is that you won't get any. That's OK too. Its just that, if the organization has business goals and is prepared to share them, then it is sensible to use them.
Thinking About Business Strategy
Estimated time: 1 hour This is the first real workshop exercise. We split them into groups with three or four people in a group. In order to get the best out of this exercise, the composition of the groups is important. This is why it is useful to know who the business people are and who the IT people are. We need one senior business person, if possible, in each group. The remaining business people should be spread as evenly as possible throughout the groups and the IT people likewise. Don't allow the IT people to become dominant in any one group. IT people have a tendency to become “joined at the hip” in these sessions if allowed.
One way of achieving an even spread of business and IT people is as follows: During the introduction session earlier, the nonspeaking facilitator should note the names and job titles of each of the participants and can form the groups independently. The attendees can then be simply informed as to which group they have been assigned. The objectives of the exercise at this stage are:
1. Decide upon the business goals (assuming none was forthcoming previously), or use the business goals previously presented. Each team, therefore, is working toward its own set of goals, rather than the entire class having common goals. This is perfectly OK. Each of the business people in the room may have their own goals that will be different from the others' goals. One or two goals are sufficient. Each goal must possess the three properties listed above.
193
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
2. Think about business strategy to help them to achieve the goals. This is a prioritized set of initiatives that the organization might need to implement. It is unlikely that a single item, by itself, would be enough to make the goal achievable.
3. What steps would they need to take in order to put the strategy into operation? The steps will be driven by the prioritization of the strategic components. Be careful to allow enough time for this. This process invariably leads to serious discussion in the groups as almost everyone has a view on the single thing their organization needs to do in order to become more successful. These discussions are, in the main, a good thing and should be allowed to run their course. About one hour should be allowed for this part and a break should be included. It is worth recommending that each person spends the first 15 minutes thinking about the problem, so that each has a contribution to make before the group convenes. Make sure that you have sufficient room for the group discussions to be conducted without affecting each other. The groups must record their decisions as, later on, they will be expected to present them to the other groups.
The Initial Dot Model
Estimated time: 1–2 hours The class should reconvene for this teaching session. The objective is to get them to understand how to use the dot modeling system. This is quite easy, but the IT people will understand it more quickly than the business people. It is sensible to start by focusing on customer behavior, as this usually is the easiest part for people to comprehend. The best way to explain how to make a dot model is as follows: First, on a flipchart or a whiteboard, go through the spreadsheet explanation that we saw earlier in this chapter. Everyone knows what a spreadsheet looks like and can relate to the axes, the cells, etc. Then build a model in front of them, like the one in Figure 5.6. It represents the Wine Club.
Figure 5.6. Initial dot model for the Wine Club.
194
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
It is vital that at least some of the members of the team understand how to do this. It is useful to ask questions, especially of the IT people, to convince yourself that each group is capable of developing a rudimentary model. At this stage, we do not expect perfect models and there is a refinement stage later on. However, it is important that they know the difference between a fact and a dimension. This is obvious to some people, but experience shows it is not obvious to others. It helps to draw on the star schema analogy. Most IT people will be able to relate to it straight away. The business people will pick it up, albeit more slowly.
Behavior
Estimated time: 15 minutes We begin by explaining the meaning of the dot, that is, the behavior. These are the attributes of the dot itself and they represent the focus of the model. Any data warehousing practitioner would be expected to be able to explain this. Typical facts are sales values and quantities. We must be careful to include derived facts such as profit or return on investment (ROI). Each fact attribute must have the properties of being:
1. Numeric 2. Summable, or semi summable Each measurable fact must earn its place in the warehouse. Whoever suggests a fact should demonstrate that it has the capability to contribute toward the business strategy. In addition, questions should be posed, in the form of queries (in natural language, not SQL), to show how the fact will be used.
195
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The measurable fact name, as it will be known in the data warehouse, should be recorded. It is vital that everyone has a clear idea as to the meaning of the facts, and it makes sense to spend a little time focusing on the semantics of the information. In the “sales” scenario, the question “What is a sale?” may result in a discussion about the semantics. For instance, an accountant may say that a sale has occurred when an invoice has been produced. A salesperson might say that a sale has occurred when an order is received. A warehouse manager might say that a sale has occurred when a delivery note is signed by the customer. Everyone has to be clear about the meaning of the data before moving on. We also need to know the time granularity of the measurable fact. This should, ideally, be the lowest level of grain of time that is practical. The best possible level is known as the valid time. Valid time means the time that the event occurred in real life. This usually translates to the transaction level. In a telecommunications application where the data warehouse is to record telephone calls, the valid time relates to the time the call was placed, right down to the second. In other applications, the time granularity might be set to “day.” Usually, all the facts will share the same time granularity. However, it is important to know where this is not the case so we record the time granularity for each measurable fact attribute. For each fact attribute, metadata supporting the attribute should also be recorded. This means a precise description of the attribute in business terms.
The Dimensions
Estimated time: 15 minutes We now describe the meaning of dimensions. Again, it is expected that data warehousing practitioners understand this term, and so the intention here is to give guidelines as to how to explain the term to nontechnical people. One way of achieving this is to draw a two-dimensional spreadsheet with axes of, say, customer and product and explain that this is an example of a two-dimensional view of sales. Each intersection, or cell, in the spreadsheet represents the sale of a particular product to a particular customer. I usually draw an intersection using lines and a dot where they intersect. The dot represents the measurable fact and, by using it, you reinforce the dot notation in the minds of the participants. The two-dimensional view can be extended to a three-dimensional view by transforming the diagram into a cube and labeling the third axis as, say, “time.” Now each intersection of the three dimensions, the dot, represents the sale of a particular product to a particular customer at a particular time. We have now reached the limit of what is possible to draw. So we need a diagrammatic method that releases the dot from its three-dimensional limitation, and this is where the dot model helps. By describing the dimensions as axes of multidimensional spreadsheets, it is usually fairly easy to understand.
Creating the Initial Dot Model
196
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Estimated time: 1 hour Taking the business goals, the strategy, and the steps they decided upon in the earlier group session, the groups now have to do the following:
1. Using their highest-priority strategy, decide what information they need to help them. They will need guidance on this at first. Lead them through an example.
2. Formulate some questions that they would like to be able to ask. 3. Create a dot model that is able to answer those questions. You need to allow at least one hour for this exercise and about half an hour for the teaching session. It is useful to include a major break, such as the lunch break, or overnight. The facilitators must be on hand all the time to assist where required. At this stage you will always get questions like “Is this a fact or a dimension?” Also, the more “switched-on” IT people will already start making assumptions about things like: Availability of data from source systems Target database architectures Data quality issues, etc. They should be discouraged from these lines of thought. It is important that they do not constrain the innovative thinking processes. We are trying to establish the information we need, not what is currently available.
Group Presentations
Estimated time: 30 minutes per group Each group must now present, to the other groups:
1. Business goals 2. Steps in their strategy 3. Information needed to support the steps 4. Dot model to support the information requirements 5. Some example questions that the model supports
197
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
6. How each question is relevant to their strategy The information should be presented on a flip chart or overhead projector. The groups should each elect a spokesperson. Experience shows that some groups prefer to have two people, or even the whole group, participating in the presentation. The other groups should provide feedback. The facilitators must provide feedback. The feedback at this stage must always be positive with some suggestions as to how the model could be enhanced. Do not comment on the accuracy or completeness of the models. The refinement process will resolve these issues.
The Refinement Process
Estimated time: 1 hour We now have a classroom session about refinement of the model. First, we have to decide whether the groups should continue to work on their own models or whether all groups should adopt the same, perhaps a composite, model. There are no real preferences either way at this time. In either case, the work is still undertaken in groups. Then we explain about:
1. Combination of dimensions 2. Dimensional hierarchies 3. Inclusion of others' suggestions Point 1 above concerns dimensions on the original model that are not really dimensions at all. There is an instance of this in the example provided in that the “vintage” dimension is really no more than an attribute of “wine.” After discussion, it is sometimes wise to leave these, apparently extraneous, dimensions alone if no agreement can be reached. The creation of dimensional hierarchies is a fairly straightforward affair. The refined example dot model is shown in Figure 5.7.
Figure 5.7. Refined dot model for the Wine Club.
198
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The groups should now be re-formed to refine the models. The facilitators should wait for about 10 to 15 minutes and then approach the groups to see if they need help. There will usually be some discussion and disagreement about how, and whether, changes should be incorporated.
Presenting the Refined Models
Estimated time: 15 minutes per group The refined models are re-presented to the other groups. The team has to show:
1. How the model has evolved from the original model as a result of the review process 2. How any enhancements, from the previous feedback, have been incorporated Documenting the Models
Estimated time: 1 hour The models are documented, in their refined state, using the data model and the first part of the entities and segmentation worksheets (the remaining parts of the entities and segmentation worksheet
199
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
are completed during the second workshop). The entities include the customers' circumstances and the dimensions from the behavioral dot models. The segmentation relates to the derived segments, which is the final part of our GCM. Sharing a worksheet for these two components of the model is a good idea because it can be difficult for business people to separate the two things in their mind. Encourage each group member to complete their own worksheets. The facilitators must make sure they take a copy of each group's model. These models provide the primary input to the component analysis workshop. Some examples of the data model and entities worksheets for the Wine Club are shown in Figure 5.8 (a fuller set of dot model worksheets is included in Appendix B).
Figure 5.8. Dot modeling worksheet showing Wine Club sales behavior.
The following are examples of the entities and dimensions worksheet. The first, predictably, is the customer entity. At this stage, we are really concerned with describing only the entity itself, not the
200
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
attributes. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Customer
True
Monthly
Metadata The customer entity contains all customer details. We require to monitor discontinuous existences and to be able to identify active and nonactive customers. An active customer is defined to be one who has placed an order within the past 12 months. Any customer not having done so will be classified as inactive. Subsequent orders enable the customer to be reclassified as active, but the periods of inactivity must be preserved. Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
The second example is the wine dimension. This will be used, principally, as a dimension in one or more behavioral models. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Wine
True
Monthly
201
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Metadata The wine dimension holds information about the wines sold by the club. Details include prices and costs. Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data Type
Attribute Name Retrospection
PK? Frequency
Dependency
Metadata Attribute details to be completed during component analysis workshop. Source
Transformation
Data type
All that we are recording at this stage is information relating to the entity itself, not to any of the attributes belonging to the entity.
Workshop Wrap-Up
Estimated time: 30 minutes It is important to summarize the process that has been conducted over the previous two days and to congratulate the participants on having taken the first steps toward the creation of a business-led information model for the organization:
1. Explain how the model is just the first step and how it becomes the input to the next stage. Briefly describe the main components of the component analysis workshop.
2. Try to secure commitment from as many people as possible to attend the second workshop.
202
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
3. The arrangements for the second workshop should be confirmed, if that is possible. 4. Hand out the attendee feedback form and ask each person to complete it. We all want to improve on the service we offer, so we need as much feedback as we can get. While they are filling in the feedback forms, explain about dimensional attributes and try to encourage them to start thinking about this for the next workshop.
The Component Analysis Workshop The principal objective of the component analysis workshop is to put some substance into the model that was created in the information strategy workshop.
Participants—Practitioners
As far as possible, the same consultancy people as were present in the first workshop should participate in the second. This exercise is more technical than the first and so some additional skills are required but continuity of personnel should be preserved.
Participants—Customer
Similarly, as far as is feasible, the same people as before should be present. However, as has been stated, this is a more technical workshop, and some of the more senior business people should be permitted to send fully authorized deputies to work in their place. The business goals and the first part of the dot model have been established, so their part of the task is complete. The proportions of business people to IT people can be reversed in this workshop, so maybe two-thirds IT people is OK this time. It is, however, vital that some business people are present. What is also important is continuity, so I would reiterate the point that we need as many people as we can get from the first workshop to attend this second workshop.
The Component Analysis Workshop Process
The organization of this second workshop is very similar to that of the first workshop. The room layout and writing facilities are the same. We do not need extra rooms for group work this time.
Review of Previous Model
Estimated time: 30 minutes The purpose of this first exercise is to refresh the minds of the participants as to the state of the dot
203
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
model as at the end of the first workshop. Since the previous workshop, it is quite likely that the participants will have thought of other refinements to the model. Whether these refinements should be adopted depends on the authority of the persons present. As long as any refinements have the full backing of the business, and can be demonstrated to be genuinely in pursuit of business goals, they should be discussed and adopted where appropriate. The extent of refinement is difficult to assess in general terms. Ideally, the lead consultant on the project will have kept in touch with developments inside the customer organization and will be able to estimate the time needed to conduct this part of the exercise. If there are no changes required, it should take no more than 30 minutes.
Defining Attributes
Estimated time: 3–4 hours This can be quite a time-consuming process. The objective is to build a list of attributes for each dimension in the model.
Make sure lots of breaks are included in this session. Encourage people to walk around. The supporting dot modeling entities worksheet should be used for this exercise. We started completing the entities worksheet in the first workshop. Now we can complete it by adding the details of all the attributes. It is recommended that one of the facilitators maintains these worksheets as the session proceeds. The examples of this worksheet that were previously shown are reproduced with further examples of some attributes. Note that, although the form is shown as completed, some of the information can be added later in the logical modeling stage. Right now you should be aiming to record just the names of the attributes and the business metadata. As previously, we have completed a couple of examples for the customer's details: Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Customer
True
Monthly
Metadata
204
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The customer entity contains all customer details. We require to monitor discontinuous existences and to be able to identify active and nonactive customers. An active customer is defined to be one that has placed an order within the past 12 months. Any customer not having done so will be classified as inactive. Subsequent orders enable the customer to be reclassified as active, but the periods of inactivity must be preserved. Attribute Name
Customer Name
PK? N
Retrospection
Frequency
Dependency
False
Monthly
None
Metadata The name of the customer in the form of surname preceded by initials Source
Transofrmation
Data Type
Customer Admin
None
Char 25
Attribute Name
Customer Address
PK? N
Retrospection
Frequency
Dependency
True
Monthly
Sales area hierarchy
Metadata The customer's address Source
Transformation
Data Type
Customer Admin
None
Char 75
Attribute Name
Lifetime value indicator
PK? N
Retrospection
Frequency
Dependency
True
Monthly
None
Metadata The calculated lifetime value for the customer. Values are from 1 to 20 in ascending value Source
Transformation
Data type
Customer Admin
SQL package crm_life_value.sql
Numeric (2)
The second example again is the wine entity that is used as a dimension in one or more dimensional dot models. Dot Modeling Entities and Segments Entity Name
Retrospection
Frequency
Wine
True
Monthly
Metadata The wine dimension holds information about the wines sold by the club. Details include prices and costs. Attribute Name
Wine.Bottle_Cost
Retrospection
Frequency
PK? N Dependency
205
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
True
Monthly
None
Metadata The cost of one bottle of wine, net of all discounts received and including all taxes, duties, and transportaion charges. Source
Transformation
Data Type
Stock (Goods Inwards)
None
Numeric (7,3)
Attribute Name
Wine.ABV
Retrospection
Frequency
Dependency
False
Monthly
None
PK? N
Metadata The alcohol content of a wine expressed as a percentage of the total volume to one decimal place. Source
Transformation
Data type
Stock (Goods Inwards)
None
Numeric (3,1)
One worksheet, or set of worksheets, is completed for each entity in the model. This includes the customers' details, products (e.g., wine), and market segments. The identifying attribute should be included where known. The attribute name, as it will be known in the data warehouse, should be recorded. The name must make it clear, to business people, what the attribute represents. For each attribute, metadata supporting the attribute should also be recorded. At this level we are looking for a concise but precise description of the attribute in business terms. The facilitator should call for nominations of attributes from the workshop participants. Each attribute must earn its place in the model. Whoever suggests an attribute should demonstrate that it has the capability to contribute toward the business strategy. In addition, questions should be posed, in the form of queries (in natural language, not SQL) that will show how the attribute will be used. As has been stated, this exercise can be very time-consuming and somewhat tedious. It should be punctuated with frequent breaks.
Dimensional Analysis of the Facts
Estimated time: 30 minutes Next, for each measurable fact attribute, we examine each dimension to determine the extent to which the standard arithmetic functions can be applied. This enables us to distinguish between fully summable facts and semi summable additive facts. For instance: Supermarket quantities can be summed by product but not by customer.
206
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Bank balances can be summed at a particular point in time but not across time. Return on investment, and any other percentages, cannot be added at all, but maximums, minimums, and averages may be OK. So called “factless” facts, such as attendance, can be counted but cannot be summed or averaged, etc. So each of the measurable facts should undergo an examination to determine which of the standard arithmetical functions can be safely applied to which dimensions. The worksheet to be used to perform this exercise is called the fact usage worksheet, and an example is as shown below (again, a fuller model is contained in Appendix B). Dot Modeling Fact Usage Model Name: Wine Club Sales Fact Name: Value
Frequency Daily
Dimensions
Sum
Count
Ave
Min
Max
1. Customer
?
?
?
?
?
2. Hobby
?
?
?
?
?
3. Wine
?
?
?
?
?
4. Supplier
?
?
?
?
?
Hierarchies and Groupings
Estimated time: 1 hour This exercise is simply to provide some metadata to describe the relationships between the different levels in the dimensional hierarchy. Similar to before, the metadata is required to provide a precise description of the relationship in business terms. The worksheet used to capture this information is called the hierarchies and groupings worksheet, and an example is shown in Figure 5.9.
Figure 5.9. Example of a hierarchy.
207
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The hierarchy or grouping is identified by recording the names of the dimensions in the hierarchy diagram at the top of the worksheet. The example above shows Sales Area ? Customer grouping. The higher (senior) level in the hierarchy is always placed at the top of the diagram, and the lower (junior) level is placed at the bottom.
Retrospection
Estimated time: 1–2 hours We covered retrospection earlier in this chapter. However, we need to be able to present the concept to the workshop attendees so that they can assign values for retrospection to each of the data warehouse components. Whereas it is important for us to fully comprehend the design issues surrounding retrospection, usually it is not necessary for our customers to have the same level of understanding. We should present to them something that enables them to grasp the business significance of the subject without becoming embroiled in the technical detail. Anyway, here's a recap: Retrospection is the ability to look backward into the past. Each of the database objects above can take one of three possible values for retrospection:
1. True. True retrospection means that the data warehouse must faithfully record and report upon the changing values of an object over time. This means that true temporal support must be provided for the data warehouse object in question.
2. False. False retrospection means that, while the object can change its value, only the latest value is required to be held in the data warehouse. Any previous values are permanently lost. This means that temporal support is not required.
3. Permanent. Permanent retrospection means that the values will not change during the life
208
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
of the data warehouse. This means that temporal support is not required. The meaning of retrospection varies slightly when applied to different types of warehouse objects. When applied to dimensions, the value for retrospection relates to the existence of the dimension in question. For instance, if we need to know how many customers we have right now, then we must be able to distinguish between current customers and previous customers. If we want to be able to ask: How many customers do we have today compared to exactly one year go? then we need to know exactly when a customer became a customer and when they ceased being a customer. Some customers may have a discontinuous lifespan. For some periods of time they may be active customers, and during the intervening periods they may not be. If it is required to track this information faithfully, then retrospection = true would apply to the customer dimension. Another example might be the supplier dimension. While we might be interested to know whether a supplier is a “current” or “dead” supplier, it is perhaps not so crucial to know when they became a supplier and when they ceased. So, in this case, retrospection = false may well be more appropriate. Still another example of dimensions is region. In some applications, the regions might be expected to always exist. In these cases, there is no point in tracking the existence of the dimensions, so we allow for a status of retrospection = permanent to accommodate this situation. Insofar as hierarchies and attributes are concerned, retrospection relates to the values that are held by these objects. If a customer moves from one region to another, or changes some important attribute such as, say, number of children, then, depending on the application, it may be important to be able to trace the history of these changes to ensure that queries return accurate results. If this is the case, then retrospection = true should be applied to the attribute or hierarchy. In other cases it may be required to record only the latest value, such as the customer's spouse's name. In this case, retrospection = false would apply. In still other cases the values may never change. An example of this is date of birth. In these cases a value of retrospection = permanent would be more applicable. The value given, to each data warehouse object, for retrospection will become very important when the project moves into the design phase. It must be remembered that the requirements relating to retrospection are business, and not technical, requirements. It is all about accuracy of results of queries submitted by the users.
The measurable fact attributes, once entered into the warehouse, never change.
209
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
These attributes have an implied value of: Retrospection = Permanent Therefore, we have to examine each dimension, dimensional attribute, and hierarchy to establish which value for retrospection should be applied. This means that the worksheets, previously completed, must be revisited.
How to Determine the Value for Retrospection
In order to establish the value for retrospection of each data warehouse object, the designer must investigate the use of the object within the organization. One way of doing this is to question the appropriate personnel as follows: If this object were to change, would you want to be able the track the history accurately? Experience has shown that the response to this type of question is, almost invariably, in the affirmative. As a result, every attribute becomes a candidate for special treatment with respect to time, and every dimension, relationship, and attribute will be assigned a value for retrospection of true. In view of the fact that provision of temporal support is expensive in terms of resources and adversely affects performance, it is very important that such support is provided only where it will truly add value to the information provided by the data warehouse. So we need to adopt an approach to ascertaining the real need for temporal support on an object by object basis. The method should, as far as possible, provide an objective means of evaluation. One approach is to ask questions that do not invite a simple yes or no response. Try asking the following questions: For entities and dimensions: How would the answer to the following question help you in making decisions regarding your business objectives: How many customers do we have today compared to the same time last year? Other questions along similar lines such as: How many customers have we had for longer than one year, or two years? should also be phrased and the answer should be expressed in terms that show how the ability to answer such questions would provide a clear advantage. For relationships, the questions might be: How many customers do we have in each sales area today compared to the same time last year?
210
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
For attributes: How many of our customers have moved in the past year? The questions need to focus on the time requirements, preferably with a time series type of answer. The responder should be able to demonstrate how the answers to the questions would be helpful in the pursuit of their goals.
Granularity and the Dot_Time Table
Estimated time: 1–2 hours We have discussed the time granularity of the measurable facts. Now we must discuss the time granularity of the dimensions, hierarchies, and attributes. We also have to decide on the contents of the Dot_Time table. For dimensions, hierarchies, and attributes, the time granularity relates to the frequency with which changes to values, and changes to existence, are notified and applied to the data warehouse. So if, for example, customer updates are to be sent to the warehouse on a monthly basis, then the time granularity is monthly. In order for complete accuracy to be assured, the granularity of time for the capture of changes in the dimensions must be the same as the granularity of time for the capture of the measurable fact events. In practice, this is hardly ever possible. There is an inevitable delay in time between the change event and the capture of that change in the source system. There is, usually, a further delay between the implementation of the change in the source system and the subsequent capture of the change in the data warehouse. The challenge for data warehouse designers is to minimize the delays in the capture of changes. The Dot Time worksheet is simply a list of all the headings, relating to time, by which the users will need to group the results from their queries. The requirements will vary from one application to another. Below is an example of the worksheet. Dot Modeling—Time Model Name: Example Name
Description
Data Type
Day name
Standard day names
String
Day number
Day number 1–7 (Monday = 1)
Numeric
Bank holiday flag
Y = bank holiday
Character (Y/N)
Month number
Standard months 01–12
Numeric
Month name
Standard month names
String
Quarter
Standard quarters Q1, Q2, Q3, Q4
Numeric
211
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Year
Year numbers
Numeric
Weekend day
Is it a Saturday or Sunday?
Character (Y/N)
Fiscal month no.
April = 01, March = 12
Numeric
Fiscal quarter
Q1 = April – June
Numeric
24-hour opening
Y = open for 24 hours
Character (Y/N)
Weather conditions
Indicator for weather
Character
It is also important to establish how much history the application will wish to hold. This has two main benefits:
1. It tells the designers how many records will have to be created in order to populate the Dot Time table.
2. It gives the designers some indication of the ultimate size of the database when fully populated. Also, check whether future data will be needed, such as forecasts and budgets, as these will have to be built into the Dot Time table as well.
Workshop Wrap-Up
Estimated time: 30 minutes It is important to summarize the process that has been conducted over the previous two days. We now have a complete, business-driven, conceptual information model that the designers can use to help to build the data warehouse. We now have all the components of our conceptual model and lots of instructive data to help us to move forward and develop the logical model, which is the next step in the process. only for RuBoard - do not distribute or recompile
212
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY We need a conceptual model for data warehousing in which we can include all the complex aspects of our design, without making the model too complex for users to comprehend. One of the major strengths of the first-generation solutions was its inherent simplicity. However, as we have seen, the industry has moved on and the next-generation data warehouse must evolve to support the kinds of information our users will demand. For a data warehouse to support CRM properly, we need a customer-centric model, which, while using the strengths and benefits of the old behavior-centric solutions, is much more in tune with customers' circumstances, especially their changing circumstances. As we have seen in previous chapters, often it is these changing circumstances that cause a change in customer behavior. We need to be able to analyze these underlying causes if we are to be successful at CRM. In this chapter we introduced the concept of retrospection. Retrospection provides us with a way of classifying each component of the customers' circumstances so that we can build a full picture of each customer over time. We also introduced dot modeling. Dot modeling helps us to build a picture of the data warehouse starting with the behavior, the most familiar part, followed by the incorporation of the circumstances, retrospection, dependencies, and everything else we need to describe the business requirements correctly. The main part of the model can be built by the business people themselves under our guidance. We worked through a workshop scenario to show precisely how this might be achieved. Ultimately we end up with a customer-centric model, our GCM, that can be transformed into a logical model. This transformation is the subject of the next chapter. only for RuBoard - do not distribute or recompile
213
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Chapter 6. The Logical Model In this chapter we explore solutions to the implementation of the general conceptual model (GCM) in data warehouses. One of the key components of our model is the concept of retrospection that was introduced in the previous chapter. Various options as to how it might be implemented are explored in this chapter. In connection with retrospection, the subject of existence, which was discussed in Chapter 3, is developed and used to show how some of the queries that were very difficult, if not impossible, to express using the Type 2 method in Chapter 4, can be written successfully. Also in this chapter, the solution is developed further to show how it can be transformed into a relational logical model. In practice, the logical modeling stage of the conceptual?logical?physical progression is now usually omitted and designers will move from a conceptual model straight to a physical model. This involves writing the data definition language (DDL) statements directly from the conceptual model. This practice has evolved over time because relational databases have become the assumed implementation platform for virtually all database applications. This applies to operational systems as well as informational systems such as data warehouses. However, many data warehouses are implemented not on relational data base systems but on dimensional database systems that are generally known as Online analytical processing (OLAP) systems. There are currently no standards in place to assist designers in the production of logical models for OLAP proprietary database systems as there are for relational systems. Even so, in order to produce a practitioner's guide to developing data warehouses, the logical design process must pay some regard to nonrelational database management systems. Where the target database management system is not relational, in the absence of any agreed-on standards, it is not possible to produce a physical model unless the designer has intimate knowledge of the proprietary DDL or physical implementation processes of the DBMS in question. In this chapter we'll also consider the performance tradeoff. The temporal solutions that are put forward will have an impact on performance. We'll briefly consider the ramifications of this and suggest some solutions. The chapter then provides recommendations as to the circumstances when each of the available temporal solutions should be adopted. Finally the chapter defines some
214
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
constraints that have to be applied to ensure that the implementation of the representation of time does not compromise the integrity of the data warehouse. only for RuBoard - do not distribute or recompile
215
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
LOGICAL MODELING When reviewing these problems with the objective of trying to formulate potential solutions, the following are the main requirements that need to be satisfied: Accurate reporting of the facts. It is very important that, in a dimensional model, whenever a fact entry is joined to a dimension entry, the fact must join to the correct dimension with respect to time. Where a snowflake schema exists, the fact entry must join to the correct dimension entry no matter where that dimensional entry appears in the hierarchy. In reality, it is recognized that this can never be wholly achieved in all cases, as deficiencies in the capture of data from the operational systems of the organization will impinge on our ability to satisfy this requirement. We can hope only to entirely satisfy this requirement insofar as the warehouse model itself is concerned and to minimize, as far as possible, the negative effect caused by the operational systems. Accurate recording of the changes in entities to support queries involving customer circumstances and dimension browsing. Queries such as these represent a very significant component of the data warehouse usage. It is important to ensure that the periods of existence (validity) of dimensions, relationships, and attributes are recorded accurately with respect to time where this has been identified as a business requirement. Again, the ability to do this is constrained by the accuracy and quality of data supplied by the operational systems. It is important that the warehouse does not compound the problem. only for RuBoard - do not distribute or recompile
216
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE IMPLEMENTATION OF RETROSPECTION Introduction We begin this section with a general rule: Every query executed in a data warehouse must have a time constraint. If an executed query does not have an explicit time constraint, then the inferred time period is “for all time.” Queries that embrace all of time, insofar as the data warehouse is concerned, can be generally regarded as nonsensical because “for all time” simply means the arbitrary length of time that the database has been in existence. Whereas it may be sensible to aggregate across all customers or all products in order to ascertain some information about, say, total revenue for a period of time, it does not make sense to apply the same approach to time under normal circumstances. The following query has been quoted extensively: How many customers do we have? Even this query has an implicit time constraint in that it is really asking: How many customers do we have at this precise point in time, or how many customers currently exist? It is likely that readers of this book may be able to think of circumstances where the absence of a time constraint makes perfect sense, which is why the axiom is offered as a general rule rather than a fixed one. The principle still holds.
The Use of Existence Attributes We'll now explore the use of existence attributes as one approach to the implementation of retrospection. The concept of existence has already been raised, but it's general application to all data warehouse components might not be entirely clear. The reasoning is as follows: The temporal requirement for any entity relates to its existence. The existence of an entity may, in some circumstances, be discontinuous. For instance a wine may be sold by the Wine Club for a period of time and then be withdrawn. Thereafter, it may again come into existence for future periods of time. This discontinuity in the lifecycle of the wine may be important when queries such as “How many product lines do we sell currently?” are asked. A further example concerns customers. The regaining of lost customers is increasingly becoming an important business objective especially in the telecommunication and financial services industries. When customers are “won back,” the companies prefer to reinstate them with their previous history intact. It is now becoming more common for customers to have discontinuous existences.
217
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
A relationship that requires temporal support can be modelled as an m:n relationship. Using the entity relationship (ER) standard method for resolving m:n relationships, a further entity, an intersection entity, would be placed between the related entities. So there now exists another entity (a weak entity). As the temporal requirement for an entity relates to the existence of the entity, the treatment of relationships is precisely the same as the treatment of entities. It is a question of existence once again. An attribute should really be regarded as a property of an entity. The entity is engaged in a relationship with a domain of values. Over time, that relationship would necessarily be modeled as an m:n relationship. An attribute that requires temporal support, therefore, would be resolved into another intersection entity between the entity and the domain, although the domain is not actually shown on the diagram. The treatment of attributes is, therefore, the same as entities and relationships and relates to the existence of the value of the attribute for a period of time. It is proposed that temporal support for each element, where required, within customer circumstances, segmentation, or the dimensional structure of a data warehouse should be implemented by existence attributes. Various ways of representing such an existence attribute may be considered. At the simplest level for an entity, an existence attribute could be added to record whether each occurrence is currently active or not. However, if full support for history is needed, then this requires a composite existence attribute consisting of a start time and an end time that records each period of existence. Some special value should be assigned to the end time to denote a currently active period. Each element may have such an existence attribute implemented as follows:
1. For the temporal support of an entity that may have a discontinuous existence, a separate table is required, consisting of the primary key of the entity and an existence period. If discontinuous existence is not possible, the existence period may be added to the entity type.
2. For the temporal support of a relationship between entities, a separate table is required, consisting of the primary keys of both participating entities together with an existence period.
3. For the temporal support of an attribute of an entity, a separate table is required, consisting of the primary key of the entity, an existence period, and the attribute value. It should be noted that the concept of existence attributes is not new. In fact, the approach could be described as a kind of “selective” attribute timestamping. However, use of attribute timestamping has largely been limited to research into temporal database management systems and has not before been associated with data warehousing. It is the selective adoption of attribute timestamps, in the form of existence attributes, for data warehousing purposes that is new. The use of existence attributes solves the problem of cascaded extraneous inserts into the database caused by the use of the Type 2 solution with a slowly changing hierarchy that was described in Chapter 4. The reason is that there is no need to introduce a generalized key for a dimension, because changes to an attribute are kept in a separate table. However, the performance and usability of the data warehouse need to be considered. The use of existence attributes in the form of additional tables does add to the complexity, so it should be allowed only for those elements where there is a
218
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
clearly identified need for temporal support. It is for the decision makers to choose which elements are sufficiently important to justify such treatment. To explore the benefits of existence attributes, let us consider that the user wishes to know the number of customers that a particular sales executive is responsible for. The intuitive query, reproduced from Query Listing 4.6 in Chapter 4, is shown in Query Listing 6.1.
Listing 6.1 Nonexpert query to count customers (reproduced from 4.6).
Select count(*) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' If Type 2 is implemented, the result would definitely be wrong. Each customer would have one or more rows in the table depending on the number of changes that had occurred to the customer's record. The result would substantially overstate the real situation. Type 2 works by creating a new dimension record with a different generalized key. One simple solution is that the dimension is given an additional attribute that signifies whether the row is the latest row for the customer. The value of this attribute declares the existence of the dimension and remains true until the row is superseded. In its simplest form the existence of a dimension can be implemented using a Boolean attribute. This means that when a change is implemented, the previous latest row is updated and the existence attribute is set to false. The new row has its existence attribute set to true. The new query to determine the number of customers, that a particular sales executive is responsible for, is as follows:
Select count(*) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' And C.Existence = TRUE The end-user query software could be configured to add this last line to all dimension tables, so the user need not be aware of it. However, this does not entirely solve the problem, because it is not answering the question: How many customers is Tom Sawyer responsible for now? Rather, it is answering the question: How many customers has Tom Sawyer ever been responsible for? One method toward solving this problem would be to also set the existence attribute to false when the customer ceased to be a customer. Whether or not this is possible depends on the ability of the data warehouse processing to detect that a customer's status had become inactive. For instance, if the customer's record in the source system were to be deleted, the data warehouse processing could infer that the customer's existence attribute in the data warehouse should be updated to false.
219
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Another variation is to make use of null values in the existence attribute, as follows: Null for not existing Not null for existing (i.e., current) If the column was called “existence,” to identify all customers who were still active customers, then the query would be as shown here:
Select count(existence) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' Due to the way nulls are treated (i.e., they are ignored unless explicitly coded for), this expression of the query is almost as simple as the original, intuitive query phrased by the user in Query Listing 6.1. Furthermore, if the true value of existence was 1, then the following query would also return the correct result:
Select sum(existence) from Sales_Exec S,Customer C where S. SalesExecNum=C.SalesExecNum And S.Name = 'Tom Sawyer' This appears to effectively solve the problem of determining who the current customers are. Thus, even the simplest existence attribute can improve the original Type 2 method considerably. The existence attribute could be implemented using a single “effective date.” This has the advantage that we can determine when a change occurred. However, such a method does not allow us to answer the previous query (How many customers is Tom Sawyer responsible for?), because, again, there is no means to determine inactive customers. The use of row timestamping, using a pair of dates, does enable the question to be answered so long as the end date is updated when the customer becomes inactive. However, there are many questions, such as state duration and transition detection questions, that are very difficult to express and even some that are impossible to express using this approach. We have already explored the concept of customer churn. The loss of valuable customers is a subject close to the hearts of many, many business people. It is a problem that could equally apply to the Wine Club. It is now becoming a common practice to contact customers that have churned in sales campaigns in an attempt to attract them back. Where organizations are successful in doing this, the customers are reinstated with their previous identifiers so that their previous history is available to the customer service staff. The incidence of discontinuous existences is, therefore, becoming more common.
220
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The need to monitor churn and to establish the reasons for it tends to create a need for queries that return results in the form of a time series. The following exemplifies the type of questions we would like to express:
1. How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? The result of such a query would be a time series containing three periods and a number attached to each period. This is an example of a temporal selection query.
2. Of the customers who were lost, how many had been customers continuously for at least one year? The loss of long-standing customers might be considered to be as a result of worsening service. The result from this question is also a time series, but the query contains an examination of durations of time. So this query is a state duration query.
3. How many of these customers experienced a change of administration because they moved in the year that they left? Perhaps they are unhappy with the level of service provided by the new area. This question is concerned with the existence of the relationship between the customer and the sales area. It is also an example of a transition detection query.
4. How many price changes did they experience in the year that they left? Perhaps they were unhappy with the number of price rises imposed. This is similar to question 3 in that it is a transition detection query, but it is applied to the value of an attribute, in this case the selling price of a bottle of wine, instead of a relationship. These requirements cannot be satisfied using a single Boolean attribute to describe the existence of a dimension, as there is a requirement to make comparisons between dates. Neither can the queries be expressed using a single date attribute for the reason previously explained. It seems clear that the expression of such queries requires a composite existence attribute that is, logically, a period comprising a start time and an end time. It has been shown that row timestamping can provide a solution in many cases, but not all, and the resulting queries are complex to write. A simpler solution is sought. The approach to be adopted will be to use a separate existence attribute, in the form of a composite start time and end time, for each dimension, relationship, and attribute where retrospection has been defined to be true. It is assumed that these existence attributes will be held in separate tables. So the existence attribute for customers' existence will be held in a table called “CustomerExist” and the existence attribute for the period during which a customer lives in a sales area will be called “CustomerSalesAreaExist.” Using the composite existence attribute, the first query can be expressed as in Query Listing 6.2.
Listing 6.2 Count of customers who have left during Q4 year on year.
select 'Q4 2000' as quarter, count(*)
221
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' The query in Listing 6.2 could have been answered using row timestamping only if the end timestamp was updated to show that the customer was no longer active. The distinction is made between the existence attribute and the row timestamp because the existence attribute is a single-purpose attribute that purely records the existence of the customer. The row timestamp is, as has been stated, a multipurpose attribute that records other types of changes as well. In order to express the query using row timestamps, it would have to be written as a correlated subquery to ensure that only the latest record for the customer was evaluated. This means that discontinuous existences could not be detected. The second query can be expressed as in Query Listing 6.3.
Listing 6.3 Count of long-standing customers lost.
select 'Q4 2000' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1999' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 union select 'Q4 1998' as quarter, count(*) from CustomerExist ce where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365
222
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
In order to express the third query, it is assumed that there is a separate existence attribute for the relationship between the customer and the sales area. This is shown in Query Listing 6.4.
Listing 6.4 Lost customers who moved.
select 'Q4 2000' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '2000/10/01' and '2000/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '2000/01/01' and '2000/12/31' union select 'Q4 1999' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1999/01/01' and '1999/12/31' union select 'Q4 1998' as quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart between '1998/01/01' and '1998/12/31' The query in Listing 6.4 is an example of a combined state duration and transition detection query. As with the other queries, in order to express the fourth query, it is assumed that there is a separate existence attribute for the bottle price (Listing 6.5).
Listing 6.5 Lost customers affected by price increases.
select 'Q4 2000' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '2000/10/01' and '2000/12/31'
223
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1998 and spe.ExistenceStart between '2000/01/01' and '2000/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1999' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1999/10/01' and '1999/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1997 and spe.ExistenceStart between '1999/01/01' and '1999/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 union select 'Q4 1998' as quarter, ce.CustomerCode, count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t where ce.ExistenceEnd between '1998/10/01' and '1998/12/31' and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and s.TimeCode = t.TimeCode and t.year = 1996 and spe.ExistenceStart between '1998/01/01' and '1998/12/31' group by quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 The query in Query Listing 6.5 shows customers who left the club in the last quarter of the year and who had experienced more than five price changes during the year. This is another example of a combined state duration and transition detection query.
224
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
In dealing with the issue of churn, this approach of trying to detect patterns of behavior of customers is typical. It is accepted that the queries have drawbacks in that they are quite complex and would prove difficult for the average data warehouse user to write. Also, each query actually consists of a set of smaller queries, and each of the smaller queries is responsible for processing a discrete point in time or a discrete duration. Any requirement to increase the overall timespan, or to break the query into smaller discrete timespans, would result in many more queries being added to the set. So the queries cannot be generalized to deal with a wide range of times. In the next section, we explore and propose an approach for making the queries much easier to express and generalize. In this section we have seen that the use of existence periods does provide a practical solution to the implementation of true retrospection. This enables time to be properly represented in the customer circumstances and dimensional structures of the data warehouse and, therefore, satisfies one of the major requirements in the design of data warehouses. only for RuBoard - do not distribute or recompile
225
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE USE OF THE TIME DIMENSION In this section let's explore how the use of the composite existence attribute that was introduced in the previous section, together with the time dimension, may allow the expression of some complex queries to be simplified. The purpose of the time dimension is to provide a mechanism for constraining and grouping the facts, as do the other dimensions in the model. We are examining methods for properly representing customers; this means providing support for time in the customer circumstances, dimensions, and dimensional hierarchies as well as the facts. The four queries listed in Query Listings 6.2–6.5 show that similar time constraints apply to the dimensions as to the facts. Therefore, it seems appropriate to allow the users of the data warehouse to express time constraints on other components using the same approach as they do with the facts. Kimball proscribes the use of the time dimension with other dimensions because he is concerned that the semantics of time in the dimensions is different from that of facts and is potentially misleading. The view that the time dimension should not be used in dimensional browse queries is supported implicitly by the conventional star schema and snowflake schema data models that show the time dimension as being related to the fact table alone. There is no relationship between the time dimension and any other dimension on any dimensional model that I have seen. In considering this matter, two points emerge: Firstly, the time dimension provides a simple interface to users when formulating queries. Preventing them from using the time dimension with other entities means that the users will be able to place time constraints by selecting terms such as 2nd Quarter 2000 in queries involving the fact table but not other entities such as customer circumstances and dimensions. In these types of queries, the explicit time values have to be coded. Second, some dimensional browsing queries are much easier to express if a join is permitted between these other entities and the time dimension. Further, I have discovered that some useful but complex queries, such as those in the previous section, can be generalized if a join to the time dimension is permitted. This is described below. Referring to the queries expressed in Listings 6.2–6.5, the first query (Listing 6.2) is: How many customers did we lose during the last quarter of 2000, compared to 1999 and 1998? Using the time dimension, it can be expressed as shown in Listing 6.6.
Listing 6.6 Count of customers lost during Q4, using the time dimension.
226
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') group by t.Quarter Changes to the temporal scope of the query can be effected simply by altering one line of the predicate instead of creating additional discrete queries. The query from Listing 6.3 is: Of the customers who were lost, how many had been customers continuously for at least one year? This can be expressed as follows:
Listing 6.7 Count of long-standing customers lost, using the time dimension.
select t.Quarter, count(*) from CustomerExist ce, Time t where ce.ExistenceEnd = t.TimeCode and t.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 group by t.Quarter The third query, from Listing 6.4, is: How many of the customers experienced a change of administration because they moved in the year that they left? Using the same assumptions as before, the query can be expressed as in Query Listing 6.8.
Listing 6.8 Lost customers who moved, using the time dimension.
select t1.Quarter, count(*) from CustomerExist ce, CustomerSalesAreaExist csa, Time t1, Time t2 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = csa.CustomerCode and csa.ExistenceStart = t2.TimeCode and t2.Year = t1.Year group by t1.Quarter Finally, the fourth query, from Query Listing 6.5, is: How many price changes did they experience in the year that they left? This can be expressed as shown in Query Listing 6.9.
Listing 6.9 Lost customers affected by price increases, using the time dimension.
select t1.Quarter, ce.CustomerCode,
227
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
count(distinct spe.WineCode) from CustomerExist ce, SalesPriceExist spe, Sales s, Time t1, Time t2, Time t3 where ce.ExistenceEnd = t1.TimeCode and t1.Quarter in ('Q42000', 'Q41999', 'Q41998') and (ce.ExistenceEnd - ce.ExistenceStart) > 365 and ce.CustomerCode = s.CustomerCode and s.WineCode = spe.WineCode and spe.ExistenceStart = t2.TimeCode and s.TimeCode = t3.TimeCode and t2.Year = t1.Year and t3.Year = t2.Year group by t1.Quarter, ce.CustomerCode having count(distinct spe.WineCode) > 5 Thus, we can conclude that allowing the time dimension to be joined to other dimensions, when existence attributes are used, enables a simpler expression of some temporal queries In order to adopt a change of approach whereby joins are allowed between the time dimension and other dimensions, we have to alter the data model. There now exists a relationship between the time dimension and some of the other entities. Only those dimensions that need true support for time will be related to the time dimension. Part of the Wine Club model has been reproduced in Figure 6.1.
Figure 6.1. ER diagram showing new relationships to the time dimension.
Figure 6.1 shows time having a relationship with sales, as before, and also with the customer and wine dimensions. In fact, as Figure 6.1 shows, there are two relationships between the time dimension
228
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
and other dimensions, one for the start time of a period and one for the end time. A problem caused by this change in the conventional approach is that dimensional models will immediately lose their simple shape, and the overall model will become horribly complex. Simplicity is one of the requirements placed on the model. In creating this new idea, I have introduced a further problem to be solved. It is important that the dimensional shape is not lost. Clearly, we cannot present the users with such a diagram. The solution could be that the time dimension is removed altogether from the model. It has been said before that the time dimension is always included as part of a dimensional model because time is always a dimension of analysis in a data warehouse that records history. In further recognition of the fact the data warehouses are temporal databases, the explicit inclusion of a time dimension could be regarded as unnecessary. So, on the assumption that the time dimension is a given requirement, I have adopted the view that its inclusion is implicit. This means that the diagram does not need to model the time dimension explicitly. However, this causes a problem with the entity relationship modeling methodology in that it would be misleading to have implicit entities that are deemed to exist but are excluded from the diagram. The dot modeling methodology does not have this limitation and will be adapted to accommodate the new requirement. However, the time dimension does have attributes that are specific to each application, and so it is not something that can be ignored altogether. For instance, data warehouses in some types of organization require specific information about time, such as: Half-day closing Prevailing weather conditions Effect of late opening due to staff training Whether the store was open for 24 hours This type of information cannot be obtained through any type of derivation. So there is a need for some means of specifying the attributes for time on a per application basis. In the dot modeling methodology we can solve this problem by the introduction of a table that will satisfy the requirements previously handled by the explicit time dimension as well as the requirements covered in this section. The table could be given a standard table name for use in all applications. The use of Time as a name for the table is likely to conflict with some RDBMS reserved word list so, for our purposes, the name dot_time will be use to describe the table. Each application will have its own requirements as to the columnar content of the dot_time table, although some columns, such as the following, would almost always be required: Date Day name
229
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Week number Month name Month number Quarter Year As practitioners we could add value to our customers by bringing a “starter” dot_time table that might contain, say, 10 years of history and 10 years of future dates. This seems like a large amount of data, but in reality it is fewer than 8,000 rows where the granularity is daily. For finer levels of granularity, for example, seconds, it is sensible to provide two time tables. The first contains all the days required, as before, and the other would contain an entry for each second of a single day (i.e., from 00:00:00 to 23:59:59). It is then a simple matter to join to one table, in the case of dimensional changes, or both tables in the case of, say, telephone calls. In this way, multiple grains of time can be accommodated. Practitioners could also provide standard programmed procedures, perhaps using the user-defined functions capability that is available as an extension to some RDBMS products, to add optional columns such as weekends and bank holidays, etc., although some customization of the dot_time table is almost inevitable for every application. The removal of the explicit time dimension from the conceptual model to the logical model is a step forward in the approach to the design of data warehouses. It also goes some way to recognizing that data warehouses are true temporal applications and the support for time is implicit in the solution, rather than having to be made explicit on the data model. only for RuBoard - do not distribute or recompile
230
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
LOGICAL SCHEMA The implementation of true retrospection for any entity, attribute, or relationship requires that the existence of the lifespan of the object in question is recorded. This must be in the form of a period marking the starting and ending times. The introduction of such periods changes the structure of the entities. For instance, the customer circumstances from the Wine Club model has a requirement for true retrospection on its own existence as well as the relationship with the sales area and the customers' addresses. Each of these would be given their own existence period attributes as is illustrated in the diagram in Figure 6.2. Figure 6.2 shows how the implementation of true retrospection using existence attributes results in the creation of new relations. The relational logical schema for the diagram is also shown. A more complete logical schema for the Wine Club is shown in Appendix C. Figure 6.2. Logical model of part of the Wine Club.
231
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the relations in the diagram is now described: Relation Customer Customer_Code Customer_Name Hobby_Code Date_Joined Primary Key (Customer_Code)
Relation Sales_Area Sales_Area_Code Sales_Area_Name Primary Key (Sales_Area_Code) The logical schema has been included as an aid to clarity and is not intended to prescribe a physical model. Performance issues will have to be considered when implementing the solution, and some denormalization of the above schema may be appropriate. However, the above schema does present several examples of true retrospection.
233
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
True retrospection of an entity is exemplified in the existence of the customer by providing a separate relation containing just the customer code and the existence period. True retrospection of an attribute is shown by the customer address relation. True retrospection for a relationship is shown by the customer sales area relation that records the relationship between the customer and the sales area. In each case the primary key includes the start time of the period attribute. False retrospection is supported for the relationship between customers and hobbies by the inclusion of the hobby code as a foreign key in the customer relation. The customer name, for example, also has the property of false retrospection. only for RuBoard - do not distribute or recompile
234
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PERFORMANCE CONSIDERATIONS This book is primarily concerned with accuracy of information and not with performance. However, it is recognized that performance considerations have to be made at some point and that performance may be considered to be more important than accuracy when design decisions are taken. For this reason we will briefly explore the subject of performance. A query involving the fact table such as “The sum of sales for the year 2000 grouped by Sales Area” using a Type 2 approach, with surrogate keys, would be expressed as follows:
Select SalesAreaCode, sum(s.value) From sales s, customer c, time t Where s.CustomerSurrogate = c.CustomerSurrogate And s.timecode = t.timecode And t.year = 2000 Group by SalesAreaCode Using our new schema, the same query requires a more complex join, as the following query shows:
Select SalesAreaCode, sum(s.Value) From sales s, CustomerSalesArea csa, time t Where s.customer code = csa.customer code And s.time code = t.time code And t.time code between csa.start and csa.end And t.year = 2000 Group by SalesAreaCode The join between the time and customer sales area dimensions is not a natural join (it's known as a theta join) and is unlikely to be handled efficiently by most RDBMS query optimizers. A practical solution to solving the performance issue in these cases, while retaining the benefit of existence attributes, is to copy the sales area code to the fact table. As has been stated previously, the attributes of the fact table always have the property of “permanent” retrospection. So the accuracy of the results would not be compromised, and performance would be improved considerably, even better than the original, because a whole table is omitted from the join. This is shown by the next query:
Select SalesAreaCode, sum(s.Value) From sales s, time t where s.time code = t.time code And t.year = 2000 Group by SalesAreaCode The relationship between the customer and sales area dimensions must be left intact in order to
235
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
facilitate dimensional browsing. This is because the decomposition of the hierarchy is not reversible. In other words, the hierarchy cannot be reconstructed from the fact table, since, if there is no sale, there is no relationship between a customer and a sales area. So it would not be a nonloss decomposition. The model enables dimensions to be queried and for time series types of analysis to be performed by using the time dimension to allow the sales area and customer dimensions to be grouped by any time attribute. Queries involving dimensions only, such as counting the number of customers by year, would be expressed as follows:
Select year,count(*) From CustomerExistence ce, time t Where t.timecode between ce.start and ce.end And t.timecode in (2000/12/31, 1999/12/31, 1998/12/31) Group by year This appears to be a reasonably efficient query. At this point it is worth reiterating that browse queries represent about 80 percent of queries executed on the data warehouse. At this level in the model, there is no reason to distinguish between the requirements of false and permanent retrospection. The main reason for these classifications is concerned with issue of population of the data warehouse and the capture of changed data values. In any case, the method handles these requirements quite naturally. only for RuBoard - do not distribute or recompile
236
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
CHOOSING A SOLUTION The choice as to which solution is most appropriate depends upon the requirements and the type of data warehouse object under consideration. Retrospection provides a mechanism for freeing the circumstances and dimensions to be regarded as a source of information in their own right. If there is a requirement for true historical analysis to be performed on any components of the dimensional structure, then those components will have a classification for retrospection of true. If there is a requirement to perform state duration queries or transition detection queries, then the most appropriate choice is to provide an existence attribute, in the form of a start time and an end time, that is applied to the object in question. As far as circumstances and dimensions are concerned, the need to establish discontinuous existence is the main reason for allocating a true retrospection. At any point in time, a customer will or will not be active due to the discontinuous nature of their existence. True retrospection for circumstances and dimensions results from the need for time series analysis relating to their existence. So the existence attribute approach is the only choice in this case. False retrospection for circumstances and dimensions is similar. It enables the currently existing and no longer existing dimensions to be identified. So it is possible to determine which customers currently exist and which do not. It is not possible to carry out any form of time-based analysis. The best approach for implementing false retrospection for circumstances and dimensions is to use an existence attribute, containing true or false values, as described earlier in this chapter. Permanent retrospection requires no support at all as the existence of the circumstances or dimension is considered never to change. Support for relationships that implement dimensional hierarchies is entirely missing from the Type 2 solution, and any attempt to introduce a Type 2 or a row timestamp solution into a relationship may result in very large numbers of extraneous cascaded inserts as has been shown in Chapter 4. It is recommended that use of these techniques is avoided where there are implicit (as implemented by a star) or explicit (as implemented by a snowflake) hierarchies unless all the superordinate objects in the hierarchy have the property of false or permanent retrospection. True retrospection in relationships. When retrospection is true in a relationship, if state duration or transition detection analysis of the relationship is required, the relationship must be implemented using an existence attribute with a start time and an end time. There is a slight difference between the existence of relationships and the existence of dimensions. The subordinate customer will always be engaged in a relationship with a superordinate sales area because the participation condition for the customer is mandatory. It is not, therefore, a question of existence versus
237
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
nonexistence but rather a question of changing the relationship from one sales area to another and the existences pertaining to relationship instances. The discontinuous nature still exists but applies to changing relationships rather than gaps in durations. A row timestamp approach can be used where these requirements do not apply as long as the cascade insertion problem is manageable. False retrospection in relationships. False retrospection in relationships is best implemented by allowing the foreign key attribute to be overwritten in the subordinate dimension's row. No special treatment is required. Permanent retrospection in relationships. With permanent retrospection, the foreign key attribute is not expected to change at all, so no special treatment is required. True retrospection in attributes. The situation with regard to true retrospection in attributes is similar again. If state duration or transition detection analysis is required, then the simplest and most flexible solution is to use an existence attribute with a start time and an end time. Where these types of analysis are not needed, the use of row timestamps can be considered as long the problem relating to cascade insertions is manageable. False retrospection in attributes. False retrospection in attributes requires no special support, as the attribute can be safely overwritten. Permanent retrospection in attributes. Permanent retrospection also requires no support, as the attribute will not change. The purpose of Table 6.1 is to provide a simplified guide to aid practitioners in the selection of the most arppropriate solution for the representation of time in dimensions and circumstances.
238
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Table 6.1. Choosing a Solution Circumstances/Dimension
Relationships
Attributes
True State duration or transition State duration or transition detection analysis detection analysis required? required? Yes
No
Yes
Existence period
Row time stamp
Existence Does the hierarchy Existence Is this attribute at the period change regularly? period lowest level in the hierarchy
False Existence true/false attribute updated using Type 1 method Perm. Will not change
No
State transition duration or detection analysis required? Yes
No
Yes
No
Yes
No
Existence period
Row time stamp
Row time stamp
Existence period
Type 1 Will not change
Type 1 Will not change
only for RuBoard - do not distribute or recompile
239
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
FREQUENCY OF CHANGED DATA CAPTURE In the pursuit of accuracy relating to time, we need to know whether the data we are receiving for placement into the data warehouse accurately reflects the time that the change actually occurred. This was identified as another source of inaccuracy with respect to the representation of time in Chapter 4. This requirement cannot be fully satisfied by the data warehouse in isolation, as is now described. So far as the facts are concerned, the time that is recorded against the event would be regarded, by the organization, as the valid time of the event. That means it truly reflects the time the sale occurred, or the call was made. With the circumstances and dimensions, we are interested in capturing changes. As has been discussed, changes to some attributes are more important than others with respect to time. For the most important changes, we expect to record the times that the changes occurred. Some systems are able to provide valid time changes to attributes, but most are not equipped to do this. So we are faced with the problem of deducing changes by some kind of comparison process that periodically examines current values and compares them to previous values to determine precisely what has changed and how. The only class of time available to us in this scenario is transaction time. Under normal circumstances, the transaction time is the time that the change is recorded in the operational system. Often, however, the transaction time changes are not actually recorded anywhere by the application. Changes to, say, a customer's address simply result in the old address being replaced by the new address, with no record being kept as to when the change was implemented. Other systems attempting to detect the change, by file comparison method, have no real way of knowing when the change occurred in the real world or when it was recorded into the system. So, in a data warehouse environment, there are two time lags to be considered. The first is the lag between the time the change occurred in the real world, the valid time, and the time the change is recorded in an operational system, the transaction time. Usually, the organization is unaware of the valid time of a change event. In any case, the valid time is rarely recorded. The second time lag is the time it takes for a change, once it has been recorded in the operational system, to find its way into the data warehouse. The solution is to try to minimize the time lags inherent in this process. Although that is often easier said than done, the objective of the designers must be to identify and process changes as quickly as possible so that the temporal aspect of the facts and dimensions can be synchronized. only for RuBoard - do not distribute or recompile
240
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
CONSTRAINTS Part of the role of the logical model is to record constraints that need to be imposed on the model. The introduction of full support for time brings with it some additional requirements for the imposition of constraints. Double-Counting Constraints Double-counting occurs when the joining of tables returns more rows than should be returned. This problem is usually avoided if the general rules about the structure of dimensional models are followed. However, the introduction of existence attributes into the model increases the risk of error by changing the nature of the relationships in the warehouse data model from simple (1:n) to complex (m:n). The problem is best described by the use of an example of a sale of wine. Table 6.2 shows the bottle cost existence.
Table 6.2. Example of the Existence of a Wine Cost Entity Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/31
4.36
4504
1999/03/31
Now
4.79
The bottle cost has an existence attribute. The existence is continuous, but a change in the cost price of a bottle of the wine has occurred. This has resulted in the generation of a new row. Table 6.3 shows a fragment of the sales fact table detailing a sale of the wine above.
Table 6.3. A Single Sale of Wine Wine Code 4504
Day 1999/03/31
Quantity 5
Value 24.95
Next, the following query is executed and is intended to show the sales value and costs:
Select w.wine_name, s.value "Revenue", sum(s.quantity * w.bottle_cost) "Cost" from Sales s, Wine w, BottleCostExistence bce where w.wine_code = s.wine_code and w.wine_code = bce.wine_code and s.day between bce.start_date and bce.end_date
241
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
group by w.wine_name The result set in Table 6.4 is returned.
Table 6.4. Example of Double-Counting Wine Name
Revenue
Cost
Chianti Classico
24.95
21.80
Chianti Classico
24.95
23.95
The result is that the sale has been double-counted. The problem has occurred because of an overlap of dates that caused the sale, which was made on March 31, 1999, to successfully join to two rows. The date of the change may well be right. The old cost price ceased to be effective on March 31, 1999, and the new price took effect immediately, on the same day. As far as the query processing is concerned, the multiple join is also correct. The join criteria have been met. What is wrong is the granularity of time. There is an implicit constraint that states that: Time overlaps in existence are not permitted in a dimensional model. Therefore, if both dates are correct, then the granularity is incorrect and a finer grain, such as time of day, must be used instead. The alternative is to ensure that the end of the old period and the start of the new period actually “meet.”. This means that there is no overlap, as shown in Table 6.5.
Table 6.5. Wine Cost Entity Showing No Overlaps in Existence Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/30
4.36
4504
1999/03/31
Now
4.79
It is equally important to ensure that no gaps are inadvertently introduced, as in the Table 6.6.
Table 6.6. Wine Cost Entity Showing Gaps in Existence Wine Code
Start
End
Bottle Cost
4504
1997/02/27
1999/03/30
4.36
4504
1999/04/01
Now
4.79
If the data in Table 6.6 were used, then the result set would be empty.
242
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The query used in the example was phrased to aid clarity. It is worth remembering that in data warehousing most queries, involving joins to the fact table, use arithmetical functions to aggregate the results. The chances of users identifying errors from the result sets of such queries are seriously reduced when large numbers of rows are aggregated. Therefore, in order for the concept of existence to work, temporal constraints must prevent any form of overlapping to occur on periods; otherwise, there is a risk that double-counting might occur. The following query, using the temporal construct “overlaps,” would detect such an occurrence.
Select R1.PK From R1, R2 where R1.PK = R2.PK and R1.Period <> R2.Period where R1.Period overlaps R2.Period R1 and R2, in the previous query, are synonyms for the same relation, and PK is the primary key. The relation is subjected to a self-join in order to identify temporal overlaps. This query can be rewritten, without using the temporal constructs, as in the following query:
Select R1.PK from R1, R2 where R1.PK = R2.PK and (R1.Start <> R2.Start or R1.End <> R2.End) and R1.Start <= R2.End and R2.Start <= R1.End The periodic execution of such queries would enable the detection of double-counting errors. Referential Integrity Constraints Several axiomatic referential integrity constraints can be specified: 1. The period of existence of the subordinate object must be contained within a single period of existence of the superordinate object. 2. In a hierarchy, as long as the subordinate exists, there must also exist a relationship between the subordinate entity and its superordinate entity. This is due to the mandatory participation condition relating to subordinate entity in data warehouse models. 3. During the period of existence of an entity, all the attributes of the entity must exist. It follows, therefore, that gaps in existence of attributes are not allowed while the entity is in existence. There are some constraints relating to retrospection. If an entity has true retrospection:
243
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
1. It can have attributes that have true retrospection. The lifespans of those attributes must fall within the lifespan periods of the entity. If the entity ceases to exist, then the attributes with true retrospection should cease to exist at the same time. 2. It can have attributes that have false retrospection. These attributes can change only when the existence attribute for the entity indicates that the entity exists. They cannot change when the existence attribute for the entity indicates that the entity does not exist. 3. It can have attributes that have a value, for retrospection, of permanent, as these values never change. If an entity has false retrospection: 1. It can have attributes that have true retrospection. If the entity ceases to exist, then the attributes with true retrospection should cease to exist at the same time. When the entity changes from nonexistent to existent, these attributes should begin a new interval of existence at the same time. 2. It can have attributes that have false retrospection. These attributes can change only when the existence attribute for the entity indicates that the entity exists. They cannot change when the existence attribute for the entity indicates that the entity does not exist. 3. It can have attributes that have permanent retrospection, as these values never change. If a dimension has a value for retrospection of permanent, there are no constraints that restrict the kind of attributes the dimension can have. Deletion Constraints It is for the owners of the data warehouse to determine the values of retrospection that are to be applied to each data warehouse object. The three approaches place very different requirements on the design of the warehouse, and the accuracy of the results from queries will vary depending on the choices made. Some rules can be applied. The rules governing referential integrity violations in relational databases with respect to deletions must be applied to existence. Where a dimension changes its status from being existing to becoming logically nonexisting, this is equivalent to the entity being logically deleted. The application of the rules has to be applied selectively. Cascade delete cannot be used because this may result in the deletion of facts and this would have the effect of invalidating the database. Although the data warehouse would retain integrity in the sense that the references would remain intact, the database would return incorrect results. Nullifying the references (effectively-deleting the relationship) cannot be used, because the participation condition of the dimensions, which would be nullified, is mandatory. If it were permitted to delete the relationships, queries that aggregated all the facts using one dimension would produce different totals to other dimensions.
244
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
This would have the effect of invalidating the results. The only method for dealing with changes to existence is to use the restricted effect. This means that when a dimension's existence ceases, any referencing dimension must have its relationship existences closed and new relationship existences created that refer to an existing dimension, before the previous dimension is allowed to have its existence terminated. only for RuBoard - do not distribute or recompile
245
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
EVALUATION AND SUMMARY OF THE LOGICAL MODEL This chapter has focused on the implementation of the solutions that were introduced into the conceptual model in Chapter 5. Principally, Chapter 6 has focused on the implementation of retrospection. We have seen how true retrospection can be implemented by the use of existence attributes and how the use of such attributes enables queries to be expressed that are otherwise very difficult to write and impossible to generalize using the approach adopted generally by practitioners. Existence attributes can be described as selective attribute timestamps that are applied only where needed. In the pursuit of accuracy, the implementation of true retrospection has enabled a very much greater level of accuracy to be achieved in queries involving circumstances and dimensions as well as maintaining accuracy in queries involving facts. As a result of attempting to simplify the expression of queries involving time, we have seen that the use of the time dimension can be greatly extended to enable joins to circumstances and dimensions rather than just to facts. This approach challenges standard practices and is not recommended by some authors, but the advantages that accrue, in terms of the simplification of queries, are significant enough to justify its adoption. In introducing this new idea, another problem has emerged which is that the new relationships tend to obliterate the essential star shape of the dimensional model because the time dimension is now related not only to the fact table but also to one or more of the dimensions and even some attributes. Upon reflection, however, in recognition of the fact that data warehouses are temporal databases, it seems entirely appropriate to change the status of the time dimension from being an explicit component of the data model to becoming an implicit component that does not, therefore, need to exist on the diagram. Far from being simply a convenient way of removing the problem, this is regarded as a benefit of the method and a natural consequence of accepting the fact that data warehouses are temporal databases. only for RuBoard - do not distribute or recompile
246
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 7. The Physical Implementation In this chapter, we will explore ways to create the physical data warehouse. There is a traditional left to right flow of data from the source systems to the warehouse. The operational process begins by the periodic extraction of data. The data has to be tested for quality. The quality of data is a major headache for most data warehouse developers, and we will be exploring this issue in some depth. Poor data quality will threaten the success of the whole project. Also, we need to consider what happens to the warehouse when changes occur in the source systems. Changes can be very slight, having little or no impact, or they can be massive, as in the case of the introduction of a new enterprise resource planning (ERP) system that will, at a stroke, change the major source of data to the warehouse. Designing a data warehouse that is able to cope with such changes is similar to designing a 100-story building in an earthquake zone. Then there is the requirement to operate the warehouse in a “lights-out” environment. Each day, new data is extracted, quality checked, loaded, and summarized. The way in which we design and build the warehouses is not usually conducive to a lights-out approach. Let's begin this chapter with an overall look at the architecture. only for RuBoard - do not distribute or recompile
247
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE DATA WAREHOUSE ARCHITECTURE There has been considerable discussion in the industry about the various approaches to the development of data warehouses. After some successful projects, and some more spectacular failures, the discussion has tended to focus on the benefits of data marts, or application-centric information systems, sometimes referred to as department-level data warehouses. The rationale around these ideas is that enterprise data warehouses are high-risk ventures that, often, do not justify the expense and extended delivery times. A department-level data mart, on the other hand, can be implemented quickly, easily, and with relatively low risk. The big drawback of department-level information systems is that they almost always end up as stove-pipe developments. There is no consistency in their approach, and there is usually no way to integrate the data so that anything approaching enterprise-wide questions can be asked. They are attractive because they are low risk. They are low risk because they incur relatively low cost. The benefits they bring are equally quite low. What we need are information systems that help the business in pursuit of its goals. As the goals evolve, the information systems evolve as well. Now that's a data warehouse! The trick is to adopt a fairly low risk, and low cost, approach with the ultimate result being a complete and fully integrated system. We do this by adopting an incremental approach. The benefits of an incremental approach to delivering information systems are recognized and I am an enthusiastic exponent of this approach, as long as it is consistent with business goals and objectives. In order to do this we need to have a look back at the idea behind database systems. The popular introduction of database management systems in the 1970s heralded the so-called Copernican revolution in the perception of the value of data. There was a shift in emphasis away from application centric toward a more data centric approach to developing systems. The main objectives of database management systems are the improvement of:
1. Evolvability. The ability of the database to adapt to the changing needs of the organization and the user community. This includes the ability of the database to grow in scale, in terms of data volumes, applications, and users.
2. Availability. Ensuring that the data has structure, and is able to be viewed in different ways by different applications and users with specific and nonspecific requirements.
3. Sharability. Recognition of the fact that the data belongs to the whole organization and not to single users or groups of users.
4. Integrity. Improving the quality, maintaining existence, and ensuring privacy.
248
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The degree to which success has been achieved is a moot point. Some systems are better, in these respects, than others. It would be difficult to argue that some of the online analytical processing (OLAP) systems with proprietary database structures achieve any of the four criteria in enterprise-wide terms. The point here is that, even with sophisticated database management systems now at our disposal, we are still guilty of developing application-centric systems in which the data cannot easily be accessed by other applications. Where operational, or transactional, systems are concerned this appears to be acceptable. In these cases the system boundaries seem to be clear, and it could be argued that there is no need for other systems to have access to the data used by the application. Where such interfaces are found to be of benefit, integration processes can be built to enable the interchange of information. With informational systems, however, the restriction is not acceptable. It is also not acceptable where transactional systems are required to provide data “feeds” to informational systems if the transactional data is structured in such a way that it cannot be easily read using the standard query language provided by the DBMS. These practices do nothing to smooth the development of informational systems and contribute to the risk of embarking on any decision support project. The objectives of evolvability, availability, sharability, and integrity are still entirely valid and even more so with decision support systems where the following exists:
1. The nature of the user access tends to be unstructured. 2. The information is required for a wide number of users with differing needs. 3. The system must be able to respond to changes in business direction in a timely fashion. 4. The results of queries must be reliable and consistent. The development of an application-centric solution will not support those objectives. The only way to be sure that the database possesses these qualities is to design them in at the beginning. This requirement has given rise to the EASI (evolvability and availability, sharability, integrity) data architecture, an example of which is shown in Figure 7.1.
Figure 7.1. The EASI data architecture.
249
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The foundation of EASI is the data pool. The data pool contains the lowest level of detail (granularity) of data that the organization is able to store. Ideally, this should be at transaction level. The data pool is fed by the organization's operational systems. The data pool supports the information and decision support systems (applications) that are able to take data from the pool, as well as store data of their own. The applications are entirely independent of one another and have their own users and processing cycles. The applications can, in turn, provide data to other systems or can even update the source systems. The most important point to note is that the data pool is a resource that is available to all applications. This is entirely consistent with the original objectives of database management systems. The various components of the model are now described in more detail.
The VIM Layer In order for data to be allowed to enter the pool, each item of data must conform to rigorous constraints regarding its structure and integrity. Every data item must pass through a series of processing steps
250
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
known as VIM (see Figure 7.1). VIM stands for validation, integration, and mapping. Collectively the VIM processes form the basis of the data quality assurance strategy for the data warehouse. Closely associated with the VIM processing is a comprehensive metadata layer. We will be building the metadata as we go.
Data Validation
The validation layer is the first part of the data quality assurance process. The validation of data is one of the most important steps in the loading of data into the warehouse. Although this might sound obvious, it is one area that is often hopelessly underestimated in the planning of a data warehouse. Unfortunately, when we ask the question, “What is the quality of your data?” the answer is invariably that their data is nearly 100 percent perfect. This assertion is always wrong. Most organizations have no clue about the quality of their data and are always naïve in this respect. If they say it's near perfect, you can be sure that it's poor. If they admit to their data being OK then it's likely to be desperately poor. Clearly, if the quality of data is poor, then the accuracy of any information that can be derived from it will be highly suspect. We cannot allow business decisions to be based on information of questionable accuracy. Once it becomes known that the accuracy of the information cannot be relied upon, the users will lose trust in the data warehouse and will stop using it. When this happens, you might as well shut it down. There are several classes of data quality issues. Data can be:
1. Missing (or partially missing) 2. Erroneous 3. Out of date 4. Inconsistent 5. Misleading Missing data. The main reason for data to be missing is insufficient rigor in the routine data collection processes of the operational systems. This might be fields on an input screen or document where, for the purposes of the application, the data is not classified as mandatory whereas in the warehouse environment, it is absolutely mandatory. An example of this, insofar as the Wine Club is concerned, might be the customer's favorite hobby. When a new customer is registered, the data entry operator might not appreciate the value of this piece of information and, unless they are forced to, might not capture the information. In our CRM solution, we are trying hard to tailor our products and services toward our customers' preferences, and the absence of information will make it all the harder for us to do this successfully. Resolving the issue of missing data can be a nightmare. Ideally, the missing values should be sorted
251
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
out within the source system before the data is extracted and transferred to the warehouse. Sometimes this is not possible to do and sometimes it is not desirable, and so the warehouse design has to incorporate some method of resolving this. It is very risky to allow data into the warehouse with missing values, especially if it results in null values. Here is an example. Take a look at the fragment of the customer table in Table 7.1.
Table 7.1. Table Containing Null Values Customer ID
Customer Name
Salary
12LJ49
Lucie Jones
20,000
07EK23
Emma Kaye
null
34GH54
Ginnie Harris
30,000
45MW22
Michael Winfield
10,000
21HL11
Hannah Lowe
null
Now consider the queries shown in Table 7.2.
Table 7.2. Query Results Affected by Null Values Query
Result
Select count(*) from customer
5
Select count(Salary) from customer
3
Select min(Salary) from customer
10,000
Select avg(Salary) from customer
20,000
Although these results are technically correct, in business terms, apart from the first, they are clearly wrong. While it is reasonable to say that we might be able to spot such errors in a table containing just five rows, how confident are we that such problems could be detected in a table that has several million rows? The approach to the correction of missing values is a business issue. What importance does the business place on this particular data element? Where there is a high level of value placed on it, then we have to have some defined process for dealing with records where the values are missing. There are two main approaches:
1. Use a default value. 2. Reject the record. Use of default values means that the rejection of records can be avoided. They can be used quite
252
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
effectively on enumerated or text fields and in foreign key fields. We can simply use the term unknown as our default. In the case of foreign keys, we then need to add a record in the lookup table with a primary key of “UN” and a corresponding description. The example of the hobbies table in Table 7.3. makes this clear.
Table 7.3. The Wine Club Hobbies Table Hobby Code
Hobby Description
GF
Golf
HR
Horse riding
MR
Motor racing
UN
Unknown
A fragment of the customer table is shown again in Table 7.4.
Table 7.4. Part of the Customer Table Customer ID
Customer Name
Hobby Code
12LJ49
Lucie Jones
MR
07EK23
Emma Kaye
UN
34GH54
Ginnie Harris
GF
45MW22
Michael Winfield
GF
21HL11
Hannah Lowe
UN
A query to show the distribution of customers by hobby will result in Table 7.5.
Table 7.5. Result of the Query on Customers and Their Hobbies Hobby Description
Count(*)
Golf
2
Motor Racing
1
Unknown
2
A benefit of default values is that we can always see the extent to which the default has been applied. The application of defaults should, however, be limited to sets of a known size. If we have, say, 50 different hobbies it is easy to imagine adding a further “unknown” category. In the previous example, salary, how can we possibly define a default value, other than null, to be used when the column has continuous values? We cannot apply a text value of “unknown” to a numeric field. The
253
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only way to deal with this effectively is to prevent such a record from getting into the warehouse. So we reject them. Rejecting records means that they are not allowed into the warehouse in their current state. There are three main approaches to the rejection of records:
1. Permanent rejection. In this situation, the record is simply thrown away and never gets into the warehouse. The obvious problem with this solution is that there is a compromise of simplicity of operation versus accuracy of information. Where the volumes are tiny in proportion to the whole, this might be acceptable but extreme care should be exercised. This type of solution can be applied only to behavioral records, such as sales records, telephone calls, etc. It should not be applied to records relating to circumstances. For instance, if we were to reject a customer record permanently because the address was missing, then every subsequent sale to that customer would also have to be rejected because there would be no customer master record to match it up to.
2. Reject for fixing and resubmission. With this approach, the offending records are sidelined into a separate file that is subjected to some remedial processing. Once the problems have been fixed, the records are reentered into the warehouse. The simplest way to do this is to merge the corrected records with the next batch of records to be collected from the operational system so that they get automatically revalidated as before. This sounds like a neat solution and it is. Beware however. Firstly, if we take this approach, it means that we have to build a set of processes, almost a subsystem, to handle the rejections, allow them to be modified somehow, and place them back in the queue. This system should have some kind of audit control so that any changes made to data can be traced, as there is a clear security implication here. The mechanism we might adopt to correct the records could be a nice windows-style graphical user interface (GUI) system where each record is processed sequentially by one or more operators. Sometimes the rejection is caused by an enhancement or upgrade to the source system that ever so slightly changes the records that are received by the data warehouse. When this happens, two things occur. First, we do not see the odd rejection; we see thousands or millions of rejections. Second, if the format of the records has changed, they are unlikely to fit into the neat screen layouts that we have designed to correct them. The reject file contains some true records and a whole pile of garbage. This situation really requires a restore and a rerun. Restores and reruns are problematic in data warehousing. We will be covering this subject later in the chapter.
3. Reject with automatic resubmission. This is a kind of variation on the previous type of rejection. Sometimes, simply resubmitting the offending record is sufficient. An example of this occurs frequently in mobile telecommunication applications. The customer Lucie Jones goes into her
254
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
local telephone store, orders a new service, and 15 minutes later leaves the store with her new possession. Almost immediately the telephone is activated for use and Lucie is able to make calls. That evening, all the calls that Lucie has made will be processed through the mediation system through to billing and will then be transferred to the data warehouse. However, Lucie's circumstances (customer details), name, address, billing information, etc., are unlikely to find their way to the data warehouse in time to match up to her telephone calls. The missing data, in this example, are Lucie's customer details. The result is that Lucie's calls will be rejected by the warehouse as it does not have a customer with whom to match them up. But we don't want them to be passed to some manual remedial process to correct them. What we need is for the calls to be recycled and re-presented to the validation process until such time as Lucie's master customer record turns up in the data warehouse. When this happens, the calls will be accepted without further intervention. Clearly there has to be a limit to the number of cycles so that records that will never become valid can be dealt with in another way. Each of the three approaches to the rejection of records can be adopted or not, depending on what is best for the business. In some organizations, all three approaches will be used in different parts of the application. The important point to note is that, although the rejection of records seems like an easy solution to the problem of missing data, a considerable amount of processing has to be built around the subsequent handling of those records. Erroneous Data. Whereas missing data is easy to spot, data that is present but that is simply wrong can sometimes be harder to detect and deal with. Again, there are several types of erroneous data: Values out of valid range. These are usually easy to recognize. This type of error occurs when, say, the sex of a customer, which should have a domain of “F” or “M” contains a different value, such as “X.” It can also apply to numerical values such as age. Valid ages might be defined as being between 18 and 100 years. Where an age falls outside of this range, it is deemed to be in error. Referential errors. This is any error that violates some referential integrity constraint. A record is presented with a foreign key that does not match up to any master record. This is similar to the problem we described in the previous section. This could occur if, say, a customer record were to be deleted and subsequent transaction records have nothing to match up to. This kind of problem should not be encountered if the guidelines on deletions, which were described in Chapter 6, are followed. However, this does not mean that referential integrity errors will not occur; they will. How they are dealt with is a matter of internal policy, but it is strongly recommended that records containing referential errors are not allowed into the warehouse because, once in, they are hard to find and will be a major cause of inconsistency in results.
255
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
As an example, look at the slightly modified customer table in Table 7.6.
Table 7.6. Customer Table Fragment Containing Modified Hobby Codes Customer ID
Customer Name
Hobby Code
12LJ49
Lucie Jones
MR
07EK23
Emma Kaye
OP
34GH54
Ginnie Harris
GF
45MW22
Michael Winfield
GF
21HL11
Hannah Lowe
OP
Now, instead of having hobby codes of “UN” (unknown), Emma and Hannah have codes relating to opera as their hobby. A simple query to group customers by hobby is as follows:
Select HobbyDescription, Count(*) From Customer C, Hobby H Where C.HobbyCode = H.HobbyCode Group by HobbyDescription Table 7.7 is the result.
Table 7.7. Result Set With Missing Entries Hobby Description
Count(*)
Golf
2
Motor racing
1
Emma and Hannah have disappeared from the results. What may be worse is that all of their behavioral data will also be missing from any query that joined the customer table to a fact table. The best way to deal with this type of referential integrity error is to reject the records for remedial processing and resubmission. Valid errors. These are errors in the data that are almost impossible to detect: values that have simply been entered incorrectly, but that are still valid. For instance, a customer's age being recorded as 26 when it should be 62. Typically, errors like this are made in the operational systems that supply the data warehouse and it is in the operational systems that the corrections should be made. In many cases, corrections are ad hoc, relying on sharp-eyed operators or account managers who know their customers and can sometimes, therefore, spot the errors.
256
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
We can sometimes take a more proactive approach by, periodically, sending data sheets to our customers that contain their latest details, together with an incentive for them to return the information with any corrections. Out-of-Date Data. Out-of-date information in a data warehouse is generally the result of too low a frequency of synchronization of customers' changing circumstances between the operational systems and the warehouse. An example is where a customer's change of address has been recorded in the operational system but not yet recorded in the data warehouse. This problem was covered in some detail in Chapter 4. The conclusion was that we should try to receive changes from the operational systems into the warehouse as frequently as possible. Inconsistent Data. Inconsistent data in the data warehouse is a result of dependencies, or causal changes, of which the operational systems may be ignorant. For instance, we may be maintaining a derived regional segment based on customer addresses. This segmentation is held only within the data warehouse and is not known by the operational system. When a customer moves from one address to another, eventually the data warehouse has to be informed about this and will, routinely, make the appropriate adjustment to the customer's circumstances. The regional segment, and any other address-dependent segments, must also be updated. This is a change that is solely under the control of the data warehouse and its management and systems. This can be quite a headache because the segments are dynamic. New ones appear and old ones disappear. The causal effect of changes to circumstances, on the various segments that are dependent on them, is a hard nut to crack. Misleading Data. Misleading data can occur when apparent retrospective changes are made and the representation of time is not correct. This was explained in Chapter 4. An example is where a change of address occurs and, due to the incorrect application of retrospection, the new address is erroneously applied to historical behavioral data. Oddly enough, the reverse situation also applies. When describing so-called valid errors, the example was where the customer's age was recorded as 26 but should have been 62. If we had classified the age of the customer as having true retrospection, then any correction might have been implemented such that, at a particular point in time, the customer's age jumped from 26 years to 62 years. In this case, it is the “proper” use of retrospection that causes the data to be misleading. In order to provide a flexible approach to validation, we should build our validation rules, as far as is practical, into our metadata model. An example of a simple validation metadata model for validation is shown in Figure 7.2.
Figure 7.2. Metadata model for validation.
257
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The attributes of the validation metadata tables are as follows: Attribute Table Attribute ID Table name Attribute name Data type Retrospection Attribute Default Table Attribute ID Default value Attribute Range Table Attribute ID Lowest value Highest value Attribute Value Set Table Attribute ID Value The data that is held in the default value, lowest value, and highest value columns is fairly obvious. It is worth noting that, in order to make it implementable, all values, including numerics, must be held as character data types in the metadata tables. So the DDL for the default value table would be:
Create table Attribute_Default ( Attribute_ID number (6), Default_Value Varchar(256))
258
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
An example of some of the entries might be: Attribute Table Attribute ID:
212
476
Table name
Customer
Customer
Attribute name
Occupation
Salary
Target data type
Char
Num
Retrospection
False
True
Attribute ID
212
476
Default value
“Unknown”
“15,000”
Attribute Default Table
This shows that the default value, although numeric, is stored in character format. Obviously in the customer table the salaries will be stored as numerics, and so some format conversion, determined by examination of the “target data type” attribute, will have to be performed whenever the default value is applied. Lastly, the attribute value set is just a list of valid values that the attribute may adopt.
Integration
The next part of the VIM processing is the integration layer. The integration layer ensures that the format of the data is correct. Format integrity means that, for instance, dates all look the same (e.g., yyyymmdd format), or maybe product codes from some systems might need to be converted to comply with a standard view of products. For many attributes there may be no problem. Where data formats are standard throughout an organization, there is little need to manipulate and transform the data so long as the standard is itself a form that enables data warehouse types of queries to be performed. In order to make the integration as flexible as possible, as much as possible should be implemented in metadata. So we need to extend our metadata model so that it can be used for integration. The extensions are shown in Figure 7.3.
Figure 7.3. Integration layer.
259
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Attribute Table Attribute ID Table name Attribute name Data type Data format Business description Source Attribute Table Source attribute ID Source attribute name Data type Data format Business description Transformation Table Transformation process ID Transformation process name Transformation process type Transformation description Attribute Transformation Table Attribute ID Source attribute ID Transformation process ID The attributes of the validation metadata tables are as follows: The attribute table has been extended to accommodate the additional requirements of the integration
260
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
part of the VIM processing. Of course, there are many more attributes that could be included, but this approach is quite flexible and should be able to accommodate most requirements. These tables can form an active part of the VIM process. As an example, the attribute table would contain an entry for the sex of the customer. The source system might not contain such an attribute, but we might be able to derive the sex from the customers' title (“Mr.” translates to “Male” while “Mrs.,” “Ms,” and “Miss” translate to “Female”). The source attribute would be the title of the customer in the customer database. The transformation process held in the transformation table would be the routine that made the translation from title to sex. Then the attribute transformation table brings these three things together. The VIM process can now determine:
1. Source attribute (title from customer data) 2. Process to transform it 3. Target attribute (customer's sex) The great attraction of this approach is that it is entirely data driven. We can alter any of the three components without affecting the other two.
Mapping
The mapping layer is a data-driven process that takes source data and maps it to the target format. The purpose of this approach is to ease the transition when we are required, for instance, to replace one source system with another or to accept altered data feeds. The aim is to develop a formal mapping layer that makes it easy to implement changes in sources where no additional functionality is being added (e.g., as a result of source system upgrades). The changes to the metadata model to accommodate this are quite small. We can simply add a “source” entity to the previous model as shown in Figure 7.4.
Figure 7.4. Additions to the metadata model to include the source mapping layer.
261
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The attributes for the source are as follows: Data Source Table Source ID Source name Source application File type Interface format Source platform Extraction process Extraction frequency Business description Source Attribute Table Source ID Source attribute ID Source attribute name Data type Data format Business description The complete metadata model for the VIM process is shown in Figure 7.5.
Figure 7.5. Metadata model for the VIM layer.
262
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Margin of Error In Chapter 4, we said our users had no real way of knowing whether their query results were adversely affected by the inadequate representation of time. In this section we'll look at a margin of error calculation that could be provided as a standard facility for data warehouses. The proposal is that information regarding retrospection and time lags is held in a metadata table so that any query can be assessed as to the likely margin of error. There are two main components to the margin of error:
1. The degree to which the dimensions are subject to change. In order to be entirely accurate, the problem must be approached on a per attribute basis. Different attributes will
263
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
change at differing rates, and so it is not sensible to view the likelihood of change at the entity level.
2. The transaction time1 to transaction time2 lag. This component covers the error caused by changes that have been recorded in the operational system but not yet been captured into the data warehouse. Transaction time 1 is the time when the data is recorded in the operational system. Transaction time2 is the time when the change is notified to the data warehouse. The first of these two can be eliminated if the requirement for retrospection, for the attribute, has been implemented properly. The second can be eliminated only if the valid time to transaction time lag is eliminated. The period of the query will also affect the accuracy. If 10 percent of customers change their address each year, then the longer the period, the greater the potential for error. The following formula can be used to calculate the percentage error for an attribute in the data warehouse.
where: T is the time period, in years, covered by the query. C is the percentage change per annum. So, if the attribute is an address, and 10 percent of people change their address each year, then C is 10. R is the value for retrospection. It has a value of 0 (zero) where true retrospection has been implemented, or 1 (one) where true retrospection has not been implemented. L is the average time lag, in days, between transaction time1 and transaction time2. The total percentage margin of error in a query is calculated by summing the individual attribute percentages. It is important to understand what the error% actually provides. It is the percentage of rows accessed during the query that may be in error. It is not a percentage that should be applied to the monetary or quantitative values returned by the query. This method is implemented by the creation of a metadata table called Error_Stats that has the following attributes: Table name
264
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Column name Time lag Retrospection Change percent The table is then populated with entries for each attribute, as Table 7.8 shows.
Table 7.8. Example Error_Stats Table Table Name
Column Name
Time Lag
Retrospection
Change Percent
Customer
Address
30
0
10
Customer
Sales_Area
30
0
10
Wine
Supplier_Code
7
1
20
If it is required to estimate the margin of error that should be applied to a query involving “Address” and “Supplier_code” columns, the following query could be executed against the “Error Stats” metadata table:
select sum(change_pct*((1.5*retrospection) +(time_lag/365))) "Error Pct" from error_stats where (table_name='Customer' and column_name='Address') or (table_name='Wine' and column_name='Supplier_Code) The “1.5” means that the query will embrace a period of one and a half years. The query returns the result in Table 7.9.
Table 7.9. Example Margin of Error Error Pct 31.21 Table 7.9 indicates a 31 percent margin of error. Most of this is due to the fact that true retrospection has not been implemented on the supplier code, which, having a 20 percent change factor, is quite a volatile attribute.
265
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
It is envisaged that a software application would be developed so that users of the data warehouse would not have to code the SQL statement shown above. All that would be needed is to select an option called, perhaps, “Margin of Error” after having selected the information they want to retrieve. The software application would extract the columns being selected and determine the time period involved. It would then build the above query and inform the user of the estimated margin of error. This attempt at calculating the margin of error is, it is believed, the first attempt at enabling users of data warehouses to assess the accuracy of the results they obtain. There is a limit to the degree of accuracy that can be achieved, and it must be recognized that “valid” times are often impossible to record as they are often not controlled by the organization but are under the control of external parties such as customers and suppliers. In order to minimize inaccuracy, the period of time between valid time and transaction time1, and between transaction time1 and transaction time2, should be as short a time as is possible. The possible range of results of the formula needs some explaining. When using the formula, a result of much greater than a 100 percent margin of error can be returned. This might sound daft, but actually it's not. Let's reexamine the example, which, having been used before in this book ad nauseum, states that 10 percent of people move house each year. After one year, 10 percent of our customer records are likely to have changed. This means that if we haven't implemented true retrospection, then 10 percent of the rows returned in queries that use the address can be regarded as being of questionable accuracy. After 10 years, the figure rises to 100 percent. But what happens after 11 years, or 15 years, or more? It is fair to say that any result returned by the formula that indicates a margin of error of more than, say, 15 percent should be regarded as risky. Anything beyond 30 percent is just plain gobbledeygook.
Data Pool The data pool can be regarded as the physical manifestation of the GCM. Obviously, the physical model may be very similar or very dissimilar to the original conceptual model. Many compromises may have to be made depending on the underlying DBMS and the need to balance the business requirements with considerations regarding performance and usability. For the purposes of this book, we will assume that the DBMS will be relational and we will keep the general shape of the GCM as far as possible. As previously stated, the data pool is independent of any decision support application. It is held at the lowest level of granularity that the organization can support. Ideally, behavioral data should be stored at transaction level rather than having been summarized before storing. Data can be allowed into the pool only after it has passed through the VIM processing just described. In order to describe the data pool fully, we will construct it using the Wine Club example. Firstly, let's look at the customers' fixed (retrospection is false or permanent) circumstances, shown in Figure 7.6.
266
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Figure 7.6. Customer nonchanging details.
The attributes associated with this are: Customer Table Title Name Date of birth Sex Telephone number Next, to the customer data in Figure 7.6 we add the design for the changing (retrospection is true) circumstances. Not surprisingly, this is the same diagram as we produced in Chapter 3 (Figure 3.13). The result is shown in Figure 7.7.
Figure 7.7. The changing circumstances part of the GCM.
267
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The attributes of the changing circumstances tables are as follows:
268
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Address Table Customer ID Address Start date End date Marital Status Table Customer ID Marital status Start date End date Child Table Customer ID Child name Child date of birth Spouse Table Customer ID Spouse name Spouse date of birth Income Table Customer ID Income Start date End date Hobby Table Customer ID Hobby Start date End date Profession Table Customer ID Profession Start date End date Employer Table Customer ID Employer name Employer address
269
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Employer business category Start date End date Next, we assemble the behavioral model. In the Wine Club, we just want:
1. Sales of wine 2. Sales of accessories 3. Trips taken The model for this is as shown in Figure 7.8.
Figure 7.8. Behavioral model for the Wine Club.
270
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Finally, we need to design the last part of the model, the derived segments. The model for this is shown in Figure 7.9.
Figure 7.9. Data model for derived segments.
271
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The attributes for these tables, together with examples, are shown in Table 7.10.
272
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 7.10. Example of Segmentation Market Segment Type Table Market segment type ID
LT217
Market segment type description
"Income Segment"
Market Segment Table Market segment type ID
LT217
LT217
Market segment ID
04
05
Market segment description
"Middle Earner"
“High Earner”
Market segment type ID
LT217
LT217
Market segment ID
04
05
Table
"Customer Salary"
“Customer Salary”
Attribute
“Salary”
“Salary”
Attribute low value
30,001
50,001
Attribute high value
50,000
75,000
Customer ID
12LJ49
L2LJ49
Market segment type ID
LT217
LT217
Market segment ID
04
05
Start date
21 Jun 1999
1 Mar 2001
End date
28 Feb 2001
null
Segment Attribute Values Table
Customer Segment Table
The example shows Lucie Jones's entries in the customer segment table as moving from a middle earner to a high earner. You can see, from Figure 7.9 and the example in Table 7.10, that a market segment can be made up of many segment attribute values. The attribute values can be drawn from any table that contains customer information. Therefore, the customer segment table itself could be referred to in the definition of a market segment. This means that Lucie's improving status, as shown by her moving from “04” to “05,” might itself be used in positioning her in other segments, such as “Customer Lifetime Value.” So, now we have built all the main components of the Wine Club's data pool. Firstly there is the customer, the single entity shown in Figure 7.6. Then we have the customers' changing circumstances in Figure 7.7. Next, from Figure 7.8 we add the customer behavior and, finally, we have a neat way of recording customers' derived segments, and this is shown in Figure 7.9. These four components, when fitted together, are the Wine Club's physical representation of the GCM that we have been working with throughout this book. Of course there is more work to be done. For instance, in the behavioral model we have focused only on the customer. For each of the dimensional models we have created, we would expect to see may more dimensions than we have drawn. Each of the dimensions would be subjected to the same
273
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
rigourous analysis as the customer information part. Even so, we have achieved a lot. Consider the problems we have had to tackle. First, the requirement to support customer circumstances as well as behavior so that we can do sensible predictive modeling for things like customer churn. This model does that. Then there were the huge problems brought about by poor representation of time. This model overcomes that too, it can handle all three types of temporal query. The behavioral stuff is still in the form of dimensional models, but overall, the model can truly be described as customer centric. Now that we have dealt with the data architecture for the pool, we'll briefly mention the applications. only for RuBoard - do not distribute or recompile
274
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
CRM APPLICATIONS As informational applications are developed, the main source of data will be the data pool. The application will draw data from the pool, either to support queries directly or to store within the application's own database, perhaps in the form of aggregates. In addition, the application will, most likely, use its own data. This might be internal metadata or externally provided agency data such as geographic or demographic information. The applications will have their own processing cycles, completely independent from those of the data pool. This includes processes for: 1. Data capture 2. Application-specific transformations 3. Loading 4. End-user access 5. Backup This separation of the data pool from the applications means that new applications can be added at any time after the data pool is established. It makes the information architecture almost infinitely scalable, as each application will, most likely, store its own additional data in its own server. It also has the advantage that, in most cases, historical data will already exist in the pool for newly implemented applications. It is likely that most applications will be purchased as packaged systems. There are many campaign management, personalization, data mining, and general analysis products on the market. It is generally not a good idea to reinvent these products and rebuild them from ground up. Even if they don't provide 100 percent of the functionality that is needed, they are usually still a better bet than bespoke development. With all such products, the data needed to drive them is expected to be provided from what they usually term, a data mart. With our architecture, these systems will invariably draw their data from the data pool, so the pool's architecture must be able to support such systems and give them the information they need. We won't go into the applications in any detail here. These are more fully described in Chapter 10. only for RuBoard - do not distribute or recompile
275
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
BACKUP OF THE DATA Partitions should be used for the behavioral information. This is usually, by far, the largest part of the data warehouse and is the most problematic to back up because of its sheer size. By placing the behavioral information into partitions based on dates, we can switch old partitions into a read-only mode. This means that no further data is expected to be placed into the partition and, therefore, once it is backed up, no further backups will be needed. Figure 7.10 shows the gathering of data into daily partitions. Figure 7.10. Daily partitions.
Figure 7.10 shows a typical take-on of behavioral information into the warehouse using a partitioned approach. Each day has its own partition within the same physical table. Day (0) represents the latest data to be collected. Usually, this is yesterday's data. We cannot expect all of yesterday's data to be available to us today. Delays in the transmission of data are inevitable. After three days in our example, however, all the data has been collected. Once a partition reaches the point at which all the data is deemed to have been received, it can be switched to read-only mode, which means that it is not expected to change again. At this point it can be backed up for the last time. The backup strategy for behavioral data, in this example, is to make copies of Day (0) through to Day (-3) each day. Days (-4) backward are not required to be saved further. Backup of circumstances data is slightly different. In most cases, the approach to be taken is, periodically, a full backup of the circumstances and, daily, a backup of the changes. Usually, the best way to backup changes is to copy the RDBMS transaction log files. only for RuBoard - do not distribute or recompile
276
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
ARCHIVAL The partition approach to the storage of behavioral data makes archival very straightforward. However many days of history we wish to store, we can just drop the partitions when they reach the required age. So if the behavioral data table should contain 120 days of data, we can simply drop day (-120), as part of our daily warehouse management process, and store it permanently in the archive in case we ever have to retrieve it. Then we have to create the new partition for today's data. We don't usually want to archive our circumstances data. In terms of storage it represents a fairly small percentage of the total disk space and is not, therefore, much of an overhead. The biggest change to customer circumstances is usually address changes. If 10 percent of your customers change their address each year and you have five million customers, that's half a million new records, assuming you've implemented true retrospection for customers' addresses. That's a drop in the ocean compared to the number of behavioral records being managed. only for RuBoard - do not distribute or recompile
277
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
EXTRACTION AND LOAD Some of the problems associated with the extraction of data have already been described in this book. There are tools that might help with the extraction process, and the various types are described and discussed in Chapter 10. There are two types of data load to be considered when designing a data warehouse:
1. Initial load 2. Incremental loads Initial loading generally relates to circumstances rather than behavior. It includes the creation of customers, products, promotions, and geographic data such as branches and regions. The only time we need to consider the initial loading of behavior is in cases where there is some historical behavioral data to be loaded. Initial loading can be considered as a “one-off” exercise and does not have to be conducted as rigorously as the continuing incremental loads. If the initial load does not work for any reason, it is a relatively simple matter to drop the whole lot and start over. We cannot take this approach with incremental loads, since dropping the previous data would then entail rebuilding the entire database. Speed in loading data is a critical element in the management of a data warehouse. We are usually presented with a minimal time window within which to complete the process. This time window is invariably overnight. Also, we almost always have to obtain the data from the operational systems and this means waiting until their overnight batch processing has completed. In reality, this means that we can consider ourselves lucky if we have more than four hours in which to extract the data, carry out the VIM processing, and then load the data. Bear in mind that loading data into the warehouse might not be the only loading that has to be done. We often have to load data out of the warehouse into the applications that use it. We may also have to build summary tables to enhance the performance of some of the applications. All these things are resource intensive. There are some things we can do about the speed: Sorting. It is often a good idea to load the data in the same sequence as the primary key of the target table. This will significantly aid the ultimate performance of the system. It is usually 10 times quicker to sort the data outside the database using a high-performance sort utility. Direct Loading. All RDBMS products have a utility that enables data to be bulk loaded, and it is always the fastest way to get data into the warehouse. The only issue is that they still tend to use SQL “insert” commands to load the data. Some RDBMS products allow for what is known as direct loading. This effectively bypasses
278
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
SQL and allows the data to be dumped into the database files directly. Again, this can be an order of magnitude faster than standard inserts. Index building. Best loading performance, on large volumes, is achieved without indexes. The indexes can be built after the data has been loaded. If we opt to use direct loading (above), then we can only do this with indexes switched off. No recovery. This is another bulk-load performance enhancing feature. What this does is switch off the transaction logging function. Transaction logging is very important as an aid to restart and recovery in transaction processing systems. It is not really much use in bulk loading, and it is a significant overhead in performance terms. Another serious issue about which there is little information is a subject known as backout. Usually, as far as transaction data is concerned, the data is passed to the warehouse from a source application system. As an example, let's assume this happens on a daily basis. The data will probably appear as an extracted file in a certain directory or other place within the system. The data warehouse data collection process will scoop up the file and copy it into the warehouse environment for further processing (e.g. VIM) and subsequent loading. Once loaded, the data will find its way into all areas of the CRM system, into summaries, the campaign management system, the derived segments, etc. It is not unknown that, after one or two days have passed, sometimes more, someone notices that the data was, in some way, invalid. This could be that:
1. It was the previous day's data that had not been cleared down properly. 2. Someone allowed the wrong data into the operational system. This data filtered into the data warehouse.
3. Some important process or set of processes had not been applied to the data, or had been applied twice.
4. The wrong pricing tables had been used. Any number of operational errors can occur that might not be detected before the data is loaded into the warehouse. The operational systems can often be corrected by restoring and rerunning. Whether this is possible in the data warehouse depends on scale and complexity. If we can't simply restore and rerun, then the data, in effect, needs to be “unpicked” from all the places in which it is stored. One way to do this is attach some kind of production run-number tag to each record at the lowest level of granularity within the system. This means that we can reach into the tables and simply delete all the rows with the offending production run-number tag. Summary tables are more of a problem. By definition, a single summary record contains data from many records of lower grain. For summaries, there are two main solutions:
1. Rebuild the summary. This should be possible so long as the summary table contains no more history than the detail. It is quite common in data warehousing to adopt a data strategy whereby we keep, say, six months of detail data and three-years of summary data. Clearly we
279
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
cannot rebuild a three-year summary table in this example.
2. Reverse out the offending values. This involves the taking of the original file that was in error and reversing the values (i.e., changing the sign). This has to be done to all monetary values, quantities, and counts. Then the changed file is reapplied to the same summary level tables as before. So we are, in effect, subtracting the incorrect values out of the summaries. Incidentally, there is a quick way of removing duplicated rows from a table in a relational database. Have a look at the diagram of a sales table in Figure 7.11. Figure 7.11 shows that a file of data (File 1) was inserted into the table Sales 1 and, subsequently, data from File 2 was also inserted. Then File 1 was erroneously inserted into the table again, as shown in Sales 2. In reality, several files may have been entered over several days, and simply dropping the table and rebuilding it might be an unattractive option. It is possible to write an SQL procedure with a fairly complex “…where exists… clause, but the simplest approach is to make use of the fact that relational theory is closely related to “set” theory and use “the power of the union.”
Figure 7.11. Duplicated input.
Just create a temporary table (called “Temp”) with the same structure as the original sales table and execute the following:
Insert into temp (Select * from Sales Union Select * from Sales) This simple query will result in the removal of all duplicates as shown in Sales 3 in Figure 7.11.
280
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
For those readers who are very familiar with the capabilities of SQL, this might seem like a trivial example, but it has been used with great effect in solving real-life problems. only for RuBoard - do not distribute or recompile
281
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY To summarize, the EASI data architecture is founded on the original evolvability, availability, sharability, and integrity principles that formed the basis of the database approach. The purpose is to make data available to any application that requires access to it. The benefits are seen as follows: It enables an organization to take a strategic approach to building informational systems. The strategic approach requires that the informational systems are built around business imperatives and not around tactical, departmental requirements. If the organization's aspirations lead to an enterprise-wide information system, this method will entirely support that objective. Although this is not an application-centric method, it still supports the incremental approach to developing informational systems. So the original need to see a measurable return on investment, with each increment, can be achieved with this approach. Having all the applications derive their data from a single source (the data pool), means that a consistent view of the enterprise is available to all users of the information. Adding new applications is a relatively straightforward exercise. Keeping the data separate from the applications, and at a low level of granularity, makes it far easier for the information system to support the business during inevitable changes of direction. Historical data can be made available to new applications immediately, rather than having to build the history after the application is implemented. This means that the applications can begin to add value more quickly. only for RuBoard - do not distribute or recompile
282
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 8. Business Justification In this chapter we explore ways to make a business case for the building of the CRM data warehouse. It is often said that it is not possible to cost-justify a data warehouse because we cannot tell what value the warehouse will bring. This is nonsense. The advice we should be giving to our customers and sponsors must be that, if you cannot justify the building of the data warehouse, then don't do it. Data warehouses are very expensive investments, just like new factories. We would not expect our bosses to sanction the building of a new factory simply because the manufacturing people thought it was a good idea, and we should take precisely the same attitude with a data warehouse. only for RuBoard - do not distribute or recompile
283
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE INCREMENTAL APPROACH There are two main approaches to building a data warehouse. The first is to take the traditional waterfall-type route that was touched upon briefly in Chapter 1. The idea is that we gather the user requirements, using a standard methodology, then we go away and build the warehouse and, when it is complete, we deliver it to the users and move on to our next project. The shortcomings of this technique are well-known, and there are special problems that occur in data warehouse projects where the waterfall approach is used. This is discussed in the next chapter. The second approach is more akin to rapid application developments (RAD) and involves piecemeal development. In data warehousing, this approach is often referred to as “incremental.” Because data warehouses are different from operational systems, it is entirely feasible to adopt an incremental approach to development and implementation. If we were developing, say, an order processing system, breaking the system up into smaller pieces is quite problematic. Let's say our order processing system has the following major modules:
1. Order capture 2. Stock allocation 3. Back-order generation 4. Stock picking 5. Packing and shipping 6. Delivery note processing 7. Invoicing It is difficult to see how we might build, say, the order capture module and release it to the users, because there is nothing in place that will handle the captured orders. Similarly, there is little point in implementing back-order processing because it is dependent on the module that checks the stock figures. The problem is that all the modules of the system are so interdependent that it makes an incremental approach to implementation an impractical proposition. We do not have this problem in data warehousing. The fact that we can build and release parts of the warehouse is another aspect that sets data warehousing apart from other systems. We can use this special feature of data warehouses to powerful advantage in building the business case because it enables the following:
284
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Pilot implementation. This is a great way to start out in data warehousing. Developing a pilot enables businesses to “dip a toe in the water” in the sense that we can sample a data warehouse and start to get some value without making a commitment to spend vast sums of money. Once the organization can see some value coming out of it, it will be more sympathetic to the next increment. Quick wins. Quick wins are sometimes described as “low hanging fruit.” This is where we can identify something that is relatively easy to produce but that has a very high value to the business. In a CRMdata warehouse, it might be the ability to create market segments for customers and provide lists for targeted campaigns. Prioritization. An incremental development enables us to prioritize the deliverables based on a tradeoff between business value and ease of delivery. We'll be expanding on this later on in the chapter.
Defining the Increments If we're agreed that an incremental approach is the best way to proceed, then we need to determine just what the increments ought to be. In principle, the answer to this is fairly obvious: the increments need to be whatever the business needs them to be. In other words, the definition of what we are to build is determined by the business. If we followed an approach like the one described in Chapter 5 (Dot Modeling), then the way forward is not too hard. As a memory aid we developed a conceptual model based on the following three things:
1. Business goals 2. Business strategy 3. Business information We know that this approach provides us with a clear indication as to the business requirements and these were incorporated into the general conceptual model (GCM). We also know that, for the Wine Club at least, our EASI data architecture fully supports the GCM. The information that the business people will be drawing upon will ultimately be drawn from the data architecture, as this is what will ultimately be developed. So we are confident that what we propose to build will provide what the business needs, or believes that it needs. What we now have to do is get the business people to do their bit. For each of the business goals, they have to be able to state the value to the business, in terms of dollars, of the increment should it be achieved. In other words: what is it worth? This is why we insist on the goals being measurable and time bounded. One of the goals that the Wine Club had was to reduce customer churn by 20 percent per year for the next three years. What the business people have to do now is figure out the value to the business of achieving that goal. Let us say that the churn rate is currently 15 percent of the customer base. That represents 15,000 of their 100,000 customers.
285
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
So if they achieve their targets, then they will persuade some 3,000 customers not to defect within the next 12 months. For the purposes of simplicity let's assume that the customer churn occurs evenly throughout the year and so the Wine Club will lose, on average, about six months of sales for each churned customer. If we further assume that an average customer's annual revenue is about $500, then the revenue to be saved in one year is $750,000 (3,000 × 50 percent × 500). If the bottom-line percentage of sales is, say, 10 percent, then such a saving means $75,000 extra profit in year one. This becomes $225,000 in the second year ($150,000 from the customers we retained in year one and $75,000 again for those we retained in year two). So that's what we get if we achieve the goal. However, we need to assess each of the business strategies that we were planning to adopt to see how the savings break down. Let's assume that the reduction of churn is to be achieved by three major new initiatives: Loyalty bonuses. For customers who have been active for more than one year, we want to be able to reward them, on their birthday and wedding anniversary, where appropriate, with a bottle or two of their favorite wine. Personalized campaigns. Once we have collected some information about customers' behavior, we want to be able to target them in campaigns with goods that we know will interest them. Predictive modeling. We want to be able to determine which of our customers are susceptible to churning so that we can take some proactive steps to try to ensure that it does not happen. These three initiatives taken together should enable the Wine Club to achieve the 20 percent reduction in churn on a year-on-year basis. The question that the business now has to address is: For each of the three initiatives, what portion of the 20 percent target will it achieve? Be warned that the business people will not always be comfortable with answering this question. But they must answer nonetheless. Really they should be doing this sort of thing anyway, and it's amazing to think that, generally, it doesn't happen. The amount of money that gets spent on little more than a leap of faith is pretty astonishing. One way to counter their reluctance is to ask: If you don't quantify your expectations, then how will you know whether or not you've been successful? The problem is of course that people often don't want to put their signatures to claims that, in the future, they might be called to account for. Their fear is that a day of reckoning will arrive and their boss will haul them into her office to administer some appropriate punishment for poor estimating. This is an understandable reaction, and we have to do our best to convince them that their job is not on the line if they get it wrong and that, if they give us the figures based on their best estimates, then we, the consultants, will take “ownership” of the justification from then on. So we are the ones who get it in the
286
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
neck. Hey, that's what consultants are for! Let us assume that we have overcome this seemingly insurmountable problem, and we can now say that the increased profit can be divided up as follows: 1. Loyalty bonuses
30 percent
($22,500 year 1, $45,000 year 2 onward)
2. Personalized campaigns
20 percent
($15,000 year 1, $30,000 year 2 onward)
3. Predictive modeling
50 percent
($37,500 year 1, $75,000 year 2 onward)
This can be generalized with the simple (and crude) formula as follows:
Using the formula, we can calculate the benefits for as many years as our customer wants us to. Some organizations will want to project forward just two years, while others might want 5 or even 10 years. For each of the business strategies, in our dot modeling workshops we asked the business people to tell us what were the main types of information that they needed in order to implement the strategy successfully. This was where they were actually developing their own information model using the dot notation. We can now see which of the information models the users have specified will bring about the biggest payback. So far, we have looked only at one business goal, reducing customer churn. There are other goals, and we have to go through the same exercise with each of them. For instance, another goal of the Wine Club is to increase the customer base by 5 percent per annum (5,000 additional customers). The emphasis being on attracting the “right” type of customer. The strategies for this are to be as follows:
1. Customer profiling to try to ascertain the right types of customers to approach 2. Campaign management so that we can contact the right people with the right offer Using the same figures as before, we can say that the types of customer that we are trying to attract are those who should spend more than average. Let us assume that the average expenditure for these customers is to be $700 per annum. However, as has been said previously, it is much more expensive (maybe 10 times) to attract a new customer than it is to retain an existing customer, and so we must make some adjustment for this. In this case, we might expect to have to pay $20 for each customer.
287
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Again, assuming that we attract our new customers evenly throughout the year, then the increased revenue will be $1.75 million (5,000 × .5 × 700). At 10 percent profit that will add $175,000 to our bottom line. From this, however, we have to subtract $100,000 for the cost of acquiring the customers, so the real net effect is $75,000 in the first year going up to $425,000 in the second year ($350,000 from the customers we recruited in year one and another $75,000 from the year two acquisitions). Splitting the value of these two strategies should be quite straightforward. We can't do the campaigns effectively without the customer profiling, and customer profiling, on its own, is useful but won't help us achieve the goal. So we arbitrarily split them on a 50/50 basis. Table 8.1 shows how much each strategy will be worth over a two-year period. One thing we're quite good at in IT is figuring out how much it will cost to build systems. Sometimes, of course, we get it spectacularly wrong. Those times when we do are well documented. Incidentally, we'll be taking a good look at estimating in the next chapter when we discuss project management.
Table 8.1. Grouping Strategies Goal
Strategy
Additional Profit Year 1
Year 2
Total
Reduce churnLoyalty bonus
$22,500
$45,000
$67,500
Personalized campaign
15,000
30,000
45,000
Predictive modeling
37,500
75,000
112,500
Recruit customersCustomer profiling
37,500
212,500
250,000
Campaign management
37,500
212,500
250,000
What we have done here, however, is to work out the value of the various components of the data warehouse, and this is something that, hitherto, we don't usually try to do. So instead of looking like some great black hole into which we will try to persuade our customer to throw money, we can now take a completely different tack and point to the potential lost opportunity if they don't invest. Having said that, we still have to figure out the cost. We have been building our incremental approach so far based on the concept of business goals and business strategies. In doing so we've kind of taken the view that there will be a one-to-one relationship between business strategies and deliverable increments. Well, that might sound OK in theory but life's not really like that. We usually find that, in systems integration and development terms, there are lots of crossing-over points that, if we ignore them, will end up causing us a headache with duplicated processes being built and lots of extra cost as a result. As an example let us take another look at the five Wine Club strategies and also at the kind of data we need, and the source systems that will provide the data (see Table 8.2). Don't forget that if we've properly completed the dot model pages, then the information should already be to hand.
288
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
We have to take a pragmatic view of the data sources and the increments and try to figure out how to give the business what it wants, when it wants it, without wasting development time and money. There are three things to consider deciding in which order to develop the increments:
1. The business strategy. We have already discussed this. It's pretty clear that the business will tend to prioritize things based on the amount of profit they are expected to bring in.
2. The degree of difficulty and, therefore, relative cost. Sometimes, a high-priority business objective is very difficult to deliver quickly. In these cases it's a good idea to recommend that the first increment is something that is easier to deliver so that the business gets some real confidence about the value of information without investing too much of its cash.
Table 8.2. Correlating Strategies, Data, and Data Sources Strategy
Data Needed
Source Systems
Loyalty bonus
Customer circumstances
Customer admin
Personalized campaign
Customer circumstances
Customer admin
Sales
Order processing
Trips made
Trip bookings
Predictive modeling
Customer changing circumstances
Customer admin
Customer profiling
Customer circumstances and behavior
Customer admin Order processing Trip bookings
Campaign management
Bought in lists
External
3. The number of increments in which a data source is used. Where a data source is used in more than one increment (this is very common), it is sensible to capture the data from the source that satisfies all the increments that use it. This may seem like common sense, and it is. However, it can help to influence the decision about what comes first. The customer admin system is used in four out of the five proposed increments. In a customer-centric system such as CRM this should not be surprising; some might even say that it's blindingly obvious. Nevertheless, building the business case has to be done in a reasonably scientific fashion, as the data will, ultimately, have to be presented to financial analysts for scrutiny. Let us assume that the total costs of the development and system integration have been calculated and are as shown in Table 8.3.
289
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Table 8.3. Development Cost Summary Source Systems
Cost (in dollars)
Customer admin
280,000
Order processing
320,000
Trip bookings
100,000
External
80,000
The cost shown in Table 8.3 is for fully integrated systems, and initial, and fully automated incremental loading into the physical manifestation of the GCM. The costs, and their split between increments, is shown in the table in Table 8.4.
Table 8.4. Breakdown of Development Cost Strategy
Source Systems
Development
Total Cost
Loyalty bonus
Customer admin
70,000
Personalized campaign
Customer admin
70,000
Order processing
160,000
Trip bookings
50,000
280,000
Predictive modeling
Customer admin
70,000
70,000
Customer profiling
Customer admin
70,000
Order processing
160,000
Trip bookings
50,000
280,000
External
80,000
80,000
Campaign management
70,000
Now that we have the additional profit and the cost, we can easily calculate an ROI for each of the strategies. This is shown in Table 8.5.
Table 8.5. Return on Investment Calculated by Business Strategy Strategy
Profit After Two Years (in dollars)
Total Cost (in dollars)
ROI After Two Years (percentage)
Loyalty bonus
67,500
70,000
96.4
Personalized campaign
45,000
280,000
16.1
Predictive modeling
112,500
70,000
160.7
Customer profiling
250,000
280,000
89.2
Campaign management
250,000
80,000
312.5
290
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Wow, most of our strategies look quite attractive, don't they? Look at campaign management. For every $1 we spend on an externally supplied list, we get over $3 in return after just two years. You might ask whether or not we need to buy a campaign management system in order to take full advantage of the benefits of campaigns. The answer is that we can invest in a campaign management system if we want, but we absolutely do not need one in order to run campaigns. By far the most important aspect of running effective campaigns is selection of the right customers as targets for the campaign. Campaign management systems and other software are reviewed in Chapter 10. However, making sure that we target the right customers requires that we do our customer profiling properly. Both of our main business goals include strategies for the analysis of customers. In order to reduce churn we have to do some predictive modeling, and in order to recruit the best kind of new customer we have to do some customer profiling. One of the best ways to do both these things is with a data mining product. We might also want to do more straightforward analysis using a standard query product. If we decide to invest in such software products, then we need to factor the cost into our ROI calculations. only for RuBoard - do not distribute or recompile
291
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
THE SUBMISSION What we have to do next is to present our findings to management in a way that shows how we plan to implement the data warehouse. We will present an incremental approach that shows the costs and the expected benefits of each increment. The submission process might be predetermined in that there are a package of prescribed forms already in existence and all we have to do is simply plug our numbers into the forms and, presto, the submission pops out of the end of the process. In the case of the data warehouse, there are several things that we want to say: 1. The fact that, in developing the warehouse, we are supporting the business in the pursuit of its goals. 2. Show how the information will affect revenue and profit over two, three, or even five years. Also the return on investment to show how quickly each increment will pay for itself. 3. Describe the incremental approach and how the value of each increment has been calculated. Table 8.6 is a real-life example of a part of a submission made by a major cellular telephone company. We should attach the ROI table as part of the submission. The submission should show what would happen to the business if the data warehouse solution is not implemented and, subsequently, what would happen if it were. It is also useful to show a roadmap. This is just a simple illustrative abstraction of the incremental delivery. An example is shown in Figure 8.1. Figure 8.1. Development abstraction shown as a roadmap.
292
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The roadmap in Figure 8.1 makes for a good presentational slide because you can explain how the different pieces of the puzzle fit together. If you want, you can put something on the Y axis such as the cost, or the projected increased revenue, or the ROI.
293
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Table 8.6. Supporting Material for a Submission Business Objective To increase net connections from 130,000 to 150,000 and by 12,000 each subsequent year
Deliverables of Data Warehouse
Estimation of Impact
1. Segmentation of market according to value segments
20 percent contribution to goal in a full year
2. Recognition of profitability of customers (revenue, cost, margin) 3. Derived market intelligence leading to marketing initiatives for acquisition and cross-selling/up-selling 4. Customer usage (number, type, when) 5. Understanding of customer behavior
To reduce churn from 21 percent to 16 percent by April 2002 and to 13 percent by April 2003
1. Understanding the underlying reasons for churn
11 percent contribution to goal in a full year
2. Increased visibility of profitability by customer/segment/channel 3. Greater understanding of the customer base 4. Greater understanding of competitive market (product/service/ price) 5. Understanding of market share and position
To increase the revenue from value added services to 2 percent of total revenue by April 2000 and to 5 percent by April 2001
1. Product and service offerings that meet customer needs by segment
18 percent contribution to goal in a full year
2. Product feature optimization based on buyers' values and elasticity 3. Knowledge based sales and service strategies
To reduce churn through To identify customers who churn after they targeting campaigns receive a marketing offer. Identifying more effectively common characteristics of those people. De-dupe these people for future mailings.
150,000 written to per annum; 1 to 2 percent churn when they receive mailings, therefore, 1,500 x $200.00 (cost of new customer) = $300,000 per annum.
To encourage channels that produce high-value customers.
Increased connections through profitable channels.
A report detailing the source by customer type, number of handsets and profitability.
Reduction in churn from nonprofitable channels. Raise new connection revenue potential by 5 percent.
only for RuBoard - do not distribute or recompile
294
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks.
295
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY There is no doubt that, in most organizations with large numbers of customers, the development of a data warehouse in support of a CRM strategy can almost always provide positive and profitable benefits. However, we cannot expect to be allowed to take a cavalier approach to the justification of such a development. Data warehousing represents a major investment and, as such, we must be able to show how and when it will become a sound investment rather than end up, as so many have, as an expensive mistake. The main thing to remember is that the development of the warehouse should be focused on assisting customers in the achievement of their business goals. Each business goal usually has an associated value. If the data warehouse is genuinely helping to achieve the goal, then it too can be assigned a value that helps toward its justification. Also, aim for the incremental approach. This way, the customer can “dip their toe in the water” without the danger of drowning. The incremental approach is also a good way of ensuring that the data warehouse development path is flexible and we can deviate from it as the changing business direction will inevitably dictate. only for RuBoard - do not distribute or recompile
296
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 9. Managing the Project No project is complete without the dreaded plan! In this chapter we explore how data warehouse projects are different from traditional development projects. This explains why organizations become unstuck when trying to use traditional “hard systems” approaches alone. They need to be augmented by a “soft systems” approach. Also, a complete work breakdown structure (WBS) for the design and development of data warehouse increments is given. The WBS contains more than 100 elements to be covered for each increment. only for RuBoard - do not distribute or recompile
297
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
INTRODUCTION Project management is often the key to the success of a project. Every project needs a project manager and some kind of project management methodology. It does not matter what kind of a project we are engaged in. Large physical construction projects like the building of bridges, ships, and tunnels all need to be managed. Although these sometimes go spectacularly wrong, we have much more experience at building these kinds of structures than we do at building large software systems. The building of physical edifices has one great big advantage over the building of software: you can see how much has been done at all stages of the project. Even the untrained observer can look at a bridge and can make an assessment as to how far the project has progressed. Whether they are working on the foundations, the supports, or the span is fairly obvious, and any of us could hazard a guess at how much is left to do. With software this is not the case. Even a trained observer can experience great difficulty in assessing how far through the project we really are. Another problem is that the level of complexity in software systems is far greater than most other kinds of projects. Take, for instance, a jet fighter. To design and build one of these amazing modern machines is a very daunting prospect. But the most complicated part of the airplane is not the wings, the engine, or the fuselage; it is the software that runs all the aircraft's systems and the software that manages those systems. This controls the guidance of the aircraft and its weapons and keeps the pilot aware of what is going on both inside and outside the plane. Incidentally, the wings, the engine, and the fuselage are all designed by computer software. None of this stuff can we see, and so our approach to the management of software projects has to be more sophisticated than that of more conventional projects. One way of addressing this issue is to tie the project management methodology into the software development methodology that is adopted for the project. There are many, many different software development methods but, broadly, they fall into two general categories: 1. Waterfall approach 2. Rapid application development (RAD) approach. This subject was touched upon in Chapter 1. It is expected that most readers of this book are familiar with these and so the explanation will be brief. There are many different representations of the waterfall approach, depending on one's view as to the major components. It is generally accepted that the phases of requirements gathering, analysis, design, coding, testing, and implementation would be included, and these are often depicted in a diagram like the one in Figure 9.1. Figure 9.1. Classic waterfall approach.
298
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The principles of the waterfall approach are that each of the major components has to be completed before the next can begin. So, for instance, the system requirements must be complete before the
299
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
analysis can begin, and that must be concluded before the design can proceed. There is a very pronounced finish-to-start dependency on each of the major components. The other main presumption of the waterfall approach is that, generally, we do not go back more than one step. So, for instance, we might find that, during testing, the coding needs to be changed and there is an iteration between coding and testing until the code passes its test. In real life, of course, it doesn't work like that and we find that, actually, the testing has revealed a misunderstanding on the part of the analyst who captured the original requirement. Having to partially redo the requirements and subsequently modify the design, code, and test plans is time-consuming, demoralizing, and, of course, very expensive and is one of the main reasons that the waterfall approach is not any longer a highly regarded method. Nevertheless it still remains, in its many forms, as one of the major approaches to software development today. This is particularly true in data warehouse developments where, in some respects, this approach still makes sense. The RAD approach works on slightly different principles and can be effective for applications which demonstrate the following characteristics: Interactive—where the functionality is clearly visible at the user interface. RAD is strongly based on incremental prototyping with close user involvement. Therefore, the users must be able to assess the functionality easily through demonstration of working prototypes, hence the need for visible functionality. Has a clearly defined user group. If the user group is not clearly defined, there may be a danger of driving the development from a wrong viewpoint or (worse) ignoring some important aspect of the application entirely. Not computationally complex. The level of computational complexity of an application is quite difficult to determine and will vary from one project to another. For instance, if an application requires some complex statistical modeling, a project may consider two approaches to development: fitting in existing, well-tested statistical modeling components or developing the models from scratch. The first would be acceptable in a RAD project. The second would be considered a risk, unless it is possible either to decompose the complexity into simpler components or to make it transparent at the user interface. If large, possesses the capability of being split into smaller functional components. If the proposed system is large, it should be possible to break it down into small, manageable chunks, each delivering some clear functionality. These can then be delivered incrementally or in parallel. Indeed, some of the functionality may be delivered using traditional waterfall methods. Time constrained. There should be a fixed critical end date by which the project must be completed. If this is not the case, it will be relatively easy to allow schedules to slip and the fundamental benefits of RAD will be lost. Although there are clear differences between the waterfall approach and the RAD approach, there is one fundamental similarity: they are both all about the creation of applications in the traditional sense.
300
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Applications tend to implement a set of business functions. This means that there is a lot of talk about functionality and about processes. If we think about data warehousing for just a moment, where is the functionality and just what are the processes? Sure, there are some processes, such as the VIM processing, but the raison d'être of a data warehouse is not to implement business processes; rather its purpose is to support other applications such as decision making, campaign management, etc. In this sense a data warehouse is not, in itself, a business application. This makes it rather odd. It's been said before that data warehouses are different. So we need to consider this carefully when trying to figure out how to develop one. only for RuBoard - do not distribute or recompile
301
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
WHAT ARE THE DELIVERABLES? One thing that we absolutely have to get right in any project is to ensure that the customer gets what they expect to get. In Chapter 1 we discussed the business goals approach to expectation setting but there is still the problem of “sign off.” A project manager has to get acceptance from the customer before the project can officially be closed down. This is important because, in a consultancy organization, this is when we get paid. Even in an internal project, closure is important because it signals the point at which the system is accepted. This event is always greeted with a huge sigh of relief by the members of the project team and is usually the trigger for some kind of end-of-project bash. It also enables the team members to be reassigned to the next project in line. For some, hopefully, this will be the next increment of the data warehouse. So how do we get sign off? It is certainly the case that most quality project managers will insist on some kind of acceptance criteria before they themselves will accept responsibility for the project. This is one of those intractable problems that we just have to deal with. The fact is that, as long as the data warehouse has been built on sound principles and firmly focused on the business goals, the fact that the data warehouse physically materializes at the end of the process should be sufficient. However, usually, more is required. It's just because data warehouses are different. If it were not a data warehouse but, instead, say, an order processing system, we could get the customer to define some acceptance criteria and provide some acceptance test cases and some test data. Then we could process the order, make sure that the customer account is properly updated, that the delivery note generates an invoice, etc. This is because most systems are process oriented and what actually gets tested is that the processes perform the way they should and that the data used in the process gets transformed properly. In a data warehouse development, the processes are all about getting the data out of the source systems and into the data warehouse in the appropriate form. We can and should provide quality control checks to ensure that all records that have been extracted from the source system can be accounted for in the data warehouse, but that can hardly fulfill the requirements of a user acceptance test, can it? No, what matters is the content. Each case, unfortunately, will probably be different but here are a couple of suggestions: Tie the data warehouse into one of the applications. For instance, the campaign management system. What we can do is to agree a set of segmentation criteria for the selection of customers to be targeted in a campaign and build the system so that the campaign management system, if you have one, is fed with the list of appropriate customers. The acceptance criteria could be the selection parameters. Our users can then validate the selection. Select some of the queries that were specified in the information strategy workshop. If we can execute a selection of those queries it will prove that the warehouse is able to provide information that is valuable to the organization. Be careful that the queries selected as acceptance criteria are in line with the first increment being delivered. It's no use asking for queries relating to customers'
302
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
changing circumstances if we have only delivered wine sales behavior. One thing to be very careful about when putting together acceptance criteria is system performance. It is common for our customers to demand blanket performance metrics such as: All queries will take no longer than 10 seconds. This can be a huge “gotcha” because acceptance criteria like these will kill us. Under no circumstances should we accept this type of condition. We have to explain to our customer that data warehouses are different from conventional applications. It is reasonable, in most cases, in an OLTP application to expect virtually instantaneous response times. All these transactions have to do, usually, is: 1. Look up a single database record (using a unique key) 2. Change some value 3. Rewrite the record As everyone knows, this sort of thing can be done in the twinkling of an eye. A data warehouse query might have to: 1. Read several millions of records 2. Try to match them to thousands of other records using a sort merge join or a nested loop join 3. Calculate sub totals 4. Present the results in a readable format This, as anyone who has experienced it will testify, can take a very long time indeed. It is a good idea to avoid performance-based acceptance criteria if at all possible. That does not mean we should ignore performance, simply that the system should not be judged on it. The attitude we should adopt is that, previously, the answers to such questions were either impossible or nearly impossible to obtain. The fact that we are now able to get an answer at all is a sign of success. Performance tuning is something that should always be addressed afterward, using standard RDBMS performance tuning techniques as well as data warehouse specific techniques such as summary tables. only for RuBoard - do not distribute or recompile
303
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
WHAT ASSUMPTIONS AND RISKS SHOULD I INCLUDE? Every project plan contains assumptions, or it should. We never know everything we need to know at the start of a project, and so we have to make a guess. These guesses we call assumptions. Different projects require different assumptions, as do different customers. The difference between assumptions and risks is not at all clear in the context of a project plan, since we can mitigate a risk by making an assumption. For instance, we might regard as a risk the possibility that our senior solution architect may get killed by a passing truck, so we mitigate this by making the assumption that it won't happen. If you prefer to use a risk register, then this section is still relevant. Here are some assumptions we should consider when planning a data warehouse project: Data quality. If the project has a task that analyzes the data quality, all well and good. If it doesn't, then we only have the customer's word for the fact that the data quality is adequate. Experience tends to show that customers tend to exaggerate the quality of their data. This is not usually an attempt to hoodwink us but simply underlines that fact that they just don't know how bad their data is. Examples of poor data quality are missing data, duplicated data, referential integrity errors, etc. It is often the case, for instance, that there are several customer databases (23 is the most I have encountered). These have grown up over time, and each is there for a different purpose. Each database yields a little of the information that we need in the warehouse. Unfortunately, they are never consistent. Sometimes the different databases adopt different encoding systems, and the customer identifier in one system may be entirely different from another system. You might have thought that, if each system yields a piece of the puzzle, all we have to do is make a big table join and, presto, we've got all the information that we need. Sometimes this is true and sometimes it is not true. It is not unusual to find that you have to have a mapping table like the one below to arrive at a consistent customer identifier in the data warehouse. Customer Identifier Map Warehouse customer ID char(8) Order processing customer ID number(6) Sales rep customer ID char(6) Accounts payable customer ID char(6) This kind of approach works, as long as there is a one-for-one relationship between the systems but, often, there isn't. The sales representative customer database might define a customer at a different level so that the daughter companies appear as individual customers, whereas the accounts payable system is only interested in the corporate customer, the one that pays the invoices. There is a further issue, which is that these disparate sources of information will not be consistent. Addresses will be different, information will be more up to date in some systems than it is in others, etc.
304
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
These oddities contribute to the general issue surrounding data quality. Our customer is unlikely to recognize these inconsistencies. Where we have not put aside time in the project to analyze the data, then it pays to include an assumption about data quality that includes these points. Poor data quality can be a real show stopper in a data warehouse project. Data availability. We need to be able to get at the data we need, and sometimes that can be difficult. We've discussed the big issue of how to get changed data when describing the temporal problems in Chapter 4 but, oftentimes, the behavioral data can also be problematic. It is quite common for the behavioral data to be made available at some point during the overnight batch processing. A point in the batch cycle is identified as being appropriate for the data to be placed in a file so that it can be passed onto the data warehouse for processing. If the batch processing fails to deliver the data for any reason, then the warehouse cannot be updated. Although this sounds like an operational rather than a development issue, our customer will not accept the system unless the availability of data can be relied upon. Ensuring that this kind of problem does not happen is very difficult. It is better to include an assumption into the plan that places the responsibility for timely delivery of the data onto the customer themselves. Overnight processing window. This is a similar problem to the previous one. We have a responsibility to make the data warehouse available for a certain period each day, say, from 8:00 a.m to 8:00 p.m. We need a certain number of processing cycles, in the form of an overnight window, to do all our loading and housekeeping between those times. Most of the time there may be no problem, but there are occasions in all operational data centers when time becomes very short. For instance, at month end, quarter end, and, especially, year end, extra data processing has to be done as part of the normal course of doing business. When this happens, we get squeezed. Also, when things go wrong in the “mission critical” overnight stuff and suites of software have to be rerun, preceded by lengthy database restores, etc., the batch processing overruns into our time window and we get squeezed again. As before, it is extremely difficult to code for this eventuality, and the best thing is to make a broad assumption at the outset which states that our overnight processing window will be kept open. Business sponsor. It is vital that there is a prominent sponsor, or project champion, for the project within the business community. This person must “own” the project on behalf of the customer. It is not possible to overstate the importance of this role to the ultimate success of the project. This person must be enthusiastic about the project and also must be empowered to make decisions on its behalf. The sponsor must continue to be the project champion throughout the life of the project. If the project sponsor leaves the project for any reason, this poses a serious threat to the success of the project. New sponsors coming in partway through a project rarely have the same level of commitment as those who have been involved from the beginning. It is well worth making an assumption that the customer's project sponsor will remain in place throughout the life of the project. Source system knowledge. This can be another big issue. This is especially a
305
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
problem where the source systems are getting a bit old and where they were originally developed in-house. Most of the original designers and developers will have moved on. The system will have been enhanced and tweaked over the years of its service and the documentation, such as it was, has not been rigorously kept up to date. In short, no one knows much about it anymore. There are several extract files, or places in the processing cycle where extract files may be obtained, but there is no one who can provide definitive descriptions of the data fields in the records. Incidentally, this can be true just as equally with quite new systems, especially where there is a large “packaged” systems component. Finding out just what is in these systems can be a nightmare. It's a good idea to make the assumption that the customer can provide people who fully understand its systems. only for RuBoard - do not distribute or recompile
306
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
WHAT SORT OF TEAM DO I NEED? In order to properly plan the project, the project manager needs to know what kind of resources are needed, and when they are needed. So we can begin with an outline development team structure, as shown in Figure 9.2.
Figure 9.2. Example project team structure.
307
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Just a brief note about some of the roles identified in Figure 9.2: Project manager. The project manager needs to be fairly happy in dealing with a certain amount of ambiguity, uncertainty, and change. Data warehouse projects, being different from normal development projects, tend to have to be more flexible to the changing business landscape and are not so rigidly defined as we might prefer. An experienced project manager, in system development, can usually adapt to the more fluid requirements of a data warehouse project. Project managers who are used to working from a fixed deliverable, such as the dreaded system specification, without wavering, will find it more difficult.
308
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Business consultant. This role is crucial to the success of the project. The business consultant, and by this we absolutely do not mean management consultant, is the person who can help the customer to understand the benefits of CRM and can explain how all the components fit together to help to implement the CRM strategy. Ideally, the business consultant should have a good understanding of the customer's business environment. This person is responsible for gathering the business requirements and helping to develop the conceptual model. Solution architect. This person is also a key player in ensuring the ultimate success of the project. The solution architect specifies the components of the warehouse jig-saw puzzle, making sure that all the pieces fit together neatly. The solution architect role is the most senior technical role on the project and the role involves gaining an understanding of the requirements and converting them into a solution. It is important that the person assigned to this role is highly skilled, not only in the field of systems integration generally but, also, specifically in the field of data warehousing. In a CRM project, traditional data warehousing techniques need to be modified, as described throughout this book. Development lead. In many respects, this can be regarded as being quite similar to a traditional systems development manager's role. What is happening in a data warehouse is that there are many little subprojects going along at the same time. Each of the subprojects consists of a small team having a team leader and one or more developers. The teams are variously developing the extraction and load process, the warehouse itself as well as the applications, such as campaign management, as shown in Figure 9.2. The number of teams that will be needed will vary from project to project but, so long as an incremental delivery approach is being adopted, we would never need more than three or four of these little teams. As one increment ends and a new one starts, we just reassign the teams to the next deliverable. Database administrator. Much of the work we do in a data warehouse development involves, axiomatically, databases. For this we do need a highly qualified DBA to work with us on the project. If possible we would prefer a DBA with data warehouse experience. The data warehouse database design will be undertaken by one of the design teams, but they will have to work very closely with the DBA and a good knowledge of warehousing techniques is highly desirable. Similarly, the extraction and load team would benefit from working with a DBA who understands the loading issues surrounding data warehouses. System administrator. This is our operating system and infrastructure expert. This person assigns access rights to the development machines and generally keeps house for the team. However, if we can get someone who can help with more technical stuff, then so much the better. This means configuring the CPU usage and memory in an optimal fashion for the warehouse, also the disks, mirroring, controller cache, etc. These things will have to be addressed at some point when we get into performance tuning, so it is a great advantage if we can get someone at the outset
309
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
who can deal with this stuff.
The Work Breakdown Structure for Data Warehousing In this section we introduce a general work breakdown structure (WBS) for data warehousing projects. Our WBS has well over 100 elements and it comes complete with dependencies between elements. It's quite comprehensive, but if you want to add to it or modify it in any way, then feel free to do so. We present the WBS in logical sections so that some of the elements that may not be immediately obvious can be explained. Notice that the leftmost column in Table 9.1 is the “Project Increment.” This WBS is designed to handle the incremental approach. We have to complete the WBS for each increment in our data warehouse. The first section is the conceptual model. In the conceptual model we include all the components that were identified in the conceptual modeling chapter of this book. Note that there is no absolute need to adopt the dot modeling technique if you don't wish to. This WBS has no allegiance to any particular methodology, although it fits quite snugly with dot.
Table 9.1. Conceptual Modeling Tasks Project Incre. number
WBS Number
Task Description
Dependency
Person Days
Conceptual Model CM010
Capture business requirements
CM020
Create conceptual dimensional model
CM010
CM030
Determine retrospection requirements
CM020
CM040
Determine data sources
CM020
CM050
Determine transformations
CM040
CM060
Determine changed data dependencies
CM040
CM070
Determine frequencies and time lags CM040
CM080
Create metadata
CM040
So retrospection, changed data dependencies, and frequencies and time lags should, by now, be fairly familiar terms. The next step is to convert the conceptual model to a logical model (Table 9.2). Again, this is covered in Chapter 6.
310
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 9.2. Logical Modeling Tasks Project Incre. number WBS Number
Task Description
Dependency Person Days
Logical Model LM010
Create conceptual?logical map CM030
LM020
Design security model
CM020
LM030
Create logical data model
LM010
The creation of the physical model is described in detail in Chapter 7. The output from the physical model stage of the project is the data definition language (DDL) statements required to instantiate all the data warehouse objects (tables, indexes, etc.) (Table 9.3).
Table 9.3. Physical Modeling Tasks Project Incre. number WBS Number
Task Description
Dependency Person Days
Physical Model PM010
Create logical ? physical map LM030
PM020
Create security model
LM020
PM030
Design physical model
PM010
PM040
Create DDL
PM030
Much of the next part of the project can be executed in parallel with the data modeling. The logical and physical modeling will be carried out by the data warehouse team and the capture of the data from the source systems is executed by the extraction and load team. There is, to some extent, a dependency on the conceptual model in that it is the metadata that will direct the extract and load team to the appropriate source systems and associated data elements. Table 9.4 lists the tasks involved in the initial loading of the dimensional behavioral models. Remember that the initial loading is a “one off” and, therefore, is not an ongoing requirement. It is all right for this part of the system to be a bit less rigorous than the incremental extracts and loads.
311
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Table 9.4. Initial Data Capture and Load Tasks Project Incre. number
WBS Number
Task Description
Dependency
Person Days
Initial Data Capture and Load Behavioral Data IB010
Design data extraction process
CM040
IB020
Develop data extraction process
IB010
IB030
Test data extraction process
IB020
IB040
Design data VIM process
CM050
IB050
Develop data VIM process
IB040
IB060
Test data VIM process
IB050
IB070
Design data load process
PM030
IB080
Develop data load process
IB070
IB090
Test data load process
IB080
IB100
Execute loading of fact data
IB090
IB110
Verify data accuracy and integrity
IB100
IB120
Implement metadata
CM080
Circumstances and Dimension Data ID010
Design data extraction process
CM040
ID020
Develop data extraction process
ID010
ID030
Test data extraction process
ID020
ID040
Design data VIM process
CM050
ID050
Develop data VIM process
ID040
ID060
Test data VIM process
ID050
ID070
Design data load process
PM030
ID080
Develop data load process
ID070
ID090
Test data load process
ID080
ID100
Execute loading of dimension
ID090
ID110
Verify data accuracy and integrity
ID100
ID120
Implement metadata
CM080
Notice that the processes ID010 through ID120 are to be repeated for each entity. The next set of processes appear to be very similar and, in some respects, they are. However, the
312
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
significant difference is that these processes will be used as part of the live, delivered system and must be developed to industrial strength. Each process should record its progress in some kind of system log. The information written in the log should be classified as: Informational. Informational messages include the date and time started and time finished. Elapsed time is another useful piece of information as it helps the system administrators to figure out which are the most time-consuming processes in the system. Number of records processed and if possible, monetary sums involved are also very useful pieces of information. This will help to provide a kind of audit trail from source system to data warehouse and will be invaluable in error tracing. Another good idea is to write an entry into the log as the process proceeds rather than just at the end. If the process routinely handles around a million records, then it is helpful to write a log entry every 100,000 records or so. One way of making this more dynamic is to provide for a run-time argument that specifies the frequency of progress entries to be written to the log. Warnings. A process should append a warning to a log file when something of concern has been detected but, for which, there is no immediate cause for concern. The rejection of records, for instance, may result in the generation of a warning message. Sometimes, a record in the log may be informational, such as the fact that the file system is 10 percent full, 20 percent full, etc. These informational messages might be “promoted” into warning messages when the file system exceeds 60 percent capacity. In a proactive system this could cause a different kind of message to be sent to an operator so that action can be taken before the situation becomes critical. Errors. When an error occurs, the process cannot proceed any further and has to stop. For instance, when the file system overflows, unless the process can get access to another file system, it cannot proceed and has to stop. Sometimes, errors are detected right at the start, such as when the process detects that it has been given a file of data that it has already dealt with. Errors generally imply that the system has come to a grinding halt, and some intervention is required before further progress can be made. When an error occurs, there is a natural inclination to start over. By that we mean that the process is rerun. In relational database terms, this usually means rolling back all the work that has been done successfully and then redoing the work from the start once the problem has been solved. The problem with this approach is that, in a data warehouse, we might be trying to process many millions of transactions. Often, it takes as long to roll back the work as it did to carry out the work in the first place. Usually, rolling back all the transactions is unnecessary, especially in cases where we have simply run out of space–which is a common data warehouse problem. If we get, say, halfway through processing all our data, then have to roll it all back and then redo the work from the beginning, the task ultimately will take twice as long as originally expected. If we are operating within a tight overnight time window, this might have the effect of us not being able to open the warehouse in time in the morning. It is important to remember that some of these processes might take several hours to complete under normal circumstances. It is well worth considering allowing processes to be restarted. This means that, although they stop when an error is detected, they can pick up where they left off once the
313
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
situation has been resolved.
314
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Table 9.5. Ongoing Data Capture and Load Tasks Project Incre. number
WBS Number
Task Description
Dependency
Person Days
Subsequent Data Capture and Load Behavioral Data SB010
Identify entity lifecycle points of capture
CM040
SB020
Design data extraction process
CM040
SB030
Develop data extraction process
SB020
SB040
Test data extraction process
SB030
SB050
Design data VIM process
CM050
SB060
Develop data VIM process
SB050
SB070
Test data VIM process
SB060
SB080
Design data load process
PM030
SB090
Develop data load process
SB080
SB100
Test data load process
SB090
Circumstances and Dimension Data SD010
Identify changed data
CM040
SD020
Create dependency links
CM060
SD030
Design data extraction process
CM040
SD040
Develop data extraction process
SD030
SD050
Test data extraction process
SD040
SD060
Design data VIM process
CM050
SD070
Develop data VIM process
SD060
SD080
Test data VIM process
SD070
SD090
Design data load process
PM030
SD100
Develop data load process
SD090
SD110
Test data load process
SD100
As with the initial loading of the data, the tasks associated with the subsequent loading of dimensions, SD010 through SD110 in Table 9.5, must be repeated for each entity.
315
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The next set of tasks is associated with enabling the users. The kind of things we need to do here are fairly self-explanatory. Also, once the user roles have been defined, these tasks can be executed in parallel to other things. (Table 9.6). The next thing we have to do is organize all the warehouse admin stuff (Table 9.7). This includes the user role definition, the user schemas (users views of the data), as well as any summary-level tables that have been identified so far. As a general principle, it is recommended that we hold off on producing the summary tables at this stage. The reasons for this are:
Table 9.6. User Access Tasks Project Incre. number
WBS Number
Task Description
Dependency Person Days
User Access UA010
Assign users to user roles
WA010
UA020
Enroll users
UA010
UA030
Configure user software
UA040
Test user access
UA030
UA050
Design user manual
UA030
UA060
Create user manual
UA050
UA070
Design help processes
UA030
UA080
Develop help processes
UA070
UA090
Test help processes
UA080
Table 9.7. Warehouse Administration Processing Tasks Project Incre. number
WBS Number
Task Description
Dependency
Person Days
Warehouse Administration WA010
Define user roles
LM020
WA020
Map security model to roles
WA010
WA030
Design user schemas
WA020
WA040
Develop user schemas
WA030
WA050
Test user schemas
WA040
WA060
Design data summary process
PM030
WA070
Develop data summary process
WA060
WA080
Test data summary process
WA070
WA090
Design data summary navigation process
WA060
316
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
WA100
Develop data summary navigation process
WA090
WA110
Test data summary navigation process
WA100
.
1. Their creation causes a delay in the point at which we can release the system to our users. Quite often, the summary processing consumes a significant percentage of the development effort. Any question that can be answered by a summary level table can also be answered from a detail table, since the summary tables get their data from the detail. The only issue is one of performance. It is perfectly reasonable for us to release the system to the users and work on the performance aspects afterward.
2. Allowing the users to get at the data as soon as is practical gives us another benefit. We can observe their usage and use the knowledge gleaned from these observations to determine which summary tables we should be putting in place. It is commonplace in data warehouse development projects to try to second guess the summary level tables. Often we are right, but sometimes we are not right and some of our carefully crafted summaries are never, or only infrequently, utilized by the users. The next step is automation (Table 9.8). Many professional project managers will testify that this is the biggest gotcha in building data warehouses. Our incremental approach, with all its advantages, tends to exacerbate the problem. What happens is that we try very hard to get the first increment to a point where the users can actually use the data. It is only at this point that they can begin to get convinced that, hey, there really is some business benefit to be obtained from all this data! Once they have it, they can't get enough of it and they want more–not just a tweak here and there–they want a whole lot more. Before we know where we are, some of the team have been diverted onto the second increment. However, what the customer does not know is that the first increment is far from being finished. All we have done is show them the data, but they think we're done. They don't see the fact that, in order to get the data out of the source systems, VIMmed, and loaded into the data warehouse, an awful lot of hand cranking takes place. None of the processes have been integrated and the system does not “hang together.” What is needed is to go back to the first increment and put in place all the scheduling, controls, error handling, etc., that are needed to turn the system into an industrial strength product with strong welds and nuts and bolts to replace all the string and blue tack. This is a serious issue in the ancient art of expectation setting. If we let this get away from us, we may never recover. The operations department of any company will never accept any system that cannot stand on its own two feet, especially a system that is as huge and resource hungry as a data warehouse. All the time we must keep impressing on our customer the fact that what they are seeing is not the finished article and that we do need sufficient time to complete the work. There is a case to be made for not allowing the users to have access to the data warehouse until the first increment is fully completed. While having some sympathy for this view, I am still of the opinion that the users should be given access as soon as possible. After all, it is their data, not ours. Strong project management and equally strong expectation management are the key to success here.
317
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 9.8. Automation Tasks Project Incre. number
WBS Number
Task Description
Dependency
Person Days
Automate Processing AP010
Design automation process
AP020
Develop automation process
AP010
AP030
Test automation process
AP020
AP040
Perform integration test
AP030
AP050
Design operational procedures manual
AP010
AP060
Develop operational procedures manual
AP050
AP070
Obtain operations acceptance
AP060
Now let's consider support. Support exists at a number of levels (Table 9.9). Each of our customers will have specific requirements. There are two principal classes of support:
1. User support. This is where we provide a kind of hot line or customer support desk environment that assists our users when they have questions about the system or the service or when they have gotten into difficulty.
2. System support. Ordinarily, as suppliers of the system, we will have some responsibility for supporting it, at least for some period of time, after it has been implemented. Beyond that, some provision must be made for more ongoing support. This includes the routine maintenance of the system, including upgrades to systems software and application software. Support of a general nature is also needed with storage space, recovery from hardware and software failures, etc. The levels of support that are required will always vary from customer to customer as will the way in which the support is implemented. Some organizations have their own support departments, while others outsource their support entirely. Whatever the situation, we have to design our support strategy to fit into the requirements. Another gotcha is in failing to involve the support people early on. It is often the case with these guys that they've got their acts together big time and they will be able to tell us precisely what it is they will need. This means precisely what we have to give them before they will take the responsibility for the system off our hands. A lot of project time and money can get wasted by not paying attention to the support requirements. It's a bit like the automation issue, and it gets left and left until the end of the project and tends to be considered to be one of the tidy up tasks instead of being given the strategic level of importance that it deserves.
318
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 9.9. User Support Tasks Project Incre. number WBS Number
Task Description
Dependency Person Days
User Support US010
Design support processes
US020
Develop support processes
US010
US030
Test support processes
US020
We touched on backup and backout in Chapter 7. It's worth repeating that backing up a data warehouse is a nontrivial exercise, and considerable effort has to be applied to the design of an effective solution (Table 9.10). There is also the backout problem. How do we manage to get data out of the system when it should never have found its way into the warehouse in the first place? So what is “backin”? Backin is the opposite of backout. Sometimes we find that the wrong file of data was used to load the warehouse. The first we know about it is when someone from the operations team appears and says that one of the files they gave us four days ago was wrong and, by the way, here is the right one. So we have to use the backout procedure to remove the offending data and the backin procedure to put the correct data back in (not forgetting that this new data might end up getting backed out if it turns out to be wrong–this has happened).
Table 9.10. Backup & Recovery Tasks Project Incre. number
WBS Number
Task Description
Dependency Person Days
Backup and Recovery BR010
Design backup process
BR020
Develop backup process
BR010
BR030
Test backup process
BR020
BR040
Design backout process
BR050
Develop backout process
BR040
BR060
Test backout process
BR050
BR070
Design backin process
BR080
Develop backin process
BR070
BR090
Test backin process
BR080
In order to get the data warehouse to operate normally, we have discussed the fact that all the routine processes have to be automated as much as possible. The actual execution of the processes and their timing, dependencies, etc., will normally be placed under the control of some scheduler. Schedulers are extremely variable in the level of functionality that they provide, but most are quite sophisticated. For instance, we would not want our data warehouse extraction process to be started until the file of data that it needs to work on had materialized in the designated place. Although the file may be due to arrive by, say, seven o'clock in the evening each day, there is no real guarantee that it will actually do so. And if it is not there at the appointed hour, we don't want the extraction process to
319
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
start. The schedulers can be configured to respond to the presence or absence of files, as well as the return codes of other processes, signifying their success or failure. The operations team will not take any responsibility for running the system if they do not know how to do so. This is a similar issue to those of support and automation and is another famous gotcha, not only in data warehouse projects but, in all IT projects. Just like the support team, ops will have their own set of requirements, often written down, that we will absolutely have to satisfy before we can hand the system off to them. Other things we have to do include performance monitoring and tuning as well as capacity planning. Do schedule some time in the plan for this kind of activity. There is no point in delivering a system that works brilliantly on day one but that runs out of memory, CPU cycles, disk space, or network bandwidth by day five (Table 9.11).
Table 9.11. System Management Tasks Project Incre. number
WBS Number
Task Description
Dependency Person Days
System Management SM010
Scheduling
SM020
Operations training
SM030
Performance monitoring
SM040
Capacity planning
Table 9.12 lists some other items that also sometimes get forgotten or that get left so late that they get rushed. Another gotcha is not to have user equipment with a high enough specification. It does not matter how well configured the server is nor how much headroom exists within it. Many a well-designed, well-configured, superbly performing data warehouse has been let down by attempting to “shoe-horn” the end-user component into “clapped out” desktop machines.
Table 9.12. Installation & Rollout Tasks Project Incre. number WBS Number
Task Description
Dependency Person Days
Installation and Rollout IR010
Conduct user training
IR020
Install user equipment
IR030
Test user software configuration UA030
Lastly, Table 9.13 lists some initial things we need to sort out before the development team turns up with their coding pencils sharpened and ready to go.
320
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Table 9.13. Initial Setup Tasks Initial System Requirements Acquire and install DBMS software Acquire and install user access software Acquire and install metadata repository software Acquire and install aggregate navigation software Acquire and install high-speed sort software Acquire and install backup systems Acquire and install Web server and client software Acquire and install capacity planning software
only for RuBoard - do not distribute or recompile
321
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Summary We need to keep track of what we're doing, and in this chapter there's a comprehensive work breakdown structure with dependencies that can be adapted to whatever needs an individual project or project manager has. We discussed the thorny issues of deliverables. This is a particular problem in data warehouses because their development is somewhat different from that of more traditional software projects. We also discussed the project team and the requisite skills that we need to make for a successful data warehouse project. Also, we explored some of the gotchas around data quality and availability as well as customer commitment. Other gotchas include the need to engage with the customer's operations team and support environment. It is hoped that this chapter will be a valuable reference to project managers embarking on the delivery of data warehouses. only for RuBoard - do not distribute or recompile
322
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 10. Software Products There is a plethora of software products on the market. Many make extravagant claims about their ability to enable greater productivity, speed of implementation, etc. Also, the vendors may claim that their products are “CRM” products. It never ceases to amaze me just how many different labels a single product can wear during its life. I can think of one product that is, basically, an SQL query tool. It has been in existence for quite a long time, since about 1990, and in all that time since, it hasn't really changed all that much. There is a Web version now and an OLAP interface, and most of the sharp edges have been removed. Basically it's a good query tool, but it's still a query tool. When I first encountered the product it was being marketed as a query tool–very good and quite appropriate really, because that is what it is. Since then, however, the vendors of this product have claimed it to be: 1. A data warehousing product 2. An OLAP product (before the OLAP interface existed) 3. A data mining tool 4. A CRM tool While this kind of constant reinvention of products is not new and to IT people it is neither surprising nor particularly of concern, it can be the cause of some projects to fail. The problem is that some people think that if they buy the best-of-breed toolset, then they are more likely to be successful. This has been proved, time and again, to be nonsense. Let's open this chapter with a slightly controversial statement: There is no such thing as a CRM product. Such a remark might result in some product vendors bristling with indignation. The statement is nevertheless true. There are some products that can help us to implement our CRM strategy, but there are no CRM products per se. This chapter describes some of the types of product that can help us in implementing our CRM strategy and puts them into context. Specific products are not evaluated nor even mentioned, but the characteristics that these products have are described and evaluated and it is hoped that this might assist in the selection of products. It is worth remembering that the kinds of software products we will be describing tend to be relatively expensive, being three or four orders of magnitude more expensive than desktop products. So, where possible, it makes sense in the early days of a CRM project to focus on collecting and presenting the information that the business needs and defocusing on the products. We should only be investing in these systems where there is a clear business reason for doing so. only for RuBoard - do not distribute or recompile
323
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
EXTRACTION, TRANSFORMATION, AND LOADING The term extraction and transformation is used to describe the processes of, first, the extraction of data from source systems and, second, the modification or transformation of that same data to put it into a form that is more acceptable to the data warehouse than its native form. There are many extraction and transformation tools, and most also provide a mechanism for loading the transformed data into the warehouse. Hence, they are usually referred to as extraction, transformation, and loading (ETL) tools. In order to facilitate the loading, the ETL tool must have an understanding of the structure of the data warehouse, and so one of the components of the ETL application is a source to target mapping of data. The diagram in Figure 10.1 shows the major components of most ETL products. Figure 10.1. Extraction, transformation, and load processing.
The diagram shows that the extraction, transformation, and load can be depicted as three distinct processes. Each of the three processes is supported by its own, internally held, documentation. This can form the basis of a metadata repository. We will be discussing metadata a little later on. Extraction It is the extraction process that reaches into the source systems and plucks the data that we need for the warehouse. The extraction routines can vary from being quite trivial, in the case of a file being delivered by the source system, to being impossibly complex, in the case of attempting to identify changes to customer circumstances. The extraction process needs to understand the layout of the source system data to a sufficient extent so that it can select out just the data that it needs. Any
324
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
source system will contain much data that is of little or no interest to the data warehouse. Individual records will contain many data fields that we need in the data warehouse as well as many fields that we do not need. There is little point in extracting data that we do not need, especially in the case of behavioral data where we may be reading and extracting from as many as hundreds of millions of records per day. Each extraneous byte of data extracted translates into 10 megabytes of wasted space in such a large system. It is important, therefore, that the tool enables us to work at the “field” level when defining what it is we wish to extract. In terms of the actual processing involved, it is quite common for the extraction process to store an intermediate version of the extracted data into what is known as a “staging area.” The benefit of this is that it enables us to easily rerun the loading process from raw data in the event of an upstream problem in the data warehouse load, without having to reextract it. Also, it enables the raw data to be backed up and archived easily, should we wish to do so. The extraction part of the tool must, therefore, allow for the writing out of intermediate flat files. Transformation In an ETL product, the transformation processor is a very busy process and is responsible for doing quite a lot of stuff. The VIM processing, which was described in Chapter 7, is the responsibility of the transformation process. The validation part that determines whether a record is OK, or can be made to be OK, is done here and so, therefore, is the rejection of records that cannot be made to be OK. The integration part is also done here. The transformation process has access to lots of functions, some of which are shown in the diagram in Figure 10.1. There will be simple processes that convert numbers to characters, and vice versa, as well as date conversion routines and, sometimes, currency conversion. Also, lookup processes will be needed where, for instance, the customer identifiers are inconsistent between systems and we need to use a kind of surrogate key in the data warehouse to bring them all together. So we need to be able to specify valid values for each of the fields on our input data streams as well as what changes need to be applied. Some fields will require many transformations such as a format conversion from numeric to character, followed by a date conversion and then a lookup to determine whether or not the date occurs, say, within the customer's period of validity. So it is important that the tool enables complex processes to be built up. The best way to achieve this, as our example showed, is by stringing simpler processes together. Obviously, we need the function library to be extensible so that we can add our own functions. Some tools provide for the construction of functions within the tool itself. These are usually quite restrictive, and so it is important that we are able to call external functions that we have written for ourselves. Loading The loading of data into the data warehouse is one of the most problematic issues with ETL products. Some products operate by physically inserting each record as a row into the target table in the warehouse. This involves an SQL insert statement usually embedded in the tool itself. Some products take this further and operate on a kind of “sausage machine” basis. They link the extraction, transformation, and loading processes together for each source system record. This means that a record is read from the source system, transformed, and written into the data warehouse; then the
325
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
next record is read from the source system, transformed, and written, and so on. This type of approach is fine when we do not have large volumes of data to be transferred. When we do, this kind of product will usually become a bottleneck. Vendors of such products will point to multiprocessor hardware architectures as the salvation, but be very careful before selecting one of these types of products if you do have large volumes of data. With large volumes we need to be thinking about using the loading tool provided by the RDBMS vendor, preferably bypassing SQL and with the recovery processing switched off. Also external sorting using a high-performance sort (see “Sorts” at the end of this chapter) can greatly assist us in the quest for performance. In operational terms, there are two kinds of ETL product. The first is as we discussed in this section. They consist of a kind of run-time environment that has to be started. Once up and running, they often come under the control of their own internal scheduler. They look for files in source systems; initiate extract, transformation, and load processes; handle errors; and issue alerts. These are highly functional systems, but they do come with some drawbacks, notably performance. The second type, much less common, enables us to “design” our extraction, transformation, and load in much the same way as before, but this time the ETL product generates program source code. The source code can be in many languages (e.g., C, Cobol, etc.). Once this is done, we have to compile the generated programs into executable code using our standard compiler. These systems are usually much more efficient to run than the first type, but they too have their problems. Whereas the first type of product makes it relatively easy to alter the processing dynamically, such as changing input or output mappings, the second type means regeneration of source code and recompiling. In a live environment, this can be problematic especially if there are strict rules about checking programs in and out of the version control system. Another problem is that a support programmer, when called out overnight to fix a problem, might be tempted to alter the program source code and recompile it. This is potentially disastrous, as the code has been generated from the ETL product. The correct way to make changes is within the confines of the product. Then the new source code can be generated, compiled, and executed. If the support staff do not know the ins and outs of the ETL product, they may not feel competent to do this and take the shortcut approach. only for RuBoard - do not distribute or recompile
326
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks . only for RuBoard - do not distribute or recompile
OLAP The term on-line analytical processing, or OLAP, was first used in the early 1990s. We could be forgiven for imagining that this was when, so called, OLAP products first started to appear. Not so. All OLAP products are founded on the principle of the dimensional model that was described and discussed at length in the early chapters of this book, and analysis products based upon dimensional models have been around since the 1970s. In the mid-1980s there was quite a strong set of competing products although, at the time, they were generally being marketed under the heading of executive information systems (EIS). Some of the products that existed in those days have survived, by a certain amount of evolution and adaptation by new owners, to today. OLAP is supposed to have delivered a kind of paradigm shift in that all business people should now be using sophisticated analysis tools to understand their businesses instead of relying on dedicated analysts to provide them with information. Oddly enough, that was precisely the same objective of the much earlier EIS products. A typical OLAP architecture consists of an OLAP server that sits between the data warehouse and the user. This is shown in Figure 10.2. The warehouse itself is normally dimensional in nature and can be relational, proprietary, or a combination of both, depending on the type of OLAP product chosen. Figure 10.2. Typical OLAP architecture.
There are three main flavors of OLAP, and each OLAP product falls into one of the three: MOLAP: Multidimensional OLAP. The term MOLAP is used to describe a set of products that utilize a proprietary database management system. By this we mean that the database does not use the relational, or any other standard, as its underlying database. These systems often have a spreadsheet basis much the same as described in the introduction to Dot Modeling in Chapter 5. There are two main problems with these products. The first concerns their ability to scale up. Generally, we have to be able to tell the system, at the outset, how big the dimensions are likely to be. This can be done by simply entering, say, the number of customers, products, etc., that we need the system to be able to hold or by specifying a range of identifiers and allowing the system to deduce the size of the dimension itself. Once the system knows the size of each dimension, it simply multiplies all the numbers together and the result is the number of spreadsheet cells that it needs to make room for in its matrix. For example: If we have two million customers and 1,000 products and we want to store 10 years of daily sales, then the matrix would contain:
The result is that the position of individual cells can be calculated from the value of the dimensional identifiers. This makes these systems very fast indeed when it comes to accessing the data and, of course, there are no expensive joins to be performed. However, there is a kind of implicit assumption that we intend to sell every product to every customer every day. As attractive a
327
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com . to register it. Thanks proposition as this might be, most would agree that it is a little unrealistic and tends to lead to what is known as sparcity. Sparcity is a measure of the unused cells in a multidimensional database. Often, the sparcity can be more than 90 percent. In the example above, that's over 20 terabytes on a 32-bit architecture for what is a fairly modest application. Most of these systems have sparcity reduction algorithms, but scalability remains an issue for these products. The second problem is that there are no standards regarding query languages. In relational databases, SQL has been the standard for a long time. It may have shortcomings, but at least we can be comfortable that SQL that works on one RDBMS can be made to work on most others with relatively little modification so long as we have stuck to the standard features. ROLAP: Relational OLAP. The ROLAP solution is to locate the data model inside an RDBMS. There are minor variations in the data models depending on the vendor. Some allow a full snowflake schema while others restrict us to a star schema. Not surprisingly, each aggressively defends their solution against the other. We've had this discussion in Chapter 3. Basically, what we have to do is define the dimensions and the facts and the ROLAP engine will create the underlying schema and provide a means of querying the model with a query tool. This approach appears not to suffer from the two problems associated with the standard MOLAP approach. First, because the data is stored, conventionally, in a relational database, the scaling and sparcity issues don't usually apply. The primary players in the RDBMS market are well sorted in terms of space management and, although they do tend to waste a lot of space, it is nowhere near as bad as the MOLAP approach. Second, the query language is SQL. The query tool that is integrated with the ROLAP product generates SQL so we don't have to. Don't be fooled into thinking everything is standard, because it isn't. While the underlying DBMS is relational and is, therefore, standard, the ROLAP bit is anything but standard. In the relational model, we can take the DDL (data definition language–create table statements, etc.) and use it, more or less, in another RDBMS and it will work. The same is not true of the statements and parameters used to build the ROLAP model. There is another thing, too. One if the big benefits of MOLAP is performance because, as we said, we can find records by simple calculation and there are no joins to be done. In a ROLAP solution this is not the case. We are back with a traditional-looking model with all its inherent performance issues. You might be wondering what the benefit is. If the ROLAP tool simply builds a dimensional model, whether a star or snowflake, and allows us to query it, why don't we just design our own dimensional model and just buy a query tool to analyze it (and save lots of money)? If you think of an answer–let me know. HOLAP: Hybrid OLAP. This is the compromise solution. The idea behind HOLAP is that we can still get the benefits of a MOLAP architecture for the summary information, but when we need detail, the system will “reach back” into a relational database. So the performance gains are available, albeit only at the summary level, and the scalability issue can be resolved because all the detail is held in conventional relational tables. There is a general issue with OLAP—that is, its architecture is founded on the dimensional model. That means it is only useful for the behavioral part of the GCM. It cannot support a customer-centric model. This problem is somewhat exacerbated when we try to use an OLAP solution in our CRM system because we want a fully integrated model of which we can ask questions about changing circumstances as well as behavior. If you find yourself in a situation where you need only behavioral information, then an OLAP solution, in one of its three main guises, might be suitable. If you need more, then be wary. only for RuBoard - do not distribute or recompile
328
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
QUERY TOOLS When data warehousing started to become popular, it was like a dream come true for the vendors of query tools. Until then there had been quite a few good-quality products, but they always had the feel of a solution that was desperately in search of a problem to solve. Data warehousing changed all that, probably forever. As soon as data warehouses appeared, there was an immediate requirement for good user interfaces to the database. Of all the various components of the data warehouse architecture, this is the most well served in terms of quality products. Nice-looking reports and fancy graphics are now taken for granted. There are several outstanding systems that provide the following kinds of features: Standard reporting. We need the ability to design and retain standard reports. Although there is a huge requirement for the ad hoc style of query, nevertheless, the standard reports that we run day after day, month after month remain as staple requirements. Ad hoc reporting. The ability to ask fairly complex questions without having to code any SQL. This requirement to hide the code from the users is now so well embedded that it has become a standard feature of most query tools. In fact, it is now fair to say that any tool requiring knowledge of SQL cannot survive in the market. Batch reporting. It is well worth questioning your supplier closely about this. The ability to run reports, say, overnight is as important as the requirement for standard reports. Most products do allow for this, but it's often added as an afterthought and can be quite clunky in operation. Distribution of reports. We might want several copies of a report e-mailed to different people. The list of people might be dynamic, and so the list management needs to be easy. Some people might want paper, others might want key information delivered via an active WAP(wireless application protoco) link. We might want a report to be split up (e.g., each customer account manager getting only their own customers' reports). Again, this is a feature that might not be so well developed as the glossy graphical stuff. Drill down. This is now a fairly normal requirement that most query tools handle quite well. Drilling down is the ability to add more detail. For example, we might be presented with a single value for the sales to a particular customer. A drill down facility would break the figure down by, say, product categories, time period, or both. Business semantics. This is the ability to present the data to the users in a language they understand. As an example, the data in the database might be called “SL_OR_VL” whereas the query tool would present the more friendly “Sales Order Value” to the user. This approach requires a layer of metadata to be established between the user and the database that converts what the user has selected into
329
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
something the RDBMS can comprehend. This facility was pioneered in the early 1990s but is now fairly common in query tools. Web-based version. This is also becoming a fairly standard requirement. The “thin” client approach to deployment of end-user software means that, increasingly, we don't want to have to place large software systems on the desktops of our users. The maintenance costs are far too high. So there is an increasing tendency to develop everything such that it can be accessed through a browser. Summary awareness. Where we have taken the trouble to build summary-level tables for our customers to aid performance, it would be nice if the tools that we use to access the data could be made aware of them. The software can then decide, at run time, which table is likely to provide the best response time. It is very easy to be overly influenced by the really neat demonstrations that the vendors of these products are so clever at presenting. Some of the presentations are mind- blowingly impressive. I have tried to highlight some of the areas where there are questions that, after you have purchased the product, you might wish you had asked. only for RuBoard - do not distribute or recompile
330
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
DATA MINING Data mining has a kind of magical reputation. Simply by pointing a data mining tool at a database reveals intriguing and unlikely patterns of behavior. There is folklore building up regarding the serendipitous nature of data mining. The reality is that, while data mining can indeed uncover hitherto unknown relationships in the data, there is nothing serendipitous about it at all. It is a systematic, interactive, and iterative process. For its success it relies on the business expertise of those interpreting the results of its analysis. An apparently useless pattern in the data can be transformed into valuable information if the right business interpretation can be applied. Conversely, a seemingly significant relationship can be explained and disregarded when business experience and commercial knowledge are used. There are two principal types of data mining: Hypothesis verification. This is where we might have some idea or hunch about the significance of a relationship between data elements. For instance, someone working in the marketing department of a supermarket retail chain might own a dog. Dogs are notorious for shedding fur in the summer months, and this takes some effort in cleaning up. Therefore, a reasonable hypothesis might be that dog owners will consume more vacuum cleaner bags than the average customer. It might be to the advantage of the company's store managers to place a second rack of vacuum cleaner bags near the dog food. The marketing executive could run some tests to prove (or maybe disprove) this hypothesis. Knowledge discovery. This is where there may exist hitherto unknown statistically significant relationships between elements of data that no human is likely to deduce. These are the cases that can bring real windfall rewards. Before embarking upon a mining project and, therefore, investing in an expensive data mining product, it is worth just reviewing whether or not we are likely to be getting some value out of it. First, in order to do data mining, the one thing we absolutely do need is data. Another issue is that the data has to be of good quality and it really should be properly integrated. One of the problems encountered in data mining is poor-quality data, and the risk associated with this is that the errors in the data might affect the results of the mining process. This is known as “noise” in data mining. Some products claim to be able to handle up to 50 percent noise in data. Nevertheless, the greater the noise, the less accurate the predictions are likely to be. If the data mining exercise forms part of our CRM project, then the most sensible source of data would be the data pool. Another popular misconception about data mining is that it needs lots and lots of data. This is not true, but what is important is that the data should be truly representative and not skewed one way or another. As long as the range of possible outcomes is covered by the data, good results can be achieved with relatively small volumes. Some of the major data mining products have evolved from statistical analysis systems and work best with their own proprietary file management systems. Although most now also work with standard
331
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
ASCII files, it is worth checking that they can also access relational tables. If not, or if they approach it in a very inefficient way, then we might be faced with having to perform expensive extractions and sorts in order to get the data in a form that the data mining product can use. Also, when a vendor says their system can read ASCII files, just check whether they mean fixed length records, comma delimited records, etc. One other thing, if we are allowed to change the field delimiter from comma to something else, that would be very useful. Comma delimiters are fine so long as our fields don't naturally contain commas (e.g., in addresses and in numeric values). Here are some of the features of data mining products: Distributions. How data values are distributed across the value domain is an analysis that is performed very often in statistics. There is a slight difference in the ways in which numeric data is analyzed as opposed to descriptive data. A simple distribution using descriptive information would be between, say, males and females or between geographical regions. The analysis of descriptive data tends to result in a distribution based on absolute set values, as the chart in Figure 10.3 shows. Figure 10.3. Descriptive field distribution.
Due to the wide range of numeric values, it is not appropriate to use a set-based distribution chart such as the one in Figure 10.3 because it might end up being several hundred pages long. Having said that we could, of course, imagine an analysis based on a descriptive field that also could end up being very long. We might, if we were daft enough, conduct an analysis on, say, address. Unfortunately, these systems won't prevent us from doing silly things. Anyway, with respect to numeric fields, there are a couple of techniques that are common and should be supported by most products. The first of these is the histogram distribution, which is shown in Figure 10.4. Figure 10.4. Numeric field distribution using a histogram.
332
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The histogram shows the number of people enjoying salaries at a particular level. In practice, each vertical bar on the chart represents a range of salaries. The other main type of distribution for numeric data is just referred to as statistics. This gives a more detailed analysis of numeric fields such as the mean, standard deviation, etc., but it does not give as detailed a distribution as the histogram. Here is an example: Statistics for field: Salary Minimum
23,998
Maximum
67,290
Occurrences
98,276
Mean
30,010
Standard deviation
1,774
Relationships. When searching for relationships between data, the same issues regarding descriptive and numeric fields apply. For relating descriptive fields to other descriptive fields, it is common to produce a kind of cross-tabulation matrix. Table 10.1 is an example.
333
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 10.1. Example of a Field Matrix Males %
Females %
North East
17.67
16.22
North West
22.12
21.87
South East
41.23
40.55
South West
18.98
21.66
Notice that, in this example, the percentages add up down the column showing the spread of the genders across region. Some products enable this to be switched around so that we could see the percentages adding to 100 across the rows. This would show the spread of males versus females in each region. Another way of relating descriptive fields together is to use a web plot. A web plot example is shown in Figure 10.5. Figure 10.5. Web plot that relates gender to regions.
The density of the line between the two nodes indicates the relative strength of the relationship between the two items. The thicker the line, the more significant the relationship. Apart from histograms, numeric information can also be displayed on scatter plots. The density of the plots indicates the significance of the relationship between the data elements. Neural networks. The neural network attempts to emulate the human brain, albeit in a very simplistic fashion. In a nutshell, this is how we use them: First, we should present the system with a set of data where the outcome is already known. Let's say that we want to focus our sales effort on a particular wine in the Wine Club catalog. One way to do this is to select out of the data warehouse a set of customers' details,
334
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
behavior, and circumstances, of customers including some who have already purchased this wine. Now we present this data to the data mining system. We have to tell it what it is we are trying to do, insofar as there needs to be an indicator such as “Bought the Wine,” yes or no. We also have to tell the system that this indicator is our target, the field that we will be asking it to predict. The data mining system then builds a neural network. It does this by relating every input field to every other input field. By repeatedly relating fields together and calculating “weightings” for each field, the system builds up a profile and makes a prediction. The prediction is compared to the actual result and, if it is within a predefined tolerance, the process is finished; otherwise the system keeps trying. The system should provide the facility for us to intervene and set its weightings to help it along. If we think we know that a particular field is more relevant than others, then we can increase the weighting. An example might be that the wine is quite expensive and so a customer's salary is likely to be a significant field. Similarly, we don't want the system wasting too much time on low influencing factors, such as, say, the customers' hobbies. Once the system has “learned” the relationships within the data, we present it with another set of data where we know the outcome but it doesn't. We can then compare its predictions with what happened in real life. If it is sufficiently accurate for our requirements, we can proceed to the next stage. If not, we return to the previous stage. When the system has come up with an acceptable level of successful prediction, we can apply the predictive algorithm to the unseen data. Unlike the data that is used to “train” and test the system, with unseen data the outcome is not known. If the training has been successful, this should result in a list of customers who have a high likelihood of wanting to purchase this product. Rule induction. The rule induction technique is very much simpler than neural networks and works on the principle of a predefined set of decision points known as rules. In reality they are like a complex decision tree where you start at the top with a question and, depending on the answer, traverse one of the branches to the next level and another question, and so on. The important thing to remember here is that the rules have to be defined by people with extensive business knowledge. The purpose of the rules-based approach is to attempt to emulate, electronically, the decision behavior of business experts. A simple example of rule induction is shown on Figure 10.6. Figure 10.6. Rule induction for wine sales.
335
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Another thing to remember about predictive models is that, where possible, we should always try to use more than one so that, in effect, they corroborate each other's predictions. So, when choosing a data mining product, it is probably worth selecting one that has a reasonably wide range of functions. only for RuBoard - do not distribute or recompile
336
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks . only for RuBoard - do not distribute or recompile
CAMPAIGN MANAGEMENT In discussing campaign management products, it is worth reiterating a point that was made in Chapter 1, which is that the most difficult part of campaigning is figuring out who are the customers, or prospects, whom we should be targeting in our campaign. This has absolutely nothing to do with campaign management systems. Most campaign management systems on the market do not provide much support in this area. For this we generally have to look to our query tools and data mining tools. The users have to provide the fulfillment agency with the names and addresses that the campaign applies to. Generally, the individuals are not held inside the campaign management system. Usually, it is the fulfillment agency that sends the treatments to the customers (a “treatment” can be anything, maybe discount pricing, store voucher offers, telemarketing, mailers, etc). Similarly, it is usually the fulfillment agency that handles the responses from customers. Most campaign management systems are good at helping us develop the campaign itself, the design (treatments, responses, etc.), budgeting, identifying the overall number of people to be contacted, size of control groups (make sure yours handles control groups), etc., but they often don't help us at all with the CRM aspects. Campaign management systems are often sold as the cornerstone of a CRM strategy, whereas, in reality, they usually offer nothing to help us in this area. They are campaign management systems and they are not customer management systems. There are some exceptions, particularly the newer systems. When talking to vendors, we should question them closely about how much customer information is held in the campaign management system or if they expect the customer information to be held in the data warehouse. It could be argued that, actually, the warehouse absolutely is the right place to hold the customer information and there are advantages in this, too. For instance, it is then possible for other systems to be aware of which campaigns each customer has been targeted with. That being the case, we need to reassess precisely what value a campaign management system brings us. You see, they don't even have a process for handling the responses from customers. If we were to send 10,000 customers a particular offer, the campaign management system would have the ability to record how many of those customers responded and which options were taken up, but only in total. Our customer-centric data warehouse will never find out how individual customers respond to the campaign. If we want to collect this priceless information about our customers, guess what? We have to build a system to capture it. So bear in mind that most campaign management systems have everything to do with campaigns and virtually nothing to do with customers and, therefore, nothing to do with CRM. There are three main types of campaign that all campaign management systems should be able to handle: Single-phase campaigns. These are kind of one-off, special offer types of campaigns. They often have a single treatment. Most campaign management systems can cope with this, as it could be seen as just a standard multiphase campaign that has just one phase. Multiphase campaigns. These are generally much more complex with each phase potentially consisting of many different treatments and response options. Each response triggers another treatment and the treatments might vary within a single phase. In the example of a multiphase campaign in Figure 10.7, we see an initial offer of a free bottle of wine. The customer is able to respond by accepting either red wine or white wine. Depending on the response, the next treatment will be for a case of the same wine at a discounted price. The marketing manager is experimenting with different discounts. This is quite common because sometimes the smaller discounts are accepted with just as much enthusiasm as the larger discounts. This provides the marketing manager with valuable information for future campaign strategies. It is important when choosing a campaign management system that this flexibility in the design of campaigns can be accommodated.
337
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com . to register it. Thanks Figure 10.7. Example of a multiphase campaign.
Repeating campaigns. These are campaigns that, as the name suggests, are repeated over time. An example of a repeated campaign is the registration “service” that is offered when we buy a product. These can be a good way of gathering more information about our customers that might help in constructing a profile about them. There is a significant difference between the first two types of campaign and the third. The first two are targeting a single group of customers, in other words, a segment. This is a list of customers who might exist in our derived segments component of the GCM. With multiphase campaigns, the later phases merely target a subset of the original list. So the main component of single-phase and multiphase campaigns is the list of customers. Contrast this with repeating campaigns. Each time the campaign is executed, we are targeting an entirely different list of customers. These campaigns do not generally consist of a list of customers but instead consist of a set of selection criteria, what we might call a query, and one of the major criteria is a relative time period. We want to target all customers that purchased a certain product within, say, the past two weeks. In order to do this we need to repeatedly query the behavioral part of the GCM as the following query shows:
Select name, address From customer c, sales s, product p Where c.customer_id = s.customer_id And p.product_name = 'Case of Barolo' And p.product_id = s.product_id And 'system date' - s.date <= 14 It is worth questioning the product vendor to make sure they recognize and can cope with these varying requirements. only for RuBoard - do not distribute or recompile
338
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PERSONALIZATION Personalization, in itself, is not a new thing. Recall the value propositions we discussed in Chapter 1. Customer intimacy and personalization go hand in hand. However, there is a practical limit to the amount of real personalization that we can engage in successfully. The main reason for this, of course, is that real personalization requires people. It is reasonable to suppose that a personal account manager can expect to be able to cope with anything from 1 to, say, 50 customers and still maintain a degree of personalization. This is fine so long as we only have a small number of customers in total or where there are only a small number within the segment that justify, due to their lifetime value expectations, to be treated in this way. If, however, we have maybe five million customers, that would take 100,000 personal account managers if we wanted to allocate 1 to each customer but not exceed the limit of 50 each. Even if the account managers cost no more that $50,000 per annum, that would amount to an annual bill of $5 billion. So if we want to take a personalized approach to all customers, we have to look to technology for the solution. This is, by definition, not real personalization but a kind of virtual personalization. We are pretending to the customers that we really know who they are, what their preferences are, and how we can best serve them. However, it is the systems that know these things, not individuals. The customers are completely at the mercy of the systems, and the systems, likewise, are completely dependent on the information that we have stored about the customers. If the information is accurate and up to date and if we have had the time to gather enough of it, then there is a good chance that we can keep the illusion going. If any of those things are not true, then our customers will become disillusioned and will leave us. It is generally better not to have a ersonalization system, than it is to have one that contains flawed information. The “personalization” layer of software provides a very thin veneer between the customer and the customer information that is held in our warehouse. It is, in many respects, similar to campaign management systems in that their purpose is to present customers with offers of goods and services. The principal application for personalization systems has been on the Internet, although there is no reason why that has to be the case. However, the Internet does provide the ideal place to utilize a personalization system, since the customer, or prospect, is already in a software-driven environment. In order to be of any real use, the system has to be able to identify the customer so that it can retrieve any information that it has previously stored. Armed with the past history, the system can actively lead, or attempt to lead, the customer along a path that has been determined by the marketing team until the user is presented with an offer to buy. Identifying users is normally achieved in one of two ways. The first way is to get them to register their name, address, etc., and to provide them with a password. Many customers, quite reasonably, object to this approach. The other way is to store information about the customer on their own machine, in the form of a cookie and search for it when they access the Website. Once we have their identifier, we can access their records in the database and we then know all about them. The best way to get the information in the first instance is to persuade them to buy something. That way, they have to give us their name, address, and credit card details. We can persuade them to buy by allowing a large discount on whatever it is they are showing interest in. This
339
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
does require “active” monitoring of their use of the Website. If, for instance, they inquired about a particular wine a second time, we could respond with, “Try this wine for $1.00.” We can't let them have it for free because we need that all-important credit card number. Check very carefully the amount of real functionality that comes with the personalization product. These products tend to be very expensive and yet often consist of little more than a library of functions that we can call through an application programming interface (API). Also, check what comes with the database. You can bet your last dollar that the database of information, the foundation of the whole system upon which ultimate success or failure might depend, is your responsibility. only for RuBoard - do not distribute or recompile
340
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
METADATA TOOLS At the last count, there are three basic kinds of metadata: Business semantic metadata. This is the kind of metadata that describes, in precise business terms, what each element of data actually means. Recall the discussion “What is a sale?” The business semantic metadata will define a sale as being, for instance: The quantity and total order value of an order for a single product by a customer, before discount The description should leave no room for ambiguity. It must be possible to define every entity, relationship, and attribute in the system. Transformational metadata. This describes all the properties of the data element and includes: Source system from which the element was extracted Frequency of extraction Changes that have been made Level of retrospection that applies Dependencies on other data elements Again, this information should be held for every piece of information in the data warehouse. Navigational metadata. This is the sort of metadata that comes with most products. It is metadata that enables the product to operate properly. For instance, the product might need to store information about the location of its own files and programs, etc. All products claim to have metadata. This is because it is fashionable to have metadata, and vendors feel that it somehow enhances their products if the architectural diagrams can show a layer of metadata (checkout the EASI data architecture in Figure 7.1!!). One major RDBMS vendor used to have schema tables in which the tables, columns, users, etc., got stored. These schema tables are no longer schema tables; they are metadata and they've got diagrams to prove it. This is a good example of navigational metadata. When we in data warehousing refer to metadata, we generally mean the business semantic and transformational kind of metadata. Active Versus Passive Metadata
341
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
If we are thinking of purchasing a metadata tool, then it is worth considering the difference between active and passive metadata management systems: Active metadata management. This is where the metadata management is part of the data warehouse operation. This means that, in order for us to process any item of data, we have to register that data with the system. Some ETL products include a set of metadata functions within the product. However, these tend to be a combination of transformational and navigational metadata. Typically these products are not strong on business semantic metadata. Passive metadata management. These are separate products that sit outside of the system. They do not participate in the actual operation of the system. Typically, these products are stronger in the area of business semantic metadata than the active metadata management products. The active metadata management systems have the distinct advantage that they are more likely to be complete and more likely to be kept up to date because any changes that we make to the data or to the processes will be recorded automatically into the active repository. With passive systems, there is a reliance on the users of the metadata management system to, initially, complete it and, subsequently, keep it up to date. Experience of passive documentation systems shows that there is usually an initial enthusiasm that tails off pretty quickly. The result is that these systems are rarely kept up to date. only for RuBoard - do not distribute or recompile
342
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SORTS I have included a small section on sorts in this chapter, describing products, because, in my experience, it is something often neglected by data warehouse designers and yet can have a profound effect on the performance, particularly summarizing and loading. Sorting inside the database is often very slow. Using sorting programs provided by the operating system can be up to four times faster. Using a high-performance sort utility can be 10 times as fast. There are also some other useful features that the better sorting utilities provide: Record selection. This enables some records to be selected while others are ignored. Summarizing. You can actually produce your daily, weekly, or monthly summaries using nothing more than the sort utility. Reformatting. This enables required fields to be extracted, rearranged, and written out in a new format. Not only can the utilities save us in processing effort; they can also save development effort. We don't always have to design and build summarization software for instance. So, instead of loading data and then building summaries using SQL, consider building the summary using a high-performance sort and loading the summary into the database using the RDBMS loader utility. only for RuBoard - do not distribute or recompile
343
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Chapter 11. The Future So what happens next? In data warehousing, we often use history to help us to try to predict future behavior. In most other respects, we don't tend to pay much attention to the past.
What experience and history teach is this—that nations and governments have never learned anything from history, or acted upon any lessons they might have learned from it. —Hegel (1830) The whole world is currently engaged in an Internet roller-coaster ride. It is clear that the Internet will change everything. The only question is: By how much? The more developed part of the world is slowly but surely losing much of its manufacturing base to the emerging economies. The Internet, with its tendency toward globalization, is helping to facilitate this migration. The developed world is now becoming more service oriented than manufacturing. The entrepreneurs have already realized that products in themselves are not any more sufficient to satisfy the customers' needs. It is the products plus the valueadded services that are needed in the more sophisticated economies. Those who are stuck in the old regime of manufacturing only will find themselves competing in a market that increasingly, they cannot survive in. Part of our job is to help our customers to visualize the future in terms of their IT systems. This book has been about the designing and building of data warehouses for use in a CRM environment. CRM is one of the value-added services that sophisticated consumers have come to expect. The question is, Just what does the future hold in this area? With the Internet changing everything, we might think that it's impossible to predict. However, if we keep focused on the data warehouse, the underlying data architecture may not have to change at all. Sure, the interfaces will continue to evolve. We can already access the data warehouse using our mobile phone, and Internet-enabled wrist watches will soon be here. However, the fundamental information structure need not change all that much. OK, the physical implementation will almost certainly change. The Internet will facilitate the emergence of vast data centers consisting of hundreds of thousands or even a million linked servers operated by service providers that will enable outsourcing on a scale hitherto unimagined. But the data need not change all that much; our GCM still describes
344
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
the kind of information that we need to store in order to understand our customers. There are some things that are worth highlighting, and the rest of this chapter attempts a certain amount of crystal-ball gazing. only for RuBoard - do not distribute or recompile
345
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
TEMPORAL DATABASES (TEMPORAL EXTENSIONS) Throughout the 1990s, and perhaps even a little before that, a large community of academics spent a huge amount of time conducting research into so-called temporal databases. A temporal database supports some aspect of time. We might think that all modern RDBMSs support time because of the DATE/TIME data type. This is not what temporal databases are about. The DATE/TIME data type is a kind of user-defined time support. A real temporal database has time support within the structure of the DBMS. The schema tables, the query processor, etc., all have to have a fundamental understanding of time. We need to be able to ask questions like: How many people move to a larger house within one year of their salary exceeding $50,000 for the first time? The issues surrounding time are thoroughly documented in Chapter 4 of this book. The importance of time cannot be overstated in data warehousing. Data warehouses are temporal databases but without the support of any real temporal DBMS. The research community has failed, after many years of research and thousands of papers, to come up with any meaningful solution. Part of the problem, in my view, is a failure in principle to accept that time is a fundamental attribute, another dimension. Almost all the research seems to be focused on how we might modify the relational model to accommodate time. On each occasion that some luminary presents a paper describing how the relational model might be adapted, another presents a counterargument showing how that adaptation violates one or more of the rules that comprise relational theory. There are also proposals regarding potential extensions to the SQL language. These often include the incorporation of a “WHEN” clause, so we can phrase a query to return rows, not only “where” a condition is true but also, “when” it was true. So far, no major RDBMS vendor has implemented these changes. There are some third-party produced “packages” available for some products that purport to provide some of these facilities, but their efficacy and scalability are unproven. Perhaps the main problem is that relational tables are two dimensional. Time adds another dimension, and any attempt to force fit a solution is bound to fail. Oddly enough, I believe there may be a link between the multidimensional nature of data warehouses and the need to incorporate time as a dimension in other types of database, especially given the fact that data warehouses are temporal. It may be that the data warehouse techniques we use today provide the key to this seemingly intractable problem. What we need is not a modified version of an RDBMS. We need a new model, as different from relational databases as relational databases are from network databases or object-oriented databases. It is to be hoped that this problem will be resolved in the not too distant future, but don't hold your breath. only for RuBoard - do not distribute or recompile
346
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
OLAP EXTENSIONS TO SQL There are a set of proposed enhancements to SQL to enable some OLAP functions to be performed. The SQL standards authority is aware of the importance of data warehousing to industry generally and is keen to rise to the challenge that it brings to the state of database technology. I describe here the major proposals in outline form. The requirements and the proposed solutions have been well thought through. The details (syntax, etc.) are still being tested. Moving averages. It will be possible to specify the number of rows to be included in the moving average. For instance, in a result set containing monthly sales, we can create moving averages for any number of months. We can even look ahead in the data to, for instance, create a three-month moving average that contains the current month (i.e., current row pointed at by the cursor), previous month, and following month. Aggregation groups. Whereas a moving average is a kind of “bounded aggregation” function, it is possible to define unbounded aggregations to enable cumulative (e.g., year to date) columns to be defined. A further development of this idea will enable what is described as “uncentered aggregation groups,” which means that we can specify aggregation columns that don't include the current row at all. For instance, we could compare this month's sales with the average for the preceding three months. Ranking functions. This has long been a bugbear in data warehousing. We often need to know who are the top “n” customers, products, sales people or regions. The rank function that is proposed will allow these to be specified. only for RuBoard - do not distribute or recompile
347
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
ACTIVE DECISION SUPPORT Up till now, decision support has been a passive activity. The data warehouses that have been developed so far are of the passive variety. This means that they tend to just sit and wait for someone or some application to ask them a question, and they react and provide an answer. The idea of an active decision support system (DSS) is one that knows what it is that we are interested in. It is constantly searching for new information of changes to situations and when it finds something that may be of interest to us, it lets us know. This is not the same thing as “alerts.” Alerts have been around for a long time. The obvious application for alerts is where a stock price is monitored. When it rises by a certain amount or falls by a certain amount, then the alert system somehow informs us of the change and we can react accordingly. Active DSS is far more than this. An active DSS system is one that actively looks for information on our behalf, once it knows what we are interested in. This can vary from being quite trivial, such as wanting to be kept informed about the responses obtained from a particular campaign, to very complex, such as the political and economic situation in a particular region of the world where we have a manufacturing plant. Notice that the latter is a fairly general request to be kept informed. The sources of information might be news broadcasts, newspapers, and journals. This type of information has two properties that, hitherto, we haven't really considered: external data and unstructured data. only for RuBoard - do not distribute or recompile
348
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
EXTERNAL DATA Some data warehouses may already have access to external data. By this we do not mean external data that is imported into the warehouse during the extraction, transformation, and load process. What we mean is that there are links to external data that our users can, seamlessly, reach out to. Over time, this will become more and more popular until the point is reached where access to external data becomes a fundamental requirement. There are some industries, especially pharmaceuticals, where there are hundreds of research and epidemiological databases that companies can subscribe to. Now, some of these databases are simply huge and it would be highly inappropriate to attempt to copy them into our own data warehouse. Somehow, we have to figure out how to enable this to happen without compromising the security and integrity of our own internal data. only for RuBoard - do not distribute or recompile
349
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
UNSTRUCTURED DATA Almost all the data that is currently held in data warehouses is what we call structured data. This means that the data is organized into rows and columns, with ordered data types, etc. Unstructured data is the kind of data that exists in documents, Web pages, journals, newspapers, etc. This data can be just as valuable as structured data. For instance, a severe drop or hike in the price of oil might have an effect on any of our customers that are highly sensitive to the price of oil, and any decision we might make about our dealings with these customers may well be influenced by such a change. The attraction of unstructured data is that it tends to become available very much more quickly than the massaged, sanitized, and structured version. The challenge is to figure out just how we obtain, interpret, and present the information to our users. Also, such data is just as likely to yield valuable results from data mining as is structured data. Another challenge is to design the mining software that could undertake the task effectively. Of all the technologies currently available, the Extensible Markup Language (XML) is the most promising for providing a solution to the interpretation of unstructured data. only for RuBoard - do not distribute or recompile
350
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SEARCH AGENTS We are all familiar with search engines. These are tools that take a string of input that we enter and, depending on the search algorithm they use, we get presented with a set of Web page URLs that match all or part of the search string we entered. Again, these search engines are passive in nature in that they sit and wait to be told to do something. Active search agents will be constantly looking out for things that we are interested in knowing about. Each person, in the future, will have their own intelligent search agent that will probably have some cute physical representation that can talk to us in our own language. We can make it look how we want and we can train it to know the things we are interested in. It may live on our desktop, in which case every time we power up the desktop it will whiz off looking for information. Alternatively, it may live on a network server and so will continue to work for us even when we're not working ourselves. It can communicate with us via the desktop or by mobile phone or, indeed, any other medium that becomes available. These advanced search agents will be able to use the data warehouse's structured information as well as external databases and even unstructured information. In effect, these will become the active DSS facilitators. They will pull together all the disparate sources of information and present us with the results, appropriately distilled. Independent software vendors will produce plug-in modules that we can purchase to give our individual agents more intelligence in, say, interpreting company financial results. Ultimately, our pet agents will be able to buy and sell on our behalf, being capable of taking financial decisions. only for RuBoard - do not distribute or recompile
351
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
DSS AWARE APPLICATIONS Another big problem we face in data warehousing is the whole extraction, transformation, and load (ETL) issue. Having to take data from lots of different places, pass it through some nontrivial (VIM) processing, and load into the warehouse comprises the major part of our development effort. Historically, the applications from which the data has had to be extracted are entirely blind to the requirements of decision support. The relative success of data warehousing, or at least the success of the idea of data warehousing, should have convinced the application designers of the value of decision support, and they should be beginning to understand the link between the data that resides in their applications and the information that can be derived from it. Indeed, one or two of the major ERP vendors have already begun to see the value in this. Unfortunately, so far the value they have seen relates more to the dollars they can make than to the value their customers might gain. Although this seems like a pejorative view, the evidence is there. There are many dissatisfied customers of data warehouses where the warehouse was provided as a kind of “bolt-on” addition. The ERP vendor was guilty of jumping on the data warehouse bandwagon without bothering to understand the business issues surrounding data warehousing, Anyway, be that as it may. In the future it is to be hoped that the independent software vendors will be sensitive to the requirements with respect to DSS in designing their systems. Who knows? We might even get some standards. only for RuBoard - do not distribute or recompile
352
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
Appendix A. Wine Club Temporal Classifications This appendix contains a more complete list of the all the components of the Wine Club and the values for retrospection that have been assigned. Each component is classified by its type, which will be one of the following:
1. Behavior 2. Circumstances 3. Dimension 4. Relationship The reason contains the business rationale for the choice as to whether the component has been assigned true, false or permanent retrospection. Component Name
Type Retrospection
Reason
Color
Dim Permanent
The colors are assumed to exist forever.
Time
Dim Permanent
The days, once entered, will exist forever. Each year, a new “years-worth” of dates will be added but their existence, thereafter, is permanent.
Hobby
Circ Permanent
The hobby details, once entered, will exist forever.
Manager
Dim True
The Wine Club wishes to monitor the performance of managers over time. So the history of a manager's existence is needed.
Customer
Circ True
Customers may have more than one interval of activity. It is important to the Wine Club that it monitors the behavior of customers over time. There is a requirement, therefore, to record the full details of the existence of customers.
Region
Dim Permanent
Wine-growing regions are not expected to cease to exist.
Sales
Beh Permanent
Fact table entries exist permanently.
Sales_Area
Dim False
Latest existence only is required. Sales areas may be combined, or split. Only the latest structure is of interest.
353
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Supplier
Dim False
Latest existence only is required. The supplier details are required, but no history.
Wine
Dim True
Wines may have a discontinuous existence as far as the Wine Club is concerned. It is important to track the history of the existence of wine. Questions such as “How many wines do we sell today, compared to a year ago ?” can be answered accurately only if we keep track of each wine's existence.
Color?Wine
Rel
Permanent
The color of a wine will not change over time. This is a permanent property of the wine.
Region?Wine
Rel
Permanent
The growing region of a wine will not change over time. This is another permanent property of the wine.
Supplier?Wine
Rel
True
The suppliers of wines will vary over time. One of the objectives of the Wine Club is to monitor the performance of suppliers with respect to the popularity and quality of the wines they supply. Where necessary, the club will switch suppliers for wines. So there is a need to monitor this relationship over time.
Wine?Sales
Rel
Permanent
The relationship of a particular sale to the wine involved in the sale will never change.
Manager?Sales Area
Circ True
Managers do move from sales area to sales area. There is a requirement to monitor the performance of managers. Therefore, it is important to keep track of history of their involvement with sales areas.
Sales Area?Customer
Circ True
There is a requirement to monitor the performance of sales areas. As customers move from one area to another, therefore, we need to retain the historical record of where they lived previously, so that sales made to those customers can be attributed to the area in which they lived at the time.
Hobby?Customer
Circ False
A customer's hobby is of interest to the Wine Club. Only the current hobby is required to be kept.
Customer?Sales
Beh Permanent
The relationship of a particular sale to the customer involved in the sale will never change.
Time?Sales
Rel
The relationship of a particular sale to the date involved in the sale will never change.
Permanent
.
354
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Color.Color_Code
Att
Permanent
Identifying attribute rule.
Color.Color
Att
Permanent
The color never changes.
Time.Time_Code
Att
Permanent
Identifying attribute rule.
Time.Day_Name
Att
Permanent
The value will never change.
Time.Week_End
Att
Permanent
The value will never change.
Time.Week
Att
Permanent
The value will never change.
Time.Month
Att
Permanent
The value will never change.
Time.Month_Name
Att
Permanent
The value will never change.
Time.Season`
Att
Permanent
The value will never change.
Time.Year
Att
Permanent
The value will never change.
Sales_Area.Sales_Area_Code Att
Permanent
Identifying attribute rule.
Sales_Area.Sales_Area_Name Att
False
The latest value only is sufficient.
Manager.Manager_Code
Att
Permanent
Identifying attribute rule.
Manager.Manager_Name
Att
False
The latest value only is sufficient.
Hobby.Hobby_Code
Att
Permanent
Identifying attribute rule
Hobby.Hobby_Name
Att
Permanent
The value will never change.
Customer.Customer_Code
Att
Permanent
Identifying attribute rule.
Customer.Customer_Name
Att
False
The latest value only is sufficient.
Customer.Customer_Address
Att
True
Requirement to analyze by detailed area down to town/city level.
Customer.Date_Joined
Att
False
The latest value only is sufficient.
Region.Region_Code
Att
Permanent
Identifying attribute rule.
Region.Region_Name
Att
Permanent
The value will never change.
Region.Country
Att
Permanent
The value will never change.
Supplier.Supplier_Code
Att
Permanent
Identifying attribute rule.
Supplier.Supplier_Name
Att
False
The latest value only is sufficient.
Supplier.Supplier_Address
Att
False
The latest value only is sufficient.
Supplier.Supplier_Phone
Att
False
The latest value only is sufficient.
Wine.Wine_Code
Att
Permanent
Identifying attribute rule.
Wine.Wine_Name
Att
False
The latest value only is sufficient.
Wine.Vintage
Att
True
There is a requirement to analyze popularity of wine by vintage. The vintage of a wine changes from time to time, approximately yearly.
Wine.ABV
Att
False
The latest value only is sufficient.
Wine.Bottle_Price
Att
True
There is a requirement to analyze popularity by price ranges and to determine how changes in price affect popularity.
355
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Wine.Case_Price
Att
True
There is a requirement to analyze popularity by price ranges and to determine how changes in price affect popularity.
Wine.Bottle_Cost
Att
True
Requirement to analyze changes in cost versus revenue.
only for RuBoard - do not distribute or recompile
356
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Appendix B. Dot Model for the Wine Club APPENDIX B DOT MODEL FOR THE WINE CLUB only for RuBoard - do not distribute or recompile
357
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
APPENDIX B DOT MODEL FOR THE WINE CLUB This appendix contains a fuller set of dot model worksheets for the Wine Club case study database. The first worksheets contain the behavioral models. This first worksheet contains the data model for sales of wine.
This second dot model shows the customer behavior with respect to the sale of Wine Club accessories. In the main text, the number of dimensions was minimized for the purpose of illustrating the method. Here, we include some additional dimensions.
358
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
This model contains the customer behavior with respect to trips organized on behalf of the Wine Club. The model contains details about the trip, the tour operator, type of trip (e.g., coach tour) and the area visited (e.g., Burgundy region of France). Also, we record details of the party. The party members may become customer prospects! In addition, there are other products associated with trips, such as personal insurance and noninclusive activities.
359
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Now we record details of some of the major entities—customer circumstances, dimensions, and segments that form the nonbehavioral part of the general conceptual model (GCM). We start, obviously, with the customer.
360
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
361
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
362
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
363
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
364
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
365
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
366
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
367
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The fact usage worksheet helps us to define by which of the dimensions it is safe to sum the measurable facts. Monetary facts are generally OK, but beware of the issues with quantities. There is risk of adding the quantity of apples to the quantity of oranges, etc.
368
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
In this example of sales of accessories, the quantities sold can be summed by product, since the product dimension will group by discrete units. But it does not make sense to sum products by, say customer or supplier because combining different units is meaningless. For instance, we might be adding “3 Promotional Sweatshirts” to “6 Goblets.” The sum of these is 9, but it is nonsensical. Averaging, minimums, and maximums are also usually not very sensible outside of the product domain.
369
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The hierarchies and groupings worksheets record the business meaning of a relationship between two components.
370
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
371
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
only for RuBoard - do not distribute or recompile
372
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
Appendix C. Logical Model for the Wine Club In this appendix we display the relational logical schema. The purpose of this is to show how the components fit together and especially to show the use of existence attributes. The domains have been omitted. Relation Customer Customer_Code Customer_Name Hobby_Code Date_Joined Primary Key (Customer_Code) Foreign Key (Hobby_Code references Hobby.Hobby_Code)
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Relation Wine_Supplier Wine_Code Supplier_Code Wine_Supplier_Exist_Start Wine_Supplier _Exist_End Primary Key (Wine_Code, Supplier_Code, Wine_Supplier Exist_Start) Foreign Key (Wine_Code references Wine.Wine_Code) Foreign Key (Supplier _Code references Supplier. Supplier_Code) only for RuBoard - do not distribute or recompile
377
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Appendix D. Customer Attributes In this appendix we provide a comprehensive selection of customer attributes. There are five general classifications under which the attributes are listed: 1. Household and personal 2. Behavioral 3. Financial 4. Employment 5. Interests and hobbies These attributes have been gathered from various sources over quite a long period and it is hoped that they will prove to be useful when deciding what information is to be held about customers. In all, there are more than 750 individual attributes. only for RuBoard - do not distribute or recompile
378
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
HOUSEHOLD AND PERSONAL ATTRIBUTES Address Line 1 Address Line 2 Address Line 3 Adults in household age 1 Adults in household age 2 Adults in household age 3 Adults in household age 4 Adults in household gender 1 Adults in household gender 2 Adults in household gender 3 Adults in household gender 4 Age Alternate N/A line 1 Alternate N/A line 2 Alternate N/A line 3 Alternate postal code Alternate start date Alternate stop date Apartment building Apartment in house Bank number
379
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Behavioral type Belong to roadside assist Birth month Bought home Business trip 1 in 3 months Business trip 1 per month Business trip Europe or UK Business trip Florida Business trip in 2 to 3 months Business trip other foreign Business trip other USA Business trip rest of Canada Business trip southwest USA Business trip within province Cat Cellular phone Child 19 or older at home Children 1 female Children 1 male Children 1 unknown gender Children 2 female Children 2 male Children 2 unknown gender
380
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Children 3 female Children 3 male Children 3 unknown gender Children in home Children under 1 female Children under 1 male Children under 1 unknown gender Children unknown Citizenship code Condominium Credit bureau ID Customer classification Customer type Date of birth Date of death Divorced or separated Dog Dwelling type Education Expecting baby no Expecting baby yes Family income Fax Number
381
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Female Female classic Female comfort Female junior Female larger size Female misses Female none of above Female petite Female stylish trend Females 0 to 2 months Females 3 to 6 months Females 7 to 12 months Females 13 to 24 months Females 3 to 4 years Females 5 to 8 years Females 9 to 12 years Females 13 to 15 years Females 16 to 17 years Females 18 to 20 years Females 21 to 25 years Females 26 to 35 years Females 36 to 45 years Females 46 to 55 years
382
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Females 56 to 65 years Females over 65 years First name Five people in household Four people in household Frequent flyer club no Frequent flyer club yes Gas oil card Gender Got married Grandchildren under 12 no Grandchildren under 12 yes Had baby Holiday travel package no Holiday travel package yes Home email Home mortgaged Home office Home office spouse Home owned outright Home ownership Home phone number Home rented
383
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Home status Household size Household support child welfare Household support environmental concern Household support foreign relief Household support health relief concern Household support humanitarian Household support other Household support political Household support religious Language code Last contact date Last name Legal residence code Length of residence Life stage Loyalty indicator Mail indicator Make of primary vehicle Make of second vehicle Male Males 0 to 2 months Males 3 to 6 months
384
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Males 7 to 12 months Males 13 to 24 months Males 3 to 4 years Males 5 to 8 years Males 9 to 12 years Males 13 to 15 years Males 16 to 17 years Males 18 to 20 years Males 21 to 25 years Males 26 to 35 years Males 36 to 45 years Males 46 to 55 years Males 56 to 65 years Males over 65 years Marital status Market segment Married Married or common law Middle name Model of primary vehicle Model of second vehicle Moved Moved in date
385
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Never business trip Never personal trip No kids No others in household No pets in household Not belong roadside assist Number of adults Number of children Number of dependents One person only in household Other dwelling type Own a vehicle Own home Partner—college university grad Partner—education level Partner—high school graduate Partner—some college university Partner—some high school less Partner—some vocational tech Partner—vocational or tech grad Personal trip 1 in 3 months Personal trip 1 month Personal trip Europe or UK
386
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Personal trip Florida Personal trip in 2 to 3 months Personal trip other foreign Personal trip other USA Personal trip rest of Canada Personal trip southwest USA Personal trip within province Postal code Preferred customer code Primary education level Primary source of income Real estate owned Rent home Residence length 1 to 5 years Residence length 6 to 10 years Residence > 10 years Retail dept card Salutation Single Single family house Six or more people in household Social security number Socioeconomic group
387
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Solicitable code Survey participants indicator Three people in household Title Two people in household Want further mail no Want further mail yes Widowed Year of primary vehicle Year of second vehicle You—college university grad You—high school graduate You—high school less You—some college university You—some vocational tech You—vocational or tech grad only for RuBoard - do not distribute or recompile
388
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
BEHAVIORAL ATTRIBUTES A few favorite brands Advertised specials many Advertised specials rarely Advertised specials some Another brand if favorite sold Bank by PC Internet Bank by phone Buy 35mm camera Buy CD player Buy CD ROM Buy dept store sale Buy favorite brand on sale few Buy favorite brand on sale many Buy favorite brand on sale some Buy IBM PC Buy Macintosh Buy mail order Buy nonfavorite brand few Buy nonfavorite brand many Buy nonfavorite brand most Buy nonfavorite brand some
389
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Buy offer in mail Buy online services Buy other computer Buy other investments Buy product result TV Buy redeem prod coupon Buy satellite dish Buy sweepstakes Buy swimming pool Buy TV offer Buy used car 2001 Buy used card 2002 Buy VCR Coupon use frequency Coupon use number Customer card Customer card more than 2 Customer card 1 only Dislike bank Dislike far away branches Dislike high service charges Dislike inconvenient banking hrs Dislike long lines
390
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Dislike noncomp interest rates Dislike slow responses Dislike unhelpful tellers Do not plan to buy vehicle Find coupons in direct mail Find coupons in newspaper Find coupons in store Find coupons in store flyers Find coupons other ways Grocery spending week Intent buy house Intent change jobs Intent get married Intent have baby Intent move Intent remodel home Intent retire Involved in investors group Involved in nothing Lease vehicle in 2001 Lease vehicle in 2002 Lowest price brand Lowest price brand few
391
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Lowest price brand many Lowest price brand most Lowest price brand some Mail order books magazines 1 to 2 Mail order books magazines 3 Mail order books magazines none Mail order clothing Mail order clothing 1 to 2 Mail order clothing 3 Mail order gifts 1 to 2 Mail order gifts 3 Mail order gifts none Mail order other than 0 Mail order other 1 to 2 Mail order other 3 Mail order purchase Mail purchase Main bank May buy large family car May buy luxury car May buy medium family car May buy other vehicle May buy pickup truck
392
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
May buy small family car May buy sporty car May buy van station wagon Multibuyer Never used coupons New vehicle in 2001 New vehicle in 2002 New vehicle planned No coupons redeemed last month Not prefer bank via new technology Not use household cleaners One favorite brand One favorite brand few One favorite brand many One favorite brand most One favorite brand some Own 35mm camera Own boat Own CD player Own CD-ROM Own cellular phone Own exercise equipment Own fax machine
393
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Own home security system Own Internet access Own jet ski Own large family car Own luxury car Own Macintosh Own medium family car Own motorcycle Own online services Own other computer Own other investment Own other vehicle Own PC Own pickup truck Own pool table Own recreational vehicle Own satellite dish Own small family car Own snowmobile Own sporty car Own stocks Own swimming pool Own vacation home cottage
394
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Own van or station wagon Own VCR Plan change car in 6 month less Plan change car in 7 to 12 months Plan change car in 1 to 2 years Plan change car in over 2 years Plan to buy boat Plan to buy CD player Plan to buy CD-ROM Plan to buy cellular phone Plan to buy exercise equip Plan to buy fax machine Plan to buy home security Plan to buy Internet access Plan to buy jet ski Plan to buy Macintosh Plan to buy motorcycle Plan to buy PC Plan to buy pool table Plan to buy recreational vehicle Plan to buy snowmobile Plan to buy stocks Plan to buy swimming pool
395
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Plan to buy vacation home cottage Plan to buy VCR Prefer bank via new tech Purchased books mail Purchased clothing mail Purchased gift mail Purchased mag mail Purchased other mail Shopping list always Shopping list many Shopping list rare Shopping list sometimes Store brand good value few Store brand good value many Store brand good value most Store brand good value some Travel frequency 12 plus coupons redeem last month Vehicle owned Visit another branch Visit nearby branch Wait for favorite brand sale Want compact car
396
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Want large vehicle Want luxury car Want mid-sized car Want other vehicle Want pickup truck Want sport coupe Want sport utility Want station wagon Want van Watch baseball Watch basketball Watch bicycling Watch fishing Watch golf Watch hockey Watch hunting Watch motor boating Watch motorcycling Watch other sports Watch sailing Watch snow skiing Watch soccer Watch tennis
397
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We use coupons weekly Week grocery less $50 Week grocery $50 to $74 Week grocery $75 to $100 Week grocery $100 Would not switch brand for coupon Would switch brand for coupon Would switch store if sale Would not switch store if sale only for RuBoard - do not distribute or recompile
398
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
FINANCIAL ATTRIBUTES Assets—Cash Accounts Assets—other Assets—other home Assets—retirement accounts Assets—stocks and bonds Assets—vehicles Bank by new technology Car change plan Consolidated bank payment Consolidated mortgage payment Consolidated other payment Consolidated retail payment Credit cards Amex Credit cards Amex Gold Credit cards bank Credit cards Bay Credit cards Discover Credit cards Eatons Credit cards MasterCard Credit cards MasterCard Gold Credit cards none Credit cards Other Credit cards Sears
399
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Credit cards Visa Credit cards Visa Gold Credit cards Zellers Current bank Current credit cards Diners Club Financial class Gas cards Canadian Tire Gas cards Chevron Gas cards Esso Gas cards Irving Gas cards Mohawk Gas cards petrocanada Gas cards Shell Gas cards Sunoco Gas cards Ultramar Gross annual income Gross income 20k or less Gross inc ann 20k to 39k Gross inc ann 40k to 59k Gross inc ann 60k to 79k Gross inc ann 80k to 99k Gross inc ann 100k to more High monthly income Home estimated value Home purchase amount
400
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Home purchase date Household income less 20k Household income 20k to 40k Household income 40k to 60k Household income 60k to 80k Household income over 80k Household income no answer Income in thousands Investment Investment planned Liabilities—credit cards Liabilities—loans Liabilities—other mortgages Net worth Other income Other income frequency Other income source Primary income Primary income frequency Vehicle purchase plan Vehicle type considered only for RuBoard - do not distribute or recompile
401
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
EMPLOYMENT ATTRIBUTES Blue collar Business email Business Phone Business run from home Changed jobs College student female College student male Employee code Employment status Full-time employment male Full-time employment female Full-time homemaker female Full-time homemaker male Government worker Home business Household home business no Household home business yes Household not employed female Household not employed male Household retired female Household retired male No business run from home Occupation
402
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Occupation spouse Partner Occupation Partner—homemaker Partner—middle or upper manager Partner—office or clerical Partner—professional or technical Partner—retired Partner—sales or marketing Partner—self-employed Partner—student Partner—teacher or educator Partner—trades person Part-time employment female Part-time employment male Profession code Professional Retired Self-employment female Self-employment male Self-employed Self-employed spouse Upper management White collar Work address line 1 Work address line 2 Work address line 3
403
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Work postal code Working woman Year employed at current job You—homemaker You—middle or upper manager You—office or clerical You—professional or technical You—retired You—sales or marketing You—self-employed You—student You—teacher or educator You—trades person only for RuBoard - do not distribute or recompile
404
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
INTERESTS AND HOBBY ATTRIBUTES Antique collect Arts events Astrology Audio equipment Automotive work Avid book reader Bible reading Bicycling Boating Books on tape Bowling Cable TV Camping biking Career activities Casino gambling Catalog shopping CB radio CD player Charitable donation Collect antiques or art Collect coins
405
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Collect die cast cars Collect dolls Collect figurines Collect plates Collect sports cards Collect stamps Collect other Community activity Cooking hobby Crafts Crossword puzzle Current affairs Daily newspaper Democratic contributor Diet weight control Dining out Electronics do-it-yourself Environmental issue Fashion designer cloth Favorite interest 1 Favorite interest 2 Fine art Fine arts antique
406
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Fishing Fitness exercise Flower garden Gardening plants Golf Gourmet Grandchildren Grandparent Have a puppy Have horse Have kitten Have puppy Health improvement Health natural food Home decorating Home video games Home video recording Home workshop House plants Household pets Hunting shooting Interest in arts events Interest in canoeing
407
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Interest in cross-country ski Interest in culture heritage Interest in curling Interest in downhill ski Interest in ethnic culture Interest in hockey club Interest in ice skating Interest in inground pool Interest in lottery Interest in snowmobiling Interest in travel Canada Interest in travel UK Language English Language French Long distance Mail order books Mail order clothing Mail order crafts Mail order flowers bulbs Mail order garden supply Mail order gift items Mail order magazines Mail order photo finish
408
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Mail order other Military veteran Money making Motor biking National heritage Needlework knitting Not books on tape Outdoor gardening Own camcorder Own cat Own cellular phone Own dog Own home computer Own microwave Own vacation property Own VCR Personal computer Photography Plan to get horse Plan to get kitten Plan to get puppy Power boating Prerecorded videos
409
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Racquetball Read art books Read art magazines Read astrology books Read astrology magazines Read best-selling fiction magazine Read best-selling fiction Read Bible devotional magazines Read Bible or devotional Read business financial book Read business financial magazines Read children book Read children magazine Read computer book Read computer magazines Read country lifestyle book Read country lifestyle magazines Read fashion books Read fashion magazines Read history books Read history magazines Read interior decorate bank Read interior decorate magazines
410
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Read medical health book Read medical health magazines Read military books Read military magazine Read mystery books Read mystery magazines Read news current event bank Read news current event magazines Read romance books Read romance magazines Read science fiction books Read science fiction magazines Read science tech books Read science tech magazines Read sports books Read sports magazines Read urban lifestyle books Read urban lifestyle magazines Read western books Read western magazines Real estate investment Recreational vehicle Remodeled home
411
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Volunteer activity Walking health Watching sports TV Wildlife conservation Wildlife issues Wines only for RuBoard - do not distribute or recompile
413
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
References Batini, C., Ceri S., & Navathe S. B. (1992), Conceptual Database Design, An
Entity-Relationship Approach, Benjamin Cummings Publishing Company. Bruce, T. A. (1992), Designing Quality Databases with IDEF1X Information Models, Dorset House Publishing. Codd, E. F., Codd, S. B., & Salley, C. T. (1993), Providing OLAP to User Analysts: An IT
Mandate, White Paper, E.F Codd and Associates. Inmon, W. H. (1992), Building the data warehouse, Wiley-QED. Jensen, C. S., et al., (1994), A Consensus Glossary of Temporal Database Concepts,
ACM SIGMOD Record, Vol 3, No 1; March 1994. Kimball, R. (1996), The Data Warehouse Toolkit, New York: John Wiley & Sons. Kimball, R. (1998), The Data Warehouse Lifecycle Toolkit, New York: John Wiley & Sons. Menefee, C. (1998), Eight Firms Dominate Data Warehousing, Newsbytes News
Network, August 17, 1998, P1. Meyer, D., & Cannon, C. (1998), Building a Better Data Warehouse, Upper Saddle River, NJ: Prentice Hall. Snodgrass, R. T. (1997), Slowly Changing Dimensions, White Paper, November 1997. Treacy, M., & Wiersema, F., Customer Intimacy and Other Value Disciplines, Harvard
Business Review, January–February 1993. only for RuBoard - do not distribute or recompile