This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Enhance Your Business Applications Simple Integration of Advanced Data Mining Functions
Understand IBM DB2 Intelligent Miner Modeling, Scoring, and Visualization Deploy data mining functions in today’s business applications Learn how to configure the advanced data mining functions
Corinne Baragoin Ronnie Chan Helena Gottschalk Gregor Meyer Paulo Pereira Jaap Verhees
ibm.com/redbooks
International Technical Support Organization Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions December 2002
SG24-6879-00
Note: Before using this information and the product it supports, read the information in “Notices” on page xvii.
First Edition (December 2002) This edition applies to IBM DB2 Intelligent Miner Modeling Version 8.1, IBM DB2 Intelligent Miner Scoring Version 8.1, and IBM DB2 Intelligent Miner Visualization Version 8.1.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.
Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Redbooks(logo)™ AIX® AS/400® Balance® DB2® DB2 Connect™ DB2 OLAP Server™
The following terms are trademarks of other companies: ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. C-bus is a trademark of Corollary, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. SET, SET Secure Electronic Transaction, and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC. Other company, product, and service names may be trademarks or service marks of others.
xviii
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Preface Today data mining is no longer thought of as a set of stand-alone techniques, far from the business applications, and used only by data mining specialists or statisticians. Integrating data mining with mainstream applications is becoming an important issue for e-business applications. To support this move to applications, data mining is now an extension of the relational databases that database administrators or IT developers use. They use data mining as they would use any other standard relational function that they manipulate. This IBM Redbook positions the new DB2 data mining functions: IBM DB2 Intelligent Miner Modeling (IM Modeling in this redbook) IBM DB2 Intelligent Miner Scoring (IM Scoring in this redbook) IBM DB2 Intelligent Miner Visualization (IM Visualization in this redbook) Part 1 of this redbook helps business analysts and implementers to understand and position these new DB2 data mining functions. Part 2 provides examples for implementers on how to easily and quickly integrate the data mining functions in business applications to enhance them. And part 3 helps database administrators and IT developers to configure these functions once to prepare them for use and integration in any application.
The team that wrote this redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, San Jose Center. Corinne Baragoin is a Business Intelligence (BI) project leader at the International Technical Support Organization, San Jose Center. She has over 17 years of experience as an IT specialist on DB2 Universal Database (UDB) and related solutions. Before joining the ITSO in 2000, she worked as an IT Specialist for IBM France. There, she supported Business Intelligence technical presales activities and assisted customers on DB2 UDB, data warehouse, and Online Analytical Processing (OLAP) solutions. Ronnie Chan is a senior IT specialist with DB2 Data Management Software for IBM Australia. He is also a member of the Technical Leadership Council with IBM Software Group, Asia Pacific. He has 15 years of experience as a system engineer for large software organizations. He holds a bachelor of science degree with honors from the University of Salford in the United Kingdom (UK). His areas
of expertise include DB2 on distributed platforms, SAS, Business Intelligence, OLAP, and data mining. His most recent publication in the area of data mining is a paper entitled “Protecting rivers and streams by monitoring chemical concentrations and algae communities using neural network” for European Network for fuzzy logic and uncertainty modeling. Helena Gottschalk is a data mining consultant, working as an IBM Business Partner in Brazil. Previously she worked for a research institute, IBM, and Pricewaterhouse. She has 13 years of experience in applied mathematics, neural networks, statistics, and high performance computing, solving business needs. She has worked in many data mining projects in different industries, some in data warehouse environments. She has participated in several conferences and written many scientific and business publications. She holds a master's degree in electrical engineering from the Federal University of Rio de Janeiro (UFRJ). Gregor Meyer is a senior software engineer at the Silicon Valley Lab in San Jose. He has worked for IBM since 1997, when he joined the product development team for DB2 Intelligent Miner in Germany. Now, he is responsible for the technical integration of data mining and other Business Intelligence technologies with DB2. He represents IBM in the Data Mining Group (DMG) defining the Predictive Model Markup Language (PMML) standard for mining models. Gregor studied computer science in Brunswick and Stuttgart, Germany. He received his doctorate from the University of Hagen, Germany. Paulo Pereira is a Business Intelligence specialist at the Business Intelligence Solutions Center (BISC) in Dallas, Texas. He has over seven years of experience with customer projects in the Business Intelligence arena, in consulting, architecture design, data modeling, and implementing large data warehouse systems. He has worked with the majority of the Business Intelligence IBM Data Management and partners portfolio, specializing in parallel UNIX solutions. He holds a master's degree in electrical engineering from the Catholic University of Rio de Janeiro (PUC-RJ), Brazil. Jaap Verhees is an architecture and business/IT alignment consultant with Ordina Finance Consulting, based in the Netherlands. Prior to his five years at Ordina, he worked for IBM for six years. He has 11 years of IT experience, both in the Business Intelligence (BI) and Object Technology (OT) areas. He holds a Ph.D degree in econometrics from Groningen University, The Netherlands. His areas of expertise include data mining, OLAP, and modeling in the BI arena. He also trains clients in techniques for business process modeling, data modeling, and system analysis, in addition to application development methodologies and data mining methodologies. He has written extensively on techniques for analyzing multidimensional data.
xx
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The team (from left to right): Helena, Gregor, Jaap, Paulo, Corinne, and Ronnie
Thanks to the following people for their contributions to this project: By providing their technical input and reviewing this book: Leston Nay UNICA Corp Ute Baumbach Inge Buecker Toni Bollinger Cornelius Dufft Carsten Schulz Gerd Piel IBM Development Lab in Boeblingen Jay Bruce IBM Silicon Valley Lab Prasad Vishnubhotla IBM Austin Don Beville IBM Dallas
Preface
xxi
Micol Trezza IBM Italy By reviewing this redbook: Wout van Zeeland Ordina Fidel Reijerse INTELLIDAT Damiaan Zwietering IBM Netherlands Graham Bent IBM UK Tommy Eunice IBM Worldwide Marketing
Become a published author Join us for a two- to seven-week residency program! Help write an IBM Redbook dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You'll team with IBM technical professionals, Business Partners and/or customers. Your efforts will help increase product acceptance and customer satisfaction. As a bonus, you'll develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html
xxii
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Comments welcome Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks
Mail your comments to: IBM Corporation, International Technical Support Organization Dept. QXXE Building 80-E2 650 Harry Road San Jose, California 95120-6099
Preface
xxiii
xxiv
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 1
Part
1
Advanced data mining functions overview This part provides an overview of the following data mining deployment topics. It explains: The next step for data mining industrialization The drivers for deploying data mining models and scores in any business environment Why data mining functions are part of the database Deployment examples according to typical business scenarios in the banking, retail, and telecommunications industry sectors
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1
Chapter 1.
Data mining functions in the database This chapter discusses: How data mining has evolved The drivers for deployment instead of using data mining alone The effective deployment of scoring into the business data environment The issues that result from industrializing the mining models and the challenges during the actual deployment of the models
1.1 The evolution of data mining Data mining has become useful over the past decade in business to gain more information, to have a better understanding of running a business, and to find new ways and ideas to extrapolate a business to other markets. Its analytical power to find new business areas and opportunities does not need to be proven anymore to both the business- and data-mining analysts. Figure 1-1 shows a historical view of data mining.
Data Warehousing Data Mining
Learning Pattern recognition Rule based reasoning
Query languages
Web Mining Real-time Scoring
DBMS
1960
1970
1956 *AI
1980
1990
2000
2002
1985
1996
2002
SQL-Standard
DB2 UDB Intelligent Miner
DB2 UDB Mining Extenders
1970 First def. of DB language
Figure 1-1 A historical view of data mining
Today, data mining is no longer thought of as a set of stand-alone techniques, far from the business applications. Over the last three years, integrating data mining with mainstream applications has become an important part of e-commerce applications. Meta Group has noted that data mining tools and workbenches are difficult to sell as stand-alone tools. They are better as part of a solution. Enterprises require more and more integration of data mining technology with relational databases and their businss-oriented applications. To support this move to applications, data mining products are shifting from stand-alone, technology to being integrated in the relational database. Yesterday’s data mining experts had to advise how to integrate the use of data mining results within the company process. Today technology supports the insertion of data mining findings in the company processes of interaction between different business users. As shown in Figure 1-2, data mining has moved from workbenches used by power users to be embedded and integrated directly in applications. With the
4
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
traditional approach to data mining, an expert uses a workbench to run preprocessing steps, mining algorithms, and visualization of the mining models. The goal is to find new and interesting patterns in the data and somehow use the gained knowledge in business decisions. The power users, as data mining analysts and advanced Online Analytical Processing (OLAP) users, typically require separate specialized datamarts for doing specialized analysis on preprocessed data. The integration of mining results into the operational business is usually done in an ad-hoc manner. With the integration of mining, the focus shifts toward the deployment of mining in business applications (Figure 1-2). The audience includes the end users in the line of business who need an easy-to-use interface to the mining results and a tight integration with the existing environment. A power user is still required to design and build optimized and efficient data mining models. The power user considers all the data from both the operational databases and the data warehouse to be potentially important in the traditional, non-business integrated way. Simple predictive models, anticipation of customer behavior, and automatic optimization of business processes by means of data mining become more important than some general knowledge discovery.
Figure 1-2 Shift in the use of data mining technology and audience
When data mining becomes part of a solution, the actual user of the mining functions is a developer who packages the mining results for the end-user
Chapter 1. Data mining functions in the database
5
community. This person has the database application development skills. They provide the models and scores back to the data warehouse with aggregated data, specialized datamarts, and operational databases that hold data at the transactional level. A person with SQL programming skills applies scores to, for example, the likelihood of events in the near future or the forecasts of customer behavior. This person also assures that the creation of the model is generally not considered the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained needs to be organized and presented in a way that the end user can use it. Depending on the business requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases, the business end user, not the data analyst, together with the person with the SQL skills, carry out the deployment steps. However, even if the data analyst does not carry out the deployment effort, the business end user must understand up front what actions need to be carried out to actually use the created models. The actual deployment of the model and scores is where the real benefits are harvested by the business.
1.2 Data mining does not stand alone anymore As stated in the previous section, data mining is no longer thought of as a set of stand-alone techniques, far from business applications. Today many applications, such as Web analytics, Web personalization, e-commerce, and operative Customer Relationship Management (CRM), require integrated data mining functions. There are several business drivers to deploy models and scores in any business environment instead of applying data mining as it is. For example, there is the need for:
Faster time to market and closing the loop Real-time analytics Leveraging existing IT skills Building repeatable processes and tasks Efficiency and effectiveness Cost reduction of mining analytics
The following sections discuss each of these business drivers.
6
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1.2.1 Faster time to market and closing the loop Time to market is essential to effectively use data mining. It is also essential across such industries as retail, telecommunications, and finance. For example, in retail, recency is short for perishable products in the fast moving goods sector. In the food retail business, for instance, it is essential to know on time when and what products to lay out or what product to competitively price for the next days or weeks to have a profitable overturn. In the telecommunications business, time to market is more and more an essential way of doing business. It provides customers a means to select their preferred manner of communications for prices that may change from time to time to minimize the possibility of losing them to the competition. And in the finance industry, financial institutions provide services where real-time analytics are useful to detect fraudulent behavior. Banks offer a large set of services that the customer may select or configure themselves to match to their preferred customer touch point. Flexibility offered by a bank induces the possibility of fraudulent behavior, and people may use the services for just enough time before any fraudulent behavior is actually detected. Fraudulent behavior, by nature, is very agile because customers sense they will not be easily detected if they connect and disconnect to services in short bursts of time. Therefore it is in the interest of banks to discover such behavior as soon as possible to reduce costs and risks of loss. Another issue where time to market is noteworthy is related to the analytical-process side more than the business-application side. Several data mining vendors have come up with the right tools and introduced a data mining process to the marketplace. Over the past years, the generic data mining method has evolved in the data mining community. This method is meant to be used by organizations that need to perform a data mining exercise, from end-to-end, for the first time in their business environment. It is the full-blown, months-lasting data mining projects where you can also find the data mining workbenches in the early stages of data mining within the organizations. In regard to time to market, many organizations so far have not considered to refine scoring models from their data warehouse environment, datamarts, or operational data stores and pass scores back to the operational databases with transaction-oriented data. The closed loop from retrieval, discovery, and actual deployment is certainly the next stage for most business to gain Return On Investment (ROI) by applying data mining in a business environment. Figure 1-3 displays this generic data mining method. The steps that are involved are outlined here:
Chapter 1. Data mining functions in the database
7
1. Define the business issue in a precise statement. 2. Prepare the data: a. Define the data model and the data requirements. b. Source the data from all your available repositories and prepare the data. c. Evaluate the data quality. 3. Choose the mining function and run it. 4. Interpret the results and detecting new information. 5. Deploy the results and the new knowledge into your business.
Define the business issue
Integration of traditional mining and the deployment
Mine the data
Mining Workbench
Prepare the data
Interpret the results
Deploy the results
SQL and RDBMS functions for Scoring
Figure 1-3 The generic data mining method
Effective and timely deployment of mining results becomes a problem with this data mining method if you do not place the right skills and technology at the right moment in the overall process. Typically, the time to move from steps 1 to step 5
8
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
can take several weeks to several months, depending on the maturity of the data warehouse or operational data stores, if it is already in place at all. Eighty percent of that time is taken by steps 1 through 4. This is regardless of how well the business issue is defined and how well meta-defined and cleansed our data is before we start a mining run with step 1. Fortunately, new application integration requirements open the door to integrate mining functions, such as modeling and scoring, with the relational database management system (RDBMS). This allows for a faster ROI and quicker answer to the business. It also advocates the need for applying the modeling functions to any data preparation source, and scoring of new transactions, without needing in-depth knowledge of the mining techniques. The new mining functions for both modeling and scoring today are more closely integrated with the RDBMS. They shift the focus to executing the mining and deployment steps of the overall generic data mining process. This addresses the time to market and close-loop driver. See Figure 1-4. With the advent of new applications where the need for quicker time to market is more urgent than before, there is less concern with the accuracy or goodness of fit of the original data mining model to the data at that time. Here are some examples: Sacrificing accuracy for speed results in faster time to market of, for example, marketing campaigns or competitive pricing strategies for up-sell or cross-sell. Detecting fraudulent behavior occurs as soon as possible after customer touch points take place in a short span of time. Click-stream analysis satisfies the need where speedy analysis and targeting are more important than an accuracy of close to 100 percent. The focus on data mining runs has evolved from trying to develop models with high accuracy, that lasts for months to years. The focus has shifted to models that support the CRM analytics with enough flexibility and speed at the cost of possibly missing out on precise targeting on each individual customer. Who has not heard of the phrase that timing is essential? It is not just about the contents but also about the moment in time that we convey a message. We need to act upon information that we gather from data and day-to-day business experiences to score an opportunity in business. We do no want to wait for days to act or react.
Chapter 1. Data mining functions in the database
9
Integration of models and deployment of scores in applications
Prepare the data
Mine the data
SQL and RDBMS functions for Modeling
Interpret the results
Deploy the results
SQL and RDBMS functions for Scoring
Figure 1-4 Focus on modeling and deployment of scores
Visualization, scoring, and deployment have become the key words in the data mining methodology. The modeling itself, the choice of the mining technique, is less important in contrast to the need for real-time analytics. We prefer less exact preparation of the model parameters and accept less than 100% correct selection of the data if it means success in delivering a model within matters of days instead of weeks. The goal is to quickly produce an actual deployment in a operational business environment to finally succeed with applying a closed loop and real time CRM application. As Figure 1-4 shows, scoring and deployment become the center of focus in the data mining process. Preparing the data and the data model is done while keeping time to market in mind at an acceptable loss of accuracy. Visualization may be part of the process, but merely to have visual inspection of the model and initial results of the first run.
10
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The real reward is when the model is deployed. This is a deployment in the operational environment by integrating the outcome of the mining model, the scores, with the actual data. In the end, for example, a third-party tool that supports a specific CRM task uses the scores to address the real reward, that is to gain overall customer satisfaction. Or the mining model is triggered by a event that takes place of a customer interaction via one of their preferred customer touch points. This leads to a new score of the customer lifetime value or of the propensity of something else. Deployment takes many forms. Refer to 3.5, “Integrating the generic components” on page 44, which explains deployment issues in more detail.
1.2.2 Real-time analytics Real-time analytics is a driver that is closely related to the need for faster time to market and a closed loop. But here the focus is, for example, not on delivering faster predictive scores for a campaign that may be scheduled to run in the next weeks after it is decided for which segment of the overall customer base this campaign will run. In the case of real-time analytics, this is real-time scoring, in which a new customer initiates a transaction via their preferred channel. The channel can be the phone call to a customer sales center, a personal visit to one of the offices, or an inline form at the Web site where products are promoted by companies. Customer data gathered via any of these channels is fed into the front-end application. Then the company gives instant feedback, information, and offerings that are suited to the customer of which a profile is generated within seconds. This is a common practice among companies today. Scoring data on entry is what sets it aside from traditional stand-alone data mining that focused on mining data from tables. Trigger-based scoring, for example the so-called trigger-based marketing concept for faster targeted marketing, serves real-time analytics and scoring of data on entry. This results in faster time-to-market responses by business-to consumer (B2C) companies. Real-time analytics, initial modeling, and iterative recalibration of the models makes more sense in fast moving markets. It makes sense in highly competitive markets and in markets where customer interaction is vital to gain high return on investment. For example, the Bank of Montreal relies on scoring to execute models. It uses real-time scoring to track changes in customer profiles, such as those who recently purchased certain products (indicating they may be interested in a new product) or those who closed several accounts in one week (indicating they are about to turn). When there is a change in a customer profile, users get the score
Chapter 1. Data mining functions in the database
11
without waiting for a report at the end of the month. Each time something changes on a customer profile, the scoring application immediately reruns scores. Also, real-time scoring enhances applications with the ability to simulate and perform micro-testing. With Intelligent Miner (IM) Scoring, input data for the scoring can be from both a database table and entry data. For example, you can score a known customer with altered values to study the effects of an action.
1.2.3 Leveraging existing IT skills These days staff turnover is high. When staff with data mining skills leaves your company, automation of (part of) the mining process is the way of the future to tackle this issue. If several tasks in the overall data mining process are automated as far as possible, the danger of skills loss is less. The actual deployment of scoring results can be done by someone with SQL programming skills to bring back scores to the data warehouse, specialized datamarts, and operational databases.
1.2.4 Building repeatable processes and tasks Build repeatable business processes and tasks to reduce redeployment and maintenance costs. This is another business driver for deploying a scoring model and scores in a business environment. You need implement a process so that the essence of a successful model is captured, made operational, and guarantees repeatable success. You want to use models that have a tight integration with databases. Without the tight integration, you would still have to manually deploy the scoring results in the data warehouse or operational databases. Without tight integration, you would not benefit from the automation facilities that database management systems offer. The deployment of scoring results is done with ease and with less intervention to make the results available. The scores must be updated at regular times for the obvious reason that the business environment changes all the time too. Repeatable and eventually automated (parts of) processes are the way to go for deployment. This way you can schedule batch runs for automated rescoring or recalibration of models, which is cost effective. Another interesting possibility when you have a repeatable process is regenerating the model based on new data or triggering a new model to be built when the old one no longer fits.
12
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1.2.5 Efficiency and effectiveness When modeling and scoring analytics must be applied in the business environment, there is the driver for effectiveness and efficiency. Technology is also needed to provide ease of use and ease of interaction close to the heart of the where the data resides. To gain scoring results in an efficient manner and to have effective scoring models to achieve these results, business applications require interaction and more integration between different modeling and scoring techniques. One technique is no longer sufficient to model a customer’s behavior. Consider an example where you need to profile the customer and predict the likelihood that they are going to accept the targeted marketing campaign. Figure 1-5 shows an example of both scoring to a profile and predictive scoring. You can combine the separate profile and target models to be a sequential scoring model. And possibly they will be more effective in the profiling of purchase habits and targeting via campaigns for new customers.
1. Profiling 2. Targeting
Who is our new customer? Are they likely to buy the product once we send them a specific offering in our next campaign?
Type attributes
Type pla yer
Type player
Type pl ayer
Type player
Type attr ibutes
Segmentation
NO
? ?
Yes! Yes! NO
Yes!Yes!
NO
Yes! NO
NO
? ?
NO
Figure 1-5 Integration between different modeling and scoring techniques
Chapter 1. Data mining functions in the database
13
1.2.6 Cost reduction of mining analytics Scoring of data on entry brings to mind what-if analysis, which was not possible before, even at the time that data mining was somewhat industrialized. The what-if analysis can reduce the cost of mining analytics. It is also a driver for deploying a scoring model. Why? You may to decide on whether a particular segment of customers in the overall customer base is likely to migrate or merge into another segment of customers that is more profitable to the business. In this case, you can cut costs by simulating instead of applying the model in an actual segment targeted campaign. Simulation is carried out, for example, by a telecommunications company in cross-sell campaign programs. They would simulate whether the flat-fee offering to a region and language specific segment of customers in its database help to migrate the customers to another segment with the characteristic that it uses more flat fee services than the previous segment. In a similar way, food retailers simulate different pricing strategies to evaluate the possible effect of cross selling products to specific groups of its customers without actually having to price the products on the shelves. And even banks may want to use scoring models to simulate new rules, or combinations of rules and hierarchies of their customers, to introduce different interest pricing offerings for products they service to select groups of customers. This way they can gain more profit without instituting more risk.
14
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
2
Chapter 2.
Overview of the new data mining functions The new modeling and scoring services directly integrate data mining technology into the relational database management system for rapid application development and deployment. This chapter describes the new data mining functions (Intelligent Miner (IM) Modeling, IM Scoring, and IM Visualization) and the philosophy behind them. It provides: An overview of the separate DB2 Universal Database (UDB) functions for Scoring, Modeling, and Visualization The principles and capabilities of these mining functions The whole picture where the three mining functions interact A positioning of these DB2 mining functions with DB2 Intelligent Miner for Data
2.1 Why relational database management system (RDBMS) functions What makes the mining functions so appealing is their integration with DB2 UDB? Built as DB2 data mining functions, the new Modeling and Scoring services directly integrate data mining technology into DB2 UDB. This leads to faster application performance. Developers want integration and performance, as well as any facility to make their job easier. Note: The Scoring data mining function is available on both DB2 UDB and Oracle. IM Scoring can be integrated into applications in the same manner, even in an environment where different database systems are used. Benefits of storing a model and scores as database objects are:
Administrative efficiency Ease of automation and integration Operational efficiency Performance advantages
2.1.1 Easy use of automation and integration The application programming interface (API) of the mining functions is tightly integrated into Structured Query Language (SQL). The model can be stored as a Binary Large Object (BLOB). The model can be used within any SQL statement. This means the scoring function can be invoked with ease from any application that is SQL aware, either in batch, real time, or as a trigger. The batch mode also facilitates automatic scheduled runs of the applications, which invoke the use of the models. Scheduled scoring is possible. Because DB2 UDB is ODBC-, OLE DB-, JDBC-, and SQLJ-enabled, the model can be easily integrated with applications.
2.1.2 Operational efficiency Storing a model and scores as a database object also leads to operational efficiency. For instance, storing models in a relational table allows more efficient marketing automation. Versioning of models that are calibrated at different points in time, or models for different marketing segments, can all be stored in a single table. See Figure 1-5 on page 13.
16
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Marketers can mix and match models with different marketing campaigns with much ease and little coding. When a model is updated, there is no need for changing the scoring code. The new SQL mining functions eliminate this error prone task. Updating a model is as simple as updating a value in the database. Besides, storing a model and scores as database objects facilitates simulation. For example, the analyst can simulate the effects of actions on selected customers that were segmented together by the model. Or the analyst can study (simulate) the impact of a chance in values of key variables in the clustering (variables are already given in a specific order by the cluster model that can help analysis) to define actions to migrate customers from one segment to another segment. These operations are efficient because the analyst has direct access to both the model and simulated scores in one environment, the environment for database objects. Another point is transparency. You can hide the application of a model on data under a database view. Also the complexity of applying models is stored in the database itself, and any application can access it.
2.1.3 Performance There is a performance advantage to invoking a user-defined function (UDF) or method. They are invoked from the database engine and not run as part of your applications. The mining functions are executed where the data is stored. There is no overhead of copying all the data to the application. An additional performance benefit occurs when working with a Large Object (LOB). You can create a function to extract information from an LOB right at the database server and pass only the extracted value back to the application. This strategy avoids the passing of the entire LOB value back to the application and then extracting the needed information.
2.1.4 Administrative efficiency The database administrator (DBA) can administer access from a single point of control such as the DB2 Control Center. This means more administrative efficiency when you perform a database or table space backup, allowing automatic backup of the models and scores. Models and scores stored inside the database immediately inherit the benefit of a secure database. Access to the models and scores are controlled from a single point, the DBA. This is a lot more secure than if the models are stored in the file system.
Chapter 2. Overview of the new data mining functions
17
Installation and maintenance are also simpler. There is no need to set up additional tools or client/server interfaces. A DBA can manage the configuration using standard database tools, and every piece of information is simply an object in a database table.
2.2 Scoring: Deploying data mining models Scoring is the use of existing data mining models based on historical data and simply applying these models to new data. For example, you may have a classification model that contains rules about estimating the churn risk for customers. Given the profile data about a particular customer, the scoring function computes the churn risk. You can now apply this function in real time to a single record, for example, the customer who currently talks to someone in the call center. Note: Scoring does not come up with new models or new predictive rules. The actual mining models that contain the rules are computed by other modeling functions or are imported from external mining workbenches. The guiding principle behind DB2 IM Scoring is the notion of mining versus business rules. The business environment is in flux, where frequent updates are necessary and rules age quickly. Rules are often manually identified, which is both a laborious and time consuming process. IM Scoring identifies unsuspected targets, opportunities, and problems in an automatic way. For example, new data (change in behavior or characteristics of transactions) may trigger the IM Scoring application to produce a score based on the underlying mining model. Then it matches this to a range of other scores to automatically signal whether this is an expected or unexpected result. IM Scoring is an economical and easy-to-use mining deployment capability, that:
Is a database extension Is implemented in batch mode or real time Supports the new Predictive Model Markup Language (PMML) Leverages existing IT skills
2.2.1 Scoring as an SQL extension IM Scoring is an economical and easy-to-use mining deployment capability, because it implemented as a database extension. The scoring functions are simple standard extensions to SQL. This way, the actual scoring is easy to
18
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
implement by the SQL interface. You can combine the scoring functions with any SQL query, VIEW, or TRIGGER. IM Scoring uses the well-known industry standard PMML for mining models.
2.2.2 Batch mode and real time Scoring can be implemented in batch mode or by transaction in real time. Again, in both cases, the invocation is done by either using the SQL API in SQL statements or by using the Java API in applications built in Java. Batch modes can be used, for example, in a telecommunications industry scenario. Here the model for churn is run on a regular basis, probably monthly and quarterly, for customers who have to renew their contract. Scoring by transaction in real time is what dubbed as real-time scoring. Real-time scoring takes place in two variations. The first variation occurs in a bank scenario, for instance. In this scenario, a loan officer online scores the loan applicant request on the basis of their online input to the mining model for the loan application. Or similarly, in a retail environment, the model runs on demand at the visit of a customer of touch point in store. Since they seem to be similar to previously targeted consumer groups, they receive the offering. The other variation occurs in e-business scenarios where the input data for scoring may include data that is not yet made persistent in the database. Think of personalization in WebSphere where the current data may depend on the most recent mouse click on a Web page. In this case, a small Java API for the scoring functions allows for high-speed predictive mining as well.
2.2.3 Support for the new PMML 2.0 standard IM Scoring uses the PMML standard so that it can access mining models that have been generated by DB2 Intelligent Miner for Data. It also uses this standard to use the mining models of other data mining tool vendors, such as SAS with its data mining workbench SAS Enterprise Miner (SAS/EM).
Chapter 2. Overview of the new data mining functions
19
More on PMML: PMML is an XML-based language that provides a quick and easy way for companies to exchange models between compliant vendors' applications. It provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. It allows users to develop models within one vendor's application and to use other vendors' applications to visualize, analyze, evaluate, or otherwise use the models. Previously, this was virtually impossible, but with PMML, the exchange of models between compliant applications is now seamless. Since PMML is an XML-based standard, the specification comes in the form of an XML Document Type Definition (DTD). IM Scoring provides a model conversion facility, which converts mining models from DB2 Intelligent Miner for Data format to PMML 2.0 format. The most important advantages of IM Scoring by using PMML are:
Computational resources are exploited efficiently. It allows real-time (event-triggered) processing and batch processing. It enables foreign software applications access to the modeling logic. It allows the generalization of any future model type. Data miner experts on-site are not necessary (see the following section). It enhances ease of integration between mining and core operational systems.
2.2.4 Leveraging existing IT skills When you want to use the DB2 data mining function for IM Scoring effectively, the following skills are required (see Figure 2-1): For the IT specialist: To use the SQL API, this specialist need basic SQL skills to apply data management procedures within the business and to actually deploy the scores in the business environment. Obviously, Java skills may be required when using Java API. For the business users (for example marketing, sales): These users can use IM Scoring through the interface or report built by the IT specialist. They must have a knowledge of the business domain knowledge and a thorough understanding of the business issue to check the usability of the scores and to evaluate the deployment from a business perspective.
20
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IT skills SQL and RDBMS functions for Scoring
Business skills
Data mining skills
Figure 2-1 IM Scoring and skills
If the IT specialist who develops IM Scoring functionalities has Java or any reporting tool (such as QMF, Business Objects, or similar) skills to build a Java interface or a report, any people with some business skills can launch IM Scoring.
2.3 Modeling: Building a mining model using SQL Using DB2 data mining functions for modeling offers an extra push when you analyze the data to find an adequate data mining model easily and in a short time. The ease of use is enhanced by IM Modeling. IM Modeling provides a set of functions as an add-on service to DB2 UDB. This consists of a set of user-defined tables (UDTs), UDFs, methods, and stored procedures. They extend the capabilities of DB2 UDB to include data mining modeling functions. You can easily use these functions from SQL statements. Modeling allows: Interoperability Building and using data mining models stored in DB2 UDB Support for the new PMML V2.0 standard It requires both IT skills and basic data mining understanding.
Chapter 2. Overview of the new data mining functions
21
2.3.1 Interoperability The models may have been developed by other applications and tools that support interoperability via PMML models. Or the models of the DB2 Intelligent Miner for Data may have been exported as PMML models. In the modeling phase of a data mining project, various modeling techniques are selected and applied. Their parameters are calibrated to optimal values. Typically, you can choose one or more of several techniques for the same data mining problem type. For example, you may try deduce a model that you will use to predict credit ratings for new customers of a bank. In this case, you use the neural prediction technique to generate the model. But parallel, or subsequent, to the neural prediction effort, you can use the tree decision technique to state the decision rules of the model that you will use over the following months.
2.3.2 Models in DB2 UDB You perform modeling essentially by making calls to the functions of the DB2 data mining function. For example, you call the mining technique that you want to use in the modeling stage. The SQL API provides the means to make these calls. It enables SQL application programs to call associations discovery, clustering, and classification techniques to develop models based on data which is (also) accessed by DB2 UDB. Models are in DB2 UDB. This makes for ease of management, centralized control, security. Plus, multiple versions of models in one central database make it easier to change to other models and ease oversight. By this you gain in ease of management, which leads to cost reduction of data and model management.
2.3.3 Support of the new PMML 2.0 standard The move from using traditional data mining workbenches (SAS Enterprise Miner, DB2 Intelligent Miner for Data) toward more modularized tooling to perform and deploy data mining is gaining momentum. This is happening now that many leading data mining product vendors are conforming to the PMML standard format for data. The result is that organizations may now buy a mining model from one company. Then they may use the visualization and application features of tooling by another company to deploy the model to business data for the future. For example, a bank may decide to use a model developed with SAS/EM and have its model exported in PMML. Then they may use the IBM mining tooling for
22
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
scoring and visualization to deploy the results to the operational business environment. Vice versa, you can use IBM mining tooling for modeling the data and then decide to incorporate the model in the deployment phase of the mining application. You can incorporate the prebuilt mining model via PMML import into a Customer Relationship Management (CRM) application such Siebel. In this manner, the organization is not dependent on only one vendor. Nor does the organization rely on the buy-in of one mining model. Instead, it has the means to choose a package on the basis of best-of-breed solutions for the business issue at hand.
2.3.4 Required skills You need the following skills to use the DB2 data mining function functionality for IM Modeling (see Figure 2-2): SQL programming: You need this knowledge to use the SQL API to call the mining technique you want to use in the modeling stage. You also need it to apply data management procedures to provide (read only) access to the models, so the business can run scoring applications. Basic understanding of data mining and the business issue: You need this knowledge to call an appropriate mining technique.
IT skills SQL and RDBMS functions for Modeling
Business skills
Data mining skills
Figure 2-2 IM Modeling data mining functions and skills
Chapter 2. Overview of the new data mining functions
23
2.4 Visualization: Understanding the data mining model The DB2 data mining function, IBM DB2 Intelligent Miner Visualization (IM Visualization) provides Java visualizers to present data modeling results for analysis. Visualization provides analysts with visual summaries of data from a database. It can also be used as a method for understanding the information extracted using other data mining methods. Features that are difficult to detect by scanning rows and columns of numbers in databases often become obvious when viewed graphically. Visualization is based on the following principles:
Interoperability Choice in use Multi platform capability Support of the new PMML V2.0 standard
It also requires business analyst skills.
2.4.1 Interoperability The models may have been developed by using IM Modeling or other applications and tools that support interoperability through the use of PMML models. Or the models of the DB2 Intelligent Miner for Data may have been exported as PMML models. The different visualizers interoperate with the models.
2.4.2 Choice in use Applications can call the Intelligent Miner visualizers to present model results. You can also deploy the visualizers as applets in a Web browser for ready dissemination.
2.4.3 Multiplatform capability Because the IM Visualization functionality is written in Java, you can install the visualizers and run them on different hardware and software platforms.
2.4.4 Support of the new PMML 2.0 standard You can use the Intelligent Miner Visualizers to visualize PMML-conforming mining models.
24
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
2.4.5 Required skills As with any end-user-oriented visualization tool that has a dynamic graphical user interface for interacting with the end user, you need the following skills to use the DB2 data mining function for IM Visualization effectively (see Figure 2-3): A basic understanding of the graphical user interface functionality, such a mouse click through, drill down, and resizing A strong business domain knowledge A basic understanding of using visualization for data mining results in the form of decision trees, tree rules, clusters and patterns, and associations rules
IT skills
Functions for Visualization
Business Skills
Data mining skills
Figure 2-3 IM Visualization and skills
2.5 IM Modeling, Scoring, and Visualization interactions This section explains the whole picture of using the three DB2 data mining functions and how to deploy them. It presents the different packaging options. And it positions the DB2 data mining functions with Intelligent Miner for Data.
2.5.1 The whole picture To start with data mining, you use the standard data mining process (see also Figure 1-3 on page 8). You develop an application into which you want to include some interfaces to mine data, so as to actually benefit from the mining results in your business environment. Your data mining project consists of these five data mining process phases: 1. Preparing the data 2. Building the data mining model 3. Testing the model, in the case of a classification model
Chapter 2. Overview of the new data mining functions
25
4. Visualizing the results 5. Applying the model IM Modeling enables you to work less on phase 1 and to complete phase 2 of the data mining process. Also, in the case of the classification data mining function, IM Modeling enables you to test the model. For phase 4, use IM Visualization, which enables you to display your data mining results. IM Visualization also helps you to analyze and interpret the data mining results. For phase 5, the application mode, use IM Scoring. If the mining models are created by means of IM Modeling, then IM Scoring can directly apply these models. That is because IM Modeling writes the models into database tables. Mining models that are applied by the SQL API of IM Scoring must be contained in database tables.
2.5.2 Configurations You can use the three DB2 data mining functions IM Modeling, IM Scoring, and IM Visualization together to cover all the phases in the data mining process. However, conformance to such standards as SQL and PMML offer you the choice to use the three DB2 data mining functions independently. Then use these DB2 data mining functions in different package combinations with tools from other data mining product vendors who conform to the PMML standard format for data. For example, you can define predictive models in IM Modeling and share models between compliant vendors' applications. Or you can buy the model from one vendor, use IM Scoring to produce scoring results, and use other vendors' applications to visualize the results. As Figure 2-4 shows, you can expect the following different packages to be used from modeling until deployment into the business environment: IM Modeling IM Scoring in combination with a PMML model imported from another mining tool (for instance, DB2 Intelligent Miner for Data or SAS/EM) IM Modeling in combination with IM Scoring IM Modeling with PMML export to another third-party scoring tool IM Modeling with PMML export to another third-party scoring tool, plus IM Visualization IM Modeling, Scoring, and Visualization as a full integrated package All of these different packaged solutions can also be combined with a third-party end-user software (for instance, Business Objects, Siebel, or SAP).
26
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Third-party end-user software: BO, Siebel, SAP
Applications
Models in PMML
(Models) Scores (Visualizations)
(Models in PMML)
IM Scoring IM Modeling
IM Modeling
IM Modeling IM Scoring
IM Visualization
Configurations
IM Modeling IM Scoring IM Visualization
Figure 2-4 Configurations
2.5.3 Positioning with Intelligent Miner for Data DB2 Intelligent Miner for Data is a suite of statistical, preprocessing, and mining functions that you can use to analyze large databases. It also provides visualization tools for viewing and interpreting mining results. Data mining is an iterative process, as Figure 1-3 on page 8 shows, that typically involves selecting input data, transforming it, running a mining function, and interpreting the results. The DB2 Intelligent Miner for Data assists you with all the steps in this process. You can apply the functions of the DB2 Intelligent Miner for Data independently, iteratively, or in combination. Mining functions use elaborate mathematical techniques to discover hidden patterns in your data. After you interpret the results of your data-mining process, you can modify your selection of data, data processing and statistical functions, or mining parameters to improve and extend your results.
Chapter 2. Overview of the new data mining functions
27
DB2 Intelligent Miner for Data is set up so that it is considered a full data mining workbench. A workbench environment provides support to a data mining specialist who goes through the complete data mining process. When you start from scratch, an interactive mining workbench, such as DB2 Intelligent Miner for Data, may be the fastest way to create a model. That’s because workbenches are the right tools for exploratory data mining. It involves less programming effort and assists the exploration with mining techniques settled in a GUI environment. But beyond exploration, once the scenario is defined, the API of the DB2 data mining functions helps to actually close the loop with the operational applications and reduces the effort for continuous deployment. DB2 Intelligent Miner for Data does not offer the real time deployment support, right from the start, that a business analyst or data warehouse administrator or operational databases administrator with SQL programming skills would look for. This is a major difference with what the DB2 data mining functions offer. As Figure 2-5 shows, the combination with IM Scoring offers support for the deployment of the models and scores in the operational business environment.
Mining: Modeling, scoring, and visualizing through using a workbench
IM4D
IM Modeling
Mining: Modeling, scoring, and visualizing
Plus Deployment
IM Visualization
through DB2 functions
IM Scoring Operational DB
Figure 2-5 Positioning DB2 mining functions with DB2 Intelligent Miner for Data
28
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
For a complete overview on DB2 Intelligent Miner for Data, see: Intelligent Miner for Data Applications Guide, SG24-5252 Intelligent Miner for Data V8.1 Application Programming Interface and Utility Reference, SH12-6751 In addition to the difference in use within the generic data mining method and the audience of users, DB2 Intelligent Miner for Data offers slight differences in functionality. Refer to Appendix J, “Demographic clustering: Technical differences” on page 299, to learn about the differences that may matter to most of you who will build mining models. IM Scoring is a separate but companion product to DB2 Intelligent Miner for Data. It supports all scoring functions offered by DB2 Intelligent Miner for Data. For its mining deployment capability, it can use the PMML format from DB2 Intelligent Miner for Data. If the mining models are created by means of DB2 Intelligent Miner for Data, they must be exported from the DB2 Intelligent Miner for Data and imported into database tables so that the models can be applied by the SQL API of IM Scoring. IM Scoring provides UDFs to import the models. Access to predictive mining results closes the loop between operational applications and data warehouse analytical environment. Operational applications are all too often isolated islands where no data sharing takes place. Meanwhile data sharing should be there in case of performing an optimal CRM process facing customers. Data warehouse analytical environment needs on the other side of the spectrum have a means to provide a more customer-centralized view. Here there is a lack of performance and time to market. Access to predictive mining results, scores on the basis of a mining model, is the middleware. For example, a call center application can automatically enrich current customer information with a predicted churn risk. This is the risk that this particular customer may leave and no longer use the services of the company. The prediction is computed by a data mining scoring function on the basis of a mining model, plus data similar to what was stored in the data warehouse (but this time fed to a mining database as input to the mining model). Then the result is actually transferred back to the operational data, which maps to the call center customer. This redbook provides a series of examples of how IM Scoring is used in customized applications or in partner tools such as Business Objects, UNICA, or Siebel. You can learn more about these in 3.5, “Integrating the generic components” on page 44.
Chapter 2. Overview of the new data mining functions
29
Interoperability between workbench and data mining functions There are not only differences in usage and technologies between the DB2 Intelligent Miner for Data workbench and the DB2 data mining functions for modeling and scoring. The technologies also interoperate. For example, consider a business case of a CRM application for campaign management, where the technologies could come together as shown in Figure 2-6. In this scenario, the data mining expert develops a campaign management model that suggest to target customers with a high-potential to buy the product: 1. They produce this model in DB2 Intelligent Miner for Data. 2. They export this model in PMML format and transfer it to the operational environment where it is settled in a Web application. 3. The marketing department launches the campaign. 4. The marketers use IM Scoring to produce scores for customers. They subselect customers based on a ranking of scores to actually target: – The high-potential customers (first), hopefully to gain a higher ROI of the campaign – All the newly scored customers
Data Warehouse or Data Mart
3. Marketing launches campaign.
2. Model is transferred in industry-standard PMML format to the web application.
IM Scoring Intelligent Miner for Data 1. Mining expert develops model describing high-potential target customers.
Operational Server
4. Offers made to target customers.
Figure 2-6 IM Scoring: Example CRM application for campaign management
30
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The previous example shows interoperability of the mining tools in a business scenario. Figure 2-7 shows a more technical and generic flow as an example of where the mining tools interoperate. Figure 2-7 also shows the data mining expert producing a model in DB2 Intelligent Miner for Data and exporting the model in the PMML format (an XML-based language). The scoring application uses the SQL interface of IM Scoring to the PMML mining model to data. Then a DB2 UDF is called to store the scored data back to the operational database.
M ining Scenario: Scoring m odels m odels cou ld b e su pplied by a con sultant, solution provid er, or central su ppo rt g ro up within an enterprise m odels can be exch ang ed betw een data m ining too ls from co m pliant vendo rs added value: consultant m ig ht m erg e purchased data, such as d em o graph ic or ind ustry-specific d ata, with d ata m ined
Data A n a l y st
Intelligent M iner for Data scoring application
Transfo rmed Da ta Extracte d I nform ation D ata W areh o use
S elect
A ssimilated Info rmat ion
SQ L
S electe d D ata
Transform
Mine
A ssim ilate
m odel XM L form at
DB2 UDF
sco red data
Historical D ata classification m odel
Figure 2-7 Interoperability of DB2 Intelligent Miner for Data with IM Scoring
In summary, sourcing the data mining model can be done in several ways. If the model to be deployed is created in a data mining workbench, such as DB2 Intelligent Miner for Data, the model must be exported into a PMML file. Then you import the PMML file into DB2 UDB using a SQL script. If no mining model exists in the DB2 Intelligent Miner for Data workbench, you create the SQL script that uses the modeling API to create the model via the mining algorithm of IM Modeling. After the script is set up, another application or the batch scheduler can invoke it. Every time a mining run is invoked using the
Chapter 2. Overview of the new data mining functions
31
SQL API of IM Modeling, a new mining model is created. When this model is applied, using the IM Scoring API, individual records are scored. For example, in the case of clustering, the individual records are scored with a cluster identifier (number). At regular intervals, you invoke the mining run and scoring run on the data. This way, you detect whether the customer behavior and possibly a change in the demographic characteristics leads to a change of score. That is because the customer may have been scored to be in a different cluster over time.
2.6 Conclusion Database induced modeling and scoring leads to single point-of-entry mining and deployment of actionable analytic results within a business environment. The reason is that the following items are all stored and accessible in one place in your RDBMS environment:
32
Operational and demographic data Settings for the schematics of the models Mining models themselves Scores as predicted by the models Relationship data (call center, campaign, Web traffic)
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
3
Chapter 3.
Business scenario deployment examples This chapter introduces several business scenarios in different business industries that are facing mining deployment and integration issues in their organization: Customer profiling in banking that describes conventional mining with a workbench combined with scoring in the database Fraud detection in telecommunications that demonstrates how modeling can be done in the database without any workbench and describes the modeling and scoring steps for SQL in more detail Campaign management in telecommunications and banking that presents further variations of modeling or scoring in the database Up-to-date promotion in retail that presents further variation of using modeling in the database only This chapter presents an overview of these scenarios, including both the business issues and business benefits. For a more in-depth explanation of these scenarios, see Part 2, “Deploying data mining functions” on page 49. This chapter also provides the generic environment and components for these scenarios that are used to integrate the data mining functions in the business applications.
3.1 Customer profiling The challenge that most banking organizations face is that the volumes of data that potentially can be collected are so great, and the range of customer behavior is so diverse, that it seems impossible to rationalize what is happening. You may have reached the stage where you modelled the business environment and became successful at explaining the customer behavior to one another. Even at this stage, you have to make the next important hurdle of the ever-changing customer behavior. The question that you must address, beyond discovering new information about customers and using it to drive business forward, is how to anticipate the change of customer behavior on a timely frequent basis in near real time. Traditionally, there are several banking business questions to which traditional data mining workbenches can provide answers, for example:
What are the characteristics of my customers? To which customers should I target specific products or offers? Which of my customers is likely to leave? Which customers should I try to acquire? Which customers may not be ones that I want to encourage? Can I identify previously known fraudulent behavior?
In this case, consider the question of how to characterize the customers that use the bank services and products, using data that you routinely collect about them. The issue is to figure out which customers may not be ones that you want to encourage. For example, a depositor with a $200 average balance, a lot of call center time, and a lot of teller time may not be profitable for the organization. On the other hand, that same interaction profile with trust funds, a brokerage account, pr other investments may be a high profit. The net is that you want to determine which customers you want and treat them appropriately, as well as which customers you don't want and treat them appropriately. To expand this question, you also need to look at the factor time. That is the need for real time servicing and fast personalized service offerings to stay on top as a bank. A bank in today’s competitive banking arena needs to profile and score new customers in a fast and easy manner. New customers to the bank who visit any channel (teller, Web site, call center, kiosk,...) for the first time have become spoiled because of what competitors offer. Also, the assumption that a customer stays with a bank once they apply for a checking account is something from the past. There seem to be more reasons for understanding which customers stay or leave with the bank and why customers
34
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
change their behaviors more frequently and within short notice. The bank has to keep up to date on these changes when it applies its business models to the competitive marketplace. Therefore, you choose to restrict yourselves to one very important business issue:
What are the characteristics of the newly arrived bank customers, the customers with whom you had very few initial contacts through any bank channels in a very short or recent time period? How should you target them right now to stay on board? You also explain how you can operationally deploy the results into your business and how to use the scores. This is so important because all too often data mining is seen as an analytical tool for gaining business insight, but is difficult to integrate into existing Customer Relationship Management (CRM) systems. Segmentations generated by data mining contain a large potential value for your business. However, this value is not realized unless the segmentations are integrated into your business and combined with actual scores on basis of the segmentations. Informational deployment by integrating the result of your segmentation into your bank Management Information System (MIS) or data warehouse allows management and employees to analyze and make decisions on the basis of this new information. Working with the new segmentation in the system also allows them to become comfortable with the segmentation. An important business need is to see the segmentation results in an end-user tool that already exists in the organization (for example Business Objects) and use reports with characteristic information on the customers. The need is not to see the results in a workbench environment. End-user tooling such as Business Objects for reporting assist bank teller personnel and sales management about the behavioral, transactional, and demographic characteristics of the bank customers. Again, this occurs in near real time since scoring happens on the fly when the customer contact takes place. This supports the bank personnel with the latest information at the time of customer interaction.
3.1.1 Business benefits Clustering for customer profiling is used largely in CRM (see 3.3, “Campaign management” on page 39). The business insights provided by the use of clustering for customer profiling enables the bank to offer specific, personalized services and products to its customers. In the commercial environment, clustering and profile scoring are used in the following areas among many others:
Chapter 3. Business scenario deployment examples
35
Cross-marketing Cross-selling Customizing marketing plans for different customer types Deciding which media approach to use Understanding shopping goals
The first question for any commercial organization is always to know and profile the existing customers that are present in the bank’s customer base. Then, from there, score newly arriving customers in near real time to target them in a fast timely manner to: Reuse customer profiles and scores in the operational databases Exploit customer profiles and scores and combine it using the end-user reporting tool Trigger additional actions based on customer profiles and scores Personalize any customer access based on these customer profiles and scores
Operational use and reuse The feedback of profile score results to the operational databases, for both long-time customers and new customers, makes for effective use of customer profiles in your business environment. The reuse to other personnel and end-user applications that would use these customer profiles is enhanced once the scores are fed back to the operational databases that are accessible to many end users within the bank.
Reporting and overviews Customer profiles are visualized in a reporting tool to the end users to provide reports on a regular, daily basis or on demand. The end users could be bank personnel such as the bank manager and the bank teller operators. Reports and overviews are provided via the reporting tool with ease of use. They depict typical characteristics of several customers or one particular customer.
Multi-channel usage There is the need to understand the customer at any touch point of the organization. There is also a need to provide equivalent offers and treatment regardless of where they “touch”. A common database of customer information is necessary, along with consistent scoring algorithms. An example may be a customer who just closed or opened an account through the call center that goes to the Web banking interface to move funds. You need to appropriately analyze this activity to determine whether they are increasing or decreasing their relationship or profit potential. You must also determine if there are other services that may be appropriately offered.
36
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Real-time personalization Real-time personalization may address newly arrived customers via any of the customer-preferred channels to our bank. It offers more chances of addressing these new customers with better targeted services and personalized products. While you can quickly score and categorize a new customer, you must also apply the personalization to an existing customer, both on their entrance into the touch point and based on their actions while they are interacting. Real-time scoring right from the start of the customer’s initial contact, and during very next customer contact, raises your chances of a (prolonged) lifetime customer value. This business scenario and the steps involved, with respect to implementation and deployment to the Business Objects end-user environment, are discussed in more (and technical) detail in Chapter 4, “Customer profiling example” on page 51.
3.2 Fraud detection The case for fraud detection for large organizations in the banking, insurance, telecommunications, and government sectors has been well documented. Evaluating transactions on a weekly or monthly basis after they occur may be good enough in businesses where the potential fraud is known. This is applicable in detecting white collar crime such as fraudulent claims by service providers of any kind. However, other types of business transactions may require much more timely intervention. In the case of detecting unauthorized usage of credit cards, mobile phone calls made from stolen mobile phones, money laundering, and insider trading, you need a near real time or very timely detection system. The fraud detection scenario illustrates how to map a solution to the problem of fraud as experienced by a telecommunications service provider. Specifically, the scenario provides answers to the following questions: What are the characteristics of fraudulent connections? Which of the connections in the Call Detailed Records is likely to be fraudulent? How can I generate a list of suspicious connections so our investigators can focus their attention to the urgent cases? We have millions of transactions in a day. How can we automate the process?
Chapter 3. Business scenario deployment examples
37
Fraudulent patterns change, so how do we know our system can evolve with the changes?
3.2.1 Business benefits The business benefits are multiple.
Up-to-date analysis on current data This fraud detection scenario illustrates how the mining analysis can be activated at any time by a business user and how it uses the most recent data. There is no latency between the data analysis and the deployment of the results. This semi-automatic approach to fraud detection speeds up the time between learning from the past and deploying this knowledge for business advantage. There is some lead time when the business solutions are designed and configured, but once the system is set up, it returns the results almost immediately.
A system that adapts to change in undesirable behavior The segmentation model is refreshed automatically at regular intervals. This allows you to detect new patterns of undesirable behavior. Because undesirable behavior changes over time, chances are that these behaviors are outliers in some way. This evolving approach is more capable of detecting new outliers. In addition, this semi-automatic approach ensures consistency and a repeatable process that an organization can use to gain on productivity.
Capture data mining expertise Data mining skills are scarce. It is generally expensive to keep a full-time data miner on board to maintain a fraud detection model. This approach allows an organization to use the service of the data miner only for defining the model in the first instance. For example, external consultants can set up a model and then pass on the maintenance to the database development team for implementation.
Enhance repeatable success The entire business process can be captured in a system maintained by IT, in a database maintained by a database development team and database administrator (DBA). This provides an environment that is conducive to automation and a repeatable process.
Potential for enhanced communication Since every component is stored and maintained by IT, the entire model is documented. That includes data settings, model settings, tasks, and results. All
38
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
components can be queried and visualized by anyone who needs access. The communication process between the developers, data miners, and the business can be greatly enhanced.
Actionable result The return on investment (ROI) of any Business Intelligence initiative depends on how an organization can turn knowledge into actionable steps. The results of the scenario are highly actionable. A list of risky transactions is generated on a daily basis. With such tools as Business Objects, this telecommunications service provider can generate a list of customer transactions that warrant further investigation on a regular basis. This business scenario and the steps involved with respect to implementation and deployment are discussed in more (and technical) detail in Chapter 5, “Fraud detection example” on page 75.
3.3 Campaign management Campaign management is one of the most interesting and challenging processes in the marketing arena. It needs a lot of data points to get to one-to-one (1:1) marketing (see Figure 3-1). It is impossible to analyze all of the data points without some mining capability or scoring.
Chapter 3. Business scenario deployment examples
39
Rich customer data ultimately fuels your ability to move customers along the relationship continuum.
Ability to Individualize Products, Services and Processes
E
ed a as d B ion e s h s n e rc Ba im ea es ery i-D t u R l Q et Mu ark M l rna Youare are Here Here xte You Databased Marketing
Mass Marketing
You Understand My Wants and Needs
Sometimes You Get it Right
You Don’t Know Me
sis aly n lA
Discovery Based
1-2
4-6
50-400
Masses
Traditional Segmentation
GeoDemographic
None
Zip Code, Census
Segment Demographics, Psychographics
M
u sC as
n tio za i m sto
e ris rp e t En
Portfolio Based
e tom s Cu
s tic aly n rA
1:1 Relationship Marketing
FutureDesire Desire Future 500+ Numerical Models Customer Buying Patterns/ History
Thousands Millions Individuals “Portfolios” of Customers Customer Customer Forecasts of Promotion History, Preferences and Company Future Behavior Objectives
Number of Market Segments
Figure 3-1 Getting to 1:1 marketing
In a more competitive environment, many companies face and struggle with the important issue of how to avoid losing customers and to retain them. To effectively create and run a marketing campaign, professionals must be able to profile customers, products, and services; create a strategy to allocate the right amount of resources to maximize ROI; and measure the results of each action. They have to determine the right combination of customer type and attrition situation, as well as the best time to present each offer. These challenges are not easily overcome. The first task is often to determine who should receive a differentiated treatment. But how do you make this determination, especially when not enough history is available (new customers and prospects) or when the lack of profitability is due to unsuccessful cross-sales strategies?
40
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Determining when the right time to act is and which offer to present requires a careful planning. It also requires an integrated infrastructure that can support data exchange between the data storage systems (such as a data warehouse), the analytical systems (such as data mining and reporting applications), and the various channels (such as call centers). Although companies today have an enormous amount of customer data, direct marketers still struggle to extract useful and actionable information from those large databases. To effectively create a high ROI marketing strategy, a shift from traditional list generation campaign management (mass) to a more targeted and complete process was necessary to deal with:
Different profiles for different campaigns Dynamic profiles for customers and products Large number of campaigns Promotion cannibalizing Optimization for budget allocation Integration of analysis and channels
To achieve this goal, two key steps are necessary: 1. Develop a comprehensive analytical scenario based on a data warehouse environment. An effective marketing automation process uses a number of business intelligence tools. These tools provide quantitative and qualitative information about customers and prospects to determine the right message, offer, and channel for a given initiative. 2. Develop a strategy for campaign automation. To effectively manage a large number of campaigns that contact an ever growing number of prospects and customers, marketing specialists require an automated process to: – – – –
Manage the channel utilization Deal with different types of customer interaction Track responses Analyze the actions feedback using consolidate and detailed campaign reports.
Note: To enable a comprehensive view of existing and potential customers, a centralized data repository is usually necessary to collect data from all appropriated operational systems, channels, and third-party sources. This repository should support a customer centric, cross-product view necessary to develop effective marketing campaigns. Refer to Chapter 6, “Campaign management solution examples” on page 97, which provides scenarios on: How to automatically trigger some marketing actions when a customer event appears (trigger-based marketing scenario)
Chapter 3. Business scenario deployment examples
41
How to use data mining functions to automatically launch a campaign when a customer has a high propensity to leave (retention campaign scenario) How to use the data mining functions to enable the analytical scenario and campaign automation based on a data warehouse solution (cross-selling campaign scenario)
3.3.1 Business benefits The advantage of this approach is that the campaign is timely and automatic. It is also more focused because, for example, a churn score is updated every time there is a change in the customers behavior. The campaign management system is designed to automatically run different campaign/scripts to different combinations of churn score, customer value, and tenure. The cross-selling option is that the campaign is more precise. It can be less expensive to have an accurate real-time system that identifies and acts fast enough to contact the customers that the company really wants to keep (focus market segment) and the ones that can cause trouble. The result of this fast and accurate contact can be to prevent the customer churn (most likely). Therefore, the campaign must be the most targeted possible to avoid customer dissatisfaction and to save campaign funds or money. This business scenario and the steps involved with respect to implementation and deployment are discussed in more (and technical) detail in Chapter 6, “Campaign management solution examples” on page 97.
3.4 Up-to-date promotion In the retail industry, the large number of products and possible combinations of them make the business analyst decisions difficult. The analyst needs to keep track of the customer purchase behavior and the best mix of products in each store to decide for the product’s optimal price and marketing appeal. In this way, the manager of each store can control the warehousing time. And the marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. The manager or business analyst has to design and run a different promotion everyday based on the mix of products they have in the store. For example, the business analyst has to sell perishable products in a short time frame and needs to determine the best location in each store. In the same time frame, they can also check the efficiency of their decision and change it if they want. Another example is to use the Web to perform “micro testing” on
42
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
promotions and to put items up for sale on the Web. Then determine how much to produce (overseas) before ordering the merchandise in the stores.
3.4.1 Business benefits The business benefits are multiple.
Automation There is an automated way to find the associations rules embedded in an application without using any data mining software or skill. The business analyst (in this case the manager) can see and interpret the product combinations every time a new product is launched, or a new transaction is inserted in the transactional system. They can also test the efficiency of a promotion when it is still occurring.
Calibration: New data = new model The advantage of having the association technique embedded in any application is that it can be run in batch almost every night. Also the business analyst keeps track of the product association patterns. In this way, the manager of each store can control the warehousing time. The marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. IM Modeling brings the ability to calibrate and recalculate a data mining model every time new transactional data is loaded. The faster the data is inserted in the transactional data, the faster the calibration is done.
Fast and accurate Again the faster and repeatable that the process is, the more accurately the business analyst can decide on this industry that is characterized by having a large number of products combinations and customer profiles.
Ready for the Web environment You can use this promotion application in a Web-based environment to perform “micro testing” on promotions. You can also use it to compare the success rate of two different promotions. This business scenario and the steps involved with respect to implementation and deployment to the Business Objects end-user environment is discussed in more (and technical) detail in Chapter 7, “Up-to-date promotion example” on page 149.
Chapter 3. Business scenario deployment examples
43
3.5 Integrating the generic components This section describes the generic components that cover an end-to-end deployment process for integrating IM Modeling, IM Scoring, and IM Visualization in business applications and solutions.
3.5.1 Generic environment and components The different environments and components used in the different business scenarios, as illustrated in Figure 3-2, include:
Customer profiling Campaign management Trigger-based marketing on retention campaign Fraud detection Up-to-date promotion Applications with embedded mining
Mining work bench
Scheduler
JOb Ty pe play er
Model C alibrat ion Type player T ype pl ay er
Scheduler
Type a ttr ibute s
Scheduler
Applications with embedded scoring
Type pla yer
Type a ttr ibute s
Model Calibration
Segmentation
Business Objects
Target Campaigns
Call center
Application Integration Layer: CLI/JDBC/SQLJ/ODBC IM Scoring API
IM Modeling API
Export model
Mining run Model (BLOB)
Mining run
Data Warehouse Analytical Data Mart
Model transportation as CLOB or PMML
Modeling environment
Operational Data Store Data Models Scores
Scoring environment
Figure 3-2 Business scenarios components
44
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
There are three distinct environments: The business applications environment The modeling environment The scoring environment
Business applications environment Business applications can be any applications that lend themselves for analytics. This may include real-time or batch applications that embed a fraud detection application. They may also include: A call center application that computes a customer’s new propensity to churn, based on changed circumstances A call center application that promotes up-to-the-minute “next best offer”, based on the latest customer information A weekly batch application, a so-called scheduler, that generates customer leads for targeted campaigns A credit scoring application that sits on the loan officer’s desk in the branches A traditional workbench environment such as DB2 Intelligent Miner for Data or SAS Enterprise Miner With the new mining and scoring extenders, any application can now embed the modeling, scoring, or both. The customer profiling scenario and the trigger-based marketing on retention only use the scoring function. The fraud detection scenario showcases an application where both the modeling and scoring extenders are used in a single business application. Therefore, from an application perspective, the applications can span across both the modeling and scoring environment. The applications environment is integrated with the modeling and scoring environment using CLI, ODBC, JDBC, and SQLJ.
IM Modeling environment The modeling environment is a DB2 database that is “enabled” for mining. The database has all the additional database objects required for executing the Modeling API. Typically, the source database is part of a data warehouse, where the data to be modeled on is cleansed, joined, pre-processed, and enriched. Using SQL script, the application developers use the mining extenders to mine or “model” the data in a batch environment. The result of a modeling run is a model stored as a special DB2 datatype within the database. The model is exported to PMML in an external file or a DB2 CLOB type for transport and storage. For more information, refer to Chapter 10, “Building the mining models using IM Modeling functions” on page 211.
Chapter 3. Business scenario deployment examples
45
This model is then transported by means of export or DB2 Universal Database (UDB) Connect to the scoring environment for scoring and applying the model on new customers.
IM Scoring environment Scoring is the application of the model built in a modeling environment. The model has to exist before it can be used for scoring new customer data. From a DB2 perspective, there is no difference between a mining and scoring environment. In practice, most IT shops separate their modeling and scoring environments for security, operational, and efficiency purposes. The scoring environment is a DB2 database that is enabled for scoring. An enabled database has all the additional database objects required for executing the scoring extender via the SQL API. The enablement process is the same for both modeling an scoring. Typically the scoring environment is a single table within an Operational Data Store (ODS). You can think of this ODS as a house midway between the data warehouse and OLTP systems. Models are developed in the data warehouse and deployed to the ODS. Finally the scores, as a result of the scoring run, are most likely integrated into the front-of-house or operational OLTP applications where scores are needed for different campaigns, call center, or CRM applications.
Components The components that are used throughout this chapter include:
Applications IM Modeling data mining function and API IM Scoring data mining function and API Data Mining Model represented in different forms Models can be stored as datatype BLOB in DB2 UDB, PMML in external files, or as CLOB in DB2 UDB. Refer to Chapter 10, “Building the mining models using IM Modeling functions” on page 211, for more information.
Table 3-1 outlines the basic steps for deployment of modeling capability into the database.
46
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 3-1 Step-by-step modeling process Model building step
Modeling environment
IM Modeling SQL API
Workbench
1
Use IM Modeling to build a model based on business requirements.
Use Workbench such as DB2 Intelligent Miner for Data or SAS Enterprise Miner based on business requirements.
2
Train the model and save the model.
3
Validate the model.
4
If Modeling and Scoring are in separate databases, export the result or model as a PMML file for distribution.
Table 3-2 outlines the basic steps for deployment of a modeling capability into the database. Table 3-2 Step-by-step scoring process Score building steps
Scoring environment
1
If Modeling and Scoring are in different databases, import the PMML model into DB2 UDB.
2
Build the SQL script that uses the imported models for scoring.
3
Deploy the score script.
Application environment 4
Integrate with SQL core scripts using CLI, ODBC, SQLJ, and JDBC.
3.5.2 The method For each scenario, refer to the following chapters: Customer profiling: Chapter 4, “Customer profiling example” on page 51 Fraud detection: Chapter 5, “Fraud detection example” on page 75 Campaign management: Chapter 6, “Campaign management solution examples” on page 97 Up-to-date promotion: Chapter 7, “Up-to-date promotion example” on page 149
Chapter 3. Business scenario deployment examples
47
For each scenario and its implementation process, the same methodology and workflow were adopted: The business issue: – What are the current problems in the business? – What are the possible approaches for a solution? – Why should the company choose a solution that involves data mining? That is, why are the other approaches not good enough? – Where's the ROI? Mapping the business issue to data mining functions: This section describes the problem at a more analytical level. – What is the key idea where a solution designer would say “this looks like mining function X; maybe we should try that”? The business application: – How does the application look from an end user’s point of view? – Who is the user, are there different kinds of users, and what tools or applications do they use? Environment, components, and implementation flow: This is the technical counterpart of the business application. It sketches the design of the solution with reference to mining. The step-by-step implementation: – – – –
How does the input data look? Optionally, how can it be derived from the source tables in the warehouse? How do you invoke mining? How do you show and integrate the results?
The benefits: – What are the technical benefits of such integrated solutions?
48
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 2
Part
2
Deploying data mining functions This part provides examples on how to easily integrate data mining functions through SQL scripts to enhance your business applications. This requires more IT or database administrator support than real data mining knowledge. This part provides several business scenarios and highlights the implementation and deployment of the DB2 data mining functions. This is based on the generic components that cover an end-to-end deployment process for integrating IM Modeling, IM Scoring, and IM Visualization in business applications and solutions (see 3.5.1, “Generic environment and components” on page 44). Plus this part elaborates other ways to integrate the DB2 data mining functions with different e-commerce- and Business Intelligence-related technologies.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
4
Chapter 4.
Customer profiling example This chapter presents a basic scenario for customer profile scoring of new customers for a bank. This business scenario leans heavily on the use of scoring techniques and scoring results. It describes conventional mining with a workbench, combined with scoring in the database. In this chapter, the mining models are created with a workbench such as DB2 Intelligent Miner for Data. This chapter also discusses the steps involved with respect to IM Scoring and deployment to the Business Objects end-user environment.
4.1 The business issue Understanding customer behavior and being able to predict their preferences and actions is crucial for every business. This is particularly true in the highly competitive banking industry. An advantage in addressing the customers' needs can make a big difference in the long run. Customer segmentation has been a very useful technique for guiding interaction with the customer. Traditionally segmentation is done by market experts who somehow come up with descriptions of segments as they feel appropriate. With data mining, however, a bank can detect different patterns in the customer behavior without depending on instinct. Even when data mining techniques are used to optimize the channels toward a “segment of one”, a bank still faces the issue of how to enable better the front office employees to use the mining results in daily operations. If data mining becomes operationalized in the customer interactions, an institution can achieve the full return on investment (ROI) of the investment in data mining. As an example, we address the issue of how to distinguish profitable and non-profitable customers. More specifically, we demonstrate how end users can easily access customer scores, from within their usual applications. We also solve the requirement that the end users must be able to achieve results that were computed using the most recent information about a specific customer. This approach allows you to score new customers. The other important issue here is that not only profiles have to be discovered in a fast timely manner. Scoring to existing profiles has to take place in a very fast and frequent way. New customers often join a bank or change banks for private or business purposes. This happens particularly in the U.S.A. where an individual still has to close one account and open a completely new account in the next place of residence when they relocate. Therefore, chances are that after a first visit, or couple of informative visits, they will never return again if the services or products are not targeted or recognizable to them at an affordable price. In our example, we consider the question of how to profile and score newly arrivals at the bank, via any channel, with near real-time scoring.
4.2 Mapping the business issue to data mining functions To generate the customer profiles, you use the traditional workbench and run a data mining technique for segmenting the customer database called clustering.
52
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Clustering: Customer segmentation is a method where customers are grouped into a number of segments based on their behavior. In data mining, this corresponds to the clustering technique. The customer behavior is described with profile data that includes features, such as age and income, and behavior indicators, such as the aggregated monetary value of transactions during the last four weeks. Clustering groups the customers into different groups, based on the similarity of their profiles. There is no need to predefine a particular criteria for the grouping operations. The clustering technique can find the right groups automatically. The models and scores that you produce can be used by the business users and be integrated in their reports (Business Objects for example) to identify the different types of customers and the potential niche market segments. The clustering technique typically generates three different result types: A graphical result for the analysis and business user An output table (or flat file) with scores The clustering model with segment characteristics To address the newly arrived customers to our bank, you use real-time scoring to deliver fast results on a typical profile fit. With real-time scoring, you can match the business actions faster and more centered to the possible needs of the newer customers. For example, a recent customer to the bank Web site registers their preferences to your products and services and confirms their registration. They are more likely to become a steady customer than the one-click Web user that visits your bank Web site. Real-time scoring of the customer behavior of this new registrant enables you to use the scoring technique to match their profile with other existing customers to your specific Web channel. From there, you can personalize the bank Web site pages to target this newly arrived customer to your bank. You can offer products that other customers in their segment also became aware of and then decided to receive service for through your bank. Similarly, scoring can automatically and instantly trigger an e-mail to this customer. The e-mail confirms the registration. Also on basis of the fit of demographic characteristics and product preferences, the e-mail targets this recent customer in a more personalized way from the start. If this same scenario occurs at a bank teller or sales account desk, and not through the Web channel, the real-time scoring instantly provides a quick score report with which the customer is serviced in the personal interview.
Chapter 4. Customer profiling example
53
4.3 The business application The features of the bank services and bank products types can lead the business analysts to choose certain segments of customers and set other segments aside. A mortgage product may be of interest only to one or two segments of customers and not to all segments. Demographic characteristics, such as family size, household income, and ownership of private property, may trigger the business analyst to select a certain segment as a target set. The customer lifetime of the bank customers may lead to the bank manager deciding to target a subset of customers in a segment differently from customers in the same segment. The first stage is to know customers better: To have a better understanding of your new customers as a bank by scoring To score these new customers on the basis of their characteristics and their match to profiles of certain segments To have customers interact with the bank via different channels of choice, be it the bank teller, ATM, Web, telephone, or kiosk So the bank can offer diverse services and products (checking, savings, bank funds, credit card, mortgage, insurance, automatic payments), which over time, need adjustments and additional features To help the business analyst in the bank to know what types of customers the bank deals with in the first place To act with a customer-oriented focus instead of service or product focus The segmentation models are produced by the business analyst that is in charge of data mining and using a mining workbench that already exists in the enterprise. The scores can be used by the business users. They can be integrated in their day-to-day reports (Business Objects for example) to identify the different types of customers and the potential niche market segments.
4.4 Environment, components, and implementation flow Figure 4-1 shows a typical end-to-end deployment environment with its components in this case of profile scoring.
54
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
A p p lic a tio n s w ith e m b e d d e d s c o rin g S ch e d u le r T y pe p l ay e r
Type attr ib utes
M in in g W o rk B ench
JO b
M o d e l C a l i b r a t io n
T yp e pla ye r
Ty p e p l a ye r
Typ e p lay er
T yp e at trib u tes
S c h e d u le r
S e g m e n ta t io n
B u s in e s s O b je c t s
A p p lica tio n In teg ra tio n L ay e r: C L I/J D B C /S Q L J/O D B C
IM S c o r in g A P I
M in es th e d a ta u s in g D e m o g r ap h ic C lu ster in g C lu s t e r m o d e l
O p e ra tio n a l D a ta S to re D a ta M o d e ls Scores
D a ta W a re h o u s e A n a ly tic a l D ata M a rt
M o d e lin g E n v iro n m e n t
S c o rin g E n v iro n m e n t
Figure 4-1 Customer profiling using IM Scoring, deployment to Business Objects
The mining workbench mines the data using the demographic clustering technique. The results (model with clusters) already produced by the business analyst in charge of the data mining workbench are stored and used by the IM Scoring score function to score new customers (or customers with changed demographic or transactional behavior). The visualization of a customer with their demographic data and bank-related transactional data, plus the segment to which they belong, assist the bank manager in profiling the customer. The customer is profiled according to the most their important characteristics. The visualization of this customer is done via Business Objects, which is an end-user reporting tool that tightly integrates with IM Scoring. Visualization is also done by embedding the graphical or textual cluster visualizations by IM Visualization in Business Objects reports. The Business Object report then shows the individual segments to the end user when it displays the segment identifier, score, and detailed data information for a customer that the end user selected. Spatial data may also be used. Address, income levels, and scoring can all determine where to open the next branch or in which geographical area to target the next campaign. Scoring can be done on demand. Maybe because the new customer was not in the customer base overnight, they arrived in person at the bank manager’s desk the following afternoon.
Chapter 4. Customer profiling example
55
In any case, at regular intervals, you invoke the scoring that is run on the data. This way detecting whether the customer behavior or possibly a change in the demographic characteristics may lead to a change of score. This is driven by the notion that a customer may be scored to a different cluster over time. Not only are the demographic characteristics of the customer transient over time, but the transactional behavior of the customer in relationship to the bank also changes. To schedule scoring, you can use DB2 UDB scheduler so that the Business Objects report at the end user’s desktop displays the most recent profile scores of bank customers on a frequent basis. Example 4-1 shows the batch job score.bat for scoring where the SQL script score.db2 writes the scoring results into a DB2 table for customer segments.
Example 4-1 Batch job for scoring db2cmd /c /w /i db2 -tvf score.db2
The high-level view of the implementation flow in Figure 4-2 shows two main phases of activities: Design and modeling: This phase comprises mining, modeling updates of mining models, creating the static report, and finally linking the scoring functions, the profile data, and the report. At this application integration stage, the database administrator (DBA) and the end user interact to define the initial report setup. They also select the right business terminology in the report, the right informational fields, and the format of report itself. End-user interaction: This phase is driven by the end user. Somehow there is a new customer. The scoring function is activated, and the end user reads the results from the report. Note: On the implementation flow shown in Figure 4-2, input may be added to trigger the generation of a new model.
56
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Con figuration
Database enablement
Table to be sco red
W orkbench Data M ining
Export PM M L file
Scoring Latest data m ining m odel in p la ce ?
Im port data mining model
No
Yes
E xternal event: new customer or changed behavior of existing custom er
Apply the Sco ring function
App lication Integ ration BO report exists?
Display BO report
No
Yes Bu ild Rep ort
Figure 4-2 Implementation flow of customer profile scores
The next section discusses the implementation in more detail.
4.5 Step-by-step implementation The detailed implementation, according to the flow shown in Figure 4-2, consists of the steps that are explained in this section. Some steps run once, some steps run each time a tune-up of the mining model occurs, and some steps run at each scoring instance. All in all, each step is short and easy. The implementation flow overall is simple because most of the steps are done earlier, such as the database configuration setup.
Chapter 4. Customer profiling example
57
4.5.1 Configuration You must do the configuration once during the deployment stage. It consists of database enablement and table creation.
Database enablement Make sure that both the DB2 UDB instance and the database BANK for the customer profile scoring projects are enabled. The steps to set up the environment for the BANK database are: 1. Enable the DB2 UDB instance. Check whether the DB2 UDB instance is enabled. If it is not enabled, enable the DB2 UDB instance by referring to the steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Scoring. 2. Enable the working database. Check whether the DB2 UDB database for scoring is enabled. If it is not enabled, enable the database (‘BANK’) by referring to the steps in 9.3.2, “Configuring the database” on page 198, for IM Scoring. In this case, you invoke the script in Example 4-2.
Example 4-2 Script to enable the USBANK database for scoring idmenabledb BANK fenced tables
Table to be scored Table 4-1 lists the data fields that are used for clustering. This list contains the demographic, relationship, and transactional data that is used for customer base profiling of the bank.
Note: Only use variables that are available in the online environment to deploy it. A typical mistake is to generate clusters based on everything you know about your current customers and not realizing that you do not have all that information on prospects.
58
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 4-1 Table fields to use Field description
Field name
1
Customer identification
Client_id
2
Age
Age
3
Sex
Sex
4
US state of residence
State
5
Monthly income
Salary
6
Profession
Profession
7
Home owner
House_ownership
8
Car owner
Car_ownership
9
Marital status
Marital_status
10
Number of dependents
N_of_dependents
11
Has children
Has_children
12
Time as customer
Time_as_Customer
13
Monthly checking amount
Checking_amount
14
Total monthly amount of automatic payments
Total_amount_automatic _payments
15
Monthly savings amount
Savings
16
Monthly value of bank funds
Bank_funds
17
Monthly value of stocks
Stocks
18
Monthly value of government bonds
Govern_bonds
19
Amount of gain or loss of funds (bank funds, stocks and government bonds)
Gain_loss_funds
20
Number of mortgages
N_mortgages
21
Monthly amount of mortgage
Mortgage_amount
22
Amount monthly overdrawn
Money_monthly_overdrawn
23
Average value of debit transactions
AVG_debit_transact
24
Number of checks monthly written
Monthly_checks_written
25
Number of automatic payments
N_automatic_payments
Chapter 4. Customer profiling example
59
Field description
Field name
26
Number of transactions using bank teller
N_trans_teller
28
Number of transactions using ATM
N_trans_ATM
29
Number of transactions using Web bank
N_trans_Web_bank
30
Number of transactions using kiosk
N_trans_kiosk
31
Number of transactions using call center
N_trans_Call_center
32
Monthly amount of credit card limit
Credit_card_limits
33
Has life insurance
Insurance
The table with all the customers that the bank wants to score is either already created, or can be created, by joining the diverse data fields from the different operational databases needed in the bank environment. For example, the customer demographic data, from the customer table in the bank data warehouse, is joined with sales transactional data from the table owned by the bank’s accounts department. The SQL script to create and load a sample input table containing customer profile data in DB2 UDB for this scenario is: Script to create and load the customer segm table for scoring.sql
4.5.2 Workbench data mining Collect all the customer demographic and transactional data and store the data in a relational database, such as DB2 UDB. Then use the DB2 Intelligent Miner for Data workbench to exercise the generic data mining method and use the demographic clustering technique. Figure 4-3 shows an example of the Summary page for this technique.
Note: For a description of the clustering algorithms that DB2 Intelligent Miner for Data offers, see Mining Your Own Business in Banking Using DB2 Intelligent Miner for Data, SG24-6272, and Mining Your Own Business in Retail Using DB2 Intelligent Miner for Data, SG24-6271. In this case, use the statistical clustering algorithm at the bank because it automatically determines the “natural” number of clusters. The clustering mining function groups data records on the basis of how similar they are. A data record may, for example, consist of information about a customer. In this case, clustering would group similar customers together, while maximizing the differences between the different customer groups formed in this way.
60
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The groups that are found are called clusters or segments. Each cluster tells a specific story about customer identity or behavior. For example, the cluster may tell about their demographic background or about their preferred products or product combinations. In this way, customers that are similar are grouped together in homogeneous groups that are then available for marketing or for other business processes.
Figure 4-3 Demographic clustering in DB2 Intelligent Miner for Data
Exporting the PMML model The model to be deployed is created in DB2 Intelligent Miner for Data. Therefore the model needs to be exported from the workbench and transferred to the database where the scoring is invoked. The model may be exported into a PMML file (see Figure 4-4) or DB2 Intelligent Miner for Data format. Both formats can be used by IM Scoring. In this example, use the PMML format. The model was exported to the CustomerSegmentation.PMML file. Refer to 9.3.3, “Exporting models from the modeling environment” on page 200, to learn how to export the model in PMML format.
Chapter 4. Customer profiling example
61
Intelligent Miner for Data: Exporting mining models
Native IM4D format
PMML format
Figure 4-4 Exporting the mining model in the PMML format
4.5.3 Scoring Use IM Scoring to score additional customers in the bank’s operational database, to the segments to which they belong or don’t belong. Scoring the data is the most important step. By itself, it is a simple step because of the ease of use of different handy SQL scripts that are provided with IM Scoring and DB2 UDB.
Importing the data mining model The model is imported into DB2 UDB using a SQL script. Refer to 9.3.4, “Importing the data mining model in the relational database management system (RDBMS)” on page 202, to learn how to import the model in PMML format. If the PMML model was exported as the CustomerSegmentation.PMML file, then the script in Example 4-3, which is part of the score.db2 file, imports the PMML model.
Example 4-3 Importing the clustering model INSERT INTO IDMMX.CLUSTERMODELS VALUES ('Demographic clustering of customer base of an US bank', IDMMX.DM_impClusFile('c:\scoring\CustomerSegmentation.PMML'));
Apply the scoring functions You apply the scoring functions on the basis of a cluster model by invoking the DM_applyClusModel command. In this example, the view CustomerSegmentsView was created. The REC2XML command was used to
62
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
convert the data that is to be scored to XML format. In this scenario, the data to be scored is data for new customers who were saved to the bank customer base in the table named Recent_New_Customers. See the SQL script in Example 4-4.
Example 4-4 Defining the view CustomerSegmentsView CREATE VIEW CustomerSegmentsView ( Customer_id , Result ) AS SELECT data.Client_id, IDMMX.DM_applyClusModel(models.MODEL, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."CAR_OWNERSHIP", data."HAS_CHILDREN", data."HOUSE_OWNERSHIP", data."MARITAL_STATUS", data."PROFESSION", data."SEX", data."STATE", data."N_OF_DEPENDENTS", data."AGE", data."SALARY"))) FROM IDMMX.CLUSTERMODELS models, RECENT_NEW_CUSTOMERS data WHERE models.MODELNAME = 'Demographic clustering of customer base of an US bank';
This SQL script is actually generated using the command IDMMKSQL. The tip in “Constructing the record and applying the model” on page 207, explains how to use this command. The complexity of the score code is encapsulated in the view CustomerSegmentsView. You can access the scoring results by simply reading from the view. Due to the dynamic nature of views, the scoring results are computed on the fly. The mining result contains the cluster ID, the score, and further indicators.
Note: With the new IM Visualizers in V8, the names of the clusters can be modified. That is, there is an abstract integer number that represents the cluster ID and a name for a certain cluster, for example, dead beats instead of clusterID=5. A table that contains the cluster ID and the score as separate entries is constructed by the SQL statement in Example 4-5.
Chapter 4. Customer profiling example
63
Example 4-5 Write scoring results into table CustomerSegmentsTable --- Use view CustomerSegmentsView to score data and to --- write scoring results into a table CustomerSegmentsTable INSERT INTO CustomerSegmentsTable SELECT Customer_id, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM CustomerSegmentsView;
The table CustomerSegmentsTable is defined once as shown in Example 4-6.
Example 4-6 Defining the CustomerSegmentsTable --- DROP TABLE CustomerSegmentsTable; CREATE TABLE CustomerSegmentsTable ( Customer_id CHAR(8), Cluster_id INTEGER, Score FLOAT );
The SQL statements in both of the previous examples are listed as part of the score.db2 file. This file is available in the additional materials for this redbook. For more information, see Appendix K, “Additional material” on page 301. The reason for creating a table in addition to a view is that account managers of the bank want to see static results, not just the dynamic changes of the scores at all times in their reports, when they are involved in client engagements over a period of time. In case only real-time scoring is required by the bank’s end users, then only the view is necessary.
4.5.4 Application integration As part of the business issue to integrate in an existing enterprise query reporting tool, this section demonstrates how end users can visualize scoring results on their well-known reporting tool, in this case Business Objects.
Integrating with Business Objects reports The purpose of customer profiling the bank customers, produced by data mining, is to discover the different behavior groups in your customer base by extracting
64
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the natural segments in your data. This allows your end users to identify different customer types and to develop the most appropriate way to service (and price) the different customer groups. The mining results are available in plain database tables or views. This makes them readily available to a reporting tool. There is no need to configure mining functions in the report. Even more, when mining results are read from the view, the scores are computed on the fly. That is, when a user of a Business Objects loads a report, the scores are evaluated in real time. Visualizing the customer profile score results in a visualization of the cluster results that IM Visualization offers. The visualizations are embedded as applets into a HTML version of a marketing report to the end user. To embed the IM Visualizers in an HTML document as an applet that is either a graphic inside the HTML version of the report or that is initiated from a Start button on the report, see 11.3.1, “Using IM Visualizers as applets” on page 235. Example 4-7 shows the code used to generate the report pages that are shown later in this section. This particular piece of code shows how to include the IM Visualization applet with a graphical view of clusters on a page titled “New scores of today” in a Business Objects report. The complete code is provided as an example in the additional materials for this redbook. The code is in the
0 - File0.htmlToshow10customerscoresplusgraphicalviewIMVisAppletpageNew ScoresToday.htm file. See Appendix K, “Additional material” on page 301, for more information.
Example 4-7 HTML code to show IM Visualization applet in BO report <TITLE>New scores of today <META NAME="GENERATOR" CONTENT="BusinessObjects 4.0 on Windows"> ...
Scoring of new customers of the bank at:
...
Demographic clustering of customer base of an US bank
...
Chapter 4. Customer profiling example
65
>
...
Customer
Segment
Score
65000001
1.00
0.49
...
Typically end users want to see the score results in an application environment to which they are accustomed. An end-user reporting tool, such as Business
66
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Objects, would suffice. The customer profile scores and profiles are shown together in a Business Objects customers’ profiles report. The different report pages that are then portrayed in a Web environment to the end users, are shown in Figure 4-5, Figure 4-6, Figure 4-7, and Figure 4-8 respectively. The report consists of three pages. The first page (Figure 4-5) is displayed graphically to the end user. It shows the scores for 10 new customers in the bank customer database.
Figure 4-5 Business Objects report first page: Scores graphical display
Chapter 4. Customer profiling example
67
The scores indicate the individual match of the 10 customers to the cluster (one to six). Next to this, the end user sees a visualization of the six clusters in an applet. They can click the clusters to understand, or remind themselves, the characteristics that a customer has according to the cluster profile they fit best. Next to this graphical presentation, the end user also has a textual representation of the same six clusters (see Figure 4-6).
Figure 4-6 Business Objects report first page: Scores textual display
In this scenario, the bank has six clusters. Several of the new customers were scored primarily to cluster 1 and cluster 5. The end user sees that cluster 1 typically characterizes these customers to be predominantly:
68
A single male Mostly between 20 and 25 years of age Resides in the state of New York Doesn’t own a home but owns a car Has a low income
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Has a low savings account Has been with the bank for only three years Has made very few transactions across all channels This may be a customer to which the account managers decide not to target for cross-sell actions but decide to inform customers to set more money aside in the savings account to lower the base monthly fee to keep the savings account. Or the bank account manager informs this customer to use the Web channel more often instead of writing checks, because it would be less costly to process for both parties, the customer and the bank. Likewise, cluster 5 shows a profile that is characterized as predominantly: A divorced male Above 45 years of age Resides in the state of New York Owns both a home and a car Has a low savings account Has a high income and high mortgage amount Does a low number of teller transactions but a higher number of ATM transactions Has been with the bank for only two years at present
The account managers may call this customer a “risk-taking yuppie”. Due to this information distilled from the report, the bank manager decides to treat this type customer with care for the next two years. This way when a change in social status happens in the near future, the financial status apart from the high income may become even better. Then this customer is likely to buy more services from the bank at that point.
Note: The file for visualizing the clusters in IM Visualizer is called Demographic clustering of customer base of a US bank.vis. You can find this file in the additional materials that accompany this book. For more information, see Appendix K, “Additional material” on page 301.
Chapter 4. Customer profiling example
69
The Business Objects report also displays a page to the end user where each individual customer may be analyzed. Again the visualizer for the particular segment that most likely fits the profile of this new customer is shown in an applet (see Figure 4-7). This customer with ID ‘65000010’ fits cluster 5 with a score of 0.41. Based on the cluster details, the bank manager has another view of the customer characteristics according their fit to cluster 5. The information that the manager distills from this particular view again validates the bank manager’s idea to treat this type customer with care for the next two years. Then at regular intervals contact this customer by phone.
Figure 4-7 Business Objects report second page: Graphics cluster
70
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 4-8 shows the third page of the HTML report. It includes the details of the 10 new individual customers.
Figure 4-8 Business Objects report third page: Detailed data of selected customer
With this view of each individual customer, the bank employee who is in contact, at some point, with the customer sees the customer’s transactional and demographical data values (used for scoring), with the segment identifier and score. Through this view, the bank employee has an overview of which demographic, relationship, and transactional data values, aggregated or not, have lead to the score fit to cluster 5. It gives the bank employee the means to use the cluster fit as a sole, mere trigger to address the needs of the customer. By having the basic and historical data, the employee is more effective in relating to the customer.
Chapter 4. Customer profiling example
71
4.6 Benefits Apart from the business-oriented benefits stated in 3.1.1, “Business benefits” on page 35, there are several, mostly technically-oriented benefits to addressing the business issue in the way that is proposed in the implementation flow in this business scenario.
4.6.1 End-to-end implementation The bank’s requirements to characterize, know, and profile their customers is now achieved via the end-to-end implementation flow where:
The operational data is accessed. Clustering is applied. IM Scoring is invoked for repeatable scoring of new customers. The actual scores are visualized in the Business Objects reporting tool to the end users.
Scoring on a regular basis and monitoring the changing score over a period of time can rapidly bring to the attention of the bank those customers with unusual or sudden changes in behavior. You can do this using the score and not simply the movement between clusters, which may occur too slowly.
4.6.2 DB2 mining functions next to the workbench A workbench, such as DB2 Intelligent Miner for Data, is used in a separate environment from the operational database. IM Scoring, however, is set up to access the operational database. The workbench is meant to find an initial clustering model in one run. But it is the IM Scoring functions that facilitate for repeated and timely profile scoring. They can timely profile newly arrived customers to the database via a trigger that stems from a business rule that, for example, says, “If the customer bought a home and is applying for a mortgage, the bank should also look at their profile characteristics again and re-score.”
4.6.3 Real-time analytics The segmentation model and the scores can also be implemented in your operational database system. The scoring functions can be accessed via the SQL or the Java API. Then they can be invoked to score the bank customer data from within an application and display the results directly into a Business Objects report. This allows your segmentation model to work in real time. This is particularly relevant if you want to score new customers on the fly and on demand.
72
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
4.6.4 Automated and on demand for multi-channels IM Scoring functions offer repeated and timely profile scoring, without the need for human interaction. The production of reports for each workday can be automated by batch run scheduling of jobs to exercise the scoring function to the operational data at hand. The end users of the reports would be bank personnel, such as the bank manager and the bank teller operators, at times when they interact with a customer. The automatic scoring can also trigger new message flows at the bank’s ATM machines. All of these communication channels interact with the identified customer at the customer touch point. And all of these channels that a customer may prefer from time to time can use the profile information of the segmented customer to target the customer via effective CRM activities.
Chapter 4. Customer profiling example
73
74
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5
Chapter 5.
Fraud detection example This chapter describes a scenario on how to use the data mining functions from scratch to build a quick and reusable data model and to score data. The scenario shows how to integrate the functions into existing operational applications for example to detect fraud in telecommunications. This scenario demonstrates how you can perform both modeling and scoring in the database. It also describes the modeling and scoring steps in SQL in more detail.
5.1 The business issue Fraudulent behavior is detected when a telecommunications company tries to bill a customer but never receives any payment. It is possible to manually evaluate these cases, classifying them as fraud, bankruptcy of customer, or a technical processing error. But fraudulent customers are creative and use techniques that were not found previously as being characteristic for fraud. The ability to detect “unusual” behavior, without knowing in advance which features make a certain behavior unusual, is necessary to detect these cases as early as possible. This business scenario is based on a real-life fraud detection exercise performed by IBM data mining experts in a large telecommunications provider in Europe. This organization was actually aware that they were a victim of fraud in their Premium Rate services business unit. Premium Rate services provide expert hotline services on such topics as:
Advice on insurance and judicial Legal Stock market tips Adult services
The charge for these services can be up to 2 euro per minute and is billed by the telecommunications company. The business model is that a company offering such a service earns quite a high share of this rate immediately from the phone company, whereas the phone company itself charges the caller for the entire amount. On the basis of this constellation, fraud can be carried out in the following way: 1. The (fraudulent) company offering a service via a premium number cooperates conspiratorially with a partner. 2. The partner makes frequent and long phone calls from one or a few other phone numbers to the (expensive) premium number. 3. A high amount of call charges accumulates within a few weeks (this task can be technically facilitated by using a computer or an automatic dialing device to perform the phone calls). 4. The service provider receives their comparably high share of the call charges from the phone company. 5. Meanwhile, the phone company tries in vain to get its money from the conspiratorial partner. But the partner most likely used a wrong name and disappeared, so their conspiracy with the service provider cannot be proven.
76
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Note: This kind of fraud falls into a category of fraud commonly known as ghosting in the telecommunications industry. This kind of fraudulent behavior is also common in other industries where service providers charge for services that were never carried out. The technique used in this scenario may equally apply to other industries where there is a high incidence of fraud. In the past, fraud detection models were typically created in a workbench environment by data mining experts using algorithms that produce predictive models. There are a few practical inconveniences with this approach: Effort to model fraud In general, fraudulent behavior is relatively scarce, and labeled fraudulent behavior is even more so. Using a classification approach, fraud has to be identified up front and fraudulent cases must be labelled in the data. In reality, this is both time consuming and resource intensive for most organizations. In the case where a business can accurately identify fraudulent behavior, fraudulent transactions are relatively scarce. In most cases, they only make up less than 1% of the total number of transaction in most businesses. Analyzing rare cases, such as fraud, requires artificial injection of data for modeling purpose, which in turns require more planning, documentation, and effort. Timeliness of the fraud detection model It usually takes a significant amount of time to produce the model. By the time it is deployed to the fraud detection system, the model loses some of its predictive power. The result is fraud detection models that perform well in the workbench but failed when deployed to the business. Shelf life of the fraud model In addition, due to the elusive nature, improper behaviors change over time and certainly with increasing speed. It is typically nontrivial to define known fraudulent behavior and model it in a timely manner. By the time the model is deployed, manual recalibration of model performance may be required. The fraud model takes a long time to build, and only has a short useful life or shelf life. This is similar to perishable goods, where you must use them by a certain date. Otherwise, it is not good after that date.
Chapter 5. Fraud detection example
77
5.2 Mapping the business issue to data mining functions Although fraudulent behavior can take many different patterns, a common feature among most fraudulent behavior is that they are different from the norm in some way. As an alternative approach, we look at how to use a deviation detection technique to uncover fraudulent behavior in a more automated manner. This scenario focuses on a technique called Demographic Clustering in IM Modeling and shows how it was successfully applied in fraud detection in a telecommunications environment. The approach addresses the business issues that were identified previously: The deviation technique requires less effort Part of the problem of modeling fraud is that you don’t know what to model. The deviation detection technique overcomes this problem in some ways, because it does not require a target variable or labeled data. Fraudulent cases do not need to be defined up front. This is useful since fraud is not always easy to define with a high degree of certainty. Using a demographic clustering technique, you group the connections that are similar to each other into clusters and assign them with cluster IDs. Potentially undesirable behaviors are likely to deviate from the norm. For example, the large groups are usually found in the clusters with the smallest size. The deviation approach results in a faster time to market model In general, clustering models are faster to build because there is no need to identify previously confirmed fraudulent cases. Consequently, there is no need to oversample. In many cases, this reduces the inertia to produce a model. Increase the life of a model by automatic recalibration In the past, fraud detection models had a short shelf life, meaning it took a long time to build but only had a short useful life. By its nature, fraud evolves quickly, sometimes from week to week. Using the SQL API provided in IM Modeling, the model can be recalibrated at a much more frequent interval. The SQL API allows the model to be rebuilt on fresh data as a scheduled job. This approach offers an opportunity for fraud detectors to catch up with the pace that frauds evolves. The result of the modeling run is a set of clusters. Empirical evidence has shown that by examining the bottom n smallest outlying clusters, organizations have identified undesirable behaviors of which they previously knew. The semi-automatic approach may not produce models that are as accurate as ones that are produced by highly skilled data miners. However the objective here
78
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
is to produce a model quickly in a production environment, rather than to produce a perfect model in a controlled environment. When you consider the value of a model as a function of predictive power and timeliness, this approach makes a viable approach to model fraud in terms of the “time to market”.
5.3 The business application The application in this scenario is fed by data generated in the Call Detail Record (CDR). Considering the changing nature of frauds, it can be difficult to come up with hard and fast business rules for fraud detection. Given the volume of data involved, some business rules were set up to ensure effective use of resources, for example: The system only mines call connections that involve the caller_id making long calls. Limited resource allows for detailed investigation of the five smallest clusters in terms of rank. Given that the clustering technique only allows you to flag a set of records as “different” from the crowd, rather than a propensity that a connection is fraudulent, each connection in the “interesting clusters” should always be cross-examined by business analysts or investigators.
Note: Clustering techniques are commonly used for segmentation purposes where users are interested in larger segments. But clustering also finds small clusters and mark entries that don't really fit into any clusters. That is, you can use clustering as a method for the detection of outliers or unusual patterns. Here, focus only on the small clusters instead of the larger ones. The business application was previously built in a less automated manner using Intelligent Miner for Data. The model built by using the demographic clustering technique proved to be successful in identifying clusters of transactions that were different from the norm and later proved fraudulent by investigators. The operational use of the first version of the solution started at the phone company already in 1998. A customer uses the solution permanently and with great success in several locations. Each week about 5,000,000 new CDRs are analyzed. Fraud attempts within the scale of tens of thousands of euro are detected and prevented. The return on investment (ROI) was achieved after only six months. This scenario is built on the success of version 1 of the model. The model parameters are translated in SQL script, using the IM Modeling and IM Scoring SQL APIs. After the models are coded in SQL scripts, the workbench approach
Chapter 5. Fraud detection example
79
in version 1 of this solution is transformed to an automated and table-driven approach of data mining. Most components of the applications are objects in the database. It is scheduled to rerun and remodel at regular intervals, ensuring the initial success is repeatable. Every time the model is refreshed (by scheduler), a new list of suspicious transaction is generated. The investigator then uses Business Objects to look at a Business Objects Report on list (generated as a DB2 Universal Database (UDB) view). Refer to Figure 5-6 on page 96 to see the end-user report.
5.4 Environment, components, and implementation flow In this scenario, the modeling environment is the same as the scoring environment. When a new model is built, the solution immediately reads the scoring results and chooses records from very small clusters. The environment is enabled once. This enablement creates both the mining and scoring extenders and APIs inside the database, so that the mining and scoring scripts execute. The application environment consists of four components: DB2 script that creates the model: The model created by the mining script is stored as a Binary Large Object (BLOB) inside a special DB2 UDB table created during the enablement process. DB2 UDB script that applies the model A scheduled job A Business Object report that highlights the suspicious connections A DB2 UDB job is scheduled to run every week. This job runs two DB2 UDB scripts that mine and score the data in succession. The first DB2 UDB script sets up the mining environment and creates the mining model. The second script selects the model that was created and scores the connections. The result of the scoring is that each connection is pigeonholed in a cluster. The last component is the Business Object report that lists the transactions that fall in the smallest clusters. Based on empirical evidence, these transactions deviate from normal behavior and are likely to be fraudulent connections.
Note: In this scenario, because the modeling and scoring environment are the same, there is no need to transport the model between modeling and scoring environments. In case the two environments are different, an extra step may be required to transport the model.
80
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 5-1 illustrates the mining and scoring environment for this fraud detection application.
Fraud detection applications with embedded mining and scoring Scheduler Sch eduler
SQL script for mining
JOb
Mod el Calibrat ion
SQL script for Scoring
Generate list of risky customers
Application Integration Layer: CLI/JDBC/SQLJ/ODBC or Message Queuing
IM Modeling API
IM Scoring API
Model (BLOB)
Model run Segmentation Typ e a ttribu tes
Typ e p layer
Ty pe p la ye r
Ty pe p la yer
Typ e lp aye r
Weekly load of CDRs
Ty pe attrib ute s
Operational DataStore Data Models Scored Modeling Environment
Scoring Environment
Figure 5-1 Fraud detection: Deployment environment and components
This architecture allows you to turn the art of fraud detection into a science. By porting it from a workbench to the production environment, it is easier to launch such a project off the ground and ensure repeatable success.
5.5 Data to be used For each phone call, the phone company records a Call Detailed Record, which is stored in a database. This database is a useful starting point for detecting fraud. A CDR consists of about 50 fields, out of which the six most important with respect to the solution described here are:
Identification of the caller (CALLER_ID in Table 5-1) Identification of the premium number called (PREMIUM_ID in Table 5-1) Date when the phone call started Time when the phone call started
Chapter 5. Fraud detection example
81
Duration in seconds Charges Table 5-1 shows the fields in the CDR that are used in this scenario. Table 5-1 CDR fields Caller_ID
Premium_ID
Start date
Start time
Duration
Charges
87645
1800...
12/12/01
0400
6000
900
87645
1899...
12/12/01
0400
6000
900
... 40 other fields not used
...
5.5.1 Data extraction CDR data is sourced at regular intervals from the exchanges and is loaded into a database every week. Given the large volume of data involved, the trend is to extract the data near real time, using such middleware as MQSeries Integrator. Detailed descriptions of data extraction techniques is beyond the scope of this case study. From the CDR, the following tasks are performed to create the connection table for modeling: Select CDRs for long distance calls to reduce the amount of data to be analyzed Data enrichment by creating derived variables that describe connection behavior
5.5.2 Data manipulation and enrichment CDRs are loaded into DB2 UDB, and extra variables are derived from the original data. From previous work done in the generic data steps, the data miner and business analyst, based on their intimate business knowledge, came up with seven extra derived fields. The fields would encapsulate connection patterns and reveal the underlying undesirable behavior. Table 5-2 lists the seven attributes used for clustering.
82
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 5-2 Attributes used for clustering Derived column from the enrichment process
Description
SUM_DUR
Whole duration of all calls on a specific connection (that is, from a specific caller to the premium number
NO_CALLS
Number of all calls on the connection
REL_DUR
Field indicating whether the connection has an extraordinarily high share in the turnover generated by all connections with the same premium number (exactly relationship between the duration of all calls on the specific connection and the average duration of all different connections with the same premium number)
SUM_COST
Call charges for the connection
MAX_DUR
Duration of the longest call on the connection
VAR_DUR
Variance of the call duration on a connection
NO_CLRS
Number of all different connections to the premium number
Figure 5-2 illustrates the transformation process to derive the seven attributes.
CALLER_ ID
PREMIUM_ ID
Start Date
Start Time
Duration (secs)
Charges
...
2561501233
0815691381
2000-01-09
2538366458
0815656545
2000-01-09
22.40.12
123
4.01
...
10.42.36
37
1.06
.
.
.
...
.
.
.
...
CDRs
b_con
b_phone_no
SUM_DUR
NO_CALLS
REL_DUR
b_premium_no
SUM_COST
MAX_DUR
VAR_DUR
NO_CLRS
1426
5
1.7
45.27
823
0.7
20
753
4
1.2
24.56
243
0.4
55
.
.
.
.
.
.
.
Connection Table (each row contains furthermore: CALLER_ID and PREMIUM_ID)
Figure 5-2 Data enrichment process
Chapter 5. Fraud detection example
83
You can find the SQL script for performing this data preprocessing in Appendix C, “SQL scripts for the fraud detection scenario” on page 255.
5.6 Implementation in DB2 UDB V8.1 DB2 UDB V8.1 or above supports extra XML functions that allow you to run the modeling step in fewer lines of code. To run IM Modeling in DB2 UDB V8.1 or higher, refer to the process flow diagram in Figure 5-3.
Enable database for modeling and scoring
Install extra UDF and installed procedure
Configuration Run stored procedure that: Defines data table setting Defines model parameter setting Builds the mining task
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5.6.1 Enabling database for modeling and scoring Before you can use the modeling function and the scoring function, you must enable it by configuring the database. Refer to Table 10-2 on page 214 and Table 10-3 on page 214 for details.
5.6.2 Installing additional UDFs and stored procedures With DB2 UDB V8.1, some extra UDFs and stored procedures are provided through this redbook to simplify the creation of a data mining model. Refer to Appendix F, “UDF to create data mining models” on page 281, to install these new database objects.
5.6.3 Model building After you install the additional database objects, you use the new stored procedure BuildClusModel for building a clustering model. The following example uses the stored procedure to create and save a segmentation model by the name of “Connection_Segmentation” in the table IDMMX.CLUSTERMODELS (default). The parameters allow a maximum of 30 clusters to be defined and sets the fields “CALLER_ID” and “PREMIUM_ID” to information field only. You can insert additional parameters into the stored procedure by using additional methods. For a list of commonly used methods for clustering models, refer to Table 10-6 on page 217. After you invoke the stored procedure shown in Example 5-1, DB2 UDB creates a model called “Connection_Segmentation”.
Example 5-1 Stored procedure to invoke clustering run in DB2 UDB V8.1 call BuildClusModel('Connection_Segmentation', 'CONNECTION_TABLE', ClusSettings('CONNECTION_TABLE’').. DM_setMaxNumClus(30).. DM_expClusSettings() );
You can now use the model to score the table as explained in 5.7.7, “Applying the scoring model” on page 90.
Chapter 5. Fraud detection example
85
5.7 Implementation in DB2 UDB V7.2 Figure 5-4 illustrates the step-by-step implementation of the model in DB2 UDB V7.2.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Build report
5.7.1 Prerequisite: Initial model building Model building was first performed in DB2 Intelligent Miner for Data. Model building with new cases of data is best performed in an interactive workbench environment, where data can be visualized and understood in an interactive manner. Initial model building is also typically performed in cycles where the data miner may experiment with different parameter settings and configurations. For example, for the fraud detection exercise, you may want to experiment with the maximum number of clusters visually, before you commit the max_clusters parameter into the scheduled job using the IM Modeling function.
5.7.2 Data settings The first step to integrate the modeling step into DB2 UDB is to define the data settings using UDFs in the IM Modeling data mining function: Table name Usage of table columns for modeling The script in Example 5-2 defines the data setting in DB2 UDB V7.2.
Example 5-2 SQL script to define data setting in DB2 UDB V7.2 connect to premiums; delete from IDMMX.miningdata where id = 'Connection'; insert into IDMMX.miningdata values ('Connection', IDMMX.DM_MiningData()..DM_defMiningData('CONNECTION_TABLE').. DM_SetColumns(' '));
5.7.3 Model parameter settings After the input data for the modeling step is defined, the next step is to define the modeling parameter settings. During the parameter setting phase, it is required that the parameter settings are already defined and optimized by a highly skilled
Chapter 5. Fraud detection example
87
data miner. The IT specialist implements the model in a production environment by plugging these parameters into the SQL script that uses the IM Modeling function. For the technique that is used for this scenario, demographic clustering, the possible parameters are:
Active/supplementary fields Field weights Value weighting Similarity scale Treatment of outliers Similarity matrix Maximum number of clusters Similarity threshold Desired execution time Minimum percentage of data to be processed Number of maximum clusters
The script in Example 5-3 shows how to define cluster training parameters.
Example 5-3 SQL script to define model setting for clustering delete from IDMMX.ClusSettings where id='Connection_Segmentation'; insert into IDMMX.ClusSettings select 'Connection_Segmentation', IDMMX.DM_CLusSettings().. DM_useClusDataSpec(MININGDATA..DM_genDataSpec()).. DM_setMaxNumClus(30).. DM_setFldUsageType('CALLER_ID',2).. DM_setFldUsageType('PREMIUM_ID',2) from IDMMX.MiningData where ID='Connection';
In this scenario, mostly default parameters settings are used, except for setting the maximum clusters to 30. For a market segmentation exercise, 30 may be too high. For detection of undesirable behavior, however, 30 or more is appropriate, since the goal of performing the segmentation is to identify the clusters of out-of-the norm behavior. Because fraudulent behavior is likely to be out of the norm in some way, setting a large number of clusters has a better chance of unearthing those small clusters.
88
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5.7.4 Building the mining task The data settings and model settings are used as input to build the mining task. The task is stored as a UDT in IDMMX.ClusTasks. The script in Example 5-4 demonstrates how to build a data mining task.
Example 5-4 SQL script to build a data mining task delete from IDMMX.ClusTasks where id='Connection_Segmentation_Task'; insert into IDMMX.ClusTasks select 'Connection_Segmentation_Task', IDMMX.DM_clusBldTask()..DM_defClusbldTask(d.miningdata,s.settings) from IDMMX.MiningData D, IDMMX.ClusSettings S where d.id='Connection' and s.id='Connection_Segmentation';
5.7.5 Running the model by calling a stored procedure After the task is built, you can invoke the task by calling a stored procedure. This task is typically scheduled to run at regular intervals, thereby recalibrating the segmentation model much more regularly than in the past. The example script in Example 5-5 demonstrates starting a cluster that is run by calling the procedure IDMMX.DM_BuildClusModelcmd.
Example 5-5 Calling a stored procedure to run the segmentation task call IDMMX.DM_BuildClusModelcmd('IDMMX.CLUSTASKS','TASK','ID', 'Connection_Segmentation_Task', 'IDMMX.CLUSTERMODELS','MODEL','MODELNAME', 'ConnectionSegmentationModel');
5.7.6 Scoring script generation Scoring script generation needs to be done only once when configuring the environment. The scoring script depends on two inputs: Input table Model to use
Chapter 5. Fraud detection example
89
The score script in Example 5-6 is part of the SQL script generated by the IDMMKSQL command.
Example 5-6 Creating the SQL script for scoring with a data mining model CREATE VIEW Scoring_engine( premium_id, caller_Id , Result ) AS SELECT data.Premium_id, data.caller_id, IDMMX.DM_applyClusModel(models.model, IDMMX.DM_impApplData( REC2XML(1,'COLATTVAL','', data."NO_CALLS", data."NO_CLRS", data."SUM_DUR", data."REL_DUR", data."SUM_COST", data."MAX_DUR", data."VAR_DUR"))) FROM IDMMX.CLUSTERMODELS models, connection_table data WHERE models.MODELNAME= 'ConnectionSegmentationModel';
5.7.7 Applying the scoring model The complexity of the score code is encapsulated in the view. To perform the score, you can call the ResultView that contains the score script. At a minimum, you want the record ID and the cluster ID. Example 5-7 illustrates the scoring script using input data and the model.
Example 5-7 Scoring with a generated script INSERT INTO Allocated_Cluster SELECT PREMIUM_ID, CALLER_ID, IDMMX.DM_getClusterID( Result ), IDMMX.DM_getClusScore( Result ) FROM scoring_engine; --The ALLOCATED_CLUSTERS table is defined once as below, the SQL is modified --from the result script generated by IDMMKSQL. CREATE TABLE Allocated_Cluster (
90
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
premium_id caller_id Clus_id Score
char(12), char(12), INTEGER, FLOAT );
5.7.8 Ranking and listing the five smallest clusters The connections are now scored with the segmentation model and each connection is assigned a cluster. Since the goal is to find connections that exhibit behaviors that depart from the norm, rank the clusters by their size in descending order. The clusters at the bottom with the lowest ranks are those that have the smallest size. By extracting the connections that have a rank of less than or equal to five, you have a subset of connections that are different from the norm. By examining these connections in detail, there is a good chance of finding connections are potentially fraudulent or undesirable. The final step of this business scenario is to generate a list of connections in the smallest (bottom) five clusters, as shown in Example 5-8.
Example 5-8 Script to list the connections in the five smallest clusters select scored.clus_id, attr.caller_id, attr.premium_id, attr.sum_dur, attr.no_calls, attr.rel_dur, attr.sum_cost, attr.max_dur, attr.var_dur, attr.no_CLRS from allocated_cluster scored, connection_table attr where scored.premium_id = attr.premium_id and scored.caller_id = attr.caller_id and scored.clus_id in (select clus_id from ( select clus_id, count(*) , rank() over(order by count(*)) as top_N from allocated_cluster group by clus_id
Chapter 5. Fraud detection example
91
) as temp where top_n <= 5 ) order by clus_id
Alternatively you can set up a business rule that extracts connections form clusters that make up less than 1% of all connections. This is shown in the script in Example 5-9.
Example 5-9 Script to list all clusters with less than 1% of all transactions Create view risky as select scored.clus_id, attr.caller_id, attr.premium_id, attr.sum_dur, attr.no_calls, attr.rel_dur, attr.sum_cost, attr.max_dur, attr.var_dur, attr.no_CLRS from allocated_cluster scored, connection_table attr where scored.premium_id = attr.premium_id and scored.caller_id = attr.caller_id and scored.clus_id in ( select clus_id from ( select clus_id, cast(cast(count(*) as double)/ cast((select count(*) from allocated_Cluster) as double) as decimal (6,4)) as percentage from allocated_cluster) as temp where percentage < 0.01 )
You can find the full script for this scenario in Appendix C, “SQL scripts for the fraud detection scenario” on page 255. Make sure that you read the comments in the beginning of each section. The SQL scripts that set parameters settings need to be run only once until you need to change the settings.
92
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
5.7.9 Actionable result for investigation Now you have a list of the connections that exhibit behaviors far from the norm. These connections are sent to the investigators to see whether those connections are legitimate connections or illegitimate activities. This is the part of the fraud detection process that requires experts from a forensic and investigation department.
5.7.10 Scheduling the job to run at regular intervals The last step of the implementation is to schedule the script to run at regular intervals. The steps are outlined here. 1. Start the DB2 Control Center. 2. Bring up the script center from the tools pull-down menu. 3. Import the SQL script that performs all the previous steps into the script center. 4. Select the job you just created. 5. Right-click to bring up the pop-up menu. 6. Select Schedule. 7. Fill in the parameters required. Figure 5-5 shows an example of a DB2 UDB Scheduler window.
Chapter 5. Fraud detection example
93
Figure 5-5 Sample window showing job scheduled to run everyday at 0130am
5.8 Benefits The benefits of the approach to use both modeling and scoring functions as demonstrated in this scenario are multi-fold.
5.8.1 A system that adapts to changes in undesirable behavior The segmentation model is refreshed automatically at regular intervals. This allows you to detect new patterns of undesirable behavior. Because undesirable behavior change over time, chances are that these behaviors are outliers in some way. This evolving approach is more capable of detecting new outliers. A word of caution, however, is that the segmentation technique is not a fool-proof way to uncover all undesirable behavior. White collar crimes are usually committed by intelligent groups or individuals. Some practitioners in fraud
94
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
detection talk of a phenomenon that the best of those crooks have hidden themselves behind the outliers. They call it the twenty-fifth record phenomenon.
5.8.2 Fast deployment of fraud detection system This scenario illustrates a highly automated mechanism of calibrating and deploying a fraud detection model. Time to market of the model is greatly reduced by this approach.
5.8.3 Better use of data mining resource Data mining skills are scarce. It is generally expensive to keep a full time data miner on board to maintain a fraud detection model. This approach allows an organization to use the service of the data miner only in defining the model in the first instance. For example, external consultants can set up a model and then pass on the maintenance to the database development team for implementation.
5.8.4 A repeatable data mining process in a production environment The model in its entirety is stored in DB2 UDB, and the entire process is table driven. This provides an environment that is conducive to automation. After an acceptable segmentation model is built, the settings can be translated to an SQL script. The SQL script can be scheduled and requires little maintenance. This approach guarantees repeatable success.
5.8.5 Enhanced communication Since every components is stored in database tables, the entire model is documented. This includes data settings, model settings, tasks, and results. All components can be queried using SQL, or the IM Visualizer, by anyone who needs access. The communication process between the developers and data miners is greatly enhanced.
5.8.6 Leveraged IT skills for advanced analytical application IT skill is critical in the success of such a system. This approach allows more IT involvement in an advanced analytical system.
5.8.7 Actionable result The result of this scenario is highly actionable. A list of risky transactions are generated in the database. With tools such as Business Objects, you can
Chapter 5. Fraud detection example
95
generate a list of customer transactions that warrant further investigation. Figure 5-6 shows a simple Business Objects report that includes the connections that warrant close investigation.
Figure 5-6 Report on risky transactions in Business Objects
96
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
6
Chapter 6.
Campaign management solution examples This chapter describes multiple alternatives to integrate IM Modeling, IM Scoring, and IM Visualization in marketing campaigns with or without a campaign management system. It provides an overview on campaign management and describes the following scenarios: Trigger-based marketing without any campaign management system Retention campaign and integration using the Unica campaign management system Cross-selling campaign and full integration in a data warehouse using the Siebel campaign management system
6.1 Campaign management overview As competition grows in the industry, companies look more than ever toward increasing profitability from existing customers, rather than expecting new customers to account for growth. The focus is on retaining the profitable customers. The telecommunications industry has proven this extensively. They determine the customers and the markets that bring the most profit, and drop those that do not bring profit to your business model. Retaining existing customers is easier and less costly than attracting new ones. Naturally, that means that companies are keenly interested in cross-selling a variety of products and services to existing customers. Cross-selling can substantially increase customer profitability, especially if an organization targets valuable customers who are likely to purchase multiple products. But identifying good potential customers and effectively cross-selling without cannibalizing other product lines are not easy tasks. It is essential to efficiently monitor and act on marketing opportunities. Changes to customer profiles occur all the time. One-to-one marketing requires a responsive system and a large number of initiatives that can be managed only by a system that provides automatic event detection, comprehensive reporting, and support for multiple customer and product profiles. Campaign management deals with three types of campaigns: Outbound: The enterprise contacts the customer during a campaign. Outbound campaigns are usually targeted based on the customer profile. Inbound: The customer contacts the enterprise. Triggered: The enterprise contacts the customer when some new and important information becomes available. With target marketing and triggered actions, the direct marketer is in charge of who, when, and how to contact a customer or prospect. With inbound interactions, the customer decides when to contact and the purpose of the action. Outbound and targeted campaigns address a target group of customers that are selected based on specific characteristics. A good example of a targeted campaign is a credit card acquisition campaign. In this example, a company selects a group of prospects based on credit history and potential customer value to offer a pre-approved low interest rate product. Another example is a credit card company that decides to pursue customers who were not timely in paying bills (but eventually paid them). Significant profit was made on late fees with minimal losses. This chapter explores solutions for outbound treatment in the following sections.
98
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
An inbound campaign is granted when a company wants to create different treatments for customers when they interact with its channels. For example, it may be how to perform real-time personalization when the customer interacts with the enterprise using the Web. The main challenge here is to detect when the right time is to present the right offer or service. An important example of this type of campaign is customer complaint management. When a valuable customer calls a customer service hotline dissatisfied with the company’s services or products, it is important that the call center specialist can deal with that complaint to avoid escalation of the problem or even customer attrition. Most of the time, a call center specialist is not prepared to deal with this kind of situation due to inexperience or lack of rules of engagement. Triggered campaigns need to deal with decisions based on new information available for the system. There are two ways to implement a triggered campaign, using schedulers or triggers. Scheduled campaigns can systematically verify the data changes in the database looking for patterns that qualify a customer to receive a new product offer, for example, or any other kind of marketing action. The problem with this approach is determining the right interval to check for new information in the databases. Intervals that are too long can lead to missed opportunities or customer attrition. Intervals that are too short may not be viable due to performance constraints. To run a campaign using triggers, they need to be enabled in the database system or in the application. The following sections describe three scenarios: An inbound campaign in telecommunications triggered by a customer contact that can be easily set up without any campaign management system (6.2, “Trigger-based marketing” on page 99) An outbound campaign in telecommunications to automatically schedule and launch a campaign using the Unica campaign management system (6.3, “Retention campaign” on page 114) An outbound campaign in banking to trigger a cross-selling campaign using the Siebel campaign management system and integrating in the enterprise data warehouse environment (6.4, “Cross-selling campaign” on page 128)
6.2 Trigger-based marketing Trigger-based marketing is the notion that the best time to activate a marketing campaign to a customer is when and as the customer’s needs arise. This
Chapter 6. Campaign management solution examples
99
provides the trigger to send a right offer to the customer. Such marketing annoys the customer less and fits the real needs of more customers. Trigger-based marketing is useful for inbound campaigns. When a customer contacts an organization via a channel, this offers a good opportunity to an organization to detect whether there is change in the customer’s needs, in case a change is detected and a new need identified. In this case, a campaign that triggers the right treatment to this individual customer can be activated as soon as possible. Using trigger-based marketing, an organization can improve customer satisfaction and respond to profit opportunities in real time. This scenario shows how DB2 Universal Database (UDB) and JavaMail API can provide the environment where simple, but effective workflows can be built for the implementation of triggered-based marketing without any campaign management system.
6.2.1 The business issue The business issue involved in this scenario is the problem customer churn in a telecommunications company. Marketing activities are not responsive enough. To take advantage of inbound marketing opportunity, an organization must be ready and focused whenever and wherever the opportunity arises. The organization must be able to leverage the new information volunteered by the customer, analyze the customer’s propensity to respond to a promotion, and act on this new knowledge in near real time. For many organizations, the current practice is to rescore the customer’s propensity by batch job on a nightly or weekly basis. In today’s business environment, this is increasingly ineffective. Trigger-based marketing is a large investment. Many organizations turn to campaign management tools for implementation of this concept of trigger-based marketing. However, these tools represent a major investment in terms of total cost of ownership and are beyond the reach of smaller marketing organizations.
6.2.2 Mapping the business issue to data mining functions Scoring in real time the propensity of customers to leave in an inbound scenario is supported through the invocation of the data mining function via DB2 UDB triggers.
DB2 UDB triggers provide real-time decisions and actions Triggers in DB2 UDB are a simple and cost-effective way to document and enforce business rules. Triggers are created to wait for a specified operation on a
100
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
table. For marketing purposes, if an interaction with a customer results in a change in their profile, scoring is automatically triggered and new scores are calculated milliseconds later. Based on the score, you can decide, for instance, the customer’s likelihood to churn based on up-to-the-second information. If the propensity exceeds a pre-defined threshold, remedial action can be launched, all in near real time. Using purely DB2 UDB features, you can create a workflow based on business rules, by daisy chaining a series of triggers.
DB2 UDB offers a lower cost of entry for real-time marketing A trigger is part of DB2 UDB. After a trigger is created, it remains secure as a DB2 UDB object inside a database. No extra software is required for triggers that touch only DB2 UDB objects. A trigger is implemented in SQL, so SQL skills must be readily available. It is self documenting. From a business point of view, it requires a low level of maintenance after it is created.
6.2.3 The business application This scenario showcases an application that combines a series of triggers to provide a simple trigger-based marketing application for churn reduction in near real time. Two triggers are created for real-time decision making and action. The first trigger monitors any change in the customer’s attribute. When this happens, the trigger scores the customer again in real time, updating the old score to a new one. A second trigger is created to take actions based on the new score. In this case, when the churn score is higher than 0.8 or 80%, the trigger fires two e-mails, one to the customer to make an offer and another to alert the customer representative about the customer’s risk of churning. The customer representative can then take remedial action before the customer churns. Using the same principle, you can build more sophisticated business rules and triggers to improve the sophistication of the campaign. The trigger-based marketing technique here is used in conjunction with inbound retention campaigns.
Chapter 6. Campaign management solution examples
101
6.2.4 Environment, components, and implementation flow The process flow for deployment of the data mining process in this case is shown in Figure 6-1.
Mining Workbench
Workflow application
Triggering events...such as profile changes
Schedu ler
Sched uler
JO b
JO b
Mo del Calibrat ion
Mo del Calib rat ion
DB2 Trigger
DB2 Trigger
Email Alert ...
Application Integration Layer: CLI/JDBC/SQLJ/ODBC
IM Scoring API
Mining Run using Radial Basis Function
Export model
Data Warehouse Analytical Data Mart
Modeling Environment
Model (BLOB)
Operational Data Store Data Models Scores Java UDF for email support
Scoring Environment
Figure 6-1 Deployment environment, components in customer churn system
In this scenario, there are four environments.
The workbench environment The churn model was developed using a Radial Basis Function algorithm in DB2 Intelligent Miner for Data. It makes good sense to perform the initial modeling in an interactive workbench environment.
The workflow The goal of this application is a real-time decision and real-time response. Without a full featured campaign management system, such as Unica or Siebel, the best way to do this is by implementing the decision and response using DB2 UDB triggers. The workflow is shown in Figure 6-2.
102
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
1 Existing customer's
for an 2 Trigger existing customer
profile changes
scoring
Customer profile
New Scores
score triggers 3New marketing decision;
Alert Customer Service to start retention offer
if high risk, take action now
UDF to send email
External Events
DB2
Business Rules
Actions
Figure 6-2 A simple workflow implementation using DB2 UDB triggers
The work flow occurs in the following order: 1. A a customer calls in and may change any of the profile details. 2. The event triggers the customer record to be scored and the results to be sent to a churn score table. 3. The event of a score update triggers a decision. If the new score is above a predefined threshold (in our case 80%), the trigger sends an e-mail to the customer service officer. The e-mail message indicates to call this customer and take measures to entice them to stay with the organization.
The mining environment A database called CHURN with the customer profile in the table ALL_CUSTOMERS is set up by an ETL process. In this scenario, we used DB2 Intelligent Miner for Data to connect to and mine from this CHURN database.
Chapter 6. Campaign management solution examples
103
The scoring environment The scoring environment is in a separate database. It has a table that looks almost identical to ALL_CUSTOMERS in structure, except it does not contain the CHURN field, but contains fresh customer data to be scored. In addition, because the work flow requires sending e-mail alerts from within DB2 UDB, the scoring environment requires that DB2 UDB be JavaMail enabled. Figure 6-3 shows the implementation steps to run this case study of trigger-based marketing for real-time churn analysis and retention campaigns.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
6.2.5 Step-by-step implementation This section explains the steps to implement the workflow.
Configuration The configuration steps for database enablement are performed only once.
Database enablement The first step is to enable the DB2 UDB instance and the CHURN database to allow the scoring functions to work. To enable the DB2 UDB instance, refer to steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Scoring. To enable the CHURN database, refer to the steps in 9.3.2, “Configuring the database” on page 198, for IM Scoring. This must be done only once for this working database and instance. In addition to the DB2 UDB configuration parameters, because this deployment scenario involves DB2 UDB sending e-mail by triggers, additional Java components are required. The prerequisites of the e-mail alert components are: JDK 1.1 JavaMail 1.3 JAF-1.0.2
You can download the pre-requisites from the Web at: http://java.sun.com
Table 6-1 lists the steps required to configure DB2 UDB for sending e-mail by triggers. To learn how to use triggers in combination with JavaMail, refer to the article in IBM Developer Solutions on the Web at: http://www-106.ibm.com/developerworks/ibm/library/i-email
Table 6-1 Steps to configure the automatic e-mail component in DB2 UDB Step
Actions
Description
1
Enter the DB2 UDB command: update database configuration parameter JDK11_PATH
This points DB2 UDB to the JDK 1.1 installation directory.
2
Enter the following commands: db2stop db2start
Bounce the database so it selects the new parameters.
3
Extract mail.jat from JavaMail 1.2 to a predetermined directory, for example: c:\JavaMail1.2.
4
Extract activation.jar from JAF-1.0.2 to a predetermined directory, for example: c:\JAF102.
Chapter 6. Campaign management solution examples
105
Step
Actions
Description
5
Add the two JAR files extracted in step 3 to the CLASSPATH environment variable
6
Enter the DB2 UDB command: call sqlj.install_jar(‘file////mail.jar’,’MAIL’)
Install the JAR file to DB2 UDB.
7
Enter the DB2 UDB command: call sqlj.install_jar(‘file//// activationjar’,’ACTIVATION’)
Install the JAR file to DB2 UDB.
8
javac MailUDF.java
Use this command to compiles sample Java file for this scenario.
9
copy MailUDF.class “%DB2PATH%\function”
Use this command to copy the Java class into DB2 UDB.
10
db2 -tvf CreateMailUDF.db2
Create the UDF.
11
db2 -tvf create_email_trgger.db2
Create the DB2 UDB trigger that sends e-mail based on a business rule.
Java Class source code Example 6-1 shows the source code of MailUDF.java, for creating a Java class to be created as a DB2 UDF.
Example 6-1 Sample Java code creating a MailUDF that sends e-mail within DB2 /*(c) Copyright IBM Corp. 2002 All rights reserved. */ /* */ /*This sample program is owned by International Business Machines */ /*Corporation or one of its subsidiaries ("IBM") and is copyrighted */ /*and licensed, not sold. */ /* */ /*You may copy, modify, and distribute this sample program in any */ /*form without payment to IBM, for any purpose including developing,*/ /*using, marketing or distributing programs that include or are */ /*derivative works of the sample program. */ /* */ /*The sample program is provided to you on an "AS IS" basis, without */ /*warranty of any kind. IBM HEREBY EXPRESSLY DISCLAIMS ALL */ /*WARRANTIES EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO*/
106
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
/*THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTIC-*/ /*ULAR PURPOSE. Some jurisdictions do not allow for the exclusion or */ /*limitation of implied warranties, so the above limitations or */ /*exclusions may not apply to you. IBM shall not be liable for any */ /*damages you suffer as a result of using, modifying or distributing */ /*the sample program or its derivatives. */ /* */ /*Each copy of any portion of this sample program or any derivative */ /*work, must include a the above copyright notice and disclaimer of */ /*warranty. */ /* */ /*********************************************************************/ /******************************************************************/ /* MailUDF Example Incorporating JavaMail */ /******************************************************************/ import java.util.*; import java.math.BigDecimal; import javax.mail.*; import javax.mail.internet.*; public class MailUDF { public static String trim(String input) { return input.trim(); } public static String mailer(String customerNO) { try { // Specify our host, intended recipient, & sender e-mail address String host = "au1.ibm.com"; InternetAddress from = new InternetAddress("[email protected]"); InternetAddress recipient = new InternetAddress("[email protected]"); // Get the system properties Properties props = new Properties(); // Setup default parameters - the protocol, the host, and port # props.put("mail.transport.protocol","smtp"); props.put("mail.smtp.host",host); props.put("mail.smtp.port","25"); // Create session Session mySession = Session.getInstance(props); // Create our message MimeMessage myMessage = new MimeMessage(mySession);
Chapter 6. Campaign management solution examples
107
myMessage.setFrom(from); myMessage.setSubject("Customer Churn Alert"); String messageText = "Dear colleague\n"; messageText+="customer : " + customerNO + " shows a high propensity to leave"; messageText+="Please follow up ! \n"; messageText+="\n"; messageText+="From your Real time Customer Intelligence Team"; myMessage.setText(messageText); myMessage.addRecipient(Message.RecipientType.TO,recipient); // Send the message javax.mail.Transport.send(myMessage); } catch (Exception e) { System.out.println("An error has occurred: " + e); return "An error has occurred: " + e; } return "An E-mail was sent to [email protected]"; } }
UDF creation in DB2 UDB Example 6-2 provides the DB2 UDB source code for creating a DB2 UDF from an external Java Class named MailUDF.
Example 6-2 DB2 UDF source code for mail Create function mailAlert(customer_no varchar(20)) returns varchar(70) fenced Variant no sql external action language java parameter style java external name'MailUDF!mailer';
Table to be scored The customer data used is based on collected data from clients of a European telecommunications provider. This customer table is referred to as the customer profile table. The data contains the fields shown in Table 6-2.
108
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Table 6-2 Customer data in telecommunications churn case Field name
Field description
Data type
CUSTOMER_AGE
The age in years of the customer. The value “-1” for the age means that the age of the customer is unknown.
integer
CHURN
The categorical churn variable is used as the class/predicted field. It has the values “0” (no churn) or “1” (churn).
char(1)
REVENUE_DEVELOPMENT
The development of the revenue. A positive value indicates revenue growth, and a negative value indicates revenue decline.
decimal(12,2)
PREMIUM_ACCOUNT
It has the values “0”, no premium account, and “1”. The customer has a premium account.
char(1)
GENDER
The values for gender are “male”, “female”, and “unknown”.
char(10)
BUYING_POWER
The buying power of the customer, values are “low”, “average”, “high”, “very high”, and “unknown”.
char(10)
CUSTOMER_RATING
The internal rating of the telecommunications company of the customers are separated into three categories: “A”, “B”, “C”. The value “?” indicates that the rating is unknown for this customer.
char(1)
CONV_PHONE_CONTRACT
The values are “yes” (the customer has a conventional phone contract), “no” (no such contract), and “unknown”.
char(7)
CELL_PHONE_CONTRACT
The values are “yes” (the customer has a cellular phone contract), “domestic” (the contract is restricted to domestic communications), “no” (no such contract), and “unknown”.
char(10)
CUSTOMER_NO
The number of the customer.
char(11)
NETWORK
The network of the cellular phone contract. There are three networks: “Network 1", “Network2”, and “Network 3".
char(10)
LOCATION_SIZE
The discrete number of inhabitants of the town or village where the customer lives. The values are “< 5000", “5-10000”, “10000-20000”, “20000-50000”, “50000-200000”, “200000-500000”, and “>500000”.
char(15)
CHANGE_OF_OFFER
It has the value “1” if the customer has changed a product and “0” if otherwise.
char(1)
SOCIO_DEM_GROUP
The socio-demographic group that the customer may belong to. The values include “worker”, “conservative”, and “upper class”.
char(25)
Chapter 6. Campaign management solution examples
109
Field name
Field description
Data type
REVENUE_CATEGORY
The internal revenue category of the customer. It has the values “RC1”, “RC2”, “RC3”, “RC4”, “RC5”, and “RC6”.
char(10)
NO_SIM_CHANGES
The number of changes of the SIM card. “1" indicates that a change took place, and “0” indicates that the SIM card has not been changed.
char(1)
REVENUE_OF_3MONTHS
The revenue of the last three months.
decimal(12,2)
CONTRACT_NO
The number of the contract
char(11)
DISTRIBUTOR
The category of distributors where the customer bought the contract.
char(25)
DURATION_OF_CONTRACT
The duration of the contract.
integer
The table with the customer data that the marketing department wants to score may already be created. Or it can be created (Script to create and load the Churn table for scoring.sql) by joining the diverse data fields from the different necessary operational databases in the telecommunications company’s database environment grouped by the customer_no.
Workbench data mining In this case study, we used the tree classification technique from DB2 Intelligent Miner for Data. Refer to Mining Your Own Business in Telecommunications Using DB2 Intelligent Miner for Data, SG24-6273, to see how to build a data mining model to predict churn with tree classification. The predicted field used is CHURN as described in Table 6-2.
Exporting the PMML model The model to be deployed is created in DB2 Intelligent Miner for Data. Now the model must be exported into a PMML file to use the existing data mining model for scoring. Refer to 9.3.3, “Exporting models from the modeling environment” on page 200, on how to export the model in PMML format.
Importing the data mining model The model is imported into DB2 UDB using a SQL script. Refer to 9.3.4, “Importing the data mining model in the relational database management system (RDBMS)” on page 202, on how to import the model in PMML format. The code segment in Example 6-3 shows the code used to import the PMML model into DB2 UDB.
110
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Example 6-3 SQL code used to import the PMML model into the CHURN database insert into IDMMX.ClassifModels values ( 'Churn_Demo', IDMMX.DM_impClasFile( 'C:\temp\retention\Tree_churn.pmml'))
Create the scoring engine The scoring script is based on two inputs: Target table Model to use
The scoring script in Example 6-4 is modified from the skeleton code generated by IDMMKSQL.
Example 6-4 The scoring code used for this scenario Create view scoring_engine(Id,Result) AS SELECT data.customer_no, IDMMX.DM_applyClasModel(models.MODEL, IDMMX.DM_impApplData( rec2xml(1,'COLATTVAL','', data."CUSTOMER_AGE", data."CHURN", data."REVENUE_DEVELOPMENT", data."PREMIUM_ACCOUNT", data."GENDER", data."BUYING_POWER", data."CUSTOMER_RATING", data."CONV_PHONE_CONTRACT", data."CELL_PHONE_CONTRACT", data."CUSTOMER_NO", data."NETWORK", data."LOCATION_SIZE", data."CHANGE_OF_OFFER", data."SOCIO_DEM_GROUP", data."REVENUE_CATEGORY", data."NO_SIM_CHANGES", data."REVENUE_OF_3MONTHS", data."DISTRIBUTOR", data."DURATION_OF_CONTRACT"))) FROM IDMMX.CLASSIFMODELS models, Churn_SCORING data WHERE models.MODELNAME='Churn_Demo';
Chapter 6. Campaign management solution examples
111
Applying the model and score To simplify the implementation, the score code can be encapsulated in a view. To perform the score, one way is to call the ResultView “Scoring_engine” that contains the score script. At a minimum, you want the record ID and cluster ID. The script in Example 6-5 shows the scoring.
Example 6-5 Scoring with generated script create table churn_Score(id char(11),class char(1),confidence float); insert into churn_Score select Id, IDMMX.DM_getPredClass( result ), IDMMX.DM_getConfidence( result ) FROM scoring_engine;
Rescoring when the customer profile changes in real time The DB2 UDB trigger in Example 6-6 encapsulates the business rule (that the customer is rescored when any of their attributes change).
Example 6-6 Trigger to run scoring code when a customer attribute changes CREATE TRIGGER RE_SCORE AFTER UPDATE ON CHURN.ALL_CUSTOMERS REFERENCING NEW AS new FOR EACH ROW MODE DB2SQL BEGIN ATOMIC update resulttable rt set (rt.pred_value,rt.confidence) = ( SELECT IDMMX.DM_getPredClass( Result ), IDMMX.DM_getConfidence( Result ) from ResultView rv where rv.id=rt.record_id ) where rt.record_id=new.customer_no; END
112
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Tip: This scenario is a CHURN reduction, which is typically run on an existing customer base. However, for profiling of new customer, you could create a trigger to score on INSERT of new customer record: CREATE TRIGGER NEW_SCORE AFTER INSERT ON CHURN.ALL_CUSTOMERS REFERENCING NEW AS new_customer FOR EACH ROW MODE DB2SQL BEGIN ATOMIC insert into ResultTable select customer_no , IDMMX.DM_getPredClass(Result), IDMMX.DM_getConfidence(Result) from ResultView rv where rv.id=new_customer.customer_no; END
Application integration In this scenario, whenever a customer develops a churn score of 0.8 or higher, you are to alert the customer service team of the increased likelihood of customer churn. The mechanism of this alert is via e-mail. We use JavaMail API to integrate DB2 UDB with the customer’s inhouse e-mail system. Refer to Table 6-1 on page 105 for step-by-step guide of setting up DB2 UDB and JavaMail integration. The trigger shown in Example 6-7 sends an e-mail alert according to a predefined business rule, which is whenever churn propensity (newrow.confidence) is 0.8 or above. mailAlert is a DB2 UDF as defined in Example 6-2 on page 108.
Example 6-7 DB2 UDB code to create a trigger to send e-mail Create trigger alertRep after update on resultTable referencing old as oldrow new as newrow for each row mode db2sql when (newrow.confidence >= 0.8)and (oldrow.confidence < 0.8) values(mailAlert(oldrow.record_id));
6.2.6 Benefits The technical benefit in using data mining functions to implement trigger-based marketing is its simplicity:
Chapter 6. Campaign management solution examples
113
Low-cost total cost of ownership: For a DB2 UDB site with the IM Scoring function, no extra software is required. Triggers are part of DB2 UDB and the JavaMail package is freely downloadable. The implementation is mainly triggers and SQL, which are easy to implement and little maintenance. The triggers also is a good way to enforce and document business rules. Using a series of triggers, the sophistication of the campaign can be easily enhanced. For example, instead of hardcoding a threshold of 0.8, you could use a threshold stored in another table, which in turn can be varied according to budgetary constraints for an individual campaign. The combination of DB2 UDB triggers, JavaMail, and the IM Scoring API allow real-time, focused, intelligent, closed-loop, and relatively inexpensive marketing campaigns.
6.3 Retention campaign This campaign shows you how to build and integrate real-time scoring in your retention campaign process to reduce customer churn. Reducing customer churn has several goals. One goal is to reduce customer attrition, which is covered in this case study. Another goal may be to improve customer loyalty so current customers are not tempted by new, more attractive startup offers. Change in customer behavior can cause a change in the customer churn score. Depending on the new score, the customer may be part of a high-risk customer group and should be contacted by a campaign. The important issues of this solution are the online churn score and its integration in a campaign management system.
6.3.1 The business issue This case study deals with the business problem of customer churn in a telecommunications company. It is defined as detecting the customers who didn’t renew their contracts on the last 18 months and who may have a high propensity to leave. By the time the customers have gone to a competitor, it is too late. The indicators for churn need to be identified early in the cycle. Churn may not be because of dissatisfaction. However, it may be because of price, coverage, and so on. The telecommunications company has found that the more services a customer has, the less likely they are to churn. For example, adding SMS, e-mail, Web access, and bill viewing to their phone makes customers less likely to churn. Identifying the right candidates and offering these services can tie them in regardless of new prices from competitors.
114
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Reducing churn in a telecommunications company has become the top issue in this industry, since most of the country’s economic and regulation models changed. Previously, most of the utilities services, including telecommunications, were managed by the state or government. In this monopoly model, the customer didn’t have the chance to leave to another provider. Now customers can choose telecommunications providers. The main business issue has become to learn the customer behavior (needs) and to keep track of them fast enough to anticipate a churn attitude. In this scenario, the differentiator from other companies may be to: To measure the customer churn probability in real time in the production environment To launch and schedule a retention campaign every time the customer score changes
A new customer or a change in the customer behavior event triggers the score calculation. If the new score calculation causes the particular customer to be part of, for example, the High Risk Customer group, then the customer should be contacted. After the company develops a customer churn probability model, the probability calculation for each customer should run online in the production environment. This way any other application can access and use the resulting calculation (churn score). Now that you have the churn score for each customer, the customers with a higher churn probability can be contacted in time before they change their minds, keeping them from changing their provider.
6.3.2 Mapping the business issue to data mining functions There are many ways to prevent customer churn. The business analyst can compare the customer that has already left the company with one that is still a customer. They analyst should look for all the variables that describe the customer behavior, such as, call center, billing, mediation, and market surveys variables. The idea now is that once a company has a data mining model to classify and score the customer churn, the calculation should be implemented in the production environment. In this customer churn case, the tree classification technique provided by any data mining workbench and for IM Scoring gives their churn prediction score. After you have a data mining model in PMML, IM Scoring can provide fast online churn scoring. The resulting churn score table can be triggered by the most important external events as the new customer acquisition and the behavioral change of any actual customer. The process is done in two steps:
Chapter 6. Campaign management solution examples
115
1. External events trigger a score calculation is triggered. 2. A retention campaign in the campaign management system is scheduled to run according to the new score calculation value. For example, if score is larger than a certain risk threshold, then a retention campaign should be launched.
6.3.3 The business application Retention campaigns seek to retain high value customers through offers and promotions that improve their level of satisfaction and reduce their probability of attrition. You can decrease customer churn a lot by producing a retention campaign that is focused on customer needs, is fast in processing, and keeps track of customer behavior. The application developed here has to accomplish real-time scoring and the scheduled action to reduce customer propensity to leave. It requires an IT analyst and a business analyst to design and implement this solution. An example of that is a change in a customer’s profile field so that their score now is bigger than the risk threshold (for example 0.7). After they have been a customer for more than 18 months, they should be contacted. A campaign script is designed to contact a group of customers that is considered High Customer Risk (has a score bigger than the risk threshold and customer tenure bigger than 18 months). The campaigns in the system must be designed to send different scripts to different customer conditions.
6.3.4 Environment, components, and implementation flow Figure 6-4 shows the process flow for the deployment of the data mining process in this case.
116
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Applications with embedded scoring OO f fffe e rr 11
Mining workbench
Sch ed ule r
External events update score table
Offer Offer 22
JOb
Mod el C ali br ati on
3 r3 eer fff OO
DB2 Trigger
Campaign Management
Different Marketing scripts offerings
Application Integration Layer: CLI/JDBC/SQLJ/ODBC
IM Scoring API
Mining Run using Tree Classification
Model (BLOB) Export model
Operational Data Store Data Models Scores
Data Warehouse Analytical Data Mart
Modeling Environment
Scoring Environment
Figure 6-4 Deployment environment, components in customer churn system
In the modeling environment, there is a data mining workbench, such as DB2 Intelligent Miner for Data, that has the functionality of exporting the final model in PMML format. You can learn more about developing the churn data mining model in Mining Your Own Business in Telecommunications Using DB2 Intelligent Miner for Data, SG24-6273. In this scenario, we already have a database called CHURN with the customer profile in the table ALL_CUSTOMERS. We will use this database from now on. The scoring environment is also the production environment. The CHURN database has the churn score table and any other auxiliary tables that are necessary in campaign management, call center, and analytical environment. The churn score table is created with IM Scoring using the PMML data mining model from the workbench applied to the customer profile table. After you deploy the score table with IM Scoring, you can build DB2 UDB triggers to update this table whenever an insert (new customer) or an update (change in the customer behavior/fields) occurs in the ALL_CUSTOMERS table.
Chapter 6. Campaign management solution examples
117
Using a campaign management system, you must first design the retention campaign. This campaign will select the many possible segment groups (for example, High Risk Churn or Moderate Risk Churn) and write different campaign scripts associated with the segment group. Each segment group has its is customized campaign script with the most effective message, the right value/benefit, the most appropriate time, and the preferred channel (direct mail, call center, Web, and so on) for the customers in this segment group. In the campaign design step, the marketing/business analyst must be aware of the customers’ real needs and behavior to not disturb the customer and really build or maintain a good relationship. Perform the steps that are shown in Figure 6-5 to run this case study.
Database enablement
Create table to be scored
Configuration Export the PMML file
Workbench Data Mining Latest data mining model in place?
Import the data mining model
No
Yes
Triggers exist?
No
Yes Apply the scoring function
Build Triggers
Scoring
Customer that has changed behavior based on the campaign
Run campaign management
New customer or existing customer that has changed behavior
Application Integration
Figure 6-5 Retention case study implementation flow
118
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The following section explains the implementation flow for the deployment.
6.3.5 Step-by-step implementation The database enablement and the score table creation steps are done once because they belong to the configuration and the design phase.
Configuration The two tasks involved in the configuration are database enablement and table creation are identical to “Database enablement” on page 105 and “Table to be scored” on page 108. They should only been performed once when you initialize the process.
Workbench data mining This case study uses the tree classification technique from DB2 Intelligent Miner for Data. Refer to Mining Your Own Business in Telecommunications Using DB2 Intelligent Miner for Data, SG24-6273, to see how to build a data mining model to predict churn with the tree classification.
Exporting the PMML model The model to be deployed is created in DB2 Intelligent Miner for Data, so the model needs to be exported into a PMML file. Refer to 9.3.3, “Exporting models from the modeling environment” on page 200, to learn how to export the model in PMML format.
Scoring In this scenario, we use IM Scoring to score customers in the CHURN database. The score provides the churn probability measurement. There are two steps you must perform as explained in the following sections: 1. Import the data mining model. 2. Apply the scoring function.
Importing the data mining model The model is imported into DB2 UDB using a SQL script. Refer to 9.3.4, “Importing the data mining model in the relational database management system (RDBMS)” on page 202, to learn how to import the model in PMML format. Import the data mining model with the script ChurnInsert.db2 as shown in Example 6-8.
Chapter 6. Campaign management solution examples
119
Example 6-8 Importing the data mining model (PMML file) insert into IDMMX.ClassifModels values ( 'Churn_Demo', IDMMX.DM_impClasFile( 'C:\temp\retention\tree_decision_pmml.dat')); C:\temp\retention>db2 -tvf ChurnInsert.db2
Applying the scoring functions In this scenario, we apply the scoring functions on the table CHURN_SCORING, which contains the customers to be scored. We created a temporary view Resultview and use REC2XML for converting the data that is to be scored to the XML format. First create the view with the script ChurnApplyView.db2 as shown in Example 6-9.
Example 6-9 Applying the tree classification model DROP VIEW Resultview; Create view ResultView(Id,Result) AS SELECT data.customer_no, IDMMX.DM_applyClasModel(models.MODEL, IDMMX.DM_impApplData( rec2xml(1,'COLATTVAL','', data."CUSTOMER_AGE", data."CHURN", data."REVENUE_DEVELOPMENT", data."PREMIUM_ACCOUNT", data."GENDER", data."BUYING_POWER", data."CUSTOMER_RATING", data."CONV_PHONE_CONTRACT", data."CELL_PHONE_CONTRACT", data."CUSTOMER_NO", data."NETWORK", data."LOCATION_SIZE", data."CHANGE_OF_OFFER", data."SOCIO_DEM_GROUP", data."REVENUE_CATEGORY", data."NO_SIM_CHANGES", data."REVENUE_OF_3MONTHS", data."DISTRIBUTOR", data."DURATION_OF_CONTRACT"))) FROM IDMMX.CLASSIFMODELS models, Churn_SCORING data
120
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
WHERE models.MODELNAME='Churn_Demo'; C:\temp\retention> db2 -stf ChurnApplyView.db2
Note: Check carefully whether the field names and order are correct. Also verify that the model name is correct.
Then run the script ChurnApplyTable.db2 that creates the table with the resulting score as shown in Example 6-10.
Example 6-10 Creating the churn score table drop table resulttable; create table resulttable(id char(11),class char(1),confidence float); insert into resulttable select Id,IDMMX.DM_getPredClass( result ),IDMMX.DM_getConfidence( result ) FROM ResultView ; C:\temp\retention>db2 -tvf ChurnApplyTable.db2
Figure 6-6 shows a sample of the result table that has the ID (customer_id) and the confidence (churn score).
Chapter 6. Campaign management solution examples
121
Figure 6-6 Table with the churn score (confidence)
In this small sample, you can see that all customers have a churn score that is bigger than 0.7. This indicates that there is churn probability. Therefore, you should pay attention to these customers and try to retain them before they decide to change providers. It is important to keep a history of this result table, so that you know whether the company’s efforts to retain the customer are efficient. This table also indicates which customers are changing behaviors in a way that they are more likely to churn now. For example, this may be an external event such as the customer is changing jobs. It is also important to take care of new customers that simply take advantage of a particular promotion and become a customer only during promotion time, but later change providers. The challenge is to keep this customer.
Building DB2 UDB triggers Now create two DB2 UDB triggers, both to update the result table (churn score table) with two external events. One trigger is for a new customer acquisition event for each time a new customer is registered in the operational customer database, even with only demographic information. The other trigger is for a behavioral change of any existing customer event.
122
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The two scripts with the triggers are newscore_trigger.db2 and rescore_trigger.db2. See Example 6-11.
Example 6-11 New score and rescore DB2 UDB triggers First trigger (New score): --This trigger is run every time a new customer is inserted into the table CHURN.ALL_CUSTOMERS; --The customer churn score is calculated and the result inserted into resultable; --The scoring is performed by the view ResultView, which hides the complexity of the scoring code; CREATE TRIGGER NEW_SCORE AFTER INSERT ON CHURN.ALL_CUSTOMERS NEW AS new_customer FOR EACH ROW MODE DB2SQL BEGIN ATOMIC insert into ResultTable select customer_no , IDMMX.DM_getPredClass(Result), IDMMX.DM_getConfidence(Result) from ResultView rv where rv.id=new_customer.customer_no; end
REFERENCING
Second Trigger (Re-score): --This trigger runs every time after a customer's record in churn.all_customers is updated It updates the table resulttable (where the score is stored) for the customer_id which has been updated; --The complexity of the scoring code is hidden in the view ResultView; CREATE TRIGGER RE_SCORE AFTER UPDATE ON CHURN.ALL_CUSTOMERS REFERENCING NEW AS new FOR EACH ROW MODE DB2SQL BEGIN ATOMIC update resulttable rt set (rt.pred_value,rt.confidence) = ( SELECT IDMMX.DM_getPredClass( Result ), IDMMX.DM_getConfidence( Result ) from ResultView rv where rv.id=rt.record_id ) where rt.record_id=new.customer_no; END
Now in the scoring environment, which in our case is also the campaign production environment, every change in the customer profile table causes a new score calculation. With the score table being updated in real time, the campaign
Chapter 6. Campaign management solution examples
123
management can be scheduled to run a campaign script every time there is a change in the score table.
Application integration The following sections explain the scenario for the application integration with a campaign management system.
Running the campaign management application Here we use a campaign management system that can connect to relational databases. The first phase is to design the retention campaign. The second phase is to set up an automatic way to execute the campaign (scheduler or trigger). Assume that the campaign management system can develop or deliver different scripts to different customer profile and churn score (segment groups) target. In the design configuration phase, the marketing analyst must select the segment groups and the scripts. Two examples of groups are: High Churn Risk segment group: Customers with a churn score bigger than 0.7 is selected. This group is contacted through the call center channel. Moderate Churn Risk segment group: Customers with a churn score between 0.5 and 0.7 is selected. This group is contacted via the direct mail channel.
For the marketing analyst to design the campaign, they must be aware of the budget, the churn behavior, and customer-specific needs. Then they must combine these needs with the objectives of the retention campaign. Figure 6-7 shows the Unica Affinium campaign and an example of DB2 scoring services invoked to obtain churn probability scores for segmentation and treatment.
124
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 6-7 Unica Affinium campaign: Example of DB2 scoring script integration
While the example Affinium campaign process flow represents basic retention logic, production campaigns are likely to also include offer coding, test and control group definition, response tracking, and subsequent customer follow-up activities. Part of the design phase is to select the most effective message, value and benefit, and appropriate timeline to each segment group. Now in the executing phase, Figure 6-8 shows how the Unica Affinium campaign fully automates campaign execution using its scheduling capability to trigger the DB2 scoring script.
Chapter 6. Campaign management solution examples
125
Figure 6-8 Unica Affinium campaign: Detail of the schedule process
Schedule conditions can be specified in Unica to read the scores table, choosing from a range of a minute (near real time) to several days. At the time it is specified, the DB2 UDB script is invoked and the campaign is launched if there are any customers in the selected segment groups. As shown in Figure 6-9, the Unica Affinium campaign can also trigger a script to perform DB2 scoring.
126
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Unica can also keep track of the campaign status and performance generating reports and check whether the customers that respond to the campaign have changed their behavior (any profile field).
Additional steps This case study shows the Unica Affinium campaign triggering a script to perform DB2 scoring services. Alternatively, the Affinium campaign could issue the SQL directly against DB2 UDB to invoke model scoring. When invoked directly via SQL, scoring results do not necessarily have to be stored in the database but can be used in a view for customer selection or segmentation. As a second alternative, customer records can be flagged for scoring in the database so that
Chapter 6. Campaign management solution examples
127
the campaign can include the flag condition in its selection criteria and SQL conditional logic. Since the Affinium campaign can also receive inbound triggers from other applications, campaigns can be triggered in response.
6.3.6 Benefits Using and combining IM Scoring and DB2 UDB triggers capabilities allows you to automate and secure operations by using DB2 UDB to control the score results. Applying real-time scoring in the production environment enhances the retention campaign to be timely, automatic, and more targeted. For a telecommunications company, with the advance of such services as SMS Coupons, the need to make near real-time decisions on best offers increases. Real-time scoring with DB2 UDB triggers can lend itself very well to these channels. With the facility of online scoring to target a campaign using DB2 UDB capabilities, a marketing analyst can track more efficiently the customer that has this particular behavior. They can then act more quickly, even before a customer actually leaves the company (and ends their business relationship with the company or cancels a contract).
6.4 Cross-selling campaign As explained in Chapter 4, “Customer profiling example” on page 51, customer profiling is key to determine targets for an outbound cross-sell campaign. However, having the data is not enough if you cannot act on it. The integration of the data mining results with channels and reporting is key for a successful marketing strategy. An end-to-end process that provides feedback for the campaign activities is necessary to evaluate the effectiveness. In the past, campaigns ran in silos, where the target lists were generated by the direct marketing department. The message in the design department and the actual customer contact were made by an outsourcing provider. Getting feedback from marketing actions was a real challenge. With systems that are available today, like Siebel applications and messaging transports like WebSphere MQ, it is possible for companies to define and enforce rules for customer contacts. This enables the action results to control when not to present a new offer or to determine the most cost-effective channel to contact the
128
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
customer. Data mining can also help companies to determine when to contact customers.
6.4.1 The business issue Promotion cannibalization is a serious problem in direct marketing. There is more than just the issue to use historical data to predict future behaviors, such as propensity to buy a certain product or likelihood of a customer to leave the company. There is also the issue to determine the propensity of a new offer to be well accepted by a customer or whether the customer is saturated and should not be contacted. The difficulty is navigating through the large amount of data that results from all interactions between the company and customers.
6.4.2 Mapping the business issue to data mining functions To tackle the integrated campaign management process, it is necessary use a combination of different data mining functions or algorithms. Customer profiling is a great tool to determine target lists, but it may not be enough. A customer can be ranked number one in regard to profitability. However a controlled strategy must be careful enough not to saturate the best customers with too many offers. Prediction algorithms can be useful in determining the promotion saturation level of a client based on:
The number of past contacts The time between contacts The type of channel used The type of promotion The number of existing products in the customer portfolio
To determine the best cross-sale opportunities, you can use association algorithms on top of the segmentation results. It is more effective to determine product bundles for groups of customers that have similar behaviors than to try to create a model based on the entire population. One concept of data mining that is not well explored is the notion that results can and should be combined for better analysis. You can generate segmentations based on customer and product profitability. Then you can apply an association algorithm on the combination of the segmentation results to determine the right mix of products for each profitability segment.
Chapter 6. Campaign management solution examples
129
6.4.3 The business application A targeted campaign may be difficult to implement. The large volumes of data that are necessary are often difficult to access and consolidate using conventional operational-system tools. Many organizations simply lack the expertise to support complex data mining and analytical or predictive tasks, which are essential to increase campaign effectiveness and budget allocation. Every interaction via a channel with the external world is an opportunity to retain an existing customer, to acquire new customer, or to cross-sell in a certain product line. If the contact is not well executed, a good opportunity can be turned into a relationship nightmare. It takes time and planning to enable the company’s channel to deal with a number of different scenarios that might arise during those interactions. Outbound treatment is a broad and complex subject. For the purpose of simplicity, we limit this example to a customer that is being selected for a lower interest rate for a credit card product that is associated with a free-checking account. This action is executed on profitable customers that do not have a checking account. The main goal of this campaign is to increase the customer’s portfolio with the bank and increase loyalty. A Siebel call center is used as the channel to offer the new product bundle. During the call, the call center specialist can offer additional products that apply to the particular segment that customer belong to. The results of the customer call are stored in the data warehouse to help tune the saturation prediction model. Although in this particular example, we describe an outbound scenario, this method can be applied to a number of complex scenarios. At the end of this discussion, we present a comparison to other scenarios and explain how to extrapolate the proposed solution to solve different business issues.
6.4.4 Environment, components, and implementation flow Let’s begin with an overview of the required deployment process for a generic campaign management process. The easiest way to start this discussion is to list the necessary components of this solution. Figure 6-10 shows a generic schema for a typical campaign management flow. Notice that there are five major components to address:
130
Data store Data analysis Data workflow Application integration Channels
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 6-10 Campaign management data flow
The information needs to flow through each of those components seamlessly. This process takes the data from the data store area, performs different analysis, and depending on the result of the analysis, decides how to contact a particular customer based on pre-defined campaign rules. After the decisions are made, the data flow control takes the results of the analysis to the channels defined by the campaign rules. The first challenge in this process is to define how to implement the campaign rules or the data workflow. Traditional campaign management relies on static rules that depend on lists of prospects generated prior the beginning of the campaign. This is a sound strategy when you have short duration and deterministic marketing goals, such as a new customer acquisition. For more complex and ongoing initiatives, such as retention efforts and fraud detection, the process is not effective.
Chapter 6. Campaign management solution examples
131
In this scenario, we developed a standard architecture for a customer loyalty solution based on a DB2 UDB data warehouse, data analysis, messaging integration, and Siebel’s call center. This architecture is shown in Figure 6-11.
Pre-defined/ Ad-hoc reports and analysis
Siebel OLTP DB
DB2
DB2 Olap Server Business Objects
MQ Series
Data Sources
DB2 DW
XML
DW Population System
XML
Call Center
DB2 Warehouse Manager
Siebel
Integration Hub Profiling Cross-Sell
Mining Functions
Campaign Rules Stored Procedures
Siebel Workflow
Campaign Management
Analytical
Campaign Management
Operational
Figure 6-11 Customer loyalty architecture
In this implementation, first, you must is to create a customer centric data repository, such as an enterprise data warehouse. We do not cover the details of a data warehouse implementation here. Next you must implement the data analysis components. In this particular example, we use online analytical processing (OLAP) and reports to analyze key indicators. We also use campaign results and data mining to generate additional data to be used in the analysis. As mentioned earlier, the mining models we use are:
Demographic clustering for customer profiling Demographic clustering for product profiling Associations for product bundling Neural prediction for promotion cannibalization treatment
The OLAP data model has the traditional dimensions of Time, Scenarios, Product, Demographics, Region, Channel, Contact Type, and Measures such as profit, cost, revenue, assets, average number of products, and so on. With data mining, you have the option to add new dimensions to the model that will bee the segmentation type and segment number. Figure 6-12 shows the DB2 OLAP server outline.
132
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 6-12 DB2 OLAP server outline
With this model, it is easy to verify the distribution of different segmentation results based on the companies key indicators. To develop the customer profitability and product profitability segmentation models, use the mining workbench on the detailed data used to populate the company’s informational systems, like the Profitability Analysis OLAP database. The resulting segmentation models are stored in the relational data store to be used by IM Scoring. There are many ways you can access the results in the process, including:
Chapter 6. Campaign management solution examples
133
Applying the models at the time of loading of the OLAP Creating the relational views based on those models to hide the SQL calls of the mining functions and to enable reuse Embedding in stored procedures for campaign workflow
You can and should use such tools as DB2 OLAP Server and Business Objects in addition to the IM Visualizer to determine the best models for the process. You can generate a number of different models using different techniques and add the results to the existing OLAP model for comparison. Depending on the marketing initiative, a marketer can select a different combination of models and segment them as a target for the promotion. For example, one strategy may be to target the most profitable customers, but not in a company’s overall profitability view but from a mortgage perspective. OLAP is a powerful tool to help make that type of decision. A simple report selects:
Measure - Profit Scenario - Actual Period – 2002 Product – Mortgage Segmentation Type – Product Profitability Mining Segments – ALL
The report also plots the distribution of Profit over time and over segments. It provides additional insights that the traditional segmentation visualizer cannot provide. You also need to deal with the cannibalization problem. You can define this problem by applying a prediction model using the segment defined in the previous analysis as parameters for the model. Campaign responses are important to tune the model. For example, a customer that was selected as a target by this process may provide a negative response to the contact, like “Customer was not interested in hearing benefits”. This may be an indicator that the prediction model failed for that particular customer. You also need to consider other factors, such as customer history of contacts by the specific channel and the number of products in their portfolio, to draw the right conclusions. To determine cross-sales opportunities, associations rules are derived from the historical data based on the target segments for the promotion. These rules, in conjunction with the analysis of OLAP data to determine which products go with the most profitable segments, guide the direct marketer to determine the product bundles for the campaign.
134
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
After the models to be used are determined, the campaign rules are defined using stored procedures. The procedures are responsible for creating XML documents that are sent to Siebel via WebSphere MQ (MQSeries) transport. Siebel workflow is responsible for sending the responses back to the data warehouse. After the responses are collected in the call center system and stored back in the data store, Business Objects are used to generate campaign results reports and analyze effectiveness. IM Modeling can be used to tune the model, especially the prediction model, based on effectiveness of contacts. Types of responses are key for this feedback. The following categories are a sample of useful information to be collected:
Customer hung up Customer was not interested in hearing benefits Customer asked for printed material about the offer Customer was interested in individual products, but not in the bundle Customer accepted offer Customer asked for additional products
Figure 6-13 shows a high-level view of the implementation process. Configuration
As you can see, the result of individual mining processes can be combined to create a more complex decision flow for the campaign. Note that one of the most striking characteristics of this flow is the closed loop provided by the campaign rules. Results of marketing actions are fed back to the reports and source tables for model adjustments.
6.4.5 Step-by-step implementation The detailed implementation of the process shown in Figure 6-13 consists of these steps, which are explained in the following sections: 1. 2. 3. 4. 5.
Configuration Workbench data mining Analytical environment Scoring Application integration
Configuration This must be done once during the deployment. It consists of database enablement and table creation.
Database enablement In this scenario, we made sure that both the DB2 UDB instance and the database CMPG_MGM for the customer profile scoring projects are enabled. The steps to set up the environment for the CMPG_MGM database are therefore: 1. Enable DB2 UDB instance. Check whether the DB2 UDB instance is enabled. If it is not enabled, to enable the DB2 UDB instance, refer to the steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Scoring. 2. Enable working database. Check whether the DB2 UDB database for scoring is enabled. If it is not enabled, to enable the database (CMPG_MGM), refer to the steps in 9.3.2, “Configuring the database” on page 198, for IM Scoring. In this case, we invoke the script in Example 6-12.
Example 6-12 Script to enable the CMPG_MGM database for scoring idmenabledb CMPG_MGM fenced tables
136
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Note: Although unfenced mode may bring slightly faster mining performance, fenced mode assures that the operational database environment is up and running at all times.
Tables to be scored To enable the data repository to store and process the data needed for this scenario, some extra data elements are necessary: IDs are generated by the operational systems. It is important that these IDs are in synchronization with the original applications. In this example, an additional workflow is necessary to synchronize the data warehouse and the Siebel OLTP database to reconcile customer_ids, product_ids, campaign_ids, and so on. A common mistake that leads to failure when applying the rules is to fail to synchronize the systems. The three tables that are used are: Customer information table (Table 6-3) Table 6-3 Customer information table Field name
Description
CUST_ID
Unique identifier for a customer
PROFIT_SEG_ID
Segment_id generated by the mining algorithm. The rules are based on high profit segments.
SEG_TYPE
Description of the segmentation
POSS_INT_CC
International credit card user indicator
POSS_LIFE_INS
Life insurance policy holder
POSS_PLATCARD
Platinum card holder
POSS_SAVE
Savings account indicator
AMT_PAYM_BC
Amount of payments with banking cards
NO_PAYMENT_BC
Number of payments with banking cards
TOT_NBR_PROD
Number of products in the customer’s portfolio
NO_WITHDRAW_BC
Number of withdrawn transactions with banking cards
NO_DEB_TRANS_BC
Number of debit transactions with banking cards
AVG_BALANCE
Average balance
MARITAL_STATUS
Marital status
Chapter 6. Campaign management solution examples
137
138
Field name
Description
NBR_YEARS_CLI
Customer relationship age
PROFESSION
Profession
AGE
Customer age range
GENDER
Gender
PREFERED_CHANNEL
Preferred channel for communications such as telephone, letter, or fax
ACQUISITION_COST
For a customer, identifies the primary costs of establishing the prospect as a new customer
ATM_CARD_FLAG
Indicates if Customer holds at least one ATM card issued by the Financial Institution
TOT_CR_LMT
Sum of credit limits on all credit lines involving the customer
CC_FLAG
Indicates if customer holds at least one credit card provided by the financial institution
DC_FLAG
Indicates if customer holds at least one debit card provided by the financial institution
NBR_ATM_CARD
Number of ATM cards supplied by this financial institution that are held by customer
NBR_CC
Number of the total number of credit cards supplied by this financial institution that are held by the customer
NBR_DB_CARD
Number of debit cards supplied by this financial institution that are held by customer
GEO_ID
The unique identifier of the geographic area in which the customer resides
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Campaign information table (Table 6-4) Table 6-4 Campaign information table Field name
Description
CMPG_ID
This data comes from applications that are responsible for creating and tracking campaigns, like Unica Affinium.
CMPG_TYPE
Data mining results, such as cluster IDs, can be used in a number of business rules and campaigns. This element helps identify the type of campaign (for example, credit card attrition, cross-sell, etc.) to which the rule applies.
CMPG_ACTIVITY
Unique identifier of the campaign activity (for example, Outbound call).
PROD_ID
Products participating in the campaign.
Campaign activity customer history table (Table 6-5). Table 6-5 Campaign activity customer history table Field name
Description
CUST_ID
Identifier for the customer participating in a certain campaign
CMPG_ID
Identifier of the active campaign
ACTIVITY_ID
Identifier of the current campaign activity
LAST_CONTACT
Date of last contact
RESPONSE
Customer response stored procedures and reporting
Workbench data mining As in “Workbench data mining” on page 119, we used the DB2 Intelligent Miner for Data workbench to build the data mining models that we want to integrate in the DB2 UDB database:
The model to be deployed is created in DB2 Intelligent Miner for Data, so the model needs to be exported into a PMML file. Refer to 9.3.3, “Exporting models from the modeling environment” on page 200, to learn how to export the model in PMML format.
Chapter 6. Campaign management solution examples
139
Scoring There are two steps for scoring: 1. Import the data mining model (prediction models). 2. Apply the scoring function. These steps are described in “Scoring” on page 119.
Analytical environment The reports are based on the data model that is shown in Example 6-12 on page 133.
Application integration Figure 6-14 shows the data flow in the application integration layer for the outbound treatment example. The decision blocks are where the rules are implemented. Some rules are based on application workflows such as call center and campaign management. Others are based on “intelligence” acquired with the scoring of input data by the DB2 data mining functions. Figure 6-14 shows: How stored procedures invoke XML extenders (native on DB2 UDB V8.1) using Siebel’s data type definition (DTD) to generate messages How WebSphere MQ functions were used to read and write to the queues
140
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Relationship Management
DW
Customer Profile
DB2 IM Scoring
Cross-Sale
Determine Activity
XML DB2 XML Parser
Siebel Business Services
Integration Object
Business Services
Integration Objects
Siebel MQ Connector
MQ XML Queue
XML Message
Cross-Sale Specialists
Create Activity Record
MQ Series
No
Determine Activity
Outbound Call
Yes Present Offer
Executive Follow-up
Success
End
Enroll in Campaign
Create Follow-up
Notify Executive
Activity Record
XML
Fulfillment
XML
Fulfillment Processing MQ Series
Figure 6-14 Outbound treatment process
Siebel This book does not cover all the details of a Siebel call center installation. For more information, refer to the Siebel Web site at: http://www.siebel.com/
This scenario was developed using Siebel V7. Business Service provides reusable business logic. It can be used with pre-built components (such as adapters and XML format converters) or user-built components (such as a tax calculation function and format converter), written using Siebel Visual Basic or Siebel eScript. The Business Service can be invoked from user interface events, Siebel workflow actions, inbound message requests, and external events from transport mechanisms. In this scenario, the stored procedures are responsible for converting the data warehouse contact information into an XML format that Siebel can use. After the messages are submitted into the queue, the Siebel workflow takes over to read from the MQ queue using the Enterprise Application Integration (EAI) MQSeries Server Transport Adapter (see Figure 6-15).
Chapter 6. Campaign management solution examples
141
Client UI
MQSeries EAI MQSeries Adapter
Workflow Process
Integration Object
Analytical System
Queue Manager
Reporting
Mining
Siebel Database
Figure 6-15 MQSeries Server transport adapter
To implement this process flow, follow these steps for defining a workflow for an integration object: 1. Create the Business Service that will support your workflow. There is a complete list in the Tools application under Business Service in Siebel. The EAI MQSeries Transport is the service that is necessary for the MQSeries integration. You can verify the services that are available in your implementation using the Siebel Client. Select Screens-> Business Service Administration-> Business Service Details. Figure 6-16 shows how the window may appear if you have all the necessary services for this scenario.
Figure 6-16 Business Service example
2. Create your workflow process. Siebel has several sample workflows that you can use as a base for activity treatment. Unfortunately this information is not well advertised.
142
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
To find the samples that Siebel provides, use the Siebel Client to connect to the sample database. To go to the Workflow windows, select Screens-> Siebel Workflow Administration-> Workflow Processes-> All Processes. Here you can find the list of available examples. You need to export them from the Sample database and close the client. Start the client again using the server database and import the selected flow into the Workflow display using the import button. After you have the Workflow in the Process view, highlight it and click Process Designer. This shows you the Workflow diagram of the existing processes. You can modify existing ones or create new ones using the flow chart components in the left navigator. If you double-click each box in the Workflow diagram, you go to the underlying Business Service. Here is where you can specify what you want to have happen in your workflow. The MQ Transport is where you specify your queue names. 3. Invoke your workflow. There is a process simulator you can use that takes you through and allows you to test whether the workflow is correct. Once it is working properly, you can then create an automatic Workflow to pull messages from the queue. Or you can use Visual Basic or Siebel eScript to create control buttons (for example read from queue, create activity, and so on). If you use eScript, you may need to change the scripting DLL in your configuration file. If you are using Siebel Visual Basic, you must specify sscfbas.dll. You cannot use both languages in your application. You have to choose what you will use. According to the Siebel documentation, the EAI components prefer eScript. With the MQ Receiver, you can manually start an MQSeries Receiver task through the Server Tasks view, with the parameters shown in Example 6-13.
Example 6-13 Starting an MQ Receiver task MQSeries MQSeries MQSeries MQSeries Workflow
Physical Queue Name SIEBEL_INPUT Queue Manager Name HUB_QM Response Physical Queue Name SIEBEL_OUTPUT Response Queue Manager HUB_QM Process Name Receive Activity Record
The rest of the parameters for the receiver are defaults, as shown in Example 6-14.
Chapter 6. Campaign management solution examples
143
Example 6-14 Other parameters for MQ Receiver Receiver Dispatch Method RunProcess Receiver Dispatch Service Workflow Process Manager Receiver Method Name ReceiveDispatchSend
Then in Receive Activity Record, the XML message is converted to a Property Set, Create Activity Record. After the outbound call is made, the workflow automatically receives the response data for that particular Activity_id and converts the Activity Response property set to XML. The MQSeries Receiver places the response on the queue. 4. Create the integration objects for the MQSeries Server Transport Adapter. There are several sample integration objects listed in Tools. If you highlight them and click Generate Schema, you see a DTD file that shows you the various elements. You can create new ones from the same window. Copying and modifying existing samples is faster than defining a new object. The process is: a. b. c. d.
Highlight the sample you want to use. Right-click and choose Copy record. Rename the new copy. Click the Synchronize button.
If you expand that object, you see the list of components associated with your primary integration object. Select the ones that you want to use. Do not choose all of them if possible. Your integration object will be rather large. Each integration object has its own XML representation. Using Siebel Tools, it is easy to obtain the DTD definition of the objects that you want to work with. When creating the messages in the analytical part of the solution, it is important that the XML format complies with Siebel’s standards. Figure 6-17 provides a simple XML representation of an account object.
144
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
A Siebel XML Document - Elements representing Integration Object Siebel Message Element
Figure 6-17 Siebel integration object XML representation
6.4.6 Other considerations The function calls described in this book to work with IM Scoring can present a challenge when working with nonflexible SQL applications. It is particularly difficult to have all vendors add capabilities to use those function calls. There is a simple way to overcome this by working with relational views. You can hide the function call by embedding the SQL calls in create view statements. For all applications accessing IM Scoring results through the views, the results appear as regular data elements that can be queried and used in more complex processing.
Data mining functions and OLAP Especially when working with the OLAP integration server, it is important for the input data to be in a well-behaved format. You can use free-hand SQL to add the function calls to the OLAP load process. However, if the results are used by a number of load and communication processes (for example, stored procedures sending messages to Siebel), it is beneficial to create a view to facilitate the process.
Data mining functions and Siebel All data applications have their own data models and definition. They are not different with Siebel. When interacting from a data warehouse with those systems, it is important to maintain consistency among the data elements in the
Chapter 6. Campaign management solution examples
145
involved systems. A data hub is necessary to maintain this consistency. Imagine sending a message from the data warehouse to the call center to contact a particular customer, but the customer_id and the campaign_id are not available on the Siebel side. The process will fail. It is important to keep the two systems synchronized to avoid this kind of inconsistency. In this particular example, the data hub can be implemented in the data warehouse. A series of workflow and exception treatments needs to be in place. For example, we implement an exception routine when a message reaches Siebel from the data warehouse without the appropriate correspondence in the online transactional process (OLTP) database. A new process starts to send a message to the data warehouse to ask for the information, and the new data is created on the OLTP database. In the data warehouse side, a series of tables is necessary to maintain a surrogate key for each data element that becomes the master key for the data hub. Each master key has a key on each system associated with it. Consider customer_id, for example. When a customer is added to the data warehouse from a source other than Siebel, a Siebel key must be available in the data warehouse. Part of the Extract Transformation Load (ETL) process is responsible for sending a message to the Siebel system to add the new customer and retrieve the new Siebel keys. At the end of this process, the data hub customer table should appear like the example in Table 6-6. Table 6-6 Hub customer table example Data element
Hub master key
Data warehouse key
Siebel key
Customer_ID
452156
00358234
98735
Customer_ID
452157
00378645
89732
Campaign_ID
452158
C874
78234
...
Business Object campaign reports Figure 6-18 provides an integration example in Business Objects, the query product used by the end users.
146
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 6-18 Campaign summary Business Objects report
Chapter 6. Campaign management solution examples
147
148
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
7
Chapter 7.
Up-to-date promotion example This chapter explains how to quickly build a data mining model to answer a punctual issue to perform an up-to-date promotion on products in retail. This case uses IM Modeling and IM Visualization and integrates them with a standard reporting tool of the end user.
7.1 The business issue In the retail industry, a large number of products and possible combinations of them make the business analyst decisions difficult. The analyst needs to keep track of the customer purchase behavior and the best mix of products in each store. This helps them to decide the product optimal price, marketing appeal, and the warehousing time. In this way, the manager of each store can control the efficiency of a promotion when it is still running and even change the promotion. The manager or business analyst must design and run a different promotion everyday based on the mix of products they have in the store. For example, the store manager has to sell perishable products in a short time frame and needs to determine the best location in each store. In the same time frame, they can also check the efficiency of their decision and change it if they want. The restrictions of storage space, budget, or the number of stores can lead the business analyst to quickly choose a precise mix of products in each store and arrange the products on the shelves so they don't stay too long there. This chapter has a situation where the manager of each store has to make a promotion every week. With a fast and automatic system, they can actually change their decision or make another promotion every day.
7.2 Mapping the business issue to data mining functions The data mining technique called associations rules is very useful to help the manager to identify opportunities of cross and up selling. Knowing this, the IT analyst can use IM Modeling and speed the process of identifying patterns and opportunities. They do this by running associations rules embedded in another application. With IM Visualization, the manager can check whether the rules make sense. The business analyst can keep track of the new rules and decide quickly what promotion to make. The results of the association mining that is run are the product combinations (rule head and body), the chance or probability that it can occurs again (confidence), and the percentage of time this combination occurs in all the transactions (support). For example, consider a total 1000 transactions. In 100 of the transactions, we found the product wine. In 30 transactions where we had the product wine, we also found the product cream. Now for the rule Wine => Cream, the rule head is
150
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the wine, the rule body is the cream, the support is 30/1000 (3%), and the confidence is 30/100 (30%). The manager must decide in a short time frame how to sell a particular product that is more perishable (cream). The valid date is expiring in the next two days. Therefore, he has to quickly sell and change the selling strategy before the product valid date expires. Finding the correlations between products using the associations rules technique from IM Modeling helps the business analyst to decide the best product locations and the optimal promotion. After it can be run in a fast and automated way, the business analyst can change their decision based on the selling pattern before the end of the week (end of the promotion time).
7.3 The business application The business application provides the store manager information to discover which combinations of products the customers purchases. It helps the store manager to plan their promotions and to act quickly and change their decision to make new promotions every day. Cross-sell the product offerings to the customers is based on the associations technique to discover the relationship between products at all levels of the product hierarchy.
7.4 Environment, components, and implementation flow The suggested deployment process is outlined here and shown in Figure 7-1: 1. Collect all the data transactions made by each store. 2. Store the data in a relational database such as DB2 Universal Database (UDB). 3. Use IM Modeling to identify which products are sold more often and the possible product combinations. This is done using the associations technique that generates basic descriptive statistics and defines the associations rules. 4. Use IM Visualization to view the rules produced by the associations technique. 5. Extract the associations rules to a DB2 UDB table. 6. Build into the rules to any analytical query tool (Business Objects) that can access DB2 UDB, with the resulting rules, the products sales, by stores, by a range in time, category, promotion type, or region.
Chapter 7. Up-to-date promotion example
151
To implement this solution that can be easily integrated to any software that has the ability to run SQL, the skills required are a database administrator and a programmer. The database administrator skills allow you to schedule and manage the data mining models. The programmer skills allow you to build the application. The business analyst skill is required to interpret the rules and give proper feedback to the programmer to build a transformation or filtering step. It should be clear that the IM Modeling was developed to leverage the IT data mining skill, based on the feedback of the business analyst that will use IM Visualization.
Applications with embedded mining Scheduler
JOb
applications that search for association rules
Model Calibrat ion
Scheduler
Transactional Data
Business Objects
IM Visualization
Application Integration Layer: CLI/JDBC/SQLJ or ODBC
IM Modeling API
Mines the data
Data Warehouse Analytical Data Mart
Modeling Environment
Figure 7-1 Up-to-date promotion components
7.5 Step-by-step implementation This scenario shows how after you have a new transactional data load and new purchase behavior, you can automatically run an association rule model. Then you can help the manager of a retail store decide the kind of promotion to make today. The associations rules data mining technique permits you to find product combinations ordered by a sequence, based on the customer purchase behavior.
152
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
That is if you buy product A and B, you may also buy product C and D. See Mining Your Own Business in Retail Using DB2 Intelligent Miner for Data, SG24-6271, for more information on associations rules. Some steps in this implementation (Figure 7-2) are done once. That is because configuration and building the model are scheduled depending on the timing and urgency of the manager of each retail store.
Implementation Flow Database Enablement
Configuration
Database Profile in IM Visualization
Transaction table
Scheduled Operational
New products or new customer purchase behaviors
Build Filters
Develop the model rules with defined input data and parameter settings
No Rules make sense?
Yes
Extract rules into a table
Modeling Update B.O. Reports
Select Rules and Design promotion
Application Integration
Figure 7-2 Implementation flow of the up-to-date promotion
Chapter 7. Up-to-date promotion example
153
7.5.1 Configuration The database where IM Modeling is running must be first enabled. When enabled and configured, this database can also be used in IM Visualization and Business Objects.
Database enablement You must first enable the DB2 UDB instance and the RETAIL database to allow the modeling functions to work. To enable the DB2 UDB instance, refer to the steps in 9.3.1, “Configuring the DB2 UDB instance” on page 197, for IM Modeling. To enable the RETAIL database, refer to the steps in 9.3.2, “Configuring the database” on page 198, for IM Modeling. You must do this only once for this working database and instance.
Database profile in IM Visualization The database profile can be edited using IM Visualization as explained in 11.2.2, “Loading a model from a database” on page 232.
7.5.2 Data model The everyday purchase of every customer provides transactional data as input to be loaded in DB2 UDB. Table 7-1 describes the Transactions_Data table to be used in this scenario. Table 7-1 Customer data in retail case Field name
Field description
Data type
CUSTOMER_NO
Customer number
char(7)
DATE
Purchase date
char(6)
ITEM_NO
Purchase item number
char(3)
STORE_NO
Store number where the customer has purchased the items
char(3)
TRANSACTION_ID
Transaction ID received whenever a customer buys one or a group of items
char(5)
A job can easily be scheduled to run for any update or insert that occurs in this transactional table and provides up-to-date reports.
7.5.3 Modeling Before you build the associations model that is required in this scenario by the business analyst, it may be interesting to apply some filters on the data. 154
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Building filters Depending on the support of the product, the business analyst may need to avoid frequent products such as soda, salt, sugar, garlic, and so on. One of the store manager’s problems is to sell perishable products in a short time frame. Therefore, a possible filter may be to select all the transactions concerning products where the perishable expiration date is the following week.
Developing model rules with input data, parameter settings The model was developed to run only in DB2 UDB V8.1. In a single command, the model input the data specification, mining technique, model parameter settings, and name mapping. Example 7-1 shows the Build_associations.db2 script, which is also included in the additional material that accompanies this redbook. For more information, see Appendix K, “Additional material” on page 301.
Example 7-1 Building the associations rules model -- build association rules model call BuildRulesModel('Up_to_date_promotion','TRANSACTIONS_DATA', IDMMX.DM_RuleSettings() ..DM_useRuleDataSpec(MiningData('TRANSACTIONS_DATA').. DM_genDataSpec()..DM_addNmp('desc','PRODUCT_NAME','ITEM_NO','DESCRIPTION') ..DM_setFldNmp('ITEM_NO','desc')) ..DM_setGroup('TRANSACTION_ID') ..DM_setItemFld('ITEM_NO')..DM_setMinSupport(3) ..DM_setMinConf(30)..DM_expRuleSettings());
The input data table is Transactions_data. Refer to the products in the Product_Name table (Table 7-2). The business analyst requires a minimum confidence set to 30% and a minimum support of 3%. Note: The high confidence value ensures that you only produce rules that are strong enough. The low support value gives you a variety of rules so you are likely to find a rule for most of the products later.
The rule model generated is named Up_to_date_promotion.
Chapter 7. Up-to-date promotion example
155
Table 7-2 Product_Name table Field name
Field description
Data type
ITEM_NO
Purchase item number
char(3)
DESCRIPTION
Purchase item description
char(15)
EXPIRATION_ DATE
Item expiration date
YYYYMMDD
The rules model is loaded into IM Visualization as shown in Figure 7-3.
Figure 7-3 Rules in IM Visualization
The business analyst in agreement with the IT specialist can see a quick view of the number of rules generated and perform a consistent check before exploiting and sending them to the store managers through daily reports.
Extracting rules into a table With DB2 UDB V8.1, some extra UDFs and stored procedures are provided through this redbook as the UDF ListRules. You can find the UDF ListRules in Appendix G, “UDF to extract rules from a model to a table” on page 285. UDF ListRules is used first to extract the rules from the model into a table so the rules can be exploited by end-user tools such as Business Objects.
156
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Next you run script, such as extract_rules.db2 (Example 7-2) that calls the function ListRules and select the model called Up_to_date_promotion that was created in earlier. This script is also available in the additional materials that accompany this redbook.
Example 7-2 Extracting the associations rules -- list rules using item descriptions INSERT into Rules_Definition SELECT P1.Description as Antecedent,P2.Description as Consequence,R.support,R.confidence from table ( ListRules( (SELECT MODEL FROM IDMMX.RuleModels WHERE MODELNAME='Up_to_date_promotion') ) ) R, PRODUCT_NAME P1, PRODUCT_NAME P2 where R.Antecedent = P1.ITEM_NO and R.Consequent = P2.ITEM_NO;
Table 7-3 shows the layout of the table that is created (Rules_Definition). Table 7-3 Rules_Definition table Field name
Field description
Data type
ANTECEDENT
Rule head (only one item)
char(15)
CONSEQUENCE
Rule body (only one item)
char(15)
SUPPORT
Rule support
decimal(12,2)
CONFIDENCE
Rule confidence
decimal(12,2)
Now with this table, any application that accesses DB2 UDB can select the rules head, body, and respective support and confidence. Using the DB2 UDB scheduler, one single DB2 UDB script building the associations rules and inserting the results into a table every day can be run.
7.5.4 Application integration The application used in this case study is the Business Objects reports that the manager of each retail store receives to help them in designing a promotion.
Designing a promotion with Business Objects reports Once you have the resulting associations rules in a table, the business analyst can use any query and report tool that they feels comfortable in using. This example shows how the end-user Business Objects report tool, already in place
Chapter 7. Up-to-date promotion example
157
in the enterprise, can help the manager of a store to make a decision, as a promotion, by looking at the report they receive daily. This report has the product combination based on the last customer purchased behavior. Figure 7-4 shows a report where the product combination (associations rules) is highlighted, when the confidence is greater than 60%.
Figure 7-4 Business Objects report with the product combinations
With the Business Objects report, the manager can select the product combinations that are the most relevant in their every day business. Since they must quickly sell perishable products, such as cream, they may create a promotion to sell disposable nappies together with cream. With the Business Objects scheduling feature, the report is in sequence, updated, and accurate on their desk every morning. If required, the manager can ask to change the scheduling and receive the information more often.
158
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
7.6 Benefits The benefits of this solution are to: Act and react quickly on a market that is in permanent move and change Recalibrate and quickly build a new data mining model
7.6.1 Automating models: Easy to use There is an automated way to find the associations rules embedded in an application without using any data mining software or skills. This way allows the business analyst (in this case the manager) to see and interpret the rules with the IM Visualization. Then the IT analyst can run a SQL API to discover new rules every time a new product is launched, or they can test the efficiency of a promotion that it is still running.
7.6.2 Calibration: New data = new model The advantage of embedding the association technique in any application is that it can be run in batch almost every night. Also the business analyst can keep track of the products associations pattern. In this way, the manager of each store can control the warehousing time. And the marketing analyst can control the efficiency of a promotion when it is still running and even change the promotion. IM Modeling brings the ability to calibrate or recalculate a data mining model every time new transactional data is load. The faster the data is inserted in the transactional data, the faster the calibration is done, and the more accurately the business analyst can decide. This up-to-date promotion application can also be used in a Web-based environment. It can perform micro testing on promotions or compare the success rate of two different promotions prior to implementing in the brick and mortar environment. For example, retailers may want to put items up for sale on the Web and then determine how much to produce (overseas) before ordering merchandise in the stores.
Chapter 7. Up-to-date promotion example
159
160
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8
Chapter 8.
Other possibilities of integration Integrated mining makes mining simple and offers much more opportunities for integration into any kind of application. A recent trend for DB2 data mining is integration into analytical solutions. Web analytical solutions are in place to provide real-time scoring based on Web site traffic behavior. DB2 OLAP Server now has a feature that highlights deviations and hidden business opportunities. The DB2 Intelligent Miner for Data technology has been integrated into SAP Business Warehouse and Customer Relationship Management (CRM). For example, customer interaction is enhanced by product recommendations computed by mining. Business Objects Application Foundation, as well as QMF for Windows, integrate IM Scoring functions into traditional reporting and OLAP. The WebSphere Commerce Analyzer comes with predefined mining configurations. This chapter covers several of these examples from a business user's point of view. A brief look under the covers tells application designers how mining functions work or may work and help in these cases. A business end user can benefit from mining analysis without even knowing the technology.
8.1 Real-time scoring on the Web (using Web analytics) Marketing across traditional channels has changed forever due to the Internet explosion. It forced changes in traditional marketing techniques, including capture and use of extensive customer data. For example, today's online retailers are drowning in customer data. Industry experts concur that understanding customers as individuals, and leveraging every interaction to maximize those insights, is crucial to success.
8.1.1 The business issue Understanding (inter)actions between customers and enterprises through the Web and actually achieving them with measurable results presents a challenge. Barriers to real customer insight are:
Lack of centralized information source Lack of coordination across channels Lack of deep, current customer data Lack of real time response capability, especially online
Assume that in our day-to-day business, we have succeeded in addressing the first two barriers. When the channel turns out to be the World-wide Web, we are often still confronted with the last two issues. Our business issue in that case translates into: We want to better understand where traffic originates, who the visitor is, what the customer preferences are, and how to approach desired customers effectively in a real-time Web environment.
8.1.2 Mapping the business issue to data mining functions To address this business issue, let’s look at customer-related data and Web-surfing behavior data. The mining model is in place to have a number of preset characteristic profiles ready to target in the business via the Web channel. To achieve in this manner, a more personalized marketing approach is required beyond mass marketing. On the one hand, the scoring engine underneath uses the model. On the other hand, there is the data to score the Web visitor to a certain profile. From there, you can work toward understanding the customer preferences. However, both sets of data, customer-related data and Web surfing behavior, are often incomplete or even missing as in the case of Web traffic. This may simply be because it may be a first time customer to the business, a first time visit to the Web site, or an incomplete online application form. In that case, you would make
162
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
a prediction of the likelihood that this particular Web site visitor has a certain customer behavior profile. They may be based on predicted values to characteristics such as age group, sex, income class next to predictions of product characteristics (model, color, price range) preferences. The data to be used for prediction may be the dynamically changed data during their session and the specific Web site traffic in the session (Web page, page sections, and product information traffic). This data would be combined with other data prepared in the data warehouse and linked to the operational database. You may use modeling and scoring, as part of the Web analytics approach, to perform real-time scoring to the data warehouse or operational database. For example, the more the female gender propensity increases, the larger the number and the more interested in female content the viewer is. Real-time scores are produced for up-sell and cross-sell campaigns or other CRM initiatives depending on the business issue. Real-time scoring services for CRM initiatives occur in the data warehouse or operational database. You can address the business issue initiatives in this scenario to up-sell and cross-sell via the Web, with or without a combination with data driven personalization. For example, if the female gender propensity increases over the Web traffic session time, the personalized recommendations focus more on female product lines. The product recommendations can come from an association modeling (or a collaborative filtering tool) run to the data warehouse. This solution is shown in Figure 8-1.
Figure 8-1 Using modeling and real-time scoring services for Web analytics
8.1.3 The business application For a better understanding of your new customers that visit the Web site, you can use scoring. This way you can address your customers more on an individual basis than as a general customer via a static Web site. You can also make product recommendations that can lead to up-sell or cross-sell.
8.1.4 Integration with the application example Both modeling and scoring are integrated in the Web application (see Figure 8-2). Both features of a clustering and an association model are reflected in the monitor and trace facility that the IT developer set up for the end users, whether it is the marketing department or the Web site manager. The features are set up as part of the Web delivery to internal viewers. The trace facility is not meant for external viewing purposes, and therefore, is hidden for customers by the Web delivery tools.
164
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The scoring engine is used to score several features of the profiles that are set up on the basis of the models. Both the sets of demographic features in the cluster model and product navigational and transactional related features in the association model are continuously and in real time updated by the scoring engine. The scoring engine continuously monitors the visitor traffic during the life span of each individual visitor’s Web session. Based on this, you can approach customers more effectively in real time in a Web environment.
Monitor and trace customer profile behavior: This page was specially designed to gain insight into the price sensitivity of the visitor. The gender propensity was scored as part of the visitor profile on the basis of other Web page traffic during this or another session of this same visitor. The information in the customer behavior profile is derived from many sources. Information comes directly from purchase history and indirectly via subtle questioning. This is similar to the approach that a master salesperson would undertake.
Figure 8-2 Tracing a customer behavior profile based on session traffic
This example of Web analytics demonstrates the focus on maximizing relationship for the customer’s return. At several stages of the customer interaction with the online application, both the mining model and the online scoring principle are used. Both IM Modeling and IM Scoring support this principle as integrated technology parts in a Web analytics solution to a cross-sell or up-sell application.
Chapter 8. Other possibilities of integration
165
The models, then tightly integrated with the data itself in the database management system (DBMS), facilitate automating the process of dynamically evaluating and responding to individual customer preferences and behaviors. Certain product items and price offerings, preferred color combinations, and other micro campaigns can be offered to Web site visitors in a more dynamic way, based on the underlying mining model.
8.2 Business Intelligence integration This section discusses the integration of IM Scoring with other Business Intelligence tools. Particularly, it includes the tools for online analytical processing (OLAP) using the DB2 OLAP Server and tools for query reporting using QMF.
8.2.1 Integration with DB2 OLAP Making data mining results available to the business analyst using multidimensional OLAP front-end tools gives new insights to solve real business problems such as to: Find the most profitable customer segments Understand customer buying behavior (find product associations using market basket analysis) Optimize the product portfolio Keep profitable customers by understanding attrition factors
Knowledge that was previously hidden in the data warehouse and data mining models and that was only available to data mining experts is now available for cross-dimensional browsing using both DB2 UDB and DB2 OLAP Server. Integrating IM Modeling and IM Scoring further into OLAP solutions, by automating steps done manually by the OLAP designer manually, reduces the steep learning curve for OLAP users when applying mining technology. Plus, it brings faster time to market of marketing- and sales-related actions on the basis of found knowledge as the automation eliminates the manual efforts.
Basic understanding of OLAP cube An OLAP cube is a multidimensional view into a company's data. Typical dimensions include time, products, market or geography, sales organization, scenario (plan versus actual), and a measure dimension (includes such measures as revenue, cost of goods sold (COGS), profit or ratios-like margin). The structure of the dimension that defines a multidimensional view is called outline.
166
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Each dimension can be hierarchical in structure. For example, the time dimension can be broken down into years, the years can be broken down into quarters, and the quarters can be broken down into months and so on. Typically the outline contains a hierarchy that business analysts used for a long time in their reporting. For example, the customer dimension in the banking industry could be aggregated according to customer segments such as private individuals, corporate customers, public authorities, and so on. The cube typically does not contain attribute dimensions for all attributes that are known in the warehouse about the base dimension “customer”. In the banking industry, for example, the warehouse may have dozens of attributes, such as age, marital status, number of children, commute distance, number of years as a customer, and so on, for each customer.
Integrating a new dimension in the cube The attributes described in Chapter 4, “Customer profiling example” on page 51, can be represented as an additional dimension or as an attribute dimension. An attribute dimension simply assigns the attribute as a label to the base dimension. Defining a hierarchy for customers, such as using geographical information, is easy to do in an OLAP system. Using market regions in an OLAP cube is common practice. However, a hierarchy that is easy to define, such as a geographical hierarchy, does not give valuable information about the business. Data mining using IM Modeling and IM Scoring, instead, can produce a segmentation of customers. It takes more information about customers into account, such as family status, size of city, estimated income, and other demographic data. Such segments, also called clusters, can then be used to define a customer dimension in OLAP as shown in Figure 8-3. The cluster identifiers can then be added to the OLAP cube as additional attribute dimensions. A meaningful short description of the separate cluster can be added as a dimension.
Figure 8-3 OLAP outline with initial rough hierarchy to present customer groups
Chapter 8. Other possibilities of integration
167
The bank's focus group consists of private individuals and the bank's data warehouse contains dozens of attributes such as age, marital status, number of children, and customer annual income segment. IM Modeling that was run using the clustering technique on those attributes found the following groups of customers:
Seniors Families with children Yuppies Other
These customer segments are loaded into the OLAP cube (Figure 8-4). They allow cross-dimensional analysis on each customer segment by geography, product, and time.
Figure 8-4 Customer segments placed as dimensions in an OLAP outline
When the OLAP analyst views a slice that shows only the customer segment “families with children”, they may want to understand this segment better and invoke the clustering visualizer. The clustering visualizer shows, for a set of attributes, the distribution of the attribute values in a specific segment, compared to the distribution of the attribute values in all customer records. This scenario can be enhanced by a cross-selling scenario as described in 6.4, “Cross-selling campaign” on page 128. Such a scenario provides, in its implementation flow, an example of how to integrate with DB2 OLAP Server when building the cube outline.
8.2.2 Integration with QMF On a frequent basis, business analysts need timely and easy reporting of information in their customer base. A typical business question is to understand
168
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
the customer base by itself: “What type of data do we have on our customers, and based on that, what do our customers look like?” End-user reporting to address these questions is often based on query reporting. Query reporting relies on queries that business analysts may initially have set up by the database people in the IT department. In a later stage, these same business analysts may configure and enhance the queries themselves since they need to administer similar queries on a more ad-hoc basis. On the basis of need to know, they may want to configure the queries at their desktop after they are accustomed to its code and format.
Mapping the business issue to data mining functions QMF for Windows users can improve the quality and timeliness of business intelligence available to analysts by using the IM Scoring feature of DB2 UDB Extended Edition and Enterprise Extended Edition. Using the new QMF for Windows V7.2 Expression Builder, users can easily invoke IM Scoring to apply the latest mining analytics in real time. The QMF V7.2 Expression Builder helps QMF users build the SQL expression to invoke IM Scoring which, in turn, applies these rules to new business situations. An example of using IM Scoring for clustering functions in QMF is provided through the following business application.
The business application A financial institution, such as a bank, typically runs weekly, monthly, and quarterly reports to monitor transactional relationships with its customer base. Apart from this, these same base reporting queries also serve as starters for ad-hoc reporting. Queries need to be run to feed other end-user applications that are used by sales account managers, portfolio managers, mortgage loan officers and bank call center operators that interact with customers. These people need to interact with the customer based on their needs. They also want to communicate a sense of personalized and up-to-date service to the customer at hand. Personalized answers to customer questions and needs in near real time for quick response is key to effect customer satisfaction. These end users need the information right there, right now when they interact with the customer. For this to happen, access to the data on the typical sets of customers in the database, as well as the means to make slight variations for ad-hoc query reporting, are useful.
Integration with the application example To provide access to the customer database of the bank and, at the same time, achieve near real-time response based on the database mining extenders
Chapter 8. Other possibilities of integration
169
embedded in the relational database management system (RDBMS) of our bank, we build the SQL query shown in Example 8-1 in the Expression Builder panel of QMF V7.2. We use the scoring functions on basis of a cluster model. The query is called QMF Query to use IM Scoring.qry.
Example 8-1 QMF V72. query for IM Scoring --- DROP TABLE CBARAR3.QMFResultTable ; --- CREATE TABLE CBARAR3.QMFResultTable ( --Customer CHAR(8), --Cluster_id INTEGER, --Score DOUBLE, --Confidence DOUBLE --- ); INSERT INTO CBARAR3.QMFResultTable (Customer, Cluster_id, Score, Confidence) SELECT CLIENT_ID, Q.PredictClusterID( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' , CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ), Q.PredictClusScore( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' , CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ), Q.PredictClusConf( 'Demographic clustering of customer base of an US bank' , REC2XML( 2 , 'COLATTVAL' , '' ,
170
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
CAR_OWNERSHIP, HAS_CHILDREN, HOUSE_OWNERSHIP, MARITAL_STATUS, PROFESSION, SEX, STATE, N_OF_DEPENDENTS, AGE, SALARY ) ) from CBARAR3.BANKING_SCORING, IDMMX.CLUSTERMODELS WHERE IDMMX.CLUSTERMODELS .MODELNAME = 'Demographic clustering of customer base of an US bank';
Figure 8-5 shows the result of running this query in QMF.
Figure 8-5 Results of IM Scoring run in QMF V7.2
Chapter 8. Other possibilities of integration
171
The results can obviously be portrayed in a table or exported to a file format for use by the end user in another business application.
8.3 Integration with e-commerce To be competitive in the global marketplace, businesses need to offer greater levels of customer service and support than ever before. When customers access a Web site today, they expect to browse through a product catalog, buy the products online in a secure environment, and have the product delivered to their doorstep. Electronic commerce or e-commerce involves doing business online, typically via the Web. E-commerce implies that goods and services can be purchased online, whereas e-business may be used as more of an umbrella term for a total presence on the Web, which includes the e-commerce component on a Web site. Note: Some of the key concepts of an e-commerce Web site include: User profile: Information that is entered and gathered during a user’s visits form the user profile. Product catalog: On the Web, this is analogous to a printed catalog. Products are organized into logical groups. The display of the products are tailored to maximize the sales of the products. Customers can browse the catalog to search for products and then place orders. Shopping flow: In the e-commerce environment, this is the process where customers browse the catalog, select products, and purchase the products. Shopping cart: The metaphor of a shopping cart has become widely used on the Web to represent an online order basket. Customers browse an e-commerce site and add products to their shopping carts. Shoppers proceed to the check-out to purchase the products in their shopping carts.
The Business-to-Consumer (B2C) e-commerce store model is a publicly accessible Web site offering products for sale. It is analogous to a store on the street, where any member of the public can walk in and make a purchase. A new, unknown customer is called a guest shopper. The guest shopper has the option to make purchases, after they provide general information about themselves to fulfill the transaction (name, address, credit card, and so on). Most B2C sites encourage users to register and become members. In doing so, the business can establish a relationship with the customer, provide better service, and build customer loyalty.
172
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
The Business-to-Business (B2B) e-commerce store model refers to an e-commerce store specifically designed for organizations to conduct business over the Internet. The two entities are known to each other and all users are registered. In our case, where we deal with a B2C e-store, after we set up our store, we are likely to be interested in how successful the store is. We may want to have access to specific information about the success of different campaigns and initiatives. We may also want to know more about the customers who are using the store and the responses to specific campaigns and initiatives. See the example in Figure 8-6.
Campaigns, Initiatives Catalog
Orders
Payment
Business-to-Consumer
Sales
Customers Workflow
Contracts Negotiation
Business-to-Business
Business focal points
Figure 8-6 B2C, B2B, and business focal point in an e-store
The business issue is to leverage the information, stored separately on both campaigns and customers, by combining these two information sources to gain more lift in responders to campaign initiatives. Web and call center interfaces allow us to test campaigns and effectiveness before applying them to the masses. For example, campaign A targets the first 50 eligible visitors to our Web site, and campaign B targets the next 50 eligible visitors. We can determine offer response, customer segmentation, and so on. Then we can create the appropriate batch or campaign management offer to the segment that best fits the response profile.
Chapter 8. Other possibilities of integration
173
We address the following business issues: What are the characteristics of the initiatives that are most successful? What are the characteristics of the customers that respond most favorably to the initiatives?
Mapping the business issues to data mining functions After the e-store is deployed, there are many activities that you must perform to manage the e-store and to examine how well the B2C e-store is performing as shown in Figure 8-7. Both IM Modeling and IM Scoring may help you to understand better and differentiate between online campaign initiatives and between customers.
Visitor traffic
Traffic volumes on visitors, repeat visitors, registered customers Traffic trends based on hour of day, day of week Time spent by visitors on pages viewed Effectiveness of referrers (advertisements and links on partner Web sites) Visitor network geography (domain, subdomain, country) Search key words
E-commerce
Products seen or selected Shopping behavior (searching or browsing) Shopping cart abandon rate Campaign effectiveness Business impact measure Conversion rates (browse-to-buy, search-to-buy)
Path analysis
Popular paths through the site Content effectiveness of specific site features Customer needs (search analysis)
Personalization
Customer profiling Effectiveness of business rules
Site operation
Web site bugs (broken links, server errors) Speed of page load or response
Figure 8-7 E-commerce analysis
IM Modeling can be used to discover additional characteristics, such as why one initiative is more successful or why a particular cluster of customers responds more favorably to the campaign than another cluster of customers. The outcome is a set of mining models that describes several clusters of campaigns or initiatives, respectively in case of the first question. And for the second question, you have a mining model that describes several clusters of online customers who are favorable or less favorable responders to the campaign initiatives.
174
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IM Scoring may be used to score similar, but newly setup, online campaigns or initiatives against the customer segmentation results stored into DB2 UDB tables. Likewise, in the case of the second question, it allows you to score new customers to existing initiatives or in combination with new campaigns to be run on short notice. These DB2 data mining functions integrate the mining results automatically back into the datamart, where they can be used to revise e-commerce marketing strategies. Using IM Modeling and IM Scoring, you can score data on online customers and at the same time display the history of these customers, because both are stored in the data warehouse.
8.4 Integration with WebSphere Personalization Most B2C sites try to maintain information about users. They encourage users to register. Information that is entered and gathered during the users’ visits form the user profile. The user profile information can be used as a powerful marketing tool to personalize the Web site content for the user. The personalized content can be used to filter the product catalog for only products that the customer is interested in. The content can also be used to implement such selling techniques as cross-selling and up-selling. Web-based personalization is meant to match more the individual needs and preferences of the visitor to the online site. The intent is that by targeting the Web content to the customer’s needs, they will become a true customer beyond the status of Web site visitor. Online shops, such as online auctions sites (eBay), camping equipment (REI), and book stores (Amazon) try their best to match Web content to the visitors to get them to become frequent visitors and eventually buying customers. In Web channel-based communication, a medium that provides no actual personal contact with the visitor, real-time response is essential to the visitor’s needs and preferences. Typically, a visitor to a Web site must have their needs recognized at the Web site within split seconds. Otherwise they will not return again to seek whether the site offers the information, services, and products they were looking for and considered buying. Fast delivery of matching content is important to new visitors and to established customers that were actually shopping at the online site before. We want to address the business issue of minimizing those who leave (leavers) at all costs, by improving their experience with our B2C site. And at the same time, we must turn navigational and information request visits into actual sales transactions at each occurrence of a Web session.
Chapter 8. Other possibilities of integration
175
Mapping the business issue to data mining functions Companies with online shop sites, such as Amazon, use recommendation engines to address the possible needs of its Web site traffickers in a more personalized manner. Using this manner, they sell to new guest shoppers or up-sell or cross-sell books and other items to existing customers. The association techniques of IM Scoring may be used to monitor the Web pages flow the Web site user visits during a session. The navigational clicks by Web site traffickers who search for items, either for information request purposes or for simple on-time page visits, tell what information needs the guest shopper or customer may have. At the same time, a series of mouse clicks to register, search, select, and pay for items in the online shopping cart further enhances the personalized services to customers with whom you start to establish a customer lifetime relationship. This Web transactional behavior by traffickers, who became part of the customer base after registration, leads to a useful shopping history. It allows IM Scoring to provide scores to the recommendation engine that will be more effective in case of future Web sessions by the customers. Figure 8-8 shows a solution that maps the business issue to technology involving personalization and recommendation engines that relies on data mining functions with IM Modeling and IM Scoring in place. Here, WebSphere Application Server is running with personalizations based on the recommendation engine of WebSphere Personalization. Scoring results are plugged in by using the Java API of IM Scoring. In certain e-business scenarios, such as first time visits to the Web site, the input data for scoring may include data that is not yet made persistent in the database. The current data may depend on the most recent mouse click on a Web page. A small Java API for the scoring functions allows for high-speed predictive mining in these cases as well. It also offers support to personalization in WebSphere. In addition, IM Visualizers may be plugged in using the applet view of these visualizations once a Web browser view of the personalizations is set up by the IT department for the content manager or managers.
176
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Personalization Workspace (inlcuded with WebSphere Personalization, V4.0)
Business Manager
Manage campaigns Preview Web site Develop business rules
For each site visitor: Evaluate business rules Select appropriate content
WebSphere Site Analyzer Analyze business rules for effectiveness Improve your site's ROI
WebSphere Application Server, Advanced or Enterprise Edition AIX, Solaris, Windows NT/2000, HP-UX, Linux
Profiles: Collected on site or retrieved from other business systems LDAP Content: Interwoven team site IBM EIP Other content management systems
Pages personalized for each site visitor
Figure 8-8 WebSphere Personalization
The business application or applications There are two business applications where personalization is quite important. One is Amazon, and the other is Recreational Equipment, Inc. (REI). The association techniques of IM Scoring are used to monitor the Web page flow that the Web site user visits during a session.
Amazon When you access the online bookstore Amazon (http://www.amazon.com), the site immediately uses the recommendation engine. Amazon invites you to become a registered guest shopper. See Figure 8-9.
Chapter 8. Other possibilities of integration
177
Figure 8-9 Personalizing at Amazon.com when you visit their Web site
Amazon also invites you to tell them your interests, so that they will remember them and can personalize their site just for you. After you decide to have Amazon.com personalized to your interests, the site displays a recommendation wizard (Figure 8-10).
178
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 8-10 The Amazon three-step recommendation wizard
Amazon asks you to follow three steps to receive personalized recommendations: 1. Tell what your favorite stores and categories are. 2. Tell what your favorite products are. 3. Confirm your favorites to receive recommendations from now on. By looking at your purchase history with Amazon that we supplied in their recommendations wizard, you see a recommendation immediately after your registration. (If you are a first time guest shopper at the Amazon site, Amazon may look at your purchase history with other stores.) If you do not see such a message, then you see a personal friendly message: Hello Jaap Verhees...We’re sorry. We were unable to find any titles to recommend after looking at your purchase history.”
Chapter 8. Other possibilities of integration
179
You may also see a suggestion hint if the items recommended by Amazon are not on target (right-hand side in Figure 8-11). This way, Amazon tries to minimize the likelihood of us leaving their Web site and not returning in the near future, by improving our experience with the Amazon e-store.
Figure 8-11 Amazon has no recommendations yet, but tries to get on target
After you refine your recommendations by rating products that you have bought, either at the Amazon Web site as another registrant in the past or by shopping at competitors with both online and offline stores, the recommendation history considers this as a purchase history. In this example, we specified that we are interested in e-stores that sell: Books Magazines Products with respect to outdoor living
180
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Next, in store books, we selected categories (Arts & Photography, Outdoors & Nature, and Travel) and did a category selection in stores Magazines and Outdoor Living products. Finally, we rated books that were listed to us that belong in our preferred categories, and indicated which of these books we own. We bought a number of books with titles, such as “Discovering Data Mining”. From then on, the wizard had enough purchase history and preferences to recommend a product that may likely make us behave as those who stay (stayers) instead of leavers. See Figure 8-12.
Figure 8-12 The Amazon.com recommendations wizard suggesting a book
REI The other example is the REI online store (Figure 8-13). Formed in 1938 as a co-op to provide high-quality climbing and camping gear at reasonable prices, REI is now the nation's largest consumer cooperative with 2 million members. The respected outdoor gear retailer has 59 brick-and-mortar stores in 24 states.
Chapter 8. Other possibilities of integration
181
Kiosks in every store allow customers to access the REI Web site at http://www.rei.com, where approximately 78,000 SKUs are listed.
Figure 8-13 REI home page
There is also a value-priced storefront, REI-OUTLET.com, as well as 800-number shopping. With 6,500 employees, REI generates approximately $700 million in sales, of which $100 million comes from the online stores. REI is known for integrating its multiple sales channels to provide its customers (an exceedingly loyal crowd) with a consistently convenient, pleasant, and informative shopping experience. REI's in-store point-of-sale (POS) terminals have been Web-enabled since 1999. They can be used, for example, to order items that are out of stock at the store. REI's multi-channel retailing strategy, moreover, has proven itself beyond a doubt. In a 24-month period, REI found that dual-channel shoppers spent 114 percent more per customer than
182
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
single-channel shoppers. Tri-channel customers spent 48 percent more than dual-channel customers. With the importance of its online store firmly rooted in its overall retail strategy, REI began seeking ways to simplify its underlying technology as the site and its functionality grew. REI wanted to focus its energies on what it does best. That is building more personalized relationships with its customers to improve their experience with REI (see Figure 8-14).
Figure 8-14 Recommendations at the time of shopping cart entries
By scanning the current shopping cart entries, REI’s personalization engine looks at the associated products in its product catalog database. Then it sets up those items as suitable recommendations. If you are a first time guest shopper, you will not have a purchase history in their database. Therefore, the engine does not cross-reference recommendations back to the sales and purchases
Chapter 8. Other possibilities of integration
183
data tables. But after you become a frequent shopper, the recommended products are also filtered for your previous purchases of those products. Personalization clearly has more chance of succeeding if your previous purchase history is taken into account so that you do not receive recommendations to products that you have. REI also tries to boost sales by dynamically linking Web content with targeted marketing information, not only by sales and purchase information. For example, if a customer is reading an REI “Learn &Share” article on backpacking, the personalized recommendation engine could drop an image of hiking boots featured that week onto the Web page. Personalization helps REI use the Web site as a powerful marketing tool, plus also enhances the multichannel integration for the ultimate benefit of their customers. For instance, the recommendation engine of the Web site has the ability to refer new Web customers to nearby stores that are having sales. It can trigger to e-mail a coupon, redeemable in stores or online, to a recent brick-and-mortar customer who has purchased a bicycle, offering discounts on helmets and other complementary products.
The integration with the application examples By handling your item preferences or purchase history, the Web applications of Amazon and REI recommend other associated items to you. The steps for this may be as follows: 1. The IM Modeling functions for association analysis come up with a model of associations between numerous products sold by an e-store, such as Amazon and REI, over time. 2. Your itemized preferences are then matched one-by-one to the associations rules. 3. By using each itemized preference one by one, the IM Scoring function for associations then selects the association rule with the highest lift (confidence level/support) as the product item to recommend. Example: “If book title was purchased, then book titleB is often bought in combination or within a short time period after.”
4. The list of selected products is presented next on the personalized Web page to the guest shopper as recommendations. With IM Modeling and IM Scoring, next to WebSphere Personalization and Application Server powering a Web site, online stores are a lot more efficient and able to quickly make changes that enhance the way they interact with shoppers.
184
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8.5 Integration using Java The final, but one of the most worthwhile integration capabilities of IM Scoring right into any Web-enabled end-user business application is the use of Java Beans technology. This section explains the IM Scoring Java Bean concept through a business case that we present in Chapter 4, “Customer profiling example” on page 51.
8.5.1 Online scoring with IM Scoring Java Beans IM Scoring Java Beans can be used to score single or multiple data records using a specified mining model. IM Scoring Java Beans are designed to be used for applications where the online scoring of data records is the main task. IM Scoring Java Beans enable you to score a single data record in any Java application given a PMML model. This can be used to integrate scoring in e-business applications, for example for real-time scoring in CRM systems. Basically, the IM Scoring Java Beans are a good way to integrate scoring into any Web application. The Java Beans implementation of IM Scoring is set up with the idea to have: Fast deployment Ease of use using a Java programming environment Scoring available to any Web-based application
The functions of IM Scoring Java Beans are implemented as methods of the class com.ibm.iminer.scoring.RecordScorer. Note that the Java API is documented in online documentation (Javadoc) in the directory (to the IM Scoring program files) \doc \ScoringBean \index.html. Javadoc is shown in Figure 8-15.
Chapter 8. Other possibilities of integration
185
Figure 8-15 IM Scoring Java Beans: JavaDoc on class RecordScorer
186
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
8.5.2 Typical business issues A possible application area of IM Scoring Java Beans in CRM systems may be the realization of an Internet-based call center scenario. In this scenario, the required business logic, the scoring functions, runs on a Web or application server. Clients can connect to the server and send to it a data record that was specified by a call-center operator by means of a user interface on the client. The data record is scored on the server. Then the result is passed back to the client in real time. Figure 8-16 shows a simplified design of how such a scenario can be realized using IM Scoring Java Beans. Here, IM Scoring Java Beans are integrated into a Java 2 Enterprise Edition (J2EE) implementation using, for example, servlets or Enterprise JavaBeans (EJB).
Figure 8-16 Architecture sample to realize a call-center scenario
Note: For optimum performance throughput, you may decide to run each mining model in a separate process. In this case, you would pass only the new records to the appropriate scoring process. This results in a considerable performance improvement. The reason for the improvement is that the model-loading step, which is very time-consuming, is done only once.
Chapter 8. Other possibilities of integration
187
Another typical application area is in the bank customer profile scoring case from 3.5, “Integrating the generic components” on page 44. In this case, the Internet-based part of the bank business environment uses scoring to profile new or recent guest shoppers at the bank. And on the basis of the profile that the guest shopper has entered in the online form, it decides what product or service offer most likely suits them. The remainder of this section continues with this last case.
8.5.3 Mapping to mining functions using IM Scoring Java Beans To perform the scoring for a new customer, you must specify the following input: The mining model to be used The data record with data of the customer for whom we want to compute a score value
When you specify the necessary input, you can apply scoring and then access the result fields that were computed. Appendix I, “IM Scoring Java Bean code example” on page 293, contains the complete Java code “CustomerScore.java”. This code runs an IM Scoring Java Bean to score a new customer with a specified customer ID to any of the clusters as defined in a clustering model that has segmented the customer base. This code example, which uses the IM Scoring Java Bean, performs the following actions: 1. Takes a bank customer ID as an input. 2. Retrieves the customer record using Java Database Connectivity (JDBC). 3. Loads the ResultSet into a record. 4. Uses the scoring bean class RecordScorer to load a “selected Model” and score the record. Note: The Java Bean class RecordScorer is used to perform scoring on a single data record. The record is a set of (field name, value) pairs and must be defined by the user. The computed result can be accessed through a set of getXXX() methods. Before calling the score(Map) method, you must specify the following items: The actual values of the mining fields used by the scoring algorithm The connection, if the used mining model is stored in a database
5. Displays the result of the score.
188
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Note: The code uses JDBC to retrieve a record (based on customer_id as arg[0]) instead of hardcoding. This way you can link the scoring bean back to DB2 UDB or to any JDBC-enabled database for that matter. For example, changing the DB2 UDB specification COM.ibm.db2.jdbc.app.DB2Driver to the specification of the JDBC driver for the ORACLE database offers access to data records in this RDBMS.
This code also uses a method that matches the columns in the PMML model to all columns in the ResultSet, instead of hard coding the data fields.
8.5.4 The business application Bank customers, in particular guest shoppers who use the Internet-based part of the bank business environment, often interact with the bank in short time bursts. This way data records based on information entered within the Web session must be scored in near real time. Then, the bank can provide an immediate response to their needs that they stated through the online customer information or request form. Both to the online customers and to the bank, the benefit of a CRM approach by the bank to its individual customers is enlarged once IM Scoring is done with high speed and with no need for operator interference. IM Scoring facilitates for an effective CRM process toward the bank’s customers who use the Web channel to interact with the bank.
8.5.5 Integration with the application example To score new data records each time a Web channel interaction occurs between the bank customer and the bank’s Internet-enabled application (online form), the integration occurs as follows: 1. The online form contents from the Web page are sent in a data record format to the servlet. 2. The servlet feeds the record to the bank’s front- or back-office application that uses the Java Bean (RecordScorer) to score the customer to one of the customer segments in the clustering model. 3. The result (score and segment ID, matched with the customer ID) from the IM Scoring Java Bean is to be used to offer a bank service or product to up-sell or cross-sell to the online customer. 4. The offer, based on the score, is received by the servlet. 5. The servlets sends the offer in near real-time response during the Web session to the Web page of the online guest shopper of the bank service. Or
Chapter 8. Other possibilities of integration
189
the servlet sends an e-mail to the registered customer in addition to a response to their Web page visit. The Java Bean code to perform and apply scoring to fit the new customer to already existing ones helps to achieve a near real-time response back to the customer and can be easily reused.
8.6 Conclusion IM Scoring enables users to incorporate analytic mining into Business Intelligence, e-commerce, and online transactional processing (OLTP) applications. Applications score records (segment, classify, or rank the subject of those records), based on a set of predetermined criteria expressed in a data mining model. These applications can serve business and consumer users alike. For example, they can provide more informed recommendations, alter a process based on past behavior, and build more efficiencies into the online experience. Therefore, in general, you can be more responsive to the specific customer relationship event at hand, that often takes place in an Web-enabled business environment.
190
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Part 3
Part
3
Configuring the DB2 functions for data mining This part provides a more technical discussion of the different configurations and uses of the DB2 data mining functions. It includes: Implementing the DB2 data mining function IM Scoring for existing data mining models Using the DB2 data mining function IM Modeling for building the data mining model Using the DB2 data mining function IM Visualization for visualizing the mining results, the scores to the operational data on the basis of the model
Note: Consult the product standard documentation for more current information.
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
9
Chapter 9.
IM Scoring functions for existing mining models This chapter provides detailed information about the integration of existing data mining models into a DB2 Universal Database (UDB) database for the purpose of scoring. It starts with an overview of the scoring functions and then provides a step-by-step guide on: Enabling the DB2 UDB database for scoring Importing models in various formats into the selected DB2 UDB database Using the imported model to score and return the result
9.1 Scoring functions The IM Scoring data mining function uses the following features of DB2 UDB extensively: User-defined function (UDF) User-defined data structured type (UDT) Method
A user defined function is a mechanism with which you can write your own extensions to SQL. For example, the API of IM Scoring is implemented with UDFs. User-defined types are useful for modeling objects that have a well-defined structure consisting of attributes. For example, a user-defined structured type that contains the classification identifier and confidence is useful for storing and structuring the result of a classification model. Methods, like UDFs, enable you to write your own extensions to SQL by defining the behavior of SQL objects. However, unlike UDFs, you can only associate a method with a structured type stored as a column in a table. In IM Scoring, you use methods to extract individual results from UDTs. To read more about UDFs, UDTs, and methods, refer to IBM DB2 Universal Database Application Develoment Guide: Programming Server Applications, SC09-4827.
9.1.1 Scoring mining models IM Scoring has a two-level structure. The IM Scoring functions provided (clustering, classification, and regression) apply models from algorithms as shown in Figure 9-1. Consider the example of scoring functions for clustering that applies models from demographic and neural clustering algorithms. The results of applying the model are referred to as scoring results. To learn more about these mining models, see Discovering Data Mining, SG24-4839.
194
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Scoring function........................Applies models from algorithms... Demographic Clustering Clustering Neural Clustering
Decision Tree Classification Neural Classification
RBF Regression Neural Value prediction Polynomial Regression Linear, Logistic
Figure 9-1 IM Scoring to apply models to new data
9.1.2 Scoring results The scoring results differ in content according to the type of model applied. When a classification model is applied, the scoring results assign a class label and confidence value to each individual record that is being scored. Confidence value: This is a data mining term for the reliability of fit of a record to a certain class. If the confidence value (range 0-1) is below or near 0.5, another grouping of the record may also be done. Or maybe it is not reasonable to expect the record to be considered for the respective class.
The predicted class that is produced when you apply a classification model identifies the class within the model to which the data matches. When a clustering model is applied, the scoring results are the assigned cluster ID. They are the measure that indicates how well the record fits into the assigned
Chapter 9. IM Scoring functions for existing mining models
195
cluster, to each individual record being scored. The cluster ID identifies the position of the cluster in the clustering model that is the best match for the data. When a prediction model is applied, the scoring results are the assigned predicted value. The predicted value, which is produced when you apply a regression model, is calculated according to relationships that are established by the model.
9.2 IM Scoring configuration steps The steps required to configure a DB2 UDB database for scoring are listed in Table 9-1. They are categorized according to several actions. Table 9-1 Step-by-step action categories and steps Step
Action category
Action steps
1
Enable DB2 UDB instance
Update the database configuration parameter UDF_MEM_SZ. Restart the DB2 UDB instance.
2
Enable the database
Increase the Database Transaction Log size. Increase Database Control Heap Size. Increase the Application HEAPSIZE. Create database objects that are required for scoring.
3
Export models from the modeling tool
Export the selected model or models to an external file or files, either in PMML or DB2 Intelligent Miner for Data format. (IM Modeling SQL API only exports model(s) in PMML format.)
4
Import models
Import models from the external files into the relational database.
5
Generate the SQL script
Generate the SQL script that scores the target table using the models.
6
Application
Invoke the SQL scoring scripts from the application.
Figure 9-2 graphically shows the steps at a high-level view of an application architecture involving a modeling and scoring (and application) layer.
196
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Scheduler Type pla y er
Typ e att rib ute s
Sc heduler
JO b
JOb
Model Calibr ation
Typ e p laye r
Typ e player
T ype pla yer
Model Calibr ation
Scheduler
Typ e a ttr ibut es
Scheduler
Campaign management
Step 3: Export models as PMML
Scoring inside the Customer segmentation call center application
Step 5: Score the data
Application Environment
Step 1: Enable instance for scoring
Step 2: Enable database for scoring
Modeling API
Step 4: Import models into DB2
Step 3: Save model directly
... PMML stored in files
Operational Data Store Data Models Scores
Analytical Data Mart
Modeling Environment
File System
Scoring Environment
Figure 9-2 Application architecture with modeling and scoring
9.3 Step-by-step configuration The main configuration steps that you perform once for scoring are: 1. 2. 3. 4. 5.
Configure the DB2 UDB instance. Configure the database. Export models from the modeling tool. Import models in DB2 UDB. Generate the SQL script.
9.3.1 Configuring the DB2 UDB instance After the scoring module is installed, you need to configure the DB2 UDB instance and the database before you can use IM Scoring. This is done by enabling the DB2 UDB instance. See the steps in Table 9-2.
Chapter 9. IM Scoring functions for existing mining models
197
Since the scoring data mining function is implemented primarily as UDF, you must increase the default memory size allocated to the UDF. A recommended value is 60000. Table 9-2 lists the parameters and their recommended values. Table 9-2 Steps for enabling the DB2 UDB instance Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase UDF_MEM_SZ
db2 update dbm cfg using udf_mem_sz 60000
2
Windows only
Increase the DB2 registry parameter
db2set DB2NTMEMSIZE=APLD:240000000
3
UNIX Windows
Bounce the database
db2stop db2start
9.3.2 Configuring the database Once the database instance is configured for scoring, the next step is to enable the database. These steps ensure that: The database is configured with the appropriate database parameters. The required database objects are created:
– – – – –
Tables UDFs UDTs Methods Stored procedures
The steps in Table 9-3 are required for each database. Table 9-3 Steps for enabling the DB2 UDB database
198
Step
Platform
Purpose
DB2 command
1
UNIX Windows
Increase the log size for likely long transactions during scoring.
db2 update db cfg for using logfilsiz 2000
2
UNIX Windows
Increase the application heaps control shared memory.
db2 update db cfg for using APP_CTL_HEAP_SZ
3
UNIX Windows
Increase the private memory for the application.
db2 update db cfg for using APPLHEAPSZ 1000
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Step
Platform
Purpose
DB2 command
4
UNIX Windows
Create database objects required for scoring, including administrative tables, UDFs, and UDTs.
idmenabledb fenced tables
Federated access If the table to be scored is in a remote DB2 UDB table, such as a DB2 UDB on the z/OS Server, you can score the table to the remote server using federated access support in DB2 UDB. Table 9-4 summarizes the prerequisites for federated access. Table 9-4 Middleware prerequisites for federated database access IM Scoring environment
Remote database DB2 UDB for z/OS
DB2 UDB for iSeries
DB2 UDB Enterprise Edition for Windows
Oracle (Solaris, Linux, AIX)
SQL Server
DB2 UDB Enterprise Edition on Windows
DB2 Connect on Z/OS
DB2 Connect on OS/400
No additional software requirement
Relational Connect
Relational Connect
DB2 UDB Enterprise Edition on UNIX
DB2 Connect on z/OS
DB2 Connect on OS/400
No additional software requirement
Relational Connect
Relational Connect
With all the prerequisite software installed, you can now configure remote database tables for federated access. Table 9-5 summarizes the steps to achieve this. Table 9-5 Configuring a remote DB2 UDB table as a target table Step
Example
Catalog the remote node
CATALOG TCPIP NODE DB2NODE REMOTE SYSTEM42 or 9.1.150.113 SERVER DB2TCP42 or 50000
Define the remote server
CREATE SERVER DB2SERVER TYPE DB2/390 VERSION 6.1 WRAPPER DRDA OPTIONS (NODE 'db2node',DBNAME 'quarter4')
Chapter 9. IM Scoring functions for existing mining models
199
Step
Example
Create the wrapper
Create wrapper DRDA
Create the user name mapping
CREATE USER MAPPING FROM USER27 TO SERVER DB2SERVER AUTHID "TSOID27" PASSWORD "TSOIDPW"
Create a nickname
CREATE NICKNAME DB2SALES FOR DB2SERVER.TSOID27.MIDWEST
The nickname is the table where you want to score. Currently this is the only way to score a DB2 UDB table on the z/OS.
9.3.3 Exporting models from the modeling environment Scoring requires a model as the basis of how to score. The models must be stored in the database table as a model object so IM Scoring can use it via the SQL API. If the mining models are created by means of IM Modeling, they can be directly applied since IM Modeling stores the result models in the database tables directly. The mining models may have been created in a workbench environment that supports model export in PMML format, such as DB2 Intelligent Miner for Data or SAS/Enterprise Miner. In this case, the models must be exported to an intermediate file format first. DB2 Intelligent Miner for Data can export the models in the native DB2 Intelligent Miner for Data model format or in the industry standard format PMML. After the models are in the PMML format and made accessible from the file system, DB2 UDB can use the SQL API to import the models into DB2 UDB tables for scoring applications. Imported models are stored as Character Large Object (CLOB) in DB2 UDB tables.
Exporting the data mining model from the mining tool Figure 9-3 shows an example of the available formats for exporting a model from DB2 Intelligent Miner for Data.
200
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
Figure 9-3 Export model menu from DB2 Intelligent Miner for Data
Using the DB2 Intelligent Miner for Data format After a model is created using DB2 Intelligent Miner for Data, it is ready to be exported to external files for exchange with other applications, such as IM Scoring. The DB2 Intelligent Miner for Data format is appropriate when the model is for deployment using IM Scoring, since IM Scoring can import models in DB2 Intelligent Miner for Data format as well as PMML format. One point to mention is that when exporting to a file with the proprietary DB2 Intelligent Miner for Data format, the file is stored using the system codepage of the DB2 Intelligent Miner for Data client. Make sure that the language-specific characters appear correctly on this machine, which means that the system codepage is correct.
Using the PMML format Exporting model in PMML format is recommended when the model is to be imported in third-party tools that only support PMML. Again, it is worthwhile to mention in this case is that when exporting to a file with the PMML-format, the file is stored using the system codepage of the DB2 Intelligent Miner for Data client. However, the encoding written in the first line of the model (XML declaration) specifies the codepage of the DB2 Intelligent Miner for Data server, where the conversion to PMML occurred. Therefore, the encoding can be erroneous if the DB2 Intelligent Miner for Data client and server are on different machines and systems.
Chapter 9. IM Scoring functions for existing mining models
201
Converting files from DB2 Intelligent Miner for Data to PMML DB2 Intelligent Miner for Data models may be stored in native DB2 Intelligent Miner for Data format from previously mining runs. You may need to change the format. With respect to transforming a file from DB2 Intelligent Miner for Data to PMML format, consider a file that contains a model with the proprietary DB2 Intelligent Miner for Data format. You can create a file containing the same model with the PMML format by explicitly calling the executable idmxmod command. The input file is supposed to be in the system codepage of the current machine. The PMML file is written in the system codepage of the current machine, and the corresponding encoding is written in the first line of the model. Tip: Transfer your PMML files between machines as binary objects to prevent any implicit codepage conversion.
You can specify the encoding of the model by using an additional parameter in the import function, for example: DM_impClasFileE('/tmp/myModel.pmml' , 'windows-1252')
9.3.4 Importing the data mining model in the relational database management system (RDBMS) Once models are stored in export formats, the next step is to load the specified model into the database for deployment of scoring applications. There are different mode types and import features for you to take into account. IM Scoring provides the SQL API that imports the mining models in various formats.
Data mining model types IM Scoring can read the models produced by the following DB2 Intelligent Miner for Data functions:
Demographic and neural clustering Tree and neural classification RBF and neural prediction Polynomial regression
When you import a model, you must be aware of the type of models that are being used. These models include: Clustering models Classification models Regression models
202
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
IM Scoring provides an SQL interface that enables the application of PMML models to data. In this way, IM Scoring supports the PMML 2.0 format for:
Center-based clustering (neural clustering in IM Scoring) Distribution-based clustering (demographic clustering in IM Scoring) Decision trees (tree classification in IM Scoring) Neural network (neural prediction and neural classification in IM Scoring) Regression (polynomial regression in IM Scoring) Logistic regression Radial Basis Function (RBF) prediction Note: IM Scoring also supports the RFB prediction in addition to all the other algorithms that are listed. The PMML standard version 2.0 does not yet state the RBF prediction. See the following Web site for more information: http://www.dmg.org
Importing features (DB2 Intelligent Miner for Data, PMML, CLOB) IM Scoring provides the SQL API and a set of UDFs for importing and using various scoring functions. There are different UDFs and DB2 tables for importing and storing different types of models. These UDFs and special tables are created when the database is enabled for scoring using the DB2 script idmenabledb. Table 9-6 cross tabulates the different models to UDFs, UDTs, and DB2 UDB tables to use. Table 9-6 Matching models to UDFs, UDTs, and DB2 tables Models produced by
UDFs to import models
UDTs for storing model
Name of DB2 tables where models are stored into
Demographic clustering
DM_impClusFile
DM_ClusteringModel
ClusterModels
Neural clustering Example DB2 command: db2 insert into IDMMX.ClusterModels Values ('', IDMMX.impClusFile('') Tree
DM_impClasFile
DM_ClassModel
ClassifModels
Neural classification
Chapter 9. IM Scoring functions for existing mining models
203
Example DB2 command: db2 insert into IDMMX.ClassifModels Values ('', IDMMX.impClasFile('') Polynomial regression
DM_impRegFile
DM_RegressionModel
RegressionModels
Radial basis function Neural prediction Example DB2 command: db2 insert into IDMMX.RegressionModels Values ('', IDMMX.impRegFile('')
The following assumptions are made about the codepage during the import: Importing from a file with the proprietary format of DB2 Intelligent Miner for Data, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file):
The file is supposed to be in the system codepage of the database server. It is transformed to PMML using this codepage. Importing from a file with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file):
The encoding specified in the first line of the model (XML declaration) is supposed to be correct in IDMMX.DM_imp{Clas|Clus|Reg}File(file). If the encoding is not correct, use the next import with the file and encoding option. Importing from a file with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}File(file, encoding):
The encoding given as a parameter is used to convert the model. The encoding specified in the first line of the model itself is ignored. Importing from a database object with the PMML format, functions IDMMX.DM_imp{Clas|Clus|Reg}Model(clob):
The database object implicitly has the database codepage. The encoding specified in the first line of the model itself is ignored. Note that the model is not converted to the database codepage when copied from a file into the database. However, it is assumed that the file codepage is compatible with the database codepage. We recommend that you use the functions IDMMX.DM_imp{Clas|Clus|Reg}File(file, encoding) when you want to override the encoding in the XML declaration of the PMML model. This may be necessary
204
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions
if a previous file transfer changed the code page of the PMML file without updating the XML declaration within the file.
9.3.5 Scoring the data This section provides a sample SQL script template for scoring. After the database is enabled and models are in place, the next step is to generate the DB2 script for the actual scoring. IM Scoring provides the SQL API and a set of UDFs to apply the scoring functions. There are different UDFs for different types of models. These UDFs and special tables are created when the database is enabled for scoring using the DB2 script idmenabledb. Table 9-7 lists the model categories, model types, and relevant UDFs. Table 9-7 Model types and UDFs Model type
UDF
Tree
DM_applyClasModel
Neural classification Demographic clustering
DM_applyClusModel
Neural clustering Polynomial regression
DM_applyRegModel
Neural prediction Radial Basis Function
Figure 9-4 explains the conceptual overview of the scoring process, by showing elements of a SQL scoring script for each UDF. Business application-specific tables and associated models are used as input to the scoring process. Multiple versions of the same model can be stored the model tables. The benefits for multiple versions are mainly flexibility in model execution. You can have: A different version of models for different customer segments Different versions of the same model can be saved at different points of recalibration
The business scenario may be generic customer data used for segmentation, customer churn tables for churn scoring, or customer product tables for cross-sell propensity scoring.
Chapter 9. IM Scoring functions for existing mining models
205
Customer Attributes table
SQL: DM_ApplyClusModel
V1 C lusterModel
V2 V3
Custom er Segm ents Chustomer with Churn Table
V1 SQL: DM _ApplyClasModel
C lassifModel V2
V3
Custom er Churn score
Customer Product Portfolio Table
SQL: DM_ApplyR egModel
V1 RegressionM odel
V2
V3
Custom er propensity score
DB2 Figure 9-4 Conceptual overview of SQL scoring elements
There are four essential elements in the SQL scoring script:
Input data: The table to be scored Input model to use The appropriate UDF to use for the model The output score
The pseudocode in Figure 9-5 highlights those elements.
206
Enhance Your Business Applications: Simple Integration of Advanced Data Mining Functions