Series on Innovative Intelligence - Vol. 6
10C0101110000010101000
nmmwmmmm
3101
• it*
K Intelligent and Other Computational Techniques in Insurance Theory and Applications Editors
A. F. Shapiro L. C. Jain
Intelligent and Other Computational Techniques in Insurance Theory and Applications
Series on Innovative Intelligence Editor: L. C. Jain (University of South Australia) Published: Vol. 1
Virtual Environments for Teaching and Learning (eds. L. C. Jain, R. J. Howlett, N. S. Ichalkaranje & G. Tonfoni)
Vol. 2
Advances in Intelligent Systems for Defence (eds. L. C. Jain, N. S. Ichalkaranje & G. Tonfoni)
Vol. 3
Internet-Based Intelligent Information Processing Systems (eds. R. J. Howlett, N. S. Ichalkaranje, L. C. Jain & G. Tonfoni)
Vol. 4
Neural Networks for Intelligent Signal Processing (A. Zaknich)
Vol. 5
Complex Valued Neural Networks: Theories and Applications (ed. A. Hirose)
Forthcoming Titles: Biology and Logic-Based Applied Machine Intelligence: Theory and Applications (A. Konar & L. C. Jain) Levels of Evolutionary Adaptation for Fuzzy Agents (G. Resconi & L. C. Jain)
Series on Innov.iiiw Inirlli^rnt'f - Vol. C>
Intelligent and Other Computational Techniques in Insurance Theory and Applications
Editors
A. E Shapiro Penn State University, USA
L. C. Jain University of South Australia
l j | p World Scientific NEW JERSEY • LONDON • SINGAPORE • SHANGHAI • HONGKONG • TAIPEI • BANGALOBE
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
INTELLIGENT AND OTHER COMPUTATIONAL TECHNIQUES IN INSURANCE: THEORY AND APPLICATIONS Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-238-718-8
Printed in Singapore by World Scientific Printers (S) Pte Ltd
Foreword "Knowledge is Power." This maxim summarizes my attitude toward intelligence as a key driver of success,... in particular long term success of an individual or business enterprise in a competitive economy. Intelligence has a number of meanings including the intellectual capability of the individual or a group of individuals, and the ability of an organization to gather data useful for making decisions. The data itself is also often called intelligence! Intelligent systems are those automated procedures that turn intelligence, in the sense of data, into advice for decision-makers. One of the features of recently-developed intelligent systems is that they "learn" in the sense of improving decision-making through time by continually incorporating new data. With the development of high speed computers and the resulting ability to collect and store and process huge amounts of data, the insurance field is one of the industries that is in the best position to take advantage of the intelligent systems that are being developed. Insurance companies do not need to be huge in order to take advantage of intelligent systems and other data dependent systems. Indeed, an automobile or health insurer with a few hundred thousand policyholders can collect vast amounts of data on the attributes as well as the detailed experience of each of the policyholders. Some easy-to-use tools for data mining and other techniques are now already commercialized and easy to use. This book provides a series of papers that provide overviews of some of these "intelligent" techniques as well as more traditional techniques that are not necessarily described as intelligent, such as statistical methods; e.g., logistic regression. These additional tech-
V
Foreword
VI
niques do of course contribute to intelligence in the sense of the intellectual capability of the individual or enterprise. Finally, several papers focus on the process of making decisions in the presence of imperfect "fuzzy" information. Virtually all the papers include direct applications to insurance, risk management or other financial data. Actuaries form a key element of the intelligence of the insurance industry. Actuarial science is a collection of paradigms and techniques, most of which are shared by other disciplines. Some methodologies, such as bonus-malus, are used exclusively by actuaries, since they are developed for use in the areas, such as insurance, of traditional actuarial practice. This book will be successful for the insurance industry and for the actuarial profession if it stimulates interest in further more detailed reading and research. I commend it to all individuals and insurance enterprises. "Knowledge is Power.'''
Harry H. Panjer President, Society of Actuaries
Preface At the turn of this century, the premiums written by insurance companies were nearly $2.5 trillion (USD) worldwide, and growing. Many factors will influence the continued growth of the industry, including a rapidly changing risk environment and market conditions, but one of the dominant factors will be technology, and an important component of that will involve intelligent and other computational techniques. This will continue the worldwide trend to fuse these techniques into insurance-based programs. This novel book presents recent advances in the theory and implementation of intelligent and other computational techniques in the insurance industry. The paradigms covered include artificial neural networks and fuzzy systems, including clustering versions, optimization and resampling methods, algebraic and Bayesian models, decision trees and regression splines. Thus, the focus is not just on intelligent techniques, although that is a major component, the book also includes other current computational paradigms that are likely to impact the industry. The chapters are divided into two main sections: "Neural Networks, Fuzzy Systems and Genetic Algorithms" (Chapters 1-11) and "Other Computational Techniques" (Chapters 12-18). For the most part, these section headings are adhered to, but a few chapters spanned more than one of these topics and had to be arbitrarily assigned to a section. A summary of each chapter follows. The first chapter, by Shapiro, presents an overview of insurance applications of neural networks, fuzzy logic and genetic algorithms. The purposes of the chapter are to document the unique characteristics of insurance as an application area and the extent to which these technologies have been employed. The relationships between
Vll
Vlll
Preface
previously published material and the chapters in this book are noted. Chapter 2, by Francis, walks the reader through an introduction to neural networks from a statistical perspective, in a property and casualty insurance context. One of her goals is to demystify neural networks, and, to that end, she describes them and how they work by generalizing more familiar models. Chapter 3, also by Francis, is a sequel to her first chapter. It uses fraud detection and underwriting applications to illustrate practical issues and problems associated with the implementation of neural networks. The topics addressed include goodness of fit, relevance of predictor variables, and the functional relationships between the independent and target variables. Chapter 4, by Dugas et ah, argues in favor of the use of statistical learning algorithms such as neural networks for automobile insurance ratemaking. To this end, they describe various candidate models and compare them qualitatively and numerically. Issues they address include the differences between operational and modeling objectives, the bias-variance dilemma, high order nonlinear interactions, and non-stationary data. Chapter 5, by Yeo and Smith, combines data mining and mathematical programming to establish an optimal balance between profitability and market share when determining the premium to charge automobile insurance policy holders. They demonstrate the quantitative benefits of their approach. Chapter 6, by Carretero, describes and implements two fuzzy models for adjusting third party liability insurance rates. The models, which represent the two main classes of linear programming models with imprecise information, are based on constraints or information that is supplemental to the statistical experience data. Chapter 7, by Duncan and Robb, deals with population risk management, from the perspective of reducing costs and managing risk in health insurance. In this context, the authors discuss the problem of identifying high-risk members of a population and potential prediction methodologies.
Preface
IX
Chapter 8, by Chang, applies a fuzzy set theoretical approach to asset-liability management and decision-making. It first develops fuzzy-set theoretical analogues of the classical immunization theory and the matching of assets and liabilities. Then, a fuzzy set theoretical approach is used to extend the Bayesian decision method, which accommodates inherently fuzzy new information and decision maker options. Chapter 9, by Brockett et al., gives a review of effectiveness of using neural network models for predicting market failure (insolvency or financial distress) at the firm level, with special emphasis on the US insurance industry. The early warning signals of the neural network models are contrasted with those of expert rating agencies and other statistical methods. Chapter 10, by Viaene et al., addresses the problem of determining which predictor variables are most informative to the trained neural network. The study is done in the context of automobile insurance claim fraud detection, and the neural network results are compared with those of logistic regression and decision tree learning. Chapter 11, by Shapiro, discusses the merging of neural networks, fuzzy logic and genetic algorithms within an insurance context. The topics addressed include the advantages and disadvantages of each technology, the potential merging options, and the explicit nature of the merging. Chapter 12, by Gomez-Deniz and Vazquez-Polo, illustrate notions and techniques of robust Bayesian analysis in the context of problems that arise in Bonus-Malus Systems. They suggest the use of classes of a priori distributions to reflect the a priori opinion of experts. Chapter 13, by Guillen et al, focuses on the issue of policyholder retention. The authors use logistic regression to investigate the conditional probability of a personal lines customer leaving a property and casualty insurer, given various explanatory variables. Chapter 14, by Kolyshkina et al., investigates the use of data mining techniques, such as decision trees and regression splines, in
X
Preface
workers' compensation and health insurance. They describe the methodology used and compare the result of data mining modeling to the results achievable by using more traditional techniques such as generalized linear models. Chapter 15, by Ostaszewski and Rempala, describes resampling methodology and then shows how it can be used to enhance a parametric mortality law and to develop a nonparametric model of the interest rate process associated with an insurer's asset-liability model. Chapter 16, by Craighead and Klemesrud, reports on a study of the selection and active trading of stocks by the use of a clustering algorithm and time series outlier analysis. The clustering is used to restrict the initial set of stocks, while the change in an outlier statistic is used to determine when to actively move in and out of various stocks. The authors comment on the advantages and limitations of their strategies and areas for further research. Chapter 17, by Cheung and Yang, discusses some of the recent advances in optimal portfolio selection strategies for financial and insurance portfolios. Specific topics addressed include an optimal multi-period mean-variance portfolio policy, a continuous time model, and the measures value-at-risk and capital-at-risk. The final chapter, by Hiirlimann, presents some new and interesting algebraic and probabilistic aspects of deterministic cash flow analysis, which are motivated by several concrete examples. The methodology illustrates how abstract algebraic structures can be used to resolve practical problems.
Acknowledgments The foregoing authors are leading professionals from around the world, including Asia, Australia/Oceania, Europe and North America. The editors are grateful to these authors and reviewers for their contributions and their willingness to share their knowledge and insights. We are grateful to Berend Jan van der Zwaag for his wonderful contribution.
Contents Parti Neural networks, fuzzy systems, and genetic algorithms Chapter 1.
3
Insurance applications of neural networks, fuzzy logic, and genetic algorithms 1 2
Introduction Neural network (NN) applications 2.1 An overview of NNs 2.1.1 Supervised NNs 2.1.2 Unsupervised NNs 2.2 Applications 2.2.1 Underwriting 2.2.2 Classification 2.2.3 Asset and investment models 2.2.4 Insolvency 2.2.5 Projected liabilities 2.2.6 Comment 3 Fuzzy logic (FL) applications 3.1 An overview of FL 3.1.1 Linguistic variables 3.1.2 Fuzzy numbers 3.1.3 A fuzzy inference system (FIS) 3.1.4 C-means algorithm 3.2 Applications 3.2.1 Underwriting 3.2.2 Classification 3.2.3 Pricing 3.2.4 Asset and investment models 3.2.5 Projected liabilities
xi
3 4 5 5 8 10 11 12 13 14 17 19 19 19 19 20 22 23 24 24 26 30 31 33
xii
Contents
4
Genetic algorithm (GA) applications 4.1 An overview of GAs 4.1.1 Population regeneration factors 4.2 Applications 4.2.1 Classification 4.2.2 Underwriting 4.2.3 Asset allocation 4.2.4 Competitiveness of the insurance products 5 Comment Acknowledgments References
34 35 35 36 36 37 37 38 39 39 40
Property and casualty Chapter 2. An introduction to neural networks in insurance
51
1 2
51 54 55
Introduction Background on neural networks 2.1 Structure of a feedforward neural network 3 Example 1: simple example of fitting a nonlinear function to claim severity 3.1 Severity trend models 3.2 A one node neural network 3.2.1 Fitting the curve 3.2.2 Fitting the neural network 3.2.3 The fitted curve 3.3 The logistic function revisited 4 Example 2: using neural networks to fit a complex nonlinear function 4.1 The chain ladder method 4.2 Modeling loss development using a two-variable neural network 4.3 Interactions 5 Correlated variables and dimension reduction
57 58 60 64 65 66 69 71 71 81 84 86
Contents
5.1
xiii
Factor analysis and principal components analysis 5.1.1 Factor analysis 5.1.2 Principal components analysis 5.2 Example 3: dimension reduction 6 Conclusion Acknowledgments References
86 87 89 91 98 99 100
Chapter 3 . Practical applications of neural networks in property and casualty insurance
103
1 2
104 105 105 110 117 119 122
Introduction Fraud example 2.1 The data 2.2 Testing variable importance 3 Underwriting example 3.1 Neural network analysis of simulated data 3.2 Goodness of fit 3.3 Interpreting neural networks functions: visualizing neural network results 3.4 Applying an underwriting model 4 Conclusions Acknowledgments Appendix 1 Appendix 2 References
122 128 130 130 131 132 134
Chapter 4. Statistical learning algorithms applied to automobile insurance ratemaking
137
1 2
139 142 144 146
Introduction Concepts of statistical learning theory 2.1 Hypothesis testing: an example 2.2 Parameter optimization: an example
xiv
Contents
3
Mathematical objectives 3.1 The precision criterion 3.2 The fairness criterion 4 Methodology 5 Models 5.1 Constant model 5.2 Linear model 5.3 Table-based methods 5.4 Greedy multiplicative model 5.5 Generalized linear model 5.6 CHAID decision trees 5.7 Combination of CHAID and linear model 5.8 Ordinary neural network 5.9 How can neural networks represent nonlinear interactions? 5.10 Softplus neural network 5.11 Regression support vector machine 5.12 Mixture models 6 Experimental results 6.1 Mean-squared error comparisons 6.2 Evaluating model fairness 6.3 Comparison with current premiums 7 Application to risk sharing pool facilities 8 Conclusion Appendix: Proof of the equivalence of the fairness and precision criterions References
148 149 152 153 156 156 157 158 159 160 162 162 163
Chapter 5. An integrated data mining approach to premium pricing for the automobile insurance industry
199
1 2 3
199 200 201
Introduction A data mining approach Risk classification and prediction of claim cost
167 168 170 172 174 174 179 181 183 187 191 192
Contents
3.1
xv
Risk classification 3.1.1 K-means clustering model 3.1.2 Fuzzy c-means clustering model 3.1.3 Heuristic model 3.2 Prediction of claim cost 3.2.1 K-means clustering model 3.2.2 Fuzzy c-means clustering model 3.2.3 Heuristic model 3.3 Results 4 Prediction of retention rates and price sensitivity 4.1 Prediction of retention rate 4.1.1 Neural network model 4.1.2 Determining decision thresholds 4.1.3 Analyzing prediction accuracy 4.1.4 Generating more homogeneous models 4.1.5 Combining small clusters 4.1.6 Results 4.2 Price sensitivity analysis 4.2.1 Results 5 Determining an optimal portfolio of policy holders 5.1 Results 6 Conclusions References
201 203 204 204 205 205 205 206 207 209 209 209 210 213 213 215 217 218 219 220 222 225 226
Chapter 6. Fuzzy logic techniques in the non-life insurance industry
229
1 2
Insurance market 230 Fuzzy logic in insurance 232 2.1 Basic concepts in the fuzzy decision-making processes..234 2.2 The two fuzzy non-life insurance models 236 2.2.1 Bonus-malus system 236 2.2.2 The data 237 2.2.3 The non-life insurance model based on Zimmermann's linear approach: a symmetric model..237
xvi
Contents
2.2.4 The non-life insurance model based on Verdegay's proach: a nonsymmetric model 2.2.5 Closing remarks about the two models 3 Some extensions 4 Classification 4.1 The fuzzy c-means algorithm 5 The future of insurance Acknowledgments References
ap247 249 250 251 253 254 256 256
Life and health Chapter 7. Population risk management: reducing costs and managing risk in health insurance
261
1
261 262 265 267 267 267 268 269 270 271 271
Background What is a high-risk member? Elements of population risk management 2 Identification (targeting) of high-risk populations 2.1 Data: available sources 2.1.1 Medical charts 2.1.2 Survey data 2.1.3 Medical claims 2.1.4 Pharmacy claims 2.1.5 Laboratory values 2.1.6 Conclusions on data 2.2 Implementation issues and how they affect prediction methodologies 2.2.1 Goals 2.2.2 Budgets 2.2.3 Staffing 2.2.4 Computing resources 2.2.5 Data warehousing 2.3 Prediction methodologies 1.1 1.2
271 272 272 272 272 273 273
Contents
xvii
2.3.1 Clinical methods 2.3.2 Statistical methods 3 Application of interventions and other risk management techniques 3.1 Targeting the right members 3.2 Effectiveness and outcomes 3.3 Results, including a methodology for optimizing inputs to and estimating return on investment from intervention programs 3.4 Using the risk management economic model 4 Summary and conclusions References
275 277 293 293 294
295 295 297 298
Asset-liability management and investments Chapter 8. A fuzzy set theoretical approach to asset and liability management and decision making
301
1 2
301 303 303 304 308 310 314 324
3 4 5 6 7 8
Introduction Fuzzy numbers and their arithmetic operations 2.1 Definitions 2.2 Fuzzy arithmetic operations Fuzzy immunization theory Fuzzy matching of assets and liabilities Bayesian decision method Fuzzy Bayesian decision method Decision making under fuzzy states and fuzzy alternatives Conclusions References
327 331 332
Contents
XV111
Industry issues Chapter 9. 337 Using neural networks to predict failure in the marketplace 1 2
Introduction 337 Overview and background 338 2.1 Models for firm failure 338 2.2 Neural network and artificial intelligence background ..340 2.2.1 The general neural network model 341 2.2.2 Network neural processing units and layers 343 2.2.3 The back-propagation algorithm 344 3 Neural network methods for life insurer insolvency prediction 345 4 Neural network methods for property-liability insurer insolvency prediction 356 5 Conclusion and further directions 360 References 362 Chapter 10. 365 Illustrating the explicative capabilities of Bayesian learning neural networks for auto claim fraud detection 1 2 3 4 5 6 7
Introduction Neural networks for classification Input relevance determination Evidence framework PIP claims data Empirical evaluation Conclusion References
365 368 376 378 384 388 393 394
xx
Contents
4.1
Transition rules in a robustness model Illustrations Conclusions and further works Acknowledgments Appendix A: Proof of Theorem 2 References 5 6
447 449 456 460 461 462
Chapter 13. 465 Using logistic regression models to predict and understand why customers leave an insurance company 1 2
Introduction Qualitative dependent variable models 2.1 Model specification 2.2 Estimation and inference 2.3 Stages in the modeling process 3 Customer information 4 Empirical results 5 Conclusions Acknowledgments References
466 470 470 473 475 477 481 486 488 489
Life and health Chapter 14. 493 Using data mining for modeling insurance risk and comparison of data mining and linear modeling approaches 1 2
Introduction Data mining - the new methodology for analysis of large data sets - areas of application of data mining in insurance 2.1 Origins and definition of data mining. Data mining and traditional statistical techniques 2.2 Data mining and on-line analytical processing ("OLAP")
495
497 498 499
xlx
Contents
Chapter 11. Merging soft computing technologies in insurance-related applications
401
1 2 3
401 403 404 405 408 410 411 412 415 419 421 422 424 424
4 5 6 7 8 9
Introduction Advantages and disadvantages of NNs, FL and GAs NNs controlled by FL 3.1 Inputs and weights 3.2 Learning rates and momentum coefficients NNs generated by GAs 4.1 Implementing GAs Fuzzy inference systems (FISs) tuned by GAs FISs tuned by NNs GAs controlled by FL Neuro-fuzzy-genetic systems Conclusions Acknowledgments References
Part 2 Other computational techniques Property and
casualty
Chapter 12. Robustness in Bayesian models for bonus-malus systems
435
1 2
435 437 437 438 439 439 440 441 444
Introduction The models 2.1 The likelihood 2.2 The structure functions 2.3 Conjugate posterior structure functions 2.3.1 Negative binomial-generalizedPareto model 2.3.2 Poisson-generalized inverse Gaussian model 3 Premium calculation in a bonus-malus system 4 Technical results
Contents
2.3 3
4
5
6 7
xxi
The use of data mining within the insurance industry....499 Data mining methodologies. Decision trees (CART), MARS and hybrid models 500 3.1 Classification and regression trees (CART) 500 3.2 Multivariate adaptive regression splines (MARS) 502 3.3 Hybrid models 503 Case study 1. Predicting, at the outset of a claim, the likelihood of the claim becoming serious 504 4.1 Problem and background 505 4.2 The data, its description and preparation for the analysis 505 4.3 The analysis, its purposes and methodology 506 4.3.1 Analysis using CART 506 4.3.2 Brief discussion of the analysis using logistic regression and comparison of the two approaches ....510 4.4 Findings and results, implementation issues and client feedback 511 4.5 Conclusion and future directions 512 Case study 2. Health insurer claim cost 512 5.1 Background 512 5.2 Data 513 5.3 Overall modeling approach 514 5.4 Modeling methodology 515 5.5 Model diagnostic and evaluation 517 5.5.1 Gains chart for total expected hospital claims cost ...517 5.5.2 Actual versus predicted chart 518 5.6 Findings and results 519 5.6.1 Hospital cost model precision 519 5.6.2 Predictor importance for hospital cost 519 5.7 Implementation and client feedback 520 Neural networks 520 Conclusion 520 Acknowledgments 521 References 521
xxii
Contents
Chapter 15. Emerging applications of the resampling methods in actuarial models
523
1
524 525 526 527 528 530 530 532
Introduction The concept Bootstrap standard error and bias estimates Bootstrap confidence intervals Dependent data 2 Modeling US mortality tables 2.1 Carriere mortality law 2.2 Fitting the mortality curve 2.3 Statistical properties of the parameter estimates in Carriere mortality model 2.4 Assessment of the model accuracy with parametric bootstrap 3 Methodology of cash-flow analysis with resampling 3.1 Interest rates process 3.2 Modeling interest rates with nonparametric bootstrap of dependent data 3.3 Model company assumptions 3.4 Interest rates process assumptions 3.5 Bootstrapping the surplus-value empirical process 3.6 Bootstrap estimates of the surplus cumulative distribution 3.7 Estimates based on the yields on the long-term treasury bonds for 1953-76 3.8 Comparison with the parametric analysis results 4 Conclusions Acknowledgments References 1.1 1.2 1.3 1.4
533 535 540 540 541 543 545 547 551 553 555 557 558 558
Contents
xxiii
Asset-liability management and investments Chapter 16. System intelligence and active stock trading
563
1 2 3 4 5 6
Introduction Determination of the asset universe Implementation Model description Results Conclusions and further research Acknowledgments Appendix A: Partitioning around mediods (PAM) Appendix B: Filter/smoother model References
564 566 568 571 574 577 579 580 583 585
Chapter 17. Asset allocation: investment strategies for financial and insurance portfolio
587
1 2 3
587 589 590 591 592 593 597 600 600 601 604 606 606 609 611 613
4 5
6
7 8
Introduction Single-period Markowitz model Multi-period mean-variance model 3.1 Model and problem formulation 3.2 Model assumption 3.3 A useful auxiliary problem A brief review of Merton's model Continuous-time VaR optimal portfolio 5.1 Value-at-risk 5.2 Model and problem formulation 5.3 Solution approach Continuous-time CaR formulation 6.1 Model and CaR 6.2 Problem formulation Optimal investment strategy for insurance portfolio Conclusions
XXIV
Contents
Acknowledgments Appendix 1: Proof of Proposition 5 Appendix 2: Proof of Proposition 6 Appendix 3: Proof of Proposition 7 References
614 615 616 618 620
Chapter 18. The algebra of cash flows: theory and application
625
1 2 3 4 5 6 7 8 9
Introduction 625 Term structure of interest rates 627 The cash flow polynomial ring 630 The economic multiplication and division of cash flows ...633 The probabilistic notions of duration and convexity 637 Convex order and immunization theory 642 Immunization of liability cash flow products 645 Examples 646 Conclusions 648 Acknowledgment 648 References 649 Appendix 1: Proof of Theorem 3 651 Appendix 2: Proof of Theorem 10 652 Index
653
List of contributors
657
Parti
Neural Networks, Fuzzy Systems, and Genetic Algorithms
This page is intentionally left blank
Chapter 1 Insurance Applications of Neural Networks, Fuzzy Logic, and Genetic Algorithms Arnold F. Shapiro
The insurance industry has numerous areas with potential applications for neural networks, fuzzy logic and genetic algorithms. Given this potential and the impetus on these technologies during the last decade, a number of studies have focused on insurance applications. This chapter presents an overview of these studies. The specific purposes of the chapter are twofold: first, to review the insurance applications of these technologies so as to document the unique characteristics of insurance as an application area; and second, to document the extent to which these technologies have been employed. Keywords: insurance, applications, neural networks, fuzzy logic, genetic algorithms
1
Introduction
Neural networks (NNs) are used for learning and curve fitting, fuzzy logic (FL) is used to deal with imprecision and uncertainty, and genetic algorithms (GAs) are used for search and optimization. These technologies often are linked together because they are the most commonly used components of what Zadeh (1992) called soft computing (SC), which he envisioned as being "... modes of com3
4
A. F. Shapiro
puting in which precision is traded for tractability, robustness and ease of implementation." The insurance industry has numerous areas with potential applications for these technologies. Some of these application areas, such as liability projections and mortality and morbidity studies, are unique to the insurance field; others, such as pricing, classification, underwriting, insolvency, and asset and investment models, are common to all financial intermediaries, but have features unique to the insurance area. Given this potential and the impetus on NNs, FL and GAs during the last decade, it is not surprising that a number of studies have focused on insurance applications. This chapter presents an overview of these studies. The specific purposes of the chapter are twofold: first, to review NNs, FL and GAs applications in insurance so as to document the unique characteristics of insurance as an application area; and second, to document the extent to which these technologies have been employed. The chapter has a separate section devoted to each of these technologies. Each section begins with a brief description of the technology for the reader who is unfamiliar with the topic, and then the insurance applications of the technology are reviewed. The reviews cover much of the literature and are intended to show where each technology has made inroads. The chapter ends with a comment on the merging of the technologies.
2
Neural Network (NN) Applications
NNs, first explored by Rosenblatt (1959) and Widrow and Hoff (1960), are computational structures with learning and generalization capabilities. Conceptually, they employ a distributive technique to store knowledge acquired by learning with known samples and are used for pattern classification, prediction and analysis, and control and optimization. Operationally, they are software programs that emulate the biological structure of the human brain and its associated neural complex (Bishop, 1995).
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
2.1
5
An Overview of NNs
The core of a NN is the neural processing unit (neuron), a representation of which is shown in Figure 1. Inputs
Weights
Figure 1. Neural processing unit.
The inputs to the neuron, Xj, are multiplied by their respective weights, Wj, and aggregated. The weight wo serves the same function as the intercept in a regression formula. The weighted sum is then passed through an activation function, F, to produce the output of the unit. Often, the activation function takes the form of the logistic function F(z)=(l+e"z)"1, where z = Sj wj Xj, as shown in the figure. The NN can be either supervised or unsupervised. The distinguishing feature of a supervised NN is that its input and output are known and its objective is to discover a relationship between the two. The distinguishing feature of an unsupervised NN is that only the input is known and the goal is to uncover patterns in the features of the input data. The remainder of this subsection is devoted to an overview of supervised and unsupervised NNs. 2.1.1
Supervised NNs
This section gives a brief overview of supervised NNs. It covers the basic architecture of a supervised NN and its learning rules and operation.
6
A. F. Shapiro
A Three-Layer NN A supervised NN, also referred to as a multilayer perceptron (MLP) network, is composed of layers of neurons, an example of which is the three-layer NN depicted in Figure 2. Input (Layer 0)
Hidden (Layer 1) w
Output (Layer 2)
on—*£n)
w
001 / w 021
w
111
Bias
x
(nioVv»ioi-*l M 21 012 ^~s ___. \ ^ w 121
Inputs
(
Output
w
w
022
-*^«J
Figure 2. A simple representation of an FFNN.
Extending the notation associated with Figure 1, the first layer, the input layer, has three neurons (labeled noj, j=0,l,2), the second layer, the hidden processing layer, has three neurons (labeled ny, j=0,l,2), and the third layer, the output layer, has one neuron (labeled nn). There are two inputs signals, xi and X2. The neurons are connected by the weights Wijk, where the subscripts i, j , and k refer to the i-th layer, the j-th node of the i-th layer, and the k-th node of the (i+l)st layer, respectively. Thus, for example, W021 is the weight that connects node 2 of the input layer (layer 0) to node 1 of the hidden layer (layer 1). The goal is for the output of the NN to be arbitrarily close to a target output. The Learning Rules The weights of the network serve as its memory, and so the network "learns" when its weights are updated. The updating is done using a learning rule, a common example of which is the Delta rule (Shepard, 1997, p. 15), which is the product of a learning rate, which controls the speed of convergence, an error signal, and the value associated with the j-th node of the i-th layer. The choice of
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
1
the learning rate is critical: if its value is too large, the error term may not converge at all, and if it is too small, the weight updating process may get stuck in a local minimum and/or be extremely time intensive. The Operation of a Supervised NN A sketch of the operation of a supervised NN is shown in Figure 3.
[ Start)-
Initialize values: architecture weights learning rate (T|)
Assign Input-output values
momemtum (a)
(Stop )-
Adjust weights
-o
Compute Output Values
Compute Hidden Layer Values
Figure 3. The operation of a supervised NN.
Since supervised learning is involved, the system will attempt to match the input with a known target, such as firms that have become insolvent or claims that appear to be fraudulent. The process begins by assigning random weights to the connection between each set of neurons in the network. These weights represent the intensity of the connection between any two neurons. Given the weights, the intermediate values (a hidden layer) and the output of the system are computed. If the output is optimal, in the sense that it is sufficiently close to the target, the process is halted; if not, the weights are adjusted and the process is continued until an optimal solution is obtained or an alternate stopping rule is reached. If the flow of information through the network is from the input to the output, it is known as a feed forward network. The NN is said to involve back-propagation if inadequacies in the output are fed back through the network so that the algorithm can be im-
8
A. F. Shapiro
proved. We will refer to this network as a feedforward NN with backpropagation (FFNN with BP). An instructional tour of the FFNN methodology in a casualty actuarial context can be found in Francis (2001). The ideas in that paper are expanded upon and extended in Chapters two of this volume. 2.1.2
Unsupervised NNs
This section discusses one of the most common unsupervised NNs, the Kohonen network (Kohonen 1988), which often is referred to as a self-organizing feature map (SOFM). The purpose of the network is to emulate our understanding of how the brain uses spatial mappings to model complex data structures. Specifically, the learning algorithm develops a mapping from the input patterns to the output units that embodies the features of the input patterns. In contrast to the supervised network, where the neurons are arranged in layers, in the Kohonen network they are arranged in a planar configuration and the inputs are connected to each unit in the network. The configuration is depicted in Figure 4. Input Layer
Kohonen Layer
Figure 4. Two-dimensional Kohonen network.
As indicated, the Kohonen SOFM may be represented as a twolayered network consisting of a set of input units in the input layer
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
9
and a set of output units arranged in a grid called a Kohonen layer. The input and output layers are totally interconnected and there is a weight associated with each link, which is a measure of the intensity of the link. The sketch of the operation of a SOFM is shown in Figure 5.
patterns iterations
Figure 5. Operation of a Kohonen network.
The first step in the process is to initialize the parameters and organize the data. This entails setting the iteration index, t, to zero, the interconnecting weights to small positive random values, and the learning rate to a value smaller than but close to 1. Each unit has a neighborhood of units associated with it and empirical evidence suggests that the best approach is to have the neighborhoods fairly broad initially and then to have them decrease over time. Similarly, the learning rate is a decreasing function of time. Each iteration begins by randomizing the training sample, which is composed of P patterns, each of which is represented by a numerical vector. For example, the patterns may be composed of solvent and insolvent insurance companies and the input variables may be financial ratios. Until the number of patterns used (p) exceeds the number available (p > P), the patterns are presented to the units on the grid, each of which is assigned the Euclidean distance between its connecting weight to the input unit and the value of the
10
A. F. Shapiro
input. This distance is given by [Ej ( Xj - Wy ) 2 ] 0 5 , where wy is the connecting weight between the j-th input unit and the i-th unit on the grid and Xj is the input from unit j . The unit that is the best match to the pattern, the winning unit, is used to adjust the weights of the units in its neighborhood. For this reason the SOFM is often referred to as a competitive NN. The process continues until the number of iterations exceeds some predetermined value (T). In the foregoing training process, the winning units in the Kohonen layer develop clusters of neighbors, which represent the class types found in the training patterns. As a result, patterns associated with each other in the input space will be mapped to output units that also are associated with each other. Since the class of each cluster is known, the network can be used to classify the inputs.
2.2
Applications
A survey of NN applications in insurance is provided in this section. This includes hybrid applications where the NN is the dominant technology. For the most part, these areas of application include underwriting, classification, asset and investment models, insolvency studies, and projected liabilities. Since this is a relatively new area of analysis, a number of the studies also include comparisons with rival approaches, such as discriminant analysis (DA) and logistic regression (LR).1 The overviews in this section do not include a discussion of specific steps unless the methodology of the study had some notable feature. For example, while not specifically mentioned, the methodology for each study involving a FFNN with BP generally included a feasibility stage, during which univariate and bivariate analysis was performed on the data set to determine the feasibility Discriminant analysis was the dominant methodology from the late 1960s to the 1980s, when logistic analysis was commonly used. See Jang (1997: 22, 24) for the assumptions underlying discriminant analysis and an explanation of why logistic regression is preferred.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
11
of the study, a training stage, a testing stage and a validation stage. Commonly, the index of discrimination for each of the NNs, as well as for the scoring system, was defined to be the area underneath their respective receiver operating characteristic (ROC) curves. 2.2.1
Underwriting
Underwriting is the process of selection through which an insurer determines which of the risks offered to it should be accepted, and the conditions and amounts of the accepted risks. The goal of underwriting is to obtain a safe, yet profitable, distribution of risks. This section reviews two NN applications in the area, mortgage insurance and bonding. One of the early applications of FFNNs to the problem of underwriting was the study by Collins et al. (1988), who used a committee of NNs to replicate the decisions made by mortgage insurance underwriters. They used an architecture of nine coupled networks in a 3x3 arrangement where each of the three networks at each of the three levels focused on a non-exclusive subset of the full feature space. Every 'expert' network rendered a decision by itself but every triad cooperated by searching for an agreement (or consensus) between networks of a level before the system yielded a classification. The system reached a high degree of agreement with human underwriters when testing on previously unseen examples and it was found that when the system and the underwriter disagreed, the system's classifications were more consistent with the guidelines than the underwriter's judgment. Bakheet (1995) used the FFNN with BP as the pattern classification tool in construction bond underwriting. Surety companies bond construction contractors based on the evaluation of a set of decision A ROC curve is a plot of the true positive rate against the false positive rate for the different possible outpoints of a diagnostic test. See Alsing et al. (1999) and Viaene et al. (2001) for an in-depth discussion of ROC curves.
12
A. F. Shapiro
factors, such as character, capital, capacity, and continuity. The study used an independent pattern classification module for each of these four factors and an integrated pattern classification model. The conclusion was that the model was an efficient method for handling the evaluation. It is worth noting that Bakheet preferred to optimize the NN structure heuristically rather than use the structure suggested by a built-in genetic optimizer. 2.2.2
Classification
Classification is fundamental to insurance. On the one hand, classification is the prelude to the underwriting of potential coverage, while on the other hand, risks need to be properly classified and segregated for pricing purposes. This section discusses representative classification areas where NN have been employed: consumer behavior analysis, the classification of life insurance applicants, and the detection of fraudulent insurance claims. Van Wezel et al. (1996) constructed a data-mining tool based on a SOFM to investigate the consumer behavior of insurance company customers. Their goal was to determine the number of underlying dimensions influencing consumer behavior. They began with a discussion of the underlying model and problem, and the observation that they could not minimize their error measure using a gradient method, because it was not differentiable, which led them to use a competitive NN. After discussing the specifics of the NN used to solve the problem, they show the results of their method on artificially generated data and then on actual data. Vaughn et al. (1997) used a multilayer perceptron network to classify applicants for life insurance into standard and non-standard risk. They then used their knowledge discovery method (Vaughn 1996) to identify the significant, or key, inputs that the network uses to classify applicants. The ranking of these inputs enables the knowledge learned by the network during training to be presented in the form of data relationships and induced rules. These validated that the network learned sensibly and effectively when compared
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
13
with the training data set. Brockett et al. (1998) used a SOFM to uncover automobile bodily injury claims fraud in the insurance industry and a FFNN with BP to validate the feature map approach. In the case of the latter, a GA was used to find the optimal three-layered configuration. The focus of the study was to determine whether a claim is fraudulent and the level of suspicion of fraud associated with the claim record file. They concluded that the consistency and reliability of the fraud assessment of the SOFM exceeded that of both an insurance adjuster and an insurance investigator in assessing the suspicion level of bodily injury claims. They suggested that a similar methodology could be used for detecting fraud in Medicare claims. Viaene et al. (2001) reported on an exploratory study for benchmarking the predictive power of several prominent state-ofthe-art binary classification techniques in the context of early screening for fraud in personal injury protection automobile insurance claims. To this end, the predictive power of logistic regression, C4.5 decision tree (an algorithm for inducing decision trees from data), A;-nearest neighbor, Bayesian learning MLP, least squares support vector machine (a classification method that separates the training data into two classes), and Bayes classifiers were compared. The comparisons were made both analytically and visually. Overall, they found that logit and the Bayesian learning MLP both performed excellently in the baseline predictive performance benchmarking study, whereas none of the C4.5 variants attained a comparable predictive performance. Chapter 10 of this book, by Viaene et al., elaborates on this study and extends it. 2.2.3
Asset and Investment Models
The analysis of assets and investments is a major component in the management of an insurance enterprise. Of course, this is true of any financial intermediary, and many of the functions performed are uniform across financial companies. Thus, insurers are involved with market and individual stock price forecasting, the forecasting
14
A. F. Shapiro
of currency futures, credit decision-making, forecasting direction and magnitude of changes in stock indexes, and so on. Numerous examples of these applications are contained in Refenes et al. (1996), which focuses on the categories of derivatives and termstructure models, foreign exchange, equities and commodities, and corporate distress and risk models. Two examples involving cash flow issues follow. Boero and Cavalli (1996) investigated a NN model for forecasting the exchange rate between the Spanish peseta and the US dollar. Their goal was to examine whether potentially highly nonlinear NN models outperform traditional methods or give at least competitive results. To this end, they compared the performance of NN with four linear models, a random walk process and three different specifications based on the purchasing power parity theory. They found mixed results. In experiments with quarterly data, they found no advantage in the use of NNs for forecasting the exchange rate, while the performance of the NNs clearly improves when they were trained on monthly data. Lokmic and Smith (2000) investigated the problem of forecasting when checks issued to retirees and beneficiaries will be presented for payment, in order to forecast daily cash flow requirements. The goal was to optimize the levels of funds kept in the company's bank account to avoid both overdraft charges and an over commitment of investment funds. In their preprocessing stage, they used SOFMs to cluster the checks into homogeneous groups. They then employed a FFNN with BP to arrive at a forecast for the date each check will be presented. They had mixed results. While they concluded that preprocessing was effective, there appeared to be no significant improvement in the cash flow forecasting over a simple heuristic approach. 2.2.4
Insolvency
A primary consideration of regulators is insurer solvency. In the past, most studies of insurance company failure prediction used sta-
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
15
tistical methods but recently NNs were brought to bear on the problem. This section presents an overview of five such studies and their conclusions with respect to traditional statistical methods. Three of the papers address insolvency from the perspective of a nonlife insurer while the other two articles focus on life insurers. One of the earlier studies was conducted by Park (1993), who used a NN designed by a GA to predict the bankruptcy of insurance companies, and to compare the results with those obtained using DA, LR, Classification And Regression Trees (CART), a procedure for analyzing categorical (classification) or continuous (regression) data (Breiman et al. 1984), and ID3, which is similar to CART. Using a GA to suggest the optimal structure and network parameters was found to be superior to using randomization in the design of the NN, a finding that was confirmed in a study of bank insolvencies. The major conclusion of the study was that the robust NN model outperformed the statistical models. The study also addressed the issue of conflicts in model performance. While previous studies attributed the performance of a model to the application domain, this study argued that the performance of a model is affected by the distribution of independent variables and can be compared from the viewpoint of task characteristics. A similar study was undertaken by Brockett et al. (1994), who used a three-layer FFNN with BP to develop an early warning system for U.S. property-liability insurers two years prior to insolvency. The results of the NN method were compared with those of DA, the rating of an insurer rating organization, and the ratings of an national insurance regulatory body. The conclusion was that the NN approach outperformed the DA and did far better than the rating organization and the regulators. Generally speaking, the NN results showed high predictability and generalizability, suggesting the usefulness of this method for predicting future insurer insolvency. Huang et al. (1994) used a NN optimized with a GA to forecast financial distress in life insurers, and compare the forecasts to those
16
A. F. Shapiro
obtained using DA, k-nearest neighbor, and logit analysis. The data was limited to Insurance Regulatory Information System Ratios for the total population of insurers. They concluded that the NN dominates the other methods for both in-sample fit and out-of-sample forecasting. Jang (1997) undertook a similar comparative analysis. Here the focus was on a comparison of multiple DA and LR analysis with a FFNN with BP, learning vector quantization3 (LVQ) and SOFM. As with the Huang et al. (1994) study, the FFNN with BP outperformed the traditional statistical approaches for all data sets with a consistent superiority across the different evaluation criteria, as did the LVQ. The SOFM supported these findings by showing the distinct areas of bankrupt and non-bankrupt companies geographically, which suggests that the easier visual interpretation SOFM can be used as a management tool by both insurance regulators and the companies themselves. They also used a misclassification cost approach, like Huang et al. Kramer (1997) combined an ordered logit model with a FFNN with BP and an expert system to investigate the solvency of Dutch non-life insurers. The inputs to the study were the financial indicators of solvency ratio, profit ratio, solvency margin, investment ratios, and growth of the combined ratio. Both the logit model and the NN showed good performance for weak and strong companies, but did poorly on the classification of moderate companies. The results were markedly improved, however, when the combined output of the mathematical models were used as input to the rule-based expert system. The output of the early warning system was the priority (high, medium, low) for further investigation of a specific company.
Learning vector quantization is a two-step method of designing examples for use in nearest-neighbor procedures, which involves finding cluster centers using the k-means clustering algorithm, and then incrementally adapting the centers in order to reduce the number of misclassifications.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
17
Chapter 9 of this volume, by Brockett et al, gives a current perspective on solvency in the insurance industry as well as a more detailed discussion of some of the articles in this section. 2.2.5
Projected Liabilities
The evaluation of projected liabilities is fundamental to the insurance industry, so it is not surprising that we are beginning to see NNs applied in this area. The issue on the life side is mortality and morbidity rates, while on property and casualty side it is property damage and liability. The projected liability examples discussed below relate to intensive care, dental caries, in-hospital complications, insurance reserves and claim duration. Tu (1993) compared NN and LR models on the basis of their ability to predict length of stay in the intensive care unit following cardiac surgery. Structurally, the training set and test set each consisted of a large equal-sized sample of patients, and five unique risk strata were created using each model. The advantages and disadvantages of each modeling technique were explored, with the conclusion that either model could potentially be used as a risk stratification tool for counseling patients and scheduling operations. Saemundsson (1996) used a NN to address the problem of identifying individuals at risk of developing dental caries. The focus of the research was to compare the predictive capabilities of clinicians, NNs and LR. The study revealed that human prediction of dental disease was at a comparable level of performance to LR, and that the best predictive performance of any method in the study was reached by eliminating some of the uncertainty introduced by dental treatment interventions into the natural process of dental caries, and by using a NN prediction model with human prediction as one of the inputs. An example of the use of NN for a mortality study was provided by Ismael (1999), who investigated the feasibility of using NN to reliably predict in-hospital complications and 30-day mortality for acute myocardial infarction patients. A large database of American
18
A. F. Shapiro
patients was used in this analysis and 22 history and physical variables were analyzed as predictors for 16 distinct complications. The conclusion of the study was that the NNs proved to be successful in predicting only four of the 16 specific complications: death within 30 days, shock, asystole, and congestive heart failure or pulmonary edema. The scoring system was found to have had a rather low discriminatory ability. Magee (1999) reported on a comparative analysis of FFNNs with BP and traditional actuarial methods for estimating casualty insurance reserve. Specifically, when the NNs were compared to the chain ladder, the additive method, and the least squares regression method, the NNs performed well in terms of both absolute value error and bias introduced by the methodology. A second objective of the study was to compare the simulated data to industry data. Magee found that the NNs outperformed the traditional actuarial methods insofar as these two data sets, which may be attributed to a NN's ability to model complex nonlinear patterns. His final conclusion was that NNs are more robust across data sets with different attributes. Speights et al. (1999) use a generalization of a FFNN with BP to predict claim duration in workers' compensation data in the presence of right censoring and covariate claim attributes. Right censoring occurs when only intermediate, but not final values of claim duration are known for some data points, and final values of claim duration are known for all other observations. They use a connection between least squares estimation and maximum likelihood estimation to establish the generalization, and showed that their methodology could make accurate predictions in the presence of right-censored data and many covariates. Two authors in the current volume address NNs and project liabilities. Francis investigates the chain ladder method in Chapter 2 and claim severity and loss development in Chapter 3. Duncan focuses on future high-cost members from the perspective of population risk management in Chapter 7.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
2.2.6
19
Comment
Notably absent from the foregoing NN articles are studies related to ratemaking and pricing. Dugas et al., Chapter 4, and Yeo and Smith, Chapter 5, focus on these issues.
3
Fuzzy Logic (FL) Applications
Fuzzy logic (FL) and fuzzy set theory4 (FST), which were formulated by Zadeh (1965), were developed as a response to the fact that most of the parameters we encounter in the real world are not precisely defined. As such, it gives a framework for approximate reasoning and allows qualitative knowledge about a problem to be translated into an executable set of rules. This reasoning and rulebased approach are then used to respond to new inputs. In this section, we review the literature where FL has been used to resolve insurance problems.5
3.1
An Overview of FL
Before proceeding to a review of the literature, this segment gives an overview of four FL concepts that are often used in articles: linguistic variables, fuzzy numbers, fuzzy inference systems and fuzzy clustering. 3.1.1
Linguistic Variables
Linguistic variables are the building blocks of FL. They may be defined (Zadeh, 1975, 1981) as variables whose values are expressed as words or sentences. Risk capacity, for example, may be viewed Following Zadeh (1994, p. 192), in this chapter the term fuzzy logic is used in the broad sense where it is essentially synonymous with fuzzy set theory. Previous reviews of this literature can be found in Yakoubov and Haberman (1998) and Derrig and Ostaszewski (1999).
20
A. F. Shapiro
both as a numerical value ranging over the interval [0,100%], and a linguistic variable that can take on values like high, not very high, and so on. Each of these linguistic values may be interpreted as a label of a fuzzy subset of the universe of discourse X = [0,100%], whose base variable, x, is the generic numerical value risk capacity. Such a set, an example of which is shown in Figure 6, is characterized by a membership function (MF), fj-high(x), which assigns to each object a grade of membership ranging between zero and one. fuzzy
f^hiflh(x)
high
y S 0 I" 0
i 10
i 20
i 30
i 40
r —i 50 60
1 70
i 80
1 90
1 100
/ NOT high Risk Capacity in %
Figure 6. (Fuzzy) Set of clients with high risk capacity.
In this case, which represents the set of clients with a high risk capacity, individuals with a risk capacity of 50 percent, or less, are assigned a membership grade of zero and those with a risk capacity of 80 percent, or more, are assigned a grade of one. Between those risk capacities, (50%, 80%), the grade of membership is fuzzy. Fuzzy sets are implemented by extending many of the basic identities that hold for ordinary sets. Thus, for example, the union of fuzzy sets A and B is the smallest fuzzy set containing both A and B, and the intersection of A and B is the largest fuzzy set which is contained in both A and B. 3.1.2
Fuzzy Numbers
Fuzzy numbers are numbers that have fuzzy properties, examples of which are the notions of "around six percent" and "relatively high". The general characteristic of a fuzzy numbers (Zadeh, 1975 and Dubois and Prade, 1980) is represented in Figure 7.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
21
M*M
fi(y|M) = rri! + y( m2 - mi )
f2(y|M) = m4 - y( rru - m3 )
Figure 7. A fuzzy number.
This shape of a fuzzy number is referred to as a "flat" fuzzy number; if m2 were equal to m3, it would be referred to as a "triangular" fuzzy number. The points mj, j=l, 2, 3, 4, and the functions fj(y|M), j=l,2, M a fuzzy number, which are inverse functions mapping the membership function onto the real line, characterize the fuzzy number. These inverse functions were first used by Buckley (1987), who used fuzzy arithmetic to develop the present value for annuities-certain and other financial transactions. As indicated, a fuzzy number usually is taken to be a convex fuzzy subset of the real line. As one would anticipate, fuzzy arithmetic can be applied to the fuzzy numbers. Using the extension principle (Zadeh, 1975), the nonfuzzy arithmetic operations can be extended to incorporate fuzzy sets and fuzzy numbers. Briefly, if * is a binary operation such as addition (+) or the logical min (A) or max (v), the fuzzy number z, defined by z = x * y, is given as a fuzzy set by Uz (w) = vU;V \ix (u) A Uy (v), u,v,w e 5R,
(1)
subject to the constraint that w = u * v, where (j,x , Uy, and \iz denote the membership functions of x, y, and z, respectively and v u v denotes the supremum over u,v. Chang, in Chapter 8 of this volume, addresses fuzzy numbers and their arithmetic operations.
A. F. Shapiro
22
3.1.3 A Fuzzy Inference System (FIS) The fuzzy inference system (FIS) is a popular methodology for implementing FL. FISs are also known as fuzzy rule based systems, fuzzy expert systems, fuzzy models, fuzzy associative memories (FAM), or fuzzy logic controllers when used as controllers (Jang et al. 1997 p. 73). The essence of the system can be represented as shown in Figure 8. Knowledge Base Database Rule base
Crisp Input
Fuzzification I Interface I Fuzzy Input
Inference ^Engine_ I Fuzzy Output
Defuzzification I Interface I Crisp Output
- Processor -
Figure 8. A fuzzy inference system (FIS).
As indicated in the figure, the FIS can be envisioned as involving a knowledge base and a processing stage. The knowledge base provides MFs and fuzzy rules needed for the process. In the processing stage, numerical crisp variables are the input of the system. These variables are passed through a fuzzification stage where they are transformed to linguistic variables, which become the fuzzy input for the inference engine. This fuzzy input is transformed by the rules of the inference engine to fuzzy output. The linguistic results are then changed by a defuzzification stage into numerical values that become the output of the system.
The numerical input can be crisp or fuzzy. In this latter event, the input does not have to be fuzzified.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
3.1.4
23
C-Means Algorithm
The foregoing fuzzy system allows us to convert and embed empirical qualitative knowledge into reasoning systems capable of performing approximate pattern matching and interpolation. However, these systems cannot adapt or learn because they are unable to extract knowledge from existing data. One approach for overcoming this limitation is to use a fuzzy clustering method such as the fuzzy c-means algorithm (Bezdek 1981). The essence of the cmeans algorithm is that it produces reasonable centers for clusters of data, in the sense that the centers capture the essential feature of the cluster, and then groups data vectors around cluster centers that are reasonably close to them. A flowchart of the c-means algorithm is depicted in Figure 9. |u ( t + 1 ) -u ( t ) U Database Calculate new partition: Q(H1)
t = ol Choose Values c = # of clusters m = exp. wt. G = pos.def (pxp) e = tolerance
Choose Initial Partition: 0(0)
Calculate Fuzzy Cluster Centers
Figure 9. The c-means algorithm.
As indicated, the database consists of the nxp matrix, xnp, where n indicates the number of patterns and p denotes the number of features. The algorithm seeks to segregate these n patterns into c clusters, 2 < c < n-1, where the within-clusters variances are minimized and the between clusters variances are maximized. To this end, the algorithm is initialized by resetting the counter, t, to zero, and choosing: c, the number of clusters; m, the exponential weight,
A. F. Shapiro
24
which acts to reduce the influence of noise in the data because it limits the influence of small values of membership functions; G, a symmetric, positive-definite (all its principal minors have strictly positive determinants), pxp shaping matrix, which represents the relative importance of the elements of the data set and the correlation between them, examples of which are the identity and covariance matrixes; and e, the tolerance, which controls the stopping rule. Given the database and the initialized values, the next step is to choose the initial partition (membership matrix), t/ (0) , which may be based on a best guess or experience. Next, the fuzzy cluster centers are computed. Using these fuzzy cluster centers, a new (updated) partition, f/('+1), is calculated. The partitions are compared i,+l) {l and if the difference exceeds using the matrix norm U -U 8, the counter, t, is increased and the process continues. If the difference does not exceed s, the process stops.
3.2
Applications
This section reports on five major insurance areas where FL has been implemented: underwriting, classification, pricing, asset and investment models, and projected liabilities. Readers who are also interested in potential areas of applications will find a large number of them mentioned in Ostaszewski (1993). 3.2.1
Underwriting
The first attempt to use fuzzy theory to analyze the internal logic of the intuitive part of insurance underwriting was due to DeWit (1982). He made the case that an insurance contract is based on its (risk) premium estimated as exactly as possible, and also on a more intuitive kind of experience, which is expressed by the company's underwriting practice. His focus was on the use of basic fuzzy membership functions and their operations in an underwriting context.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
25
Lemaire (1990) extended DeWit (1982) to a more extensive agenda for FL in insurance theory, most notably in the financial aspects of the business. His underwriting focus was on the definition of a preferred policyholder in life insurance. In this context, he compared the classical approach of using crisp membership functions with the use of grades of membership, modified by alpha-cuts. Young (1993) used fuzzy sets to model the selection process in group health insurance. Applying the model of fuzzy multi-stage decision processes of Bellman and Zadeh (1970) to group health underwriting, Young defined fuzzy sets that characterize groups of good underwriting risks. First single-plan underwriting was considered and then the study was extended to multiple-option plans. An example of a different underwriting dimension was provided by Kieselbach (1997) who used FL for systematic failure analysis. The role of the research was to analyze structure failures in order to prevent similar cases in the future, clarify responsibilities, add to experience, and contribute to design theory. The possible advantages of using fault tree analysis as a tool for systematic failure analysis was also discussed. The failure of the railing of a scaffold, which led to an accident, was discussed as an example of the feasibility of this kind of systematic approach and the application of FL to failure analysis. Most reported studies of FL applications in insurance underwriting were written by academicians. Two exception were the studies of Erbach and Seah (1993) and Horgby et al. (1997). One of the first practical insurance application of FL took place in Canada in 1987, and involved the development of Zeno, a prototype life automated underwriter using a mixture of fuzzy and other techniques. According to Erbach and Seah (1993), it was carried as far as the development of an underwriter (coded in C++) that ran on portable computers. With the aid of electronic links to the home office, it was intended to make turnaround a matter of minutes for cases complicated enough to require human intervention. Unfortunately, while Zeno was carried far enough to show that it could
26
A. F. Shapiro
work, when it was turned over to the system people, they "promptly abandoned it." The other underwriting application was provided by Horgby et al. (1997), who applied FL to medical underwriting of life insurance applicants. In a step-by-step fashion, the authors show how expert knowledge about underwriting diabetes mellitus in life insurance can be processed for a fuzzy inference system. The article was based on one of the first computer-based fuzzy underwriting system implemented in the industry and, given the success of the application, the authors conclude that techniques of fuzzy underwriting will become standard tools for underwriters in the future. 3.2.2
Classification
One of the earliest classification studies was Ebanks et al. (1992), who discussed how measures of fuzziness can be used to classify life insurance risks. They envisioned a two-stage process. In the first stage, a risk was assigned a vector, whose cell values represented the degree to which the risk satisfies the preferred risk requirement associated with that cell. In the second stage, the overall degree of membership in the preferred risk category was computed. This could be done using the fuzzy intersection operator of Lemaire (1990) or the fuzzy inference of Ostaszewski (1993). Measures of fuzziness were compared and discussed within the context of risk classification, including an entropy measure, which measures of the average information content of a set of datapoints, the distance measure of Yager (1979), and the axiomatic product measure of Ebanks (1983). Ostaszewski (1993, Ch. 6) observed that lack of actuarially fair classification is economically equivalent to price discrimination in favor of high risk individuals and suggested "... a possible precaution against [discrimination] is to create classification methods with no assumptions, but rather methods which discover patterns used in classification." To his end, he was the first to suggest the use of the c-means algorithm for classification in an insurance context.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
27
By way of introducing the topic to actuaries, he discussed an insightful example involving the classification of four prospective insureds, two males and two females, based on the features height, gender, weight, and resting pulse. Age also was a factor, but it was irrelevant to the analysis since the applicants were all the same age. Moreover, the other feature values were intentionally exaggerated for illustrative purposes. The two initial clusters were on the basis of gender, and, in a step-by-step fashion through three iterations, Ostaszewski showed the existence of a more efficient classification based on all the features. Hellman (1995) used a fuzzy expert system to identify Finnish municipalities that were of average size and well managed, but whose insurance coverage was inadequate. The steps taken included: identify and classify the economic and insurance factors, have an expert subjectively evaluate each factor, preprocessing the conclusions of the expert, and incorporate this knowledge base into an expert system. The economic factors included population size, gross margin rating (based on funds available for capital expenditures), solidity rating, potential for growth, and whether the municipality was in a crisis situation. The insurance factors were nonlife insurance premiums written with the company and the claims ratio for three years. Hellman concluded that important features of the fuzzy expert systems were that they were easily modified, the smooth functions give continuity to the values, and new fuzzy features could easily be added. Cox (1995) reported on a fraud and abuse detection system for managed healthcare that integrated NNs and fuzzy models. The essence of the system was that it detected anomalous behavior by comparing individual medical providers to a peer group. To this end, the underlying behavior patterns for the peer population were developed using the experience of a fraud-detection department, an unsupervised NN that learnt the relationships inherent in the claim data, and a supervised approach that automatically generated the fuzzy model from a knowledge of the decision variables. Cox con-
28
A. F. Shapiro
eluded that the system was capable of detecting anomalous behaviors equal to or better than the best fraud-detection departments. Derrig and Ostaszewski (1995) extended the previous works by showing how the c-means clustering algorithm can provide an alternative way to view risk classification. Their focus was on applying fuzzy clustering algorithms to problems of auto rating territories and fraud detection and on the classification of insurance claims according to their suspected level of fraud. Conceptually, when the fuzzy cluster algorithm is applied to the classification of rating territories, the clusters are the risk classes, and the degree of belief that each territory belongs to a given cluster (risk class) is quantified as a real-valued number between zero and one. McCauley-Bell and Badiru (1996a,b) discussed a two-phase research project to develop a fuzzy-linguistic expert system for quantifying and predicting the risk of occupational injury of the forearm and hand. The first phase of the research developed linguistic variables to qualify risk levels. The second phase used a multi-criteria decision making technique to assign relative weights to the identified risk factors. Using the linguistic variables obtained in the first part of the research, a fuzzy rule base was constructed with all of the potential combinations for the given factors. This study was particularly interesting because it used processed expert opinion, unlike most previous studies, which rely on unprocessed expert opinion. McCauley-Bell et al. (1999) extended the study by incorporating a fuzzy linear regression model to predict the relationship of known risk factors to the onset of occupational injury. Jablonowski (1996, 1997) investigated the use of FST to represent uncertainty in both the definition and measurement of risk, from a risk manager's perspective. He conceptualized exposure in terms of three fuzzy concepts: risk, which he viewed as a contoured function of frequency and severity; the probability of loss, point estimates of which he assigned membership functions; and the risk profile, which was the intersection of the first two. He concluded that FST provides a realistic approach to the formal analysis of risk.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
29
Chen and He (1997) presented a methodology for deriving an Overall Disability Index (ODI) for measuring an individual's disability. Their approach involved the transformation of the ODI derivation problem into a multiple-criteria decision-making problem. Essentially, they used the analytic hierarchy process, a multicriteria decision making technique that uses pairwise comparisons to estimate the relative importance of each risk factors (Saaty 1980), along with entropy theory and fuzzy set theory, to elicit the weights among the attributes and to aggregate the multiple attributes into a single ODI measurement. Verrall and Yakoubov (1999) showed how the fuzzy c-means algorithm could be used to specify a data-based procedure for investigating age groupings in general insurance. Starting with the assumption that distortion effects have already been removed and policyholder age was the only significant factor, they pre-processed the data by grouping the ages where the data was sparse. Then they used a model based on six clusters and an oc-cut of 20 percent to analyze policyholder age groupings associated with the coverages of automobile material damage and bodily injury. They concluded that the flexibility of the fuzzy approach makes it most suitable for grouping policyholder age. Bentley (2000) used an evolutionary-fuzzy approach to investigate suspicious home insurance claims, where genetic programming was employed to evolve FL rules that classified claims into "suspicious" and "non-suspicious" classes. Notable features of his methodology were that it used clustering to develop membership functions and committee decisions to identify the best-evolved rules. With respect to the former, the features of the claims were clustered into low, medium, and high groups, and the minimum and maximum value in each cluster was used to define the domains of the membership functions. The committee decisions were based on different versions of the system that were run in parallel on the same data set and weighted for intelligibility, which was defined as inversely proportional to the number of rules, and accuracy. Bentley
A. F. Shapiro
30
reported that the results of his model when applied to actual data agreed with the results of previous analysis. 3.2.3
Pricing
In the conclusion of DeWit's (1982) article, he recognized the limitations of his analysis and he speculated that "eventually we may arrive, for branches where risk theory offers insufficient possibilities, at fuzzy premium calculation." He was right, of course, and eight years later Lemaire (1990) discussed the computation of a fuzzy premium for an endowment policy, using fuzzy arithmetic. His methodology followed closely that of Buckley (1987). Young (1996) described how FL can be used to make pricing decisions in group health insurance that consistently consider supplementary data, including vague or linguistic objectives of the insurer, which are ancillary to statistical experience data. Using group health insurance data from an insurance company, an illustrative competitive rate changing model was built that employed fuzzy constraints exclusively to adjust insurance rates. Young did not necessarily advocate using such a model without considering experience studies but presented the simplified model to demonstrate more clearly how to represent linguistic rules. The foregoing analysis was extended to include claim experience data in Young (1997). In this case, the author described, stepby-step, how an actuary/decision maker could use FL to adjust workers compensation insurance rates. The supplementary data may be financial or marketing data or statements that reflect the philosophy of the insurance company or client. Cummins and Derrig (1997) used FL to address the financial pricing of property-liability insurance contracts. Both probabilistic and nonprobabilistic types of uncertainty in the pricing and underwriting accept/reject context were incorporated. The authors focused primarily on FL aspects needed to solve the insurance-pricing problem, and in the process fuzzified a well-known insurance financial pricing model, provided numerical examples of fuzzy pric-
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
31
ing, and proposed fuzzy rules for project decision-making. They concluded that FL can lead to significantly different pricing decisions than the conventional approach. Carretero and Viejo (2000) investigated the use of fuzzy mathematical programming for insurance pricing decisions related to a bonus-malus rating system7 in automobile insurance. Their assumed objective was "attractive income from premiums" while the constraints involved the spread of policies among the risk classes, the weighted sum of the absolute variation of the insured's premium, and the deviation from perfect elasticity of the policyholder's payments with respect to their claim frequency. The system was tested on a large database of third-party personal liability claims of a Spanish insurer and they concluded that their fuzzy linear programming approach avoids unrealistic modeling and may reduce information costs. Carretero, Chapter 6, provides further commentary on this approach and FL techniques in general. A novel extension of the pricing issue was provided by Lu et al. (2001), who investigated the use of game theory to decide on a product line in a competitive situation, where the decision factors are fuzzy. To accommodate the fuzzy aspect of their study they focus on the linguistic operators of importance and superiority. Their model is simplistic, in the sense that it assumed complete information and a static game, as well as only two companies with two potential product lines, but it did walk the reader through to the development of a Nash equilibrium. 3.2.4 Asset and Investment Models As mentioned in the section on NNs, a good deal of the research on allocation of assets and investment analysis can be drawn from research with respect to other financial intermediaries. Two interestA bonus-malus rating system rewards claim-free policyholders by awarding them bonuses or discounts and penalizes policyholders responsible for accidents by assessing them maluses or premium surcharges.
32
A. F. Shapiro
ing sources in this regard are Chorafas (1994, chapters 8-10), which describes several applications of FL in finance, and Siegel et al. (1995), which describes many applications of fuzzy set from an accounting perspective. Four interesting applications that are relevant to insurance involve cash flow matching, immunization, optimal asset allocation, and insurance company taxation. An overview of these studies follows. Buehlmann and Berliner (1992) used fuzzy arithmetic and the fuzzy numbers associated with cash-flow matching of assets and liabilities due at different times and in different amounts to arrive at a cash-flow matching approximation. The essence of their approach, which they labeled "fuzzy zooming," is that a set of cash flows can be replaced by a single payment, with an associated triangular fuzzy value. In this way, insecure future interest rates can be modeled by fuzzy numbers and used for the calculation of present values. Berliner and Buehlmann (1993) generalized the fuzzy zooming concept and showed how it can be used to materially reduce the complexity of long-term cash flows. Chang and Wang (1995) developed fuzzy mathematical analogues of the classical immunization theory and the matching of assets and liabilities. Essentially, they reformulated concepts about immunization and the matching of assets and liabilities into fuzzy mathematics. This approach offers the advantages of flexibility, improved conformity with situations encountered in practice and the extension of solutions. Chang, in Chapter 8 of this volume, presents further commentary on this application and extends the study to include a fuzzy set theoretical view of Bayesian decision theory. Guo and Huang (1996) used a possibilistic linear programming method8 for optimal asset allocation based on simultaneously maximizing the portfolio return, minimizing the portfolio risk and maximizing the possibility of reaching higher returns. This is analoBrockett and Xia (1995: 34-38) contains an informative discussion of fuzzy linear programming and how it can be transformed to its crisp counterpart.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
33
analogous to maximizing mean return, minimizing variance and maximizing skewness for a random rate of return. The authors concluded that their algorithm provides maximal flexibility for decision makers to effectively balance a portfolio's return and risk. Derrig and Ostaszewski (1997) viewed the insurance liabilities, properly priced, as a management tool of the short position in the government tax option. In this context, they illustrated how FL can be used to estimate the effective tax rate and after-tax rate of return on the asset and liability portfolio of a property-liability insurance company. To accomplish this, they modeled critical parameters of underwriting and investment as fuzzy numbers, which lead to a model of uncertainty in the tax rate, rate of return, and the assetliability mix. 3.2.5
Projected Liabilities
Boissonnade (1984) used pattern recognition and FL in the evaluation of seismic intensity and damage forecasting, and for the development of models to estimate earthquake insurance premium rates and insurance strategies. The influences on the performance of structures include quantifiable factors, which can be captured by probability models, and nonquantifiable factors, such as construction quality and architectural details, which are best formulated using fuzzy set models. Accordingly, two methods of identifying earthquake intensity were presented and compared. The first method was based on the theory of pattern recognition where a discriminative function is developed based on Bayes' criterion and the second method applied FL. Cummins and Derrig (1993 p. 434) studied fuzzy trends in property-liability insurance claim costs as a follow-up to their assertion that "the actuarial approach to forecasting is rudimentary." The essence of the study was that they emphasized the selection of a "good" forecast, where goodness was defined using multiple criteria that may be vague or fuzzy, rather than a good forecasting model. They began by calculating several possible trends using ac-
34
A. F. Shapiro
cepted statistical procedures and for each trend they determined the degree to which the estimate was good by intersecting several fuzzy goals. They suggested that one may choose the trend that has the highest degree of goodness and proposed that a trend that accounts for all the trends can be calculated by forming a weighted average using the membership degrees as weights. They concluded that FL provides an effective method for combining statistical and judgmental criteria in insurance decision-making. A final perspective was provided by Zhao (1996), who addressed the issues of maritime collision prevention and liability. The essence of the analytical portion of the study was the use of fuzzy programming methods to build decision making simulation models and the development of an automatic collision avoidance decision-making system using FL and NNs. In this case, the offline FFNN was used to map input data to membership functions.
4
Genetic Algorithm (GA) Applications
Genetic algorithms9 (GAs) were proposed by Holland (1975) as a way to perform a randomized global search in a solution space. In this space, a population of candidate solutions, each with an associated fitness value, are evaluated by a fitness function on the basis of their performance. Then, using genetic operations, the best candidates are used to evolve a new population that not only has more of the good solutions but better solutions as well. This section reviews GA applications in insurance.
GAs are a subset of the broader category of Evolutionary Computing (EC), which is comprised of the evolutionary optimization methods that work by simulating evolution on a computer. The three main subcategories of EC are Genetic Algorithms (GAs), Evolutionary Programming (EP) and Evolution Strategy (ES) (Thomas (1996). GAs are the most commonly used.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
4.1
35
An Overview of GAs
Before proceeding, let us briefly review the GA process. This process, which can be described as an automated, intelligent approach to trial and error, based on principles of natural selection, is depicted in Figure 10. Initialize values: population size (M) (Start |-»- regeneration factors termination criterion
Create new individual(s) reproduction crossover mutation
AP(g+1) = new individual^)
Figure 10. The GA process.
As indicated, the first step in the process is initialization, which involves choosing a population size (M), population regeneration factors, and a termination criterion. The next step is to randomly generate an initial population of solutions, P(g=0), where g is the generation. If this population satisfies the termination criterion, the process stops. Otherwise, the fitness of each individual in the population is evaluated and the best solutions are "bred" with each other to form a new population, P(g+1); the poorer solutions are discarded. If the new population does not satisfy the termination criterion, the process continues. 4.1.1
Population Regeneration Factors
There are three ways to develop a new generation of solutions: reproduction, crossover, and mutation. Reproduction adds a copy of a
36
A. F. Shapiro
fit individual to the next generation. Crossover emulates the process of creating children, and involves the creation of new individuals (children) from the two fit parents by a recombination of their genes (parameters). Under mutation, there is a small probability that some of the gene values in the population will be replaced with randomly generated values. This has the potential effect of introducing good gene values that may not have occurred in the initial population or which were eliminated during the iterations. In Figure 10, the process is repeated until the new generation has the same number of individuals (M) as the previous one.
4.2
Applications
GAs have become a popular alternative to linear programming and other optimization routines because of their power and ease of use. Insurance applications have included classification, underwriting, asset allocation, and optimizing the competitiveness of an insurance product. This section provides a brief overview of examples of these studies. The discussion excludes the previous mentioned studies where GAs were used to determine the architecture and parameters of NNs. 4.2.1
Classification
Lee and Kim (1999) used GAs to refine the classification system of Korean private passenger automobile insurance. The study was based on a large sample of cases randomly selected from automobile policies in force at the largest auto insurer in Korea, which was divided into two sets for training and testing purposes. The research employed a hybrid learning methodology that integrated GAs and the decision tree induction algorithm, C4.5. First, the GA was used to explore the space of all possible subsets of a large set of candidate discriminatory variables; then the candidate variables subsets were evaluated using the decision tree induction algorithm to produce a classifier from the given variables and the training data. As a
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
37
benchmark, the classification performance of this technique was compared with that obtained using a Logit model. It was concluded that the hybrid GA-C4.5 outperforms the Logit model both in training and in testing but is quite time consuming. 4.2.2
Underwriting
Nikolopoulos and Duvendack (1994) described the application of GAs and classification tree induction to insurance underwriting. The basis of their study was that underwriter rules were to be extracted from a set of policyholder records, each of which had 19 attributes describing the policyholder, so as to determine when a policy should be cancelled. When the results of the GAs and classification tree techniques were reviewed by an underwriting domain expert, semantically incorrect rules were found in each, which, when eliminated, drastically reduced the accuracy of their results. To overcome this problem, a hybrid model was constructed whereby the rule set generated by the classification tree algorithm was used to construct the initial population of the GA. The authors concluded that the hybrid model produces far superior results than either of the techniques alone. 4.2.3 Asset Allocation Wendt (1995) used a GA to build a portfolio efficient frontier (a set of portfolios with optimal combinations of risk and returns). The underlying data consisted of 250 scenarios of annual returns for eight asset classes. To evaluate the GA process, the final GA output was compared to the efficient frontier created by a sophisticated nonlinear optimizer. After about 50 cycles, the GA found portfolios very close to the efficient frontier generated by the nonlinear optimizer. While not insurance motivated, an interesting study in this area was conducted by Frick et al. (1996), who investigated price-based heuristic trading rules for buying and selling shares. Their method-
38
A. F. Shapiro
ology involved transforming the time series of share prices using a heuristic charting method that gave buy and sell signals and was based on price change and reversals. Based on a binary representation of those charts, they used GAs to generate trade strategies from the classification of different price formations. They used two different evaluation methods: one compared the return of a trading strategy with the corresponding riskless interest rate and the average stock market return; the other used its risk-adjusted expected return as a benchmark instead of the average stock market return. Their analysis of over one million intra-day (during a single trading day) stock prices from the Frankfurt Stock Exchange (FSE) showed the extent to which different price formations could be classified by their system and the nature of the rules, but left for future research an analysis of the performance of the resulting trading strategies. Jackson (1997) used the asset allocation problem to investigate the performance of a GA when confronted first with a wellconditioned situation and then a more complex case. As a benchmark he used Newton's method of optimization. In the first situation, he assumed a quadratic utility function and found that the GA and Newton method produced essentially the same asset allocation, but the GA took much longer to converge. In the second case, he used a step utility function and found that the GA, while still slower than Newton's method, was more robust. 4.2.4
Competitiveness of the Insurance Products
The profitability, risk, and competitiveness of insurance products was investigated by Tan (1997). He began the analysis with a Monte Carlo simulation, which was used to develop a flexible measurement framework. A GA was then used to seek the optimum asset allocations that form the profitability-risk-competitiveness frontier and to examine the trade-offs. The paper showed how to select the appropriate asset allocation and crediting strategy in order to position the product at the desired location on the frontier.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
5
39
Comment
The purpose of this chapter has been to provide the reader with an overview of where NNs, FL and GAs have been implemented in insurance. While it is clear that these technologies have made inroads into many facets of the business, in most instances the applications focused on each technology separately, and did not capitalized on potential synergies between the technologies. However, a natural evolution in soft computing has been the emergence of hybrid systems, where the technologies are used simultaneously (Shapiro and Gorman 2000). FL based technologies can be used to design NNs or GAs, with the effect of increasing their capability to display good performance across a wide range of complex problems with imprecise data. Thus, for example, a fuzzy NN can be constructed where the NN possesses fuzzy signals and/or has fuzzy weights. Conversely, FL can use technologies from other fields, like NNs or GAs, to deduce or to tune, from observed data, the membership functions in fuzzy rules, and may also structure or learn the rules themselves. Chapter 11 explores the merging of these soft computing technologies.
Acknowledgments This work was supported in part by the Robert G. Schwartz Faculty. The assistance of Asheesh Choudhary, Angela T. Koh, Travis J. Miller, and Laura E. Campbell is gratefully acknowledged.
40
A. F. Shapiro
References Alsing, S.G., Bauer Jr., K.W., and Oxley, ME. (1999), "Convergence for receiver operating characteristic curves and the performance of neural network," in Dagli, C.H., et al. (eds.), Smart Engineering System Design: neural networks, fuzzy logic, evolutionary programming, data mining, and complex systems, Proceedings of the Artificial Neural Networks in Engineering Conference (ANNIE '99), ASME Press, New York, pp. 947-952. Bakheet, M.T. (1995), "Contractors' Risk Assessment System (Surety, Insurance, Bonds, Construction, Underwriting)," Ph.D. Dissertation, Georgia Institute of Technology. Bellman, R. and Zadeh, L.A. (1970), "Decision-making in a fuzzy environment," Management Science 17, pp. 141-164. Bentley, P.J. (2000), "Evolutionary, my dear Watson. Investigating committee-based evolution of fuzzy rules for the detection of suspicious insurance claims," Proceedings of the second Genetic and Evolutionary Computation Conference (GECCO), 8-12 July. Berliner, B. and Buehlmann, N. (1993), "A generalization of the fuzzy zooming of cash flows," 3rdAFIR, pp. 433-456. Bezdek, J.C. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York. Bishop, CM. (1995), Neural Networks for Pattern Recognition, Clarendon Press. Boero, G. and Cavalli, E. (1996), "Forecasting the exchange rate: a comparison between econometric and neural network models," AFIR, Vol II, pp. 981-996. Boissonnade, A.C. (1984), "Earthquake Damage and Insurance Risk," Ph.D. Dissertation, Stanford University. Bonissone, P.P. (1998), "Soft computing applications: the advent of hybrid systems," in Bosacchi, B., Fogel, D.B., and Bezdek, J.C. (eds.), Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation, Proceedings of SPIE
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
41
3455, pp. 63-78. Breiman, L., Friedman, J., Olshen, R., and Stone, C.J. (1984), Classification and Regression Trees, Chapman & Hall, New York. Brockett, P.L., Cooper, W.W., Golden, L.L., and Pitaktong, U. (1994), "A neural network method for obtaining an early warning of insurer insolvency," J Risk and Insurance 61(3), pp. 402424. Brockett, P.L., Xia, X., and Derrig, R.A. (1998), "Using Kohonen's self-organizing feature map to uncover automobile bodily injury claims fraud," J Risk and Insurance 65(2), pp. 245-274. Buckley, J.J. (1987), "The fuzzy mathematics of finance," Fuzzy Sets and Systems 21, pp. 257-273. Buehlmann, N. and Berliner, B. (1992), "The fuzzy zooming of cash flows," Transactions of the 24th International Congress of Actuaries, Vol. 6, pp. 432-452. Carretero, R.C. and Viejo, A.S. (2000), "A bonus-malus system in the fuzzy set theory [insurance pricing decisions]," FUZZ IEEE 2000, The Ninth IEEE International Conference on Fuzzy Systems 2, pp.1033-1036. Chang, C. and Wang, P. (1995), "The matching of assets and liabilities with fuzzy mathematics," 25th International Congress of Actuaries, pp. 123-137. Chen, J.J-G and He, Z. (1997), "Using analytic hierarchy process and fuzzy set theory to rate and rank the disability," Fuzzy Sets and Systems 88, pp. 1-22. Chorafas, D.N. (1994), Chaos Theory in the Financial Markets: Applying Fractals, Fuzzy Logic, Genetic Algorithms, Probus Publishing Company, Chicago. Collins, E., Ghosh, S., and Scofield, C , (1988), "An application of a multiple neural network learning system to emulation of mortgage underwriting judgments," IEEE International Conference on Neural Networks 2, pp. 459-466. Cox, E. (1995), "A fuzzy system for detecting anomalous behaviors in healthcare provider claims," in S. Goonatilake and P.
42
A. F. Shapiro
Treleven (eds.), Intelligent Systems for Finance and Business, John Wiley & Sons, pp.111-135. Cummins, J.D. and Derrig, R.A. (1993), Fuzzy trends in propertyliability insurance claim costs, J Risk and Insurance 60(3), pp. 429-465. Cummins, J.D. and Derrig, R.A. (1997), "Fuzzy financial pricing of property-liability insurance," North American Actuarial Journal 1(4), pp. 21-44. Derrig, R.A. and Ostaszewski, K.M. (1994), "Fuzzy techniques of pattern recognition in risk and claim classification," 4th AFIR International Colloquium l,pp. 141-171. Derrig, R.A. and. Ostaszewski, K. (1995), "Fuzzy techniques of pattern recognition in risk and claim classification," J Risk and Insurance 62(3), pp. 447-482. Derrig, R.A. and Ostaszewski, K. (1997), "Managing the tax liability of a property-liability insurance company," J Risk and Insurance 64(4), pp. 595-711. Derrig, R.A. and Ostaszewski, K. (1999), "Fuzzy sets methodologies in actuarial science," Chapter 16 in H.J. Zimmerman, Practical Applications of Fuzzy Technologies, Kluwer Academic Publishers, Boston. Dubois, D. and Prade, H. (1980), Fuzzy Sets and Systems: Theory and Applications, Academic Press, San Diego, CA. DeWit ,G.W. (1982), "Underwriting and uncertainty," Insurance: Mathematics and Economics 1, pp. 277-285. Erbach, D.W. and Seah, E. (1993), "Discussion of: Young VR, The application of fuzzy sets to group health underwriting," Transactions of the Society of Actuaries 45, pp. 585-587. Ebanks, B. (1983), "On measures of fuzziness and their representations," Journal of Mathematical Analysis and Applications 94, pp. 24-37. Ebanks, B., Karwowski, W. and Ostaszewski, K. (1992), "Application of measures of fuzziness to risk classification in insurance," Computing and Information, Proceedings, ICCI '92, Fourth In-
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
43
ternational Conference, pp. 290 -291. Francis, L. (2001), "Neural networks demystified," Casualty Actuarial Society Forum, Winter, pp. 253-319. Frick, A., Herrmann, R., Kreidler, M , Narr, A., and Seese, D. (1996), "Genetic-based trading rules - a new tool to beat the market with? - First empirical results," AFIR 2, pp. 997-1017. Guo, L. and Huang, Z. (1996), "A possibilistic linear programming method for asset allocation," Journal of Actuarial Practice 2, pp. 67-90. Hellman, A. (1995), "A fuzzy expert system for evaluation of municipalities - an application," Transactions of the25th International Congress of Actuaries 1, pp. 159-187. Horgby, P., Lohse, R., and Sittaro, N. (1997), "Fuzzy underwriting: an application of fuzzy logic to medical underwriting," J Actuarial Practice 5(1), pp. 79-104. Holland, J.H. (1975), Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, MA. Huang, C , Dorsey, R.E., and Boose, M.A. (1994), "Life insurer financial distress prediction: a neural network model," J Insurance Regulation, Winter 13(2), pp. 131-167. Ismael, M.B. (1999), "Prediction of Mortality and In-Hospital Complications for Acute Myocardial Infarction Patients Using Artificial Neural Networks," Ph.D. dissertation, Duke University. Jackson, A. (1997), "Genetic algorithms for use in financial problems," ,4F/i? 2, pp. 481-503. Jablonowski, M. (1997), "Modeling imperfect knowledge in risk management and insurance," Risk Management and Insurance Review 1(1), pp. 98-105. Jang, J. (1997), "Comparative Analysis of Statistical Methods and Neural Networks for Predicting Life Insurers' Insolvency (Bankruptcy)," Ph.D. dissertation, University of Texas at Austin. Kieselbach, R. (1997), "Systematic failure analysis using fault tree and fuzzy logic technology," Law and Insurance 2, pp. 13-20.
44
A. F. Shapiro
Klir, G.J. and Yuan, B. (1996), Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers by Lotfi A. Zadeh, World Scientific, New Jersey. Kohonen, T. (1988), "Self-organizing feature maps," SelfOrganizing and Associative Memory, 2nd ed. Spring-Verlag, Berlin, Heidelberg, Germany. Kramer, B. (1997), "N.E.W.S.: a model for the evaluation of nonlife insurance companies," European Journal of Operational Research 98, pp. 419-430. Lee, B. and Kim, M. (1999), "Application of genetic algorithm to automobile insurance for selection of classification variables: the case of Korea," Paper presented at the 1999 Annual Meeting of the American Risk and Insurance Association. Lemaire, J. (1990), "Fuzzy insurance," ASTIN Bulletin 20(1), pp. 33-55. Lokmic, L. and Smith, K.A. (2000), "Cash flow forecasting using supervised and unsupervised neural networks," Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks 6, pp. 343-347. Lu, Y., Zhang, L., Guan, Z., and Shang, H. (2001), "Fuzzy mathematics applied to insurance game," IFSA World Congress and 20th NAFIPS International Conference, Vol. 2, p. 941. Magee, D.D. (1999), "Comparative Analysis of Neural Networks and Traditional Actuarial Methods for Estimating Casualty Insurance Reserve," Ph.D. Dissertation, The University of Texas at Austin. McCauley-Bell, P. and Badiru, A.B. (1996a and b), "Fuzzy modelling and analytic hierarchy processing to quantify risk levels associated with occupational injuries, Part 1 and II," IEEE Transactions on Fuzzy Systems 4, pp. 124-131 and 132-138. McCauley-Bell, P.R., Crumpton, L.L., and Wang, H. (1999), "Measurement of cumulative trauma disorder risk in clerical tasks using fuzzy linear regression, " IEEE Transactions on Systems, Man, and Cybernetics—part C: Applications and Reviews,
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
45
vol. 29, no. 1, February. Nikolopoulos, C. and Duvendack, S. (1994), "A hybrid machine learning system and its application to insurance underwriting," Proceedings of the IEEE conference on Evolutionary Computation 2, 27-29 June, pp. 692-695. Ostaszewski, K. (1993), Fuzzy Set Methods in Actuarial Science, Society of Actuaries, Schaumburg, IL. Park, J. (1993), "Bankruptcy Prediction of Banks and Insurance Companies: an Approach Using Inductive Methods," Ph.D. dissertation, University of Texas at Austin. Refenes, A.-P.N., Abu-Mostafa, Y., Moody, J., and Weigend, A. (eds) (1996), Neural Networks in Financial Engineering. Proceedings of the Third International Conf. on Neural Networks in the Capital Markets, World Scientific Pub. Co., London. Rosenblatt, F. (1959), "Two theorems of statistical separability in the perceptron," Mechanization of Thought Processes, Symposium held at the National Physical Laboratory, HM Stationary Office, pp. 421-456. Saaty, T.L. (1990), "How to make a decision: the analytic hierarchy process," European J. Oper. Res. 48, pp. 9-26. Saemundsson, S.R. (1996), "Dental Caries Prediction by Clinicians and Neural Networks," Ph.D. dissertation, University of North Carolina at Chapel Hill. Shapiro, A.F. and Gorman, R.P. (2000), "Implementing adaptive nonlinear models," Insurance: Mathematics and Economics 26, pp. 289-307. Shepherd, A.J. (1997), Second-Order Method for Neural Networks. Springer. Siegel, P.H., de Korvin, A. and Omer, K. (eds) (1995), Applications of Fuzzy Sets and the Theory of Evidence to Accounting, JAI Press, Greenwich, Connecticut. Speights, D.B., Brodsky, J.B., and Chudova, D.I. (1999), "Using neural networks to predict claim duration in the presence of right censoring and covariates," Casualty Actuarial Society Forum,
46
A. F. Shapiro
Winter, pp. 256-278. Tan, R. (1997), "Seeking the profitability-risk-competitiveness frontier using a genetic algorithm," J Actuarial Practice 5(1), pp. 49-77. Thomas, B. (1996), Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms, Oxford University Press. Tu, J.V. (1993), "A Comparison of Neural Network and Logistic Regression Models for Predicting Length of Stay in the Intensive Care Unit Following Cardiac Surgery," Ph.D. dissertation, University of Toronto. van Wezel, M.C., Kok, J.N., and Sere, K. (1996), "Determining the number of dimensions underlying customer-choices with a competitive neural network," IEEE International Conference on Neural Networks 1, pp. 484-489. Vaughn, M.L. (1996), "Interpretation and knowledge discovery from a multilayer perceptron network: opening the black box," Neural Computing and Applications 4(2), pp. 72-82. Vaughn, M.L., Ong, E., and Cavill, S.J. (1997), "Interpretation and knowledge discovery from a multilayer perceptron network that performs whole life assurance risk assessment," Neural Computing and Applications 6, pp. 201-213. Verrall, R.J. and Yakoubov, Y.H. (1999), "A fuzzy approach to grouping by policyholder age in general insurance," Journal of Actuarial Practice, Vol. 7, pp. 181-203. Viaene, S., Derrig, R., Baesens, B., and Dedene, G. (2001), "A comparison of state-of-the-art classification techniques for expert automobile insurance fraud detection," The Journal of Risk & Insurance 69 (3), pp. 373-421. Wendt, R.Q. (1995), "Build your own GA efficient frontier," Risks and Rewards, December 1, pp. 4-5. Widrow, B. and Hoff, M.E. (1960), "Adaptive switching circuits" IRE Western Electric Show and Convention Record, Part 4, August, pp. 96-104.
Insurance Applications of NNs, Fuzzy Logic, and Genetic Algorithms
47
Yager, R.R., Ovchinnikov, S., Tong, R.M., and Ngugen, H.T. (1987), Fuzzy Sets and Applications: Collected Papers of Lotfi A. Zadeh, John Wiley & Sons, Inc., New York. Yakoubov, Y.H. and Haberman, S. (1998), "Review of actuarial applications of fuzzy set theory," Actuarial Research Report, City University, London, England. Young, V.R. (1993), "The application of fuzzy sets to group health underwriting," Transactions of the Society of Actuaries 45, pp. 551-590. Young, V.R. (1996), "Insurance rate changing: a fuzzy logic approach," J Risk and Insurance 63, pp. 461-483. Young, V.R. (1997), "Adjusting indicated insurance rates: fuzzy rules that consider both experience and auxiliary data," Proceedings of the Casualty Actuarial Society 84, pp. 734-765. Zadeh, L.A. (1965), "Fuzzy sets," Information and Control, vol. 8, pp.338-353. Zedeh, L.A. (1975), "The concept of a linguistic variable and its application to approximate reasoning," Part I, Information Sciences 8, pp. 199-249; Part II, Information Sciences 8, pp. 30157. Zedeh, L.A. (1981), "Fuzzy systems theory: a framework for the analysis of humanistic systems," in Cavallo, R.E. (ed.), Recent Developments in Systems Methodology in Social Science Research, Kluwer, Boston, pp. 25-41. Zadeh, L.A. (1992), Foreword of the Proceedings of the Second International Conference on Fuzzy Logic and Neural Networks, pp. xiii-xiv, Iizuka, Japan. Zadeh, L.A. (1994), "The role of fuzzy logic in modeling, identification and control," Modeling Identification and Control 15(3), p. 191. Zhao, J. (1996), "Maritime Collision and Liability," Ph.D. Dissertation, University of Southampton.
This page is intentionally left blank
Property and Casualty
This page is intentionally left blank
Chapter 2 An Introduction to Neural Networks in Insurance Louise A. Francis
The chapter will provide a basic introduction to the use of neural networks as an analytical modeling tool in insurance, taking the view that the neural network technique is a generalization of more familiar models such as linear regression. The reader will also be introduced to the traditional characterization of a neural network as a system modeled on the functioning of neurons in the brain. The chapter will explain how neural networks are particularly helpful in addressing the data challenges encountered in real insurance applications, where complex data structures are the rule rather than the exception. The chapter will introduce and illustrate three specific modeling challenges: 1) nonlinear functions, 2) correlated data, and 3) interactions. The chapter will explain why neural networks are effective in dealing with these challenges. The issues and concepts will be illustrated with applications that address common Property and Casualty insurance problems. Applications will include trend analysis and reserving.
1
Introduction
Artificial neural networks are the intriguing new high tech tool for finding hidden gems in data. Neural networks belong to a category of techniques for analyzing data that is known as data mining. Other widely used data mining tools include decision trees, genetic
51
52
L. A. Francis
algorithms, regression splines and clustering. Data mining techniques are used to find patterns in data not easily identified by other means. Typically the data sets used in data mining are large, i.e., have many records and many predictor variables. The number of records is typically in the tens of thousands, or more, and the number of independent variables is often in the hundreds. Data mining techniques, including neural networks, have been applied to portfolio selection, credit scoring, fraud detection and market research. When data mining techniques are applied to data sets containing complex relationships, the techniques can identify the relationships and provide the analyst with insights into those relationships. An advantage data mining techniques have over classical statistical models used to analyze data such as regression and ANOVA is that they can fit data where the relation between independent and dependent variables is nonlinear and where the specific form of the nonlinear relationship is unknown. Artificial neural networks (hereafter referred to as neural networks) share the advantages just described with the many other data mining tools. However, neural networks have a longer history of research and application. As a result, their value in modeling data has been extensively studied and is established in the literature (Potts 2000). Moreover, they sometimes have advantages over other data mining tools. For instance, decisions trees, a method of splitting data into homogenous clusters with similar expected values for the dependent variable, are often less effective when the predictor variables are continuous than when they are categorical.1 Neural networks work well with both categorical and continuous variables. Neural networks are among the more glamorous of the data mining techniques. They originated in the artificial intelligence discipline where they are often portrayed as a brain in a computer. Neural networks are designed to incorporate and mimic key features of the neurons in the brain and to process data in a manner analogous Salford System's course on Advanced CART, October 15, 1999.
An Introduction to Neural Networks in Insurance
53
to that of the human brain. As a result, much of the terminology used to describe and explain neural networks is borrowed from biology. Many other data mining techniques such as decision trees and regression splines were developed by statisticians and are described in the literature as computationally intensive generalizations of classical linear models. Classical linear models assume that the functional relationship between the independent variables and the dependent variable is linear. However, they allow some nonlinear relationships to be approximated by linear functions by utilizing a transformation of dependent or independent variables. Neural networks and other data mining techniques do not require that the relationships between predictor and dependent variables be linear (whether or not the variables are transformed). The various data mining tools differ in their approach to approximating nonlinear functions and complex data structures. Neural networks use a series of "neurons" in what is known as the hidden layer that apply nonlinear activation functions to approximate complex functions in the data. The details are discussed in the body of this chapter. As the focus of this chapter is neural networks, the other data mining techniques will not be discussed in any further detail. Despite the analytic advantages, many statisticians and actuaries are reluctant to embrace neural networks. One reason is that neural networks appear to be a "black box." Because of the complexity of the functions used in neural network approximations, neural network software typically does not supply the user with information about the nature of the relationship between predictor and target variables. The output of a neural network is a predicted value and some goodness of fit statistics. However, the functional form of the relationship between independent and dependent variables is not made explicit. In addition, the strength of the relationship between dependent and independent variables, i.e., the importance of each variable, is also often not revealed. Classical models as well as other popular data mining techniques, such as decision trees, sup-
54
L. A. Francis
ply the user with a functional description or map of the relationships. This chapter seeks to open that "black box" and show what is happening inside the neural network. While a description of neural networks will be presented and some of the artificial intelligence terminology covered, this chapter's approach is predominantly from the statistical perspective. The similarity between neural networks and regression will be shown. This chapter will compare and contrast how neural networks and classical modeling techniques deal with three specific modeling challenges: 1) nonlinear functions, 2) interactions, and 3) correlated data. The examples in this chapter are drawn from property and casualty insurance.
2
Background on Neural Networks
A number of different kinds of neural networks exist. This chapter will discuss feedforward neural networks with one hidden layer, the most popular kinds of neural networks. A feedforward neural network is a network where the signal (i.e., information from an external source) is passed from an input layer of neurons through a hidden layer to an output layer of neurons. The function of the hidden layer is to process the information from the input layer. The hidden layer is denoted as hidden because it contains neither input nor output data and the output of the hidden layer generally remains unknown to the user. A feedforward neural network can have more than one hidden layer; however, such networks are not common. The feedforward network with one hidden layer is one of the older neural network techniques. As a result, its effectiveness has been established and software for applying it is widely available. The feedforward neural network discussed in this chapter is known as a Multilayer Perceptron (MLP). The MLP is a feedforward network that uses supervised learning. A network that is trained using supervised learning is presented with a target variable and fits a function that can be
An Introduction to Neural Networks in Insurance
55
used to predict the target variable. Alternatively, it may classify records into levels of the target variable when the target variable is categorical. This is analogous to the use of such statistical procedures as regression and logistic regression for prediction and classification. Other popular kinds of feedforward networks often incorporate unsupervised learning into the training. A network trained using unsupervised learning does not have a target variable. The network finds characteristics in the data that can be used to group similar records together. This is analogous to cluster analysis in classical statistics. This chapter will discuss only supervised learning using neural networks and, further, the discussion will be limited to a feedforward MLP neural network with one hidden layer. This chapter will primarily present applications of this model to continuous rather than discrete data.
2.1
Structure of a Feedforward Neural Network
Figure 1 displays the structure of a feedforward neural network with one hidden layer. The first layer contains the input nodes. Each node is a separate independent variable and represents the actual data used to fit a model to the dependent variable. These are connected to another layer of neurons called the hidden layer or hidden nodes, which modifies the data. The nodes in the hidden layer connect to the output layer. The output layer represents the target or dependent variable(s). It is common for networks to have only one target variable, or output node, but there can be more. An example would be a classification problem where the target variable can fall into one of a number of categories. Sometimes each of the categories is represented as a separate output node. As can be seen from Figure 1, each node in the input layer connects to each node in the hidden layer and each node in the hidden layer connects to each node in the output layer.
56
L. A. Francis
Figure 1. Diagram of three-layer feed forward neural network.
This structure is viewed in the artificial intelligence literature as analogous to that of the biological neurons contained in the human brain. The arrows leading to a node correspond to the dendrites leading to a neuron, and, like the dendrites, they carry a signal to the neuron or node. The arrows leading away from a node correspond to the axons of a neuron, and they carry a signal (i.e., the activation level) away from a neuron or node. The neurons of a brain have far more complex interactions than those displayed in the diagram; however developers view neural networks as abstracting the most relevant features of the neurons in the human brain. Neural networks "learn" by adjusting the strength of the signal coming from nodes in the previous connecting layer. As the neural network better learns how to predict the target value from the input pattern, each of the connections between the input neurons and the hidden or intermediate neurons and between the intermediate neu-
An Introduction to Neural Networks in Insurance
57
rons and the output neurons increases or decreases in strength. A function called a threshold or activation function modifies the signal coming into the hidden layer nodes. In the early days of neural networks, this function produced a value of 1 or 0, depending on whether the signal from the prior layer exceeded a threshold value. Thus, the node or neuron would only "fire" (i.e., become active) if the signal exceeded the threshold, a process thought to be similar to the behavior of a neuron. It is now known that biological neurons are more complicated than previously believed and they do not follow a simple all or none rule. Currently, activation functions are typically sigmoid in shape and can take on any value between 0.0 and 1.0 or between -1.0 and 1.0, depending on the particular function chosen. The modified signal is then output to the output layer nodes, which also apply activation functions. Thus, the information about the pattern being learned is encoded in the signals carried to and from the nodes. These signals map a relationship between the input nodes or the data and the output nodes or dependent variable.
3
Example 1: Simple Example of Fitting a Nonlinear Function to Claim Severity
A simple example will be used to illustrate how neural networks perform nonlinear function approximations. This example will provide details about the activation functions in the hidden and output layers to facilitate an understanding of how a neural network works. When actuaries project future losses on insurance policies using historic data containing losses for past years, they must consider the impact of inflationary factors on the future loss experience. To estimate the impact of these inflationary factors, actuaries use regression techniques to estimate "trends" or the rate of change in insurance losses across successive accident or policy periods. The
58
L. A. Francis
trends are often fit separately to frequencies and severities, where frequency is defined as the number of claims per policy or per other unit of exposure, and severity is defined as the average dollars of loss per claim. This example will involve estimating severity trends.
3.1
Severity Trend Models
When fitting trends to loss data, whether severity data or frequency data, an exponential regression is one of the most common approaches used. For the exponential regression, the underlying model for loss severity is: Severityt
= Severityt
f'^e
(1)
That is, the severity at time t is equal to the severity at some starting point to times an inflation factor I (which may be thought of as one plus an inflation rate) raised to the power t - to. This is multiplied by a random error term, e, as actual observed severities contain a random component. If the logs are taken of both sides of the equation, the function is linear in the dependent variable, t—to: \n(Severityt)
= ]n(Severityt
) + (t-t0)ln(7)
+e
(2)
As a variation on the exponential trend model, we will model trend as a power function that is close to exponential in form, but increases more rapidly than an exponential trend. In this example, a curve will be fit to average severities where the "true" relationship is of the form: Severityt = SeveritytQ I{t~to)P e
(3)
or, for this particular example, using t as the input variable (with to equal to 0) and Y as the output variable:
59
An Introduction to Neural Networks in Insurance
7 = 10,000* I'1'2
(4)
/~N(1.05,.004) where Y is the dependent variable, average claim severity, t is the time period from the first observation, / is a random inflation or trend factor, and N (u, a) is understood to denote the normal probability distribution, with parameters u, the mean of the distribution, and a, the standard deviation of the distribution. Because this is a simulated example developed as a simple illustration intended primarily to illustrate some of the key features of neural network analysis, the values for this example are likely to be significantly less variable than the severities encountered in actual practice. A sample of forty quarters (10 years) of observations of t and Y was simulated. A scatterplot of the observations is shown in Figure 2. The scatterplot in Figure 3 displays the "true" curve for Fas well as the random Y (severity) values.
• 22000 -
• • »
•C 18000"
>
••
•
•
• •
14000"
10000-
1
3
5
7
9
t (Time in quarters)
Figure 2. Scatterplot of Severity vs. Time.
60
L. A. Francis
• 22000 -
• •* • /^
18000-
• .,—• .
14000 "
•
^ ^ ™ "True" Expected Severity • Severity 10000 -
—
•**** "* 1
3
5 7 t {Time in Quarters)
9
Figure 3. Scatterplot of Severity and Expected Severity vs. Time.
3.2
A One Node Neural Network
A simple neural network with one hidden layer was fit to the simulated data. In order to compare neural networks to classical models, a regression curve was also fit. The result of that fit will be discussed after the presentation of the neural network results. The structure of this neural network is shown in Figure 4. As neural networks go, this is a relatively simple network with one input node. In biological neurons, electrochemical signals pass between neurons with the strength of the signal measured by the electrical current. In neural network analysis, the signal between neurons is simulated by software, which applies weights to the input nodes (data) and then applies an activation function to the weights: Neuron signal of the biological neuron system -> Node weights of neural networks.
An Introduction to Neural Networks in Insurance
61
One Hidden Node
•
ti •1
—
Input Layer (Input Data)
Hidden Layer (Process Data)
fci
*l Output Layer (Predicted Value)
Figure 4. One-node neural network.
The weights are used to compute a linear sum of the independent variables. Let Y denote the weighted sum: Y = WQ + w]*Xl+w2X2...
+ wnXn
(5)
The activation function is applied to the weighted sum and is typically a sigmoid function. The most common of the sigmoid functions is the logistic function: f(X)
=— ^
(6)
The logistic function takes on values in the range 0 to 1. Figure 5 displays a typical logistic curve. This curve is centered at an X value of 0, (i.e., the constant wo is 0). Note that this function has an inflection point at an X value of 0 andJ{X) value of 0.5, where it shifts from a convex to a concave curve. Also, note that the slope is steepest at the inflection point where small changes in the value of Xcan produce large changes in the value of the function. The curve becomes relatively flat as X approaches both 1.0 and -1.0.
62
L. A. Francis
Figure 5. Logistic function.
Another sigmoid function often used in neural networks is the hyperbolic tangent function that takes on values between -1.0 and 1.0: ex-e -x (7) f(X) = ex +e~ x In this chapter, the logistic function will be used as the activation function. The Multilayer Perceptron is a multilayer feedforward neural network that uses a sigmoid activation function. The logistic function is applied to the weighted inputs. In this example, there is only one input, therefore the activation function is: 1 h = f(t;w0wl) = f(w0 + wlt) (8) l -(wo+w\0 +
e
This gives the value or activation level of the node in the hidden layer. Weights are then applied to the hidden node (i.e., w2 + w3 h). An activation function is then applied to this "signal" coming from the hidden layer:
63
An Introduction to Neural Networks in Insurance
Q=/(/
*;M^3)=1+e_(i2+W0
(9)
The weights wo and W2 are similar to the constants in a regression and the weights wi and W3 are similar to the coefficients in a regression. The output function o for this particular neural network, with one input node and one hidden node, can be represented as a double application of the logistic function: f(f(t;wo,wl);w2,w3)
=
j
(10)
It will be shown later in this chapter that the use of sigmoid activation functions on the weighted input variables, along with the second application of a sigmoid function by the output node, is what gives the MLP the ability to approximate nonlinear functions. One other operation is applied to the data when fitting the curve: the dependent variable X is normalized. Normalization is used in statistics to minimize the impact of the scale of the independent variables on the fitted model. Thus, a variable with values ranging from 0 to 500,000 does not prevail over variables with values ranging from 0 to 10 merely because the former variable has a much larger scale. Various software products will perform different normalization procedures. The software used to fit the networks in this chapter normalizes the data to have values in the range 0.0 to 1.0. This is accomplished by subtracting a constant from each observation and dividing by a scale factor. It is common for the constant to equal the minimum observed value for X in the data and for the scale factor to equal the range of the observed values (the maximum minus the minimum). Note also that the output function takes on values between 0 and 1 while the dependent variable takes on values between -00 and +00 (although the probability of negative The second function may not be a sigmoid function. A linear function, rather than a sigmoid function, is often applied to the output of the hidden node.
64
L. A. Francis
values for the data in this particular example is nil). When producing predicted values, (i.e., the output, o), the information must be renormalized by multiplying by the target variable's scale factor (the range of Y in our example) and adding back the constant (the minimum observed Y in this example). 3.2.1 Fitting the Curve The process of finding the best set of weights for the neural network is referred to as training or learning. The approach used by most commercial software to estimate the weights is called backpropagation. Backpropagation is performed as follows: • Each time the network cycles through the training data, it produces a predicted value for the target variable. This value is compared to the actual value for the target variable and an error is computed for each observation; • The errors are "fed back" through the network and new weights are computed to reduce the overall error. Despite the neural network terminology, the training process is actually a statistical optimization procedure. Typically, the procedure minimizes the sum of the squared residuals: Min(X(Y-Y)2)
(11)
Warner and Misra (1996) point out that neural network analysis is in many ways like linear regression, which can be used to fit a curve to data. Regression coefficients are solved for by minimizing the squared deviations between actual observations on a target variable and the fitted value. In the case of linear regression, the curve is a straight line. Unlike linear regression, the relationship between the predicted and target variable in a neural network is nonlinear; therefore, a closed form solution to the minimization problem may not exist. In order to minimize the loss function, a numerical technique such as gradient descent (which is similar to backpropagation) is used. Traditional statistical procedures such as
An Introduction to Neural Networks in Insurance
65
nonlinear regression or the solver in Microsoft Excel use an approach similar to that used by neural networks to estimate the parameters of nonlinear functions. A brief description of the procedure is as follows: 1. Initialize the neural network model using an initial set of weights (usually randomly chosen). Use the initialized model to compute a fitted value for an observation. 2. Use the difference between the fitted and actual value on the target variable to compute the error. 3. Change the weights by a small amount that will move them in the direction of a smaller error • This involves multiplying the error by the partial derivative of the function that is being minimized with respect to the weights, because the partial derivative gives the rate of change with respect to the weights. The result is then multiplied by a factor representing the "learning rate," which controls how quickly the weights change. Since the function being approximated involves logistic functions of the weighted input and hidden layers, multiple applications of the chain rule are needed. While the derivatives are a little messy to compute, it is straightforward to incorporate them into software for fitting neural networks. 4. Continue the process until no further significant reduction in the squared error can be obtained. Further details are beyond the scope of this chapter. However, more detailed information is supplied by some authors (Warner and Misra 1996, Smith 1996). The manuals for a number of statistical packages (for example, SAS Institute 1988) provide an excellent introduction to several numerical methods used to fit nonlinear functions. 3.2.2 Fitting the Neural Network For the more ambitious readers who wish to create their own program for fitting neural networks, Smith (1996) provides an Appen-
66
L. A. Francis
dix with computer code for constructing a backpropagation neural network. Chapter 3 in this volume provides the derivatives, mentioned above, which are incorporated into the computer code. However, the assumption, for the purposes of this chapter, is that the overwhelming majority of readers will use a commercial software package when fitting neural networks. Many hours of development by advanced specialists underlie these tools. 3.2.3 The Fitted Curve The parameters fitted by the neural network are shown in Table 1. Table 1. Neural network weights.
Input Node to Hidden Node 2.48 -3.95 Hidden Node to Output Node 2.16 -4.73 To produce the fitted curve from these coefficients, the following procedure must be used: 1. Normalize each input by subtracting the minimum observed value and dividing by the scale coefficient equal to the maximum observed t minus the minimum observed t. The normalized values will be denoted t*. 2. Determine the minimum observed value for Y and the scale coefficient for Y. 3. For each normalized observation t*t compute "(f
"
=
j
-(2.48-3.95**,)
^ >
4. For each h(t*,) compute o(/z(x*,)) = l + g _ ( 2 , 1 6 _4. 7 3 ^ | ) )
(13)
5. Compute the estimated value for each y; by multiplying the
An Introduction to Neural Networks in Insurance
67
normalized value from the output layer in step 4 by the Y scale coefficient and adding the Y constant. This value is the neural network's predicted value for yj. Figure 6 provides a look under the hood at the neural network's fitted functions. The graph shows the output of the hidden layer node and the output layer node after application of the logistic function. The outputs of the hidden node are a declining function, while the output of the output node is an increasing curve with an exponential-like shape. Figure 7 displays the final result of the neural network fitting exercise: a graph of the fitted and "true" values of the dependent variables versus the input variable.
1
3
5
7
9
t (Time in Quarters)
Figure 6. Hidden and output nodes. It is natural to compare this fitted value to that obtained from fitting a linear regression to the data. Two scenarios were used in fitting the linear regression. Since Y is an exponential-like function of t, the log transformation is a natural transformation for Y, as discussed above. However, because the relationship between the independent and dependent variable is not strictly exponential, but is close to an exponential relationship, applying the log transformation to Y produces a regression equation that is not strictly linear in both the independent variable and the error term. That is, in this
68
L. A. Francis
example, the inflation factor is raised to the power t1'2, so the relationship cannot be made strictly linear. Nonetheless, the log transformation should provide a better approximation to the true curve than fitting a straight line to the data. It should be noted that the nonlinear relationship in this example could be fit using a nonlinear regression procedure that would address the concern about the log transform not producing a relationship that is linear in the independent variable. The purpose here, however, is to keep the exposition simple and use techniques that the reader is familiar with.
22000 "
/ ^^—
20000 -
Expected Severity Neural Fitted Severity
sr
18000-
•?
16000"
14000 -
12000 "
10000
1
3
5
7
9
t (Time in Quarters)
Figure 7. Neural network fitted and "true" severity.
The table below presents the goodness of fit results for both regression and the neural network. In this example, the "true" value for severity is known, because the data is simulated. Thus, the average squared difference between the fitted and "true" values can be computed. In addition, it is possible to compute the percentage of the variance in the "true" severity that can be explained by the fitted model. Table 2 indicates that the fitted severity for both models explain was close to the "true" severity. Also note that both the neural network and regression had an R2 (that is the variance of actual data around the fitted values) of about 0.86.
An Introduction to Neural Networks in Insurance
69
Table 2. Results of fit to simulated severity.
average % of variance explained „2 squared error fitted vs. true severity Neural network 47,107 99.6% 86.0% Regression 46,260 99.6% 86.0% The results of this simple example suggest that the exponential regression and the neural network with one hidden node are fairly similar in their predictive accuracy. In general, one might not use a neural network for this simple situation where there is only one predictor variable, and a simple transformation of one of the variables produces a curve that is a reasonably good approximation to the actual data. In addition, if the analyst knew the true function for the curve, a nonlinear regression technique would probably provide the best fit to the data. However, in actual applications, the functional form of the relationship between the independent and dependent variable is often not known.
3.3
The Logistic Function Revisited
The two parameters of the logistic function give it a lot of flexibility in approximating nonlinear curves. Figure 8 presents logistic curves for various values of the coefficient w\. The coefficient controls the steepness of the curve and how quickly it approached its maximum and minimum values. Coefficients with absolute values less than or equal to 1.0 produce curves that are straight lines. Figure 9 presents the effect of varying wo on logistic curves. Varying the values of wo while holding w\ constant shifts the curve right or left. A great variety of shapes can be obtained by varying the constant and coefficients of the logistic functions. A sample of some of the shapes is shown in Figure 10. Note that the lvalues on the graph are limited to the range of 0 to 1, since this is what the neural networks use. In the previous example the combination of shifting the curve and adjusting the steepness coefficient
70
L. A. Francis
was used to define a simple nonlinear curve that is close to exponential in shape in the region between 0 and 1.
1.0 -
"""•vTN\
••'"
—
o.e - - - 0.6 -
0.4 -
=-10 = -5 « -1 = 1 - 5 = 10
Z1 V*
0.2 -
w1 w1 w1 W1 w1 w1
*
^y /
\
i
«-
•-.. s
v! ! > — -
Figure 8. Logistic function for various values of wx.
1.0 -
v 0.8 "
0.6 "
0.4 '
N
\
X
\
•nnm—
W0 = -1
^——
W0 = 0 wO = 1
\ \ \ N \X \ \ \ \ \ \ \ \ > \ \ \ \ \ \
0.0 "
Figure 9. Logistic curve with varying constants.
An Introduction to Neural Networks in Insurance
71
Figure 10. Examples of shapes from logistic curve.
4
Example 2: Using Neural Networks to Fit a Complex Nonlinear Function
To facilitate a clear introduction to neural networks and how they work, the first example in this chapter was intentionally simple. The next example is a somewhat more complicated curve.
4.1
The Chain Ladder Method
Table 3 3 presents a development triangle for automobile bodily injury paid losses. This data has been used by property and casualty actuaries to illustrate actuarial approaches to estimating loss reserves and measuring the variability around loss reserves (Francis 1998, Hayne 2002). The data is grouped by accident year and development age. An accident year is the year in which the claim occurred. The development age is the time period after the beginning In order to fit the table on the page, only a subset of the data is shown.
L. A. Francis
72
of the accident year in which claim data are being evaluated. Thus development age 12 displays cumulative paid losses 12 months (or one year) after the beginning of the accident year and development age 24 displays cumulative paid losses 24 months (or two years) after the beginning of the accident year. Table 3. Cumulative paid loss triangle. Accident Year
12
1990 1991 1992 1993 1994 1995 1996 1997
2164 1922 1962 2329 3343 3847 6090 5451
Months of Development 60 36 48 11538 34440 21549 29167 32982 10939 21357 28488 13053 27869 38560 44461 38099 51953 58029 18086 52054 66203 24806 59232 34171 33392
24
72 36528 35330 45988
Table 4. Age-to-age factors. Months of Development
Accident Year
12
24
36
48
60
1990 1991 1992 1993 1994 1995 1996 All yr Avg 5 Yr Avg Selected
5.332 5.691 6.653 7.766 7.420 8.883 5.483 6.747 7.241 7.241
1.868 1.952 2.135 2.107 2.098 1.733
1.354 1.334 1.384 1.364 1.272
1.181 1.158 1.153 1.117
1.061 1.071 1.034
1.982 2.005 2.005
1.341 1.341 1.341
1.152 1.155 1.155
1.055 1.061 1.061
Figure 11 presents a graph of the data on the triangle. It is apparent that as time passes and as development age increases, cumu-
An Introduction to Neural Networks in Insurance
73
lative paid losses tend to increase. This "development" is due to several factors including; 1) late reporting of claims, 2) investigation and settlement of claims by claims adjusters, and 3) the resolution of claims litigated in the courts. After the passage of many years, all losses for a given accident year have been paid and losses for the year reach an "ultimate" value. While a discussion of the use of loss development factors to evaluate future unpaid losses and estimate loss reserves is beyond the scope of this chapter (see Berquist and Sherman 1977), some introduction to this topic is necessary to understand the next application of neural networks.
Figure 11. Scatterplot of cumulative paid losses vs. development age.
The following notation will be used to describe the loss development process. This notation follows that of England and Verrall (2002). Cjj denotes incremental claims, where i denotes the row, (in this case, accident year4) andy denotes the column (development age): The rows of triangles can also represent policy years or report years.
74
L. A. Francis
Dij=icik
(14)
*=1
where Ay is the cumulative paid losses. The amount of development that occurs between subsequent evaluations of the loss data is measured with development factors: A ,+1
(15)
*v=-JTij
Expected or average factors are typically computed using one of the following formulas: n-j
n-j
h-^r—
or
IA,
Xj^f-r
(16)
J
Table 3 displays a triangle of paid loss development factors. These factors measure the increase in paid losses as accident years mature and losses are paid. Thus the factor 5.332 for the 1990 accident year is the result of dividing $11,538 by $2,264. It measures the development of the accident year losses from 12 months after the beginning of the accident year to 24 months after the beginning of the accident year. The all year average of the 12-month to 24month factors is 6.747 and the average for the latest 5 years' factors is 7.241. If 7.241 is selected as the "expected" future development between ages 12 months and 24 months for losses aged 12 months, we can use this factor to estimate future payments for accident years at 12 months of maturity by multiplying losses at development age 12 by the factor C 2 4 = DnXn
- Dn
or $39,471 = $5,451 * 7.241
(17)
That is, the payment from 12 months to 24 months equals cumulative payments as of 12 months multiplied by the 12-month devel-
An Introduction to Neural Networks in Insurance
75
opment factor minus payments as of 12 months. Similarly the all year average for development between 24 months and 36 months is 2.174. Thus, on average, cumulative paid losses aged 24 months will more than double in the next 12 months. The historic data can be used to measure development from 12 to 24 months, 24 to 36 months, 36 to 48 months, etc. The measure is typically an average, such as an all year weighted average, an all year simple average, an average of the last 5 years, etc. However, judgment is often applied, and the actual factors used may reflect a review of trends in the factors, as well as knowledge of the claims to which the factors are applied. The product of these "age-to-age" factors can then be computed and can be used to estimate "ultimate" losses.
A M =A/fU*
(18)
k=j
Table 4 (bottom) displays selected age-to-age factors and the ultimate factors for the automobile bodily injury data. The approach just illustrated for estimating future development on claims that have been incurred by an insurance company is also known as the "chain ladder" approach. The approach is also referred to as the "deterministic chain ladder," as a statistical model does not underlie the calculation. It has been shown by England and Verrall (2002) that the chain ladder method can be reformulated to a generalized linear model (GLM) allowing the machinery of statistical analysis to be used to estimate parameters and assess the quality of the fit. One of the GLM formulations introduced by England and Verrall (2002) is the over-dispersed Poisson. If the chain ladder is modeled as an over-dispersed Poisson:
L. A. Francis
76
E(Cij) = mij=xiyj VariCij^fcyj
(19)
n
k=\
In this formulation, the x's are viewed as accident year ultimate losses, and the y's are the proportion of losses paid in each development year j and, therefore, sum to 1. The variance for the overdispersed Poisson is proportional to the mean. GLM software, that is widely available, can be used to estimate the parameters of this model. Another model that can be fit with GLM methods is the negative binomial model:
Var{Cij) = cpXj{Xj-X)DUj_,
(20)
The X parameter in this model is analogous to the loss development factors in the deterministic chain ladder. The Poisson and negative binomial models are two of many statistical models for the chain ladder that have appeared in the literature. See England and Verrall (2002) for a review of chain ladder models. An alternative to the chain ladder is to fit a curve to the payment data. It is an application that is well suited to the use of neural networks, with their ability to approximate complex nonlinear functions. It is common to fit a curve to incremental as opposed to cumulative data (England and Verrall 2002, Zenwirth 1994, Halliwell 1996). Figure 12 displays a scatterplot of the incremental paid losses. The plot indicates that the amount of paid losses tends to increase until about age three years and then decrease.
11
An Introduction to Neural Networks in Insurance
0
30000 "
a
e
• a o
20000 "
0
O
1
;g
* 8 ' 8 1 %
CL
10000 -
8
o •
#
I *
\ s - '- ' i •
o1
3
6
8
11
13
16
Development Age (Years)
Figure 12. Scatterplot of incremental paid losses vs. development age.
The chain ladder model requires one parameter for each accident year (the cumulative losses for the year) and one for each development age (a estimate of the development factor). By fitting a curve to the paid loss data, a more parsimonious model can be created. In much of the literature on chain ladder models, the losses are normalized by dividing by an exposure base or volume measure. The dependent variable for the curve fit becomes the normalized incremental paid losses and the independent variable is development age. For automobile insurance, the exposure base is often the number of cars insured. The more vehicles insured, the more claims the insurance company can expect to incur. The curve displayed in Figure 12 is a more complicated function than the previous simple trend function. It contains a "hump" where the curve reaches a maximum and then changes direction. To illustrate more explicitly how neural networks approximate functions, the data was fit using neural networks of different sizes. The results from fitting this curve using two hidden nodes will be described first. Table 5 displays the weights obtained from training
78
L. A. Francis
for the two hidden nodes. In the table wo denotes the constant and wi denotes the coefficient applied to the input data. The result of applying these weights to the input data and then applying the logistic function is the values for the hidden nodes. Table 5. Two-node hidden and output node weights. wo
Wl
Hidden Node 1 -3.166 Hidden Node 2 0.657 Output Node -2.195
w2 7.286 3.419 -6.799 4.839
Figure 13 displays the output of the two hidden modes. The weights produce two upward sloping curves. The curves are then combined using the hidden-node-to-output weights (the weights associated with Output Node in Table 5).
1.0
1 0.4
1 1
0.2
•
••
o
1 o
• 1
0.6
Hi,
• o Hidden Node 1 *> Hidden Node 2 Neural Network Fitted Value
0.8
•
i
•
1
•• o
0.0
i
o
..•"
3 8 13 D e v e l o p m e n t A q e (Years)
3 8 13 D e v e l o p m e n t Aqe (Years)
Figure 13. Hidden node output (left) and neural network fitted (right). Figure 13 also displays the fitted values of the neural network. The fitted neural network will have its highest values when hidden
An Introduction to Neural Networks in Insurance
79
node two has a low value relative to hidden node one. It will have its lowest values when hidden nodes one and two are close in value. The resulting curve has a hump or bell-like shape at low development ages and then declines until is asymptotic to zero at high development ages. To further illustrate the computation, Table 6 displays the application of the neural network coefficients to the output of the two hidden nodes, followed by the application of the logistic function. The scatterplot of the actual payment data suggests that actual values at development age one are lower than those fitted by the neural network, i.e., that a steeper hump is needed. A three node neural network was fit to the data. The additional hidden node gave the neural network additional flexibility to capture the steepness of the curve at low development ages, and improved the fit of the curve (see Figure 14). The R2 of the three node neural network was 0.62 compared to an R2 of 0.44 for the two node network. Table 6. Computation of neural network output from hidden nodes. (1) Dev Age (yrs) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(2)
(3)
Hidden Hidden #2 #1 0.038 0.673 0.063 0.714 0.102 0.756 0.160 0.796 0.240 0.833 0.347 0.864 0.890 0.471 0.602 0.911 0.929 0.718 0.811 0.943 0.879 0.955 0.924 0.964 0.953 0.972 0.972 0.978 0.983 0.983
(4) 2.19-6.8*(2)+4.84*(3) After Coefficients 0.80309 0.83364 0.77163 0.57434 0.20656 -0.37093 -1.09343 -1.88083 -2.58426 -3.14552 -3.54716 -3.81083 -3.97511 ^.07302 ^.12546
(5) l/l+exp(-(4)))
(6) .4+88*0 Fitted Logistic Function Value 0.69063 61.80 0.69713 61.34 0.68387 59.58 0.63976 55.56 0.55146 47.94 0.40832 35.36 0.25097 21.71 0.13229 11.53 0.07016 6.29 0.04127 3.86 0.02800 2.75 0.02165 2.21 0.01843 1.95 0.01674 1.79 1.73 0.01590
80
L. A. Francis
It is clear that the three node neural network provides a considerably better fit than the two node network. One of the features of neural networks that affects the quality of the fit and that the user must often experiment with is the number of hidden nodes. If too many hidden nodes are used, it is possible that the model will be over parameterized. However, an insufficient number of nodes could be responsible for a poor approximation of the function.
80.00 -
il 60.00 o - 40.00 CD
z 20.00 "
o.oo0.5
3.0
5.5 8.0 10.5 Development Age (Years)
13.0
15.5
Figure 14. Three-node neural network fitted normalized payments.
This particular example has been used to illustrate an important feature of neural networks: the multilayer perceptron neural network with one hidden layer is a universal function approximator. Theoretically, with a sufficient number of nodes in the hidden layer, any deterministic continuous nonlinear function can be approximated. In an actual application, if data contain random noise as well as a hidden pattern, it can sometimes be difficult to accurately approximate a curve no matter how many hidden nodes there are. This is a limitation that neural networks share with all statistical procedures.
An Introduction to Neural Networks in Insurance
4.2
81
Modeling Loss Development Using a TwoVariable Neural Network
In the discussion of loss development triangles and the application of development factors above, both accident year and development age were used to estimate future paid losses. In the previous example, only development age was used to fit a curve to approximate normalized payments. While normalizing loss payments by adjusting for exposure adjusts the data for changes in the volume of business written, it may not account for all the inflationary forces that impact insurance losses over a number of accident years. Certain exposure bases, such as number of cars insured, do not capture the effects of economic inflation. In addition, insurance inflation is often different from economic inflation due to the impact of judicial decision, law changes, and social attitudes that affect the propensity to sue. The impact of these insurance inflationary forces can cause the level of losses at a given development age to change from year to year. In general, losses will tend to increase over time, but decreases in insurance costs have occurred and increases may not occur at a constant rate. The development factor approach uses accident year information by using cumulative losses to date as the base for a loss development factor. Similarly, the Poisson GLM uses accident year ultimates that are fitted using payments to date as a base for payout proportions. Thus, a year with higher losses (i.e., cumulative losses or estimated ultimate losses) is expected to have higher future paid losses. To account for accident year effects in the neural network model, accident year was used as an independent variable in the neural network model along with development age. The model was of the following form: Pij=^L = f(i,j)
(21)
where Py is the payment (per unit of exposure) for accident period i and development agej,
82
L. A.
Francis
i accident year / variable, and j is the column or development age effect. Figure 15 displays the neural network fitted curve fitted to payment data using accident year and development year information. The plot shows that the fitted curve of the payout pattern by development age varies by accident year. The normalized payments of more recent accident years tend to be higher than those of earlier years.
200" 96
95
95 94
150-
94 93 93
LL
a. o |ioo-
92
92 91 90
91
z
90
"2
89 68
5097 §6
87 86 85 84 63
89
94
88
|5
87 86 85 84 83 82 81 80
: j ! ! !
» li H
1
81
o0.5
3.0
5.5
8.0
10.5
13.0
15.5
Development Age (Years)
Figure 15. Neural network fitted values, using development age and accident year as predictors.
Figure 16 displays the curve derived from fitting the chain ladder model using generalized linear models to derive the parameters to estimate the parameters. The GLM-chain ladder had an R of 0.92 while the neural network had a very similar R2 of 0.93. Thus, neither model seems to outperform the other using a simple good-
83
An Introduction to Neural Networks in Insurance
ness of fit statistic. The purpose of this example is to illustrate the neural network capabilities in fitting nonlinear models. The fitted neural network can be used to extrapolate values beyond the range of the data, while the chain ladder model cannot. This is of consequence because the historic data on loss development triangles is often incomplete and does not contain observations for development ages in the "tail", i.e., at high development ages, where payments may still occur in a future period.6
95 96
200 -
95
T3
*s L. 1502
84
94 93
_i
CD 0) "O 100 CD
"
92
93
69
93
92 93
69
c
6 50-
s
IH i* M
CO 97
SS
HI
so
94
o-
B 80
80
I! ao
1 3.0
5.5
92
1. 8.0
10.5
Development Age (Years)
Figure 16. Chain ladder GLM fitted.
More rigorous procedures for assessing goodness of fit, (i.e., holding out a portion of the data for testing, cross-validation, and bootstrap) are not covered in this chapter. Models other than the chain ladder, such as the hoerel curve apply regression methods to fit a curve to loss development data than can estimate "tail" payments. In addition, non-parametric smoothing models can be used in a manner similar to neural networks to fit smooth curves to loss development data. A discussion of these smoothing methods is beyond the scope of this chapter. See England and Verrall (2002).
84
4.3
L. A. Francis
Interactions
Another common feature of data that complicates the statistical analysis is interactions. An interaction occurs when the relationship between a predictor variable and a target variable is dependent on the value of another independent variable. For instance, the curve relating normalized payments to development age may depend on the accident year. An example of this is illustrated in Figure 17, which shows the neural network fitted curve for distinct groupings of accident year. The graph makes it evident that when interactions are present, the slope and shape of the curve relating the dependent variable (loss payment) to an independent variable (development age) varies based on the values of a third variable (accident year). It is clear from the graph that the curve has a sharper peak and reaches a higher amount for accident years 1993 through 1997 than for other accident years. 3
8
13
AY: 93 to 97
AY: 89to83
O
°2
i'.
• •
•
o
1
°
AY: 84 to 89
AY: 80 to 84
1
e
•
O
"oooooooe 3
8
•
° o o o ooo
13
Development Age (Years)
Figure 17. Fitted neural network development curve by accident year.
When interactions are present, the true expected payment cannot be accurately estimated by the simple product of an accident year
An Introduction to Neural Networks in Insurance
85
parameter and a development age parameter. The interaction of the two terms, accident year and development age, may need to be taken into account. In linear regression, interactions are estimated by adding interaction terms to the regression. For a regression in which the effects are additive : Cjj = B0 + (Bj * Xi) + (Bj*yj,) + (By * x, * yj )
(22)
where: Cy = is incremental payments for accident year i and development age j , Bo = the regression constant, and Bj, Bj are coefficients of the accident year and development age, and the By are the coefficients of the interaction variables for the accident year-development age interaction. The interaction terms represents the product of the accident year and development age effects. In a linear model, using interaction terms allows the level (or slope, when the independent variable is continuous) of the fitted model to vary by accident year. When using a generalized linear model such as the Poisson GLM model, the interaction terms allow the payout percentage for a given development age to vary by accident year. For this application, with only one observation for each accident year and development age, a regression model with interaction terms would fit the data perfectly and would use up all the available degrees of freedom. The selection of a method for estimating future payments from loss development data will depend on the data and on the software tools available for the analysis. For certain kinds of data, such as sparse data were the volume of losses is insufficient for the use of regression or neural network techniques, neither GLMs nor neural networks would be appropriate. For other kinds of data, where the volume of claims is sufficient and trends and patterns are believed to be relatively stable over time, both approaches may be appropriate. When effects are multiplicative, as in the chain ladder model, a model with interaction terms applies on a logarithmic scale.
86
5
L. A. Francis
Correlated Variables and Dimension Reduction
A previous section discussed how neural networks approximate functions with a variety of shapes. Another task performed by the hidden layer of neural networks is dimension reduction.
5.1
Factor Analysis and Principal Components Analysis
Workers' Compensation insurance compensates a worker hurt on the job by paying for the necessary medical treatment and by providing wage replacement benefits called indemnity payments. The wage replacement benefits are keyed to the average weekly wage of the employee. Thus, the actual payment for an injury to a worker will depend on the injury (i.e., how much medical treatment is needed), the weekly wage (how much the worker is compensated for each lost work day) and the claim duration (i.e., a surrogate for length of recovery time). The data used for financial analysis in Workers Compensation and other Property and Casualty lines of insurance contain variables that are correlated. An example would be the age of a worker and the worker's average weekly wage, as older workers tend to earn more. Education is another variable that is likely to be correlated with the worker's income. All of these variables would probably influence Workers Compensation indemnity payments. It could be difficult to isolate the effect of the individual variables because of the correlation between the variables. Another example is the economic factors that drive insurance inflation, such as inflation in wages and inflation in the medical care. For instance, analysis of quarterly Bureau of Labor Statistics data for hourly wages inflation and the medical inflation from January 1994 through May 2000 suggest these two time series have a correlation of about -0.90 (See Figure 18). Other measures of economic inflation can be expected to show similarly high correlations.
An Introduction to Neural Networks in Insurance
87
<»
r
oS.
c
J,.,
o
"
\
•
0
1.02
1.03
1.04
1.05
Hourly Earnings Inflation Rate
Figure 18. Scatterplot of medical vs hourly earnings inflation rates.
Suppose one wanted to combine all the demographic factors related to income level or all the economic factors driving insurance inflation into a single index in order to create a model that captured most of the predictive ability of the individual data series. Reducing many factors to one is referred to as dimension reduction. In classical statistics, two similar techniques for performing dimension reduction are Factor Analysis and Principal Components Analysis. Both of these techniques take a number of correlated variables and reduce them to fewer variables that retain most of the explanatory power of the original variables. 5.1.1 Factor Analysis Assume the values on three observed variables are all "caused" by a single factor plus a factor unique to each variable. Also assume that the relationships between the factors and the variables are linear. Such a relationship is diagrammed in Figure 19, where Fl denotes the common factor, Ul, U2 and U3 the unique factors and XI, X2 and X3 the variables. The causal factor Fl is not observed. Only the variables XI, X2 and X3 are observed. Each of the unique
L. A. Francis
factors is independent of the other unique factors, thus any observed correlations between the variables is strictly a result of their relation to the causal factor Fl.
X1
X2
Factor
X3 Variables
U3 Unique Factors
Figure 19. Diagram of relationship between factors and three variables.
Litigation Rates Social Inflation Factor
Size of Jury Awards Index of State Litigation Environment
Figure 20. The relationship between an underlying factor and three observable variables.
An Introduction to Neural Networks in Insurance
89
As an example, assume an unobserved factor, social inflation, is one of the drivers of increases in claims costs. This factor reflects the sentiments of large segments of the population towards defendants in civil litigation and towards insurance companies as intermediaries in liability claims. Although it cannot be observed or measured, some of its effects can be observed. Examples are the change over time in the percentage of claims being litigated, increases injury awards, and perhaps an index of the litigation environment in each state created by a team of lawyers and claims adjusters. In the social sciences it is common to use Factor Analysis to measure social and psychological concepts that cannot be directly observed but which can influence the outcomes of variables that can be directly observed. Sometimes the observed variables are indices or scales obtained from survey questions. A hypothetical social inflation scenario is diagrammed in Figure 20. In scenarios such as this one, values for the observed variables might be used to obtain estimates for the unobserved factor. One feature of the data that is used to estimate the factor is the correlations between the observed variables. If there is a strong relationship between the factor and the variables, the variables will be highly correlated. If the relationship between the factor and only two of the variables is strong, but the relationship with the third variable is weak, then only the two variables will have a high correlation. The highly correlated variables will be more important in estimating the unobserved factor. A result of Factor Analysis is an estimate of the factor (Fl) for each of the observations. The Fl obtained for each observation is a linear combination of the values for the observed variables for the observation. Since the values for the variables will differ from record to record, so will the values for the estimated factor. 5.1.2 Principal Components Analysis Principal Components Analysis is in many ways similar to Factor Analysis. It assumes that a set of variables can be described by a smaller set of factors that are linear combinations of the variables.
90
L. A. Francis
The correlation matrix for the variables is used to estimate these factors. However, Principal Components Analysis makes no assumption about a causal relationship between the factors and the variables. It simply tries to find the factors or components that seem to explain most of the variance in the data. Thus both Factor Analysis and Principal Components Analysis produce a result of the form: I = w,X}+w2X2... + wnXn (23) where / i s an estimate of the index or factor being constructed, X\ ... Xn are the observed variables used to construct the index, w\ ... wn are the weights applied to the variables. An example of creating an index from observed variables is combining observations related to litigiousness and the legal environment to produce a social inflation index. Another example is combining economic inflationary variables to construct an economic inflation index for a line of business. Factor analysis or Principal Components Analysis can be used to do this. Sometimes the values observed on variables are the result of or "caused" by more than one underlying factor. The Factor Analysis and Principal Components approach can be generalized to find multiple factors or indices, when the observed variables are the result of more than one unobserved factor. One can then use these indices in further analyses and discard the original variables. Using this approach, the analyst achieves a reduction in the number of variables used to model the data and can construct a more parsimonious model. Factor Analysis9 is an example of a more general class of models known as Latent Variable Models. For instance, observed values on categorical variables may also be the result of unobserved In fact, Masterson created such indices for the Property and Casualty lines in the 1960s. 9 Principal Components, because it does not have an underlying causal factor is not a latent variable model.
An Introduction to Neural Networks in Insurance
91
factors. It would be difficult to use Factor Analysis to estimate the underlying factors because it requires data from continuous variables, thus an alternative procedure is required. While a discussion of such procedures is beyond the scope of this chapter, the procedures do exist. It is informative to examine the similarities between Factor Analysis and neural networks. Figure 21 diagrams the relationship between input variables, a single unobserved factor and the dependent variable. In the scenario diagrammed, the input variables are used to derive a single predictive index (Fl) and the index is used to predict the dependent variable. Figure 22 diagrams a neural network being applied to the same data. Instead of a factor or index, the neural network has a hidden layer with a single node. The Factor Analysis index is a weighted linear combination of the input variables, while in the typical neural network, the hidden layer is a weighted nonlinear combination of the input variables. The dependent variable is a linear function of the Factor in the case of Factor Analysis and Principal Components Analysis and (possibly) a non linear function of the hidden layer in the case of the neural network. Thus, both procedures can be viewed as performing dimension reduction. In the case of neural networks, the hidden layer performs the dimension reduction. Since it is performed using nonlinear functions, it can be applied where nonlinear relationships exist.
5.2
Example 3: Dimension Reduction
Both Factor Analysis and neural networks will be fit to data where the underlying relationship between a set of independent variables and a dependent variable is driven by an underlying unobserved factor. An underlying causal factor, Factor I, is generated from a normal distribution: Factorl~N (1.05,.025)
(24)
92
L. A. Francis
X1
X2
>
Y
X3 Input Variables
Factor
Dependent Variable
Figure 21. Using factor analysis to create an index for prediction.
Prediction with a Neu ral Network
•
^
+% A
Input Layer (Input Data)
^0^^
*
Hidden Layer (Process Data)
Output Layer (Predicted Value)
Figure 22. Diagram showing analogy between factor analysis and neural networks.
An Introduction to Neural Networks in Insurance
93
On average this factor produces a 5% inflation rate. To make this example concrete, Factor \ will represent the economic factor driving the inflationary results in a line of business, say Workers Compensation. Factor1 drives the observed values of three simulated economic variables, Wage Inflation, Medical Inflation and Benefit Level Inflation. Although unrealistic, in order to keep this example simple it was assumed that no factor other than the economic factor contributes to the value of these variables and the relationship of the factors to the variables is approximately linear. Also, to keep the example simple it was assumed that one economic factor drives Workers Compensation results. A more realistic scenario would separately model the indemnity and medical components of Workers Compensation claim severity. The economic variables are modeled as follows10: \n(J¥ageInflation) = .7 * \n(Factorl) + e e ~ JV(0,.005) \n(MedicalInflation) = 1.3* \n(Factor\) + e e~N(0,.Ql) \n(Benefit _ level _ trend) = .5 * \n(Factor\) + e e ~ 7V(0,.005)
(25) (26) (27)
Two hundred fifty records of the unobserved economic inflation factor and observed inflation variables were simulated. Each record represented one of 50 states for one of five years. As a result, in the simulation, inflation varied by state and by year. The annual inflation rate variables were used to compute cumulative inflationary factors and indices. For each state, the cumulative product of the prior year's index and that year's observed inflation measures (the Note that the according to Taylor's theorem the natural log of a variable whose value is close to one is approximately equal to 1 minus the variable's value, i.e., ln(l+x) « x. Thus, the economic variables are, to a close approximation, linear functions of the factor.
94
L. A. Francis
random observed independent variables) were computed. For example the cumulative unobserved economic factor is computed as: Cumfactor\t = f] Factor\k
(28)
k=\
A base severity, intended to represent the average severity over all claims for the line of business for each state for each of the five years was generated from a lognormal distribution.11 To incorporate inflation into the simulation, the severity for a given state for a given year was computed as the product of the simulated base severity and the cumulative value for the simulated (unobserved) inflation factor for its state. Thus, in this simplified scenario, only one factor, an economic factor, is responsible for the variation over time and between states in expected severity. The parameters for these variables were selected to make a solution using Factor Analysis or Principal Components Analysis straightforward and are not based on an analysis of real insurance data. This data, therefore, had significantly less variance than would be observed in actual insurance data. Note that the correlations between the variables is very high. All correlations between the variables are at least 0.90. This means that the problem of multi-collinearity exists in this data set. That is, each variable is nearly identical to the others, adjusting for a constant multiplier, so typical regression procedures have difficulty estimating the parameters of the relationship between the independent variables and severity. Dimension reduction methods such as Factor Analysis and Principal Components Analysis address this problem by reducing the three inflation variables to one, the estimated factor or index. Factor Analysis was performed on variables that were standardized. Most Factor Analysis software standardizes the variables used
This distribution has an average of $5,000 the first year (after application of the inflationary factor for year 1). Also \n(Severity) ~ N(8.47,.05)
95
An Introduction to Neural Networks in Insurance
in the analysis by subtracting the mean and dividing by the standard deviation of each series. The coefficients linking the variables to the factor are called loadings. That is: XI = bi FactorX X2 = b 2 FactorX
(29)
X3 = b3 Factor I where XI, X2 and X3 are the three observed variables, FactorX is the single underlying factor and bi, b 2 and b3 are the loadings. In the case of Factor Analysis, the loadings are the coefficients linking a standardized factor to the standardized dependent variables, not the variables in their original scale. Also, when there is only one factor, the loadings also represent the estimated correlations between the factor and each variable. The loadings produced by the Factor Analysis procedure are shown in Table 7. Table 7. Factor analysis results.
Variable
Loading
Weights
Wage Inflation Index
0.985
0.395
Medical Inflation Index
0.988
0.498
Benefit Level Inflation Index
0.947
0.113
Table 7 indicates that all the variables have a high loading on the factor, and, thus, all are likely to be important in the estimation of an economic index. An index value was estimated for each record using a weighted sum of the three economic variables. The weights used by the Factor Analysis procedure to compute the index are shown in Table 7. Note that these weights (within rounding error) sum to 1.0. The resulting index was then used as a dependent variable to predict each state's severity for each year. The regression model was of the form: Index = 0.395 (Wage Inflation) + .498 (Medical Inflation) + 0.113 (Benefit Level Inflation)
L. A. Francis
96
Severity = a + b * Index + e
(30)
where Severity is the simulated severity, Index is the estimated inflation Index from the Factor Analysis procedure, and e is a random error term. The results of the regression will be discussed below where they are compared to those of the neural network. The simple neural network diagrammed in Figure 22 with three inputs and one hidden node was used to predict a severity for each state and year. The relationship between the neural network's predicted value and the independent variables is shown in Figure 23. This relationship is linear and positively sloped. The relationship between the unobserved inflation factor driving the observed variables and the predicted values is shown in Figure 24. This relationship is positively sloped and nearly linear. Thus, the neural network has produced a curve that is approximately the same form as the "true" underlying relationship. I P r e d i c t o r Variable vs N e u r a l N e t w o r k P r e d i c t e d
BenRate by NeuraINotworkPredicted
1.6 1.2 "
mE>**i**wc»ipi> -«*»
«s»R«s,S*» «• * " n _
D
°
M e d C P I by N auralN elworkPredicted
- 1.6 - 1.2
^ * ^ W a o e C P I by N euraIN etwork Predicted
1.6 1.2 '
5000
5500 6000 NeuralNetworkPredicted
6500
Figure 23. Plot of each of the three inflation indices vs. the neural network predicted.
97
An Introduction to Neural Networks in Insurance
I N e u r a l Network P r e d i c t e d vs Inflation F a c t i
1 1
8oS^°
twork
SoO^I^S § S*^o
0
ra
£ 6200
°J%T o^l6
•
a
z
o
y*%°2°° 5200 -
_ o8*tf8°
1 1.2 1.3 U n o b s e r v e d Inflation Factor
Figure 24. Plot of neural network predicted vs. unobserved inflation factor driving observed inflation in severity.
Figure 25 shows the actual and fitted values for the neural network and Factor Analysis predicted models. This figure displays the fitted values compared to actual randomly generated severities (on the left) and to "true" expected severities on the right. The xaxis of the graph is the "true" cumulative inflation factor used to generate the random severities in the simulation. However, it should be noted that when working with real data, information on an unobserved variable would not be available. The predicted neural network values appear to be more jagged than the Factor Analysis predicted values. This jaggedness may reflect a weakness of neural networks: over fitting. Sometimes neural networks do not generalize as well as classical linear models, and fit some of the noise or randomness in the data rather than the actual patterns. Looking at the graph on the right showing both predicted values as well as the "true" value, the Factor Analysis model appears to be a better fit as it has less dispersion around the "true" value. Although the neural network fit an approximately linear model to the data, the Factor Analysis model performed better on the data used in this example. Since the relationships between the
98
L, A. Francis
independent and dependent variables in this example are approximately linear, this is an instance of a situation where a classical linear model would be preferred over a more complicated neural network procedure. However, if the relationships had been nonlinear, neural networks would probably provide a better approximation. Neural Network and Factor Predicted Values rt
7000
ft
7000 a,_d
6500"
if
6000
6000
Q OAM
AjtjS$tn
'^^'V Jfflr*1 — •
*>Jjff^
^ —
NeuralNetworkPredicte H 5500 Actual Severity Factor Predicted
5000
o NeuralNetworkPredicted ^ Factor Predicted - o " True Severity
if
5000
J? 4500 4000
1.0
1.2 1.4 Cumulative Factor
1.0
1.2
1.4
Cumulative Factor
Figure 25. Left panel displays neural network predicted, factor analysis predicted and actual severities vs. underlying factor, right panel displays neural network predicted and factor analysis predicted vs. the "true" expected severity.
6
Conclusion
This chapter has gone into some detail in describing neural networks and how they work. The chapter has attempted to remove some of the mystery from the neural network "black box". The author has described neural networks as a statistical tool that mini-
An Introduction
to Neural Networks in
Insurance
99
mizes the squared deviation between target and fitted values, much like more traditional statistical procedures do. Examples were provided which showed that neural networks: 1) are universal function approximators, 2) can model complicated data patterns such as interactions between variables, and 3) perform dimension reduction on correlated predictor variables. Classical techniques can be expected to outperform neural network models when data is well behaved and the relationships are linear or variables can be transformed into variables with linear relationships. However neural networks seem to have an advantage over linear models when they are applied to complex nonlinear data. This is an advantage neural networks share with other data mining tools not discussed in detail in this chapter. Practical examples of applications of neural networks to complex data are presented in Chapter 3 of this volume, "Practical Applications of Neural Networks in Property and Casualty Insurance." Note that this chapter does not advocate abandoning classical statistical tools, but rather adding a new tool to the actuarial toolkit. Classical regression performed well in many of the examples in this chapter. Some classical statistical tools such as Generalized Linear Models have been applied successfully to problems similar to those in this chapter.
Acknowledgments The author wishes to acknowledge Virginia Lambert and Jane Taylor for their encouragement and helpful comments on this project.
100
L. A. Francis
References Berquist, J.R. and Sherman, R.E. (1977), "Loss reserve adequacy testing; a comprehensive, systematic approach," Proceedings of the Casualty Actuarial Society, pp. 10-33. Berry, M.J.A. and Linoff, G. (1997), Data Mining Techniques, John Wiley and Sons Brockett, P.L., Xiaohua, X. and Derrig, R.A. (1998), "Using Kohonen's self-organizing feature map to uncover automobile bodily injury claims fraud," Journal of Risk and Insurance, June, 65:2. Dhar, V. and Stein, R. (1997), Seven Methods for Transforming Corporate Data Into Business Intelligence, Princeton Hall. Dunteman, G.H. (1989), Principal Components Analysis, SAGE Publications. Derrig, R.A. (1999), "Patterns, fighting fraud with data," Contingencies, pp. 40-49. England, P.D. and Verrall, R.J. (2002), "Stochastic claims reserving in general insurance," Presented at the Institute of Actuaries, January. Francis, L. (1998) "Regression methods and loss reserving," presented at Casualty Actuarial Society Loss Reserve Seminar. Francis, L. (2001), "Neural networks demystified," Casualty Actuarial Society Forum, Winter, pp. 253-320. Freedman, R.S., Klein, R.A., and Lederman, J. (1995), Artificial Intelligence in the Capital Markets, Probus Publishers. Halliwell, L.J. (1996), "Loss prediction by generalized least squares," Proceeding of the Casualty Actuarial Society, pp. 436489. Hastie, T., Tibshirani, R., and Freidman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer. Hayne, R. (2002), "Determining reserve ranges and variability of loss reserves," presented at Casualty Actuarial Society Loss Reserve Seminar
An Introduction to Neural Networks in Insurance
101
Hatcher, L. (1996), A Step by Step Approach to Using the SAS System for Factor Analysis, SAS Institute. Heckman, P.E. and Meyers, G.G. (1986), "The calculation of aggregate loss distributions from claim severity and claim cost distributions," Proceedings of the Casualty Actuarial Society, pp. 22-61. Holler, K., Somner, D., and Trahair, G. (1999), "Something old, something new in classification ratemaking with a new use of GLMs for credit insurance," Casualty Actuarial Society Forum, Winter, pp. 31-84. Hosmer, D.W. and Lemshow, S. (1989), Applied Logistic Regression, John Wiley and Sons. Keefer, J. (2000), "Finding causal relationships by combining knowledge and data in data mining applications," presented at Seminar on Data Mining, University of Delaware. Kim, J. and Mueler, C.W. (1978), Factor Analysis: Statistical Methods and Practical Issues, SAGE Publications. Lawrence, J. (1994), Introduction to Neural Networks: Design, Theory and Applications, California Scientific Software. Martin, E.B. and Morris, A.J. (1999), "Artificial neural networks and multivariate statistics," in Statistics and Neural Networks: Advances at the Interface, Oxford University Press, pp. 195292. Masterson, N. E. (1968), "Economic factors in liability and property insurance claims cost: 1935 - 1967," Proceedings of the Casualty Actuarial Society, pp. 61-89. Monaghan, J.E. (2000), "The impact of personal credit history on loss performance in personal lines," Casualty Actuarial Society Forum, Winter, pp. 79-105. Plate, T.A., Bert, J., and Band, P. (2000), "Visualizing the function computed by a feedforward neural network," Neural Computation, June, pp. 1337-1353. Potts, W.J.E. (2000), Neural Network Modeling: Course Notes, SAS Institute. SAS Institute (1988), SAS/STAT Users Guide: Release 6.03.
102
L. A. Francis
Smith, M. (1996), Neural Networks for Statistical Modeling, International Thompson Computer Press. Speights, D.B, Brodsky, J.B., Chudova, D.I. (1999), "Using neural networks to predict claim duration in the presence of right censoring and covariates," Casualty Actuarial Society Forum, Winter, pp. 255-278. Venebles, W.N. and Ripley, B.D. (1999), Modern Applied Statistics with S-PLUS, 3 r ed., Springer. Viaene, S., Derrig, R., Baesens, B., and Dedene, G. (2002), "A comparison of state-of-the-art classification techniques for expert automobile insurance fraud detection," Journal of Risk and Insurance. Warner, B. and Misra, M. (1996), "Understanding neural networks as statistical tools," American Statistician, November, pp. 284293. Zehnwirth, B. (1994) "Probabilistic development factor models with applications to loss reserve variability, prediction intervals and risk based capital," Casualty Actuarial Society, Spring Forum, pp. 447-606. Zehnwirth, B. and Barnett, G. (1998), "Best estimate for loss reserves" Casualty Actuarial Society Forum, Fall 1998, pp. 55102.
Chapter 3 Practical Applications of Neural Networks in Property and Casualty Insurance Louise A. Francis
This chapter will use two property and casualty insurance applications to illustrate practical issues and problems encountered when using neural networks. The two applications are fraud detection and underwriting. Neural networks seem to show promise as a tool for finding patterns in data with complex data structures. Many statisticians, actuaries and other insurance analysts are reluctant to embrace neural networks because of concern about the pitfalls associated with neural networks. One criticism is that neural networks operate as a "black box" - data goes in, answers come out, but what happens inside the "box" remains a mystery. The output of neural networks reveals little about relationship between predictor and target variables. Thus, it can be difficult to present the results of neural network modeling to management. This chapter will address the "black box" issue that arises when using neural networks. It will show how to derive meaningful business intelligence from the fitted model. It will also explain how to avoid overparameterization.
103
104
1
L. A. Francis
Introduction
Neural networks are one of the more popular data mining techniques. Chapter 2 of this volume, "An Introduction to Neural Networks in Property and Casualty Insurance," showed that neural networks can be viewed as computationally intensive extensions of better-known statistical techniques. Some of the strengths of neural networks were presented, i.e., their ability to approximate complex patterns such as those found in Property and Casualty insurance data. Some of these complexities include nonlinear relationships between the independent and dependent variables, interactions and correlated independent variables. Perhaps the greatest disadvantage of neural networks is the inability of users to understand or explain them. Because the neural network is a very complex function, there is no accepted way to illustrate the relationships between independent and dependent variables with functions that can be interpreted by data analysts or management. Thus, users must accept on faith the relationships between the independent and dependent variables that give rise to the predictions they get, and trust that the neural network has produced a good prediction. In the words of Berry and Linoff (1997), "Neural networks are best approached as black boxes with mysterious inner workings, as mysterious as the origins of our own consciousness." More conventional techniques such as linear regression result in simple mathematical functions where the relationship between predictor and target variables is clearly described and can be understood by audiences with modest mathematical expertise. The "black box" aspect of neural networks is a serious impediment to more widespread use. A number of other practical issues arise when using neural networks. For instance, the analyst must choose the number of hidden nodes to include in the neural network. This number should be high enough so that the fitted model approximates the data well, but small enough so that the model is not overparameterized. In addi-
Practical Applications of NNs in Property and Casualty Insurance
105
tion, the analyst may wish to eliminate variables from the model that do not make a significant contribution to predicting the dependent variable. In this chapter, methods for evaluating and interpreting neural networks will be presented. These will include techniques for assessing goodness of fit of the model, techniques for assessing the relevance of variables in the model, and techniques for visualizing the functional relationships between the independent and dependant variables. Two applications will be presented in this chapter and will be used to illustrate the evaluation and interpretation techniques. The first application will be the fraud model. This application will be used to illustrate methods of assessing goodness of fit and methods of determining the importance of predictor variables in explaining the target variable. The second example, underwriting, will be used to illustrate techniques for understanding the functional relationships between the independent and target variables.
2
Fraud Example
2.1
The Data
The data for the fraud application was supplied by the Automobile Insurers Bureau of Massachusetts (AIB). The data consists of information on 1400 closed claims from personal automobile claimants. The data are a random sample of Massachusetts PIP claims that occurred in 1993. The database was assembled with the cooperation of ten large insurers. This data has been used by the AIB, the Insurance Fraud Bureau of Massachusetts (IFB), and other researchers to investigate fraudulent claims or probable fraudulent claims (Derrig and Ostaszewski 1995, Weisberg and Derrig 1995, Viaene et al. 2002). Most data mining applications would use a much larger database. However, the AIB PIP data is well suited to illustration of the use of data mining techniques in insurance. Viaene et al. (2002) used the AIB data to compare the performance of
106
L. A. Francis
a number of data mining and conventional classification techniques. A measure of fraud collected in the study was an overall assessment (ASSESS) of the likelihood that the claim was fraudulent or abusive. Each record in the data was assigned a value by an expert. The value indicated the expert's subjective assessment as to whether the claim was legitimate or whether fraud or abuse was suspected. Experts were asked to classify suspected fraud or abuse claims into the following categories: exaggerated damages, opportunistic fraud, or planned fraud. As shown in Table 1, the assessment variable can take on 5 possible values. Overall, about one third of the claims were coded by adjusters as probable abuse or fraud claims. In this chapter, all categories of the assessment variable except "probably legitimate" will be treated as suspected fraud. However, it should be noted that "soft fraud" such as exaggerating the extent of the injury, often does not meet the legal definition of fraud. This section will present an example where neural networks are used for classification, that is, in a model developed to predict in which of two categories or classes each claim belongs. The dependent variable for the model is ASSESS, the adjustor's assessment of the likelihood that the claim was fraudulent or abusive. The values ranged from 1 (probably legitimate) through 2 to 5 (the various kinds of suspected fraud or abuse). This variable was then converted to a binary dependent variable. Thus, if a claim was other than probably legitimate, it was treated as a suspected fraud. Table 1. Assessment variable. Value 1 2 3 4 5
Assessment Probably legitimate Excessive treatment only Suspected opportunistic fraud, no injury Suspected opportunistic fraud, exaggerated injury Suspected planned fraud
Practical Applications of NNs in Property and Casualty Insurance
107
When the dependent variable is binary, it can take on one of two possible values. For the purposes of this analysis the values are 0 (legitimate) or a 1 (suspected fraud or abuse). Ordinary least squares regression can be performed by regressing a binary variable on the predictor variables. A more common procedure when the dependent variable is binary is to use logistic regression. Suppose that the true target variable is the probability that a given claim is abusive, and this probability is denoted by p(x). The model relating p(x) to the a vector of independent variables x is: H-7£-;x)
=
B0+B{xl+...+Bnxn
1-/7
where the quantity ln(p(x)/(l-p(x))) is known as the logit function. Logistic regression can be used to produce scores that are between zero and one, consistent with viewing the model's score as a probability. If one does not use a logistic regression approach, but leaves the dependent variable at its original binary values when fitting a model, predicted values which are less than zero and greater than one can result. One solution to this issue is to truncate the predicted values at zero and one. Another solution is to add the extra step of fitting a logistic regression to the data using the neural network predicted value as the independent variable and the binary ASSESS variable as the dependent variable. The fitted probabilities from the logistic regression can then be assigned as a score for the claim. In this example, logistic regression was used after applying the neural network model to convert the predicted values into probabilities. There are two kinds of predictor variables that were used in the analysis. The first category is red flag variables. These are subjective variables that are intended to capture features of the accident, injury or claimant that are believed to be predictive of fraud or abuse. Many red flag variables represent accumulated industry wisdom about which indicators are likely to be associated with fraud. The data on these variables represent an adjustor's subjective as-
108
L. A. Francis
sessment of a red flag indication of fraud, such as the "claimant appeared to be claim wise." These variables are binary, that is, they are either true or false. Such red flag variables are often used to target certain claims for further investigation. The data for these red flag variables is not part of the claim file; it was collected as part of the special effort undertaken in assembling the AIB database for fraud research. The red flag variables were supplemented with claims file variables deemed to be available early in the life of a claim and were, therefore, of practical value in predicting fraud. The variables selected for this example are the same as those used by Viaene et al. (2002) in their comparison of statistical and data mining methods. These same variables were also used by Francis (2003) to compare two data mining techniques. While a much larger number of predictor variables is available in the AIB data for modeling fraud, the red flag and objective claim variables selected for incorporation into their models by Viaene et al. were chosen because of early availability. Therefore, they are likely to be useful in predicting fraud soon enough in the claim's lifespan for action to mitigate the cost of the claim to be effective. Tables 2 and 3 present the claim file variables and red flag. Note that one of the claim file variables, treatment lag, was missing values on a significant number of records. For this reason, an additional dummy variable was created. This variable served as an indicator variable for the presence or absence of a value for the treatment lag. One of the objectives of this research is to investigate which variables are likely to be of greatest value in predicting fraud. To do this, procedures were needed for evaluating the importance of independent variables in predicting the target variable.
Practical Applications of NNs in Property and Casualty Insurance
109
Table 2. Objective claim file variables. Value AGE POLLAG RPTLAG TREATLAG AMBUL PARTDIS TOTDIS LEGALREP TRTMIS
Description Age of claimant Lag from inception date of policy to date reported Lag from date of accident to date reported Lag from date of accident to earliest treatment by service provider Ambulance charges Is the claimant partially disabled? Is the claimant totally disabled? Is the claimant represented by an attorney? Is treatment lag missing? Table 3. Red flag variables.
Subject Accident
Indicator variable Description No report by police officer at scene ACC01 Single vehicle accident ACC04 No plausible explanation for accident ACC09 Claimant in old, low valued vehicle ACC10 Rental vehicle involved in accident ACC11 Property damage was inconsistent with accident ACC14 Very minor impact collision ACC15 Claimant vehicle stopped short ACC16 Insured felt set up, denied fault ACC19 Claimant CLT02 Had a history of previous claims Was an out of state accident CLT04 Was one of three or more claimants in vehicle CLT07 Injury consisted of strain or sprain only Injury INJ01 No objective evidence of injury INJ02 Police report showed no injury or pain INJ03 No emergency treatment was given INJ05 Non-emergency treatment was delayed INJ06 Unusual injury for auto accident INJ11 Insured INS01 Had history of previous claims INS03 Readily accepted fault for accident INS06 Was difficult to contact/uncooperative INS07 Accident occurred soon after effective date Lost wages LW01 Claimant worked for self or a family member LW03 Claimant recently started employment
110
2.2
L. A. Francis
Testing Variable Importance
Because of the complicated functions involved in neural network analysis, interpretation of the variables is more challenging than for classical statistical models. One approach (Potts 1999) is to examine the weights connecting the input variables to the hidden layer nodes. Those weights that are closest to zero are viewed as least important. A variable is deemed unimportant only if all of these connections are near zero. Figure 1 displays the absolute values for the weights connecting the input layer (independent variables) to the hidden layer. A table displaying the variables that correspond to the numbers displayed on the graph is provided in Appendix 1. Since there are five nodes in the hidden layer, there are five weights for each variable. Using this procedure, several variables appear to have low weights on all hidden layer nodes and might be deemed "unimportant." This procedure is typically used to eliminate variables from a model, not to quantify their impact on the outcome. Potts points out that this procedure has a number of limitations. Large weights on variables do not necessarily mean that the variables are important and small weights do not guarantee that they are unimportant. Figure 1 indicates that all weights for the 3 rd and 4th variables (TRTLAG and POLLAG) are "small," thus these variables could be pruned from the model. In the next procedure described, these two variables rank in the top 11 (out of 33) in importance. Another limitation of examining weights is that the weights do not provide a means for evaluating the relative importance of variables. There is no established mechanism for combining the weights into a summary statistic that can meaningfully be used to evaluate the variables' importance. In other words, the sum of the weights (which is shown in Figure 1) has no statistical meaning. The next procedure introduced provides a means for assessing variable importance. The measure is referred to as sensitivity and is described by Potts (1999). The sensitivity is a measure of how
Practical Applications of NNs in Property and Casualty Insurance
111
much the predicted value's error increases when the variables are excluded from the model one at a time. However, instead of actually excluding variables and refitting the model, which could be very time consuming, they are fixed at a constant value. The sensitivity is computed as follows: 1. Hold one of the variables constant; say at its mean or median value. 2. Apply the fitted neural network to the data with the selected variable held constant. 3. Compute the squared errors for each observation produced by these modified fitted values. 4. Compute the average of the squared errors and compare it to the average squared error of the full model. 5. Repeat this procedure for each variable used by the neural network. The sensitivity is the percentage reduction in the error of the full model, compared to the model excluding the variable in question. 6. If desired, the variables can be ranked based on their sensitivities.
Figure 1. Plot of weights to hidden layer nodes.
112
L. A. Francis
Since the same set of parameters is used to compute the sensitivities for all variables, this procedure does not require the user to refit the model each time a variable's importance is being evaluated. Table 4 presents the sensitivities of the neural network model fitted to the fraud data. The sensitivity statistic was computed as a ratio of the total squared error for the model, excluding the variable, to the total squared error for the model, including the variable. A higher ratio indicates a more significant impact. In the neural network model, the sensitivity statistic indicates that the involvement of a lawyer is the most important variable in predicting fraud. This variable has a sensitivity ratio well above that of any of the others. Table 5 provides further insight into the relationship between LEGALREP and the dependent variable. The table shows that 61% of claims with legal representation are probably fraudulent while 89% of claims with no representation are probably legitimate. Table 4. Neural network variable ranking and sensitivity. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Variable LEGALREP TRTMIS AMBUL AGE PARTDIS RPTLAG ACC04 POLLAG CLT02 INJ01 TRTLAG ACC01 ACC14 INJ02 TOTDIS INJ06 ACC15
Ratio 1.4514 1.0905 1.0872 1.0666 1.0605 1.0389 1.0354 1.0329 1.0304 1.0291 1.0286 1.0206 1.0162 1.0160 1.0143 1.0137 1.0107
Rank 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Variable INS06 ACC10 INS01 CLT04 INJ11 INJ05 INJ03 CLT07 ACC09 INS03 ACC19 ACC16 LW01 INS07 ACC11 LW03
Ratio 1.0091 1.0076 1.0068 1.0061 1.0054 1.0053 1.0042 1.0041 1.0029 1.0023 1.0020 1.0019 1.0019 1.0006 1.0003 1.0000
Practical Applications of NNs in Property and Casualty Insurance
113
Table 5. Crosstabulation of ASSESS and LEGALREP. T FfiAl RFP
No Yes
ASSESS Probably Legitimate Probably Fraud 89% 11% 39% 61%
The variable TRTMIS, treatment lag missing, ranks second in importance. The value on this variable is missing when the claimant has not been to an outpatient health care provider, although in over 95% of these cases the claimant has visited an emergency room. Note that both medical paid and total paid for this group are less than one third of the medical paid and total paid for claimants who have visited a provider. Table 6 is a cross-tabulation of the values of TRTMIS versus ASSESS. It indicates that 97% of claims with treatment lag missing are probably legitimate while 51% of claims not missing on this variable are probably fraudulent. Note that the actual lag in obtaining treatment ranks far lower in importance. Table 6. Crosstabulation of ASSESS and TRTMIS. TRTMIS No Yes
ASSESS Probably Legitimate Probably Fraud 49% 51% 97% 3%
Table 7 presents a cross-tabulation for AMBUL, the ambulance cost variable. The table suggests that a higher proportion of claims with no ambulance costs are probably fraudulent, but the effect does not appear to be as strong as LEGALREP or TRTMIS. Note that the variable may be related to the dependent variable in a complicated way, i.e., the function may be nonlinear and may involve interactions with other variables, and a simple cross-tabulation will not uncover the more complicated relationship. These simple descriptive statistics for several of the highest ranking variables indicate that the neural network analysis uncov-
114
L. A. Francis
ered relationships between independent and dependent variables that help us understand the fitted model. Table 7. Crosstabulation of ASSESS and AMBUL. AMBUL 0 1-200 201-300 301-400 >400
ASSESS Probably Legitimate Probably Fraud 60% 40% 82% 18% 71% 29% 69% 31% 81% 19%
A task that is often performed when using conventional statistical procedures is determining which variables are important enough to keep in the model. To develop a parsimonious model it is not sufficient to know the variable's importance ranking. Rather, a test of the variable's significance to the goodness of fit of the model is needed. One approach is to create training and testing samples. Most neural network software allows the user to hold out a portion of the sample for testing. This is because most modeling procedures fit the sample data better than they fit new observations that were not in the sample. One can typically improve the fit on the sample by adding variables, even when the added variables do not improve the fit on out -of -sample data. Data is separated into training and test data.1 The model is fitted using the training data and tested using the test data. The testing helps to determine how well the dependent variable was predicted on data not used for fitting. The test data can be used to determine if the goodness of fit improves or deteriorates as variables are added to or subtracted from the model. As stated previously, this fraud application involves a classification problem. That is, the model can be used to classify claims into 1
A more rigorous testing procedure is to create three samples; one for training, one for validation or fine tuning of the model (such as deciding how many variables to retain in the model), and one for testing the goodness of fit of the model.
Practical Applications of NNs in Property and Casualty Insurance
115
the categories "probably legitimate" and "probably fraudulent." The classification in this example is based on the predicted values for the neural network. Whenever the predicted value exceeds 50%, claims are classified as probably legitimate; otherwise they are classified as probably fraudulent. These classifications are then compared to the actual values in the data for the ASSESS variable to compute the percent correctly classified. This statistic was computed for the test data while varying the numbers of variables in the model. Variables were eliminated from the model based on their rank in the sensitivity test. In addition, the squared correlation coefficient between the neural network's predicted values for the claims and the actual values was computed. Table 8 presents the results, ranked by number of variables in the model. The table indicates that the model achieves its maximum goodness of fit at around 9 or 10 variables. Thus, it seems unnecessary to include more than 10 variables in the model, although the model's performance on out of sample claims does not appear to degrade when additional variables are added. The table indicates that a good model for predicting fraud could be constructed from about 10 variables. Of the top 10 variables, based on the sensitivity test, three are red flag variables. They are CLT02 (claimant has a prior history of claims), ACC04 (single vehicle accident), and INJ01 (injury was a sprain or strain only). According to this test, seven of the top ten predictors are claim file variables. This may indicate that even when a company does not have the resources to collect subjective fraud-related information on claims, there may be value in using only claim file information to build a fraud model. Another procedure that can be used to test the fit of a model is cross-validation. This procedure might be preferred when applying neural networks to a small database. Cross-validation involves iteratively holding out part of the sample, fitting the model to the remainder of the sample, and testing the accuracy of the fitted model on the held out portion. For instance, the sample may be di-
116
L. A. Francis
vided into ten groups. Nine of the groups are used to fit the model and one is used for testing. The process is repeated ten times, and the goodness of fit statistics for the ten test samples are averaged. Though the reader should be aware of cross-validation, it was not used in this example. Table 8. Goodness of fit on test data by number of variables. Number of .. . . . . Variables in -. , , Model 1 2 3 4 5 6 7 8 9 10 11 12 13 15 20 26
„2 ,-, ~- . t , R : Coefficient of _. . . ,. Determination 0.195 0.253 0.267 0.270 0.290 0.282 0.307 0.274 0.328 0.331 0.332 0.365 0.326 0.338 0.340 0.362
D
t
Percentage _ f Correct 72.6% 74.1% 74.3% 75.2% 76.0% 76.4% 76.2% 74.9% 77.7% 78.2% 78.4% 78.2% 77.9% 78.2% 77.7% 78.3%
A more comprehensive introduction to the techniques and uses of fraud modeling can be found in the many references on that subject found at the end of this chapter. Derrig (2002) provides an excellent overview of some of the literature on fraud modeling and discusses how analytical models can be used to sort claims for further processing by adjusters.
Practical Applications of NNs in Property and Casualty Insurance
3
117
Underwriting Example
In this example additional methods for understanding the relationship between the predictor variables and target variables when using neural networks will be introduced. The example makes use of underwriting and ratemaking variables to model an underwriting outcome variable: claim severity.2 The data in this example were produced by simulation and will have nonlinearities, interactions, correlated variables, and missing values, i.e., the data contains complexities that are often encountered in actual practice. The predictor variables in the simulated data are intended to represent those that would be available when underwriting or rating a policy. A random sample of 5,000 claims was simulated. The sample represents 6 years of claims history. (A multiyear period was chosen, so that inflation could be incorporated into the example.) Each claim represents a personal automobile claim severity developed to ultimate settlement.3 As an alternative to using claims developed to ultimate, an analyst might use a database of claims that are all at the same development age.4 Random claim values were generated from a lognormal distribution. In the simulation, the scale parameter, |a, of the lognormal, varies with the characteristics of the policyholder and with factors, such as inflation, that impact the entire line of business simultaneously. The policyholder and line of business characteristics in the simulation were generated by eight variables. The (J, parameter itself has a probability distribution. A graph of the distribution of the parameter in the simulated sample is shown in Figure 2. The parameter had a standard deviation of approximately 0.47. The objective of the analysis is to dis2
Severity is the cost of the claim, given that a claim has occurred. The analyst may want to use neural networks or other data mining techniques to develop the data. 4 For readers who are not familiar with Property and Casualty insurance terminology, a discussion of ultimate losses and loss development is provided in chapter 2 of this volume. 3
L. A. Francis
118
tinguish high severity policyholders from low severity policyholders. This translates into an estimate of (i which is as close to the "true" ^i as possible.
Figure 2. Histogram of severity |x parameter.
Table 9 lists the eight variables used to generate the data (by affecting the u. parameter) in this example. These variables are not intended to serve as an exhaustive list of factors that are associated with loss exposure for the personal automobile line. Rather, they are examples of the kinds of variables one could incorporate into an underwriting or ratemaking analysis. A ninth variable (labeled Bogus) has no causal relationship to average severity. It is included as a noise variable to test the effectiveness of the neural network procedure. An effective prediction model should be able to distinguish between meaningful variables and variables that have no relationship to the dependent variable. Note that in the analysis of the data, two of the variables used to create the data are unavailable to the analyst as they represent unobserved inflation factors, i.e., the auto bodily injury and auto prop-
Practical Applications of NNs in Property and Casualty Insurance
119
erty damage/physical damage underlying inflation factors (see the discussion of Factor Analysis in the previous chapter, "Introduction to Neural Networks in Property and Casualty Insurance"). Instead, six inflation indices that are correlated with the unobserved factors are available to the analyst for modeling. The inflation indices include a hospital cost index and a medical services index that measure medical cost inflation, and a wage cost index that measures economic damages associated with wage loss. The inflation indices are the actual variables that the neural network uses as predictor variables to capture the inflation effect on claim severity. The variables are further described in Appendix 2 at the end of this chapter. Table 9. Variables and factors used to simulate severity data. Variable Variable Type Number of Categories Missing Data Age of Driver Continuous No Territory Categorical 45 No Age of Car Continuous No No Car Type Categorical 4 Credit Rating Continuous Yes Auto BI Inflation Factor Continuous No Auto PD Inflation Factor Continuous No No Law Change Categorical 2 Bogus Continuous No
3.1
Neural Network Analysis of Simulated Data
The dependent variable for the model was the log of severity. A general rule in statistics is that variables that show significant skewness should be transformed to approximate normality before fitting is done. The log transform is a common transform for accomplishing this. In general, Property and Casualty severities are positively skewed. The data in this example have a skewness of 10, a relatively high skewness. Figure 3, a graph of the distribution of the log of severity indicates that approximate normality is attained after the data is logged.
120
L. A. Francis
^m 3.6
4.6
5.5
™*\ 6.5
7.4
8.4 9.3 In(Severity)
10.3
11.2
12.2
13.1
Figure 3. Histogram of Log Severity.
The data was separated into a training database of 4,000 claims and a test database of 1,000 claims. A neural network with 6 nodes in the hidden layer was run on the 4,000 claims in the training database. As will be discussed later, this network was larger than the final fitted network. This network was used to rank variables in importance and to eliminate variables. Because the amount of variance explained by the model is relatively small (13%), the sensitivities were relatively small. Table 10 displays the results of the sensitivity test for each of the variables. These rankings indicate that the Bogus variable had no effect on the dependent variable, therefore it was eliminated from the model. Despite their low sensitivities, the inflation variables were not removed. The low sensitivities were probably a result of the high correlations of the variables with each other. The inflation variables, when tested as a group, had a sensitivity of 1.014, significantly higher than their individual sensitivities. In addition, it was deemed necessary to include a measure of inflation in the
Practical Applications of NNs in Property and Casualty Insurance
121
model. Since the neural network's hidden layer performs dimension reduction on the inflation variables, in a manner analogous to Factor or Principal Components Analysis, it seemed appropriate to retain these variables. Table 10. Results of sensitivity test. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13
Variable Sensitivity Ratio 1.098 Credit 1.073 Car Age 1.057 Car Type 1.038 Age Territory 1.023 1.022 Law Effect 1.002 Hospital Cost Index 1.002 Bogus 1.002 Medical Services Index Car Part Cost Index 1.000 Other Services Cost Index 1.000 1.000 Wage Cost Index 1.000 Car Body Cost Index
One peril that is present with neural network models is overfitting. As more hidden layer nodes are added to the model, the fit to the data improves and the R of the model increases. However, the model may simply be fitting the features of the training data, therefore its results may not generalize well to a new database. A rule of thumb for the number of intermediate nodes to include in a neural network is to use one half of the number of variables in the model. After eliminating the Bogus variable, 12 variables remained in the model. The rule of thumb would indicate that 6 nodes should be used. The test data was used to determine how well networks fit with 3, 4, 5, 6 and 7 hidden nodes performed when presented with new data. The fitted model was then used to predict values of claims in the test data. Application of the fitted model to the test data indicated that a 4 node neural network provided the best model. (It produced the highest R2 in the test data.)
122
3.2
L. A. Francis
Goodness of Fit
The fitted model had an R2 of 13%. This is a relatively low R2 , but not out of line with what one would expect with the highly random data in this example. The "true" \i (true expected log (severity)) has a variance equal to 13% of the variance of the log of severity. (While these percentages happen to be equal, note that the model did not perfectly predict \i and the measured R2 is almost certainly an optimistic estimate of the goodness of fit.) Thus, if one had perfect knowledge of \i, one could predict individual log(severities) with only 13% accuracy. However, if one had perfect knowledge of the true mean value for severity for each policyholder, along with knowledge of the true mean frequency for each policyholder, one could charge the appropriate rate for the policy, given the particular characteristics of the policyholder. In the aggregate, with a large number of policyholders, the insurance company's actual experience should come close to the experience predicted from the expected severities and frequencies. With simulated data, the "true" \i for each record is known. Thus, the model's accuracy in predicting the true parameter can be assessed. For this example, the correlation between the neural network's predicted values and the parameter [x (mean of the logs of severity) is 0.87.
3.3
Interpreting Neural Networks Functions: Visualizing Neural Network Results
When a model is used to fit a complex function, one can attempt to understand the fitted function by plotting the fitted value against a dependent variable of interest. However, because the fitted values for a given value of a predictor variable are influenced by the values of many other independent variables, such graphs may not provide much insight into the nature of the relationships between independent and dependent variables. Figure 4 displays the relationship
Practical Applications of NNs in Property and Casualty Insurance
123
between the neural network predicted value and the policyholder's age.
*«°°
10.00 -
mgmmmi Kfi ftfe^'
9.00 -
$
8.00 -
Predicted In
b o
rity)
*^MBBHWb!MLd4t£a. ^* .
e
e
^M^^^^^^^^^^^^^H ^ ^ H
ESErrsk™**0
9
^
•
C
6.00 "
a
-
%
* 10
20
30
40
50 Age
Figure 4. Scatterplot of predicted log severity versus age.
It is difficult to discern the relationship between age and the fitted value for claim severity from this graph. There is a great deal of dispersion of predicted values at any given age due to other effects of variables in the model that disguise the fitted relationship between age and the dependent variable. Researchers have been exploring methods for understanding the function fit by a neural network. Recently, a procedure for visualizing neural network fitted functions was published by Plate at al. (2000). Plate et al. describe their plots as Generalized Additive Model style plots. Rather than attempting to describe Generalized Additive Models, an algorithm for producing the plots is presented below. Hastie et al. (2002), Venables and Ripley (1999), and Plate et al. provide descriptions of Generalized Additive Models. The procedure is implemented as follows: 1. Set all the variables except the one being visualized to a constant value. Means and medians are logical choices for the constants. 2. Apply the neural network function to this dataset to produce a predicted value for each value of the independent variable. Al-
124
L. A. Francis
ternatively, one could choose to apply the neural network to a range of values for the independent variable selected to represent a reasonable set of values of the variable. The other variables remain at the selected constant values. 3. Plot the relationship between the neural network predicted value and the variable. 4. Plate et al. recommend scaling all the variables onto a common scale, such as 0 to 1. This is the scale of the outputs of the logistic functions in the neural network. In this chapter, variables remain in their original scale. The result of applying the above procedure is a plot of the relationship between the dependent variable and one of the independent variables. Multiple applications of this procedure to different variables in the model provides the analyst with a tool for understanding the functional form of the relationships between the independent and dependent variables. As an illustration, the visualization method was applied to the data with all variables set to constants except for driver age. The result is shown in Figure 5. From this graph, we can conclude that the fitted function increases at first and then declines with driver age. Figure 6 shows a similar plot for car age. This function declines with car age, but then increases at older ages.
Practical Applications of NNs in Property and Casualty Insurance
Figure 5. Neural network age function.
8.50-
/ \
' o
o
ral Network Fltted
8.00-
\ \ \
V
£ 7.00 -
\ 6.50 -
-10
C
10
V^ 20
.'
30
40
Car Age
Figure 6. Neural network car age function.
125
126
L. A. Francis
One of the independent variables in the model is a dummy variable for law change. In the simulation, legislative reform affecting automobile liability costs was passed and implemented after 12 quarters of experience. The dummy variable, intended to capture an intervention effect for the law change, was set to 1 after 12 quarters and was 0 otherwise. The visualization procedure can be employed if the user wishes to obtain an estimated value for the law change. In Figure 7, the line for claims subject to the law change (a value of 1 on the graph) is about 0.27 units below the line of claims not subject to the law change. This suggests that the neural network estimates the law effect at 0.27 on a log scale or about 27%. A 0.27 impact on a log scale corresponds approximately to a multiplicative factor of 1.27, or 0.73 in the case of a negative effect.5 The "true" impact of the law change is a 22% reduction in claim severity, therefore the neural network overestimates the impact.
8.37- • •• D
O •• ~ D
O
a
O
O • D
O ~~D ~~0 -O
1 I I O
8.10-
1
2
3
4
5
6
7
8
D
D
O
O-O
D
O"
DO
DO
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Quarter
Figure 7. Impact of law change. 5
Actually, the effect, when converted from the log scale, is about 30% for the neural network fitted value and 22% for the actual impact.
Practical Applications of NNs in Property and Casualty Insurance
127
The visualization procedure can also be used to evaluate the impact of inflation on the predicted value. All variables except the six economic inflation factors were fixed at a constant value, while the inflation index variables entered the model at their actual values. This case differs from the previous visualization examples in that six variables, rather than one, were allowed to vary. The predicted values are plotted against time. Figure 8 shows that the neural network estimated that inflation increased by about 30% during the six year time period of the sample data. This corresponds roughly to an annual inflation rate of about 4.8%. The "true" inflation underlying the model was approximately 5%.
8.55 -
I 845-
sz a>
0>
Z
8.35 -
8.25 J
1 0
'
1 5
1 10
15
20
25
Qtr
Figure 8. Impact of inflation.
One way to visualize two-way interactions is to allow two variables to take on their actual values in the fitting function while keeping the others constant. Figure 9 displays a panel graph for the age and car type interaction. The graph suggests that the function relating policyholder age to severity varies with the value of car type.
128
L. A. Francis
0
20
cartvDe: 3
40
60
80
cartvpe: 4
/
•
^'-UlDoo
cartvpe'. 1
cartvpe: 2
^ O
20
40
60
80 Age
Figure 9. Age and car type.
These few examples indicate that visualization can be a powerful tool for understanding the relationships between the independent and dependent variables when using neural networks to find patterns in the data.
3.4
Applying an Underwriting Model
Many Property and Casualty insurance applications of neural networks can utilize predictions of claim severity. A company may want to devise an early warning system to screen newly reported claims for those with a high probability of developing into large settlements. A severity model utilizing only information available early in the life of a claim could be used in an early warning system. A fraud detection system could also be based on claim severity. One approach to fraud detection that was not presented in this chapter is to produce a severity prediction for each claim. The actual value of the claim is compared to the predicted value. Those
Practical Applications of NNs in Property and Casualty Insurance
129
with a large positive deviation from the predicted are candidates for further investigation. However, many of the potential underwriting applications of neural networks require both a frequency and a severity estimate. A company may wish to prune unprofitable risks from its portfolio, pursue profitable risks, or actually use models to establish rates. For such applications, either the loss ratio or pure premium6 will be the target variable of interest. There are two approaches to estimating the needed variable. One can develop models to separately estimate frequency and severity and combine the two estimates. An illustration of fitting neural network models to frequencies is provided in Francis (2001). Alternatively, one can estimate a pure premium or loss ratio model directly. One difficulty of modeling pure premiums or loss ratios is that in some lines of business, such as personal lines auto, most of the policyholders will have no losses, since the expected frequency is relatively low. Because loss ratios and pure premiums are restricted to the range [0, <x>], it is desirable to transform the data onto a scale that does not allow for negative values. The log transformation accomplishes this. However, since the natural log is not defined for a value of zero, it may be necessary to add a very small constant to the data in order to apply the log transform. Once a pure premium is computed, it can be converted into a rate by loading for expenses and profit. Alternatively, the pure premium could be divided by the premium at current rate levels to produce an expected loss ratio. A decision could be made as to whether the predicted loss ratio is acceptable before writing a risk. Alternatively, the loss ratio prediction for a company's portfolio of risks for a line of business can be loaded for expenses and profit and the insurance company can determine if a rate increase is needed. 6
The pure premium is the losses divided by the exposure base for the line of business (in personal auto this might be the number of vehicles insured).
130
4
L. A. Francis
Conclusions
This chapter is a sequel to the previous chapter, which introduced the neural network technique. Whereas Chapter 2 sought to remove some of the "black box" stigma from neural networks by explaining how they work, this chapter addresses another "black box" challenge: understanding the relationship between independent and target variables uncovered by the neural network. Because of the complexity of the functions used in the neural network approximations, neural network software typically does not supply the user with information about the nature of the relationship between predictor and target variables. Two key tools for understanding this relationship, the sensitivity test and a visualization technique, were presented in this chapter. The sensitivity test helps the user rank the variables in importance. The visualization technique helps the user understand the functional relationships fit by the model. Of these techniques, the visualization technique is not commonly available in commercial data mining software. Incorporating the procedures into neural network software should help address the "black box" criticism of neural networks.
Acknowledgments The author wishes to acknowledge the helpful comments of Patricia Francis-Lyon.
Practical Applications of NNs in Property and Casualty Insurance
Appendix 1 Table Al. Variable order for Figure 1. Order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Variable AGE RPTLAG TRTLAG POLLAG AMBUL TRTMIS LEGALREP PARTDIS TOTDIS ACC01 ACC04 ACC09 ACC10 ACC11 ACC14 ACC15 ACC16 ACC19 CLT02 CLT04 CLT07 INJ01 INJ02 INJ03 INJ05 INJ06 INJ11 INS01 INS03 INS06 INS07 LW01 LW03
131
132
L. A. Francis
Appendix 2 This appendix is provided for readers wishing a little more detail on the structure of the data in Example 2. The predictor variables are: Driver age: Age of the driver in years. Car type: This is intended to represent classifications like compact, midsize, sports utility vehicle and luxury car. There are 4 categories. Car age: Age of the car in years. Representative parameters for the Driver age, Car type and Car age and their interactions variables were determined from the Baxter automobile claims database.7 Territory: Intended to represent all the territories for 1 state. There are 45 categories. Credit: A variable using information from the insured's credit history was included in the model. Some recent research has suggested credit information may be useful in predicting personal lines loss ratios. Monaghan (2000) shows that credit history has a significant impact on personal automobile and homeowners' loss ratios. Some insurance companies develop a credit score (perhaps using neural networks) that use the information from a number of credit history variables. For the purposes of illustrating this technique, it was assumed that the entire impact of the credit variable is on severity, although this is unlikely in practice. Automobile Bodily Injury (ABI) inflation factor and Automobile Property Damage and Physical Damage (APD) inflation factors: These factors drive quarterly increases in the bodily injury, property damage, and physical damage components of average 7
This database of Automobile claims is available as an example database in SPLUS. Venables and Ripley supply the S-PLUS data for claim severities in a SPLUS library. See Venables and Ripley (1999), p.467.
Practical Applications of NNs in Property and Casualty Insurance
133
severity. They are unobserved factors. The ABI factor is correlated with three observed variables: the producer price index for hospitals, the medical services component of the consumer price index, and an index of average hourly earnings. The APD factor is correlated with three observed variables: the produce price index for automobile bodies, the producer price index for automobile parts, and the other services component of the consumer price index. The ABI factor was given a 60% weight and the APD factor was given a 40% weight in computing each claim's expected severity. Law Change: A change in the law is enacted which causes average severities to decline by 22% after the third year. Interactions: Table A2 shows the variables with interactions. Three of the variables have interactions. In addition, some of the interactions are nonlinear (or piecewise linear). An example is the interactions between age and car age. This is a curve that has a negative slope at older car ages and younger driver ages, but is flat for older driver ages and younger car ages. In addition to these interactions, other relationships exist in the data, which affect the mix of values for the predictor variables in the data. Young drivers (<25 years old) are more likely not to have any credit limits (a condition associated with a higher average severity on the credit variable). Younger and older (>55) drivers are more likely to have older cars. Table A2. Interactions Driver Age and Car Type Driver Age and Car Age Driver Age and Car Age and Car Type
Nonlinearities: A number of nonlinear relationships were built into the data. The relationship between age and severity for certain car types follows an exponential decay. The relationships be-
134
L. A. Francis
tween some of the inflation indices and the factors generating actual claim inflation are nonlinear. The relationship between car age and severity is piecewise linear. That is, there is no effect below a threshold age, then effect increases linearly up to a maximum effect and remains at that level at higher ages. Missing Data: In our real life experience with insurance data, values are often missing on variables that have a significant impact on the dependent variable. To make the simulated data in this example more realistic, data is missing on one of the independent variables, the credit variable. For records with missing data, two dummy variables were created with a value of 0 for most of the observations, but a value of 1 for records with a missing value on car age and/or credit information. In addition, a value of -1 was recorded for car age and credit leverage where data was missing. These values were used in the neural network analysis.
References Berry, M.J.A. and Linoff, G. (1997), Data Mining Techniques, John Wiley and Sons. Brockett, P.L., Xiaohua, X., and Derrig, R.A. (1998), "Using Kohonen's self-organizing feature map to uncover automobile bodily injury claims fraud," Journal of Risk and Insurance, June, 65:2. Brockett, P.L., Derrig, R.A., Golden, L.L., Levine, A., and Alpert, M. (2002), Fraud classification using principal component analysis of RIDITs, Journal of Risk and Insurance, (to be published). Dhar, V. and Stein, R. (1997), Seven Methods for Transforming Corporate Data Into Business Intelligence, Prentice Hall. Dunteman, G.H. (1989), Principal Components Analysis, SAGE Publications.
Practical Applications of NNs in Property and Casualty Insurance
135
Derrig, R.A. (1999), "Patterns, fighting fraud with data," Contingencies, pp. 40-49. Derrig, R. (2002), "Insurance fraud," Journal of Risk and Insurance. Derrig, R.A., and Ostaszewski, K.M. (1995), Fuzzy techniques of pattern recognition in risk and claim classification, Journal of Risk and Insurance, September, 62:3: 447-482. Derrig, R.A., Weisberg, H., and Chen, X. (1994), Behavioral factors and lotteries under no-fault with a monetary threshold: a study of Massachusetts automobile claims, Journal of Risk and Insurance, June, 61:2: 245-275. Derrig, R.A. and Zicko, V. (2002), Prosecuting insurance fraud: a case study of the Massachusetts experience in the 1990s, Risk Management and Insurance Research. Francis, L. (2001), "Neural networks demystified," Casualty Actuarial Society Forum, Winter, pp. 253-320. Francis, L. (2003), "Martian chronicles: is MARS better than neural networks," Casualty Actuarial Society Forum, Winter. Freedman, R.S., Klein, R.A., and Lederman, J. (1995), Artificial Intelligence in the Capital Markets, Probus Publishers. Hastie, T., Tibshirani, R., and Freidman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer. Hatcher, L. (1996) A Step by Step Approach to Using the SAS System for Factor Analysis, SAS Institute. Holler, K., Somner, D., and Trahair, G. (1999) "Something old, something new in classification ratemaking with a new use of GLMs for credit insurance," Casualty Actuarial Society Forum, Winter, pp. 31-84. Hosmer, D.W. and Lemshow, S. (1989), Applied Logistic Regression, John Wiley and Sons. Keefer, J. (2000), "Finding causal relationships by combining knowledge and data in data mining applications," presented at Seminar on Data Mining, University of Delaware.
136
L. A. Francis
Kim, J. and Mueler, C.W. (1978), Factor Analysis: Statistical Methods and Practical Issues, SAGE Publications. Lawrence, J. (1994), Introduction to Neural Networks: Design, Theory and Applications, California Scientific Software. Martin, E.B. and Morris, A.J. (1999), "Artificial neural networks and multivariate statistics," in Statistics and Neural Networks: Advances at the Interface, Oxford University Press, pp. 195-292. Masterson, N.E. (1968), "Economic factors in liability and property insurance claims cost: 1935 - 1967," Proceedings of the Casualty Actuarial Society, pp. 61-89. Monaghan, J.E. (2000) "The impact of personal credit history on loss performance in personal lines," Casualty Actuarial Society Forum, Winter, pp. 79-105. Plate, T.A., Bert, J., and Band, P. (2000), "Visualizing the function computed by a feedforward neural network," Neural Computation, June, pp. 1337-1353. Potts, W.J.E. (2000), Neural Network Modeling: Course Notes, SAS Institute. SAS Institute (1988), SAS/STAT Users Guide: Release 6.03 Smith, M. (1996), Neural Networks for Statistical Modeling, International Thompson Computer Press. Speights, D.B, Brodsky, J.B., and Chudova, D.I. (1999), "Using neural networks to predict claim duration in the presence of right censoring and covariates," Casualty Actuarial Society Forum, Winter, pp. 255-278. Venebles, W.N. and Ripley, B.D. (1999), Modern Applied Statistics with S-PLUS, 3rd ed., Springer. Viaene, S., Derrig, R., Baesens, B., and Dedene, G. (2002), "A comparison of state-of-the-art classification techniques for expert automobile insurance fraud detection, Journal of Risk and Insurance. Warner, B. and Misra, M., (1996), "Understanding neural networks as statistical tools," American Statistician, November, pp. 284293.
Chapter 4 Statistical Learning Algorithms Applied to Automobile Insurance Ratemaking C. Dugas, Y. Bengio, N. Chapados, P. Vincent, G. Denoncourt, and C. Fournier The chapter will start from a description of the fundamentals of statistical learning algorithms and highlight how its basic tenets and methodologies differ from those generally followed by actuaries and econometricians. The main premise is that reality is too complex to be captured with a single unifying model, although some aspects may be well approximated by models. Therefore the statistical learning approach does not presume that reality is perfectly captured by a model, or at least tries to minimize the assumptions about the true generating distribution of the data. The approach is empirical: good models will be distinguished from poor models by comparing their predictive power and explanatory power on new data. At this point it is interesting to consider that choosing among models may be guided by two different objectives, which sometimes lead to different answers: an operational objective (which model will make the best decisions/predictions on new data), or a "modeling" objective (which model better describes the true underlying nature of the data). We will show an example in which the two approaches lead to different statistical tests and the operational approach makes more conservative decisions (chooses simpler models). Another example of the difference between the two approaches is illustrated by the case of ridge regression: there is a regularized (biased) regression that brings better out-of-sample expected predictions than the maximum likelihood (unbiased) estimator. This example will be used to illustrate the famous bias-variance dilemma that is so pervasive in statistical 137
138
C. Dugas et al.
learning algorithms. The above discussion and introduction to the principles of statistical learning will naturally bring up the issue of methodology. We will describe and justify the main methodological tools of the statistical learning approach for selecting and comparing models, either based on theoretical bounds or on resampling techniques (such as the cross-validation and bootstrap techniques). A special section on the particular (and rarely discussed) issue of non-stationary data will explain how the above resampling methods can be generalized to data whose distribution varies over time, which is the case with insurance data. In order to evaluate and compare models, one needs to build statistical tools to evaluate the uncertainty in the measurements of out-of-sample performance (due to finite data and non-stationarity). We applied the principles and methodology described above in a research contract we recently conducted for a large North American automobile insurer. This study was the most exhaustive ever undertaken by this particular insurer and lasted over an entire year. We analyzed the discriminative power of each variable used for ratemaking. We analyzed the performance of several statistical learning algorithms within five broad categories: Linear Regressions, GLMs, Decision Trees, Neural Networks and Support Vector Machines. We present the main results of this study. We qualitatively compare models and show how Neural Networks can represent high order nonlinear dependencies with a small number of parameters, each of which is estimated on a large proportion of the data thus, has low variance. We thoroughly explain the purpose of the nonlinear sigmoidal transforms which are at the very heart of Neural Networks' performances. The main numerical result is a statistically significant reduction in the out-of-sample mean-squared error using the Neural Network model. In some provinces and states, better risk discrimination, if not directly implemented because of market share concerns or legislative constraints, can also be used for the purpose of choosing the risks to be sent to "risk-sharing pools." According to these plans, insur-
Statistical Learning Algorithms in Automobile Insurance Ratemaking
139
ers choose a portion of their book of business which they ceed to the pool. Losses (seldom gains) are shared by participants and/or insurers doing business in the province or state of the plan. Since the selection of risks to be sent to the pool bears no effect on market share (the insured is unaware of the process) and legislation is generally looser than that of ratemaking, highly discriminative statistical learning algorithms such as Neural Networks can be very profitably used to identify those most underpriced risks that should be ceded to the pool. We compare Generalized Linear Models to our Neural Network based model with respect to their risk-sharing pool performance.
1
Introduction
Ratemaking is one of the main mathematical problems faced by actuaries. They must first estimate how much each insurance contract is expected to cost. This conditional expected claim amount is called the pure premium and it is the basis of the gross premium charged to the insured. This expected value is conditioned on information available about the insured and about the contract, which we call the input profile. Automobile insurance ratemaking is a complex task for many reasons. First of all, many factors are relevant. Taking account of each of them individually, i.e., making independence assumptions, can be hurtful (Bailey and Simon 1960). Taking account of all interactions is intractable and is sometimes referred to as the curse of dimensionality (Bellman 1957). In practice, actuarial judgment is used to discover the most relevant of these interactions and feed them explicitly to the model. Neural networks, on the other hand, are well-known for their ability to represent high order nonlinear interactions with a small number of parameters, i.e., they can automatically detect those most relevant interactions between variables (Rumelhart, Hinton and Williams 1986). We explain how and why in section 5. A second difficulty comes from the distribution of claims: asym-
140
C. Dugas et al.
metric with fat tails with a large majority of zeros and a few unreliable and very large values, i.e., an asymmetric heavy tail extending out toward high positive values. Modeling data with such a distribution is essentially difficult because outliers, which are sampled from the tail of the distribution, have a strong influence on parameter estimation. When the distribution is symmetric around the mean, the problems caused by outliers can be reduced using robust estimation techniques (Huber 1982, Hampel, Ronchetti, Rousseeuw and Stahel 1986, Rousseeuw and Leroy 1987) which basically intend to ignore or down-weight outliers. Note that these techniques do not work for an asymmetric distribution: most outliers are on the same side of the mean, so down-weighting them introduces a strong bias on its estimation: the conditional expectation would be systematically underestimated. Recent developments for dealing with asymmetric heavytail distributions have been made (Takeuchi, Bengio and Kanamori 2002). The third difficulty is due to the non-stationary nature of the relationship between explanatory variables and the expected claim amount. This has an important effect on the methodology to use, in particular with respect to the task of model selection. We describe our methodology in section 4. Fourth, from year to year, the general level of claims may fluctuate heavily, in particular in states and provinces where winter plays an important role in the frequency and severity of accidents. The growth of the economy and the price of gas can also affect these figures. Fifth, one needs sufficient computational power to develop models: we had access to a large database of « 8 x 106 records, and the training effort and numerical stability of some algorithms can be burdensome for such a large number of training examples. Sixth, the data may be of poor quality. In particular, there may be missing fields for many records. An actuary could systematically discard incomplete records but this leads to loss of information. Also, this strategy could induce a bias if the absence of a data is not random but rather correlated to some particular feature which affects
Statistical Learning Algorithms in Automobile Insurance Ratemaking
141
the level of risk. Alternatively one could choose among known techniques for dealing with missing values (Dempster, Laird and Rubin 1977, Ghahramani and Jordan 1994, Bengio and Gingras 1996). Seventh, once the pure premiums have been established the actuary must properly allocate expenses and a reserve for profit among the different contracts in order to obtain the gross premium level that will be charged to the insureds. Finally, an actuary must account for competitive concerns: his company's strategic goals, other insurers' rate changes, projected renewal rates and market elasticity. In this chapter, we address the task of setting an appropriate pure premium level for each contract, i.e., difficulties one through four as described above. Our goal is to compare different models with respect to their performance in that regard, i.e., how well they are able to forecast the claim level associated to each contract. We chose several models within five broad categories: Linear Regressions, Generalized Linear Models (McCullagh and Nelder 1989), Decision Trees (Kass 1980), Neural Networks and Support Vector Machines (Vapnik 1998). The rest of the chapter is organized as follows: in section 2 we introduce the reader to some of the fundamental principles underlying statistical machine learning, compare them to those that govern more traditional statistical approaches and give some examples. Then, we describe usual candidate mathematical criteria that lead to insurance premium estimation in section 3. Statistical learning methodologies are described in section 4, with an emphasis on the one that was used within the course of the study. This is followed in section 5 by a review of the statistical learning algorithms that we considered, including our best-performing mixture of positive-output Neural Networks. We describe experimental results with respect to ratemaking in section 6. In section 7, we compare two models on the task of identifying the risks to be sent to a risk sharing pool facility. In view of these results we conclude with an examination of the prospects for applying statistical learning algorithms to insurance modeling in section 8.
142
2
C. Dugas et al.
Concepts of Statistical Learning Theory
Statistical inference is concerned with the following: Given a collection of empirical data originating from some functional dependency, provide answers to questions that could be answered if that dependency were known. Although elements of statistical inference have existed for more that 200 years (Gauss, Laplace), it is within the last century that the development of methods and their formal analysis began. Fisher (Fisher 1912, Fisher 1915, Fisher 1922, Fisher 1925, Aldrich 1995) developed the framework of parametric statistics and suggested one method of approximating the unknown parameter values of a particular model: maximum likelihood. Other researchers (Glivenko 1933, Cantelli 1933, Kolmogorov 1933) used a more general approach as they proved that the empirical distribution function converges exponentially to the actual distribution function. Most importantly, this result is independent of the unknown actual distribution function. These two fundamental results can be seen as the seeds of two philosophically diverging frameworks of statistical inference. The goal of the first approach is to identify the data generating process. For the purpose of this modeling goal, one must have sufficient knowledge of the physical laws that govern the process in order to build a corresponding model. The essence of that branch of statistical inference is therefore to estimate the unknown parameter values of a (presumably) known model, using the available collection of empirical data, then to devise statistical tests that lead to rejection or not of the model (or of some of its parameters). For the purpose of parameter estimation, one often adopts the maximum likelihood method, which enjoys attractive asymptotic properties. On the other hand, according to the second approach, one merely attempts to predict properties of future data, based on the already given observed data. The belief is that the reality of the process is too complex to be identified and captured in a single unifying model.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
143
In particular, multivariate processes are faced with the problem of the curse of dimensionality (Bellman 1957), i.e., the number of combinations of variable values increases exponentially with the dimensionality (the number of explanatory variables) of the problem. In real-life problems, such as automobile insurance where one considers dozens of variables, the belief that one can truly identify the generating process looks naive. The goal of the second approach is therefore less ambitious: given a collection of data and a set of candidate functions, find an approximation to the observed unknown process (or a function that can answer the desired questions about the data, such as future conditional expectation) in order to obtain the best performance on predictive tasks, on new data. In the face of this operational goal, statistical inference had to evolve. The robust approach to parametric statistics appeared in the 1960's (Huber 1964, Huber 1965, Huber 1966, Huber 1968, Huber 1981, Huber and Rieder 1996). In the 1970's, Generalized Linear Models were developed in order to widen the sets of functions to choose from (Nelder and Wedderburn 1972). These models have lately become increasingly popular in the actuarial community. The availability of such wider sets of functions lead to the problem of model selection. In the 1980's, many researchers (Breiman, Friedman, Olshen and Stone 1984, Huber 1985, Rumelhart et al. 1986) started to consider special types of function, nonlinear in their parameters and with fewer distributional assumptions (in particular Decision Trees and Neural Networks) and also developed the regularization method (Tikhonov and Arsenin 1977) as an alternative to the maximum likelihood method, one better suited to the operational goal of achieving the best out-of-sample predictive performance. A branch of statistical learning (or machine learning) is mainly concerned with the development of proper refinements of the regularization and model selection methods in order to improve the predictive ability of algorithms. This ability is often referred to as generalization, since the algorithms are allowed to generalize from the observed training data to new data. One crucial element of the
144
C. Dugas et al.
evaluation of the generalization ability of a particular model is the measurement of the predictive performance results on out-of-sample data, i.e., using a collection of data, disjoint from the in-sample data that has already been used for model parameter estimation. In the case of automobile insurance, where data is not iid. but rather bears a sequential structure with potential non-stationarities (changes in the underlying generating process with respect to its explanatory variables), this requirement leads to the particular methodology of sequential validation which we shall explain in detail in section 4.
2.1
Hypothesis Testing: an Example
Let us illustrate these concepts with a simple example where the two approaches yield different statistical tests, thus possibly different conclusions. Consider the classical statistical linear regression tests for deciding to keep a coefficient. Let the relation i£|Y|:r] = a + j3x hold, with Var"[Y|:E] = a2 the output noise. The classical statistical test for rejecting the input X (i.e., setting the (3 coefficient to 0 in the model) is based on the null hypothesis (3 = 0. In this context, however, one should distinguish two questions: (1) is (3 really equal to zero? (this is what the above classical test tries to determine), or (2) would choosing (3 — 0 give better or worse out-of-sample expected generalization than choosing (3 that minimizes the in-sample error? Let us define the generalization error as the expected out-of-sample error: ESE
= E[(Y - (a + J3X))2].
(1)
If one is more interested in generalization error, then one should not use the classical test, but rather choose an unbiased out-of-sample test (Gingras, Bengio and Nadeau 2000, Nadeau and Bengio 2000). In particular one can show that if the true (3 is such that the signal to noise ratio ^ is less than (greater than) some positive threshold value 8, then setting (3 to zero (the in-sample estimator) will generalize better. When the data set has input values (Xx,X2,... ,Xn), and
Statistical Learning Algorithms in Automobile Insurance Ratemaking
writing the input average X = -^ YA=I p [
XU
145
the threshold value is
(x-xy
£[(X-X)2] ' where X is an out-of-sample example (not one of the training set Xi). Thus the out-of-sample tests tell us to reject a parameter when the signal-to-noise for that parameter is too small, even if the "true" value of that parameter is non-zero. This is because by trying to estimate that parameter from too little data, one is bound to worsen expected generalization error. If one is really interested in knowing whether (3 = 0 in the above example, then one should really use the classical statistical test rather than a generalization error statistic, because for that null hypothesis the former has greater power than the latter (Gingras et al. 2000). On the other hand, if one is really interested in out-of-sample predictive performance, one should use a generalization error test, because the classical test is liberal (i.e., it keeps a non-zero (3 too often), which can be very dangerous in applications. Finally, although the difference between the two types of statistics becomes small as n goes to infinity (6 above goes to zero), it should be noted that in insurance applications with many input variables, the "small-sample" effect is not negligible, for two reasons: 1. when the number of discrete variables is large, and we want to take into account their joint effects, the number of possible combinations of their values is very large; thus there is really very little data to decide whether a parameter associated to a particular combination is useful or not, 2. when the claims distribution is highly asymmetric (i.e., the mean is far from the median), the rare large claims can have a very strong effect on the parameter estimation (i.e., the noise is strong), which increases the discrepancy between the conclusions reached with the in-sample and out-of-sample statistics.
C. Dugas et al.
146
In the above analysis, there is another reason for preferring the operational approach in practical applications: the out-of-sample statistical tests do not require any assumption on the form of the underlying distribution.1 In other words, when performing a classical parametric test, the conclusion of the test could generally be invalidated if strictly speaking the data was not generated from the presumed class of parametric distributions. When the livelihood of a corporation is at stake in these choices, it might be wiser to avoid relying on such assumptions.
2.2
Parameter Optimization: an Example
Consider the same problem of linear regression as described above but, let us now turn the task of parameter estimation. Our objective is to minimize the expected out-of-sample squared error which does not mean that we should minimize the in-sample mean squared error: MSE
1 N = -^(y.-ia
+ Px,))2
(2)
i=i
Minimizing the MSE is what the maximum likelihood calls for in the classical framework. The reason for that apparent discrepancy has to do with the statistical learning principles defined above. Instead, in order to obtain better generalization, we turn to the regularization framework and accordingly choose to minimize a penalized criterion leading to what is often referred to as ridge regression: 1
N
MSEX = ^ 5 > i - ( a + /^)) 2 + A/32
(3)
2=1
'the only assumption, in ordinary tests, is that the data points are generated i.i.d., independently from the same distribution. Even this assumption can be relaxed in order to deal with sequentially dependent data (Newey and West 1987, Diebold and Mariano 1995, Campbell, Lo and MacKinlay 1997, Chapados and Bengio 2003).
Statistical Learning Algorithms in Automobile Insurance Ratemaking
147
with A > 0 and a minimum achieved at (3X. Thus /30 is the Ordinary Least Squares estimator. This minimum is always achieved with shrinked solutions, i.e., \\(3\\\ < \\Po\\ for A > 0. Note that this solution is generally biased, unlike /30, in the sense that if the data is generated from a multivariate normal distribution, the expected value of ||P\11 is smaller than the true value \\(3\\ from the underlying distribution. In the case where linear regression is the proper model it is easy to show that the optimal fixed value of A is
x
*" NW-
(4)
Note therefore that the optimal model is biased (optimal in the sense of minimizing out-of-sample error). Considering the case of automobile insurance, with noisy observations (large a2) and the small sample effect (small N, as argued above), we obtain large optimal values for A. In other words, for this particular case, regularization significantly differs from maximum likelihood. This example illustrates the more general principle of biasvariance trade-off (Geman, Bienenstock and Doursat 1992) in generalization error. Increasing A corresponds to "smoothing more" in non-parametric statistics (choosing a simpler function) or to the choice of a smaller capacity ("smaller" class of functions) in Vapnik's VC-theory (Vapnik 1998). A too large value of A corresponds to underfitting (too simple model, too much bias), whereas a too small value corresponds to overfitting (too complex model, too much variance). Which value of A should be chosen? (The above formula is not practical because it requires the true j3, and is only applicable if the data is really Gaussian). It should be the one that strikes the optimal balance between bias and variance. This is the question that model selection algorithms address. Fortunately, the expected outof-sample error has a unique minimum as a function of A (or more generally of the capacity, or complexity of the class of functions). Concerning the above formula, note that unfortunately the data is
C. Dugas et al.
148
generally not normal, and a2 and (3 are both unknown, so the above formula can't be used directly to choose A. However, using a separate held-out data set (also called a validation set, here), and taking advantage of that unique minimum property (which is true for any data distribution), we can quickly select a good value of A (essentially by searching), which approximately minimizes the estimated out-of-sample error on that validation set. Note that we arrive at the conclusion that a biased model is preferable because we set as our goal to minimize out-of-sample error. If our goal was to discover the underlying "truth," and if we could make very strong assumptions about the true nature of the data distribution, then the more classical statistical approach based on minimum variance unbiased estimators would be more appropriate. However, in the context of practical insurance premia estimation, we don't really know the form of the true data distribution, and we really care about how the model is going to perform in the future (at least for ratemaking).
3
Mathematical Objectives
The first goal of insurance premium modeling is to estimate the expected claim amount for a given insurance contract for a future period (usually one-year). Here we consider that the amount is 0 when no claim is filed. Let X € R" denote the customer and contract input profile, a vector representing all the information known about the customer and the proposed insurance policy before the beginning of the contract. Let A e R + denote the amount that the customer claims during the contract period; we shall assume that A is non-negative. Our objective is to estimate this claim amount, which is the pure
Statistical Learning Algorithms in Automobile Insurance Ratemaking
149
premium ppure of a given contract x:2 Ppure(x) = EA[A\X
= X\.
(5)
where EA[] denotes expectation, i.e., the average over an infinite population, and £^[A|X = x] is a conditional expectation, i.e., the average over a subset of an infinite population, comprising only the cases satisfying the condition X = x.
3.1
The Precision Criterion
In practice, of course, we have no direct access to the quantity (5), which we must estimate. One possible criterion is to seek the most precise predictor, which minimizes the expected squared error (ESE) over the unknown distribution: EAJCMX)
- A)2},
(6)
where p(X) is a pure premium predictor and the expectation is taken over the random variables X (input profile) and A (total claim amount). Since P(A, X), the true joint distribution of A and X, is unknown, we can unbiasedly estimate the ESE performance of an estimator p on a data set Dtest = {(XJ, a j ) } ^ , as long as this data set is not used to choose p. We do so by using the mean squared error on that data set:
1
£
(p(x^)-a,)2,
(7)
(Xi,ai)eDtest
where 9 is the vector of parameters of the model used to compute the premiums. The vector Xi represents the ith input profile of dataset Dtest and a* is the claim amount associated to that input profile. 2
The pure premium is distinguished from the premium actually charged to the customer, which must account for the underwriting costs (marketing, commissions, premium tax), administrative overhead, risk and profit loadings and other costs.
150
C. Dugas et al.
Thus, Dtest is a set of N insurance policies. For each policy, Dtest holds the input profile and associated incurred amount. We will call the data set Dtest a test set. It is used only to independently assess the performance of a predictor p. To choose p from a (usually infinite) set of possible predictors, one uses an estimator L, which obtains a predictor p from a given training set D. Such an estimator is really a statistical learning algorithm (Hastie, Tibshirani and Friedman 2001), generating a predictor p = LD for a given data set D. What we call the squared bias of such an estimator is EX[{EA[A\X] - ED[LD(X)})2}, where ED[LD{X)\ is the average predictor obtained by considering all possible training sets D (sampled from P(A, X)). It represents how far the average estimated predictor deviates from the ideal pure premium. What we call the variance of such an estimator is EXtD[(LD(X) — E[LD(X)])2]. It represents how the particular predictor obtained with some data set D deviates from the average of predictors over all data sets, i.e., it represents the sensitivity of the estimator to the variations in the training data and is related to the classical measure of credibility. Is the mean squared error (MSE) on a test set an appropriate criterion to evaluate the predictive power of a predictor pi First one should note that if pi andp 2 are two predictors of £^[/4|^], then the MSE criterion is a good indication of how close they are to £U|yl|X], since by the law of iterated expectations, EAX[(Pl(X) - A)2} - EAX[(P2(X) - A)2} = Ex[(Pl(X) - EA[A\X])2}Ex[(p2(X) - EA[A\X])% and of course the expected MSE is minimized when p(X) = EA[A\X\. For the more mathematically-minded readers, we show that minimizing the expected squared error optimizes simultaneously both the precision (low bias) and the variance of the estimator. The expected
Statistical Learning Algorithms
in Automobile
Insurance Ratemaking
151
squared error of an estimator LD decomposes as follows: EAXiD[(A-LD(X))2} = EA<XtD[((A - EA[A\X}) + [EA[A\X] = EAX[(A - EA[A\X]f] + EX,D[(EA[A\X] s
v noise
LD(X)))2} LD{X)f)
'
+2EA<X,D[(A - EA[A\X])(EA[A\X} >
-
LD(X))} '
v
zero
= =
EX,D[({EA[A\X] - ED[LD(X)}) + (ED[LD(X)] - LD(X)))2} +noise +EX[(EA[A\X]-ED[LD(X)])2} +EX
'
v
zero
=
+noise EX[(EA[A\X]-ED[LD(X)})2} „ ' bias2
+EX,D[(ED[LD(X)}-LD(X))2}. *
^
'
variance
+noise Thus, algorithms that try to minimize the expected squared error simultaneously reduce both the bias and the variance of the estimators, striking a tradeoff that minimizes the sum of both (since the remainder is the noise, which cannot be reduced by the choice of predictor). On the other hand, with a rule such as minimum bias used with table based methods, cells are merged up to a point where each cell has sufficient credibility, i.e., where the variance is sufficiently low. Then, once the credibility (and variance) level is set fixed, the bias is minimized. On the contrary, by targeting minimization of the expected squared error one avoids this arbitrary setting of a credibility level.
152
C. Dugas et al.
In comparison to parametric approaches, this approach avoids distributional assumptions. Furthermore, it looks for an optimal tradeoff between bias and variance, whereas parametric approaches typically focus on the unbiased estimators (within a class that is associated with a certain variance). Because of the above trade-off possibility, it is always possible (with a finite data set) to improve an unbiased estimator by trading a bit of bias increase for a lot of variance reduction.
3.2
The Fairness Criterion
In insurance policy pricing, the precision criterion is not the sole part of the picture; just as important is that the estimated premiums do not systematically discriminate against specific segments of the population. We call this objective the fairness criterion, sometimes referred to as actuarial fairness. We define the bias of the premium b(P) to be the difference between the average pure premium and the average incurred amount, in a given sub-population P of dataset D:
where \P\ denotes the cardinality of the sub-population P, andp(-) is some premium estimation function. The vector Xi represents the ith input profile of sub-population P and at is the claim amount associated to that input profile. A possible fairness criterion would be based on minimizing the sum, over a certain set of critical sub-populations {Pk} of dataset D, of the square of the biases:
£
b2(Pk)
(9)
In the particular case where one considers all sub-populations, then both the fairness and precision criterions yield the same optimal solution, i.e., they are minimized when p(xj) = £[v4i|xj], Vi, i.e., for
Statistical Learning Algorithms in Automobile Insurance Ratemaking
153
every insurance policy, the premium is equal to the conditional expectation of the claim amount. The proof is given in appendix 8. In order to measure the fairness criterion, we used the following methodology: after training a model to minimize the MSE criterion (7), we define a finite number of disjoint subsets (subpopulations) of test set D: Pk C D, PkC\Pj^k = 0> and verify that the absolute bias is not significantly different from zero. The subsets Pk can be chosen at convenience; in our experiments, we considered 10 subsets of equal size delimited by the deciles of the test set premium distribution. In this way, we verify that, for example, for the group of contracts with a premium between the 5th and the 6th decile, the average premium matches, within statistical significance, the average claim amount.
4
Methodology
A delicate problem to guard against when applying statistical learning algorithms is that of overfitting. It has precisely to do with striking the right trade-off between bias and variance (as introduced in the previous section), and is known in technical terms as capacity control. Figure 1 illustrates the problem: the two plots show empirical data points (black dots) that we are trying to approximate with a function (solid red curve). All points are sampled from the same underlying function (dashed blue curve), but are corrupted with noise; the dashed curve may be seen as the "true" function we are seeking to estimate. The left plot shows the result of fitting a very flexible function, i.e., a high-order polynomial in this case, to the available data points. We see that the function fits the data points perfectly: there is zero error (distance) between the red curve and each of the black dots. However, the function oscillates wildly between those points; it has not captured any of the fundamental features of the underlying function. What is happening here is that the function has mostly captured the noise in the data: it overfits.
154
C. Dugas et al.
The right plot, on the other hand, shows the fitting of a less flexible function, i.e., a 2nd-order polynomial, which exhibits a small error with respect to each data point. However, by not fitting the noise (because it does not have the necessary degrees of freedom), the fitted function far better conveys the structural essence of the matter. Capacity control lies at the heart of a sound methodology for data mining and statistical learning algorithms. The goal is simple: to choose a function class flexible enough (with enough capacity) to express a desired solution, but constrained enough that it does not fit the noise in the data points. In other words, we want to avoid overfitting and underfitting. Figure 2 illustrates the basic steps that are commonly taken to resolve this issue: these are not the only means to prevent overfitting, but are the simplest to understand and use. 1. The full data set is randomly split into three disjoint subsets, respectively called the training, validation, and test sets. 2. The training set is used to fit a model with a chosen initial capacity. 3. The validation set is used to evaluate the performance of that fitted function, on different data points than used for the fitOverfitting
Good Fit
K . N i i.V I
Figure 1. Illustration of overfitting. The solid left curve fits the noise in the data points (black dots) and has not learned the underlying structure (dashed). The right curve, with less flexibility, does not overfit.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
155
h full
.-Jj
ll.it.ibj-.i-
f
(Mining
valuation
U*s[
*»rl
"a'i
M'l
motlul iMinini;
imnli'i validation
VAlitlnlftJ uuuhl
tiil'ttiitii
fin.il UM
canttol
Figure 2. Methodology to prevent overrating. Model capacity is controlled via a validation set, disjoint from the training set. The generalization performance estimator is obtained by final testing on the test set, disjoint from the first two.
ting. The key here is that a function overfitting the training set will exhibit a low performance on the validation set, if it does not capture the underlying structure of the problem. 4. Depending on the validation set performance, the capacity of the model is adjusted (increased or reduced), and a new training phase (step 2) is attempted. This training-validation cycle is repeated multiple times and the capacity that provides the best validation performance is chosen. 5. Finally, the performance of the "ultimate" function (that coming out of the validation phase) is evaluated on data points never used previously—those in the test set—to give a completely unbiased measure of the performance that can be expected when the system is deployed in the field. This is called generalization performance. In the case of automobile insurance, there is a sequential structure that must be respected. When using data from previous years
156
C. Dugas et al.
to simulate the out-of-sample performance of models, one should try to replicate as closely as possible the actual process that will be followed to deploy a model. A sound procedure is to split the data according to their policy date year into the training (year y — 3), validation (year y — 2) and test (year y) sets. The reason for skipping year y — 1 is to recognize the fact that at time y, as the model is deployed, year y — 1 is not yet available. Reducing this gap to a few months would help the insurer account for more recent data and very likely obtain better performance. An insurer having access to data dating from 1995 could obtain test performances for years 1998 and above. Assuming 2002 results are available yields 5 estimates of the test performance of the whole modeling procedure.
5
Models
In this section, we describe various models that have been implemented and used for the purpose of ratemaking. We begin with the simplest model: charging a flat premium to every insured. Then, we gradually move on towards more complex models.
5.1
Constant Model
For benchmark evaluation purposes, we implemented the constant model. This consists of simply charging every single insurance policy a flat premium, regardless of the associated variable values. The premium is the mean of all incurred amounts as it is the constant value that minimizes the mean-squared error. P(X)
= A),
(10)
where fi0 is the mean and the premium p(x) is independent of the input profile x. In figure 3, the constant model is viewed as a flat line when the premium value is plotted against one of the input variables.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
157
p(x) ig
•i
6
fgp$gr
4
2
2
•
j*^
• Wm~ *
>'
4
6
8
10
Figure 3. The constant model fits the best horizontal line through the training data.
5.2
Linear Model
We implemented a linear model which consists of a set of coefficients, one for each variable plus an intercept value, that minimize the mean-squared error,
p(x)
= /3n + ^2 &Xi
(11)
1=1
Figure 4 illustrates a linear model where the resulting premiums form a line, given one input variable value. With a two dimensional input variable space, a plane would be drawn. In higher dimension, the corresponding geometrical form is referred to as a hyper-plane. There are two main ways to control the capacity of linear models when in presence of noisy data: • using a subset of input variables; this directly reduces the number of coefficients (but choosing the best subset introduces another level of choice which is sometimes detrimental).
158
C. Dugas et al.
Figure 4. The linear model fits a straight line through the training data.
• penalizing the norm of the parameters (in general excluding the intercept (3Q); this is called ridge regression in statistics, and weight decay in the Neural Networks community. This was the main method used to control capacity of the linear model in our experiments (see above subsection 2.2). It should be noted that the premium computed with the linear model can be negative (and negative values are indeed sometimes obtained with the trained linear models). This may happen even if there are no negative amounts in the data, simply because the model has no built-in positivity constraint (unlike the GLM and the softplus Neural Network described below).
5.3
Table-Based Methods
These more traditional ratemaking methods rely mainly on a classification system, base rates and relativities. The target function is approximated by constants over regular (finite) intervals. As shown on the figure, this gives rise to a typical staircase-like function, where each level of the staircase is given by the value in the corresponding
Statistical Learning Algorithms in Automobile Insurance Ratemaking
159
cell in the table. A common refinement in one dimension is to perform a linear interpolation between neighboring cells, to smooth the resulting function somewhat. The table is not limited to two variables; however, when adding a new variable (dimension), the number of cells increases by a factor equal to the number of discretization steps in the new variable. In order to use table-based methods to estimate a pure premium, find a certain number of variables deemed useful for the prediction, and discretize those variables if they are continuous. To fill out the table, compute over a number of years (using historical data) the total incurred claim amount for all customers whose profiles fall within a given cell of the table, and average the total within that cell. This gives the pure premium associated with each cell of the table. Assuming that the zth variable of profile x belongs to the jth category, we obtain, m
P(x)
n
= A > n ^ + E &•>•' i=l
^
j=jn+l
where (3itj is the relativity for the jth category of the ith variable and Po is the standard premium. We consider the case where the first m factors are multiplicative and the last n — m factors are additive. The formula above assumes that all variables have been analyzed individually and independently. A great deal of effort is often put in trying to capture dependencies (or interactions) between some variables and to encode them into the premium model. An extension of the above is to multiplicatively combine multiple tables associated to different subsets of variables. This is in effect a particular form of Generalized Linear Model (see below), where each table represents the interdependence effects between some variables.
5.4
Greedy Multiplicative Model
Greedy learning algorithms "grow" a model by gradually adding one "piece" at a time to the model, but keeping the already chosen pieces
C. Dugas et al.
160
fixed. At each step, the "piece" that is most helpful to minimize the training criterion is "added" to the model. This is how decision trees are typically built. Using the validation set performance we can decide when to stop adding pieces (when the estimated out-of-sample performance starts degrading). The GLM described in the next section is a multiplicative model because the final premium function can be seen as a product of coefficients associated with each input variable. The basic idea of the Greedy Multiplicative Model is to add one of these multiplicative coefficients at a time. At each step, we have to choose one among the input variables. We choose the variable which would reduce most the training MSE. The coefficient for that component is easily obtained analytically by minimizing the MSE when all the previously obtained coefficients are kept fixed. In the tables we use the short-hand name "CondMean" for this model because it estimates and combines many conditional means. Note that like the GLM, this model provides positive premiums.
5.5
Generalized Linear Model
The Generalized Linear Models (GLM) were introduced to the actuarial community a while ago (Bailey and Simon 1960). More recently, (Brown 1988, Holler, Sommer and Trahair 1999, Murphy, Brockman and Lee 2000) some experiments have been conducted using such models. GLMs, at their roots, are simple linear models that are composed with a fixed nonlinearity (the so-called link function); a commonly-used link function is simply the exponential function ex. GLMs (with the exponential link) are sometimes used in actuarial modeling since they naturally represent multiplicative effects, for example, risk factors whose effects should combine multiplicatively rather than additively. They are attractive since they incorporate problem-specific knowledge directly into the model. These
Statistical Learning Algorithms in Automobile Insurance Ratemaking
161
models can be used to obtain a pure premium: p(x)
= explpo
+ ^PiXij,
(13)
where, the exponentiation ensures that the resulting premiums are all positive. In figure 5, we can see that the model generates an exponential function in terms of the input variable.
Figure 5. The Generalized Linear Model fits an exponential of a linear transformation of the variables.
In their favor, GLMs are quite easy to estimate3, have interpretable parameters, can be associated to parametric noise models, and are not so affected when the number of explanatory variables increases, as long as the number of observations used in the estimation remains sufficient. Unfortunately, they are fairly restricted in the shape of the functions they can estimate. The capacity of a GLM model can be controlled using the same techniques as those mentioned above (5.2) in the context of linear 3
We have estimated the parameters to minimize the mean-squared error, but other training criteria have also been proposed in the GLM literature and this could be the subject of further studies.
162
C. Dugas et al.
models. Again, note that the GLM always provides a positive premium.
5.6
CHAID Decision Trees
Decision trees split the variable space in smaller subspaces. Any input profile x fits into one and only one of those subspaces called leaves. To each leaf is associated a different premium level,
1=1
where I^ez*} is an indicator function equal to 1 if and only if x belongs to the zth leaf. In that case, I{xeh} = 1 a n d p(x) = Pi. Otherwise, I{xek}1S ec l ua l to zero, meaning x belongs to another leaf. The number of leaves is rt[. The premium level Pi is set equal to the average incurred amount of the policies for which the profile x belongs to the zth leaf. In figure 6, the decision tree is viewed as generating a piecewise constant function. The task of the decision tree is to choose the "best" possible partition of the input variable space. The basic way in which capacity is controlled is through several hyper-parameters: minimum population in each leaf, minimum population to consider splitting a node, maximum height of the decision tree and Chi-square statistic threshold value.
5.7
Combination of CHAID and Linear Model
This model is similar to the previous one except that, in each leaf, we have replaced the associated constant premium value with a linear regression. Each leaf has its own set of regression coefficients. There are thus ni different linear regressions of n + 1 coefficients each. nt X
P( ) =
/
n
^hx&i}
[Pifl + ^PijXj].
i=l
\
j=l
\
(15) /
Statistical Learning Algorithms in Automobile Insurance Ratemaking
163
P(x)
Figure 6. The CHAID model fits constants to partitions of the variables. The dashed lines in the figure delimit the partitions, and are found automatically by the CHAID algorithm.
Each linear regression was fit to minimize the mean-squared error on the training cases that belong to its leaf. For reasons that are clear in the light of learning theory, a tree used in such a combination should have less leaves than an ordinary CHAID tree. In our experiments we have chosen the size of the tree based on the validation set MSE. In these models, capacity is controlled with the same hyperparameters as CHAID, and there is also the question of finding the right weight decay for the linear regression. Again, the validation set is used for this purpose.
5.8
Ordinary Neural Network
Ordinary Neural Networks consist of the clever combination and simultaneous training of a group of units or neurons that are individually quite simple. Figure 8 illustrates a typical multi-layer feedforward architecture such as the ones that were used for the current project.
164
C. Dugas et al.
p(x)
Figure 7. The CHAID+Linear model fits a straight line within each of the CHAID partitions of the variable space. "1.0
Input .... Variable 1 -"'
Input Variable 2
/
^ Final fTlOutput
Input Variable n
Input Layer
Hidden Layer (and hidden weights)
j^iSoftplus Output Layer
Figure 8. Topology of a one-hidden-layer Neural Network. In each unit of the hidden layer, the variables are linearly combined. The network then applies a non-linear transformation on those linear combinations. Finally, the resulting values of the hidden units are linearly combined in the output layer.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
165
We describe here the steps that lead to the computation of the final output of the Neural Network. First, we compute a series of linear combinations of the input variables: n
Vi = aifi + 'Y^aitjXj,
(16)
i=i
where Xj is the jth out of n variables, aii0 and aij are the slope intercept and the weights of the ith linear combination. The result of the linear combination, vu represents a projection in a preferred direction, that combines information from potentially all the input variables. Then, a non-linear transform (called a transfer function) is applied to each of the linear combinations in order to obtain what are called the hidden units. We used the hyperbolic tangent function: hi
=
tanh(uj) eVi
+
e-Vi
'
where hi is the ith hidden unit. The use of such a transfer function with infinite expansion in its terms has an important role in helping the Neural Network capture nonlinear interactions and this is the subject of subsection 5.9. Finally, the hidden units are linearly combined in order to compute the final output of the Neural Network:
= Po + J2&hi>
P(X)
<18>
where p(x) is the premium computed by the Neural Network, nh is the number of hidden units and @0 and Pi are the slope intercept and the weights of the final linear combination. Put all together in a single equation, we obtain: nh
/
1
P(x) = Po + Y^P 2=1
tanh
n
ai
( >° \
+
\
ai x
5Z J i j=l
)• J
(19)
166
C. Dugas et al.
Figure 9 depicts a smooth non-linear function which could be generated by a Neural Network.
Figure 9. The Neural Network model learns a smooth non-linear function of the variables.
The number of hidden units (n^ above) plays a crucial role in our desire to control the capacity of the Neural Network. If we choose a value for nh that is too large, we obtain overfitting: the number of parameters of the model increases and it becomes possible, during the parameter optimization phase, for the Neural Network to model noise or spurious dependencies. These dependencies, present in the training dataset used for optimization, might not apply to other datasets. Conversely, setting nh to a value that is too low corresponds to underfitting: the number of parameters becomes too small and the Neural Network can not capture all of the relevant interactions in order to properly compute the premiums. Thus, choosing the optimal number of hidden units is an important part of modeling with Neural Networks. Another technique for controlling the capacity of a Neural Network is to use weight decay, i.e., a penalized training criterion as described in subsection 2.2 that limits the size of the parameters of the Neural Network.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
167
Choosing the optimal values for the parameters is a complex task and out of the scope of this chapter. Many different optimization algorithms and refinements have been suggested (Bishop 1995, Orr and Miiller 1998) but in practice, the simple stochastic gradient descent algorithm is still very popular and usually gives good performance. Note that like the linear regression, this model can potentially yield negative premiums in some cases. We have observed much fewer such cases than with the linear regression.
5.9
How Can Neural Networks Represent Nonlinear Interactions?
For the more mathematically-minded readers, we present a simple explanation of why Neural Networks are able to represent nonlinear interactions between the input variables. To simplify, suppose that we have only two input variables, x\ and x2- In classical linear regression, a common trick is to include fixed nonlinear combinations among the regressors, such as x\, x\, x\X2, x\x2, • • • However, it is obvious that this approach adds exponentially many terms to the regression, as one seeks higher powers of the input variables. In contrast, consider a single hidden unit of a Neural Network, connected to two inputs. The adjustable network parameters are named, for simplicity, a0, a\ and a2. A typical function computed by this unit is given by t a n h ( a 0 + OL\XI + 012X2).
Here comes the central part of the argument: performing a Taylor series expansion of tanh(y + a0) in powers of y, and letting a\X\ +
168
C. Dugas et al.
a2x2 stand for y, we obtain (where /5 = tanh a0), tanh(o;o + cx.\Xi + 0.2X2) = P + (l- P2){alXl + a2x2) + (-P + P3)(axXi + a2x2)2+ - \ + ^L-pi){aixl + a2x2f+ f - ^ f + /35) (ctixi + a2x2f
+ 0{alxl
+ a2x2f.
In fact the number of terms is infinite: the nonlinear function computed by this single hidden unit includes all powers of the input variables, but they cannot all be independently controlled. The terms that will ultimately stand out depend on the coefficients a0, a.\, and a2. Adding more hidden units increases the flexibility of the overall function computed by the network: each unit is connected to the input variables with its own set of coefficients, thereby allowing the network to capture as many (nonlinear) relationships between the variables as the number of units allows. The coefficients linking the input variables to the hidden units can also be interpreted in terms of projections of the input variables. Each set of coefficients for one unit represents a direction of interest in input space. The values of the coefficients are found during the network training phase using iterative nonlinear optimization algorithms.
5.10 Softplus Neural Network This new type of model was introduced precisely to make sure that positive premiums are obtained. The softplus function was recently introduced (Dugas, Bengio, Belisle, Nadeau and Garcia 2001) as a means to model a convex relationship between an output and one of its inputs. We modified the Neural Network architecture and included a softplus unit as a final transfer function. Figure 10 illustrates this new architecture we have introduced for the purpose of computing
Statistical Learning Algorithms in Automobile Insurance Ratemaking
169
'I."
Input Variable 1
Input Variable 2
«1,1
g
"*J'•*#"
"" ^'"hT*-*" V ***• ~ _ 2* r
V
—'
/ _^Final ^"Output
Input Variable n 'liiH.t>
afcjLayer a . . Hidden Layer |^r'.f{iSi[ah4hidden weightet
Softplus Output Lavor
(.ind oulpul iM-it-lil*)
Figure 10. Topology of a one-hidden-layer softplus Neural Network. The hidden layer applies a non-linear transformation of the variables, whose results are linearly combined by the output layer. The softplus output function forces the function to be positive. To avoid cluttering, some weights linking the variables to the hidden layer are omitted on the figure.
insurance premiums. The corresponding formula is as such: p(x)
= F f A> + £
A tanh f aifi + J ] atjXj I I , (20)
where F(-) is the softplus function which is actually simply the primitive (integral) function of the "sigmoid" function. Thus F(y)
= log(l + e").
(21)
The softplus function is convex and monotone increasing with respect to its input and always strictly positive. Thus, as can be seen in Figure 11, this proposed architecture leads to strictly positive premiums. In preliminary experiments we have also tried to use the exponential function (rather than the softplus function) as the final transfer
170
C. Dugas et al.
function. However we obtained poor results due to difficulties in the optimization (probably due to the very large gradients obtained when the argument of the exponential is large).
Figure 11. The softplus Neural Network model learns a smooth non-linear positive function of the variables. This positivity is desirable for estimating insurance premiums.
The capacity of the softplus Neural Network is tuned just like that of an ordinary Neural Network. Note that this kind of Neural Network architecture is not available in commercial Neural Network packages.
5.11 Regression Support Vector Machine Support Vector Machines (SVM) have recently been introduced as a very powerful set of non-parametric statistical learning algorithms (Vapnik 1998, Scholkopf, Smola and Muller 1998). They have been very successful in classification tasks, but the framework has also been extended to perform regression. Like other kernel methods the
Statistical Learning Algorithms in Automobile Insurance Ratemaking
171
class of functions has the following form: P(x)
=
^OLiK(x,Xi)
(22)
i
where Xj is the input profile associated with one of the training records, and cti is a scalar coefficient that is learned by the algorithm and K is a kernel function that satisfies the Mercer condition (Cristianini and Shawe-Taylor 2000): J K{x,y)g{x)g(y)dxdy
> 0
(23)
for any square integrable function g(x) and compact subset C of R n . This Mercer condition ensures that the kernel function can be represented as a simple dot product: K(x,y)
= cf>{x) • (t>{y)
(24)
where >() is a function that projects the input profile vector into a (usually very) high-dimensional "feature" space, usually in a nonlinear fashion. This leads us, to a simple expression for the premium function: i
=
w
{x).
(25)
Thus, in order to compute the premium, one needs to project input profile x in its feature space and compute a dot product with vector w. This vector w depends only on a certain number of input profiles from the training dataset and their associated coefficients. These input profiles are referred to as the support vectors and have been selected, along with their associated coefficients by the optimization algorithm.
C. Dugas et al.
172
SVMs have several very attractive theoretical properties, including the fact that an exact solution to the optimization problem of minimizing the training criterion can be found, and the capacity of the model is automatically determined from the training data. In many applications, we also find that most of the c^ coefficients are zero. However, in the case of insurance data, an important characteristic of regression SVMs is that they are NOT trained to minimize the training MSE. Instead they minimize the following criterion: J = \\\w\\2 + \J2\at
~P(xi)\e
(26)
i
where |e| e = max(0, |e| — e), A and e trade-off accuracy with complexity, di is the observed incurred claim amount for record i, Xj is the input profile for record i, and the vector w is defined in terms of the a, coefficients above. It can therefore be seen that this algorithm minimizes something close to the absolute value of the error rather than the squared error. As a consequence, the SVM tends to find a solution that is close to the conditional median rather than the conditional expectation, the latter being what we want to evaluate in order to set the proper value for a premium. Furthermore, note that the insurance data display a highly asymmetric distribution, so the median and the mean are very different. In fact, the conditional median is often exactly zero. Capacity is controlled through the e and A coefficients.
5.12 Mixture Models The mixture of experts has been proposed (Jacobs, Jordan, Nowlan and Hinton 1991) in the statistical learning literature in order to decompose the learning problem, and it can be applied to regression as well as classification. The conditional expectation is expressed as a linear combination of the predictions of expert models, with weights determined by a gater model. The experts are specialized predictors
Statistical Learning Algorithms in Automobile Insurance Ratemaking
173
C iislmiii'i Input I'mlile «-W-BSB9W«g|
Ixpert 1 Output 1
Expert 2
• • •
Output 2
Expert n
Galer Model
Output rt
Probability (1)
Prediction Combination
Probability (.2)
*--»—*—.
Probability (n)
Final Mixture Output
Figure 12. Schematic representation of the mixture model. The first-stage models each make an independent decision, which are linearly combined by a second-stage gater.
that each estimate the pure premium for insureds that belong to a certain class. The gater attempts to predict to which class each insured belongs, with an estimator of the conditional probability of the class given the insured's input profile. For a mixture model, the premium can be expressed as P(x)
=
^2P(C\X)PC{X)
(27)
where p{c\x) is the probability that an insured with input profile x belongs to class c. This value is determined by the gater model. Also, pc{x) is the premium, as computed by the expert model of class c, associated to input profile x. A trivial case occurs when the class c is deterministically found for any particular input profile x. In that case, we simply split the training database and train each expert model on a subset of the data.
174
C. Dugas et al.
Table 1. Comparison between the main models, with MSE on the training set, validation set, and test sets. The MSE is with respect to claim amounts and premiums expressed in thousands of dollars.
Model Constant Linear GLM NN Softplus CHAID CondMean Mixture
Train MSE 56.1108 56.0780 56.0762 56.0706 56.0704 56.0917 56.0827 56.0743
Valid MSE Test MSE 67.1192 56.5744 67.0909 56.5463 67.0926 56.5498 67.0903 56.5468 67.0918 56.5480 67.1078 56.5657 67.0964 56.5508 67.0851 56.5416
The gater then simply assigns a value of pc{x) = 1 if c is the appropriate class for input profile x and zero otherwise. This is in fact fundamentally equivalent to other techniques such as decision trees or table-based methods. A more general and powerful approach is to have the learning algorithm discover a relevant decomposition of the data into different regions of the input space which then become the classes and are encoded in the gater model. In that case, both the gater and the experts are trained together. In this study both the experts and the gater are softplus Neural Networks, but any other model can be used. In Figure 12, we schematically illustrate a mixture model as the one that was used in the framework of this project.
6
Experimental Results
6.1
Mean-Squared Error Comparisons
Table 1 summarizes the main results concerning the comparison between different types of statistical machine learning algorithms. All the models have been trained using the same input profile variables.
Statistical Learning Algorithms
in Automobile
Insurance
175
Ratemaking
Mean-Squared Error
67.1192 Test 67.0851
56.5744 Validation
56.5416 56.1108
Training
56.0743
mW
^
0„ea<
5
„pws^
> c 0 ( 1 6^ a °
C^°
sVa< * Co«
Models
Figure 13. MSE results (from table 1) for eight models . Models have been sorted in ascending order of test results. The training, validation and test curves have been shifted closer together for visualization purposes. The out-of-sample test performance of the mixture model is significantly better than any of the other. Validation based model selection is confirmed on test results.
For each insurance policy, a total of 33 input variables were used and the total claims for an accident came from five main coverages: bodily injury, accident benefit, property damage, collision and comprehensive. Two other minor coverages were also included: death benefit and loss of use. In the table, NN stands for Neural Network, GLM for Generalized Linear Model, and CondMean for the Greedy Multiplicative Model. The MSE on the training set, validation set and test set are shown for all models. The MSE is with respect to claim amounts and premiums expressed in thousands of dollars. The model with the lowest MSE is the "Mixture model," and it is the
176
C, Dugas et al.
model that has been selected for the comparisons with the insurer's current rules for determining insurance premiums to which we shall refer as the Rule-Based Model. One may wonder from the previous table why the MSE values are so similar across various models for each dataset and much different across the datasets. In particular, all models perform much worse on the testset (in terms of their MSE). There is a very simple explanation. The maximum incurred amount on the test set and on the validation set is around 3 million dollars. If there was one more such large claim in the test set than in the validation set, one would expect the test MSE (calculated for premiums and amounts in thousands of dollars) to be larger by about 7 (these are in units of squared thousand dollars). Thus a difference of 11 can easily be explained by a couple of large claims. This is a reflection of the very thick right-hand tail of the incurred amount distribution (whose standard deviation is only of about 8 thousand dollars). Conversely, this also explains why all MSE are very similar across models for one particular dataset. The MSE values are all mainly driven by very large claims which no model could reliably forecast (no model could lead the insurer to charge a million dollars to a particular insured!). Consequently, truly significant differences between model performances are shadowed by the effect of very large claims on the MSE values. Although the differences between model performance are relatively small, we shall see next that careful statistical analysis allows us to discover that some of them are significant. Figure 13 illustrates graphically the results of the table, with the models ordered according to the validation set MSE. One should note that within each class of models the capacity is tuned according to the performance on the validation set. On the test and validation sets, the Mixture model dominates all the others. Then come the ordinary Neural Network, linear model, and softplus Neural Network. Only slightly worse are the GLM and CondMean (the Greedy Multiplicative model). CHAID fared poorly on this dataset. Note that the CHAID + linear model described in section 5.7 performed worse
Statistical Learning Algorithms in Automobile Insurance Ratemaking
111
Table 2. Statistical comparison between different learning models and the mixture model. The p-value is for the null hypothesis of no difference between Model #1 and the best mixture model. Symbols \x and a stand for sample mean and standard deviation. Note that ALL differences are statistically significant.
Model #1 Constant Linear GLM NN Softplus CHAID
Model #2 Mixture Mixture Mixture Mixture Mixture Mixture
A 3.41e-02 5.82e-03 7.54e-03 5.24e-03 6.71e-03 2.36e-02
a 3.33e-03 1.32e-03 1.15e-03 1.41e-03 1.09e-03 2.58e-03
Z 10.24 4.41 6.56 3.71 6.14 9.15
p-value 0 5.30e-06 2.77e-ll 1.03e-04 4.21e-10 0
than ordinary CHAID. Finally, the constant model is shown as a baseline (since it corresponds to assigning the same premium to every 1-year policy). It is also interesting to note from the figure that the model with the lowest training MSE is not necessarily the best out-of-sample (on the validation or test sets). The SVM performance was appalling and is not shown here; it did much worse than the constant model, because it is aiming for the conditional median rather the conditional expectation, which are very different for this kind of data. Table 2 shows a statistical analysis to determine whether the differences in MSE between the Mixture model and each of the other models are significant. The Mean column shows the difference in MSE with the Mixture model. The next column shows the Standard Error of that mean. Dividing the mean by the standard error gives Z in the next column. The last column gives the p-value of the null hypothesis according to which the true expected squared errors for both models are the same. Conventionally, a value below 5% or 1% is interpreted as indicating a significant difference between the two models. The p-values and Z corresponding to significant differences are highlighted. Therefore the differences in performance between the mixture and the other models are all statistically significant. As
178
C. Dugas et al.
Table 3. Statistical comparison between pairs of learning models. Models are ordered from worst to best. Symbols ft and a stand for sample mean and standard deviation. The test is for comparing the sum of MSEs. The p-value is for the null hypothesis of no difference between Model #1 and Model #2.
Model #1 Model #2 Constant CHAID CHAID GLM GLM Softplus Softplus Linear Linear NN NN Mixture
p, 1.05e-02 1.60e-02 8.29e-04 8.87e-04 5.85e-04 5.23e-03
a 2.62e-03 2.15e-03 8.95e-04 1.09e-03 1.33e-03 1.41e-03
Z 3.99 7.46 0.93 0.82 0.44 3.71
p-value 3.24e-05 4.23e-14 1.77e-01 2.07e-01 3.30e-01 1.03e-04
mentioned above, the MSE values are very much affected by large claims. Does such a sensitivity to very large claims make statistical comparisons between models incorrect? No. Fortunately all the comparisons are performed on paired data (the squared error for each individual policy), which cancel out the effect of these very large claims (since, for these special cases, the squared error will be huge for all models and of very close magnitude) Table 3 has similar columns, but it provides a comparison of pairs of models, where the pairs are consecutive models in the order of validation set MSE. What can be seen is that the ordinary Neural Network (NN) is significantly better than the linear model, but the latter, the softplus Neural Network and GLM are not statistically distinguishable. Finally GLM is significantly better than CHAID, which is significantly better than the constant model. Note that although the softplus Neural Network alone is not doing very well here, it is doing very well within the Mixture model (it is the most successful one as a component of the mixture). The reason may be that within the mixture, the parameter estimation for model of the low incurred amounts is not polluted by the very large incurred amounts (which are learned in a separate model).
Statistical Learning Algorithms in Automobile Insurance Ratemaking
6.2
179
Evaluating Model Fairness
Although measuring the predictive accuracy—as done with the MSE in the previous section—is a useful first step in comparing models, it tells only part of the story. A given model could appear significantly better than its competitors when averaging over all customers, and yet perform miserably when restricting attention to a subset of customers. We consider a model to be fair if different cross-sections of the population are not significantly biased against, compared with the overall population. Model fairness implies that the average premiums within each sub-group should be statistically close to the average incurred amount within that sub-group. Obviously, it is nearly impossible to correct for any imaginable bias since there are many different criteria to choose from in order to divide the population into subgroups; for instance, we could split according to any single variable (e.g., premium charged, gender, rate group, territory) but also combinations of variables (e.g., all combinations of gender and territory, etc.). Ultimately, by combining enough variables, we end up identifying individual customers, and give up any hope of statistical reliability. As a first step towards validating models and ensuring fairness, we choose the subgroups corresponding to the location of the deciles of the premium distribution. The i-th decile of a distribution is the point immediately above 10i% of the individuals of the population. For example, the 9-th decile is the point such that 90% of the population come below it. In other words, the first subgroup contains the 10% of the customers who are given the lowest premiums by the model, the second subgroup contains the range 10%-20%, and so on. The subgroups corresponding to the Mixture Model (the proposed model) differ slightly from those in the Rule-Based Model (the insurer's current rules for determining insurance premiums). Since the premium distribution for both models is not the same. The subgroups used for evaluating each model are given in Table 4. Since they cor-
180
C. Dugas et al.
Table 4. Subgroups used for evaluating model fairness, for the Mixture and Rule-Based Models. The lowest and highest premiums in the subgroups are given. Each subgroup contains the same number of observations, w 28,000.
Rule-Based Model Mixture Model High Low High Low 245.0145 166.24 139.27 50.81 Subgroup 1 297.0435 166.24 214.10 245.01 Subgroup 2 336.7524 259.74 297.04 Subgroup 3 214.10 378.4123 336.75 259.74 306.26 Subgroup 4 417.5794 357.18 378.41 306.27 Subgroup 5 460.2658 415.93 417.58 357.18 Subgroup 6 507.0753 490.34 460.26 Subgroup 7 415.93 554.2909 597.14 507.07 Subgroup 8 490.35 617.1175 783.90 554.29 597.14 Subgroup 9 Subgroup 10 783.90 4296.78 617.14 3095.7861 respond to the deciles of a distribution, all the subgroups contain approximately the same number of observations (£s 28,000 on the 1998 test set). The bias within each subgroup appears in Figure 14. It shows the average difference between the premiums and the incurred amounts, within each subgroup (recall that the subgroups are divided according to the premiums charged by each model, as per Table 4). A positive difference implies that the average premium within a subgroup is higher than the average incurred amount within the same subgroup. 95% confidence intervals on the mean difference are also given, to assess the statistical significance of the results. Since subgroups for the two models do not exactly represent the same customers, we shall refrain from directly comparing the two models on a given subgroup. We note the following points: • For most subgroups, the two models are being fair: the bias is usually not statistically significantly different from zero.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
181
• More rarely, the bias is significantly positive (the models overcharge), but never significantly negative (models undercharge). • The only subgroup for which both models undercharge is that of the highest-paying customers, the 10-th subgroup. This can be understood, as these customers represent the highest risk; a high degree of uncertainty is associated with them. This uncertainty is reflected in the huge confidence intervals on the mean difference, wide enough not to make the bias significantly different from zero in both cases. (The bias for the Rule-Based Model is nearly significant.) From these results, we conclude that both models are usually fair to customers in all premium subgroups. A different type of analysis could also be pursued, asking a different question: "In which cases do the Mixture and the Rule-Based Models differ the most?" We address this issue in next section.
6.3
Comparison with Current Premiums
For this comparison, we used the best (on the validation set) Mixture model and compare it on the test data of 1998 against the insurer's Rule-Based Model. Note that for legislative reasons, the Rule-Based Model did not use the same variables as the proposed Mixture Model. Histograms comparing the distribution of the premiums between the Rule-Based and the Mixture models appear in Figure 15. We observe that the premiums from the Mixture model is smoother and exhibits fatter tails (more probability mass in the right-hand side of the distribution, far from the mean). The Mixture model is better able to recognize risky customers and impose an appropriately-priced premium. This observation is confirmed by looking at the distribution of the premium difference between the Rule-Based and Mixture models, as shown in Figure 16. We note that this distribution is extremely skewed to the left. This
182
C. Dugas et al. Difference with incurred claims (sum of all KOL-groups) 200
E a o n £ 3 O
c -200
-400 -600
- © - Mixture Model (normalized premia) -fc - Rule-Based Model (normalized premia)
10 Decile
Figure 14. Average difference between premiums and incurred amounts (on the sum over all coverage groups), for the Mixture and Rule-Based models, for each decile of the models' respective premium distribution. We observe that both models are being fair to most customers, except those in the last decile, the highest-risk customers, where they appear to undercharge. The error bars represent 95% confidence intervals. (Each decile contains « 28,000 observations.)
means that for some customers, the Rule-Based model considerably under-charges with respect to the Mixture model. Yet, the median of the distribution is above zero, meaning that the typical customer pays more under the Rule-Based model than under the Mixture model. At the same time, the Mixture model achieves better prediction accuracy, as measured by the Mean-Squared Error (MSE) of the respective models, all the while remaining fair to customers in all categories. Our overriding conclusion can be stated plainly: the Mixture model correctly charges less for typical customers, and correctly charges more for the "risky" ones. This may be due in part to the
183
Statistical Learning Algorithms in Automobile Insurance Ratemaking
5
!
4
i
i
i
i
i
Rule-Based Model
o
i
Mean =298.5184 • — Median = 286.499 • • • Stddev = 78.7417 -
a) 2
:
i
1
Mixture Model
j
Mean =298.5184 • - Median = 246.5857 • • • Stddev = 189.5272
i
I
I
I
I
1500 2000 Premium amount {$)
Figure 15. Comparison of the premium distribution for the current RuleBased model and the Mixture model. The distributions are normalized to the same mean. The Mixture model distribution has fatter tails and is much smoother.
use of more variables, and in part to the use of a statistical learning algorithm which is better suited to capturing the dependencies between many variables.
7
Application to Risk Sharing Pool Facilities
In some provinces and states, improved discrimination between good and bad risks can be used for the purpose of choosing the insureds to be ceded to risk-sharing pool facilities. In this section, we illustrate the performance of some algorithms when applied to this feat according to the rules that apply in Quebec Plan de Repartition des Risques (PRR). According to this facility, an insurer can choose to cede up to 10% of its book of business to the pool by paying 75% of the gross premium that was charged to the insured. Then, in case an accident occurs, the PRR assumes all claim payments. The losses
C. Dugas et al.
184
Rule-Based minus Mixture 1
-
1
i
i
i
-
. "-
Mean =-1.5993e-10 • — Median = 37.5455 • • • Stddev= 154.65
j>.
-
"i
•"?
P I m
1
-3000
-2500
1
-2000
i i .-.-w»s^l^fllPiliillllll& -1500 -1000 -500 0 Difference between premia ($)
i 500
1000
Figure 16. Distribution of the premium difference between the Rule-Based and Mixture models, for the sum of the first three coverage groups. The distribution is negatively skewed: the Rule-Based model severely undercharges for some customers.
(gains) in the pool are then shared among the insurers in proportion of their market share. Since automobile insurance is mandatory in Quebec, the PRR was initially created in order to compensate insurers that were forced by the legislator to insure some risks that had been previously turned down by multiple insurers. The idea was that the insurer could then send these risks to the pool and the losses would be spread among all insurers operating in the province. Of course, such extreme risks represent far less than the allowed 10% of an insurer's volume. The difference can then be used for other profitable purposes. One possibility is for an insurer to cede the risks that bring most volatility in the book of business and the pool therefore becomes a means of obtaining reinsurance. In this section, we take a different view: our interest is to use highly discriminative models to identify "wholes" in the ratemaking model, i.e., to identify the risks that have been underpriced the most. Mathematically, this correspond to identifying risks for which the expected value of the claims is higher than 75% of the gross premium, i.e., those risks with an expected loss ratio of at least
Statistical Learning Algorithms in Automobile Insurance Ratemaking
185
75%, a figure above the industry's average performance. For a particular insurer, the lower the loss ratio, the more difficult it becomes to identify risks that can be (statistically) profitably ceded. Still, there are a few reasons why important underpricings can be identified: 1. legislation related to ratemaking is more restrictive than the one that pertains to the risk-sharing pool, 2. strategic marketing concerns may have forced the insurer to underprice a certain part of its book of business and, 3. other concerns may not allow the insurer to use highly discriminative models for the purpose of ratemaking. The last two items can possibly be handled by rule-based systems if the insurer clearly knows which segments of its book of business are underpriced. The legislative context is of more interest to us: stringent legislators refrain insurers from using highly explanatory variables such as sex or age for the purpose of ratemaking. If the pool facility ruling is silent in that regard, then underpricings can easily be identified. But this can be done with traditional models. The interest in highly discriminative models such as Neural Networks comes from the necessity of filing ratemaking plans in a clear fashion. Often, this filing operation limits an actuary in his desire to exploit relevant dependencies between explanatory variables. A lot of insurers still analyze variables independently, in silos, in order to compute individual parameters for each one of them. In that case, no dependency can be captured unless a "home-brewed" variable, resulting from the combination of many, is added. But this is a highly informal procedure which relies on the actuary's thorough knowledge of the problem at hand and technical assistance such as visualization tools. As we have shown in subsection 5.9, Neural Networks are able to automate this procedure and capture the most relevant of these dependencies w.r.t ratemaking. This is where comes in the most important difference between Neural Networks and Generalized Linear Models: automating the detection of dependencies.
C. Dugas et al.
186
The superiority of the Neural Network Model is illustrated in Figure 17, where we have simulated the profits that can be generated by an insurer with 100M$ book of business operating at a global loss ratio of 65%. We compare Neural Networks and Generalized Linear Models as they take turns as ratemaking model and facility model (the model used to identify underpricings in the ratemaking and to choose the risks to be ceded to the facility). We measured profits as such: for a particular insured risk, let Pr and Pf be the premiums computed according to the ratemaking and facility models, respectively. Let C be the level of claims that occurred for that risk (usually zero). The premiums Pr and Pf are pure premiums. Since we have assumed a loss ratio of 65%, we can compute the gross premium as Pr/65% for the ratemaking model. Then, when a risk is ceded, the facility keeps 75% of that premium. So the actual profit of ceeding a particular risk is Actual Profit = C - 75% x P r /65%. Similarly, the facility premium Pf corresponds to the expected level of claims, so the projected profit of ceding a risk is: Projected Profit =
Pf - 75% x P r /65%.
Accordingly, the facility premium must be 15.4% higher than the corresponding ratemaking premium in order to profitably (statistically) cede a risk. The top graphic of Figure 17 shows that the Neural Network, used as a facility model can help generate substantial profits (between 1.25M$ and 1.5M$ for the 100M$ book of business insurer) when a GLM is used for ratemaking. It profitably identifies underpricings on more than 10% of the insurer's book of business. Also observe that the difference between the actual and relative profits is relatively small. Since Actual Profit - Projected Profit =
C -
Pf,
we conclude that the Neural Network is very precise at estimating the expected claims level for high risks.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
187
According to the graphic, the insurer has been able to cede 1.25MS + 75% x 10% x 100M$ =
8.75M$
in claims to the pool. Thus, the ceded risks had an average loss ratio of 87.5% up from the global figure of 65%. On the other hand the second graphic of Figure 17 shows that the GLM model, when used as the facility model mistakingly identifies underpricings in the ratemaking Neural Network model that appear in the projected profit but do not translate in real, actual profit.
8
Conclusion
Neural networks have been known to perform well in tasks where discrimination is an important aspect of the task at hand and this has lead to many commercially successful applications of these modeling tools (Keller 1997). We have shown that, when applied properly while taking into account the particulars of insurance data, that ability to discriminate is also revealed with insurance data. When applied to automobile insurance ratemaking, they allow us to identify more precisely the true risk associated to each insured. We have argued in favor of the use of statistical learning algorithms such as Neural networks for automobile insurance ratemaking. We have described various candidate models and compared them qualitatively and numerically. We have found that the selected model has significantly outperformed all other models, including the current premium structure. We believe that their superior performance is mainly due to their ability to capture high-order dependencies between variables and to cope with the fat tail distribution of the claims. Other industries have adopted statistical learning algorithms in the last decade and we have shown them to be suited for the automobile insurance industry as well. Completely changing the rate structure of an insurer can be a costly enterprize, in particular when it involves significant changes
188
C. Dugas et al. PRR Profit (Volume: 100M$, Loss Ratio: 65%) 1
1
1
Actual Profit Projected Profit
;
!
;
1
1
1
:
:
1.25M$--
J
j
*
—
/
1
-''
J ~ ^
Q/Pt'l
1M$-
0.5M$ -
1
s
'.
J/
0.25M$ - •
n i
3%
4% 5% 6% Percentage of ceeded risks
N
7%
PRR Profit (Volume: 100M$, Loss Ratio: 65%) Actual Profit Projected Profit
/: /: / : / /
1.25M$-
'
1M$-
0.75MS -
0.5M$ -
' /
: :
: :
: : :
:
: : : : :
: : :
: : :
: : :
: : :
i
i
4% 5% 6% Percentage of ceeded risks
7%
8%
/ / / / / / / / / / / 1 1 1 1 1 1 »
1%
i
i
2%
3%
i
i
i
9%
10%
Figure 17. Profit from the PRR facility as a function of the ceding percentage. Both, the projected profit (dashed) and the actual profit (solid) are shown. These illustrations apply to an insurer with a volume of business of 100MS and a global loss ratio of 65%. In the top (bottom) figure, the benchmark model is the Neural Network model (GLM model) and the model used to identify the underpricings is the GLM model (Neural Networks model).
Statistical Learning Algorithms in Automobile Insurance Ratemaking
189
in the computer systems handling transactions, or the relations with brokers. We have shown that substantial profit can be obtained from the use of Neural Networks in the context of risk-sharing pools. There are still many other applications where better discrimination of risks can be used profitably, in particular target marketing, fraud detection and elasticity modeling. Target Marketing When an insurer sends out mail solicitation, only a portion (5%10%) of the whole population will be contacted. The goal here is, given this fixed portion, to reach the maximum number of people who will respond positively to the solicitation. Another possibility would be for the insurer to develop a "customer lifetime value" model that would predict, given an insured's profile, what is the expected present value of the future profits that will be generated by acquiring this particular insured's business. Then, by using the customer lifetime value model in conjunction with a model for the probability of positive response, an insurer could attempt to maximize the profit of its solicitation campaign instead of simply maximizing the number of new insureds. Fraud Detection Fraud represents 10%-15% of all claims. Usually only a portion of the claims will be looked at by an insurer's investigators. The goal of fraud detection is to develop a model that will help an insurer increase the effectiveness of its investigators by referring them cases that are more likely to be fraudulent. In order to do so, one needs a database of previous successful and unsuccessful investigations. Neural Networks have been applied with great success to credit card fraud detection. Elasticity Modeling The greatest benefit from an improved estimation of pure premium derives by considering its application to ratemaking. The main rea-
190
C. Dugas et al.
son for these benefits is that a more discriminant predictor will identify a group of insureds that are significantly undercharged and a (much larger) group that is significantly overcharged. Identifying the undercharged will increase profits: increasing their premiums will either directly increase revenues (if they stay) or reduce underwriting losses (if they switch to another insurer). The advantage of identifying the insured profiles which correspond to overcharged premiums can be coupled with a marketing strategy in order attract new customers and increase market share, a very powerful engine for increased profitability of the insurer (because of the fixed costs being shared by a larger number of insureds). To decide on the appropriate change in premium, one also needs to consider market effects. An elasticity model can be independently developed in order to characterize the relation between premium change and the probability of loosing current customers or acquiring new customers. A pure premium model such as the one described in this chapter can then be combined with the elasticity model, as well as pricing constraints (e.g., to prevent too much rate dislocation in premiums, or to satisfy some jurisdiction's regulations), in order to obtain a function that "optimally" chooses for each insured profile an appropriate change in gross premium, in order to maximize a financial criterion. Clearly, the insurance industry is filled with analytical challenges where better discrimination between good and bad risks can be used profitably. We hope this chapter goes a long way in convincing actuaries to include Neural networks within their set of modeling tools for ratemaking and other analytical tasks.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
191
Appendix Proof of the Equivalence of the Fairness and Precision Criterions In this section, we show that, when all subpopulations are considered to evaluate fairness, the precision criterion and the fairness criterion, as they were defined in section 3, both yield the same premium function. Theorem 1 The premium function which maximizes precision (in the sense of equation 6) also maximizes fairness (in the sense of equation 9, when all subpopulations are considered), and it is the only one that does maximize it. Proof: Let P be a subset of the domain of input profiles. Let q be a premium predictor function. The bias in P is defined by
(xi,a,i)eP 2
Let Fq = —E[J2p bq(P) ] be the expected "fairness" criterion using premium function q, to be maximized (by choosing q appropriately). Let p(x) = l£[a|:r] be the optimal solution to the precision criterion, i.e., the minimizer of
E[{p{X)-Af\. Consider a particular population P. Let q(P) denote the average premium for that population using the premium function q(x),
(xi,a.i)eP
and similarly, define a(P) the average claim amount for that population, (xj,ai)eP
192
C. Dugas et al.
Then the expected squared bias for that population, using the premium function q, is E[bq{Pf] = E[(q(P) -
a(P)f]
which is minimized for any q such that q(P) = E[a(P)}. Note in particular that the optimal ESE solution, p, is such a minimizer of Fq, since p
^
=
jp-\ E
£feisi] = £[j4 E
(xi,a.i)eP
*i = w ) ]
{xitcn)eP
We know therefore that g = p is a minimizer of Fq, i.e., Vg, F p < Are there other minimizers? Consider a function q ^ p, that is a minimizer for a particular population Pi. Since q ^ p,3x s.t. g(x) 7^ p(x). Consider the particular singleton population Px = {x}. On singleton populations, the expected squared bias is the same as the expected squared error. In fact, there is a component of F which contains only the squared biases for the singleton populations, and it is equal to the expected squared error. Therefore on that population (and any other singleton population for which q ^ p) there is only one minimizer of the expected squared bias, and it is the conditional expectation p{x). So E[(q(x) - A)2\X = x] > E[(p(x) - A)2\X = x] and therefore E[bq(Px)} > E[bp(Px)]. Since p is a maximizer of fairness for all populations, it is enough to prove that q is sub-optimal on one population to prove that the overall fairness of q is less than that of p, which is the main statement of our theorem:
\/q^p,Fq>Fp.
References Aldrich, J. 1995, R.A. Fisher and the making of maximum likelihood 1912-22, Technical Report 9504, University of Southampton, Department of Economics.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
193
Bailey, R. and Simon, L. 1960, 'Two studies in automobile insurance ratemaking', ASTIN Bulletin 1(4), 192-217. Bellman, R. 1957, Dynamic Programming, Princeton University Press, NJ. Bengio, Y. and Gingras, F. 1996, Recurrent neural networks for missing or asynchronous data, in M. Mozer, D. Touretzky and M. Perrone, eds, 'Advances in Neural Information Processing System', Vol. 8, MIT Press, Cambridge, MA, pp. 395^101. Bishop, C. 1995, Neural Networks for Pattern Recognition, Oxford University Press. Breiman, L., Friedman, J., Olshen, R. and Stone, C. 1984, Classification and Regression Trees, Wadsworth Int. Group. Brown, R. 1988, Minimum bias with generalized linear models, in 'Proceedings of the Casualty Actuarial Society'. Campbell, J., Lo, A. W. and MacKinlay, A. 1997, The Econometrics of Financial Markets, Princeton University Press, Princeton. Cantelli, F. 1933, 'Sulla probabilita come limita della frequenza', Rend. Accad. Lincei 26(1), 39. Chapados, N. and Bengio, Y. 2003, 'Extensions to metricbased model selection', Journal of Machine Learning Research 3, 1209-1227. Special Issue on Feature Selection. Cristianini, N. and Shawe-Taylor, J. 2000, An Introduction to Support Vector Machines, Cambridge Press. Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977, 'Maximumlikelihood from incomplete data via the EM algorithm', Journal of Royal Statistical Society B 39, 1-38.
194
C. Dugas et al.
Diebold, F. X. and Mariano, R. S. 1995, 'Comparing predictive accuracy', Journal of Business and Economic Statistics 13(3), 253263. Dugas, C , Bengio, Y., Belisle, R, Nadeau, C. and Garcia, R. 2001, A universal approximator of convex functions applied to option pricing, in 'Advances in Neural Information Processing Systems', Vol. 13, Denver, CO. Fisher, R. A. 1912, 'On an absolute citerion for frequency curves', Messenger of Mathematics 41, 155-160. Fisher, R. A. 1915, 'Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population', Biometrika 10, 507-521. Fisher, R. A. 1922, 'On the mathematical foundations of theoretical statistics', Philosophical Transactions of the Royal Society of London A222, 309-368. Fisher, R. A. 1925, 'Theory of statistical estimation', Proceedings of the Cambridge Philosophical Society 22, 700-725. Geman, S., Bienenstock, E. and Doursat, R. 1992, 'Neural networks and the bias/variance dilemma', Neural Computation 4(1), 1-58. Ghahramani, Z. and Jordan, M. I. 1994, Supervised learning from incomplete data via an EM approach, in J. Cowan, G. Tesauro and J. Alspector, eds, 'Advances in Neural Information Processing Systems', Vol. 6, Morgan Kaufmann, San Mateo, CA. Gingras, R, Bengio, Y. and Nadeau, C. 2000, On out-of-sample statistics for time-series, in 'Computational Finance 2000'. Glivenko, V. 1933, 'Sulla determinazione empirica delle leggi di probabilita', Giornale dell'Istituta Italiano degli Attuari 4, 92.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
195
Hampel, R, Ronchetti, E., Rousseeuw, P. and Stahel, W. 1986, Robust Statistics, The Approach based on Influence Functions, John Wiley & Sons. Hastie, T., Tibshirani, R. and Friedman, J. 2001, Data Mining, Inference and Prediction, Springer. Holler, K., Sommer, D. and Trahair, G. 1999, 'Something old, something new in classification ratemaking with a novel use of glms for credit insurance', Casualty Actuarial Society Forum pp. 3 1 84. Huber, P. 1982, Robust Statistics, John Wiley & Sons Inc. Huber, P. J. 1964, 'Robust estimation of a location parameter', Annals of Mathematical Statistics 35, 73-101. Huber, P. J. 1965, 'A robust version of the probability ratio', Annals of Mathematical Statistics 36, 17-53. Huber, P. J. 1966, 'Strict efficiency excludes superefficiency', Annals of Mathematical Statistics 37, 14-25. Huber, P. J. 1968, 'Robust confidence limits', Zeitschrift fur Wahrscheinlichkeits Theorie 10, 269-278. Huber, P. J. 1981, Robust Statistics, John Wiley & Sons, New York. Huber, P. J. 1985, 'Projection pursuit', Annals of Statistics 13, 435525. Huber, P. J. and Rieder, H. 1996, Robust statistics, data analysis, and computer intensive methods, number 109 in 'Lecture Notes in Statistics', Springer, New York. Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. 1991, 'Adaptive mixture of local experts', Neural Computation 3, 7987.
196
C. Dugas et al.
Kass, G. 1980, 'An exploratory technique for investigating large quantities of categorical data', Applied Statistics 29(2), 119-127. Keller, P. E. 1997, 'Neural networks: Commercial applications'. http://www.emsl.pnl.gov:2080/proj/neuron/neural/products. Kolmogorov, A. 1933, 'Sulla determinazione empirica di una leggi di distribuzione', Giornale dell'Istituta Italiano degli Attuari 4, 33. McCullagh, P. and Nelder, J. 1989, Generalized Linear Models, Chapman and Hall, London. Murphy, K., Brockman, M. and Lee, P. 2000, 'Using generalized linear models to build dynamic pricing systems', Casualty Actuarial Society Forum pp. 107-139. Nadeau, C. and Bengio, Y. 2000, Inference for the generalization error, in S. Solla, T. Leen and K.-R. Mller, eds, 'Advances in Neural Information Processing Systems', Vol. 12, MIT Press, pp. 307313. Nelder, J. and Wedderburn, R. 1972, 'Generalized linear models', Journal of the Royal Statistical Society 135, 370-384. Newey, W. and West, K. 1987, 'A simple, positive semi-definite, heteroscedasticity and autocorrelation consistent covariance matrix', Econometrica 55, 703-708. Orr, G. and Miiller, K.-R. 1998, Neural Networks: Tricks of the Trade, Springer. Rousseeuw, P. and Leroy, A. 1987, Robust Regression and Outlier Detection, John Wiley & Sons Inc. Rumelhart, D., Hinton, G. and Williams, R. 1986, 'Learning representations by back-propagating errors', Nature 323, 533-536.
Statistical Learning Algorithms in Automobile Insurance Ratemaking
197
Scholkopf, B., Smola, A. and Miiller, K.-R. 1998, 'Nonlinear component analysis as a kernel eigenvalue problem', Neural Computation 10, 1299-1319. Takeuchi, I., Bengio, Y. and Kanamori, T. 2002, 'Robust regression with asymmetric heavy-tail noise distributions', Neural Computation 14, 2469-2496. Tikhonov, A. and Arsenin, V. 1977, Solutions of Ill-posed Problems, W.H. Winston, Washington D.C. Vapnik, V. 1998, Statistical Learning Theory, John Wiley, Lecture Notes in Economics and Mathematical Systems, volume 454.
This page is intentionally left blank
Chapter 5 An Integrated Data Mining Approach to Premium Pricing for the Automobile Insurance Industry A.C. Yeo and K.A. Smith
In setting the optimal combination of premiums, the insurance company has to find a balance between two conflicting objectives of market share and profitability. The premiums must cover the cost of expected claims and also ensure that a certain level of profitability is achieved. However, the premiums must not be set so high that market share is jeopardized as consumers exercise their rights to choose their insurers in a competitive market place. We used a data mining approach, which endeavours to find a balance between the two conflicting objectives in determining optimal premium, and we were able to demonstrate the quantitative benefits of the approach.
1
Introduction
To succeed in a highly competitive environment, insurance companies strive for a combination of market growth and profitability, and these two goals are at times conflicting. Premiums play a critical role in enabling insurance companies to find a balance between these two goals. The challenge is to set the premium so that expected claim costs are covered and a certain level of profitability is achieved; yet not to set premiums so high that market share is jeopardized as consumers exercise their rights to choose their insurers. 199
200
A. C. Yeo and K. A. Smith
Insurance companies have traditionally determined premiums by assigning policy holders to pre-defined groups and observing the average claim behavior of each group. The groups are formed based on industry experience about the perceived risk of different groups of policy holders. With the advent of data warehouses and data mining however comes an opportunity to consider a different approach to assessing risk; one based on data-driven methods. By using data mining techniques, the aim is to determine optimal premiums that more closely reflect the genuine risk of individual policy holders as indicated by behaviors recorded in the data warehouse.
2
A Data Mining Approach
Figure 1 presents a data mining approach for determining the appropriate pricing of policies. The approach consists of three main components. The first component involves identifying risk classifications and predicting claim costs using k-means clustering. The total premiums charged must be sufficient to cover all claims made against the policies and return an acceptable cost ratio. The second component involves price sensitivity analysis using neural networks. Premiums cannot be set at too high a level as customers may terminate their policies, thus affecting market share. The third component combines the results of the first two components to provide information on the impact of premiums on profitability and market share. The optimal mix of policy holders for a given termination rate can be determined by non-linear integer programming. An Australian motor insurance company supplied the data for this study. Two data sets (training set and test set), each consisting of 12-months of comprehensive motor insurance policies and claim information were extracted. The training set consisted of 146,326 policies with due dates from 1 January to 31 December 1998 while the test set consisted of 186,658 policies with due dates from 1 July 1998 to 30 June 1999. The data was selected to enable comparison
201
An Integrated Data Mining Approach to Premium Pricing
of exposure and retention rates over a one-year period and to ensure that sample sizes are sufficiently large within the constraints of the data available at the time of collection. Forty percent of the policies in the test set were new policies. The training set was used to train the models while the test set was used to evaluate the results. (
Clustering J
'•
Define risk groups
•
Determine effect of changes in premiums on retention rates
'r
1
Predict claims
Determine optimal premiums
'
i
/—kl„..,Ii^\ v
Networks J
"
Predict sales
.
/^mteger^Nv
Programming
i
''
'
Claim Estirr ates
Sales fore(:ast
Profit fora;ast s-1
0*
Figure 1. A data mining approach to determine optimal premium.
Risk Classification and Prediction of Claim Cost 3.1
Risk Classification
Insurance companies classify policy holders into various risk groups based on factors such as territory, demographic variables (such as age, gender and marital status) and other variables (such as use of vehicle, driving record and years of driving experience). These factors are considered predictors of claim costs (Dionne and
202
A. C. Yeo and K. A. Smith
Vanasse 1992, Samson 1986, Tryfos 1980). Risk classification has traditionally been achieved using heuristic methods, both within the industry and in academic studies. For example, Samson and Thomas (1987) selected four variables: age of policy holder, area in which the policy holder lives, group rating of the insured automobile and level of no-claim discount; and categorized each variable into three levels of activity. Each policy was placed into one ofthe81(3 4 )risk groups. However, the number of factors that can be included is limited for the heuristic method. This is because the number of exposure units in each risk group must be sufficiently large to make claim costs reasonably predictable (Vaughan and Vaughan 1996). To ensure that there is a large number of exposure units in each risk group, the number of risk groups has to be kept small, which in turn affects the number of factors that can be considered. For example adding an additional factor in Samson and Thomas' study would increase the number of risk groups to 243 (35), which would significantly reduce the number of exposure units in each risk group. The classification structure is normally designed to achieve maximum homogeneity within groups and maximum heterogeneity between groups. This can be achieved through clustering, whether crisp or fuzzy. Clustering places objects into groups or clusters based on distances computed on the attributes of the data. The objects in each cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. (Anderberg 1973, Everitt 1993, Johnson 1998, Kaufman and Rousseeuw 1990). Clustering allows more factors to be included in risk classification without compromising the size of each risk group compared to the heuristic method. Several researchers have used clustering techniques for the risk classification stage of the claim cost prediction problem. Williams and Huang (1997) used k-means clustering to identify high claiming policy holders in a motor vehicle insurance portfolio. Derrig and Ostaszewski (1995) used fuzzy c-means clustering for territo-
An Integrated Data Mining Approach to Premium Pricing
203
rial rating of Massachusetts towns. Smith, Willis and Brooks extended the use of clustering by proposing the use of k-means clustering to predict claim costs (Smith et al. 2000). Our work extends this approach by evaluating the k-means clustering model and the fuzzy c-means clustering model as techniques for predicting claims cost, and comparing the results obtained with a heuristic model (Samson and Thomas 1987) to determine the advantages of a datadriven approach. 3.1.1 K-Means Clustering Model The k-means clustering model used to classify policies performs disjoint cluster analysis on the basis of Euclidean distances computed from variables and seeds that are generated and updated by the k-means algorithm (Anderberg 1973, MacQueen 1967). Using the least squares clustering criterion, the sum of the squared distances of observations to the cluster means is minimized. Thirteen variables were used for clustering. They were: (1) policy holder's age, (2) policy holder's gender, (3) area in which the vehicle was garaged, (4) rating of policy holder, (5) years on current rating, (6) years on rating one, (7) number of years policy held, (8) category of vehicle, (9) sum insured, (10)total excess, (11) vehicle use, (12) vehicle age, (13) whether or not the vehicle is under finance. A minimum cluster size of 1,000 was specified to satisfy the insurability requisite of mass. The initial clustering yielded 6 clusters, with cluster sizes ranging from 1,600 to 58,000. Two more rounds of clustering were done to reduce the cluster sizes to no more than
204
A. C. Yeo and K. A. Smith
20,000. A total of 30 risk groups were generated through the three rounds of clustering, each containing between 1,000 and 20,000 policy holders. This process ensures the clustering algorithm finds a balance between the requisite mass and homogeneity criteria. 3.1.2 Fuzzy C-Means Clustering Model The fuzzy c-means clustering algorithm assigns each policy holder to different clusters to varying degrees specified by a membership grade. The algorithm minimizes an objective function that represents the distance from any given data point to a cluster center weighted by that data point's membership grade (Bezdek 1981). To ensure comparability with the k-means clustering, the number of clusters specified was 30. The same thirteen variables used in the kmeans clustering were used for the fuzzy c-means clustering. The training data was clustered using MATLAB Fuzzy Logic Toolbox. A neural network was then trained to learn the fuzzy inference system using the thirteen variables as input and the membership grades of the 30 clusters as output. The number of hidden neurons was fifty, and the hyperbolic tangent activation function was used. Thirty percent of the data was reserved for a validation set, and the R Squared obtained on this set was 0.9608 (with an R squared of 0.97 for the total data set) giving us the confidence to apply the network to the test set. 3.1.3 Heuristic Model To determine how well the two clustering models group policy holders into various risk groups, a heuristic model based on the approach of Samson and Thomas (1987) was used for comparison. Three factors were used in the heuristic model: age of the policy holders, area in which the vehicles were garaged and the category of the vehicle. Rating area and category of vehicle were sorted by average claim cost per policy holder and age of policy holder was sorted by age. The three variables were then split into 5 classes each. An attempt was made to minimise the difference in average
An Integrated Data Mining Approach to Premium Pricing
205
claim cost per policy between the classes and to ensure that each class had at least 10,000 policies, in other words, to comply with the two requirements of mass and homogeneity. A total of 125 (53) groups were created.
3.2
Prediction of Claim Cost
Having grouped the policy holders into various risk groups using one of the three methods, the next stage is to predict claim costs for each of the risk groups. For the heuristic method, we have chosen regression, firstly so that we can compare our results to Samson and Thomas (1987), and secondly because regression has been shown in the literature to be preferable to other models that have been used (Chang and Fairley 1979, Sant 1980). As for the two clustering methods, the average claim cost per policy holder within each cluster of the training set was used as the basis for prediction of the test set. The two clustering approaches (k-means and fuzzy c-means) to the risk classification problem utilize all of the available information except claim behavior to find groups of policy holders exhibiting similar characteristics (demographic and historical). The heuristic method of Samson and Thomas (1987) classifies policy holders according to a set of three pre-defined factors. These methods are described in more detail below. 3.2.1
K-Means Clustering Model
The actual average claim cost per policy for each cluster, found by clustering the training set, was used as the basis for predicting average claim cost per policy for the test set. 3.2.2
Fuzzy C-Means Clustering Model
Claim cost of a policy holder is apportioned to the clusters he belongs to according to his/her calculated membership grade. For example, if the claim cost of a policy holder was $ 1,000 and his/her
206
A. C. Yeo and K. A. Smith
membership grade of Cluster 1 is 0.8 and that of Cluster 2 is 0.2, $800 will be apportioned to Cluster 1 and $200 will be apportioned to Cluster 2. The claim cost per policy holder in each cluster will be the total cost apportioned to that cluster divided by the total membership value for that cluster. An illustrative example is shown in Table 1. Table 1. Illustrative example of computing claims cost per policy holder (fuzzy clustering). Apportioning of Claim Cost ($) Membership Grade [Claim Cost * Membership Grade Policy Holder Claim Cost ($) Clusterl Cluster2 Clusterl Cluster2 1 0 0.98 0.02 0 0 2 300 1.00 0.00 300 0 0.24 3 0 0.76 0 0 4 500 0.99 0.01 497 3 5 0 0.00 1.00 0 0 6 0 0.02 0.98 0 0 7 1,000 0.98 0.02 981 19 8 0 0.03 0.97 0 0 9 900 0.16 0.84 144 756 10 0 0.01 0.99 0 0 Total 2,700 4.93 5.07 1921 779 154 389 Claim Cost Per Policy Holder (Total Claim Cost/Total Membership Value)
3.2.3 Heuristic Model A linear regression model, similar to Samson's (Samson and Thomas 1987) was used to predict average claim cost per policy for the risk groups found by the heuristic model. The linear model we used is shown in the following equation: y = 123.2ai + 13.1459a2 - 33.6505a3 - 29.7609a4 - 116bi - 82.6731b2 + 7.9831b3 - 51.317b4 - 77.8226ci - 65.8959c2 - 61.8716c3 - 3.5125c4 + 402.4 where y = claim cost per policy holder in a cell, and each risk factor is represented by two binary variables as shown below:
An Integrated Data Mining Approach to Premium Pricing
ai = 1 for age group 1 = 0 otherwise a3 = 1 for age group 3 = 0 otherwise bj = 1 for rating area 1 = 0 otherwise b 3 = 1 for rating area 3 = 0 otherwise Ci = 1 for vehicle category 1 = 0 otherwise C3 = 1 for vehicle category 3 = 0 otherwise
3.3
207
a2 = 1 for age group 2 = 0 otherwise a4 = 1 for age group 4 = 0 otherwise b2 = 1 for rating area 2 = 0 otherwise b4 = 1 for rating area 4 = 0 otherwise C2 = 1 for vehicle category 2 = 0 otherwise C4 = 1 for vehicle category 4 = 0 otherwise
Results
Figures 2, 3 and 4 show the plots of predicted claim cost per policy against the actual claim cost per policy for the k-means clustering model, fuzzy c-means clustering model and heuristic model respectively. From the graphs, it can be seen that the predicted claim cost of the two clustering models are closer to the actual claim cost compared to the heuristic model. However the fuzzy c-means does not appear to be able to discriminate policy holders with high claim cost. Average Claim Cost($)
Figure 2. Prediction of claim cost (k-means clustering).
208
A. C. Yeo and K. A. Smith
Average Claim Cost ($)
800.00
•Actual — - Predicted
Figure 3. Prediction of claim cost (fuzzy c-means clustering). Average Claim Cost ($) 1,400 1,200
1,000 800
Risk Group -Actual ——Predicted
Figure 4. Prediction of claim cost (heuristic).
Table 2 shows the various measures of performance of the three models. The weighted mean absolute deviation of predicted claim cost per policy from the actual claim cost per policy for the k-
An Integrated Data Mining Approach to Premium Pricing
209
means clustering model was 8.3% which was significantly lower than the 15.63% for the fuzzy c-means clustering model and the 13.3%) for the regression model. The k-means clustering model provided a more accurate prediction than the fuzzy c-means clustering and heuristic models. Table 2. Measurement of performance of models. K-Means Fuzzy C- Heuristic Clustering Model Means Model Model 13.30% 15.63% Weighted mean absolute deviation 8.30% $403 $433 $111 Maximum deviation 129.47% 93% Maximum % deviation 23% 39% Deviation within 10% 43% 57% 72% 77% 90% Deviation within 20% Measurements
4
Prediction of Retention Rates and Price Sensitivity
Having classified the policy holders into 30 risk groups using kmeans clustering, we can now examine the price sensitivity within each cluster. This is the second component of the data mining framework. Within each cluster a neural network is used to predict retention rates given demographic and policy information, including the premium change from one year to the next. Sensitivity analysis of the neural networks was then performed to determine the effect of changes in premium on retention rate.
4.1
Prediction of Retention Rate
4.1.1
Neural Network Model
A multilayered feedforward neural network was constructed for each of the clusters with 25 inputs, 20 hidden neurons and 1 output neuron (whether the policy holder renews or terminates the con-
210
A. C. Yeo and K. A. Smith
tract). The inputs consist of the thirteen variables used for risk classification and the following premium and sum insured variables: (1) "old" premium (premium paid in the previous period), (2) "new" premium (premium indicated in renewal notice), (3) "new" sum insured (sum insured indicated in the renewal notice), (4) change in premium ("new" premium - "old" premium), (5) change in sum insured ("new" sum insured - "old" sum insured), (6) percentage change in premium, (7) percentage change in sum insured, (8) ratio of "old" premium to "old" sum insured, (9) ratio of "new" premium to "new" sum insured. (10) whether there is a change in rating (11) whether there is a change in postcode (12) whether there is a change in vehicle Several experiments were carried out on a few clusters to determine the most appropriate number of hidden neurons and the activation function. Twenty hidden neurons and the hyperbolic tangent activation function were used for the neural networks for all the clusters. A uniform approach is preferred to enable the straightforward application of the methodology to all clusters, without the need for extensive experimentation by the company in the future. Input variables which were skewed were log transformed. 4.1.2
Determining Decision Thresholds
The neural network produces output between zero and one, which is the probability that a policy holder will terminate his/her policy. Figure 5 shows the probability of termination of Cluster 11 based on the training data. A threshold value is used to decide how to categorize the output data. For example a threshold of 0.5 means that if the probability of termination is more than 0.5, then the policy will be classified as terminated. Usually a decision threshold is selected based on the classification accuracy using a confusion ma-
211
An Integrated Data Mining Approach to Premium Pricing
trix. Table 3 shows a confusion matrix for Cluster 11 with a decision threshold of 0.5. The overall classification accuracy is 88.8% (of the 11,463 policies, 569 are correctly classified as terminated and 9,615 are correctly classified as renewed), while the classification accuracy for terminated policies is 33.8% and renewed policies is 98.3%. Frequency
200 Termination Rate = 14.7%
100
minim mn-n urtmi nffl iH iniffli H w imlTi ilHli n m u n
0.00 0.16 0.32 0.48 0.64 0.80 0.96 Probability of Termination Figure 5. Determining the threshold value of the neural network output (cluster 11).
Table 3. Confusion matrix for cluster 11 with decision threshold (training set).
0.5
Classified as Renewed Total Terminated Terminated 569 (33.8%) 1,114 (66.2%) 1,683 Actual Renewed 165 (1.7%) 9,615 (98.3%) 9,780 Total 734 11,463 10,729 Overall Accuracy 88.8%
The decision threshold is usually chosen to maximize the classification accuracy. However in our case we are more concerned with achieving a predicted termination rate that is equal to the actual termination rate. This is because we are more concerned with the performance of the portfolio (balancing market share with prof-
212
A. C. Yeo and K. A. Smith
itability) rather than whether an individual will renew or terminate his/her policy. The actual termination rate for cluster 11 is 14.7%. A threshold of 0.5 yields a predicted termination rate of 6.4%. To obtain a predicted termination rate of 14.7%, the threshold has to be reduced to 0.204 (See Figure 5). The confusion matrix for a threshold of 0.204 is shown in Table 4. The overall classification accuracy has decreased from 88.8% to 85.3% and that of renewed policies from 98.3% to 91.4%. However, the classification accuracy for terminated policies has improved from 33.8%) to 50.0%. The confusion matrices for the test set with threshold of 0.5 and 0.204 are shown in Tables 5 and 6 respectively. Table 4. Confusion matrix for cluster 11 with decision threshold = 0.204 (training set). Classified as Renewed Total Terminated Terminated 842 (50.0%) 1,683 841 (50.0%) Actual 845 (8.6%) 8,935 (91.4%) 9,780 Renewed 11,463 9,777 Total 1,686 Overall Accuracy 85.3% Table 5. Confusion matrix for cluster 11 with decision threshold = 0.5 (test set). Classified as Renewed Total Terminated 284 (10.2%) 2,510 (89.8%) 2,794 Terminatec Actual Renewed 350 (2.6%) 13,234 (97.4%) 13,584 Total 634 15,744 16,378 Overall Accuracy 82.5% Table 6. Confusion matrix for cluster 11 with decision threshold = 0.204 (test set).
Actual
Classified as Renewed Total Terminated Terminatec 948 (33.9%) 1.846 (66.1%) 2,794 Renewed 1,778 (13.1%) 11,806 (86.9%) 13,584 Total 13,652 16,378 2,726 Overall Accuracy 77.9%
An Integrated Data Mining Approach to Premium Pricing
4.1.3
213
Analyzing Prediction Accuracy
The confusion matrix provides the prediction accuracy of the whole cluster. It does not tell us how a given percentage change in premium will impact termination rate, however. To determine how well the neural networks are able to predict termination rates for varying amounts of premium changes, the policies in each cluster are grouped into various bands of premium changes as shown in Table 7. Table 7. Dividing clusters into bands of premium change. Band <= -25% -20% -15% -10% -5% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% >= 50%
Actual >= -22.5% >= -17.5% >= -12.5% >= -7.5% >= -2.5% >= 2.5% >= 7.5% >= 12.5% >= 17.5% >= 22.5% >= 27.5% >= 32.5% >= 37.5% >= 42.5% >= 47.5%
& & & & & & & & & & & & & &
< -22.5% < -17.5% < -12.5% < -7.5% < -2.5% < 2.5% < 7.5% < 12.5% < 17.5% < 22.5% < 27.5% < 32.5% < 37.5% < 42.5% < 47.5%
The predicted termination rates of each band of policies are then compared to the actual termination rates for that band. For all the clusters the prediction accuracy of the neural networks starts to deteriorate when premium increases are between 10% and 20%. Figure 6 shows the actual and predicted termination rates for one of the clusters (Cluster 24). 4.1.4
Generating More Homogeneous Models
In order to improve the prediction accuracy, the cluster was then split at the point when prediction accuracy starts to deteriorate. Two separate neural networks were trained for each sub-cluster.
214
A. C. Yeo and K. A. Smith
The prediction accuracy improved significantly with two neural networks as can be seen from Figure 7. The average absolute deviation decreased from 10.3% to 2.4%. The neural network was then applied to the test set. The neural network performed reasonably well on the test set with an average absolute deviation of 4.3% (Figure 8). TRAINING SET 60%
<=25%
-10%
+5%
+20%
+35%
>=50%
Premium Change - Predicted •
-Actual
Figure 6. Prediction accuracy of one neural network model for cluster 24 (training set). TRAINING SET
60% 50%
Neural Network 1
Neural Network 2
40% 30% 20% 10%
pj|=-*^
^^>^r"^
V
0% <=25%
-10%
+5%
+15%
+30%
+45%
Premium Change - Predicted •
-Actual
Figure 7. Prediction accuracy of two networks model for cluster 24 (training set).
215
An Integrated Data Mining Approach to Premium Pricing
TEST SET
o 50% -
Neural Network 1
Neural Network 2
£ 40% c • | 30% c E 20%
1- 10%
A
*<^»-^ - ^ ^ ^
<=25%
«
-10%
= +5%
• _ _ *
Ay ^^Cr/V +15%
+30%
+45%
Premium Change —A— Predicted -a—Actual
Figure 8. Prediction accuracy of two networks model for cluster 24 (test set).
4.1.5
Combining Small Clusters
Some of the smaller clusters had too few policy holders to train the neural networks. We grouped the small clusters that had fewer than 7,000 policies. Since the objective is to ultimately determine the optimal premiums, which reflect the risk of the policy holders, the criteria for grouping has to be similarity in risk. Risk in turn is measured by the amount of claims. Therefore the clusters were grouped according to similarity in claim cost while the maximum difference in average claim cost per policy in grouped clusters was no more than $50. Table 8 shows the grouping of the small clusters. For the combined clusters, prediction ability is also improved by having two neural networks instead of one for each cluster. Figures 9 and 10 show that the average absolute deviation of combined clusters 5, 23 and 26 decreased from 10.3% to 3.5%. The test set has an absolute deviation of 4.2% (See Figure 11).
216
A. C. Yeo and K. A. Smith
Table 8. Grouping of small clusters. No of Average Cluster Policies Claim Cost ($) 13 14 28
2,726 2,714 6,441
344 343
12 2 30
1,422 3,988 2,366
5 26 23
Difference ($)
1
r
16
285 280 278
[
r
7
5,606 1,460 3,374
270 262 256
i
r
!4
3 10 8 7 18
1,595 1,610 2,132 1,601 1,446
249 248 247 235 234
r
15
1 15 16 6
4,101 1,621 1,505 2,346
231 217 194 183
21 25 22 17
3,566 1,445 1,401 1,411
138 125 116 98
328
J
J ^ L
48
J
|
l
40
11
Figure 9. Prediction accuracy of one network model for combined clusters 5, 23 and 26 (training set).
217
An Integrated Data Mining Approach to Premium Pricing
TRAINING SET
50% -
Neural Network 2
Neural Network 1
40% • 30% 20%
M^t<[
*s=*-~* ; ^V
^^""^ii^sT^
10% -
A
>iC^* < * > *
y / ^^
0% +5%
+18.75
+30%
Premium Change -Actual
Figure 10. Prediction accuracy of two networks model for combined clusters 5, 23 and 26 (training set). TEST SET
s
50% -
Neural Network 2
Neural Network 1
& 40% c • | 30% -i CO
c | 20%
»-~y x~^
«
S^f
H 10% <=25%
-10%
+5%
+18.75
+30%
+45%
Premium Change I —*— Predicted --*—Actual I
Figure 11. Prediction accuracy of two networks model for combined clusters 5, 23 and 26 (test set).
4.1.6
Results
These results are summarized in Table 9. Employing two neural networks per cluster rather than a single neural network significantly reduces the average absolute deviation between the actual and predicted termination rates. It appears that a single neural network is unable to simultaneously learn the characteristics of policy holders and their behaviors under different premium changes. This is perhaps due to the fact that many of the large premium increases
218
A. C. Yeo and K. A. Smith
are due to an upgrade of vehicle. Since these policy holders may well expect an increase in premium when their vehicle is upgraded they may have different sensitivities to premium change compared to the rest of the cluster. Attempting to isolate these policy holders and modeling their behaviors results in a better prediction ability. Table 9. Summary of results of prediction of termination rates. TWO NETWORKS ONE NETWORK TEST SET TRAINING SET TRAINING SET Mean Absolute Actual/ Mean Absolute Actual Error (MAE) Error (MAE) Predicted Predicted Error (MAE) Split at
Cluster 4 9 11 19 20 24 27 29 1,6,15, 16 2, 12 ,30 3,7,8,10,18 5, 23, 26 13,14,28 17,21,22,25
4.2
11.2% 8.8% 14.7% 7.5% 6.8% 9.5% 11.6% 9.5% 10.9% 13.5% 11.7% 14.7% 15.1% 8.7%
5.5% 7.2% 8.3% 6.3% 10.8% 10.3% 12.4% 8.8% 5.2% 9.2% 6.7% 10.3% 10.9% 5.4%
20% 20% 10% 15% 15% 10% 15% 15% 20% 10% 20% 20% 10% 15%
2.7% 1.9% 5.0% 2.2% 3.7% 2.4% 3.1% 3.8% 2.2% 2.8% 3.1% 3.5% 3.8% 2.9%
9.9% 8.6% 17.1% 8.1% 5.0% 8.2% 10.8% 8.5% 11.1% 13.1% 12.3% 17.7% 17.9% 8.4%
12.6% 10.0% 17.1% 7.6% 6.9% 10.9% 12.6% 11.3% 12.9% 15.4% 14.4% 17.3% 17.9% 9.8%
2.9% 4.5% 7.4% 3.6% 3.4% 4.3% 4.2% 5.9% 4.0% 4.0% 4.4% 4.2% 4.8% 3.4%
Price Sensitivity Analysis
Having trained neural networks for all the clusters, sensitivity analysis was then performed on the neural networks to determine the effect of premium changes on termination rates for each cluster. We performed price sensitivity using the systematic variation of variables approach by varying the premium information and holding all other inputs constant. Separate data sets were created from each "half cluster with all variables remaining unchanged except the new premium and related variables (change in premium, percentage change in premium and ratio of new premium to new sum insured). For example, cluster 24 was split into two "halves"; policy holders with premium decreases or increases of less than 10% and policy holders with premium increases of more than 10%. Data sets with varying per-
An Integrated Data Mining Approach to Premium Pricing
219
centage changes in premium were created and scored against the trained neural networks to determine the predicted termination rates. 4.2.1
Results
Table 10 and Figure 12 show the results of price sensitivity of cluster 24. From Figure 12, we can see that increasing premium leads to an increase in profit (premium minus claim costs) but it also results in a higher termination rate. Table 10. Price sensitivity analysis - effect on termination rate (cluster 24). Average Average Scored Change in Premium Termination Premium Amount ($) Against Rate Neural -8.3% 8.6% 481 Network 1 -3.3% 507 8.3% 533 9.1% 1.7% 559 12.2% 6.7% Neural 11.7% 585 12.1% Network 2 16.7% 13.2% 612 21.7% 14.6% 638 26.7% 16.3% 664 31.7% 18.2% 690 51.7% 26.9% 795 71.7% 35.1% 900
The company needs to find a balance between the two conflicting objectives of profitability and market share. This problem is similar to the portfolio optimization problem where capital is invested in securities in such a way that a desire to minimize risk is balanced with a desire to maximize the rate of return.
220
A. C. Yeo and K. A. Smith
35%
$3,500,000
^^Z/^~ 31%
$3,200,000
j^^^uS
o $2,900,000 • a j j $2,600,000
23%
/
*" $2,300,000$2,000,000 J
27%
/
:
19% 15% 1-
/j^^^
$1.700,000 - lr*^ $470
11% 7%
$570
$670
$770
$870
Average Premium - Total Revenue -
- Termination Rate
Figure 12. Price sensitivity analysis - effect on termination rate and profit (cluster 24).
5
Determining an Optimal Portfolio of Policy Holders
Markowitz, the father of modern portfolio theory, proposed the "expected returns-variance of returns" rule for portfolio diversification (Constantinides and Malliaris 1995, Markowitz 1952). The approach involves tracing out an efficient frontier, a continuous curve illustrating the tradeoff between return and risk. For a particular universe of assets, the portfolio of assets that offers the minimum risk for a given level of return forms the efficient frontier. This frontier can found by quadratic programming. Although determining the optimal mix of premiums for policy holders in the insurance company is similar to portfolio optimization, it requires a different formulation. In portfolio theory, an asset has a given rate of return and risk. In the insurance optimization problem, the termination rate and the revenue earned of each cluster depend on the premium that is charged. Instead of using quadratic programming, we will use non-linear integer programming to
An Integrated Data Mining Approach to Premium Pricing
221
determine the premium to charge for each cluster to maximize total revenue for a given overall termination rate. The termination rates within individual clusters may vary to maximize the revenue but the overall termination rate for the portfolio will be constrained by a user-defined parameter. In our non-linear integer programming formulation, we have the decision variable Xy defined as: x{j
=
f 1 if cluster i has a termination rate of as L 0 otherwise
where a, has values of 9%, 11 %, 15%, 20%, 25% and 30% and x is of dimension N * r, where N is the number of clusters and r = 6 are the termination rates considered. The overall termination rate of the portfolio is to be fixed as a constant at. However, the clusters can have varying termination rates, as provided by Xy values, to produce the same overall termination rate. This is a combinatorial optimization problem. If we decide cluster / should have a termination rate a,, then we can determine the premium to charge cluster i based on neural network sensitivity analysis, py is the premium to charge cluster i to achieve a termination rate a,. The resulting expected revenue of cluster i based on a termination rate of ctj and a premium of py will be (1 - aj) Pi (py - C,) where P, is number of policies in cluster / and C, is average claim cost of cluster /. The average claim cost is calculated based on earlier cluster analysis. Our mathematical formulation of the insurance portfolio optimization problem is thus as follows: Let N = number of clusters Pi = number of policies in cluster i Py = premium of cluster / to achieve termination rate a/ Ct= average claim cost of cluster i Oj = set of r termination rates at = desired termination rate for portfolio
222
A. C. Yeo and K. A. Smith
The objective function is to maximise expected revenue: Maximise £ ( 1 - 2 > , * y ) i > ( X A,*, - C , ) '•=1
7=1
(1)
7=1
There are also some constraints affecting the system. They are:
5>„=1V/
(2)
7=1
x,e{0,l}
(3)
X^(l-«,)>(l-« ( )i;^ i=l
(4)
i=l
Equation 1 maximizes the total expected revenue associated with the portfolio. Equations 2 and 3 ensure that each cluster is assigned a termination rate and hence a calculated premium. Equation 4 ensures that the portfolio has a termination rate no greater than the desired termination rate of at. The mathematical formulation was solved using Lingo (Schrage 1999). Since our model contains integer constraints, Lingo uses the branch-and-bound method, which is the most commonly used enumerative approach. The "branching" refers to the enumeration part of the solution technique and "bounding" refers to the fathoming of possible solutions by comparison to a known upper or lower bound on the solution value (Land and Doig 1960).
5.1
Results
In our case study we have used termination rates of 9%, 11%, 15%, 20%, 25%o and 30% as inputs for the above mathematical formulation. Suppose the insurance company wants to maintain its current portfolio termination rate of 11%. Can adjusting the premiums of different clusters and thereby changing the termination rate of different clusters while maintaining the same termination rate for the
223
An Integrated Data Mining Approach to Premium Pricing
portfolio improve the expected revenue? Solving the above formulation for a desired termination rate of 11%, we obtain values of xy as shown in Table 11 and an optimal portfolio as shown in Table 12. The expected revenue derived from the optimal portfolio was $39.11 million compared to $35.54 million for the current portfolio. This represents an increase of 10% in revenue, without affecting the market share. Table 11. Values of xtj for optimal solution for overall portfolio termination rate of 11%. Termination Rate 25% Cluster 30% 20% 15% 11% 9% 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 1 0 1 1 1 0 0 0 0 1 1 0 1
0 0 0 0 0 0 0 1 1 1 0 0 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
The optimization problem was solved for varying values of the termination rate at: 15%, 20% and 25%. The results are shown in Figure 13. The curve is similar to the efficient frontier of portfolio optimization. It is a smooth non-decreasing curve that gives the best possible tradeoff between revenue and termination rate. The insurance company selects an acceptable termination rate and the model will determine a portfolio that maximizes expected revenue. We can also add other constraints to our model to better reflect practical insurance portfolio optimization. For instance the insurance company may decide that under no circumstances should their
A. C. Yeo and K. A. Smith
224
market share in any cluster fall below a certain level. Such a constraint could be encoded by stipulating the minimum number of policy holders in a cluster, for example, each cluster should have at least 6,000 policy holders. The following constraint could be added: P, > 6,000 Table 12. Current portfolio versus optimal portfolio. Current Portfolio Termination Rate 11.2% 570 8.8%
Risk Class 4
Premium 424
9
Optimal Portfolio Termination Premium Rate 403 9.0% 598 9.0%
11
764
14.7%
786
19
377
7.5%
390
15.0% 9.0%
20
463
6.8%
493
9.0%
24
533
9.5%
533
9.0%
27
445
11.6%
1,090
25.0%
29
570
605 510
2,12,30
499 685
9.5% 10.9%
11.0%
1,6,15,16
13.5%
660
11.0%
3,7,8,10,18
508
11.7%
483
9.0%
5,23,26
561
14.7%
502
13,14,28
688
15.1%
338 532 $35.54m
8.7% 11.1%
630 345 490 $39.11m
17,21,22,25 ALL Objective Function Value (Revenue)
11.0%
9.0% 11.0% 9.0% 11.0%
$60,000,000 -i -
.
-
•
'
-
*
'
§ $50,000,000 c a > £ $40,000,000
• $30,000,000 9%
i
14%
i
i
19%
24%
Termination Rate * Optimal • Current
Figure 13. Optimal revenue for varying termination rates.
Such an approach may help when considering concepts such as life-time value of customers and finds analogy in portfolio theory
An Integrated Data Mining Approach to Premium Pricing
225
where cardinality constraints are included to impose limits on the proportion of the portfolio held in a given asset (Chang et al. 2000). The insurance company has two conflicting objectives of profitability and market share. How does the company decide on the tradeoff between the two objectives or in other words, how does it pick a point on the efficient frontier? This answer to this question depends on the elected strategy of the company, whether this be to maximize profit regardless of the impact on market share, or to accept less profit for the sake of retaining a certain percentage of market share. Another common strategy would be to maximize customer life-time value (Berger and Nasr, 1998). Lifetime value is the net present value of profit that a company can earn from a typical customer over a specified number of years. This value depends on a number of factors such as retention rate, profit margin and referral rate. The chosen strategy will determine where the company prefers to lie on the efficient frontier, and the proposed mathematical programming approach is able to find the optimal mix of premiums across the portfolio in order to implement this strategy.
6
Conclusions
We have provided evidence of the benefits of an approach which combines data mining and mathematical programming in determining the premium to charge automobile insurance policy holders in order to arrive at an optimal portfolio. The approach has addressed the insurer's need to find a balance between profitability and market share when determining optimal premiums. Using our approach we have been able to increase revenue without affecting the predicted market share. This kind of quantitative evidence is essential before we can expect the insurance industry to embrace this new approach to premium pricing.
226
A. C. Yeo and K. A. Smith
References Anderberg, M. (1973), Cluster Analysis for Applications, Academic Press. Berger, P. and Nasr, N. (1998), "Customer lifetime value: marketing models and applications," Journal of Interactive Marketing, 18(1), 17-30. Bezdek, J.C. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press. Chang, L. Fairley, W.B. (1979), "Pricing automobile insurance under multivariate classification of risks: additive versus multiplicative," Journal of Risk and Insurance, 46(1), 75-98. Chang, T., Meade, N., Beasley, J., and Sharaiha, Y. (2000), "Heuristics for cardinality constrained portfolio optimisation," Computers & Operations Research, 27, 1271-1302. Constantinides, G. and Malliaris, A. (1995), "Portfolio theory," in Jarrow, R., Maksimovic, V., and Ziemba, W. (eds.), Handbooks in Operations Research and Management Science, vol. 9, pp. 130, Elsevier Science. Derrig, R.A. and Ostaszewski, K.M. (1995), "Fuzzy techniques of pattern recognition in risk and claim classification," The Journal of Risk and Insurance, 62(3), 447-482. Dionne, G. and Vanasse, C. (1992), "Automobile insurance ratemaking in the presence of asymmetrical information," Journal of Applied Econometrics, 7(2), 149-165. Everitt, B.S. (1993), Cluster Analysis, 3rd ed., London: E. Arnold. Johnson, D.E. (1998), Applied Multivariate Methods for Data Analysts, Pacific Grove, CA: Duxbury Press. Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: an Introduction to Cluster Analysis, New York: Wiley. Land, A.H. and Doig, A.G. (1960), "An automatic method for solving discrete programming problems," Econometrica, 28, 497520. MacQueen, J. (1967), "Some methods for classification and analysis of multivariate observations," Proceedings of the 5th Berke-
An Integrated Data Mining Approach to Premium Pricing
227
ley Symposium on Mathematical Statistics and Probability. Markowitz, H. (1952), "Portfolio selection," Journal of Finance, 7, 77-91. Samson, D. (1986), "Designing an automobile insurance classification system," European Journal of Operational Research, 27, 235-241. Samson, D. and Thomas, H. (1987), "Linear models as aids in insurance decision making: the estimation of automobile insurance claims," Journal of Business Research, 15, 247-256. Sant, D.T. (1980), "Estimating expected losses in auto insurance," The Journal of Risk and Insurance, 47(1), 133-151. Schrage, L. (1999), Optimization Modeling with Lingo, Lindo Systems Inc. Smith, K.A., Willis, R.J., and Brooks, M. (2000), "An analysis of customer retention and insurance claim patterns using data mining: a case study," Journal of the Operational Research Society, 51(5), 532-541. Tryfos, P. (1980), "On classification in automobile insurance," Journal of Risk and Insurance, 47(2), 331-337. Vaughan, E.J. and Vaughan, T.M. (1996), Fundamentals of Risk and Insurance, 7th ed., New York: Wiley. Williams, G.J. and Huang, Z. (1997), "Mining the knowledge mine: the hot spots methodology for mining large real world databases," Lecture Notes in Artificial Intelligence, 1342, 340348.
This page is intentionally left blank
Chapter 6 Fuzzy Logic Techniques in the Non-Life Insurance Industry Raquel Caro Carretero
This chapter attempts to bring research on the insurance market and Fuzzy Logic together. It examines the current situation in the insurance industry and how fuzzy models may provide another alternative in the developing of mathematical models. At the beginning of this century, non-life insurers reported higher growth rates than in the nineties. However, we have been seeing the start of a "hard" market cycle. As a result, underwriting losses are more pervasive and stress in the insurance industry has become more prevalent. Insurers are, therefore, under pressure to find ways to return value to their policyholders. This represents an opportunity to companies that can provide key solutions to help the insurer mitigate this market effect. After an introduction to the insurance industry in this chapter, an overview of non-life insurers' pricing decision making processes is given from the perspective of possible applications of fuzzy settheoretic methods. The implementation of two "non-precise" fuzzy reasoning techniques is proposed. The Fuzzy Set Theory is developed to improve the classical bonus-malus rating system used in the automobile insurance industry. The second part proposes a fuzzy method of clustering. Classical clustering algorithms generate patterns such that each object is assigned to exactly one cluster. Often, however, objects can not appropriately be assigned to strictly one cluster, because they are, in some way, "between" clusters. Fuzzy clustering methods have
229
230
R. C. Carretero
proved themselves to be a powerful tool for representing data structure in such situations. The chapter concludes with some observations about the future of the insurance industry.
1
Insurance Market
This section details the current state of the insurance market and explores its capability to embrace the modernization movement and e-business initiatives that are changing the insurance industry. The insurance industry is currently facing a period of unprecedented change. Legal regulatory changes, the need to globalize, and shifting economic conditions are just a few of the market challenges facing today's insurance leaders. Insurance companies are looking for ways to mitigate these market conditions in order to maintain and increase their market share. Exerting additional pressure is the entrance of non-traditional players such as banks and brokerages (financial intermediaries). In fact, banks are now one of the fastest-growing distribution systems for annuity sales and some banks have shown interest in life insurance and personal propertycasualty lines and, to this end, have purchased insurance agencies. At the same time, insurers are experimenting with distributing insurance products through banks. The insurance market typically follows economic cycles characterized as either "hard" or "soft" market cycles, that is, high or low premium rates.1 Underwriters offer a broad range of coverage and typically charge low rates, because the investment of the premium 1
The phenomenon of the underwriting cycle has been analyzed extensively in the economic and financial literature. For details, see, e.g., Swiss Re (2001b). The cycle is best explained by a combination of two hypotheses. The first hypothesis of cycles is that the market is basically competitive and that participants make rational decisions. The second hypothesis, capital constraint, is that the market is susceptible to the influence of external events and inherent industry features.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
231
revenue provides the greatest value. This typically leads to greater losses, since there are more policies in effect for greater risks. Eventually, market conditions change and the insurers' profitability decreases. This leads to a hardening of the market in which insurers look to increase prices and maintain profit goals despite the diminishing returns on their investments and their losses from claims. According to Swiss Re (2001a), worldwide, non-life insurance markets are currently in a period of declining profitability. The decline reflects the soft underwriting cycle, weakening investment performance, and the high level of capital funds. Falling interest rates and robust stock markets have increased insurers' capital funds. Insurance capacity reached historically high levels in 1999, which led to downward pressure on premium rates. As a result, underwriting results declined in all mayor markets. Nevertheless, nonlife insurance markets are currently in a transition period between a severe soft market and a hard market. During the soft market, rising insurance capacity and fierce price and markets share competition led to widely available coverage and low premium rates. Underwriting results declined in all major markets. Then in 2000, commercial lines and reinsurance premium rates started to rise, and have gained momentum in 2001. The need is now for improved underwriting results. As a result of the start of the hardening in the property and casualty market, according to METAgroup (2001), rates are predicted to reach bottom in a number of lines as underwriting losses become more pervasive and stress in the industry is more prevalent. Insurance industry analysts predict that rates will continue to decline. Concurrently, incidences of reserve charges are on the increase. All of this adds up to dramatic restrictions in coverage, fewer available limits, and significant increases in premiums/price. These conditions impact all insurance companies, including the reinsurance companies that provide support to primary insurers. After a prolonged soft market and with investment prospects diminished, premium rates and underwriting results need to improve significantly to achieve widespread profitability.
232
R. C. Carretero
Although the insurance industry is expected to grow, insurers are not spending their information technology budgets in accordance with increasing Internet initiatives. Historically, technology spending has not been a priority for insurance carriers. Most companies use technology, but adoption has generally been slow and internally focused, with little standardization or integration of architecture. Cost often determines development, not usefulness or need. As a result, current insurance technology is generally archaic with limited functionality and information exchange between systems. Most insurance carriers utilize legacy systems, which require batch or other time-consuming processing needs. Data tend to be either too discrete or too aggregated to get meaningful feedback from. Additionally, integration can be limited when systems operations exit in silos. As a consequence, technology is gaining in importance within the insurance industry and is being reflected in increased annual technology-related budgets. Insurers are under pressure to find ways to return value to their policyholders and shareholders. This represents an opportunity to companies that can provide cost cutting solutions to help insurers mitigate this market effect.
2
Fuzzy Logic in Insurance
Most actuaries rely on probabilistic models. This kind of approach assumes that the outcome of a random event can be observed and identified with precision. Any vagueness of observation is not considered significant to the construction of the theoretical model. For this reason, it is desirable to include models of a form of uncertainty or vagueness other than randomness. Fuzzy sets are, therefore, useful to describe uncertain statements where the uncertainty is due to the nature of the phenomenon, its perception by humans or when it arises from complexity. According to Ostaszewski (1993), there may be several reasons
Fuzzy Logic Techniques in the Non-Life Insurance Industry
233
for wanting to apply Fuzzy Set Theory in modeling uncertainty in insurance science. First and foremost is that vagueness is unavoidable. If the degree of vagueness is so small that one can ignore it in the model, disregarding it is justifiable. On the other hand, if the area studied is permanently "tainted" with imprecision or the limitations of human perception or natural language, vagueness can become a major factor in any attempt to model an event. In addition, when the phenomena observed become so complex that exact measurement would take more than a lifetime, or longer than economically feasible for the study, mathematical precision is often abandoned in favor of more workable and simple, "common sense" models. Real situations are very often not crisp,2 and they can not be described precisely. Therefore, classical mathematical language is not enough to tackle all practical problems. In Fuzzy Logic "either-or" classification does not exist. Rather, the idea is that an item may be part of a class to a greater or lesser extent. The Fuzzy Set Theory is a theory of graded concepts, and Fuzzy or Multivalued Set Theory develops the basic insight that categories are not absolutely clear cut. This concept is not unique, it also found in every-day speech. One can be young, middle-aged or old. Age could be given in years instead of using the terms "young" and "old," but such precise delineators will often be too rigid. There is much that only applies to "young" and much that only applies to "old" people, but there is little that applies only to twenty-five-year-olds and not to twenty-six-years-olds. Statements with vague terms are not only more flexible, but also more "durable." Most of the relevant distinctions we make in our life are of this kind. They can retain validity for a long time, even if their concrete significance varies with time. The Fuzzy Set Theory can help to improve oversimplified (crisp) models and provide more robust and flexible models for 2
By crisp we mean dichotomous, that is, yes-or-no-type rather than more-orless-type. Classical Logic stipulates that an object either is a member of a class or not. There is no room for gradual transitions. Most of our traditional tools for formal modeling and reasoning are crisp and precise.
234
R. C. Carretero
real-world complex systems. The methodology developed in fuzzy sets can be successfully applied in several areas of actuarial science, based on classical probabilistic models, such as rating systems used to provide the public with a sound insurance system. This chapter describes how Fuzzy Logic can be used to make insurance pricing decisions.
2.1
Basic Concepts in the Fuzzy DecisionMaking Processes
Fuzzy Sets were introduced by Zadeh (1965) with a view to reconciling mathematical modeling and human knowledge in the engineering sciences. Since its inception more than decades ago, the theory has matured into a wide-ranging collection of concepts and techniques for dealing with complex phenomena that do not lend themselves to analysis by classical methods using probability theory and bivalent logic. The Fuzzy Set Theory was first developed for solving the inexact and vague problems in the field of artificial intelligence/expert systems, especially for imprecise reasoning and modeling linguistic terms. In solving decision making problems, the pioneer work came from Bellman and Zadeh (1970) and Zimmermann (1987). In classical normative decision theory the components of the basic model of decision making under certainty are taken to be crisp sets or functions. Vagueness only enters the picture when considering decisions under risk or uncertainty and then this uncertainty concerns the happening of a state or event and not the event itself. This uncertainty is generally modeled by means of probability, no matter whether the decision is to be made once only or frequently. In descriptive decision theory this precision is no longer assumed, but ambiguity and vagueness is very often modeled only verbally, which usually does not permit the use of powerful mathematical methods for the purposes of analysis and computation. The first publication in Fuzzy Set Theory by Zadeh (1965) shows the intention of the author to generalize the classical notion of a set and a
Fuzzy Logic Techniques in the Non-Life Insurance Industry
235
proposition to accommodate uncertainty in the non-stochastic sense. The basic traditional structure underlying any decision-making problem can be summarized as follows: the existence of limited resources generates the constraints of the problem. In other words, the value of the decision variables that satisfy the constraints defines what is known as the feasible set. Once the feasible set is established, the objective function that suitably reflects the preferences of the decision maker is defined. Finally, by utilizing to more or less sophisticated mathematical techniques, the "best" or "optimal" solution among all possible choices is obtained. In a fuzzy environment, the fuzzy objective function and the fuzzy constraints are defined by their corresponding membership functions. Decision-making problems can be addressed by mathematical programming approaches. Next, a function is developed in which values assigned to the elements of a set fall within a specified range. This function indicates the membership grade of the elements in the set in question. Larger values denote higher degrees of set membership. Such a function is called a membership function (u) and the set it defines is a fuzzy set. In this way, the membership function indicates a subjective degree, within given tolerances, to which an element belongs to set. The value of a membership function is 0 if the constraints (including the objective function) are "strongly violated," is 1 if the constraints are very "well satisfied," or can lie on a continuum from 0 to 1. We argue that the membership function is equivalent to the utility function when the membership function(s) are based on a preference concept such as the utility theory. We can classify linear programming models with imprecise information into two main classes: symmetric and nonsymmetric. The symmetric models are based on the definition of fuzzy decision proposed by Bellman and Zadeh (1970). They assumed that the objective^) and constraints in an imprecise situation can be represented by fuzzy sets. A decision, then, may be stated as the confluence of the fuzzy objective(s) and the constraints, and may be de-
R. C. Carretero
236
fined by a Max-Min operator. The nonsymmetric models are based on the determination of a crisp maximizing decision by aggregating the objective, after appropriate transformations, with the constraints. This interpretation is similar to the well known Lagrangian relaxation approach within the classic techniques of optimization. An application making use of these linear programming models will be given in the non-life insurance section below.
2.2
The Two Fuzzy Non-Life Insurance Models
This section covers the implementation of two "non-precise" fuzzy reasoning methods in a non-life insurance line: automobile insurance. These fuzzy models may provide alternative business models that may be very helpful for insurance carriers. Probability and statistics are basic tools used by an actuary to develop models that describe the nature of both claim frequency and claim severity. Fuzzy models may provide an alternative when creating mathematical models of phenomena that can not be adequately described solely as stochastic. The Fuzzy Set Theory is developed to improve the classical bonus-malus rating system in automobile insurance. 2.2.1
Bonus-Malus System
In most developed countries, rating systems are now in force that penalize the insured responsible for one or more accidents through premium surcharges or maluses, and that reward claim-free policyholders by awarding them discounts or bonuses. Their main purpose, besides encouraging policyholders to drive carefully, is to better assess individual risks, so that everyone will pay, in the long run, a premium corresponding to his/her own claim frequency. The adopted terminology for this system in many European and Asian countries is bonus-malus system.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
2.2.2
111
The Data
The sample upon which this study is based includes a set of data on 48,666 policyholders from a Spanish insurance company. The database contains four types of information on each driver: the personal characteristics appearing on the driver's license itself as of the date of the sampling (age, time transpired since the license was issued); the same information for the three year period from January 1, 1992, to December 31, 1994; the number of accidents per year; and the size or severity of each claim. This data is limited to automobile third party liability insurance. This chapter analyzes only the bonus-malus system data applied to "regular" passenger cars. Additional information came from interviews with experts, supervisory authorities and insurance companies. Classical cluster analysis has served to group policyholders into five clusters based on the characteristics they possess, with the claim frequency as the most significant variable. That is to say, as is done almost everywhere in the world, we work with a bonusmalus system exclusively based on the number of accidents reported to the company. It is supposed that every cluster is distributed by a Poisson model with claim frequency A,j (j= 1,..., 5). 2.2.3
The Non-Life Insurance Model Based on Zimmermann's Linear Approach: a Symmetric Model
If one wants to satisfy (optimize) the objective function as well as the constraints, a decision in a fuzzy environment is defined, by analogy to a non-fuzzy environment, as the selection of activities that simultaneously satisfy the objective function "and" the constraints. The logical "and" corresponds to the intersection of the set-theoretic and the intersection of fuzzy sets is defined by the Min-operator. The decision in a fuzzy environment can, therefore, be viewed as the intersection of the fuzzy objective function O and the fuzzy constraints Rj. The relationship between constraints and 3
For the results of this multivariable statistical analysis, see Caro (1999).
238
R. C. Carretew
objective function is fully symmetric; that is to say, there is no longer a difference between the former and the latter. If the decision maker wants to have a "crisp" decision proposal, it seems appropriate to choose the dividend with the highest possible degree of membership in the fuzzy set decision. Then a "maximizing decision" can be defined as |iD(x*) = Max uc(x);
x>0
(1)
X
where the fuzzy set "decision" is characterized by its membership function UD (X) = Min{//b(x), /fc,(x)} ;
i=l,...,n
(2)
If uo(x) has a unique maximum at x*, then the "maximizing decision" is a uniquely defined crisp decision that can be interpreted as the action that belongs to all fuzzy sets representing either the objective function or the constraints and which has the highest possible degree of membership. This becomes a fuzzy rating bonus-malus system based on the Max-Min operator following Zimmermann's linear approach. When the decision-making process is used in a fuzzy environment, Bellman and Zadeh (1970) proposed that the symmetry between goal and constraints is the most important feature. The symmetry eliminates the differences between them. In this context, the authors considered the classical model of a decision under certainty, and suggested a model for decision making in which the objective function as well as the constraint(s) are fuzzy. The fuzzy objective function and the constraints are defined by their corresponding membership functions //o(x) and //R.(X) respectively. This model served as a point of departure for most of the authors in the area of fuzzy decision theory, such as Zimmermann (1987), and many have developed their own methodology. The first non-life rating system described here aims to satisfy an objective function by maximizing the income from premiums. This objective function is subject to the conditions that (1) the system is
Fuzzy Logic Techniques in the Non-Life Insurance Industry
239
fair, that is to say, every insured has to pay, at each renewal, a premium proportional to the estimate of his/her claim frequency, and (2) the condition that the system is financially balanced, that is to say, at each stage of this sequential process, the mean of the individual claim frequencies is equal to the overall mean. In addition, the objective function ought to satisfy the optimal conditions of three measures that reflect the severity or "toughness" of a bonusmalus system: the relative stationary average level, the coefficient of absolute variation of the insured's premiums, and the elasticity of the policyholder's payments with respect to their claim frequency.5 The financial balance coefficient is defined as 5
ZniaJ FBC = -tl A
(3) v
'
where nj represents the number of policyholders in the risk group Gj, Oj is the coefficient applied to the premium for a new policyholder (Po), so that the premium that might pay the holder that belongs to Gj is Pj = ctj Po, 0 = 1> •••> 5), and A is the total number of policyholders in the portfolio. The FBC is related to the income from premiums (I) given that 5
I=
n
Z JaJP°
(4)
The three measures that are to be used in the model are defined as follows: the relative stationary average level is a measure of the degree of clustering of the policies in the lowest classes of a bonus4
It is supposed that the premiums are assigned correctly to every risk group, that is to say, the system is fair, after a non-fuzzy approach with the sample available. For details, see Caro (1999). 5 Lemaire (1995) uses similar tools in order to compare and rank thirty different bonus-malus systems from twenty-two countries. However, some variations in the definitions of this non-life rating system are introduced. For further details, see Caro (1999).
240
R. C. Carretero
malus system; the coefficient of absolute variation of the insured's premiums is a measure of the severity of a bonus-malus system; and the elasticity of the policyholder's payments with respect to their claim frequency measures, as is well known, the financial response of the system to a change in the claim frequency. The relative stationary average level is defined as FBC-a, l
RSAL =
(5) a5 - a , where, after a non-fuzzy approach with the sample available,6 is considered, that 0:5 - cci = 6.5
(6)
A low value of RSAL indicates a high clustering of policies in the high-discount bonus-malus system classes. A high RSAL suggests a better spread of policies among these classes. The coefficient of absolute variation of the insured's premiums, if the system is financially balanced, has been defined as tz—jV lJIc r J - F B C CAV = -^ (7) A Given the spread of admissible values for the coefficients Oj, we consider that 0Cj - F B C
20
<1
(8)
To express CAV in linear terms a binary variable Zj is associated with every coefficient j so that 0.05 (aj - FBC) < Zj < 0.05 (ai - FBC) +1
(9)
With this definition, Zj = 1 if a r FBC>0, and Z, = 0 if Oj-FBCO. 6
For further details, see Caro (1999).
241
Fuzzy Logic Techniques in the Non-Life Insurance Industry
Also, two non-negative variables are defined for every coefficient j , Vj and Uj, so that c i j - F B C ^ V j ^ a j - F B C + O-Zj) FBC-aj ^ U j ^ F B C - a j + Z j
; ;
Vj
(10) (11)
In this way, equation (7) can be expressed as
CAV =
-^
(12)
A Ideally, an increment in the claim frequency should produce an equal change in the premium and the bonus-malus system is called perfectly efficient. That is, perfect elasticity exists. Then, if we suppose that the polyholder that belongs to Gj can only be moved to adjacent classes Gj_i and Gj+i, it is happened that (13)
or
(14) P k -Pjj = c j k P j ^>P k -(l+Cj k )P J = 0 where k * j , k = 1,2,..., 5, and Cjk * ckj. The values A,j and Xk are estimated with the sample data available.7 As a general rule, however, there will not be perfect elasticity: the change in premium is less than the change in claim frequency. Then, a variable is defined as
Ejk = P k - ( l + c j k ) P j
(15)
and sometimes will be negative and other times positive. If Ejk < 0, when the polyholder goes from Gj to Gk, he/she pays a premium lower than the corresponding to this polyholder and vice versa for E j k >0. 7
For details, see Caro (1999).
242
R. C. Carretero
The insurer seeks to find the optimal incomes, which for financial reasons, ought to be "attractive." If we assume that the objective "attractive income from premiums" or "good FBC" and constraints related to RSAL, CAV and Ejk in a imprecise situation, which depends on the decision-makers, can be represented by fuzzy sets with the assumption of the symmetry, we can make decisions that satisfy both the constraints "and" the goal. The objective function and the constraints are connected to each other by the operator "and" which corresponds to the intersection of fuzzy sets. In such a situation, if the decision-maker is interested not in a fuzzy set but in a crisp decision proposal, it seems appropriate to suggest to him/her the income that has the highest degree of membership in the fuzzy set "decision." Let us call this the "maximizing solution." The fuzzy set of the objective function "good FBC' could, for instance, be defined by H(FBC) =
0 5C -19 1
FBC < 0.95 0.95 < FBC < 1 FBC>1
(16)
and represented graphically as shown in Figure 1. In order to define this membership function (MF) in terms of linear constraints, which has been called MFFBC, three binary variables8 FBC1, FBC2 and FBC3 are defined so that FBC1 = 1 if FBC < 0.95 and 0 else; FBC2 = 1 if 0.95 < FBC < 1 and 0 else; and FBC3 = 1 if FBC > 1 and 0 else. In this sense, FBC1+FBC2 + FBC3 = 1 FBC3 + 0,95FBC2 < FBC FBC < 0, 95FBC1 + FBC2 + 2FBC3 FBC3 < MFFBC < 1 MFFBC < 1 - F B C 1 As an usual practice in lineal programming, we utilize binary variables when the function is defined in sections.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
20FBC - 19 - 21 + 21FBC2 < MFFBC MFFBC < 20FBC - 19 + 21 - 21FBC2
243
(17)
Figure 1. Membership function FBC.
The fuzzy set related to the constraint "good RSAL" could be represented, for instance, by 0 RSAL-0.01 H(RSAL) =
0.49 0.8 - RSAL 0.3 0
if RSAL < 0.01 if 0.01 < RSAL < 0.5 (18) if 0.5 < RSAL < 0.8 if RSAL > 0.8
and represented graphically as shown in Figure 2. In order to model this membership function in terms of linear constraints, called MFRSAL, four binary variables RSAL1, RSAL2, RSAL3 and RSAL4 are defined in similar way that the above coefficient so that RSAL1 + RSAL2 + RSAL3 + RSAL4 = 1 0.01RSAL2 + 0.5RSAL3 + 0.8RSAL4 < RSAL RSAL < 0.01RSAL1 + 0.5RSAL2 + 0.8RSAL3 + 6RSAL4
244
R. C. Carretero
MFRSAL < 1 - R S A L 1 MFRSAL < 1 - RSAL4 2.04RSAL - 0.0204 - 100 + 100RSAL2 < MFRSAL MFRSAL < 2.04RSAL - 0.0204 + 100 - 100RSAL2 2.66 - 3.33RSAL- 100 + 100RSAL3 < MFRSAL MFRSAL < 2.66 - 3.33RSAL + 100 - 100RSAL3
(19)
Figure 2. Membership function RSAL.
Ideally, this measure takes value 0.5, that is, 50% of the polyholders pay more than Po and the other 50% pay less than Po. The fuzzy set related to the constraint "good CA V" is characterized, for instance, by 0 if CAV < 0.10 CAV-0.10 if 0.10 < CAV < 0.20 0.10 (20) H(CAV) if 0.20 < CAV < 0.45 1 0.50-CAV if 0.45 < CAV < 0.50 0.05 0 if CAV > 0.50 9
As Lemaire (1995) indicates, in bonus-malus systems that are in force the coefficient of variation of the insured's premiums varies around 0.30.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
245
and is represented graphically as shown in Figure 3.
MCAV
A
0.10 0.20
0.45 0.50
> CAV
Figure 3. Membership function CAV.
In order to model this membership function in terms of linear constraints, called MFCAV, five binary variables CAV1, CAV2, CAV3, CAV4 and CAV5 are defined in similar way that the two above coefficients so that CAV1 + CAV2 + CAV3 + CAV4 + CAV5 = 1 0.5CAV5 + 0.1CAV2 + 0.2CAV3 + 0.45CAV4 < CAVAP CAVAP<0.1CAV1+0.2CAV2+0.45CAV3+0.5CAV4+10CAV5 CAV3 < MFCAV < 1 MFCAV < 1 - CAV1 MFCAV < 1 - CAV5 10CAVAP - 1 - 50 + 50CAV2 < MFCAV MFCAV < 10CAVAP - 1 +50 - 50CAV2 10 - 20CAVAP - 50 + 50CAV4 < MFCAV MFCAV < 10 - 20CAVAP + 50 - 50CAV4 (21) The fuzzy set related to the constraint "good Ej]? is characterized, for instance, by a trapezoidal membership function, with a graph similar to Figure 3, that is to say,
R. C. Carretero
246
H(Ejk) =
0 Ejk + 0.2 0.1 1 0.2-Ejk 0.1 0
if Ejk < -0.2 if - 0.2 < Ejk <-0.1 if-0.1 < Ejk < 0.1
(22)
if0.KEjk<0.2 if Ejk > 0.2
In order to model this membership function in terms of linear constraints, called MFEjk, and which serve to measure the elasticity of policyholders that belong, e.g., to Gi and move to G2, we define five binary variables E12j (i = 1, ..., 5) in similar way that the above measures so that E12i + E122 + E123 + E124 + E125 = 1 0.2E125 - 0.2E12 2 - 0.1E123 + 0.1E124 < E12 E12 < -0.2E12i - 0.1E122 + 0.1E123 + 0.2E124 + 5E125 E12 3 <MFE12<1 MFE12<1-E12i MFE12<1-E12 5 10E12 - 2 - 100 + 100E122 < MFE12 MFE12 < 10E12 - 2 + 1 0 0 - 10E122 2 - 10E12 - 100 + 100E124 < MFE12 MFE12<2-10E12+100-100E12 4
(23)
The above formula will be the same for MFE21, MFE23, MFE32, MFE34, MFE43, MFE45 and MFE54. The above equations are presented in this work to solve the linear program defined by Zimmermann while making use of Bellman and Zadeh's Max-Min operator. In this way, the optimal solution can be obtained by
10
A more thorough treatment of this topic can be found in Lai and Hwang (1992).
Fuzzy Logic Techniques in the Non-Life Insurance Industry
uD(x*)= Max {Min{ Mo (FBC), juRi (RSAL), ^
247
(CA V), / ^ (Ejk) }} (24)
The results, using the sample data available, are presented in Table 1. With this fuzzy model we achieve a degree of overall satisfaction of 43% of the objective function as well as the constraints, and the income from premiums is fully covered.11 Table 1. Results of the bonus-malus rating system following Zimmermann's linear approach.
Objective function Min:
0.43
Coefficients applied to the premium P0 a. 0.93 1 a2 4.3 a3 4.99 a4 7.5 a5
In proposing a linear programming model, it has been necessary to substitute some frequently used coefficients which measure the severity of a bonus-malus system due to non-linearity, for others which possess this property. This situation does not suppose a special limitation for the problem since it increased the possibilities to define the membership functions. It is easier for human beings to graduate something that varies in linear form. Moreover, linear measures are more intuitive. 2.2.4
The Non-Life Insurance Model Based on Verdegay's Approach: a Nonsymmetric Model
In this section, a fuzzy rating bonus-malus system is proposed based on the a-level cut of the membership functions concept following Verdegay's approach.12 11 12
For the details of programming, see Caro (1999). For details, see Verdegay (1984) and Lai and Hwang (1992).
248
R. C. Carretew
This model seeks aims to satisfy an objective function by maximizing the income from premiums. This objective function can be subjected to the same conditions as the first system that was proposed. The insurer is trying to obtain the optimal incomes (maximizing I), which for financial reasons, ought to be "attractive." We assume that the objective "attractive income from premiums" and the constraints in an imprecise situation, which depends on the decisionmakers, can be represented by fuzzy sets. It is noted that for each a, we have an optimal solution, so the solution with a value of membership is actually fuzzy. Since we consider this approach to be based on the use of the acut, the objective function will be subject to the following additional restrictions: ai < RSAL < bi a!=0.49wi + 0.01 bi=0.798oci-0.3wi (25) where [ai, bi] is the a-cut of the membership function of the coefficient RSAL, and wi is the parameter of the a-cut for not mistaking with the coefficients applied to the premium Po. a2 < CAV< b 2 a2 = 0.1w 2 +0.1 b2 = 0.5 - 0.05w2
(26)
where [a2, b2] is the a-cut of the membership function of the RSAL coefficient and w2 is the parameter of the a-cut. a3J < E[A,j] < b 3J a3j = 0.3w3j + 0.4 b3j = l
(27)
where [a 3 j, b3j] is the a-cut of the membership function of the elasticity corresponding to the class j , E[A,j], and w3j is the parameter of a-cut.
249
Fuzzy Logic Techniques in the Non-Life Insurance Industry
This elasticity concept is a steady-state notion, applicable only when stationarity has been reached and it is defined as m
E ft] =
p
^
(28)
J>(kj-»A
or
E ft] = — Z ^ ^ l X«(^j->^) jk " j k
K -"j)
(29)
j
k*j
where 71ft—»Xk) is the limit value for the probability that the polyholder with claim frequency Xj is in class Gk. To solve the linear program defined by these additional equations and the equations of the first rating system presented, the method proposed by Verdegay is utilized in an approximate way. The results of simulation with different parameters of the a-cut of the objective function and restrictions related to RSAL, CAV and Eft] are presented in Table 2.14 With this model we attempt to guarantee satisfaction of the objective function and restrictions with a minimum degree of 100a%. 2.2.5
Closing Remarks about the Two Models
Actuaries and other insurance decision-makers often consider conditions that are supplementary to statistical experience data. These conditions may be technical conditions that reflect company philosophy. The Theory of Fuzzy Sets provides a way to explicitly and 13 14
For further details of estimation of this limit value, see Caro (1999). For the details of programming, see Caro (1999).
250
R. C. Carretero
consistently deal with qualitative constraints. This chapter describes how to develop a Fuzzy Logic model with which actuaries and decision-makers can adjust insurance rates by considering constraints or information that are supplementary to statistical experience data. The main advantage to using fuzzy logic is that one can begin with qualitative conditions and systematically create a mathematical model that reflects those conditions. An actuary can, thus, consistently act in accordance with those rules through use of fuzzy sets. Table 2. Results of the bonus-malus rating system following Verdegay's approach. 0.2-cut
0.3-cut
0.4-cut
Objective function (maximizing I): 46,866.59 47,020.36 47,449.3 a, a2 a3 a4 a5
3
Coefficients applied to the premium 0.945 0.951 0.953 1 1 1 1.14 1.18 1.4 1.81 1.7 1.7 1.9 L9 L8
Some Extensions
Beyond using the above approaches, some extensions to fuzzy linear programming can be introduced. These extensions could be applied to multiple objective linear programming. Most decision makers do not make their decisions based on a single criterion but rather on several criteria. Almost all decision problems have multiple, usually conflicting, criteria. Criteria are the attributes, objectives and goals to be considered relevant for a given decision making problem. Therefore, multiple criteria decision making or multiple criteria decision analysis comprise a general theoretical framework to deal with decisional problems involving several attributes,
Fuzzy Logic Techniques in the Non-Life Insurance Industry
251
objectives or goals. Rigorous descriptions of these tools can be seen in Ballesteros (1998). Two fuzzy linear programming models based on linear membership functions and a Max-Min operator are introduced. Others membership functions and operators are also possible. For simplicity, it is assumed that these membership functions and operator are consistent with to the decision maker's judgement and rationality in the decision processes. It is assumed that the membership function is equivalent to the utility function when the membership(s) of fuzzy linear programming are based on a preference concept such as the Utility Theory. It can not be forgotten that the shapes of the membership function must depend on the meanings of the particular problems. In addition, in some practical situations, operators other than the MinMax might be required to preserve some basic properties such as transitivity or preference relations. Two techniques used to solve linear programming problems in a fuzzy environment have been discussed. These techniques can also be applied to some non-linear programming problems. Like fuzzy linear programming, a fuzzy non-linear program should be transferred into an equivalent crisp non-linear programming problem. Then we can solve it by the conventional solution techniques of non-linear programming, instead of linear programming.
4
Classification
Classification is one of the most controversial topics in today's insurance industry. When dividing an automobile insurance market into groups, with one group having signicantly higher risk, uniform pricing by the insurer results in antiselection. In life insurance, distinguishing between men and women may be viewed to be the same as gender discrimination in the workplace. One can view classification based on experience as a form of classification of risk that is based solely on data, not on the precon-
252
R. C. Carretero
ceived perception of risk. Previously, classical statistical methods have been applied to the two fuzzy rate making models. That is to say, generally speaking, the task is to partition a given set of data or objects into categorically homogeneous subsets or groups which are called clusters. The objects belonging to the same cluster should be as similar as possible and the objects in different clusters should be as dissimilar as possible. In the case of risk classification, the purpose is to distinguish between risks that are significantly different. But statistical significance may not be sufficient. A fuzzy classification alternative is suggested here. The basic method of clustering is presented. For twodimensional data sets, the intuitive, and usually commonly agreed upon judgement of the data, is the inspection by eye. This often sets the standard for the quality of a cluster partition. For higher dimensional data, where humans can not recognize an unambiguous cluster partition, one is content with characteristic numerical data that evaluates the quality of a partition in a strictly mathematical manner. In image recognition, there is usually no doubt about the correct partition. Unfortunately, there is no objective mathematical measure for the partition made that way, such that a program could evaluate its own partitions according to a human point of view. Therefore, we speak of the intuitive partition when we refer to the partition as it would be made by humans. The technique introduced in this section deals exclusively with the partition of data into full or solid clusters. Cluster analysis deals with the discovery of structures or groupings within data. Since it is rare that all disturbances can be completely eliminated, some inherent data uncertainty can not be avoided. That is why fuzzy cluster analysis dispenses with unambiguous mapping of the data to clusters, and, instead, computes degrees of membership that specify to what extent data belong to clusters. For the initialization of more complicated algorithms the following basic algorithm is often needed.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
4.1
253
The Fuzzy c-Means Algorithm
A fuzzy partition of the data set into clusters, following Ostaszewski (1993), is described by membership functions of the elements of the cluster. An algorithm for generating fuzzy partitions is the fuzzy c-means algorithm. The resulting fuzzy c-means algorithm recognizes spherical clouds in a p-dimensional space. The clusters are assumed to be of approximately the same size. Each cluster is represented by its center. This representation of a cluster is also called a prototype, since it is often regarded as a representative of all data assigned to the cluster. The letter c in the name of this algorithm stands for the number of clusters and is meant to clarify that the algorithm is intended for a fixed number of clusters, i.e., it does not determine that number. Here an initial partition is determined by our experience, then in each cluster we identify its center, which takes the essential features of the cluster. Finally, we calculate the weighted sum of the distances given by the vector norm of the elements of the clusters from the corresponding centers. We modify the initial partition in order to minimize that weighted sum. A more thorough treatment of this topic can be found in Hoppner et al. (1999). In this section, a possible way of determining the policyholder premium, utilizing this fuzzy c-means algorithm, is proposed. The idea is as follows: 1. do fuzzy classification into c clusters 2. determine the grade of membership in every cluster for every policyholder, such that Pij: grade of membership to cluster j of policyholder i 3. make an initial fuzzy (F) estimation of the number of policyholders in every cluster j , such that njF=ZPij, i=l
but frequently
j = l,2,...,c
(30)
R. C. Carretero
254
^ n j F * A (total number of policyholders in the portfolio) j=i
for this reason we reestimate the number of policyholders such as n,jF=A-^-
I>
(31)
JF
and in this way
jV j F =A j=i
4. calculate the coefficients OCJ applied to the premium of every policyholder j with any fuzzy non-life model proposed in this chapter 5. determinate the premium of holder i such as Pi=J=^
(32)
where the premium Pj could be calculated with any fuzzy rating system previously presented. A disadvantage of the described algorithm is that the number of the clusters has to be known in advance. In many applications, this knowledge is not available. In this case, the results of analyses with different numbers of clusters would have to be compared with each other in order to find an optimal partition.
The Future of Insurance This section targets an e-business possibility that could impact the insurance clients'operations using Fuzzy Logic.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
255
The biggest impact by the Internet and e-business technology will likely not be on direct sales. The strongest expected impact of the Internet will be in the savings incurred by migrating processes from their existing paper-based systems to the more efficient Internet. However, many insurers are developing e-business capabilities without having sound e-business strategies in play. Insurers would like to combine and merge client databases with clients common to others market products to produce an insurance market product. However, files containing client information will most likely be stored differently from database to database but with common fields such as name, address, age, city, as well as postal code (variables). The information may be similar between files but it is rarely exactly the same and this would prevent most merge routines from matching any two records. Even, switching the names around in one of the files still will not ensure an exact match due to the way the names and addresses are stored. This is where the fuzzy match routine will help. This routine is by no means an exact science nor without its own problems with mismatching records. However, it can minimize the amount of labor needed to determine which client is located in both files or is possibly duplicated in the same file. Every variable will be represented by a fuzzy set. The purpose of the fuzzy application is to find records that may be duplicated or matched in another file and to minimize manual effort. Another possible application could be the following: an insurance company creates a list of clients that are potentially ideal candidates for a special marketing product campaign. There is no common key in this list that one can use to directly link these entries with the company's client file. This list has the name, address, city, state and postal code for each entry. The utility of the fuzzy application is its ability to match records from separate files that do not have any common key. These files may have common fields but they can not be matched exactly by the contents of these fields due to the way the data is keyed or maintained. Assume both files (master and transaction) are
256
R. C. Carretero
matched and have been normalized. This process of standardizing is the key to making the fuzzy match process successful. For details, see Caro (2002). One of the key conclusions is that Fuzzy Set Theory provides a promising way of treating the uncertainty that is inherent in many e-business applications and would be a useful addition to the modeling tools used here. Using Fuzzy Logic, insurance companies looking to maximize their e-business initiatives could participate in the insurance Internet revolution and quickly enter Internet space. Remember, however, that Fuzzy Logic Theory is not an exact science and is prone to "fuzzy" results.
Acknowledgments I would like to express my thanks for the great support and comments to this chapter to Angel Sarabia Viejo.
References Ballesteros, E. and Romero, C. (1998), Multiple Criteria Decision Making and Its Applications to Economic Problems, Kluwer Academic Publishers, Dordrecht. Bellman, R.E. and Zadeh, L.A. (1970), "Decision-making in a fuzzy environment," Management Science, vol. 17, n° 4, pp. 141-164. Caro, R. (1999), "Modelo multivariable en la tarificacion del seguro del automovil," Tesis Doctoral, ICADE, UPCO, Madrid. Caro, R. (2002), "Managing e-insurance using Fuzzy Logic," Proceedings of the SCI, volume II, p.342, Orlando, ISBN: 980-078150-1. Hoppner, F., Klawonn, F., Kruse, R., and Runkler, T. (1999), Fuzzy Cluster Analysis, John Wiley and Sons, Ltd., England. Lai, Y.J. and Hwang, C.L. (1992), "Interactive fuzzy linear programming," Fuzzy Sets and Systems 45, pp. 169-183.
Fuzzy Logic Techniques in the Non-Life Insurance Industry
257
Lemaire, J. (1995), Bonus-Malus Systems in Automobile Insurance, Kluwer Academic Publishers, Boston. METAgroup (2001), "Insurance information strategies: where ubiquitous business becomes the norm. Future states for the insurance industry," working paper. Ostaszewski, K.M. (1993), An Investigation into Possible Applications of Fuzzy Set Methods in Actuarial Science, Shaumburg IL, Society of Actuaries, Chicago. Swiss Re (2001a), "Rentabilidad del sector asegurador no-vida: el retorno a los valores tradicionales," Sigma 5/ 2001, Swiss Reinsurance Company, Zurich. Swiss Re (2001b), "El seguro mundial en el afto 2000: otro ano de auge para el seguro de vida; retorno al crecimiento normal en el seguro no-vida," Sigma 6/ 2001, Swiss Reinsurance Company, Zurich. Verdegay, J.L. (1984), "A dual approach to solve the fuzzy linear programming problem," Fuzzy Sets and Systems 14, pp. 131141. Zadeh, L.A. (1965), "Fuzzy sets," Information and Control 8, pp. 338-353. Zimmermann, H.J. (1987), Fuzzy Sets, Decision Making and Expert Systems, Kluwer Academic Publishers, Boston.
This page is intentionally left blank
Life and Health
This page is intentionally left blank
Chapter 7 Population Risk Management: Reducing Costs and Managing Risk in Health Insurance Ian Duncan and Arthur Robb
Recently, health plans have initiated risk-management programs consisting of either disease management or case management. Many programs of this nature are based on the assumption that current patient risk equals future patient risk. This chapter discusses the use of more robust prediction methods that predict risk even on a current low-cost population. Statistical and artificial-intelligence prediction methods are discussed here. Prediction methods are considered in the context of a comprehensive, operational risk management project.
1
Background
Most health insurers have traditionally managed risk by a combination of pricing, underwriting, and reinsurance, together with strict claims management. Managed care initially managed risk by network contracting (steering members to preferred providers who offered either lower rates or lower utilization of services) and utilization management (pre-authorization or concurrent review of hospital admissions). Because of the denials that resulted from preauthorization, managed care plans began to find other ways to manage high-risk members directly: case management and disease management programs that aim to manage high-cost insureds. 261
262
/. Duncan and A. Robb
Unlike pre-authorization, which is largely implemented through a series of rules, and administered at the claims level, management programs (either case-management or disease management) are resource-intensive. Because staffing requires individuals with clinical skills and experience, for the most part these programs are costly to implement and manage. Therefore, successful programs that demonstrate return on investment require that only the right, highestrisk members be selected.
1.1
What Is a High-Risk Member?
Traditional risk-management methodologies do not distinguish between "high cost" and "high risk" members. "High risk" members, in our definition, are the future high-cost members, who can be either low or high cost, currently. With other insurance coverages (fire; workers' compensation, and so on), the identification of potential risk factors is an essential component of programs of riskmanagement or risk reduction. The same techniques have been slower to penetrate in Health insurance, although recently, the resurgence of high trend rates has resulted in interest on the part of many risk-taking plan sponsors. Health plans have different ideas of what constitutes a high-risk member: for example, members that have risk-markers for certain diseases, such as cancer or heart-disease because of a family history. While such an identification method may be possible for an entire population, there are several problems with this approach: • There is no consistent or successful method currently (of which we are aware) for obtaining the necessary data on members of a health plan. The information has to be obtained from the member by questionnaire. Acquiring data in this way is costly, and the response rate to this type of questionnaire is not high. • The data may be low quality. Even with the best of intentions, members may not respond truthfully. Moreover, the usefulness of a prediction tool is limited by the generality of the underlying
263
Population Risk Management: Reducing Costs and Managing Risk
data used to construct the tool. Survey data is notoriously prone to response bias. • While a member may be at risk of adverse health outcomes, and report risk-factors in a questionnaire, it is difficult to predict when that outcome will occur. For many high-risk members, the adverse event may not take place for many years, if at all. These observations lead us to propose a definition of a high-risk member for the purposes of this chapter: • The member has a significant probability of experiencing costs higher than the average for the group to which he/she is assigned. • The predicted costs will occur in the near future, such as the next twelve months; • The member's costs are more likely to be concentrated or episodic than regular. Some of these concepts are illustrated in Table 1. Table 1. Distribution of enrollees by expense category.
Baseline Year Cost
Percentage of total costs in base-line and subsequent years Mean Mean Percent Percent of Per Capita Baseline Per Capita Percent of Total Total Cost, Cost of Total Cost, Cost, Baseline Subsequent Category Subsequent Baseline Enrollment Year Cost Year Year LOW 87% $ 1,191 $ 324 23% 58%
<$2,000 $2,000- MEDIUM $24,999 HIGH $25,000+
$ 5,658
12%
56%
$ 5,385
35%
$49,032
Average
$ 1,451
1% 100%
21% 100%
$15,800 $ 1,840
7% 100%
This table tracks the status of members continuously enrolled in a health plan for two years. For this discussion, members have been assigned to Medical Expense Categories in the Baseline year. The specific limits of each category are somewhat arbitrary, but they may be broadly thought of as "Healthy" (claims under $2,000 per
264
/. Duncan and A. Robb
year), "Chronic" (claims between $2,000 and $25,000 per year) and "Catastrophic" (claims over $25,000 annually). The Chronic and Catastrophic groups, together, constitute about 13% of the total population in the baseline year, but account for 77% of the total cost, an observation often referred to as the "80/20 rule". There are several points to note of significance from this analysis: • The high cost groups, in total, while they account for 77% of cost in the first year, account for only 42% of cost in the projection year. • The high-cost population in the projection year come from both the high cost population and the healthy population, in the baseline year. • Although not shown in this table, 20% of the Projection Year's High cost Category come from the same period's Low cost (or healthy) population and 40% from the Medium cost, or Chronic population. • Conversely, only about 40% of the Projection Period's High cost category comes from the Baseline period's High cost category. What this analysis tells us is that the choice of high cost members as the cohort on which to focus medical management resources is misplaced; conversely, ignoring the Low cost cohort overlooks a significant portion of the medical management opportunity. The traditional focus has financial consequences for the health plan. Let's assume that the plan decided to focus its medical management resources on the high-cost group as identified in the baseline period. Tracking these members going forward, we would see an apparent significant reduction in cost (on an individual basis, from $49,000 to $15,800, and on a group basis from 21% of total cost to 7% of total cost). It would appear that all is well; medical management has reduced the overall cost of the plan; trend is under control. Unfortunately, this is not true, as the Average cost line in
Population Risk Management: Reducing Costs and Managing Risk
265
the table above shows: overall the plan's trend was 27%. In summary, focus on future, rather than historic high-cost members is essential for successfully managing financial risk. In addition, this control needs to be exercised in a cost-efficient way, a subject to which we will return below in Section 3.
1.2
Elements of Population Risk Management
"Population Risk Management" consists of three components: • Identification ("Targeting") of future high-risk populations, subgroups or individuals; • Application of interventions, behavior change and other riskmanagement techniques to the high risk members in order to reduce their resource utilization and the cost of the group or individual; and • Application of pricing and underwriting techniques that change behavior by conveying financial signals to users of health services, and plan sponsors. The remainder of this chapter addresses the science of Population Risk Management, or the application of scientific techniques to identify, manage and price health insurance risks. Although this is an introductory survey, we include some detailed discussions of practical implementation issues, giving the user the ability to design and implement a population risk management program. In this chapter, we cover targeting of high-risk populations, including a discussion of data sources and methods, commonly-used 1
Many readers have expressed reservations with this statement, reading into it an implication that we support using Population Risk Management to exclude certain patients from coverage. This is not our intention; nor do we believe that exclusion from coverage based on predicted cost is legal, in most jurisdictions. We believe that Population Risk Management techniques, however, allow plan sponsors to assign prices appropriately to different risks. How plan sponsors choose to subsidize different risks is a matter for the individual plan sponsor, again, operating within the relevant regulatory environment.
266
/. Duncan and A. Robb
techniques such as clinical groupers and statistical methods (decision trees, regressions, and neural networks). Later in the chapter, we discuss practical applications of these methods to high-risk population management. For those who wish to apply the methods, we include an optimization model for selecting candidates and the appropriate level of intervention for their management. Specifically, we cover the following issues in subsequent sections: 2. Identification (targeting) of high-risk populations 2.1 Data: available sources. 2.2 Implementation issues and how they affect prediction methodologies. 2.3 Prediction Methodologies 2.3.1 Clinical methods: Groupers and clinical identification methods. 2.3.2 Statistical methods: applications of data mining and predictive techniques. - Introduction - Regressions - Decision Trees - Artificial Neural Networks 3. Application of interventions and other risk management techniques 3.1 Targeting the right members. 3.2 Effectiveness and outcomes. 3.3 Results, including a methodology for estimating Return on Investment from intervention programs and determining an appropriate level of intervention for the targeted population. 4. Summary and conclusions
Population Risk Management: Reducing Costs and Managing Risk
2
267
Identification (Targeting) of High-Risk Populations
This section discusses risk prediction methodologies. From the start, it should be noted that there is no single "best" methodology - different methodologies are suited to different purposes. Too often, the methodology is chosen first, as a consequence of available software, because it is the methodology that either the analyst or the analyst's manager is most familiar with, or simply because the methodology sounds impressive in a meeting or a sales presentation. The choice of a methodology is a business decision that should be based on available data, budget, staffing, computing resources, and goals. It is therefore necessary to discuss these factors as a part of a detailed discussion of prediction methodologies.
2.1
Data: Available Sources
Data is a central issue - without data, there is no identification, targeting or prediction. Unfortunately, there is no ideal source of data. Each source has its benefits and drawbacks, which must be weighed against each other. This is probably the most pessimistic section of the chapter, because we must discuss data deficiencies. However, predictive modeling, as any other data-driven endeavor, is subject to the "garbage in garbage out" principle. Therefore, this discussion is crucial. Five types of data commonly available to the health care analyst are as follows. 2.1.1
Medical Charts
Medical charts may be available if the source of the data is a physician group or a staff model HMO. Typically, however, charts are unavailable. When available, charts are the richest source of patient information. However, they have serious drawbacks. They are not
/. Duncan and A. Robb
268
comprehensive because they do not cover out-of-network or out-ofgroup services. While they may record the physician's prescribing behavior, they will not record the patient's compliance (such as prescription filling). Transcribing the data and transferring it to a uniform format is time-consuming and requires highly trained staff Moreover, there is a lack of uniformity in entries between physicians - different physicians may choose to omit or include different details and may have very different interpretations of conditions and their severities. The crucial factor is, of course, the lack of availability. For this reason, charts will not be discussed further. 2.1.2
Survey Data
Survey data is perhaps the most tantalizing type of data available. Unlike chart data, it is frequently available. Moreover, it can be much richer than other data, such as claims. For example, claims data cannot tell how often an asthmatic vacuums his/her carpet. However, survey data has serious drawbacks. First and foremost, it generally does not exist prior to initiation of a risk management program. Other sources, such as claims and laboratory values exist for other purposes, and the cost of data transfer and formatting is incremental. Surveys, on the other hand, must be commissioned, budgeted, and executed in order to generate the data. Moreover, in contrast to claims, survey datasets are not updated or refreshed as medical events occur - new surveys must be commissioned periodically as data becomes stale. Surveys are also limited by response bias - the sub-population that responds to any particular survey is far from a representative cross-section. It is dangerous to draw conclusions from survey responses. Finally, respondents may also, intentionally or unintentionally, submit untruthful answers. Professionally-written surveys are required to detect such responses - unfortunately surveys are often written by amateurs - staff whose professional training is in another field. Despite the drawbacks, surveys can be a valuable addition to transaction-based data.
Population Risk Management: Reducing Costs and Managing Risk
2.1.3
269
Medical Claims
Medical claims have high ratings for availability. They are generally available in a health plan environment, except when capitation agreements are in place. Medical claims data is also continually refreshed as events occur, another major advantage. Some drawbacks are in the depth and accuracy of the medical information they contain. Additionally, there are accounting issues to be reckoned with. The quality of medical claims data varies greatly between health plans and there are many additional possibilities for error in the data-gathering stage. Therefore, diagnostic reports should be run to determine data quality before beginning any predictive analysis. Key data elements contained in medical claims include the diagnosis, the procedure, the location and type of the service, date of service, date of adjudication, amount charged and amount paid. Diagnosis codes are nearly always present on medical claims data. Sometimes there may be multiple diagnosis codes, although usually only the primary and secondary codes are populated with any regularity or reliability. Diagnosis codes nearly always follow the uniform ICD (International Classification of Diseases - Version 9) formats. Unfortunately, they can contain errors. First, claims are usually not coded by the diagnosing physician, or they may be coded vaguely - physicians may have a strong conjecture on a precise diagnosis but will often code a diagnosis in the absence of confirming tests. This is often the case for rare diagnoses as well as for emergency treatments where precision takes a backseat to swift treatment. Diagnosis codes may be selected to drive maximum reimbursement. Finally, coding may lack uniformity - different physicians may follow different coding practices. However, for all of their drawbacks, diagnosis codes are invaluable due to their availability, uniformity of format, and usefulness. Procedure codes are often present on claims data. However, unlike diagnosis codes, they come in various formats, some of which are specific to particular health plans. Like diagnosis codes,
270
/. Duncan and A. Robb
they are subject to data problems. Nevertheless, they are useful when available. Service location and type are generally present on medical claims and can be used to determine basic types of service— emergency, hospital inpatient, and so on. Formats are sometimes standard and sometimes Plan-specific. Data quality is generally good enough to allow use in modeling. Cost of service is both highly necessary and highly problematic. It is a necessary factor because our definition of risk is financial. It is problematic due to a lack of uniformity. First, there is some question as to whether to use billed or reimbursed cost. Billed cost is more subject to non-uniformity and can be highly inflated. Reimbursed cost is altered by capitation agreements and deductibles, among other factors. Both costs are subject to variation in physician coding practices. The dates of service and date of adjudication are essential to medical claims. The importance of date of service should be apparent. The date of adjudication is essential for verification of completeness of data. There is a lag between dates of service and adjudication that can be several months or more. If data is extracted too soon after service, there will be missing claims that have not yet been adjudicated. Actuaries are accustomed to evaluating the completeness of paid claims data, using different methods such as the chain-ladder or the inventory method (using a report of claims costs cross-tabbed between months of service and adjudication). There are accounting issues, too, with claims - time-based reports change as data is refreshed and missing claims are added. 2.1.4
Pharmacy Claims
Of all data sources, pharmacy claims have the best overall quality. They are adjudicated quickly, if not immediately, resulting in a higher completion factor than medical claims. The essential fields are nearly always populated. Unfortunately, pharmacy claims do not contain enough information on their own for robust prediction.
Population Risk Management: Reducing Costs and Managing Risk
271
There are some exceptions, mainly in the area of interactions. Despite the best efforts of Pharmacy Benefit Managers (PBM's), conflicting prescriptions are sometimes filled, and conflicting drugs are highly predictive of adverse outcomes. Moreover, there are enrollees with prescriptions for multiple drug classes, sometimes four or more. Such enrollees are at risk statistically regardless of any other factors. 2.1.5
Laboratory Values
At present, it can be difficult to obtain laboratory values and they require large efforts to render them useful to targeting. The authors are beginning to see regular submissions of lab values from clients, although even in Managed Care plans, fewer than 50% of tests are administered within contracted Lab vendors, due to out-of-network leakage. Few vendors code the test consistently using a standard protocol (such as the LOINC code). However, the potential for their use is very high and the availability of lab values is always worth investigating. 2.1.6
Conclusions on Data
The authors favor the integrated use of medical and pharmacy claims. Although they are not as rich as survey data, claims are always available and of sufficient quality to drive risk management programs - the same cannot be said for any other data source.
2.2
Implementation Issues and How They Affect Prediction Methodologies
Besides data, there are other practical issues that affect choice of prediction methodologies including budget, staffing, computing resources and goals. The following issues should be examined before deciding upon a methodology.
/. Duncan and A. Robb
272
2.2.1
Goals
The project goals should be well-defined: • Is the prediction designed to drive underwriting or an intervention program? Underwriting requires general-purpose risk stratification (here, risk equals cost) while intervention programs may have more specialized goals and require a specialized analysis. • Is the prediction designed for experimentation or production? By an experimental design, we mean a one-time analysis, say for a research article or as a prototype for production. By production, we mean an analysis that will be repeated many times on various datasets. For example, periodic identification for the purpose of underwriting or intervention is a production function. In practice, experimental and production projects often require different resources and methodologies. Without careful consideration, the line between experimental and production projects often becomes blurred, to the detriment of the prediction project. 2.2.2
Budgets
Budgets drive everything in the business world, including statistical methodologies. Statistical packages alone can be costly, with topof-the-line enterprise solutions costing one hundred thousand dollars or more. However, lower-cost solutions also exist - the point is that it is necessary to check goals, staffing, and resources before proceeding too far. 2.2.3
Staffing
Staffing depends on budget; it also depends on availability of resources. Availability of skilled resources can drive the choice of methodology. 2.2.4
Computing Resources
Computing resources should be inventoried - some methodologies,
Population Risk Management: Reducing Costs and Managing Risk
273
most notably neural nets, are computation-intensive. Moreover, in any production application, there should be sufficient storage space and a well-organized warehouse from the beginning. 2.2.5
Data Warehousing
A data warehouse adds three invaluable elements to the risk management process. First, it automates data collection, saving valuable staff-time. Second it stores data in a standard format, reducing the user learning curve and allowing users to streamline their analysis programs. Finally, it allows the definition of derived variables that are of considerable value to building and testing prediction models. Examples of derived variables (in order of increasing complexity) could include: • Medicare-eligible member identification; • A diabetes disease indicator; • A member's medical possession ratio, or the percentage of possible prescriptions for a particular drug that the member filled; • A "gap" indicator, indicating that the member's treatment protocol is not compliant with best-practices. A data warehouse may not be necessary for experimental analyses and can sometimes be omitted. In such cases, it may be preferable not to build one for the simple reason that the warehouse design will form part of the experimental stage. However, data warehousing is almost always necessary in production analyses - failure to standardize data storage and processing practices can doom a program.
2.3
Prediction Methodologies
This section separates clinical and statistical methods of risk prediction. This is appropriate because clinical and statistical predictions represent different ends of the methodology spectrum. Nevertheless, all methods discussed here are quantitative and are there-
274
/. Duncan and A. Robb
fore in some way statistical. Any prediction methodology, clinical groupers included, must employ concrete computer-implementable algorithms backed by sound statistical verification. Surprisingly, there are still organizations that use manual risk prediction by clinicians. Although this chapter focuses on computer implementations, there is enough manual risk prediction in the industry to warrant a brief discussion here. Clinicians predict risk by using a combination of charts, claims and pre-authorization data. Human recognition has many advantages. However, it can lack consistency, and is not scalable. The definition of risk presented in this chapter has, intentionally, financial and time components. Clinicians have a different view of risk, focusing on clinical outcomes and a longer time horizon. If used, human recognition is usually limited to the very highest-potential populations. There is, however, one use of manual risk prediction in the artificial intelligence context. It will be seen later in this chapter that manual risk prediction can be used as a means of training artificial neural nets. In a sense, this represents a translation of previouslygathered clinical knowledge into a form that is usable by a risk management program. The shortcomings of manual risk prediction highlight some of the attributes that need be present in any computer algorithm underlying a risk management program. Risk prediction algorithms must be: • Scalable - able to assess any number of enrollees. • Cost-effective and implementable subject to real-world budget and staffing constraints. • Employ consistent, rational, easily-communicated processes that can stand up to scrutiny by any one of many concerned parties within an organization. For any prediction program, it should be a "given" that at some point, non-technical managers will ask "why is Enrollee 123 high risk?" These managers will need a clear non-technical answer.
Population Risk Management: Reducing Costs and Managing Risk
275
• Quantitative: the algorithm should stratify a population into risk levels, and assign each risk level an actual quantitative measure - predicted cost, probability of an event in the next 12 months, and so on. Manual algorithms are, of course, unable to generate such stratifications - however, many computerized algorithms also share this shortcoming. For example, there are binary risk groupers that designate a patient as "risky" or "not risky" but fail to distinguish whether Enrollee 123 is more or less risky than Enrollee 456. As we shall see in section 3, practical program implementation often requires the ability to stratify a population because simple identification of individuals with common characteristics (such as a particular disease) result in too many potential individuals for the limited resources of the program manager. • Stratification is essential to any program, whether for underwriting or clinical intervention. Intervention programs not only need to know which patients should receive the intervention but which patient should be first to receive the intervention - stratification is the key to prioritization. • Finally, any intervention program should have a Return on Investment (ROI) component. We will see later that stratification is a key part of building a business case for risk management, because the potential for risky events, and response to an intervention program are not the same for all individuals. 2.3.1
Clinical Methods
By clinical methods, we mean methods that are heavily weighted towards diagnosis codes as the main source of input. Such methods frequently group medical claims by diagnostic categories into clusters and by date into episodes. Based on these groupings, enrollees may be assigned risk scores. For example, a patient with diagnosis X and Co-morbidity Y might be considered more risky than a patient with diagnosis X only. Additional clinical assessment may be performed, by comparing
276
/. Duncan and A. Robb
the patient's apparent treatments as reflected in claims, with clinical best practices for that diagnosis. There are several commercially available groupers and they do, in fact, detect risk. Their algorithms, when known, can be quite complex. Some algorithms are not known - they are proprietary "black boxes". For a detailed discussion of groupers, see Cummings et al. (2002). Groupers are the most robust methods of exploiting the great range of diagnosis codes. They have a practical advantage of being understood, or at least trusted, by physicians. This is no small advantage. Implementing risk prediction programs often means selling the programs to a diverse audience within an organization. Certainly, the physicians comprise one of the parties. When associated with an implementation program, groupers have a reputation of providing actionable results. For example, intervening nurses might be given pre-summarized data based on a patient's diagnostic clusters. While they can be effective, commercial groupers do have disadvantages. One is simply the "black box" algorithms - great care must be taken in their purchase to ensure that the product fits purchaser needs and expectations. Another disadvantage is that they are "one size fits all" products. In this case, commercial groupers do not incorporate demographic or plan design differences and therefore must be recalibrated upon implementation. Groupers tend to operate on the premise that risk is absolute. However, it is not: risk is influenced by many factors exogenous to the diagnostic tables - geography is a prime example. It is possible to develop custom groupers. However, their construction is a major undertaking and requires the combined efforts of staff with wide areas of expertise. There are some general drawbacks to the implementation of groupers: one is simply that there are subject to the coding errors and imprecision that are prevalent in diagnostic codes. This is a non-trivial issue, especially because groupers tend to focus upon more complicated diagnoses that rely on two decimals of precision
Population Risk Management: Reducing Costs and Managing Risk
277
in the ICD table. Another drawback is the need for recalibration, which may be technical or financial. Groupers generally measure some sort of clinical risk, which may not be appropriate for underwriting, or planning a cost-effective intervention program. Financial risk measurement requires calibration to a particular plan's financial structure. In this case, one size does not fit all. 2.3.2
Statistical Methods
Introduction By statistical methods, we mean the statistical end of the risk prediction spectrum. However, all statistical methods have clinically valid independent variables and algorithms that should withstand clinical scrutiny. Statistical methods rely less on exploitation of the ICD table and more on pure statistical methods, often employing statistical algorithms common to other industries, such as the credit screening or direct marketing industries. Implementation of Statistical methods may be quicker, and can be developed by statisticians and biostatisticians, rather than physicians. However, staff should be conversant in both modeling and the content area. Using a modeling tool is like operating a racing car - drivers need know what is under the hood. More information about statistical methods may be found in a number of sources; see, for example, Berson et al. (1999), and Weisberg(1985). Definition The statistical prediction problem can be stated as follows. Suppose a given quantity x (an independent variable) will be used to predict another quantity y (a dependent variable). For example, patient age (= x) might be used to predict cost (=y). Historical data would be collected, consisting of age/cost pairs -xi,yr, X2, }>2', X3,ys, ... - in some population. This data would be used to construct a function, J{x), that is used to approximate y. The function
278
/. Duncan and A. Robb
f(x) is the prediction rule, implemented as the basis of population risk management. Alternatively, f(x) might not actually be an approximation but simply an index with some statistical relationship with y. For exa m p l e , / ^ might be a risk index that correlates to mean future cost, rather than being a specific approximation of future cost. It should be observed that/fa need not be precisely accurate on the individual level in order to be useful. In fact many cost predictors are inaccurate on the individual level, not due to any flaws but simply due to the volatile and stochastic nature of individual costs. The key is that these tools should at least be accurate on the aggregate level. Generally, construction of/fa entails minimization of an aggregate error, some quantity derived from the individual errors,/fa; -yi;f(x2) -y2;f(x3) -yx •••• In real world situations multiple independent variables are used to predict dependent variables. Part of the prediction problem then becomes selection of the key relevant variables that provide an optimal statistical solution. Medical and Pharmacy claims data are so rich, and contain so many elements that selecting appropriate independent variables can be a significant exercise. Data Considerations As in any statistical modeling context, the importance of preparing and studying the data prior to modeling cannot be understated. The actual modeling is one of the last steps in the process and usually only takes place after a careful preparation and study of relevant variables, their quality and their interrelationships. The first question about data is simply this - "how much is enough?" The glib answer is simply "more than you think you will need." The point is that conventional statistics and/or biostatistics seldom prepares analysts in terms of real world sample size. Clinical studies often involve a few hundred patients or fewer. Public opinion polls that divine our political future boast of a few hundred to a thousand data points. In the world of health care predictions, such sample sizes would be woefully inadequate.
Population Risk Management: Reducing Costs and Managing Risk
279
Public opinion polls are a good example: although they lie outside health care modeling, they are far closer to the content of most statistics courses. A poll with a binary outcome that uses a typically-sized sample of 500 respondents might, depending on the response, have a precision of ±4 percentage points at a 95% confidence level. In other words, there is a probability of one in twenty that the "true" result differs from the published result by more than 4 points. This might be acceptable for a poll, where there is little penalty for error. A similar error could have serious consequences in a financial program. Moreover, it is usually of interest to generate statistics on sub-populations, e.g., what percent of women of ages 65 and over would vote for candidate X? The margin of error on subsets is even greater and in this example, would result in statistics of dubious value. The problem is that in many areas, data acquisition is expensive. This is true for polls and other forms of research, such as clinical studies, another significant area of experience for health care analysts. In these familiar areas, a thousand data points is a rich study. As discussed above (Section 2.1.3) the relative abundance of health claims allows sample sizes several orders of magnitude greater, literally up to millions of data points. These possible sample sizes should be exploited at every available opportunity. There are two other reasons for large datasets. First, some health plans are interested in targeting catastrophic conditions. There are differences of opinion in the industry as to whether or not this can actually be achieved; however, the very low frequency of catastrophic events makes a large dataset a pre-requisite. The second reason is that prediction methodologies require at least two datasets, one for creating the model and one for validating it. Validating a model on the dataset with which it was created leads to tautological models and over-fitting. Bootstrapping methodologies exist if there is not enough data to allow separate sets; however, a split-sample is preferable. Once the data has been collected, what variables should be used? A well-constructed statistical model makes optimum use of
280
/. Duncan and A. Robb
data - efficient modeling can be based on relatively few variables. Some simple variables, such as the number of prescriptions within a time window, regardless of the type of prescription, can be effective. Variables that detect co-morbid conditions are particularly effective, sometimes more so than simple disease markers. Typical variables of this type might be the number of distinct drug classes detected or the number of distinct diagnosis types, as broken out by ICD chapters. It is also usually necessary to transform variables - to determine an optimal dependent variable and subsequently calibrate independent variables to generate optimal statistical relationships prior to the statistical modeling. The most basic example here is cost variables, which generally need either log transforms and truncation or conversion into flags or categorical variables. This step is non-trivial and can make or break a prediction project. Regressions The regression is perhaps the fundamental statistical prediction algorithm. Regressions are used in nearly every quantitative discipline, frequently with great effectiveness. Regressions exist in many types. The most basic is the linear least-squares regression. Using the notation of the previous section, a linear least-squares regression has two key features. First, the function/fa) is linear, that is, it has the foxmf(x) = ax + b. Second, the equation minimizes an aggregate error equal to the sum of the squares of the individual errors. With reasonable assumptions regarding the linear independence of the historic data xj, yu X2, yy, *3> yx ..., the function f(x) is unique. Moreover, the parameters of the linear function/fc) can be calculated from the historic data using a closed formula. This property is a major computational advantage over other types of models. For example, artificial neural nets use feedback loops instead of closed formulas and do not have unique optimal solutions. The formulae for linear least-squares regressions (and other types of regressions) have been documented for a very long time
Population Risk Management: Reducing Costs and Managing Risk
281
(many years prior to computers) and are in the public domain. Hence, regression computation engines are plentiful and inexpensive. Regressions are widely taught in universities and hence, staffing is available. Linear least square regressions generalize easily to the case of multiple independent variables. In practice, with creative transformations of the independent variables, they can be used to model other non-linear curves as well. A special case is the logistic regression, which is often used to model binary dependent variables. Here, if z, is the mean response to a predictor x, the function ln(z/(lz)) can be modeled as a linear function of x. Traditionally, regressions are evaluated using a set of diagnostic statistics known as an ANOVA (analysis of variance). The most well-known of these statistics is R2, which, in one dimension, equals the square of the correlation coefficient between independent and dependent variables. R2 by construction takes values between 0 and 1. Statisticians term it the percent of variation explained by the independent variables and modelers generally attempt to maximize it. R2 is useful as an evaluation tool and it is a key piece of some useful algorithms for deciding which variables to include in a multiple regression. For example, the stepwise algorithm includes variables one at a time, adding the one with the greatest incremental R2. R2 is, however, sometimes used inappropriately. With volatile stochastic data, R2 is often fairly low, understating the true effectiveness of the model. The final outcome of predictive modeling is, in fact, a statistically stable stratification. A much better means of evaluation is a goodness-of-fit test on the stratification. In summary, regressions are highly recommended by the authors. They are simple and well-understood, effective and the results are easily communicable. Because they employ a simple closed formula, they are easy to document and implement. An audit trail showing the development of a prediction model is invaluable and is not as easily generated by other methodologies, such as neural nets. Similarly, other methodologies such as neural nets are
282
/. Duncan and A. Robb
much more difficult to implement. In a large-scale program, these considerations should not be taken lightly. Decision Trees Introduction Decision trees are quite simply a means of stratifying a population into subgroups in iterative steps. Figure 1 illustrates a simple example. In a hypothetical population, patients are subdivided into increasingly small groups according to some risk criterion. At the first step, the population is split by gender, male vs. female. Next, each sub-population is split by age. Following this, each sub-population is split by disease flag (in this case, the presence of a Cardio-Vascular (CV) condition, or Diabetes). iPoDulation
| Male
1
Aee 65+
Aee 64-
Female
1
Aee 65+
Aee 64-
ICV condition
1
|No CV condition
1
iDiabetes
1
iNo Diabetes
1
ICV condition
1
|No CV condition
1
1 Diabetes INo Diabetes
1 1
Figure 1. Sample decision tree.
Decision trees are extremely useful in terms of their clarity and ease of implementation. Decision trees also answer the question: which variables should be included and which are redundant? In a sense, trees prioritize variables by predictive power. Decision trees have another advantage - they recognize the fact that there are many types of risk and hence there might be separate risky sub-populations. For example, a risk model might be
Population Risk Management: Reducing Costs and Managing Risk
283
weighted towards older patients and ignore men under 40 with Warfarin prescriptions. Trees can solve these problems. Although trees can provide a foundation for "intuitive" prediction algorithms, they can also become over-complex and lead to long lists of disjointed rules for identifying risky patients. Care must be exercised in the creation of trees and the implementation of tree-based modeling. A number of algorithms exist by which computers can generate decision trees. A disadvantage of these algorithms is that they are often proprietary, and implementation is costly. They can also, if not implemented well, be costly in terms of computing time. However, these considerations can be balanced by the fact that they are frequently easy to use. As mentioned in previous sections, we are often interested in developing a risk index. When using a tree, an index is developed in a separate step, following the creation of the tree. Two specific tree algorithms are discussed below. Chi-Squared Automatic Interaction Detection (CHAID) CHAID is an algorithm for generating a tree in the case of categorical dependent and independent variables. In the world of computer modeling, it is fairly old, dating to the 1970s. Given a population, a dependent variable, and a set of independent variables, a CHAID algorithm subjects each independent variable to a Chi-Squared statistical significance test with the dependent variable. The independent variable with the highest ChiSquared is used to split the population, creating a "node". For example, in Figure 1, the following variables would be independent: gender; under/over 65; CV/no CV; Diabetes/No Diabetes. Patient gender has the highest overall Chi-Squared value. The Chi-Squared test is then applied to each sub-population using the remaining independent variables. In Figure 1, the Chi-Squared test chose under/over 65 as the next most powerful variable for each of the males and females. This process is continually applied until all possible Chi-Squared values fall below some predefined threshold.
284
/. Duncan and A. Robb
Then, the process stops. This process yields a minimal tree, i.e., one that needs no "pruning". As we will see, not all trees share this desirable attribute. CHAID can be highly effective. It is available in statistics packages, or may be implemented manually. However, clever implementations are often necessary - poor implementations run forever. Usually, it is easier to simply buy the package. Classification and Regression Trees (CART) CART is a robust tree mechanism that can incorporate continuous as well as categorical variables. As mentioned previously, unlike CHAID, CART algorithms can incorporate continuous variables. At a node, CART algorithms "split" continuous variables, that is, they set a threshold that converts the continuous variable into a binary categorical variable. The threshold is selected that maximizes a "diversity" function, essentially a function that chooses a "best possible" division of the remaining population into unlike groups. A continuous variable can be split multiple times as appropriate. In addition, CART algorithms do not produce an optimal tree and then stop. CART algorithms actually produce an overcomplex tree that is then "pruned" into final output. Like CHAID, CART can be highly effective. It is available in statistical packages. However, CART algorithms are more difficult than CHAID to implement. Artificial Neural Networks Introduction The topic of artificial neural networks is too broad for a section of a chapter. For simplicity, we will focus on one of the most common types of artificial neural networks, one that is very useful in the context of health care risk prediction. This type is the "feed-forward back-propagation" network. First, we briefly revisit regressions. Recall that they approximate the dependent variable with a linear function. They can also be used to approximate the dependent variable with a function that has
285
Population Risk Management: Reducing Costs and Managing Risk
a simple linear parameterization, such as a logistic function. The left-hand illustration in Figure 2 depicts a scatter plot that is appropriate for regression. Inappropriate Appropriate for for regression
regression
y = f(x)
• t •%
••• x
i>yi
Figure 2. Sample scatter plots. Some datasets, such as the right-hand depiction in Figure 2, are inappropriate for linear modeling. A creative reader might be able to find a linearly parameterized function to fit to this pattern of data. However, appropriate approximation functions are not so easily visualized with multi-dimensional datasets. This is unfortunate because many curve-fitting techniques, including linear parameterizations, require an a priori notion of the approximation function. What then? One possible solution is an artificial neural network. Artificial neural networks (usually referred to simply as neural nets) are all-purpose non-linear models that essentially approximate the dependent variable with compositions and linear combinations of step functions and truncated linear functions, generically known as transfer functions (actually, continuous approximations of step functions and truncated linear functions). Because practically any function can be modeled this way, neural networks are robust and flexible.
/. Duncan and A. Robb
286
f(x)
f(x)
Step function
Truncated fu n c t i o n
linear
Figure 3. Examples of transfer functions.
There is a well-known analogy between neural nets and human brain functions (hence the name). It is believed that the human brain reconfigures its circuits in response to external stimuli, either positive or negative reinforcement. Positively reinforced pathways are strengthened, and negatively reinforced pathways are diminished. Neural nets are created in an iterative process. Initially, they are general, and hence inaccurate models. They become accurate through a process known as "training" in which a model reconfigures itself in response to prediction errors associated with a series of data points (analogous to stimuli). As previously stated, neural net models are combinations of simple functions, known as "activation functions" (analogous to pathways in the brain), each of which has approximately the same importance or weight in the initial model. As the neural net is trained, its self-reconfiguration consists of adjusting the relative importance of each activation function, strengthening or diminishing in response to its share of the prediction error associated with a data point (positive or negative reinforcement of stimulus). The Training Process For the sake of comparison, we return again, briefly, to regressions and trees. Consider the regression modeling process. Regression parameters are easily calculated using well-known closed formulas. If the data points meet mild conditions of linear independence, the formulas produce a single "answer", that is, a unique set of regression parameters.
Population Risk Management: Reducing Costs and Managing Risk
287
Trees are modeled using constructive processes, not closed formulas. However, because of dataset and various construction constraints, tree algorithms produce a tree only once during the construction process, i.e., the process produces a single set of parameters and then stops. In contrast, neural nets neither use closed formulas and produce a single set of parameters before stopping. Neural nets produce a series of parameters in a series of "training cycles". The process is iterative. A set of parameters is calculated, an initial approximation is created, and the error is measured. If error level is acceptable, the process is finished. If the error level is unacceptable, the parameters are recalculated. See Figure 4.
JL ^ . ,, Error is unacceptable
Measure Error Error is acceptable NN is ready to be run in production
Figure 4. Neural net training cycle.
In practice, many training cycles comprise the development of a neural net model. Depending on methodological choices, the input of a cycle can be a single data point, a subset of the dataset, or the entire dataset. If the input is a single data point or subset of the dataset, the content of the input set is rotated at each cycle so that the entire dataset is fed into the model before data points are repeated.
288
/. Duncan and A. Robb
In general, predictive modeling functions can be described by a set of parameters. In a regression, for example, the parameters are the regression coefficients. A particular regression equation "lives" in a multi-dimensional parameter space, i.e., the multi-dimensional space containing all possible regressions. The least squares error function forms a surface over this space. The least-squares regression modeling process is able to select a single equation from this space because the error surface has a unique minimum over this space. Likewise, sets of neural net parameters live in some multidimensional parameter space. Unfortunately, neural net parameter spaces are not as well-behaved with respect to error as regression parameter spaces. Neural net error surfaces do not have unique minima. They have local minima, but they also have troughs and plateaus. Geometrically, the training cycle corresponds to a series of points (parameter sets) on the error surface above the parameter space that hopefully converges to a minimum error value. The traditional method of choosing a direction of travel on the error surface is the "method of least descent" which involves calculation of the gradient of the error function in order to descend along the steepest slope in search of a minimum. Unfortunately, the point sequences on the error surfaces can get "stuck" in undesirable local minima or plateaus. See, for example, Shepherd (1997). Different neural net programs have different methods of navigating the error surface without getting "stuck". Neural Net Architecture Neural nets come in many shapes and sizes. As mentioned above, one of the basic architecture configurations is known as "feed-forward back-propagation". In many contexts, this is the basic model. Neural nets are an ongoing area of research and there are many variations to this basic configuration. The basic feed-forward back-propagation configuration is shown in Figure 5. The feed-forward portion of the algorithm is what generates the actual output. Neural nets generally assume multiple independent
289
Population Risk Management: Reducing Costs and Managing Risk
variables - Inputs 1-3 in Figure 5. The independent variables are fed into "activation functions", which are compositions of linear functions and what are called "transfer functions" - continuous approximations of the step functions or linear functions described in Figure 3. Traditional choices for approximating step functions are logistic function l/(l+exp(-ax+b)) or hyperbolic tangent. Each transfer function is considered, diagramatically, to be a "node". The nodes are configured in layers, with Layer 1 feeding Layer 2 and so on. Interior Layers are known as "hidden layers" or "hidden nodes".
Output
Layer 1
Layer 2
Figure 5. Feed-forward back-propagation neural net.
-> Activation Function Node Figure 6. Neural net node detail.
The number of nodes in each layer depends heavily on the nature of the data and the number of independent variables. Similarly, the number of layers depends on the data, although in practice most
290
/. Duncan and A. Robb
neural nets have two or fewer hidden layers. The training cycle was outlined in the previous section and illustrated in Figure 4. The adjustment of the neural net parameters is known as back-propagation. Parameter adjustments flow backward through the net - in an n layer net, Layer n is adjusted first, Layer n-1 is adjusted next, and so on. As mentioned in the previous section, the equations governing back propagation are derived using differential geometry techniques. See Shepherd (1997). Neural Net Implementation Techniques Neural nets have many applications. Perhaps the most basic is as an alternative to conventional statistical prediction techniques such as regressions. In this way, a neural net is simply a curve-fitting tool that creates a nonlinear output function to approximate known output. Neural nets, in the hands of a skilled operator, can be quite effective in this application, more so than regressions in many cases. In the above case, where a neural net represent a more effective alternative to regressions, the neural net is "trained" by a dataset, presumably synthesized from claims or other electronic information. Such an application, however, useful it may be, actually misses one of the main advantages of neural nets. Heuristically, neural nets perform well in pattern-recognition situations, where a human might recognize a pattern but may not be able to articulate a concrete set of rules for discerning the pattern. For example, in the field of chemistry, Madison, Munk, and Robb (1996) successfully used neural networks to recognize molecular structures from the results of various spectral tests, something a chemist can do with a "good eye" but which was difficult to implement in a computer program using other prediction methods. In this vein, neural nets might be trained by physicians to recognize risk from a clinical point of view. This would entail giving a physician access to a patient's claims history and letting the physician determine the patient's relative risk level. In this case, the neural net is trained one patient at a time. The final model would not measure financial or actuarial risk, but would be useful as a case management tool that yields actionable results.
Population Risk Management: Reducing Costs and Managing Risk
291
Neural Nets: Remarks and Conclusions Neural nets are potentially more powerful than traditional statistical predictive methodologies. However, this power has a price. Four areas of concern are staffing, opacity, "bookkeeping", computer resources, and expense. The staffing is a critical issue. Neural net training is as much an art as a science. While the same can be said for other prediction techniques, this is especially true for neural nets. This is because the methodology is an order of magnitude more difficult to understand and harness than other methodologies, and despite the claims of some neural net vendors, the user really must understand the methodology. Neural nets require a higher level of data preparation prior to the modeling than other methodologies. They are also more prone to overfitting. For all of these reasons, an experienced modeler is necessary and these are less plentiful than regression modelers. Neural nets have opaque "recipes" and for this reason, are often considered to be black boxes. The recipes, in fact, consist of matrices governing the linear transforms, coupled with parameters controlling the transfer functions. As one might imagine, the recipes are near impossible to understand intuitively by inspection. Because of this opacity, neural nets can be inferior to regressions and trees in practical applications. However, it is possible to perform sensitivity analyses on neural net models. Such an analysis consists of feeding the net a series of input points in which a single variable is varied and the remaining variables are held constant. In this way, the relative importance of individual variables can be estimated. Storing the parameters and integrating the model into a production data process are serious implementation issues. As mentioned above, neural nets use matrices as well as parameters for possibly non-linear transforms. Because of the complexity and non-intuitive nature of the model (non-intuitive parameters prevent error checking by inspection), the documentation of neural nets is an order of magnitude more difficult than it is for other methods. Regarding computer resources, the main issues are runtime and choice of an actual neural net application. Neural nets are computa-
292
/. Duncan and A. Robb
tion-intensive and the runtime is non-trivial. There is little to be done on this front other than using fast computers and planning analyses well, so that minimal rework is necessary. There are a number of neural net applications on the market. The authors have tried several, ranging from a spreadsheet add-in to top-of-the-line integrated statistics and modeling applications. The authors have also developed their own neural net application. In comparisons on health care risk prediction, it turned out that each of the applications was equally effective. This is not surprising as each used a similar feed-forward back-propagation method. The difference between applications was largely in the implementation details. For example, the spreadsheet add-in was worse in the bookkeeping area and was restricted to datasets of 65,536 rows or less, formatted in a spreadsheet. In practice, health care datasets are often too large for spreadsheet applications. The top-of-the-line application had the advantage of being integrated into a recognized data-mining suite, but was the most expensive. The authors' application had the advantage of fitting closely into an existing data warehouse and integrating with other analysis tools developed to fit the warehouse. In practice the decision of whether to build or purchase depends on the level of expertise within the organization. It also depends on the "production vs. experiment" question. It may be worthwhile to build a neural net to suit production needs if production parameters and constraints are unlikely to change on a regular basis. However, if the nature of the analysis changes frequently or if the analyst is unsure of what modeling tool will be the final choice, then it may be preferable to purchase. Finally, the expense of neural nets must be mentioned. Neural nets are indeed expensive, considerably more so than simple regression packages. For example, one of the top data-mining applications costs over one hundred thousand dollars annually. Other applications are cheaper, but on the average, neural net programs cost more than other prediction tools. The need for high-level staffing and the overhead of computing equipment also raises costs con-
Population Risk Management: Reducing Costs and Managing Risk
293
siderably. This is not to say that it is not worthwhile investing in neural nets. However, the decision to follow any particular methodology, whether traditional or new, is an economic decision and as such, costs and benefits need to be weighed.
3
Application of Interventions and Other Risk Management Techniques
3.1
Targeting the Right Members
The user who has followed our discussion to this point will be in a position to develop his or her own model, using one of the techniques discussed. Models may be trained using one set of data, and applied or tested on another set, a single data set may be split, or sub-sets may be constructed randomly from a single data source ("bootstrapping"). However construction and testing is performed, the result will be a Model, or set of coefficients, that will be applied to the "production" data set. Application of the model to production data will result in a set of "scored" outcomes. Highest-risk members have the highest likelihood of the presence, in their profiles, of the key variables. These members will then be assigned the highest score by the model. The model may be validated by comparing predictions to actual outcomes for historic data, or to concurrent data (for which outcomes are not yet available, but which will be collected and compared with predictions). The scoring process results in many different members, with different risk profiles, being assigned the same risk score; testing should then be performed by cohort. (A cohort will consist of all those members with the same score, or range of scores.) For more information on prediction and the testing process, see Meenan et al. (1999) or Dove, Duncan and Robb (forthcoming, 2003).
294
/. Duncan and A. Robb
3.2
Effectiveness and Outcomes
The techniques described here may be applied to any population; examples include the entire membership of a health plan, or a subset, such as all diabetics, or all those who had low costs in the historic period (but who are at risk of high costs in the future). Effectiveness will vary according the specific model developed, the underlying data, and the methodology employed. However, an example of the type of results obtained is shown in Table 2. In this example, patients were ranked according to the score that resulted from the model. In this example, the highest-ranked patients (Rank 40) had, as a cohort, a probability of 51% of experiencing the target outcome (in this case, prediction year cost in excess of $2,000). Obviously, the highest-ranked cohort is the sub-group that is most in need of intervention. Additionally, the information provided from the model may also be translated into expected costs for pricing purposes. Table 2. Patients with claims in both 1998 and 1999; 1998 cost < $2,000. Rank
Cumulative patients
Cumulative probAverage 1999 Cumulative ability claims > % Patients cost per costs in 1999 claimant $2,000 in 1999
40 38 36 34 32 30
1,054 1,713 2,668 4,019 5,741 7,952
51.0% 50.2% 48.7% 47.8% 45.6% 43.7%
0.5% 0.8% 1.3% 1.9% 2.7% 3.8%
0
209,069
14.2%
100.0%
$
3,899,475 5,963,129 9,041,030 13,106,399 17,510,712 23,255,465
$ 7,248 6,934 6,965 6,823 6,694 6,686
183,062,265
6,168
More information about the use of these techniques in an Intervention Program is given below.
Population Risk Management: Reducing Costs and Managing Risk
3.3
295
Results, Including a Methodology for Optimizing Inputs to and Estimating Return on Investment from Intervention Programs
Use of a prediction, or ranking, model, such as that described above, forms the basis of a financially-driven intervention program. Unlike other, disease-focused or event-driven programs, this type of program focuses on patients according to their "scores" or their likelihood of experiencing an unfavorable financial outcome in the coming year. In Table 2 the highest-risk cohorts (Rank 40) had a probability of 51% of experiencing an adverse event (Claims in excess of $2,000) in the coming year. The average cost for those members who had claims in excess of $2,000 was $7,248. The 1,054 members with Risk Rank 40, forms the universe for an intervention program. The cost of such a program can be estimated, depending on the type of resources devoted to it and the type and duration of interventions. Some percentage of the target population will enroll in the program (in the authors' experience, this percentage varies between 20% and 80%, depending on the type of population and program, and the intensity of enrollment). Finally, the likely savings from the intervention may also be included in the modeling process. The final result is an optimization process that will help the user determine how deeply in any population to intervene, with what type of intervention, and the likely Return on Investment in the program. An example of such a model developed by the authors is provided in Figure 7.
3.4
Using the Risk Management Economic Model
This example of the Economic model allows the user to optimize the level of interventions in a population (stratified into four different strata according to risk) with two different types of intervention, called Automatic and Nurse-based. The costs of these two dif-
jm
a
E
»<
It ^m •qri
^F1 Tpq
o
~
I.
i »-s
o
|_ m
TOTAL TOTAL PMPM
295,533 98,350 NET J J 80,488 209,025 COSTS AUTO $ NURSE S 376,021 307,375 SAVINGS AUTO $ NURSE J
il
I' III III
HI I
/. Duncan and A. Robb
,l_DtCtl_L,
L_kli.Uk
Population Risk Management: Reducing Costs and Managing Risk
297
ferent interventions are variable, according to the number of members managed. In addition, the cost of program management is provided by for as Fixed costs. The stratification and prediction process ranks cohorts of risky members according to their likelihood of experiencing the predicted outcome (in this case, costs in excess of $5,000 in the coming year). Cohort 4 is predicted to experience these costs at a rate of 72%, or approximately twice the probability for the entire population of 3,273 members. In addition to predicting the probability for the cohort, the prediction process also predicts the likely average cost for the cohort. Applying some key assumptions in terms of the cost of different interventions and the outcomes, the expected financial outcomes for each type of intervention and each cohort is predicted. The user has the option of adding different types of intervention to each cohort. Because the nurse-based intervention is relatively expensive, it is not generally economic to penetrate a population as deeply with nurse-based interventions as with automated means. In this example, we optimize our program by implementing automated interventions to all target members, while intervening with nurses only on cohorts 3 and 4. This program is predicted to cost $481,000 and save a (gross) total of $681,000.
4
Summary and Conclusions
This chapter has addressed the use of some well-known, and some less-well-known, techniques for identifying and classifying risk in a health insurance population. The basis of successful risk identification in health insurance is access to sufficient, timely and accurate data. Fortunately, in most health insurance situations, such volumes of reliable data exist. Once the risk is identified, the stratification process leads naturally to the application of interventions to reduce the risk that those members represent. The information generated through the modeling process described here can also be used to predict cost at the individual and employer-group level, for
298
/. Duncan and A. Robb
pricing purposes. The science of Population Risk Management is in its infancy. Much work remains to be done in incorporating new data sources (Lab values and self-reported data), extending the models to multiple years (for example, taking a survival model approach), and evaluating the success of interventions. Nevertheless, these techniques will form an important component of risk management for all health plans in the future.
References Berry, M.J.A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, and Customer Support, Wiley. Berson, A., Smith, S., and Thearling, K. (1999), Building Data Mining Applications, McGraw-Hill. Cummings, R.B., Knutson, D., Cameron, B.A., and Derrick, B. (2002), A Comparative Analysis of Claims-Based Methods of Health Risk Assessment for Commercial Populations, Society of Actuaries. Dove, H.G., Duncan, I., and Robb, A. (2003), "A prediction model for targeting low-cost, high-risk members of managed care organizations," American Journal of Managed Care, April. Madison, M.S., Munk, M.E., and Robb, E.W. (1996), "The neural network as a tool for multispectral interpretation," Journal of Chemical Information and Computer Sciences, vol. 36, no 2, pp. 231-238. Meenan, R.T., O'Keeffe-Rosetti, C , Hornbrook, M.C., Bachman, D.J., Goodman, M.J., Fishman, P.A., and Hurtado, A.V. (1999), "The sensitivity and specificity of forecasting high-cost users of medical care," Medical Care; 37(8): 815-823. Shepherd, A.J. (1997), Second-Order Methods for Neural Networks, Springer. Weisberg, S. (1985), Applied Linear Regression, 2nd ed. Wiley.
Asset-Liability Management and Investments
This page is intentionally left blank
Chapter 8 A Fuzzy Set Theoretical Approach to Asset and Liability Management and Decision Making Chiu-Cheng Chang
Good management consists in showing average people how to do the work of superior people. John D. Rockefeller This chapter first uses fuzzy set theory to extend classical immunization theory and modern knowledge about the matching of assets and liabilities. The second half of the chapter uses a fuzzy set theoretical approach to develop a fuzzy Bayesian decision method, and finally extends the Bayesian decision method to include the possibility that the states of nature are fuzzy and the decision makers' alternatives are also fuzzy.
1
Introduction
The classical Redington theory of immunization was developed to structure an insurance enterprise's assets and liabilities so that the insurer will be immunized against the adverse effects from changes in the level of interest rates. Since an insurer's liabilities were largely determined by forces outside the control of the insurer, immunization was generally aimed at managing the structure of the assets.
301
302
C.-C. Chang
Immunization strategy consists of structuring the assets so that the following three conditions are met: 1. The present value of the cash inflow from the assets is equal to the present value of the cash outflow from the liabilities. This condition assures that the amount of assets is sufficient to support the liabilities. 2. The duration of the assets is equal to the duration of the liabilities. This condition assures that the sensitivity to changes in interest rates is the same for both assets and liabilities. 3. The convexity of the assets is greater than the convexity of the liabilities. This condition assures that a decrease in interest rates will cause asset values to increase by more than the increase in liability values and, conversely, an increase in interest rates will cause asset values to decrease by less than the decrease in liability values. Since the introduction of immunization theory, much has been written about the matching of assets and liabilities, including both the theory and its applications. Further work has also been done on how the concepts and principles of asset and liability matching can be used to shape investment strategy. This has been applied successfully to the investment management of insurance products with an investment component. In this chapter, we will first develop fuzzy-set theoretical analogues of the classical immunization theory and the matching of assets and liabilities as presented by Tilley (1980). This means that we will "translate" the existing knowledge about immunization and the matching of assets and liabilities into fuzzy-set theory, or, equivalently, use fuzzy set theory to express the existing knowledge. In doing so, we can achieve the following: 1. Provide a new view of the existing problems and solutions. 2. Extend the solutions to the problems by applying the established knowledge about fuzzy-set theory. 3. Offer the advantages, such as flexibility and closer to real-world
A Fuzzy Set Approach to Asset and Liability Management
303
situations, which result from using fuzzy mathematics instead of traditional mathematics. This chapter will then give a reasonably detailed description of the Bayesian decision method. To help readers better understand the method, we illustrate with simple examples and work out all the necessary calculations according to the formulas inherent in the method. We then use a fuzzy set theoretical approach to extend the Bayesian decision method by first assuming that the new information may be inherently fuzzy and then by defining fuzzy events on this information. The extension process develops a set of formulas corresponding to those described under the Bayesian decision method. We also continue the original examples and work out corresponding calculations to illustrate the extension. Finally, we further use the fuzzy set theoretical approach to extend the Bayesian method by making both the states of nature and the decision makers' alternatives fuzzy. This final extension also develops a set of more complex formulas corresponding to previous sets. We will then draw our conclusions at the end of the chapter.
2
Fuzzy Numbers and Their Arithmetic Operations
In classical set theory, an object "a" either belongs or does not belong to a given set A. In other words, one and only one of the following relationships is true in classical set theory: a e A or a g A. In many real world situations, the above clear distinction cannot be made. Fuzzy set theory as initiated by Zadeh (1965) was introduced to handle such situations.
2.1
Definitions
We will mainly follow Dubois and Prade (1980) in setting up definitions. Let A denote a collection of objects a. Fuzzy set B in A is a
C.-C. Chang
304
set of ordered pairs B = {a, juB(a) }, a e A, where juB: A—•M is a function from A to the membership space M that is the closed interval [0,1]. We call JUB a membership function of B. For any X e [0,1], the X-cut of B is a non-fuzzy subset and is defined by Bx = {x | jUB(x) > X).
Let kh i = 1,2,3,4, be real numbers and k\ < k2 < kj < k4. A fuzzy number B is a fuzzy subset of the real line R whose membership function //5(a) = //g(a; k\, k2, £3, £4) and is defined as follows: 1. //g: R—»[0,1] is continuous, 2. //s(a) = 0 for a e (-QO, k\\, 3. //fi(a) is strictly increasing in [k\, ki] and is denoted //si(a) and called left wing, 4. juB(a) = 1 for a e [k2, k3], 5. jUsia) is strictly decreasing in [£3, £4] and is denoted //52(a) and called right wing, 6. //^(a) = 0 for a e [£4, °°)Since //si and fisi are continuous and strictly monotonic, their inverse functions exist and are denoted respectively as Vgiy) = VB\(y) = Vmiy) and Vsiy) = Vsi(y) = jumiy)- It follows from the definition of B&, clearly we have Bx = \yB(X), VB(A)] for any A, 0
2.2
Fuzzy Arithmetic Operations
Let * be any of the four arithmetic operations: addition, subtraction, multiplication, and division. Then fuzzy arithmetic operations are
305
A Fuzzy Set Approach to Asset and Liability Management
defined as follows: MA.B 0 0
= ™
(MA ( «). MB (V)), X,
U, V in R
When A and B degenerate to two intervals [a\, a2] and [b\, b2], then the operations become [a\, a2] * [bi, b2] = {u * v | a\ < u < 02, b\ < v < 62 and we have that [a\, a2] + [b\, b2] = [ai + b\, a2 + b2] and \a\, a2] - \b\, b2] = [ct\ - b2, a2 - b\\. When a2 > a\ > 0 and b2>b\ > 0, we have that [a\, a2\[b\, b2] = [a\b\, a2b2] and [«i, a2\/[bu b2] = [ f l i4 2 ' fl^i]- F o r any A e [0, 1], we have (A*B)X = AX*BX. ' Following from the definition, we have Theorem 1. If A and B are triangular fuzzy numbers, then we have ki(A + B) = k(A) + kiB),
1= 1,2,3,4
UA - B) = kiA) - kUE),
(1)
i = 1,2,3,4
(2)
IfJfci(4)>0,Jfci(fl)>0,then k£A-E) = UA)-kiB),
/= 1,2,3,4
(3)
'" = ! ' 2 ' 3 ' 4
(4)
If*i04)>0, **(£)>(), then ^ / 5 ) = ^)/£ 5 _,(£>
Proof: It follows from above that (A + B)x = AX + BX. Since (A + B)x = [VA+B(X), VA+B(X)1 for any X > 0, Ax + Bx=
[VA(A), = in(X)
VXX)\
+ WB(X),
+ VB(A), Vl{X)
Vfo)]
+ V&)]
and
= VA(X) + VB(A), Vt+B(A) = Vl{X) + V$(X). Thus UA + B) = k(A) + k(B), i = 1,2,3,4. Similarly, we can prove that VA+B(X)
VA.B{X)=VA{X)-VB'{X)
VtB(X)=VA(X)-VB(X) kt{A -B) = kiA) - ks-iB), i =1,2,3,4. The proofs for the remaining parts are similar.
306
C.-C. Chang
Note that the multiplication and the division of two triangular fuzzy numbers do not necessarily result in a triangular fuzzy number. However, given two TFNs, A and B, we will treat A B and A/B as a TFN simply as a matter of convenience and approximation. This kind of approximation should be acceptable since the very idea of using fuzzy numbers themselves is an attempt to approximate real world problems. While the advantages of treating A-B and A/B this way are obvious, we like especially the fact that fuzzy linear programming (FLP) can then be transferred into ordinary linear programming when multiplication and/or division of fuzzy numbers appear in the FLP. We will see this further in Section 4. If a fuzzy number A degenerates to a positive real number a > 0, i.e., /4A(X) = MA(X', a, «> a, a), then we have V-aB(y) = a- VE(y), Vwa{y) = vB(y)/a,
Ksiy) = a- V£(y), V^/a(y) = rtiy)/a
Similarly, for any sequence of positive real numbers {a,}, / = 1,2,.. .,n and fuzzy numbers B\, Bi,..., Bn, we have
Definition 1. The set of solutions to the fuzzy inequality fix) >b{0) is a fuzzy subset and is defined by
\/sup{y\f{x)>b-0y}\ where 0 < 6< 1 is called tolerance. Note that the concept of tolerance used here was introduced by Zimmermann (1980). A fuzzy mapping/: X -> TFN(Y) is a function that maps x in its domain io fix) in TFN(Y), the set of fuzzy numbers on Y.
307
A Fuzzy Set Approach to Asset and Liability Management
Definition 2. The set of solutions to the inequality fix) > b involving a fuzzy mapping is a fuzzy subset defined by suV\y\Vnx)(\-y)>b\ Definition 3. The set of solutions to the fuzzy inequality fix) > b(0) involving a fuzzy mapping is a fuzzy subset and is defined by
/«
(\-y)>b-0y
Note that we can define fix) < b(0), fix) < b, and fix) < b{0) analogously. Definition 4. The set of solutions to the inequality fix) > g(x) involving two fuzzy mappings is a fuzzy subset and is defined by
[/sup{y\Vl{x)(\-y)>V;jx)(l-y)}\Definition 5. Two fuzzy numbers A and B are said to be s-equal, denoted by A = B(z) if \ki(A) - kt(B)\ < s, for i = 1,2,3,4. Definition 6. Let A„, n = 1,2,..., be a sequence of triangular fuzzy numbers. The fuzzy number A is said to be the limit of An, denoted by \m\A„ = A if lim kj(An) = kt{A) for all / = 1,2,3,4. Theorem 2. Y\mAn = A'\f and only if, for any y, 0 < y < 1, lim F7„(y) = V~A(y) andlim Vl(y) = V}(y). Definition 7. Let F : X -> TFN(Y) be a fuzzy mapping where X and Y are the set of real numbers. Then dx
^o
^
is the derivative of F at x = XQ if the limit as Ax —> 0 exists.
C.-C. Chang
308
Theorem 3. Let At be fuzzy numbers and ft : X —» Y non-fuzzy mappings from X to Y, where X and Y are the set of real numbers. Then iiYAJiix)) = IA^RX).
3
Fuzzy Immunization Theory
For the first example, to show how fuzzy set theory may be applied to the matching of assets and liabilities, we will develop the fuzzy set theoretical analogues of Redington's classic theory of immunization (1952). Let Lt be the net liability-outgo and A, the asset-proceeds, both at time t years. We assume that both At and Lt are fuzzy numbers. Let SQ be the current force of interest. Let VA(5Q) and VL{8Q) be the fuzzy values at this force of interest of the asset-proceeds and the net liability-outgo respectively. The conditions for fuzzy immunization against small changes in the rate of interest are as follows:
vA(s0) = (yL(s0) + T)(£l)
(5)
where ris a small positive number, VAS0) = VAS0)(e2)
(6)
vl'{S0) > v"(50)
(7)
If these conditions are satisfied, it follows that the function7(<S) = (YA{S) - VL(S) - T) (si) equals zero when 8= 8$ and has a relative minimum value there. Thus, there is a neighborhood of 8Q such that if 8\\QS therein but is not equal to 8Q, then VA(8) > Vi{8) + r. An investor whose investments are such that the above three conditions hold is thus said to be immunized against small changes in the rate of interest. The immunization exists because any immediate small change in the rate of interest will lead to a surplus, since the present value of the assets will exceed that of the net liabilities plus r.
309
A Fuzzy Set Approach to Asset and Liability Management
Now we proceed to discuss how the above conditions may be interpreted in practice. Since Lt = St- Ph where St is the liabilities and Pt is the receipts at time t, we can rewrite equations (5) and (6) as follows:
*,Ete+4y)-*,(S<S/)-r I < ex, i=l,2,3,4 and kfet(Pt
+A,y)-k,fetStv')
| < s2,
i= 1,2,3,4,
both evaluated at force of interest So. Since r Y,KPl+A,y}_ k,fct(pl+Aty) and
TJts,v')_ kfosy) .5>'J k^syy we have
<
+ <
MEtt + 4K) +
310
C.-C. Chang
k.&sy)
<
M 2 > ' ) + r-*, k^S^
st+kfetss).
+ r + s,
2e, 2
^ 2 + * l
= *3-
Therefore, we have
at force of interest So and
* , ( £ ^ + 4 V ) > *,&2S/)> /= 1,2,3,4. Thus, at force of interest So, the duration of the total assets (i.e., the receipts Pt plus the asset-proceeds At) and of the liabilities St are equal. Let their common value be denoted by T(So). By similar analysis, in combination with the above equations and inequalities, we can rewrite the last inequality as follows:
kfe[t-ns0jf(p,+A,y)> *,(z['- r w ] V at force of interest So, i = 1,2,3,4. Hence, the spread of the total assets about their duration must, if the conditions for immunization are satisfied, exceed that of the liabilities. This means that the spread of the receipts and assetproceeds about the duration must exceed that of the liabilities.
4
Fuzzy Matching of Assets and Liabilities
In this section, we will consider James Tilley's model for matching assets and liabilities (1980) in order to develop its fuzzy-set theo-
311
A Fuzzy Set Approach to Asset and Liability Management
retical analogue. Let CFf' denote the net cash outflow in year k from items other than investments, with cashflow during year k accumulated to the end of the year at the prevailing long-term interest rate. Assume CFf" is a fuzzy number. Let CFf denote the net cash inflow in year k from assets associated with the initial portfolio only, with cashflow during year k accumulated to the end of the year at the prevailing long-term interest rate. Assume CFf is also a fuzzy number. Let akj denote the net interest and principal payments in year k per dollar invested in thej'th instrument, j = 1,2, ,n, with any payments during the year accumulated to year-end at the long-term interest rate. Let P, denote the fraction of initial funds invested in thej'th instrument. Then 1 and CFl" = Y,avPj . 7=1
Let ana be the portfolio of assets existing at the start of the first year as a result of investments made in all prior years. For a new fund or block of business, a*o = 0 for all k. Thus, we can also express CFf as follows: CF? = o M +• £ V / 7=1
Let Tt be the rollover rate for the year k+i, the vector r = {r\, rj, ...,rq) and X?=1 r, = 1. Let /* be the new money rate at the beginning of year k. In the case where CF%"', CFf, and ay are non-fuzzy numbers, Tilley (1980) gives a formula to calculate the total amount of assets at the end of year TV as follows: N+lt r
N-k+\
\k-\
^=z 2 1--5> Yju H
N+lf
N-k+l
\k-\
a
7=1
*=2\ ^
i=\
) 1=1
k=2 V
i=l
/ 1=1
Assuming CFf, CFf, and % are fuzzy numbers, we can then obtain the same formula, but with a different meaning since ay and CF°ut are now fuzzy numbers:
312
C.-C. Chang N+lf
^ = EE 1 7=1
*=2V
N-k+l \k-l
N+lf
N-k+l
\k-l
& E ^ 7 Pj-I. i- 5>, E^CF ; » i=i / / = i
J
*=2v
/=i y/=i
0,w > £ - 1 where
\,m = k -\
Ykm
k-\
ylm,k-\>m>\
l-i+'z I
;=m+i
7=1 J
Since ay and CF/0"' are fuzzy numbers, AN is now a fuzzy number. Note that if ik are non-fuzzy numbers, we have
r;,(y) 'jV+l/
N-k+l
\k-\
E i- Zi En^oo
7=1
n
i=2V
'=i
N+lf
N-k+l
i=2 V
<=1
y/=i
77+1/
*=2
v
\k-\
77+1/'
/ '=1
i=2 V
N-k+l
\k-l
(=i y /=i
A/-i+l
\k-l
• S « - Z i EIr/;rw z E i- Ei E^;oo ^-I^Ei 7=1
'=1 / /=1
On the other hand, if ik are fuzzy numbers (and so are jki), we have
KLiyV
I
f
7=1 *=2 v
AM/
N-k+l \k-l /=i
iV-i+1 \ i - l
V
y '=i /
\YI
Z J - E^E[*.(r«>,(^r)+(*2(r*>2(c/r)-^0'i/>1(c/r))]^ r
A Fuzzy Set Approach
to Asset and Liability
Management
313
K(y)= (
W-t+l
i=2 V nTi
N-k+l
\k-\
'=1
/ '=1
\k-\
r
I i- lMlk(rt,k(cFr)+(^(^x4(cFr)-^(7W>3(cFr))]^ *=2 V
i'=l
J '=1
0,m > k-\ where kiiyhK) = \,m = k-\ k-\
k-l-\
Vi+^,0';) I" ZO */0'teU-i>'"^i l=m+\
Let S be a specified set of interest rate patterns /. Let C = (Ci, C2, ..., C„) be the current market price of investment instruments. Assume C„ /' = l,2,...,n, are fuzzy numbers. An objective of fuzzy matching of assets and liabilities is to maximize / = C\P\ + C2P2 + ... + CnPn, where P, > 0, XJ=1 Pj = 1 and A^U, r,P)>0,l= l,2,...,w. We have thus expressed the objective in terms of a fuzzy linear programming (FLP) problem. Since the fuzzy number AN is treated as triangular, the wings of AN, i.e., VA„ and VA„, are linear functions and so there are ways to transform FLP problems to ordinary linear programming problems and solve accordingly. For the above fuzzy linear programming problem, the equivalent ordinary linear programming problem can be described as follows: Given fi> 0, solve the linear programming problem: Maximize IM = wVJ(ji) + (1 - w)Vl{ju) such that Pj > 0,1,]=^= 1 and VAn(ihr,p){\ - ju) + judi > 0, / = 1,2,...,m, where Oi is a given set of tolerances. For further examples, see Chang and Lee (1993) and Zimmermann(1980). The above development of Tilley's conclusions on the matching of assets and liabilities into fuzzy set theoretical analogues can be
314
C.-C. Chang
similarly extended to recent results of Shiu (1988), Kocherlakota (1988, 1990) and Reitano (1991, 1993). However, the extensions are quite tedious and the resulting formulas become very cumbersome. We thus stop our fuzzy set theoretical approach to asset and liability management here and turn our attention to fuzzy decision making.
5
Bayesian Decision Method
Classical statistical decision making involves the notion that uncertainty in the future can be characterized probabilistically. When one wants to make a decision among various alternatives, one's choice is predicated on information about the future that is normally classified into various states of nature. If one knew with certainty the future states of nature (crisp states), one would not need an analytic method to assess the likelihood of a given outcome. Unfortunately, we do not know what the future will entail, under most circumstances, so we have devised methods to make the best choices given an uncertain environment. Classical Bayesian decision methods presume that future states of nature can be characterized as probability events. For example, consider the condition of cloudiness in tomorrow's weather by classifying the state space into, say, three levels (very cloudy, cloudy, and sunny) and assessing each level probabilistically. The problem with the Bayesian method is that the events are vague and ambiguous. How can one clearly distinguish between cloudy and very cloudy? If there is one small cloud in the sky, does one classify it sunny or cloudy? In what follows, we shall first present Bayesian decision making and then start to consider ambiguity in the value of new information, in the states of nature, and in the alternatives in the decision process. Let S = {s\, S2, ..-, sn} be a set of possible states of nature and let the probabilities that these states will occur be listed in a vector as
315
A Fuzzy Set Approach to Asset and Liability Management
follows:
?={p{s]),p(s2),-,p(sn)},
(8)
where Z"=1 p(s,) = 1. The probabilities shown in (8) are called "prior probabilities" in the Bayesian decision method, because they express prior knowledge about the true states of nature. Assume now that the decision maker can choose among m alternatives A = {a\, a-i, ..., am). For a given alternative a,- we assign a utility value, Ujh if the future state of nature turns out to be state st. These utility values should be determined by the decision maker since they express value, or cost, for each alternative-state pair, i.e., for each a, - s, combination. The utility values are usually arranged in a matrix as shown in Table 1. Table 1. Utility matrix.
^\States Alternatives^'^ a, • a
m
*1
s2
uu
un
.
. U ml i
"ml
s„ . . .
U
\n
* . . .
u mn
The expected utility associated with, say, the y'th alternative would be
4*,) = i X , / 4 )
(9)
(=i
The most common decision criterion is the maximum expected utility among all the alternatives, i.e., E\uj=
max E{Uj)
(10)
which leads to the selection of alternative a^ if E(u*) = E(uk). A simple example will illustrate the above. Suppose you are an
C.-C. Chang
316
engineer who is asked by the CEO of a large oil company to help make a decision about whether to drill for natural gas in a particular geographic region of a foreign country. You determine that there are only two states of nature regarding the existence of natural gas in the region: si = there is natural gas; S2 - there is no natural gas, and you are able to find from previous drilling data that the prior probabilities for each of these states is p(si) = 0.5; p{s2) = 0.5. Assume that there are two alternatives in this decision: a\ = drill for gas; «2 = do not drill for gas. By way of helping you assemble a utility matrix, the CEO tells you that the best situation for him is to decide to drill for gas, and subsequently find that gas is indeed in the region. He assesses this value Mn as 5. However, he thinks that the worst possible situation would be to drill for gas and subsequently find that there is no gas at all in the area. He determines that the value for this would be «i2 = -10. The other two utilities are assessed by the CEO to be W21 = -2 and «22 = 4. Hence, the utility matrix is given by U=
r 5 [-2
-io" 4_
The situation may further be shown as a decision tree in the following:
317
A Fuzzy Set Approach to Asset and Liability Management
utility 0.5
s2 \
H„
=5
0.5 un = - 1 0
Decision flX
S{
y °-5
2i = - 2
M
0.5 u22 = 4 The expected utility for each alternative a\ and a% is: E(u{) = (0.5)-(5) + (0.5H-10) = -2.5; E(u2) = (0.5)-(-2) + (0.5>(4) = 1.0. The maximum utility is 1.0, which comes from alternative a2. Hence, based on prior probabilities only, the CEO decides against drilling for natural gas. In many decision situations an intermediate issue arises: Should you get more information about the true state of nature prior to making the decision? Suppose some new information regarding the true states of nature S is available from r experiments or observations and is collected in a vector, X= {x\, x2, ..., xr). This information can be used in the Bayesian approach to update the prior probabilities, p(s,), in the following manner. First, the new information is expressed in the form of conditional probabilities, where the probability of each piece of data, Xk, k = l,2,...,r, is assessed according to whether the true state of nature, st, is known. This means that, given that we know the true state of nature is Sj, the probability that the piece of new information Xk confirms that the true state is s.
318
C.-C. Chang
is p(x/c\si). These conditional probabilities p(xk\s,) are also called likelihood values. The likelihood values are then used as weights on the prior probabilities p(s,) to find updated probabilities called posterior probabilities and denoted p(sj\xk). The posterior probabilities are equivalent to this statement: Given that the piece of new information Xk is true, the probability that the true state of nature is St is p{s,\xk). These updated probabilities are determined by Bayes' rule as follows: P\si\xk)=
(
\ P\si)
(11)
PM where the term in the denominator, p(xk), is the marginal probability of the data Xk and is determined using the total probability theorem P(xk) = 1[dp(xk\s,)-P(si)
(12)
Now the expected utility for they'th alternative, given the data Xk, is determined from the posterior probabilities instead of the prior probabilities, E u
( j\xk)=^jiP(si\xk)
(13)
1=1
and the maximum expected utility, given the new data x^ is now given by ii(w*.xt) = max E\Uj\xk) (14) To determine the unconditional maximum expected utility we need to weigh each of the r conditional expected utilities given by (14) by the respective marginal probabilities for each datum Xk, i.e., by p(xk) as shown below: Eu =
( l)
T,Eiu*\xk)-PM k=\
(15)
A Fuzzy Set Approach to Asset and Liability Management
319
We can now introduce the concept of value of information denoted V(x). In the case where there is some uncertainty about the new information X — {x\, xi, ••-, xr), we call the information imperfect information. The value of this imperfect information V(x), can be assessed by taking the difference between the maximum expected utility without any new information and the maximum expected utility with the new information, i.e., V(x)=
E(U*X)-E{U)
(16)
We now introduce the concept of perfect information. Perfect information can never be achieved in reality, but can be used to assess the value of imperfect information. If information is considered to be perfect, then each new piece of information or data predicts one and only one state of nature and there is no ambivalence about what state is predicted by the data. However, if there is more than one piece of information, the probabilities for a particular state of nature have to be shared by all the data. Mathematically, perfect information is represented by posterior probabilities of 0 or 1, i.e.,
We call this perfect information xp. For perfect information, the maximum expected utility becomes E(u;p)=f,E(u;p\xk)-p(xk)
(18)
and the value of perfect information becomes V(XP)=E(U;P)-E(U')
(19)
Let us continue the natural gas example where we had two states of nature — s\ with gas and S2 without gas — and two alternatives — a\ to drill and 02 not to drill. The prior probabilities were uniform: p(si) = 0.5, p(s2) = 0.5.
320
C.-C. Chang
Suppose that the CEO reconsiders his utility values and provides you with a new utility matrix as follows: Table 2. New utility matrix. \
s2
4 -1
-2 2
s
U]i a
\
a2
Suppose further that the CEO has asked you to collect new information by taking eight geological testing samples from the region being considered for drilling. Assume that the results of these eight tests can be expressed as conditional probabilities in the form of a matrix as shown in Table 3. Table 3. Conditional probabilities for imperfect information. X,
p(xk
h)
p(xk
\sl)
0
6
x7
xg
0.2
0.4
0.1
0.5 y"Vow==i
0.1
0.1 0.05
xA
x5
0.05 0.1
0.1
0.4
0.2
x2
0.05 0.1
*3
X
0
yyow==i
For comparison purpose, we assume that the eight tests were capable of providing perfect information and the conditional probabilities are thereby changed as follows: Table 4. Conditional probabilities for perfect information.
p(xk
h)
p(xk
\sl)
X,
x2
x3
x4
x5
*6
x7
xg
0
0
0
0
0.2
0.5
0.2
0.1 ^row- = 1
0.1
0.2
0.5
0.2
0
0
0
0
y^row -= 1
Since the CEO changed his utility values, you have to recalculate the expected utility of making the decision on the basis of just the prior probabilities, before any new information is acquired. The
A Fuzzy Set Approach to Asset and Liability Management
321
decision tree for this situation now looks as follows: 4 «„
-2 un Decision -1 u2X
The expected utilities and maximum expected utility, based on prior probabilities only, are £(a,) = (4>(0.5) + (-2H0.5)=1.0; E(a2) = (-1X0.5) + (2)-(0.5) = 0.5. E(u*) = 1 therefore, the choice is alternative a\\ drill for natural gas. We are now ready to assess the changes in this decision process by considering additional information, both imperfect and perfect. Let us first calculate marginal probabilities for the new imperfect information. Using (12), the conditional probabilities for imperfect information and the prior probabilities, we obtain, for example, p(Xl) = (0)-(0.5) + (0.05)-(0.5) = 0.025 p(x4) = (0.1)-(0.5) + (0.2)-(0.5) = 0.15 Now we can calculate the posterior probabilities. Using (11), the conditional probabilities for imperfect information, the prior probabilities and the marginal probabilities just obtained, we have, for example,
322
C.-C. Chang
I | \ 0.05-(0.5) 1 / | \ 0.4-(0.5) 4 p[s, \x2) = -—- = - , pis, \x6) = -—'- = —, v ' ; 0.075 3 V ' ' 0.25 5 / , \ 0.1-(0.5) 2 / , \ 0.1(0.5) 1 p\sV2\x2) = -—- = —,P\s -—- = -,•••• 2\X(,) = ' ' 0.075 3 V21 6 / 0.25 5 The conditional expected utilities, E(u*\xk), are calculated using first (13), then (14); for example,
£("i|x3) = (4>(4) + (4).(-2) = - 4 ; ^ 2 | X 3 ) = ( 4 K - 1 ) + ( 4 H 2 ) = 4.
Hence, E{u*\x^) = max(--j, 3) = -| (choose alternative ai). £(«I|X 8 ) = ( 1 H 4 ) + ( 0 K - 2 ) = 4
£(«2|x8) = (l>(-l) + (0H2) = - l Hence, £(«*|x8) = max(4, -1) = 4 (choose alternative a\). All the above results are summarized in Table 5. Table 5. Posterior probabilities based on imperfect information. x8 x7 x5 x3 X, x2 x4 *6 p(>\ ) p\s\ \
•*
x
k)
l
p(s2
\xk)
E\u* a
j \
h)
x
k
0.025 0.075 0.25 1 1 0 3 5 2 4 — — 1 3 5 7 2 1 5 a2 a2 a2
0.15 1
0.15 2
0.25 4
3 2
3 1
5 1
—
—
—
—
—
3
3
1
2
5 14 5
a2
"1
«i
0.075 0.025 2 — 1 3 1 — 0 3 2
4
a,
a,
We can now use (15) to calculate the overall unconditional expected utility for imperfect information, which is actually the sum of pairwise products of the values in rows 1 and 4 of the above table, e.g., E(ux) = (0.025)-(2) + (0.075X1) + - + (0.025)-(4) = 1.875
323
A Fuzzy Set Approach to Asset and Liability Management
and the value of the new imperfect information is V(x) = E{u*x) E(u)= 1.875 - 1 = 0 . 8 7 5 . To decide which alternative to choose, notice from Table 5 that the total utility favoring a\ is 10.8 = (2 + -y +2 + 4) and the total utility favoring a2 is 5.4 = (2 + 1 + 3 + 1). Therefore, the CEO chooses alternative a\ to drill for gas. We can now use conditional probabilities for perfect information to replace those for imperfect information and perform all the calculations leading to p(xk), p{s]\xk), pfalxk), E(u*\xk), and a}\xk- The results are summarized in Table 6. Table 6. Posterior probabilities based on perfect information. 6
x7
xg
0.1
0.25
0.1
0.05
0
1
1
1
1
1
1
0
0
0
0
2
2
2
4
4
4
4
a2
a2
a2
a
a,
a
a,
X,
x2
x3
x4
x5
PM
0.05
0.1
0.25
0.1
p(sl\*k)
0
0
0
p(S2^k)
1
1
E(u'\xk)
2
°j\xk
a2
\
X
\
Equation (18) can be used to calculate the overall unconditional expected utility for perfect information, which is actually the sum of pairwise products of the values in rows 1 and 4 of the above table. For example, E(u*Xp) = (0.05>(2) + (0.1>(2) +...+ (0.05>(4) = 3.0 and the value of the new perfect information is v(xp)=E(u;p)-E(u*)
= 3- 1=2.0.
Alternative a\ is still the choice here. Note that the hypothetical perfect information has a value of 2 and the imperfect information has a value of 0.875. This difference can be used to assess the value of the imperfect information compared to both no information (1) and perfect information (3).
324
6
C.-C. Chang
Fuzzy Bayesian Decision Method
We now consider the situation where the new information may be inherently fuzzy. Let X = {x\, X2, ..., xr} be the new information. Then we can define fuzzy events, M, on this information, such as "good" information, "moderate" information, and "poor" information. The fuzzy event will have membership function, juyixk), k = l,2,...,r. We can now define the concept of a "probability of a fuzzy event", i.e., the probability of Mas
P(M) = YJMMPM
(20)
If the fuzzy event is, in fact, crisp, i.e., M=M, then the probability reduces to
P{M)= £ > ( * * ) / < * = { \
^eM
(21
>
X$M
[ 0 otherwise where (20) describes the probability of a crisp event simply as the sum of the marginal probabilities of those data points Xk that are defined to be in the event M. Based on this, the posterior probability of st, given fuzzy information Mis P(v
I w\ = £M**hW(**M j ,-) =
{ , m )
P(M\S,)P(S,)
p(M)
p(M)
(
n~ ]
where
P(M\S, ) = X p(xk \s, )MK (xk )
(23)
k=\
We can now define the collection of all the fuzzy events describing fuzzy information as an orthogonal fuzzy information system, Y= {Mi, M2, ..., Mz} where by orthogonal we mean that the sum of the membership values for each fuzzy event Mj, for every data point Xk, equals 1 (Tanaka 1976). That is
325
A Fuzzy Set Approach to Asset and Liability Management
I>M ; ta) = lfora11 xk eX
(24)
/=i
If the fuzzy events on the new information are orthogonal, we can extend the Bayesian approach to consider fuzzy information. The fuzzy equivalents of (13), (14), and (15) become, for a fuzzy event, Mt, E[UJ\MJ=YJUMP{S}ML)
(25)
E\u* \Mt) = max£(w J M , J
(26) (27)
E{UY)=YSE(U*M^-P{M)
The value of fuzzy information can now be determined analogously as V(Y)
= E(u'y )-E(U')
(28)
We can continue our gas example to illustrate the above by assuming that test samples are inherently fuzzy and by defining an orthogonal fuzzy information system F a s 7 = {M\, Mj, M$} ~ {poor data, moderate data, good data} with membership functions as shown in Table 7. Table 7. Orthojjonal membership functions for orthogonal fuzzy events. X,
x2
x3
x4
x5
*6
x7
xg
/"M,(*J
1
1
0.5
0
0
0
0
0
MM^M
0
0
0.5
1
1
0.5
0
0
MuMk)
0
0
0
0
0
0.5
1
1
0.15
0.15
0.25
P(xk)
0.025 0.075 0.25
0.075 0.025
Note that the fourth row is the same as that from posterior prob-
326
C.-C. Chang
abilities based on imperfect information and the sum of the membership values in each column (the first three rows) equals 1 as required for orthogonality. We can now use (20) to determine the marginal probabilities for each fuzzy event, P ( M j = 0.0225 P(M2) = 0.55 P ( M J = 0.225 and (23) to determine the fuzzy conditional probabilities, P{M^SX)
= 0.\
P{M2\SX)
=
0.55 p(M3\s1) = 0.35;
P[Mj\s2) = 0.35 P(M^s2) = 0.55 P(M 3 \S 2 ) = 0.1 and (22) to determine the fuzzy posterior probabilities,
^ U M J = 0.222 p(sx M J = 0.5 p(sx AfJ = 0.778; p[s2 Mx) = 0.778 p(s2 M2) = 0.5 p(s2 M3) = 0.222. The conditional fuzzy expected utilities can now be determined by using (25),
^ U M J = (4X0.222) + (-2)-(0.778) = -0.668; ^(WJMJ
= (-1X0.222) + (2)-(0.778) = 1.334;
E(U\M^)
= (4)-(0.5) + (-2)-(0.5) = 1.0;
£(W 2 |MJ
= (-l)-(0.5) + (2)-(0.5) = 0.5;
EUMJ
= (4)<0.778) + (-2X0.222) = 2.668;
E(U2 M J = (-1X0.778) + (2)-(0.222) = -0.334 and the maximum expected utility from (27), using each of the foregoing three maximum conditional probabilities,
A Fuzzy Set Approach to Asset and Liability Management
327
E(u*Y) = (0.225X1.334) + (0.55)-(l) + (0.225>(2.668) = 1.45 and the value of the fuzzy information from (28), V(Y) = 1.45 - 1 = 0 . 4 5 . We can see here that the value of the fuzzy information is less than the value of the perfect information (2.0), and less than the value of the imperfect information (0.875). However, it may turn out that fuzzy information is far less costly than either the imperfect or perfect information.
7
Decision Making under Fuzzy States and Fuzzy Alternatives
We will now extend the Bayesian decision method to include the possibility that the states of nature are fuzzy and the decision makers' alternatives are also fuzzy (Tanaka 1976). As an example, suppose your company wants to expand and you are considering three fuzzy alternatives in terms of the size of a new facility: A\ = small-scale facility Ai = middle-scale facility A3 = large-scale facility Suppose further that the economic climate in the future is very fuzzy and you pose the following three possible fuzzy states of nature: F\ = low rate of economic growth F2 = medium rate of economic growth F\ = high rate of economic growth all of which are defined on a universe of numerical rates of economic growth, say S, where S = {s\, S2, ..., sn} is a discrete universe of economic growth rates. The fuzzy states Fs. will be required to be orthogonal and this orthogonal condition on the fuzzy states will be
C.-C. Chang
328
the same constraint as shown in (24), i.e.,
5 X M = 1, i = l,2,...,n
(29)
Since we need utility values to express the relationship between crisp alternative-state pairs, we still need a utility matrix to express the value of all the fuzzy alternative-state pairings. Such a matrix will have the form shown in Table 8. Table 8. Utility values for fuzzy states and fuzzy alternatives.
5_
5L
^
A
un
"12
«13
4
"21
W
22
«23
"31
W
32
W33
A
Now with fuzzy states of nature, the expected utility of fuzzy alternative Aj is E(UJ)=Y.UJSP{F),
(30)
s=\
where
pfe) = i>£,W/>fo)
(31)
.5=1
and the maximum utility is E[u*) = max E\Uj).
(32)
We can have crisp or fuzzy information on a universe of information X= {xi, X2, ..., xr}. Our fuzzy information will again reside on a collection of orthogonal fuzzy sets o n J , Y= {M\, M2,..., Mz}, that are defined on X. Given probabilistic information xr and fuzzy information Mj, we can derive the posterior probabilities of fuzzy states Fs_ as follows:
A Fuzzy Set Approach to Asset and Liability Management
329
n
M F ,k = n
p^Mj
<—\
(33)
r
=- ^
;
(34)
*=1
Similarly, we can derive the expected utility as follows: E u x
( j\ k) = 2X^fej**)
E u
i jh)=ix^kk) ^=1
(35)
( 36 )
where the maximum conditional expected utility for probabilistic and fuzzy information is as follows: E
= m xE u x
( 37 )
E\u*M J = m a x E\Uj \Mt J
(38)
(Kt)
f ( j\ k)
Finally, the unconditional expected utilities for fuzzy states and probabilistic information or fuzzy information are: E(ux)=±E(u:k)-p(xk)
(39)
k=\
E(u;)=fJE(u*M}p(ML)
(40)
f=l
The expected utility given by (39) and (40) now enables us to compute the value of the fuzzy information, within the context of fuzzy states of nature, for probabilistic information, (16), and fuzzy information, (41):
C.-C. Chang
330
V(Y) = E(uY)- E(u)
(41)
If the new fuzzy information is hypothetically perfect, then we can compute the maximum expected utility of fuzzy perfect information using the following: u(A^
= u^Fj
(42)
where the expected utility of zth alternative Aj_ for fuzzy perfect information on state F^ is the utility value for fuzzy state F± and fuzzy alternative^(see Table 8). Therefore, the optimum fuzzy alternative A*F is defined by
4ft kJ
max;M^Fj
(43)
Hence, the total expected utility for fuzzy perfect information is
E(u;J = J^U(A;\FJ.
P(FJ
(44)
where p(Fs) are the prior probabilities of the fuzzy states of nature given by (31). The result of (40) for the fuzzy perfect case will be denoted E(u*yp), and the value of the fuzzy perfect information will be V(YP) = E[UYR)-E{U) (45) Tanaka et al. (1976) have proved that the various values of information conform to the following inequalities: v(Yp)>v(xp)>V{x)>V{Y)>0
(46)
These inequalities are consistent with our intuition. The inequality V(x) > V(Y) is due to the fact that information Y is characterized by fuzziness and randomness. The inequality V(xp) > V(x) is true because xp is better information than x; it is perfect. The inequality V(YP) > V(xp) holds because the uncertainty expressed by the probability P(Fj) still remains, even if we know the true state st. Hence,
A Fuzzy Set Approach to Asset and Liability Management
331
our interest is not in the crisp states of nature S but rather in the fuzzy states F which are defined on S.
8
Conclusions
This chapter first develops fuzzy-set theoretical analogues of the Redington theory of immunization and the matching of assets and liabilities. By translating the existing knowledge about immunization and the matching of assets and liabilities into fuzzy set theory, we are able to view the problems and solutions on a broader horizon. Moreover, this approach offers the advantages of using fuzzy set theory such as flexibility and closer to real-world situations. It is believed that successful applications of fuzzy set theory to many other areas of finance and economics are highly likely. Based on Bayesian decision method, this chapter then uses a fuzzy set theoretical approach to develop a fuzzy Bayesian decision method, and finally extends the Bayesian decision method to include the possibility that the states of nature are fuzzy and the decision makers' alternatives are also fuzzy. The attributes of a typical decision making problem are that there are many states of nature, feasible policy alternatives and available information. Usually, the utilities for all the states and all the alternatives cannot be formulated because of insufficient data, the high cost of obtaining the data, or time constraints. On the other hand, top managers mostly want to decide roughly what alternatives to select simply as indicators of policy directions. Thus, an approach that is based on fuzzy states and fuzzy alternatives and that can accommodate fuzzy information is a very powerful tool for making such preliminary policy decisions.
332
C.-C. Chang
References Buehlmann, N. and Berlinder, B., "The fuzzy zooming of cash flows," (unpublished working paper). Chang, P.T. and Lee, E.S. (1993), "Fuzzy decision making: a survey," in Wang, P.Z. and Lee, K.F. (eds.), Between Mind and Computer, Advances in Fuzzy Systems, World Scientific, London, pp. 139-182. Dubois, D. and Prade, H. (1980), Fuzzy Sets and Systems, Academic Press, New York. Kocherlakota, R., Rosenbloom, E.S., and Shiu, E.S.W. (1988), "Algorithms for cash-flow matching," Transactions of the Society of Actuaries, 40:477-84. Kocherlakota, R., Rosenbloom, E.S., and Shiu, E.S.W. (1990), "Cash-flow matching and linear programming duality," Transactions of the Society of Actuaries, 42:281-93. Okuda, T., Tanaka, H., and Asai, K. (1974), "Decision making and information in fuzzy events," Bull. Univ. Osaka prefect., Ser. A, vol. 23, no. 2, pp. 193-202. Okuda, T., Tanaka, H., and Asai, K. (1978), "A formulation of fuzzy decision problems with fuzzy information, using probability measures of fuzzy events," Inf. Control, vol. 38, no. 2, pp. 135-147. Redington, F.M. (1952), "Review of the principles of life-office valuations," Journal of the Institute of Actuaries, vol. 78, pp. 268-315. Reitano, R.R. (1991), "Multivariate immunization theory," Transactions of the Society of Actuaries, 43:393-428. Reitano, R.R. (1993), "Multivariate stochastic immunization theory," Transactions of the Society of Actuaries, 45:425-61. Shiu, E.S.W. (1988), "Immunization of multiple liabilities," Insurance: Mathematics and Economics, 7:219-24. Tanaka, H., Okuda, T., and Asai, K. (1976), "A formulation of fuzzy decision problems and its application to an investment
A Fuzzy Set Approach to Asset and Liability Management
333
problem, " Kybernetes, vol. 5, pp. 25-30. Tilley, J. A. (1980), "The matching of assets and liabilities," Transaction of the Society of Actuaries, vol. 32, pp. 263-300. Zadeh, L.A. (1965), "Fuzzy sets," Information and Control, vol.8, pp. 338-353. Zimmermann, H.J. and Zysno, P. (1980), "Latent connectives in human decision making," Fuzzy Sets and Systems, vol. 4, pp. 3751.
This page is intentionally left blank
Industry Issues
This page is intentionally left blank
Chapter 9 Using Neural Networks to Predict Failure in the Marketplace Patrick L. Brockett, Linda L. Golden, Jaeho Jang, and Chuanhou Yang
This chapter examines the indicators of marketplace failure for business enterprises, with a specific application to predicting insurance company insolvency. Both the US Property and Casualty insurance industry and the US Life Insurance industry are examined. Various approaches are contrasted including discriminant analysis, logistic regression analysis, k-nearest neighbor, expert rating agencies (A. M. Best ratings, and the National Association of Insurance Commissioners' Insurance Regulatory Information System and Financial Analysis Tracking System), and variants of neural networks. It is shown for both industries that the neural network models exhibit high success relative to competing models for predicting market failure of firms.
1
Introduction
The definition and measurement of business risk has been a central theme of financial and actuarial literature for years. Although the works of Borch (1970), Bierman (1960), Tinsley (1970), and Quirk (1961) dealt with the issue of corporate failure, their models did not lend themselves to empirical testing. Other work such as that of Altman (1968), Williams and Goodman (1971), Sinkey (1975), and Altman, Haldeman, and Narayanan (1977) attempted to predict 337
338
P. L. Brockett et al.
bankruptcy or the propensity towards market failure for a particular firm by using discriminant analysis. Failure of firms within certain regulated industries, such as insurance and utility providers, has become a major issue of public debate and concern. Particularly since the terrorist attacks on the world trade center on September 11, 2001, the identification of potentially troubled firms has become a major regulatory research objective. Previous research on the topic of insurer insolvency prediction includes Ambrose and Seward (1988), BarNiv and Hershbarger (1990), BarNiv and MacDonald (1992), and Harrington and Nelson (1986), Brockett et al. (1994), Huang et al. (1994), and Brockett et al. (2002). This chapter gives a review of effectiveness of using neural network models for predicting market failure (insolvency or financial distress) at the firm level. The industry used for application is the insurance industry because of the important social implications of market failure in this industry, as well as the public availability of data at the firm level capable of being analyzed by the public without having to have proprietary data access. The results of neural network models are compared with those of discriminant analysis, logistic regression analysis, A. M. Best ratings, and the National Association of Insurance Commissioners' Insurance Regulatory Information System (NAIC) ratings, and the neural network results show high predictability and generalizability, suggesting the usefulness of neural network approaches for predicting future insurer insolvency.
2
Overview and Background
2.1
Models for Firm Failure
In the context of investigating the financial stability of firms within the insurance industry, the consumer (and regulator) has several
Using Neural Networks to Predict Failure in the Marketplace
339
sources of information. For example, the reporting and rating services from the A. M. Best Company as well as Moody's and Standard and Poors provide the consumer with knowledge useful for determining whether or not the firm considered for potential business is likely to be around when a claim is filed. In addition, the NAIC has developed Insurance Regulatory Information System (IRIS) and Financial Analysis Tracking System (FAST) to provide an early warning system for use by regulators. The NAIC also adopted risk based capital (RBC) formula for insurance prediction. The IRIS system was designed to provide an early warning system for insurer insolvency based upon financial ratios derived from the regulatory annual statement. The IRIS system identifies insurers for further regulatory evaluation if four of the eleven (or 12, in the case of life insurers) computed financial ratios for a particular company lie outside a given "acceptable" range of values. IRIS uses univariate tests, and the acceptable range of values is determined such that, for any given univariate ratio measure, only approximately 15 percent of all firms have results outside of the particular specified "acceptable" range. The adequacy of IRIS for predicting troubled insurers has been investigated empirically and found not to be strongly predictive. For example, one could use the IRIS ratio variables from the NAIC data with more sophisticated statistical methods to obtain substantial improvements over the IRIS insolvency prediction rates (cf., Barrese 1990, Brockett et al. 1994). One of the criticisms of IRIS is that it is too dependent on capital and surplus figures, and another is that IRIS fails to take into account the relationships between the ratios. In response to these criticisms, NAIC developed the FAST system, which is supposed to eliminate some of the problems associated with the IRIS system. Unlike the original IRIS ratios, the FAST system assigns different point values for different ranges of ratio results. A cumulative score is derived for each company, which is used to prioritize it for further analysis. In addition, the risk-based capital systems adopted by the NAIC may enhance regulators' ability to identify problem in-
340
P. L. Brockett et al.
surers prior to insolvency. Evaluations of the accuracy of the riskbased capital system for property-liability insurers are provided in Grace, Harrington, and Klein (1998) and Cummins, Harrington, and Klein (1994). Since the multiple discriminant analysis (MDA) method was introduced by Altman in 1968, it has become a commonly used parametric method for predicting financial distress in various industries (cf., Edmister 1972, Sinkey 1975). However, MDA has received much criticism since the data used to determine the propensity toward marketplace failure of a firm often violate the statistical assumptions of this model. Similar problems exist for the logistic regression model, another important technique used in the previous literature for insurer insolvency prediction. To overcome the problems associated with parametric techniques, nonparametric methods have become popular in recent studies. One such model is the neural network model.
2.2
Neural Network and Artificial Intelligence Background
The neural network model can be represented as a massive parallel interconnection of many simple processing units connected structurally in much the same manner as individual neurons in the brain. Just as the individual neurons in the brain provide "intelligent learning" through their constantly evolving network of interconnections and reconnections, artificial mathematical neural networks function by constantly adjusting the values of the interconnections between individual neural units. The process by which the mathematical network "learns" to improve its performance, recognize patterns and develop generalizations is called the training rule for the network. The learning law proposed by Hebb in 1949 served as the starting point for developing the mathematical training algorithms of neural networks. The subsequent development of the back-propaga-
Using Neural Networks to Predict Failure in the Marketplace
341
tion "training rule" resolved computational problems outstanding for two decades and significantly enhanced the performance of neural networks in practice. This rule is based on a "feed forward" network that essentially designates that the flow of the network intelligence is from input toward output. The "back propagation" algorithm updates its interconnection weights by starting at the output, determining the error produced with the particular mathematical or logical structure and then mathematically propagating this error backward through the network to determine, in the aggregate, how to efficiently update (adjust) the mathematical structure of interconnections between individual neurons in order to improve the predictive accuracy of the network. The method of updating weights to reduce total aggregate error is one of steepest gradient decent familiar to statistical estimation procedures such as maximum likelihood estimation. 2.2.1
The General Neural Network Model
All neural networks possess certain fundamental features. For example, the basic building block of a neural network is the single "neural processing unit" or neuron, which takes the multitude of individual inputs x = (xi, X2, ..., x„) to the neural unit, determines (through the learning algorithm) optimal connection weights w = (wi, W2, ..., wn), which are appropriate to apply to these inputs, and then aggregates these weighted values in order to concatenate the multiple inputs into a single value A(w,x) = £"=1 Wj-x1 . An activation function, F, is then applied, which takes the aggregated weighted value A(w,x) for the individual neural unit and produces an individual output F(A(w,x)) for the neural unit. The logistic activation function F(z) = l/(l+exp(-r|-z)) is the commonly used activation function, although other sigmoid functions such as the hyperbolic tangent function have been used depending upon the situation. Figure 1(a) graphically displays the configuration of the single neural processing unit or neuron as described above.
342
P. L. Brockett et al.
"Output" of neural unit •£xiwi FjCSxiWi) wx
Aggregation
"Input" to neural unit Figure 1(a). A single neural processing unit (neuron) j . ,0)
"Output" of neural net
O)
•
niwf^cixi.w^)) i=l
Input layer 1
Hidden layer 2
Output layer
Figure 1(b). Multiple neural processing units. Each circle in the hidden layer represents a single neural processing unit.
Until now, we have described a single neural processing unit. The same process, however, can be applied to an array of neural processing units. In practice, the neural processing units are grouped together to form composite topological structures called neural layers, and subsequently the units in sequential layers are
Using Neural Networks to Predict Failure in the Marketplace
343
connected via interconnection weights to obtain an ultimate topology for the network (see Figure 1(b) above). 2.2.2 Network Neural Processing Units and Layers Artificial neural networks are formed by modeling the interconnections between the individual neural processing units (just as the human brain functions through interconnections between the biological neurons through development of the synapses). The feed forward networks without a hidden layer are networks that have an input layer containing the input data information obtained from an external source and an output layer providing output information. It follows from the neural network topology given in Figure 1 that in the neural network without a hidden layer with a logistic activation function and a single 0-1-output node, the mathematical structure developed is isomorphic to the standard logistic regression model. Multilayer networks, which possess a "hidden" or intermediate processing layers between the input layer and output layer, can overcome many of limitations of the simple network model without hidden layers in representing complex nonlinear rules. The topology of the neural network determines its ability to mathematically simulate the observed process of going between the input and output values. One aspect of this topology is the number of hidden layers to assume. It follows from a theorem of Kolmogorov (cf., Lorentz 1976, Kurkova 1992) that any continuous function of N variables can be computed using only linear summations and a nonlinear but continuously increasing function of only one variable. A consequence of this theorem is that a single hidden-layer network can approximate any continuous function of N variables. It has been proven that the multilayer architecture of the model allows for universal approximations of virtually any mapping relationship (Funahashi 1989, Hornik, Stinchcombe and White 1990). The hidden layers are used to represent non-linearities and interactions between variables (Lippmann 1987). Essentially, these results show that the class of single hidden layer neural network models is
344
P. L. Brockett et al.
"dense" in the space of all continuous functions of N variables, so that whatever the "true" (but unknown) functional relationship between the input variables and the output variable, it can be well approximated by a neural network model. This finding substantially encourages the use of a single hidden layer neural network for most complex behavioral decision problems. Indeed, such three-layer neural networks have been able to "predict" the failure of savings and loan companies (cf, Salchenberger, Cinar and Lash 1992). 2.2.3
The Back-Propagation Algorithm
Much like the techniques used for maximum likelihood estimation, the back-propagation algorithm can be viewed as a gradient search technique wherein the objective function is to minimize a mean squared error between the computed outputs of the network corresponding to the given set of inputs in a multilayer feed-forward network and the actual observed outputs for these same given inputs. A difference is that the back-propagation algorithm sequentially considers data records one at a time, readjusting the parameters (in a gradient search manner) after each observation. On the other hand, the statistical procedures (maximum likelihood, least squares) use an aggregated error of the estimation almost as if in "batch" mode. In back propagation, the parameters are changed after each data point and the process is continued with the same individual observation presented to the algorithm over and over. In statistical methods the data is only presented to the algorithm once in a batch. Specifically, the neural network is trained by presenting an input pattern vector X to the network and computing forward through the network until an output vector O is obtained. The output error is computed by comparing the computed output O with the actual output for the input X. The network attempts to learn by adjusting the weights of each individual neural processing unit in such a fashion as to reduce this observed prediction error. Mathematically, the
Using Neural Networks to Predict Failure in the Marketplace
345
effects of prediction errors are swept backward though the network, layer by layer, in order to associate a "square error derivative" (delta) with each processing unit, to compute a gradient from each delta, and finally, to update the weights of each processing unit based upon the corresponding gradient. This process is then repeated, beginning with another input/output pattern. After all the patterns in the training set is exhausted, the algorithm examines the training set again and readjusts the weights throughout the entire network structure until either the objective function (sum of squared prediction errors on the training sample) is sufficiently close to zero or the default number of iterations is reached. The precise computer algorithm implementing the back-propagation technique used in this study was obtained from Eberhart and Dobbins (1990), however many commercially available programs are able to do this analysis. To date, neural network mathematical techniques have been applied in many areas, such as pattern recognition, knowledge databases for stochastic information, robotic control, and financial decision-making. Salchenberger, Cinar, and Lash (1992), Coats and Fant (1993), Luther (1993), Huang, Dorsey, and Boose (1994), and Brockett, Cooper, Golden, and Pitakong (1994), and Brockett et al. (2002) have shown that neural network models can perform better than traditional methods in the classification of financial distress.
3
Neural Network Methods for Life Insurer Insolvency Prediction
Huang, Dorsey, and Boose (1994) uses a feed-forward neural network method optimized with a genetic algorithm to forecast financial distress in life insurers. The data is limited to Insurance Regulatory Information System Ratios for the total population of life insurers. The purpose of this study is to use the neural network on the IRIS data for life insurers to see if the use of this new tool is able to
P. L. Brockett et al.
346
measurably improve prediction. It does not attempt to identify an optimal set of variables for forecasting. Neural network forecasts are compared to discriminant analysis, k-nearest neighbor, and logit. The neural network is shown to produce superior estimates to the alternative methods. The IRIS variables for the life insurers are listed in Table 1. Table 1. Description of IRIS variables. Variable Name
Description
Rl R2 R3 R4 R5 R6 R7 R8 R9 RIO Rll R12
Net Change in Capital and Surplus Net Gain to Total Income Commissions and Expenses to Premiums and Deposits Investment Yield Non-Admitted Assets to Admitted Assets Real Estate to Capital and Surplus Investment in Affiliate to Capital and Surplus Surplus Relief Change in Premiums Change in Product Mix Change in Asset Mix Change in Reserving Ratio
The data for their study came from tapes provided by the National Association of Insurance Commissioners (NAIC). The first year of this study uses 1988 data because filing requirements were changed that year and data necessary to calculate the current set of ratios is not available before 1988. The identification of Financially Impaired Companies (FICs) in this study was taken from Best's Insolvency Study (1992), Life/Health Insurers 1976-1991. For the three years of this study, Best classified 42 companies as FIC in 1989, 41 in 1990, and 58 in 1991. Of those, the NAIC data had sufficient information to use 28 in 1989, 28 in 1990, and 44 in 1991. The total number of companies in the data set was 2020 in 1989, 1852 in 1990, and 1801 in 1991.
Using Neural Networks to Predict Failure in the Marketplace
347
For each year, this study combines the financial classification (FIC or non FIC) of the current year as the dependent variable with 12 IRIS ratios of the prior year as the set of independent variables to form a year-base data set. Each year-base data set is the basis for the empirical study. For the 1989 FICs, 1988 data are used with each statistical tool to estimate the parameters of the model, then the model is used to predict FICs in-sample, 1989, and in each of the next two years where, in each case, the prior year data is used for the independent variables. In the Huang, Dorsey, and Boose (1994) study, the neural network (NN), k-nearest neighbor (KN), discriminant analysis (DA), and Logit methods are applied to the IRIS ratio model for predicting financially impaired insurers. Three criteria, misclassification cost, resubstitution risk, and the level of risk compared to a naive prediction, are used to evaluate the prediction efficiency of NN, KN, DA, and Logit. In general, for the in-sample prediction the misclassification cost is an increasing function of the cost weight. The exception is KN, which is insensitive to the cost weight change, and which is capable in-sample of correctly identifying all FICs as long as the cost weights are not equal. NN is the second best tool, clearly dominating DA and Logit throughout the range of weights, with the difference becoming more important as the relative weight of type I errors increases. Although KN is the best model for predicting in-sample, it deteriorates to the worst for the first year out-of-sample prediction. NN is now clearly the best prediction tool, dominating the other methods at all cost weights. As measured by misclassification cost, the neural network has stable, efficient prediction capacity in both insample and out-of-sample prediction. KN has the smallest insample cost, but also has the highest out-of-sample cost. Logit and DA are considerably less efficient tools in both in-sample and outof-sample prediction. The resubstitution risk of the in-sample prediction is again an
348
P. L. Brockett et al.
increasing function of cost weight except for the KN, which again is best in-sample and is unaffected by change in cost weight. The NN is again the second best prediction tool in-sample, and again its advantage over DA and Logit grows with the relative weights. As when using misclassification cost criterion, for the first year out-ofsample prediction, the KN deteriorates from best predictor, and NN dominates DA and Logit. In general, when measured by resubstitution risk, NN is the only tool with reasonable risk on both insample and out-of-sample prediction. All traditional tools have higher risk than NN in out-of-sample prediction. All the methods are efficient in terms of risk saving relative to the naive prediction for in-sample prediction. The KN has the most risk saving except for a cost weight equal to 1. The NN is the second best, and the Logit and DA are less efficient. In out-of-sample prediction, KN, DA, and Logit are less efficient than the naive prediction, and the NN is the only tool having any prediction efficiency compared to the naive prediction. In short, NN is the only prediction method having any gains in prediction efficiency in insample and out-of-sample prediction compared to the naive prediction. Brockett, Golden, Jang, and Yang (2002) provides another comparative analysis of statistical methods and neural networks for predicting life insurers' insolvency. The primary purpose of this study is to examine and compare the performance of two statistical methods (multiple discriminant analysis and logistic regression analysis) and two artificial neural network methods (back-propagation and learning vector quantization) for predicting life insurer insolvency. The second purpose is to investigate and evaluate the usefulness and effectiveness of several variable sets for the two neural network models. The 22 variables, IRIS, FAST, and Texas EWIS variables are compared and evaluated. This study shows that back-propagation (BP) and learning vector quantization (LVQ) outperform the traditional statistical approaches for all four data sets with a consistent superiority across the two different evaluation criteria: total
Using Neural Networks to Predict Failure in the
Marketplace
349
misclassification cost and resubstitution risk criteria. And the results show that the 22-variable model and the Texas EWIS model are more efficient than the IRIS and the FAST model in most comparisons. This study builds a current year model and a one-year prior model. The data are obtained from the Texas Department of Insurance (TDI), from annual statements filed for the years 1991 through 1994, as well as a list of the insurers that became "troubled" companies from 1991 to 1995. All solvent and insolvent life insurance companies whose business domiciles are in Texas and whose data are available for the entire study period (1991-1994) and entire variable sets are included in the samples. Any observation that does not exist throughout the entire period and has a blank data field is deleted from the sample data sets. Four sets of explanatory variables, 22-variable model, IRIS model, FAST model, and Texas EWIS model, are examined in this study. The 22-variable set is used as the benchmark to compare and validate the effectiveness of other variable sets. An examination of the previous studies is conducted to identify the variables that had been found to be more indicative of financial conditions in the context of insolvency. Some of the variables are eliminated because too many of the companies have no data for these variables. Stepwise regression is used to reduce the variables and the final set includes 22 variables. Table 2 presents these 22 variables. The Texas Department of Insurance (TDI) implemented an early warning information system (EWIS) in early 1992. For each company, a binary (0 or 1) variable is valued for each of a set of approximately 393 indicators based on whether the calculated ratio or numerical value of the indicator is above some preselected threshold value. Weights are assigned to each binary indicator by the EW staff according to a subjective assessment of the importance or severity of the indicator. Each binary indicator is then multiplied by its assigned weight and the resulting values are summed across all indicators to obtain an "EWIS company score" for each company.
350
P L. Brockett et al.
The rank of the insurers for prioritization is determined based on this score. Table 2. Variable description of 22-variable data set. Variable Name VI V2 V3 V4 V5 V6 V7 V8 V9 V10 Vll V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
Description Gains/Premiums Liabilities/Surplus Net Gain from Operations after Tax & Dividends Net Investment Income Accident & Health Benefits/Total Benefits (Bonds+Stocks+Mortgages)/Cash & Investment Assets Cash Flow/Liabilities Capital & Surplus/Liabilities Change in Capital & Surplus Delinquent Mortgages/Capital & Surplus Change in Premium Insurance Leverage (Reserves/Surplus) Financial Leverage (Premiums/Surplus) Log of Growth in Assets Log of Growth in Premiums Log of Growth in Surplus Log of Cash Flow from Operations Non-Admitted Assets/Admitted Assets Reinsurance Ceded/Premium Separate Account Assets/Assets Total Benefits Paid/Capital & Surplus Real Estate/Assets
Some of the 393 indicators are automatically input into the EWIS system, while some of them are input manually. Correspondingly, two separate Texas EWIS variable sets are used in this analysis: an automated indicator set (EWIS-automated) and a nonautomated indicator set (EWIS-significant). The variables used in TDI - EWIS are not publicly disclosed. The IRIS variables are listed in Table 1, and the FAST variables for the life insurers are listed below in Table 3.
Using Neural Networks to Predict Failure in the Marketplace
351
Table 3. Description of FAST variables. Var. Name Fl F2 F3 F4 F5 F6 F7 F8 F9 F10 Fl 1 F12 F13 F14 F15 F16 F17
Description Change in Capital & Surplus Surplus Relief Change in Net Prem./Ann. Cons./Dep. Type Funds A&H Bus. to Net Premiums & Annuity Cons. & Deposit Type Funds Change in Dir. &Ass. Annuities & Deposit type Funds Stockholder's Dividends to PY Capital & Surplus Change in Net Income Trending of Net Income Surrenders to Premiums & Deposit Type Funds Grp. SUIT, to Grp. Prem./Grp. Dep. Type Funds Change in Liquid Assets Aff. Investments/Receivables to Capital/Surplus Non Inv. Gr. Bonds & St. Inv. to Capital & Surplus & AVR Collateralized Mortgage Obligations to Total Bonds Problem Real Estate and Mortgages to Capital & Surplus & AVR Sch. BA Assets to Capital & Surplus & AVR Total Real Estate and Mortgages to Capital & Surplus & AVR
This study uses cross-validation tests that employ the weights or parameter estimates developed in the previous year(s) to predict the outcomes in the subsequent year. The learning or training samples consist of the insurance companies in 1992 and 1993. The parameter estimates and weights from the learning samples are used to test the sample that consists of the companies in 1994. The crossvalidation test is practically useful in the sense that we can use the ex post information of the past to predict the ex ante outcomes for the current year. The number of companies in the training and test samples is listed in Tables 4 and 5. Evaluated in terms of their misclassification cost, different prior probabilities and misclassification cost ratios are used. The type I misclassification cost takes on the following values (1, 10, 15, 20, 25, 30), while the type II misclassification cost is fixed at 1. The
352
P. L. Brockett et al.
prior probability of failure is set to equal prior probability and the proportional prior probability in the sample. An evaluation of MDA, logit, LVQ, and BP methods is performed for each of the five data sets using the current year and one-year prior model. Table 4. Data sets used in the training samples. Year
Data Set 22-Var.
1992
IRIS FAST 22-Var.
1993
IRIS FAST
.. , , ivioaei Current Year One-Year Prior Current Year One-Year Prior Current Year One-Year Prior Current Year One-Year Prior Current Year One-Year Prior Current Year One-Year Prior
Number of Insurers Insolvent 64 70 51 51 50 66 55 70 51 51 54 66
Solvent 463 463 463 463 463 463 463 463 463 463 463 463
Table 5. Data sets used in the test samples. Year
Data Set 22-Var.
1994
IRIS FAST
iviouei Current Year One-Year Prior Current Year One-Year Prior Current Year One-Year Prior
Number of Insurers Insolvent Solvent 49 463 70 463 49 463 51 463 48 463 66 463
A consistent pattern of minimum total misclassification costs emerges. For lower cost ratios (under 3), the LVQ method tends to minimize total misclassification costs. This is consistent with its minimization of type II errors. For higher cost ratios (5-30), the back-propagation method is superior in minimizing total misclassi-
Using Neural Networks to Predict Failure in the Marketplace
353
fication costs because of the smallest numbers of type I errors. For these data sets, the MDA and logit methods (equal and proportional prior probability) fail consistently to minimize total misclassification costs. There is little difference in performance between the current year model and the one-year prior model. This lack of a significant difference between the performance of the current year and one-year prior models is the results of the use of a broader definition of financial distress. The definition of a financially troubled company used in this study includes insurers subject to less severe regulatory supervision that invariably precedes insolvency. This results in a smoothing of the time effect in predicting bankruptcy. Given that cost ratios used by researchers in previous studies have been no lower than 20 (usually, much higher), this study focus on misclassification costs for cost ratios of 20, 25, and 30. For these cost ratios, the MDA with proportional prior probability and logit with equal prior probability consistently yield an unacceptably high percentage of type I errors for all data sets. This leads to extremely high misclassification costs for generally accepted cost ratio ranges. The back-propagation method consistently dominates all other methods for both the current year and one-year prior models. LVQ is a close second. Logit with proportional prior probability is a distant third, but it provides marginally better results than MDA with equal prior probability, which is consistent with the findings of earlier studies. Minimization of resubstitution risk is also used as the criterion for evaluating the MDA, logit, LVQ, and BP methods. A consistent pattern of minimum resubstitution risk emerges for the 22-variable, IRIS, FAST, and the Texas EWIS data sets. Assuming equal costs for type I and type II errors (cost ratio=l), LVQ is the best for both the current year and one-year prior models. For higher cost ratios, the back-propagation works best. For all data sets, the MDA and logit (equal and proportional priors) are almost always inferior to LVQ and BP methods. These results are consistent with those ob-
354
P. L. Brockett et al.
tained using the misclassification cost criterion. MDA with equal prior probability and logit with equal prior probability give unacceptably high levels of resubstitution risks for all data sets. The increased resubstitution risk is due to the additional weights placed on type I errors by using equal prior probability. Given the superiority of neural network methods, this study further evaluates the usefulness of each data set using the neural network methods BP and LVQ. With the back-propagation method and using total misclassification cost as the evaluation criterion, 22-variable and Texas EWISsignificant variables yield the best results for the current year model. FAST model is a close third. For one of the three years (1993), IRIS yields slightly better results than FAST, but only for the lowest two cost ratios (1 and 3). For 1992 data, the FAST and 22-variable data sets give identical results and dominate over the higher cost ratio range. The TDIsignificant data sets results are only slightly worse than these two. Using the IRIS data set yields higher misclassification costs than FAST due to the higher number of type I errors. For 1993 data, the TDI-significant clearly gives the best results. The 22-variable and TDI-automated data sets are distant seconds. For these data, results for IRIS and FAST data sets are almost indistinguishable and are the worst of the five data sets considered. Using 1994 data, the TDI-significant data set again dominates clearly. The 22-variable data set is a distant second. Using these data, IRIS results are somewhat better than FAST results. Looking at all three years, it is difficult to see any consistent improvement in performance using FAST rather than IRIS. Both data sets generally perform worse than the TDI or the 22-variable sets. For the one-year prior model, both TDI and the 22-variable data sets yield the best results. The two TDI sets provide similar results, as is expected given the overlap of certain variables in these two sets. For 1992 data, the TDI-significant performs best, with the 22-
Using Neural Networks to Predict Failure in the Marketplace
355
variable set a reasonably close second. FAST and IRIS are the worst, with FAST slightly better. Using 1993 data, the 22-variable data set is the best and TDI-auto performs equally well. TDI-significant is a fairly close third. Once again, FAST and IRIS give the worst results, with FAST dominating slightly. For 1994 data, the 22-variable data set dominates. TDI-automated and TDI-significant give similar results and are second and third, respectively. In this year, IRIS performs better than FAST, due to a lower number of type I errors. Overall, for the current year and one-year prior models, the TDI and 22-variable data sets perform the best. The FAST and IRIS data sets are clearly inferior. Also, it is important to note that no consistent improvement in performance is obtained using FAST rather than IRIS. Using the resubstitution risk criterion results in a similar evaluation of the data sets to the misclassification cost criterion. With the LVQ method and using the total misclassification cost criterion, TDI-significant and TDI-automated data sets consistently tend to minimize costs for the current year model. For each of the three years of data examined, the 22-variable data set is a close third. Once again, IRIS and FAST data sets produce the worst results, with IRIS dominating in two of the three years. For the one-year prior model, the TDI-significant data set again performs the best. TDI-automated comes in second and the 22variable data set is a somewhat distant third. It is important to note that the results for both the current year and one-year prior models are invariant to the level of cost ratios. Like the current year model, there is a consistently better performance from the IRIS data set than the FAST data set in each of the three years. Generally, for both the current year and one-year prior models, the TDI and 22-variable data sets give the best performance using misclassification cost criterion. It is important to note the consistent superiority of the IRIS data set over FAST, which was designed as an improvement to IRIS. Using the resubstitution risk criterion also lead to a similar evaluation of the data sets.
356
4
P. L. Brockett et al.
Neural Network Methods for Property-Liability Insurer Insolvency Prediction
Brockett et al. (1994) introduce a neural network artificial intelligence model as an early warning system for predicting insurer insolvency. To investigate a firm's propensity toward insolvency, the neural network approaches are applied to financial data for a sample of U.S. property-liability insurers. The results of the neural network methods are compared with those of discriminant analysis, A.M. Best ratings, and the National Association of Insurance Commissioners' Insurance Regulatory Information System ratings. The neural network results show high predictability and generalizability. This study focuses on financial variables available from the NAIC annual statement tapes. The Texas State Board of Insurance was involved in variable selection due to their interest in early warning to help firms prevent insolvency. Their rather large list of several hundred insurer financial health indicator variables was used as the first consideration set in the selection of variables. In cooperation with the Texas State Board of Insurance, this list was culled down using published research identifying variables that failed to identify potential insolvencies in previous research. This resulted in 24 variables. These 24 variables were reduced further through a series of statistical analyses using a sample of solvent and insolvent Texas domestic property-liability insurers (using Texas domestic insurer data provided by the Texas State Board of Insurance for insolvencies during the period 1987 through 1990). The first step in the preliminary analysis was to examine each variable separately to see if a significant difference existed between the solvent and the insolvent insurers that could be detected by using that variable alone. Discriminant analysis, canonical analysis, colinearity tests, and step-
Using Neural Networks to Predict Failure in the Marketplace
357
wise logistic regression were also run to check further which sets of variables might be eliminated due to multivariate considerations. The final subset of eight variables selected using the above techniques is shown in Table 6. Table 6. Final set of variables of the property-liability insurers. Variable Name Description VI Policyholders' Surplus V2 Capitalization Ratio V3 Change in Invested Assets V4 Investment Yields Based on Average Invested Assets Ratio of Significant Receivables from Parent, Subsidiaries, and Affiliates to Capital and Surplus V6 Significant Increase in Current Year Net Underwriting Loss V7 Surplus Aid to Surplus V8 Liabilities to Liquid Assets
In this study, the back-propagation neural network approach is utilized to predict the insolvency of the property-liability insurers, and a model that could detect insolvency propensity using annual statement data from two years prior to insolvency is considered. Two hundred forty-three insurers were used in training sessions consisting of 60 insurers that ultimately became insolvent, and 183 insurers that remained solvent. The list of insolvent insurers was obtained by the Texas State Board of Insurance Research Division through written requests to each state insurance department for a list of all domestic insolvent companies during 1991 or 1992. Follow-up telephone calls ensured complete data records for each insolvency. Accordingly, insolvency data were obtained on firms not included in the A. M. Best list of insolvent firms. NAIC annual statement data for each company two years prior to insolvency were used in the analysis. The data were separated into three subsets. A large data set (n = 145), called set Tl, represents 60 percent of the sample, and is used
358
P. L. Brockett et al.
for training the network. Two smaller sets, T2 and T3, each comprised 20 percent of the sample (n = 49 each). These sets were used, respectively, for determining when to cease training the neural network, and for testing the resultant trained neural network. The set T2 determined when to stop training the network according to the following rule: stop training when the predicting ability of the neural network trained on Tl and as tested on T2 begins to drop. The subset was then used to assess the trained network's outof-sample predictability characteristics. The results of applying the neural networks methodology to predict financial distress based upon our selected variables show very good abilities of the network to learn the patterns corresponding to financial distress of the insurer. In all cases, the percent correctly classified in the training sample by the network technique is above 85 percent, the average percent correctly classified in the training samples is 89.7 percent, and the bootstrap estimate of the percent correctly classified on the holdout samples is an average of 86.3 percent (see Table 7). The bootstrap result shows that the calculated ability of the neural network model to predict financial distress is not merely due to upwardly biased assessments (which might occur when training and testing on the same sample as in the discriminant analysis results). The ability of the uncovered network structure to generalize to new data sets is also a very important component of the learning process for the network and is crucial to the importance of the results for managerial implementation. The bootstrap estimate of the percent correctly classified on the various testing samples shows that generalizability is obtained. Overall, when applied to the entire sample of 243 firms, the neural network method correctly identified the solvency status of 89.3 percent of the firms in the study. Of the firms that become insolvent in the next two years, the neural network correctly predicted 73 percent, and it correctly predicted the solvency of 94.5 percent of the firms that remained solvent for the two-year duration.
Using Neural Networks to Predict Failure in the Marketplace
359
Table 7. Neural network training and predictive accuracy for the property liability insurers. Sample Training Sample (Tl) Results Sample Sizes Percentage Correctly Classified Stopping Rule Sample (T2) Results Sample Sizes Percentage Correctly Classified Bootstrapped Test Sample (T3) Sample Sizes Mean Percentage Correctly Classified Entire Sample Results Sample Size Overall Percentage Correctly Classified Percentage of Insolvent Firms Correctly Classified Percentage of Solvent Firms Correctly Classified
Predictive Accuracy 145 89.7 49 87.8 49 86.3 243 89.3 73.3 94.5
To compare the results of the neural network model to other methods, several other classification techniques were examined. This study performed a linear discriminant analysis using the same eight variables obtained from the stepwise logistic regression procedure described above. Such an approach assumes that the explanatory variables in each group form a multivariate normal distribution with the same covariance matrix for the two groups. The results of the discriminant analysis are presented in Table 8. As shown, the statistical method, although not doing as well as the neural networks methodology, showed a respectable 85 percent correct classification for insolvent firms and an 89.6 percent correct classification rate for solvent firms. An additional analysis was performed using the NAIC's IRIS ratios (see Table 9). The IRIS system correctly identifies only 26 insolvencies out of the 60 firms that eventually became insolvent within two years, or 43.3 percent of the insolvent companies On the other hand, 96.7 percent of the solvent firms were correctly identified.
P. L. Brockett et al.
360
Table 8. Discriminant analysis classification. True Status
Insolvent 51/60=85% 19/183=10.4% 70
Insolvent Solvent Total
Predicted Status Solvent 9/60=15% 164/183=89.6% 173
Total 60 183 243
Table 9. NAIC IRIS classification table. Predicted Status Insolvent Solvent 34/60=56.7% 26/60=43.3% 6/183=3.3% 177/183=96.7% 70 173
True Status Insolvent Solvent Total
Total 60 183 243
Table 10. A.M. Best ratings. Insolvent Companies Solvent Companies
A+
A
1
3
1
56
22
17
B B - C+
C
c-
2
0
1
4
1
0
47
60
4
3
1
1
0
0
79
183
A- B+
Not Rated Total
Another source of frequently used insurer rating information is the A.M. Best Company. Best's ratings for are presented in Table 10. These ratings were not very useful for predicting insolvencies, partly because of the very large percentage of nonrated companies (78.3 percent) in the insolvent group. A large subset of solvent firms (43.2 percent) also was not rated by A.M. Best.
5
Conclusion and Further Directions
The neural network artificial intelligence methods show significant promise for providing useful early warning signals of market failure (insolvency) within the insurance industry, and the results of using neural networks dominates the IRIS, FAST, and some other statis-
Using Neural Networks to Predict Failure in the Marketplace
361
tical methods. In addition, other characteristics positively differentiate neural network models from alternatives techniques. For one, the neural network model can be updated (learning supplemented) without completely retraining the network. The current version of the interconnection weights can be used as a starting point for the iterations once new data become available. Accordingly, the system of learning can adapt as different economic influences predominate. In essence, the neural network model can evolve and change as the data, system, or problem changes. Other "nonlearning" or static models do not have these built-in properties. Despite the positive classification results obtained, we still believe that further study is warranted. There are several avenues of research that might potentially enhance the predictability, the robustness, or the interpretability of the neural networks approach. These avenues include the development and inclusion of qualitative data and trend data, which could add significantly to the robustness and accuracy of the model developed. In addition, because of states' different economic climate and regulatory requirements, a comparison of the appropriate models for insolvency prediction in different states (and nationwide) should be investigated to ascertain the impact of certain state-controlled regulatory requirements. These results would then suggest public policy directives concerning these issues for the purpose of decreasing insolvency propensity.
362
P. L. Brockett et al.
References Altaian, E.I. (1968), "Financial ratios, discriminant analysis and the prediction of corporate bankruptcy," Journal of Finance, 23, pp. 589-609. Altman, E.I., Halkeman, R.G., and Narayanan, P. (1977), "ZETA analysis: a new model to identify bankruptcy risk of corporations," Journal of Banking and Finance, 55, pp. 229-244. A.M. Best Company (1992), "Best's insolvency study, life-health insurers," A.M. Best Company, Oldwick, NJ. Ambrose, J.M. and Seward, J.A. (1988), "Best's ratings, financial ratios and prior probabilities in insolvency prediction," Journal of Risk and Insurance, 55, pp. 229-244. BarNiv, R. and Hershbarger, R.A. (1990), "Classifying financial distress in the life insurance industry," Journal of Risk and Insurance, 57, pp. 110-136. BarNiv, R. and McDonald, J.B. (1992), "Identifying financial distress in the insurance industry: a synthesis of methodological and empirical issues," Journal of Risk and Insurance, 59, pp. 543573. Barrese, J. (1990), "Assessing the financial condition of insurers," CPCU Journal, 43(1), pp. 37-46. Bierman, H. Jr. (1960), "Measuring financial liquidity," Accounting Review, 35, pp. 628-632. Borch, K. (1970), "The rescue of insurance company after ruin," ASTIN Bulletin, 6, pp. 66-69. Brockett, P.L., Cooper, W.W., Golden, L.L., and Pitaktong, U. (1994), "A neural network method for obtaining and early warning of insurer insolvency," Journal of Risk and Insurance, 61, pp. 402-424. Brockett, P.L., Golden, L.L., Jang, J., and Yang, C. (2002), "Comparative analysis of statistical methods and neural networks for predicting life insurers' insolvency," Working Paper, the University of Texas at Austin.
Using Neural Networks to Predict Failure in the Marketplace
363
Coats, P.K. and Fant, L.F. (1993), "Recognizing financial distress patterns using a neural network tool," Financial Management, 22, pp. 142-155. Cummins, J.D., Harrington, S.E., and Klein, R. (1995), "Insolvency experience, risk-based capital, and prompt corrective action in property-liability insurance", Journal of Banking and Finance, 19, pp. 511-527. Eberhart, R.C. and Dobbins, R.W. (1990), Neural Network PC Tools: A Practical Guide, Academic Press, New York. Edmister, R.O. (1972), "An empirical test of financial ratio analysis for small business failure predictions," Journal of Financial and Quantitative Analysis, 7, pp. 1477-1493. Funahashi, K. (1989), "On the approximate realization of continuous mappings by neural networks," Neural Networks, 2, pp. 183192. Grace, M., Harrington, S., and Klein, R (1998), "Risk-based capital and solvency screening in property-liability insurance: hypotheses and empirical tests," Journal of Risk and Insurance, 65 (2), pp. 213-243. Harrington, S., Nelson, J.M. (1986), "A regression-based methodology for solvency surveillance in the property-liability insurance industry," Journal of Risk and Insurance, 53, pp. 583-605. Hebb, D.O. (1949), The Organization of Behavior, Wiley Press, New York. Hornik, K.M. and Stinchcombe, H.W. (1990), "Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks," Neural Networks, 3, pp. 551-560. Huang, C.S., Dorsey, R.E., and Boose, M.A. (1994), "Life insurer financial distress prediction: a neural network model," Journal of Insurance Regulation, 13, pp. 133-167. Kurkova V. (1992), "Kolmogorov theorem and multilayer neural networks," Neural Networks, 5, pp. 501-506. Lippmann, R.P. (1987), "An introduction to computing with neural nets," IEEEASSP Magazine, 1, pp. 4-22.
364
P. L. Brockett et al.
Lorentz, G.G. (1976), "The 13m problem of Hilbert," in Browder, F.E. (ed.), Mathematical Developments Arising From Hilbert's Problems^ American Mathematical Society, Providence. Luther, R.K. (1993), "Predicting the outcome of Chapter 11 bankruptcy: an artificial neural network approach," Ph.D. Dissertation, The University of Mississippi. Odom, M.D. and Sharda, R. (1990), "A neural network model for bankruptcy prediction," Proceedings of International Joint Conference on Neural Networks, volume 2, pp. 163-168. Quirk, J.P. (1961), "The capital structure of firms and the risk of failure," International Economic Review, 2(2), pp. 210-228. Salchenberger, L.M., Cinar, E.M., and Lash, N.A. (1992), "Neural networks: a new tool for predicting thrift failures," Decision Science, 23(4), pp. 899-915. Sinkey, J.F. Jr. (1975), "Multivariate statistical analysis of the characteristics of problem banks," Journal of Finance, 30, pp. 21-36. Texas Department of Insurance (1993), "Report on early warning," Texas Department of Insurance, Austin, TX. Tinsley, P.A. (1970), "Capital structure, precautionary balances and the valuation of the firm: the problem of financial risk," Journal of Financial and Quantitative Analysis, 5(1) pp. 33-62. Williams, W.H. and Goodman, M.L. (1971), "A statistical grouping of corporations by their financial characteristics," Journal of Financial and Quantitative Analysis, 6(4), pp. 1095-1104.
Chapter 10 Illustrating the Explicative Capabilities of Bayesian Learning Neural Networks for Auto Claim Fraud Detection S. Viaene, R.A. Derrig, and G. Dedene This chapter explores the explicative capabilities of neural network classifiers with automatic relevance determination weight regularization, and reports the findings from applying these networks for personal injury protection automobile insurance claim fraud detection. The automatic relevance determination objective function scheme provides us with a way to determine which inputs are most informative to the trained neural network model. An implementation based on Mackay's (1992) evidence framework approach to Bayesian learning is proposed as a practical way of training such networks. The empirical evaluation is based on a data set of closed claims from accidents that occurred in Massachusetts, USA during 1993.
1
Introduction
In recent years, the detection of fraudulent claims has blossomed into a high-priority and technology-laden problem for insurers (Viaene 2002). Several sources speak of the increasing prevalence of insurance fraud and the sizeable proportions it has taken on (see, for example, Canadian Coalition Against Insurance Fraud 2002, Coalition Against Insurance Fraud 2002, Comite Europeen des Assurances 1996, Comite Europeen des Assurances 1997). September 2002, a special issue of the Journal of Risk and Insurance (Derrig 2002) was 365
366
S. Viaene et al.
devoted to insurance fraud topics. It scopes a significant part of previous and current technical research directions regarding insurance (claim) fraud prevention, detection and diagnosis. More systematic electronic collection and organization of and company-wide access to coherent insurance data have stimulated data-driven initiatives aimed at analyzing and modeling the formal relations between fraud indicator combinations and claim suspiciousness to upgrade fraud detection with (semi-)automatic, intelligible, accountable tools. Machine learning and artificial intelligence solutions are increasingly explored for the purpose of fraud prediction and diagnosis in the insurance domain. Still, all in all, little work has been published on the latter. Most of the state-of-the-art practice and methodology on fraud detection remains well-protected behind the thick walls of insurance companies. The reasons are legion. Viaene et al. (2002) reported on the results of a predictive performance benchmarking study. The study involved the task of learning to predict expert suspicion of personal injury protection (PIP) (no-fault) automobile insurance claim fraud. The data that was used consisted of closed real-life PIP claims from accidents that occurred in Massachusetts, USA during 1993, and that were previously investigated for suspicion of fraud by domain experts. The study contrasted several instantiations of a spectrum of state-of-the-art supervised classification techniques, that is, techniques aimed at algorithmically learning to allocate data objects, that is, input or feature vectors, to a priori defined object classes, based on a training set of data objects with known class or target labels. Among the considered techniques were neural network classifiers trained according to MacKay's (1992a) evidence framework approach to Bayesian learning. These neural networks were shown to consistently score among the best for all evaluated scenarios. Statistical modeling techniques such as logistic regression, linear and quadratic discriminant analysis are widely used for modeling and prediction purposes. However, their predetermined functional form and restrictive (often unfounded) model assumptions limit their use-
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
367
fulness. The role of neural networks is to provide general and efficiently scalable parameterized nonlinear mappings between a set of input variables and a set of output variables (Bishop 1995). Neural networks have shown to be very promising alternatives for modeling complex nonlinear relationships (see, for example, Desai et al. 1996, Lacher et al. 1995, Lee et al. 1996, Mobley et al. 2000, Piramuthu 1999, Salchenberger et al. 1997, Sharda and Wilson 1996). This is especially true in situations where one is confronted with a lack of domain knowledge which prevents any valid argumentation to be made concerning an appropriate model selection bias on the basis of prior domain knowledge. Even though the modeling flexibility of neural networks makes them a very attractive and interesting alternative for pattern learning purposes, unfortunately, many practical problems still remain when implementing neural networks, such as What is the impact of the initial weight choice? How to set the weight decay parameter? How to avoid the neural network from fitting the noise in the training data? These and other issues are often dealt with in ad hoc ways. Nevertheless, they are crucial to the success of any neural network implementation. Another major objection to the use of neural networks for practical purposes remains their widely proclaimed lack of explanatory power. Neural networks are black boxes, it says. In this chapter Bayesian learning (Bishop 1995, Neal 1996) is suggested as a way to deal with these issues during neural network training in a principled, rather than an ad hoc fashion. We set out to explore and demonstrate the explicative capabilities of neural network classifiers trained using an implementation of MacKay's (1992a) evidence framework approach to Bayesian learning for optimizing an automatic relevance determination (ARD) regularized objective function (MacKay 1994, Neal 1998). The ARD objective function scheme allows us to determine the relative importance of inputs to the trained model. The empirical evaluation in this chapter is based on the modeling work performed in the context of the baseline benchmarking study of Viaene et al. (2002).
368
S. Viaene et al.
The importance of input relevance assessment needs no underlining. It is not uncommon for domain experts to ask which inputs are relatively more important. Specifically, Which inputs contribute most to the detection of insurance claim fraud? This is a very reasonable question. As such, methods for input selection are not only capable of improving the human understanding of the problem domain, specifically, the diagnosis of insurance claim fraud, but also allow for more efficient and lower-cost solutions. In addition, penalization or elimination of (partially) redundant or irrelevant inputs may also effectively counter the curse of dimensionality. In practice, adding inputs (even relevant ones) beyond a certain point can actually lead to a reduction in the performance of a predictive model. This is because, faced with limited data availability, as we are in practice, increasing the dimensionality of the input space will eventually lead to a situation where this space is so sparsely populated that it very poorly represents the true model in the data. This phenomenon has been termed the curse of dimensionality (Bellman 1961). The ultimate objective of input selection is therefore to select a minimum number of inputs required to capture the structure in the data. This chapter is organized as follows. Section 2 revisits some basic theory on multilayer neural networks for classification. Section 3 elaborates on input relevance determination. The evidence framework approach to Bayesian learning for neural network classifiers is discussed in Section 4. The theoretical exposition in the first three sections is followed by an empirical evaluation. Section 5 describes the characteristics of the 1993 Massachusetts, USA PIP closed claims data that were used. Section 6 describes the setup of the empirical evaluation and reports its results. Section 7 concludes this chapter.
2
Neural Networks for Classification
Figure 1 shows a simple three-layer neural network. It is made up of an input layer, a hidden layer and an output layer, each consisting of a
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
369
number of processing units. The layers are interconnected by modifiable weights, represented by the links between the layers. A bias unit is connected to each unit other than the input units. The function of a processing unit is to accept signals along its incoming connections and (nonlinearly) transform a weighted sum of these signals, termed its activation, into a single output signal. In analogy with neurobiology, the units are sometimes called neurons. The discussion will be restricted to the use of neural networks for binary classification, where the input units represent individual components of an input vector, and a single output unit is responsible for emitting the values of the discriminant function used for classification. One then commonly opts for a multilayer neural network with one hidden layer. In principle, such a three-layer neural network can implement any continuous function from input to output, given a sufficient number of hidden units, proper nonlinearities and weights (Bishop 1995). We start with a description of the feedforward operation of such a neural network, given a training set D = {(x l ,£i)} i=1 with input vectors x i = (x\,..., < ) T 6 I T and class labels U G {0,1}. Each input vector component is presented to an input unit. The output of each input unit equals the corresponding component in the input vector. The output of hidden unit j G {1,..., h}, that is, Zj (x), and the output of the output layer, that is, y (x), are then computed as follows: n
Hidden Layer : Zj (x) = fi I bij + \]ui>k
Xk
)
Output Layer : y (x) = f2 lb2 + J ] Vj Zj (*) I ,
^ (2)
where bij G IR is the bias corresponding to hidden unit j , u^k G 1R denotes the weight connecting input unit k to hidden unit j , b2 G IR is the output bias, and Vj G IR denotes the weight connecting hidden unit j to the output unit. The biases and weights together make up
370
S. Viaene et al.
Figure 1. Example three-layer neural network.
weight vector w. / i (•) and / 2 (•) are termed transfer or activation functions and essentially allow a multilayer neural network to perform complex nonlinear function mappings. Input units too have activation functions, but, since these are of the form f(a) = a, these are not explicitly represented. There are many possible choices for the (nonlinear) transfer functions for the hidden and output units. For example, neural networks of threshold transfer functions where among the first to be studied, under the name perceptrons (Bishop 1995). The antisymmetric version of the threshold transfer function takes the form
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
371
of the sign function: sign(a) = {
'
" " J "
(3)
Multilayer neural networks are generally called multilayer perceptions (MLPs), even when the activation functions are not threshold functions. Transfer functions are often conveniently chosen to be continuous and differentiable. We use a logistic sigmoid transfer function in the output layer—that is, sigm (a) = 1+ex p ( _ a) • The term sigmoid means S-shaped, and the logistic form of the sigmoid maps the interval [—00,00] onto [0,1]. In the hidden layer we use hyperbolic tangent transfer functions—that is, tanh (a) = exP(a)~exp(-Q). 0
'
\ y
exp(a)+exp(—a)
The latter are S-shaped too, and differ from logistic sigmoid functions only through a linear transformation. Specifically, the activation function f(a) = tanh (a) is equivalent to the activation function f(a) — sigm(a) if we apply a linear transformation a = | to the input and a linear transformation / = 2 / — 1 to the output. The transfer functions in the hidden and output layer are standard choices. The logistic sigmoid transfer function of the output unit allows the MLP classifier's (continuous) output y (x) to be interpreted as an estimated posterior probability of the form p(t = l|x), that is, the probability of class t = 1, given a particular input vector x (Bishop 1995). In that way, the MLP produces a probabilistic score per input vector. These scores can then be used for scoring and ranking purposes (as, for example, in applications of customer scoring and credit scoring) and for decision making. The Bayesian posterior probability estimates produced by the MLP are used to classify input vectors into the appropriate predefined classes. This is done by choosing a classification threshold in the scoring interval, in casu [0,1]. Optimal Bayes decision making dictates that an input vector should be assigned to the class associated with the minimum expected risk or cost (Duda et al. 2000).
372
S. Viaene et al.
Optimal Bayes assigns classes according to the following criterion: 1
argmin Y % (fc|x) Ljik (x),
(4)
where p(k\x) is the conditional probability of class k, given a particular input vector x, and Lj:k (x) is the cost of classifying a data instance with input vector x and actual class k as class j . Note that Ljtk (x) > 0 represents a cost, and that Ljtk (x) < 0 represents a benefit. This translates into the classification rule that assigns class 1 if ,. ,i x ^ £i,o(x) -L 0 ) o(x) p(*=lx) >
'
s . r f
\
T
( V
(5)
L0,i(x) - Li,i(x) + Ii,o(x) - L0,o(x) and class 0 otherwise, assuming that the cost of labeling a data instance incorrectly is always greater than the cost of labeling it correctly, and that class 0 is the default in case of equal expected costs. In case LJjfc is independent of x, that is, there is a fixed cost associated with assigning a data instance to class j when it in fact belongs to class k, (5) defines a fixed classification threshold in the scoring interval [0,1]. Weight vector w needs to be estimated using the training data D = {(x l , ti)}i=v Learning works by randomly initializing and then iteratively adjusting w so as to optimize an objective function ED, typically a sum of squared errors. That is: 1 N 2 ED = ^(ti-yi) ,
(6)
i=l
where yi stands for y (x l ). The backpropagation algorithm, based on gradient descent on ED, is one of the most popular methods for supervised learning of MLPs. While basic backpropagation is simple, flexible and general, a number of heuristic modifications to gradient descent have been proposed to improve its performance. For an overview of alternative training schemes, see (Bishop 1995).
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
373
The sum of squared errors criterion is derived from the maximum likelihood principle by assuming target data generated from a smooth deterministic function with added Gaussian noise. This is clearly not a sensible starting point for binary classification problems, where the targets are categorical and the Gaussian noise model does not provide a good description of their probability density. The crossentropy objective function is more suitable (Bishop 1995). At base, cross-entropy optimization maximizes the likelihood of the training data by minimizing its negative logarithm. Given training data D = {(xz, U)}i=l and assuming the training data instances are drawn independently from a Bernouilli distribution, the likelihood of observing D is given by: N
n^a-i/i) 1- * 4 ,
(7)
where we have used the fact that we would like the value of the MLP's output y(x) to represent the posterior probability p(t = l|x). Maximizing (7) is equivalent to minimizing its negative logarithm, which leads to the cross-entropy error function of the form: N
ED = - J 2 (*
W
The ultimate goal of learning is to produce a model that performs well on new data objects. If this is the case, we say that the model generalizes well. Performance evaluation aims at estimating how well a model is expected to perform on data objects beyond those in the training set, but from the same underlying population and representative of the same operating conditions. This is not a trivial task, given the fact that, typically, we are to design and evaluate models from a finite sample of data. For one, looking at a model's performance on the actual training data that underlie its construction is likely to overestimate its likely future performance. A common
374
S. Viaene et al.
option is to assess its performance on a separate test set of previously unseen data. This way we may, however, loose precious data for training. Cross-validation (see, for example, Hand 1997) provides an alternative. A;-fold cross-validation is a resampling technique that randomly splits the data into k disjoint sets of approximately equal size, termed folds. Then, k classifiers are trained, each time with a different fold held out for performance assessment. This way all data instances are used for training, and each one is used exactly once for testing. Finally, performance estimates are averaged to obtain an estimate of the true generalization performance. Two popular measures for gauging classification performance are the percentage correctly classified (PCC) and the area under the receiver operating characteristic curve (AUROC). The PCC on (test) data, an estimate of a classifier's probability of a correct response, is the proportion of data instances that are correctly classified (Hand 1997). The receiver operating characteristic curve (ROC) is a twodimensional visualization of the false alarm rate (-rT—r^— ) versus the true alarm rate (-^—^N ) for various values of the classification threshold imposed on the value range of a scoring rule or continuous-output classifier, where Njtk is the number of data instances with actual class k that were classified as class j . The ROC illustrates the decision making behavior of the scoring rule for alternative operating conditions (for example, misclassification costs), in casu summarized in the classification threshold. The ROC essentially allows us to evaluate and visualize the quality of the rankings produced by a scoring rule. The AUROC is a single-figure summary measure associated with ROC performance assessment. It is equivalent to the nonparametric Wilcoxon-Mann-Whitney statistic, which estimates the probability that a randomly chosen positive data instance is correctly ranked higher than a randomly selected nonpositive data instance (Hand 1997, Hanley and McNeil 1982). The best generalization performance is achieved by a model whose complexity is neither too small nor too large. Besides our
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
375
choosing an MLP architecture or functional form that offers the necessary complexity to capture the true structure in the data, during optimization we have to avoid it from fitting the noise or idiosyncrasies of the training data. The latter is known as overfitting. Countering overfitting is often realized by monitoring the predictive performance of the neural network on a separate validation set during training. When the performance measure evaluated on the latter starts to deteriorate, training is stopped, therefore preventing the neural network from fitting the noise in the training data. An alternative, eliminating the need for a separate validation set, is to add a regularization or penalty term to the objective function as follows (Bishop 1995): E = ED + aEw,
(9)
where the regularization parameter a e M + and, typically, the regularizer Ew = | J2 -vo*, with j running over all elements of the weight vector w. This simple regularizer is known as weight decay (or ridge regression in conventional curve fitting), as it penalizes large weights. The latter encourages smoother network mappings and thereby decreases the risk of overfitting. The parameter a controls the extent to which Ew influences the solution and therefore controls the complexity of the model. Several more sophisticated regularization schemes have been proposed in which weights are penalized individually or pooled and penalized as groups (see, for example, Bengio 2000, Grandvalet 1998, Tibshirani 1996). If weights are pooled by incoming and outgoing connections of a unit, unit-based penalization is performed. ARD (MacKay 1994, Neal 1998) is such a scheme. The goal here is to penalize inputs according to their relevance. The objective function then takes the following form:
E=ED
+ J2arnEv,m,
(10)
m
where am <E IR+ and EWm = \ Y^j (wm,j)2, with j running over all
376
5. Viaene et al.
weights of weight class w m . The ARD regularizer considers n + 3 weight classes within weight vector w, each associated with a regularization parameter am. Specifically, ARD associates a single am with each group of weights corresponding to the connections from an input unit to the hidden layer. Three additional regularization parameters are introduced: One associated with the hidden layer biases, one associated with the connections from the hidden layer units to the output unit, and one associated with the output bias. ARD's soft input selection mechanism is discussed next.
3
Input Relevance Determination
The ARD objective function allows us to control the size of the weights associated with each input separately. Large am values suppress the weights exiting from the respective input and effectively switch its contribution to the functioning of the MLP classifier to a lower level. This means that all inputs can be rank ordered according to their optimized am values. Inputs associated with larger am values are less relevant to the neural network. The most relevant inputs will have the lowest am. Note that, since the method is not scale-invariant, the inputs should be normalized in order to produce meaningful comparison (Marquardt 1980). Typically, all inputs are statistically normalized by subtracting their mean over the data and dividing by their standard deviation. One of the main advantages of MLP-ARD (for training sets that are not too small) is that it allows for the inclusion of a large number of potentially relevant inputs without damaging effects (MacKay 1994, Neal 1998). This is especially interesting in domains where the number of available distinct training cases is rather limited and the dimensionality of the input vectors relatively large. Now, removing irrelevant inputs, a priori assumed at least somewhat relevant, is no longer required, since they are directly dealt with by the ARD regularization parameter scheme. This clearly constitutes a remarkable change in attitude compared to the standard way of dealing with this
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
311
situation. As Breiman (2001a) puts it in his invited talk at the 2001 Nonparametrics in Large, Multidimensional Data Mining Conference: The standard procedure when fitting data models such as logistic regression is to delete variables. This is referred to as hard input selection. Breiman (2001a) refers to Diaconis and Efron (1983) to back this up: ... statistical experience suggests that it is unwise to fit a model that depends on 19 variables with only 155 data points available. However, says Breiman (2001a): Newer methods of data mining thrive on variables—the more the better. Input selection then takes a soft form. Breiman's (2001b) random forests constitute an example of this rationale, as does MLP-ARD. The philosophy of the soft input selection mechanism underlying ARD is clearly stated by Neal (1998) as follows: Variable selection seems to me to be a dubious procedure, since if we think an input might be relevant, we will usually also think that it is probably at least a little bit relevant. Situations where we think all variables are either highly relevant or not relevant at all seem uncommon (though, of course, not impossible). Accordingly, we should usually not seek to eliminate variables, but rather to adjust the degree to which each variable is considered significant. We may hope that a method of this sort for determining the importance of inputs will be able to avoid paying undue attention to the less important inputs, while still making use of the small amount of extra information that they provide.
S. Viaene et al.
378
This actually follows the ideas of Copas (1983), who suggests not completely removing the effect of any input in a regression, but rather shrinking the coefficients appropriately. This obviously does not preclude us from subsequently making use of the ARD information to look for a more compact representation. An important caveat is in order. Correlation among inputs may still cloud the assessment of an input's relevance in this setup, just as in statistical regression. This means that, if inputs are strongly correlated, then it is likely that, if say just two of them are correlated and nearly identical, ARD will pick one of them arbitrarily and switch the other one off. Since there is simply no easy way out, it still constitutes good practice to have a look at the correlation structure of the inputs prior to the modeling effort. Expert domain knowledge and prior beliefs may then well be your only guide in choosing which inputs to include in your models. The success of MLP-ARD depends strongly on finding appropriate values for the weight vector and the regularization parameters. In the next section we discuss the evidence framework approach to Bayesian learning for MLP classifiers due to MacKay (1992a).
4
Evidence Framework
The aim of Bayesian learning or Bayesian estimation (Bishop 1995, Neal 1996) is to develop probabilistic models that fit the data, and make optimal predictions using those models. The conceptual difference between Bayesian estimation and maximum likelihood estimation is that we no longer view model parameters as fixed, but rather treat them as random variables that are characterized by a joint probability model. This stresses the importance of capturing and accommodating for the inherent uncertainty about the true function mapping being learned from a finite training sample. Prior knowledge, that is, our belief of the model parameters before the data are observed, is encoded in the form of a prior probability density. Once the data are observed, the prior knowledge can be converted into a posterior probability density using Bayes' theorem. This posterior
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
379
knowledge can then be used to make predictions. There are two practical approaches to Bayesian learning for MLPs (Bishop 1995, Neal 1996). The first, known as the evidence framework, involves a local Gaussian approximation to the posterior probability density in weight space. The second is based on Monte Carlo methods. Only the former will be discussed here. Suppose we are given a set of training data D = {(x\ U)}i=v In the Bayesian framework a trained MLP model is described in terms of the posterior probability density over the weights p(w\D). Given the posterior, inference is made by integrating over it. For example, to make a classification prediction at a given input vector x, we need the probability that x belongs to class t = 1, which is obtained as follows: p(t = l | x , D ) = fp(t=
l|x,w)p(w|D)dw,
(11)
where p(t = l|x, w) is given by the neural network function ?y(x). To compute the integral in (11), MacKay (1992a) introduces some simplifying approximations. We start by concentrating on the posterior weight density under the assumption of given regularization parameters. Let p (w|a) be the prior probability density over the weights w, given the regularization parameter vector a, that is, the vector of all ARD regularization parameters am (see (10)). Typically, this will be a rather broad probability density, reflecting the fact that we only have a vague belief in a range of possible parameter values before the data arrives. Once the training data D are observed, we can adjust the prior probability density to a posterior probability density p (w|D, a ) using Bayes' theorem. The posterior will be more compact, reflecting the fact that we have learned something about the extent to which different weight values are consistent with the ob-
380
S. Viaene et al.
served data. The prior to posterior conversion works as follows: , ,^
P(w|
°-
p(D\w)p(w\a)
s
a)
-
p(£)W
•
(12)
In the above expression p(D\w) is the likelihood function, that is, the probability of the data D occurring, given the weights w. Note that p(D\w) = p(D\w,a), for the probability of the data D occurring is independent of a , given w. The term p {D\a) is called the evidence for a, that guarantees that the righthand side of the equation integrates to one over the weight space. Note that, since the MLP's architecture or functional form A is assumed to be known, it should, strictly speaking, always be included as a conditioning variable in (12). We have, however, omitted it, and shall continue to do so, to simplify notation. With reference to (7), the first term in the numerator of the righthand side of (12) can be written as follows: N 1=1
= exp
^
'
(-ED),
which introduces the cross-entropy objective function ED. By carefully choosing the prior p(w\ct), more specifically, by making Gaussian assumptions, MacKay (1994) is able to introduce ARD. Specifically, the concept of input relevance is implemented by assuming that all weights of weight class w m (see (10)) are distributed according to a Gaussian prior with zero mean and variance cr^j = —. Besides introducing the requirement for small weights, this prior states that the input-dependent am is inversely proportional to the variance of the corresponding Gaussian. Since the parameter am itself controls the probability density of other parameters, it is called a hyperparameter. The design rationale underlying this choice of prior is that a weight channeling an irrelevant input to the output
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
381
should have a much tighter probability density around zero than a weight connecting a highly correlated input to the output. In other words, a small hyperparameter value means that large weights are allowed, so we conclude that the input is important. A large hyperparameter constrains the weights near zero, and hence the corresponding input is less important. Then, with reference to (10), the second term in the numerator of the righthand side of (12) can be written as follows: p(w|a) = — e x p
Ta
Zw
r o
£w
V m
,
m
(14)
/
which introduces the ARD regularize^ and yields the following expression for the posterior weight density: exp P (w|D, a) = — ZM 1
exp
-ED - Y* amEWm \
m
/
(15)
(-E),
where -—- and ~ are appropriate normalizing constants. Hence, we note that the most probable weight values wMP are found by minimizing the objective function E in (10). Standard optimization methods can be used to perform this task. We used a scaled conjugate gradient method (Bishop 1995). This concludes the first level of Bayesian inference in the evidence framework, which involved learning the most probable weights wMP, given a setting of a. At the first level of Bayesian inference we have assumed hyperparameter vector a to be known. So, we have not yet addressed the question of how the hyperparameters should be chosen in light of the training data D. A true Bayesian would take care of any unknown parameters, in casu a, by integrating them out of the joint posterior
S. Viaene et al.
382
probability density p(w,a\D)
to make predictions, yielding:
p(w\D) = / p(w, a\D)da (16) p(w\D,
a)p(ct\D)da.
The approximation made in the evidence framework is to assume that the posterior probability density p(o.\D) is sharply peaked around the most probable values aMP. With this assumption, the integral in (16)reduces to: p(w\D)^p(w\D,aMP)Jp(a\D)da ^p(w\D,aMP), which means that we should first try to find the most probable hyperparameter values aMP and then perform the remaining calculations involving p(w\D) using the optimized hyperparameter values. The second level of Bayesian inference in the evidence framework is aimed at calculating aMP from the posterior probability density p (ct\D). We, again, make use of Bayes' theorem: p ( a | D ) =
p(D)
•
(18)
Starting from (18) and assuming a uniform prior p (a), representing the fact that we have very little idea of suitable values for the hyperparameters, we obtain aMP by maximizing p(D\a). Optimization is discussed in detail in MacKay (1992b). This involves approximating E by a second-order Taylor series expansion around wMP, which comes down to making a local Gaussian approximation to the posterior weight density centered at wMP. A practical implementation involving both levels of Bayesian learning starts by choosing appropriate initial values for the hyperpa-
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
383
rameter vector a and the weight vector w, and then trains the MLP using standard optimization to minimize E, with the novelty that training is periodically halted for the regularization parameters to be updated. Now we have trained the neural network to find the most probable weights wMP and hyperparameters cxMP, we return to (11) for making predictions. By introducing some additional simplifying approximations (such as assuming that the MLP's output unit activation is locally a linear function of the weights), MacKay (1992a) ends up suggesting the following approximation: p(t = l|x,D) ^ sigm (K(s)a MP ) ,
(19)
where aMP is the activation of the MLP's logistic sigmoid output unit calculated using the optimized neural network parameters, and
K(s)= 1 +
( x ) 2'
(20)
where s2 is the variance of a local Gaussian approximation to p(a|x, D) centered at aMP, which is proportional to the error bars around w M P . We observe that MacKay (1992a) argues in favor of moderating the output of the trained MLP in relation to the error bars around the most probable weights, so that these point estimates may better represent posterior probabilities of class membership. The moderated output is similar to the most probable output in regions where the data are dense. Where the data are more sparse, moderation smooths the most probable output toward a less extreme value, reflecting the uncertainty in sparse data regions. In casu smoothing is done toward 0.5. Although, from a Bayesian point of view, it is better to smooth the estimates toward the corresponding prior, the approximation proposed by MacKay (1992a) is not readily adaptable to this requirement. For further details on the exact implementation of MacKay's
384
5. Viaene et al.
(1992a) evidence framework we refer to the source code of the Netlab1 toolbox for Matlab and the accompanying documentation provided by Bishop (1995) and Nabney (2001).
5
PIP Claims Data
The empirical evaluation in Section 6 is based on a data set of 1,399 closed PIP automobile insurance claim files from accidents that occurred in Massachusetts, USA during 1993, and for which information was meticulously collected by the Automobile Insurers Bureau (AIB) of Massachusetts, USA. For all the claims the AIB tracked information on twenty-five binary fraud indicators (also known as red flags) and twelve nonindicator inputs, specifically, discretized continuous inputs, that are all supposed to make sense to claims adjusters and fraud investigators. Details on the exact composition and semantics of the data set and the data collection process can be found in Weisberg andDerrig 1991, 1995, 1998. Guided by the analysis of the timing of claim information in Derrig and Weisberg (1998), we retained the binary fraud indicators listed in Table 1 as inputs for the development of our first-stage claims screening models. The listed fraud indicators may all be qualified as typically available relatively early in the life of a claim. The timing of the arrival of the information on claims is crucial to its usefulness for the development of an early claims screening facility. The information pertains to characteristics of the accident (ACC), the claimant (CLT), the injury (INJ), the insured driver (INS), and lost wages (LW). The indicator names are identical to the ones used in previous work. Notice that no indicators related to the medical treatment are included in the data set. None of these fraud indicators are typically available early enough to be taken up into the data set (Derrig and Weisberg 1998).
The source code can be obtained at http://www.ncrg.aston.ac.uk/netlab/.
385
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection Table 1. PIP fraud indicator inputs with values {0 = no, 1 =
yes}.
Input
Description
ACC01 ACC04 ACC09 ACC10 ACC11 ACC14 ACC15 ACC16 ACC18 ACC19 CLT02 CLT04 CLT07 INJ01 INJ02 INJ03 INJ05 INJ06 INJ11 INS01 INS03 INS06 INS07 LW01 LW03
No report by police officer at scene. Single vehicle accident. No plausible explanation of accident. Claimant in an old, low-value vehicle. Rental vehicle involved in accident. Property damage was inconsistent with accident. Very minor impact collision. Claimant vehicle stopped short. Insured/claimant versions differ. Insured felt set up, denied fault. Had a history of previous claims. Was an out-of-state resident. Was one of three or more claimants in vehicle. Injury consisted of strain or sprain only. No objective evidence of injury. Police report showed no injury or pain. No emergency treatment was given for injury. Non-emergency treatment was delayed. Unusual injury for this auto accident. Had a history of previous claims. Readily accepted fault for accident. Was difficult to contact/uncooperative. Accident occurred soon after policy effective date. Claimant worked for self or family member. Claimant recently started employment.
To evaluate the additional information content of nonflag inputs we also decided to consider the inputs in Table 2 for inclusion in the classification models. The selection of the inputs was steered by discussion with domain experts within the limits of what was available
386
S. Viaene et al.
in the coded data. We therefore emphasize that this is only an initial example of adding nonflag inputs, not an attempt at a complete or efficient model. Again, information regarding the retained inputs is usually obtained relatively early in the life of a claim. We discretized continuous inputs by dividing up their continuous value ranges in bins and replacing the actual value with the respective bin numbers 1, 2, 3, and so on. Although algorithmic means of discretization could have been applied at this stage (see, for example, Fayyad and Irani 1993), we relied on prior domain expertise and inspection of the distribution of the values within the continuous value ranges. We statistically normalized all inputs by subtracting their mean over the data and dividing by their standard deviation. Each claim file was reviewed by a senior claims manager on the basis of all available information. This closed claims reviewing was summarized into a ten-point-scale expert assessment of suspicion of fraud, with zero being the lowest and ten the highest score. Each claim was also categorized in terms of the following verbal assessment hierarchy: Probably legitimate, excessive treatment only, suspected opportunistic fraud, and suspected planned fraud. In automobile insurance, a fraudulent claim is defined operationally for the Massachusetts, USA study as a claim for an injury in an accident that did not happen or an injury unrelated to a real accident. The qualification of each available claim by both a verbal expert assessment of suspicion of fraud as well as a ten-point-scale suspicion score gave rise to several alternative target encoding scenarios in Viaene et al. (2002). For instance, each definition threshold imposed on the ten-point scale then defines a specific (company) view or policy toward the investigation of claim fraud. Usually, 4+ target encoding—that is, if suspicion score > 4, then investigate, else do not investigate—is the operational domain expert choice, which makes technical, if not verbal, sense. For the latter scenario, that is, the one discussed in this study, about 28 percent of the claims contain enough suspicious elements to be further investigated for fraud.
+
•-^ o
'
CO
+
—'
+
rH CO
+
t-H
to „ co O S o «D I M to I
to co
^
co
co
©
.. ^ I
I rH
®
"*
I to
(D
rH CN
I
oo T—t
00
1:0
o
05 -
o oo
-
CN
o < H H
CN
rH O t O >-r->
o < H H
O V
00
1
oo
"l I
CN
rH
<M ! ^
<M
H
I I—I
rH
rH
rH
oo "
I w. I r-< £ "-I
H CN
LO
+ «5
oo
r H CN
00 ' ., to o •>* 00
~
to
I
CO
O CO to I
rH
00
to cT to cT
I
rH
o I
00 rH T—<
i—I O
o < H H
to s ^ to >-^
o < H
O
60 B P CO
O
^ 60
H OH
H
^H
i—I
CT> OS ' _ oo - oo _ 00 O - O O C 5 t>- O i t— G>
-
rH
o~ 1-1 CM b-
o (M O
i—I
v to
U
03
O
SJ
" ^
^~>
H ^ ti_,
"T3
O
Bayesian Learning Neural Nets for Auto Claim Fraud and
3 CL,
C3
00
_c C O e OH CN U
E2 "o CK
o .a 2 S
§ §
03 JJ
(J SO 03
o B «J S p * 3 P
««—1 p
OH
op ^ 2 r 3
<
o o
o co o~ o co
CN
o
o CN
o
Detection
- o o
O LO
LO
rH O rH
O
o
o o
CO
»1
©
Pi
CO
U
o
pi
tO
pi
"1
o
c
c3
c
erf
e
e e s
erf
cr3
c3
erf
s eE
m
OH
u u u rH
o
on - ,
R qO b 2 ^
H
00
^gd OH
387
S. Viaene et al.
388
Most insurance companies use lists of fraud indicators (most often per insurance business line) representing a summary of the detection expertise as a standard aid to claims adjusters for assessing suspicion of fraud. These lists form the basis for systematic, consistent and swift identification of suspicious claims. Operationally, company claims adjustment units identify those claims needing attention by noting the presence or absence of red flags in the inflow of information during the life of a claim. This has become the default modus operandi for most insurers. Claims adjusters are trained to recognize (still often on an informal, judgmental basis) those claims that have combinations of red flags that experience has shown are typically associated with suspicious claims. This assessment is embedded in the standard claims handling process that roughly is organized as follows. In a first stage, a claim is judged by a front-line adjuster, whose main task is to assess the exposure of the insurance company to payment of the claim. In the same effort the claim is scanned for claim amount padding and fraud. Claims that are modest or appear to be legitimate are then settled in a routine fashion. Claims that raise serious questions and involve a substantial payment are scheduled to pass a second reviewing phase. In case fraud is suspected, this might lead to a referral of the claim to a special investigation unit.
6
Empirical Evaluation
In this section we demonstrate the intelligible soft input selection capabilities of MLP-ARD using the 1993 Massachusetts, USA PIP automobile insurance closed claims data. The produced input importance ranking will be compared with the results from popular logistic regression and decision tree learning. For this study we have used the models that were fitted to the data for the baseline benchmarking study of Viaene et al. (2002). In the baseline benchmarking study we contrasted the predictive power of logistic regression, decision tree, nearest neighbor, MLP-
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
389
ARD, least squares support vector machine, naive Bayes and treeaugmented naive Bayes classification on the data described in Section 5. For most of these techniques or algorithm types we reported on several operationalizations using alternative, a priori sensible design choices. The algorithms were then compared in terms of mean PCC and mean AUROC using a ten-fold cross-validation experiment and Duncan's multiple range test for two-way ANOVA (Bradley 1997). We also contrasted algorithm type performance visually by means of the convex hull of the ROCs (Provost and Fawcett 2001) associated with the alternative operationalizations per algorithm type. MLP-ARD consistently scored among the best for all evaluated scenarios. The results we report in this section for MLP-ARD are based on MLPs with one hidden layer of three neurons. In light of the overall excellent performance of logistic regression reported in the baseline benchmarking study and its widespread availability and use in practice, an assessment of the relative importance of the inputs based on inspection of the (standardized) regression coefficients is taken as a point of reference for comparison. To test for multicollinearity, variance inflation factors (VIFs), specifically, calculated as in Allison (1999), were checked. The VIFs were all well below the heuristic acceptance limit of ten (Joos et al. 1998). We report on the standardized coefficients of logistic regression fitted with SAS for Windows V8 PROC LOGISTIC with the default model options at TECHMQUE=FISHER and RIDGING=RELATIVE using hard, stepwise input selection, that is, SELECTION=STEPWISE. The latter is termed Logit in the rest of this chapter. Decision trees are taken as a second point of reference. They are among the most popular data mining methods and are available in customizable form in most commercial off-the-shelf data mining software. The implementation reported on in this study is the mestimation smoothed and curtailed C4.5 (Quinlan 1993) variant due to Zadrozny and Elkan (2001). The latter tended to outperform standard (smoothed) C4.5 in the baseline benchmarking study, although
390
S. Viaene et al.
its reported performance was clearly inferior to that of logistic regression and MLP-ARD. It is conceived as follows. First, a full, that is, unpruned, C4.5 decision tree is grown. Then, instead of estimating class membership for decision making purposes from the leaf nodes of the unpruned tree, we backtrack through the parents of the leaf until we find a subtree that contains k or more training data instances. That is, the estimation neighborhood of a data instance is enlarged until we find a subtree that classifies k or more training data instances. The optimal k in terms of generalization ability of the tree's predictions and reliability of the probability estimates is determined using cross-validation. Since probability estimates based on raw training data frequencies may be unreliable in sparsely populated regions, m-estimation smoothing was applied, m-estimation smooths the class probability estimates based on the training data toward a base rate or data prior, making them less extreme, depending on the number of training data instances at a node. An input's importance to the decision tree model is quantified by the contribution it makes to the construction of the tree. Importance is determined by playing the role of splitter in the tree. In casu the tree is constructed according to a greedy, step-wise input selection method based on an entropy gain rule (Quinlan 1993). An input's importance then amounts to the sum across all nodes in the tree of the improvement scores that the input has when it acts as a splitter. We may also take advantage of the ten-fold cross-validation setup of the baseline benchmarking study for input importance evaluation. This is done in the following way. In the ten-fold cross-validation resampling scheme, for each learner ten models are fitted to ten different (though overlapping) training sets, each covering nine-tenth of the original training data. Obviously, the resulting models are dependent on the underlying (finite) training sets, so that we may expect the ten models of the cross-validated model ensemble to differ. This essentially reflects the uncertainty in model building due to finite training data. Robust input importance assessment should prefer an ensemble-based evaluation, in casu a cross-validated com-
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
391
mittee (Parmanto et al. 1996), to a single-model evaluation (Van de Laar and Heskes 2000). For that reason, input importance— specifically, quantified by the standardized regression coefficient for Logit, the entropy improvement for C4.5, and the square root of the inverse of the corresponding ARD regularization parameter for MLPARD (that is, the standard deviation of the corresponding Gaussian prior)—is aggregated to yield an ensemble-based input importance assessment. Aggregation is based on unweighted averaging across all ten models of the cross-validation. Table 3 reports the ranking of inputs (with the input assigned rank 1 being the most important) for each of the learners. The ranking of inputs is the result of the aggregated input importance assessment across the models of the cross-validated ensemble. Each input's importance (in brackets) is given relative to that of the most important input. That is, the most important input is assigned a relative importance score of 100, and the importance of the other inputs is expressed relative to this score. The top-ten inputs for each learner are boldfaced. All three learners seem to agree on the most important input: SCLEGREP (Claimant represented by an attorney.). Three further inputs, specifically, ACC14 (Property damage was inconsistent with the accident.), CLT02 (Claimant had a history of previous claims.), INJ01 (Injury consisted of strain or sprain only.), appear in the topten of all three learners. MLP-ARD's top-ten furthermore shares both T R T i A G l 0 / 1 (Lag from accident to medical provider 1 first outpatient treatment is non-zero.) and PARTDIS10'l (Claimant partially disabled due to accident.) with Logit's top-ten, and POL_LAG (Lag from (one year) policy effective date to accident.), TRT_LAG2 (Lag from accident to medical provider 2 first outpatient treatment.) and AGE (Age of claimant at time of accident.) with C4.5's top-ten. Hence, MLP-ARD's top-ten shares six inputs with Logit's top-ten and seven inputs with C4.5's top-ten.
o
r
w
2<
(N
r O
l^ S\pa 5,
ON
3
00
u o i-' w u o n "-
o
s u
H
^ n ^
•<
H
OH
H o
3
S. Viaene et al.
^o 3,
u ! H£ u< za u^ z5 ^
O 8 2 H J f l O < n 0 n " n
^ ^
«2 ^
£
3 ^
u u u
S - S S g o o o o o o
£ y fc
CSSS i-l U U
in
ON £ ?
u < < o o
O W >-!
u M5
OS
o w -1 o 1/3
INJOl RTJLA
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
393
For comparison purposes, we also canceled the inputs for MLPARD and C4.5 in Table 3 to which Logit assigned a zero-importance score. Among the inputs that were canceled is AGE, that happens to be boldfaced for both MLP-ARD and C4.5. Two other canceled inputs that belong to the top-ten for C4.5 are TRT_LAG1 (Lag from accident to medical provider 1 first outpatient treatment.) and REPT_LAG (Lag from accident to its reporting.). Note how MLPARD ranks the latter two inputs more toward the end of the importance spectrum and thus is more in line with Logit's assessment. Finally, note that the fact that different learners disagree on relative input importance is not necessarily a bad thing. Different learners may actually complement each other. Their models could be combined into an ensemble classifier (by using simple or weighted voting schemes) that may well prove to effectively reduce misclassification rate, bias and/or variance (see, for example, Bauer and Kohavi 1999, Dietterich 2003, Opitz and Maclin 1999). This, by the way, is completely in line with the modern Bayesian way of looking at the issue of model selection in the face of finite data availability (see, for example, Bishop 1995, Buntine 1990, Domingos 2000, Neal 1996).
7
Conclusion
Understanding the semantics that underlie the output of neural network models proves an important aspect of their acceptance by domain experts for routine analysis and decision making purposes. Hence, we explored the explicative capabilities of neural network classifiers with automatic relevance determination weight regularization, and reported the findings of applying these networks for personal injury protection automobile insurance claim fraud detection. The regularization scheme was aimed at providing us with a way to determine the relative importance of each input to the trained neural network model. We proposed to train the neural network models using Mackay's (1992) evidence framework for classification, a practical Bayesian learning approach that readily incorporates automatic
394
S. Viaene et al.
relevance determination. The intelligible soft input selection capabilities of the presented method were demonstrated for a claim fraud detection case based on a data set of closed claims from accidents that occurred in Massachusetts, USA during 1993. The neural network findings were compared to the predictor importance evaluation from popular logistic regression and decision tree classifiers.
References Allison, P.D. (1999), Logistic regression using the SAS system: Theory and application, SAS Institute, Cary. Bauer, E. and Kohavi, R. (1999), "An empirical comparison of voting classification algorithms: Bagging, boosting and variants," Machine Learning 1-2, vol. 36, pp. 105-139. Bellman, R.E. (1961), Adaptive control processes, Princeton University Press, Princeton. Bengio, Y. (2000), "Gradient-based optimization of hyper-parameters," Neural Computation 8, vol. 12, pp. 1889-1900. Bishop, CM. (1995), Neural networks for pattern recognition, Oxford University Press, Oxford. Bradley, A.P. (1997), "The use of the area under the ROC curve in the evaluation of machine learning algorithms," Pattern Recognition 7, vol. 30, pp. 1145-1159. Breiman, L. (2001a), "Understanding complex predictors," invited talk, Nonparametrics in Large, Multidimensional Data Mining Conference Dallas, http://www.smu.edu/statistics/NonparaConf /Breiman-dallas-2000.pdf. Breiman, L. (2001b), "Random forests," Machine Learning 1, vol. 45, pp. 5-32. Buntine, W.L. (1990), "A theory of learning classification rules," Ph.D. thesis, School of Computing Science, University of Technology Sydney. Canadian Coalition Against Insurance Fraud (2002), "Insurance fraud," http://www.fraudcoalition.org/.
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
395
Coalition Against Insurance Fraud (2002), "Insurance fraud: the crime you pay for," http://www.insurancefraud.org/fraud_ backgrounder.htm. Comite Europeen des Assurances (1996), "The European insurance anti-fraud guide,"CEA Info Special Issue 4, Euro Publishing System, Paris. Comite Europeen des Assurances (1997), "The European insurance anti-fraud guide 1997 update," CEA Info Special Issue 5, Euro Publishing System, Paris. Copas, J.B. (1983), "Regression, prediction and shrinkage (with discussion)," Journal of the Royal Statistical Society: Methodological 3, vol. 45, pp. 311-354. Derrig, R.A. (editor) (2002), "Special issue on insurance fraud," Journal of Risk and Insurance 3, vol. 69. Derrig, R.A., Weisberg, H.I (1998), "AIB PIP screening experiment final report - Understanding and improving the claim investigation process," AIB Cost Containment/Fraud Filing DOI Docket R98-41 (IFRR-267), Automobile Insurers Bureau of Massachusetts, Boston. Desai, V.S., Crook, J.N., and Overstreet, Jr., G.A. (1996), "A comparison of neural networks and linear scoring models in the credit union environment," European Journal of Operational Research 1, vol. 95, pp. 24-37. Diaconis, P. and Efron, B. (1983), "Computer-intensive methods in statistics," Scientific American 5, vol. 248, 116-130. Dietterich, T.G. (to appear 2003), "Ensemble learning," Encyclopedia of Cognitive Science, Macmillan, London. Domingos, P. (2000), "Bayesian averaging of classifiers and the overfitting problem," Proceedings of the Seventeenth International Conference on Machine Learning Stanford, pp. 223-230. Duda, R.O. Hart, P.E., and Stork, D.G. (2000), Pattern classification, 2nd ed., John Wiley & Sons, New York. Fayyad, U.M. and Irani, K.B. (1993), "Multi-interval discretization of continuous-valued attributes for classification learning," Pro-
396
S. Viaene et al.
ceedings of the Thirteenth International Joint Conference on Artificial Intelligence Chambery, pp. 1022-1027. Goldstein, M. and Smith, A.F.M. (1974), "Ridge-type estimators for regression analysis," Journal of the Royal Statistical Society: Methodological 36, pp. 284-291. Grandvalet, Y. (1998), "Lasso is equivalent to quadratic penalization," Proceedings of the International Conference on Artificial Neural Networks Skovde, pp. 201-206. Hand, D.J. (1997), Construction and assessment of classification rules, John Wiley & Sons, Chichester. Hanley, J.A. and McNeil, B.J. (1982), "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology 1, vol. 143, pp. 29-36. Joos, P., Vanhoof, K., Sierens, N., and Ooghe, H. (1998), "Credit classification: A comparison of logit models and decision trees," Proceedings of the Tenth European Conference on Machine Learning, Workshop on the Application of Machine Learning and Data Mining in Finance Chemnitz, pp. 59-73. Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," Proceedings of the Fourteenth International Joint Conference on AI Montreal, pp. 1137— 1145. Lacher, R.C., Coats, P.K., Shanker, S.C., and Fant, L.F. (1995), "A neural network for classifying the financial health of a firm," European Journal of Operational Research 1, vol. 85, pp. 53-65. Lee, K.C., Han, I., and Kwon, Y. (1996), "Hybrid neural network models for bankruptcy predictions," Decision Support Systems 1, vol. 18, pp. 63-72. Lim, T.-S., Loh, W.-Y, and Shih, Y.-S. (2000), "A comparison of prediction accuracy, complexity and training time of thirty-three old and new classification algorithms," Machine Learning 3, vol. 40, pp. 203-228. MacKay, D.J.C. (1992a), "The evidence framework applied to classification networks," Neural Computation 5, vol. 4, pp. 720-736.
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
397
MacKay, D.J.C. (1992b), "A practical Bayesian framework for backpropagation networks," Neural Computation 3, vol. 4, pp. 448472. MacKay, D.J.C. (1994), "Bayesian non-linear modelling for the energy prediction competition," ASHRAE Transactions 2, vol. 100, pp.1053-1062. Marquardt, D.W. (1980), "Comment on Smith and Campbell: You should standardize the predictor variables in your regression models," Journal of the American Statistical Association 369, vol. 75, pp. 87-91. Mobley, B.A., Schechter, E., Moore, W.E., McKee, P.A., Eichner, J.E. (2000), "Predictions of coronary artery stenosis by artificial neural network," Artificial Intelligence in Medicine 3, vol. 18, pp. 187-203. Nabney, I.T. (2001), Netlab: Algorithms for pattern recognition, Springer Verlag, New York. Neal, R.M. (1996), Bayesian learning for neural networks, Springer Verlag, New York. Neal, R.M. (1998), "Assessing relevance determination methods using Delve," in Bishop, CM. (editor), Neural networks and machine learning, Springer Verlag, New York, pp. 97-129. Opitz, D. and Maclin, R. (1999), "Popular ensemble methods: An empirical study," Journal of Artificial Intelligence Research 11, pp. 169-198. Parmanto, B., Munro, P.W., and Doyle, H.R. (1996), "Reducing variance of committee prediction with resampling techniques," Connection Science 3-4, vol. 8, pp. 405^-26. Penny, W.D., and Roberts, S.J. (1999), "Bayesian neural networks for classification: How useful is the evidence framework?" Neural Networks 6, vol. 12, pp. 877-892. Piramuthu, S. (1999), "Financial credit-risk evaluation with neural and neurofuzzy systems," European Journal of Operational Research 2, vol. 112, pp. 310-321. Provost, F.J. and Fawcett, T. (2001), "Robust classification for imprecise environments," Machine Learning 3, vol. 42, pp. 203-231.
398
S. Viaene et al.
Quinlan, J. (1993), C4.5: Programs for machine learning, Morgan Kaufmann, San Mateo. Salchenberger, L.M., Venta, E.R., and Venta, L.A. (1997), "Using neural networks to aid the diagnosis of breast implant rupture," Computers and Operations Research 5, vol. 24, pp. 435-444. Sharda, R. and Wilson, R. (1996), "Neural network experiments in business failures prediction: A review of predictive performance issues", International Journal of Computational Intelligence and Organizations 2, vol. 1, pp. 107-117. Tibshirani, R.J. (1996), "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Association: Methodological 1, vol. 58, pp. 267-288. Van de Laar, P. and Heskes, T. (2000), "Input selection based on an ensemble," http:/citeseer.nj.nec.com/vandelaarOOinput.html. Viaene, S. (2002), "Learning to detect fraud from enriched insurance claims data: Context, theory and applications," Ph.D. thesis, Department of Applied Economics, K.U.Leuven. Viaene, S., Derrig, R.A., Baesens, B., and Dedene, G. (2002), "A comparison of state-of-the-art classification techniques for expert automobile insurance claim fraud detection," Journal of Risk and Insurance 3, vol. 69, pp. 373^121. Webb, A.R. (1999), Statistical pattern recognition, Arnold, London. Weisberg, H.I. and Derrig, R.A. (1991), "Fraud and automobile insurance: A report on the baseline study of bodily injury claims in Massachusetts," Journal of Insurance Regulation 4, vol. 9, pp. 497-541. Weisberg, H.I. and Derrig, R.A. (1995), "Identification and investigation of suspicious claims", AIB Cost Containment/Fraud Filing DOI Docket R95-12 (IFFR-170), Automobile Insurers Bureau of Massachusetts, Boston. Weisberg, H.I. and Derrig, R.A. (1998), "Quantitative methods for detecting fraudulent automobile bodily injury claims," Risques 35, pp. 75-101. Zadrozny, B. and Elkan, C. (2001), "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,"
Bayesian Learning Neural Nets for Auto Claim Fraud and Detection
399
Proceedings of the Eighteenth International Conference on Machine Learning Williams College, pp. 609-616.
This page is intentionally left blank
Chapter 11 Merging Soft Computing Technologies in Insurance-Related Applications Arnold F. Shapiro
While there has been increased use of the soft computing technologies neural networks, fuzzy logic and genetic algorithms in insurance-related applications, very few of the reported studies have merged these technologies. On the contrary, the focus generally has been on a single technology heuristically adapted to a problem. A plausible explanation of why most of these studies have not merged the technologies, or even mentioned the possibility of doing so, is that the authors were not sufficiently familiar with the merging opportunities. Assuming this to be the case, and that it is a continuing problem, the purpose of this chapter is to help alleviate this situation by presenting an overview of the merging of these soft computing technologies. The topics addressed include the advantages and disadvantages of each technology, the potential merging options, and the explicit nature of the merging.
1
Introduction
It has been well established that merging1 the soft computing (SC) technologies of neural networks (NNs), fuzzy logic (FL) and genetic algorithms (GAs) may significantly improve an analysis (Jain 1
Engineers generally refer to this process as fusion. However, given the audience of this chapter, the term "merging" seems more appropriate. 401
402
A. F. Shapiro
and Martin 1999, Abraham and Nath 2001).2 There are two main reasons for this. First, these technologies are for the most part complementary and synergistic. That they are complementary follows from the observations that NNs are used for learning and curve fitting, FL is used to deal with imprecision and uncertainty,3 and GAs are used for search and optimization. We will show in this chapter that they are synergistic. Second, as Zadeh (1992) pointed out, merging these technologies allows for the exploitation of a tolerance for imprecision, uncertainty, and partial truth to achieve tractability, robustness, and low solution cost. Despite these advantages, even though there has been increased use of these SC technologies in insurance-related applications, as reported in Chapter one, very few of the reported studies have merged these technologies. On the contrary, the focus often has been on a single technology heuristically adapted to a problem. Examples of this heuristic adaptation are found in the discussions by Brockett et al. (1997 pp. 1156-7), regarding the number of processing units needed within a hidden layer of a NN, and by Verrall and Yakoubov (1999 p. 192) regarding the number of clusters in a fuzzy clustering application. This is not to say that none of the articles reviewed in chapter one merged the technologies, some did. Park (1993), Huang et al. (1994), Bakheet (1995), and Brockett et al. (1998), explicitly mentioned using a GA to determine the architecture and/or parameters of their NNs; Yoo et al. (1994), Cox (1995) and Zhao (1996) integrated NNs and FL; and Bentley (2000), used genetic programming to evolve FL rules. But, by and large, there have been very few insurance-related studies that have merged the SC technologies. A plausible explanation of why most of the insurance studies have not merged the technologies, or even mentioned the possibil2
While NNs, FL and GAs are only a subset of the soft computing technologies, they are regarded as the three principal components (Shukla 2000 p. 406). 3 Following Zadeh (1994 p. 192), in this chapter the term fuzzy logic is used in the broad sense where it is essentially synonymous with fuzzy set theory.
Merging Soft Computing Technologies in Insurance
403
ity of doing so, is that the authors were not sufficiently familiar with the merging opportunities. Assuming this to be the case, and that it is a continuing problem, the purpose of this chapter is to help alleviate this situation by presenting an overview of the merging of NNs, FL and GAs. This topic is a continuation of the author's investigation of adaptive nonlinear models for the insurance industry (Shapiro 2000, 2002, Shapiro and Gorman 2000). The subjects addressed include the advantages and disadvantages of each technology, the potential merging options, and the explicit nature of the merging. Specific topics include: NNs controlled by FL, NNs generated by GAs, fuzzy inference systems tuned by GAs, fuzzy inference systems tuned by a NN, GAs controlled by FL, and the merging of all three technologies. The chapter concludes with a commentary on the prognosis for these hybrids.
2
Advantages and Disadvantages of NNs, FL and GAs
Before discussing the merging of the SC technologies, it is appropriate to review the anticipated payoff to be gained by their merging. To this end, Table 1 and the discussion that follows provide a brief description of the advantages and disadvantages of each of the technologies and the synergies between them. Table 1. Advantages and disadvantages of the NNs, FL and GAs. Technology NNs
Advantages Adaptation Learning Approximation
Disadvantages Slow convergence speed "Black box" structure
FL
Approximate reasoning
Difficult to tune Lacks effective learning capability
GAs
Systematic random search, Derivative-free optimization
Difficult to tune No convergence criterion
404
A. F. Shapiro
As indicated in Table 1, NNs have the advantages of adaptation, learning and approximation, but the disadvantages of a relatively slow convergence speed and the negative attribute of a "black box" structure, owing to its lack of an explanatory capability. The table also indicates that FL has the advantage of approximate reasoning but the disadvantages that it is difficult to construct and tune the fuzzy membership functions (MFs) and rules, and it lacks an effective learning capability. Finally, GAs have the advantages of systematic random search and derivative-free optimization, but they are difficult to tune and have no convergence criterion. In the sections that follow we will see that merging these technologies can capitalize on their strengths and compensate for their shortcomings. For example, NNs can improve the learning capacity of fuzzy systems, FL can help tune GAs, both FL and GAs can improve the convergence speed of NNs, and GAs can help construct and tune the MFs of FL, and all of this can be done adaptively.
3
NNs Controlled by FL
Bakheet (1995 p. 221), at the conclusion of his study of construction bond underwriting, recommended that fuzzy set theory be utilized for the description of the linguistic variables associated with NNs. This section discusses a methodology for doing this. The system where NNs are controlled by FL is referred to as fuzzy NNs. Its history can be traced back to at least the mid 1970s, when Lee and Lee (1974) introduced MFs to the McCullock-Pitts (1943) model. Since then, FL has been applied to many of the components and parameters of NNs, including the inputs, weights, learning rate and momentum coefficient. This section presents an overview of these methodologies.
Merging Soft Computing Technologies in Insurance
3.1
405
Inputs and Weights
To begin, consider the neural processing unit (see Chapter 1, Figure 1) that is the core of the NN. Following Buckley and Hayashi (1994, p. 234), we adopt the convention that in order for a NN to be a fuzzy NN (FNN), the signal and/or the weights must be fuzzy sets. This leads to their three possibilities: FNNi is a FNN with real inputs but fuzzy weights FNN2 is a FNN with fuzzy inputs but real weights FNN3 is a FNN with fuzzy inputs and fuzzy weights Moreover, they called a FNN a regular FNN if its neural processing units use multiplication, addition and a logistic activation function, in contrast to a hybrid FNN, which uses operations like a t-norm, t-conorm, or some other continuous procedure, to combine the incoming signals and weights and to aggregate their products. This distinction is important because a regular FNN that uses fuzzy arithmetic based on Zadeh's extension principle5 is not a universal approximator6 while a hybrid FNN may be7 (Buckley and Hayashi 1993, 1994a). An example of the implementation of FNN 1 is the approach envisioned by Yamakawa and Furukawa (1992), which can be characterized as shown in Figure 1.
4
The simplest examples of the t-norm (triangular-norm) and the t-conorm (its dual) are the min-operator and max-operator, respectively. 5 The extension principle allows nonfuzzy arithmetic operations to be extended to incorporate fuzzy sets and fuzzy numbers. See Zimmermann (1996), Ch. 5. 6 A particular form of a neural net is a universal approximator if, given a continuous function F : 9?" —> 5R, it can approximate F uniformly on compact subset of 5R to any degree of accuracy. 7 Buckley and Hayashi (1993, 1994a) show that regular FNN are monotone increasing fuzzy functions, from which they conclude that they are not universal approximators. Hybrid FNNs, which are created by changing the operations within a FNN, need not be monotone increasing fuzzy functions.
406
A. F. Shapiro
^.^v
(a)
.. u.™
wM
(b)
Figure 1. Yamakawa fuzzy weights.
As shown in Figure 1(a), for each input, Xi, instead of a single weight, wi, the neuron has an array of weights {wy, j=l,m}, each of which is associated with a triangular fuzzy number, u^. Since, for any given Xj, only two adjacent MFs are nonzero, the input to the neuron associated with each Xj is the weighted average of the two adjacent weights (Buckley and Hayashi 1994 p. 235), as shown in Figure 1(b), that is Mij(xu)wij+niij+1(xij)wiJ+, (1) and the total input is the sum of these. In the Yamakawa studies, learning was accomplished by updating the weights using a heuristic rule. FNN2 was explored early on by Ishibuchi et al. (1992), who envisioned both the inputs, xt (where the bar indicates a fuzzy value), and the output, O , as fuzzy numbers, within the context of a NN along the lines of a 3-layer FFNN (see Chapter 1, Figure 2). Their procedure, generally speaking, was to view the j-th_training instance as the paired fuzzy input and target numbers (X^Tj), where Xj = (3cly,...,3c .)> and to use interval arithmetic8 to compute the error to be minimized. They then updated the weights using a modified version of the delta rule. 8
Interval arithmetic is an arithmetic defined on sets of intervals, rather than sets of real numbers. 9 See Ishibuchi et al. (1992), pp. 1299-1300. The delta rule is also known as the gradient (or steepest) decent method.
Merging Soft Computing Technologies in Insurance
407
Finally, a simple version of FNN3 can be envisioned as depicted in Figure 2. Input (Layer 0)
Output (Layer 1)
Figure 2. Simple type-3 FNN.
As previously, the j-th training instance is the paired fuzzy numbers (X^Tj). If universal approximation is not an issue, standard fuzzy arithmetic can be used for the input to the neuron and the extension principle can be applied to the activation function. For computations of the error involving fuzzy subtraction, the stopping rule could be to end the iterations when the error falls within an acceptable interval about zero. The learning algorithm could then be based on a fuzzified delta rule (Hayashi et al. 1992), although Buckley and Feuring (1999 p. 83) had misgivings about doing so. Of course, the potential applications for FNN3 is limited if the methodology is restricted to regular FNNs. Buckley and Hayashi (1994, 240-42) provide an example of a hybrid FNN3. The subsequent literature in this area generally focused on specifying modeling capabilities of FNNs and changes in the architecture and/or internal operations so that the FNNs become universal approximators. Examples of recent studies of the implementation problems inherent in these systems are Feuring et al. (1998), who discuss a backpropagation-based method for adjusting the weights, Jiao et al. (1999), who focus on problems involving a crisp input and a fuzzy target, and Buckley and Feuring (1999), who provide
408
A. F. Shapiro
an interesting overview of FNNs based on a decade of their research.
3.2
Learning Rates and Momentum Coefficients
In addition to its role in modifying the inputs and weights of a NN, FL can be used to monitor and control the parameters of the NN. Consider the formula for the change in the weights Aw(t)=-rj VE[w(t)]+a Aw(t -1)
(2)
where Aw(t) denotes the change in the weight, n denotes the learning rate, E[w(t)] denotes the error function during the t-th iteration, VE denotes the gradient of E in weight space, and a is the momentum coefficient. Following Kuo et al. (1993) and Bonissone (1998), we can modify the learning rate and the momentum coefficient by incorporating MFs derived from the total training error and the change in error between two successive iterations (AError). In the case of the former, the universe of discourse may be small, medium and big, while the latter may represent negative, zero and positive. If, for example, the matrix of fuzzy rules were as shown in Table 2,10 a small training error and a negative AError would suggest that the change in the learning rate be very small and positive. Table 2. Fuzzy rule table for the change in the learning rate (r|) Training Error Medium
Large Small AError < 0 Very small A+ Very small A+ Small A+ AError = 0 A= 0 A=0 Small A+ AError > 0 Small AMedium ALarge AA+ implies positive change; A- implies negative change
AdoptedfromBonissone (1998), Table 1.
409
Merging Soft Computing Technologies in Insurance
Even with such simple fuzzy rules, the training time will be reduced significantly (Bonissone 1998 p. 66). A similar procedure can be used to monitor and control the change in the momentum coefficient and the steepness parameter of the activation function. In the case of the latter, AError might be replaced by training time, partitioned into short, medium and long. Given the foregoing fuzzy rule table, the neural-fuzzy model might be designed along the lines shown in Figure 3. • Neural Network
Fuzzy System -
Initialize values: architecture weights Start|—'r*learning rate (T|) momentum (a)
Figure 3. Neural-fuzzy system.
In this representation, the whole system is composed of two parts, the NN component12 and the fuzzy system. Given the initialization and the input and target values, the NN computes the component values of the hidden layer and its estimate of the output values. It then evaluates its performance and if the output is not sufficiently close to the target value, it passes the error and the change in the error to the fuzzy system. The fuzzy system adaptively determines the necessary changes in the learning rate and momentum in accordance with the fuzzy rules of Table 2. Since the fuzzy system in this version of the model used the Mamdani approach (see Chap11
If the activation function is conceptualized as (l+exp{-zp72})" (3 is referred to as the steepness parameter. 12 This representation only shows the training portion of the NN.
A. F. Shapiro
410
ter 1, Figure 8), the conclusions of the inference engine are defuzzified to convert them to a crisp form, which is used to adjust the learning and momentum rates if necessary. BP is then used to adn
just the weight matrix and the process continues.
4
NNs Generated by GAs
As mentioned at the beginning of this chapter, one of the areas where hybrids of the SC technologies have been adopted for insurance studies has been where GAs have been used to optimize NN architecture and parameters. This is not surprising, since GAs can be viewed as modern successors to Monte Carlo search methods, and so software packages now provide this facility. There is, of course, the issue that GAs are not designed to be ergodic and therefore do not cover the space in the most efficient manner, but this is more than offset by the efficiency gain resulting from the parallalization capability (Bonissone 1998 p. 68). Since Montana and Davis (1989) first proposed the use of GAs to train a FFNN with a given topology, a number of ways that GAs can be used to improve the performance of NNs have been identified.14 Options include: developing the architecture, including the number of hidden layers, nodes within the layers, and connectivity; optimizing the weights for a given architecture, thus replacing BP; and selecting the parameters of the NN, such as the learning rate and momentum coefficient (Trafalis 1997 p. 409, Bonissone 1999 p. 7). This section presents an overview of these options.
13
In the system envisioned in Figure 3, the NN uses backpropagation to update the weights. However, the weight matrix could have been updated based on the information from the fuzzy system (Thammano 1999). 14 According to Trafalis (1997 p. 409), most researchers who use GAs use them as supportive combinations to assist NNs.
411
Merging Soft Computing Technologies in Insurance
4.1
Implementing GAs
The general procedure for implementing GAs was exemplified in the insurance company bankruptcy study conducted by Park (1993). An interesting portion of Park's study was his explicit discussion of the use of a GA to suggest the optimal structure for the NN. The basic strategy is depicted in Figure 4, which shows the eight digits generated by the GA: the first three digits are used to set the number of hidden units, the next three digits set the leaning rate, and the last two digits determine the momentum. hidden units —•
|— learning rate
(00000000) L-—momentum Figure 4. The genetic code.
In Park's study, the values associated with the binary chromosome were as show in Table 3. Table 3. Values associated with the chromosome. 1st 3 digits hidden cells 2nd 3 digits learning rate last 2 digits momentum
000 2 000 0.1 00 0.1
001 3 001 0.3 01 0.3
010 4 010 0.5 10 0.5
011 5 011 0.7 11 0.7
100 6 100 0.9
101 7 101 1.1
110 8 110 1.3
111 9 111 1.5
Thus, for example, if the generated binary digits are (00100100), the number of hidden units is 3, the learning rate is 0.3, and the momentum is 0.1. The implementation is straightforward. A solution of the GA population, such as (00100100), is used to construct a network, which is then trained by the BP method using a training set of ob-
412
A. F. Shapiro
servations. The network is then tested with a test set of observations and its output is used to determine its fitness. In this hybrid, the fitness function often takes the form (MSE + c)_1, where MSE denotes mean square (output) error and c is a small positive number used to avoid arithmetic overflows. This process is repeated for each solution in the GA population. It is common for studies involving NNs to use BP as the tuning algorithm. While this is an efficient approach where the error surface is convex, it can be problematic where the surface is multimodal because the method may get trapped in a sub-optimal local minimum. This problem can be circumvented by repeated searches with different initial conditions or by perturbing the weights when the search seems stagnated, but a more robust approach would be to use a global search method like GAs. Of course, GAs are not without their shortcoming. While they are very effective at global searches, and can quickly isolate a global minimum, they may be inefficient at actually finding that minimum. This issue can be addressed with a hybrid approach like that of Kitano (1990), whereby the GA is used to isolate the global minimum after which BP is used for the local search. Alternatively, the GA could simply be used to free the BP in the event that it gets stuck in a local minimum (Mclnerney and Dhawan 1993). Finally, it is worth noting that Bakheet (1995 pp. 114-15) preferred to optimize the NN structure heuristically rather than use the structure suggested by a built-in genetic optimizer, because he felt he could produce a better solution.
5
Fuzzy Inference Systems (FISs) Tuned by GAs
One of the few insurance applications where evolutionary computing was used to develop FL parameters was the study by Bentley (2000), which used genetic programming to evolve FL rules that
Merging Soft Computing Technologies in Insurance
413
classified home insurance claims into "suspicious" and "nonsuspicious" classes. Most other FL studies in insurance used a heuristic approach to develop the parameters. Young (1997), for example, talked about manually perturbing the parameters in a fuzzy inference rate-making model for workers compensation insurance, and Verrall and Yakoubov (1999) used an "ad hoc" approach to determine the number of clusters in a fuzzy clustering application involving policyholder ages. As in the Bentley study, FL studies that used these heuristic approaches may have been facilitated had some form of evolutionary computing been involved. In this section, we focus on the use of GAs to tune fuzzy systems by adapting the fuzzy MFs and/or by facilitating the learning of the fuzzy if-then rules. In both cases, we take advantage of the simple operations of GAs to provide robust methods for the designing and automatic tuning of these fuzzy system parameters. Following Jang et al. (1997 p. 485), the hierarchical structure of the input and output portion of a GA that tunes a FIS can be conceptualized15 as shown in Figure 5. FIS population
I
FIS 1
Inputs and outputs
| Input 1 | Input2 |
•••
Membership Functions
I MF1
I MF2
I
•••
I
MF parameters
I
I
I
e
I
Parameter as a gene
I 0 I 1
a
I
FIS 2
b
I
0 I 1 l o l o
FIS3
I
•••
I
Outputl|Output21
I I
d
•••
•••
I
|
I
M l
Figure 5. GA hierarchical representation of a FIS.
Interpreting the figure from the top to the bottom, the GA is based on a population of FISs. Each FIS has inputs and outputs, the 15
This simple conceptualization needs to be refined in practice because of implementation issues related to the specifics of crossover and mutation operations and structural level adaptations. See Jang et al. (1997: 484-5).
A. F. Shapiro
414
fuzzy portions of which are represented by MFs. Here, the MFs are portrayed as trapezoids, since they have four parameters, but they could also be portrayed as triangular or Gaussian.16 Finally, a unit of eight bits called a gene represents each binary encoded parameter. In addition to the input and output portions of the chromosome, there will be a rule base substring and a rule consequence substring, representations of which are shown in Figure 6.17 Rule base substring: X
1
I'D
X
r
X
2
2j
„
o
c
N
""HI
Rule consequence substring: rule 1 r
vi
rule 2 r
Y2
rule M -
-
<•
fyM
Figure 6. Rule base and rule consequence substrings.
In the figure, the Xj's, i=l,...,N, and y are the input and output variables, respectively, the r^'s are the rules, and M is the number of rules. The rule base substring, the first substring, which is composed of integer values, encodes the structure of each rule and assigns a unique number to each MF. A zero implies that the input variable is not involved in the rule. The rule consequence substring has a similar interpretation, except that a zero implies that the rule is deleted from the FIS rule base. The inclusion of the option to use zeros allows both the number of input variables involved in each rule and the number of rules to change dynamically during the 16
Gaussian MFs are represented by bell-shaped functions of the form exp{-(xc)2/2w2}, where c, the center, is the value in the domain around which the curve is built and w is the width parameter. 17 Adapted from Figure 1 of Liska and Melsheimer (1994 p. 1379) and their discussion.
Merging Soft Computing Technologies in Insurance
415
GA's search, giving added flexibility. Before proceeding we need to define some of the terms endemic to the area, such as scaling factor, termset and ruleset: the scaling factor determines the ranges of values for the state and output variables; the termset defines the MF associated with the values taken by each state and output variable;18 and the ruleset characterizes a syntactic mapping from a state to an output. The model parameters are the scaling factors and termsets, while the structure of the underlying model is the ruleset. A basic approach to tuning FIS by GAs is to use GAs to learn fuzzy set MFs only, with a fixed set of rules set by hand. This was the approach of Karr (1993), who used GAs to modify the MFs in the termsets of the variables. Following a procedure similar to that mentioned above, he used a binary encoding to represent the parameters defining a membership value in each termset and then concatenated the termsets to produce the binary chromosome. A commonly referenced study is Lee and Takagi (1993) who tuned both the rule base and the termsets. In their case, they assumed triangular MFs and used a binary encoding for the associated three-tuples. Following the Takagi-Sugeno-Kang (TSK) rule, under which a first-order polynomial, defined on the state space, is the output of each rule in the ruleset (see the next section), the chromosomes were constructed by concatenating the membership distributions with the polynomial coefficients.
6
FISs Tuned by NNs
Two insurance studies reported using NNs to develop their FL models: Cox (1995) used an unsupervised NN to learn relationships inherent in healthcare claim data, and a supervised approach to automatically generate the fuzzy model from a knowledge of the 18
For example, the termset of the training error in Table 2 is small, medium and large.
416
A. F. Shapiro
decision variables; and Zhao (1996) used a supervised NN to map input data to membership functions in his investigation of maritime collision prevention and liability. This idea of using NNs to develop FL models has been around for some time. Ever since Lee and Lee (1974) first proposed merging FL and NNs using the paradigm of a multi-input multi-output NN, there has been considerable interest the concept. Many have simply propagated a variation of the original paradigm under the premise that a fuzzy neural system is little more than a multilayered FFNN. Others, however, have interpreted the process as involving fuzzy reasoning and inference facilitated by a connectionist network. The most common approaches (He et al. 1999 p. 52) have been where the NNs tune the MFs, given a defined set of rules, and where the NNs are designed to autonomously generate the rules. The ANFIS (Adaptive Neural Fuzzy Inference Systems) of Jang (1993) was one of the earlier programs in this category and is often cited. The basis of the ANFIS is the TSK fuzzy inference system (Takagi and Sugeno 1983, 1985) in which the conclusion of the fuzzy rule is a weighted linear combination of the crisp inputs rather than a fuzzy set. For a first-order TSK model (Jang et al. 1997 p. 336), a common rule set with n fuzzy if-then rules is: If x is Aj and y is Bi, then f, = pj x + qi y + r,, i = 1,..., n,
(3)
where x and y are linguistic input variables, Aj and B;, are the corresponding fuzzy sets, f is the output of each rule, and pi, q;, and r; are linear parameters. A simple two-input one-output representation is depicted in Figure 7.
It is worth noting that there is not an absolute dichotomy between NNs and FL. Li and Chen (1999), for example, demonstrate that FL systems and FFNN are equivalent in essence.
Merging Soft Computing Technologies in Insurance A,
A In
417
B, w,
f,= p , x + q,y + r!
J' - \
X
......... X
B, V = w2 I \y \—----Y
Wi fi + Wj f2 Wi + Wj
f 2 =p 2 x + q2y + r2
= Wi f] + W2 f2
Figure 7. TSK-type fuzzy inference system.
Operationally, the first step is to find the membership grades of the IF parts of the rules, which, in the figure, are represented by the heights of the dashed lines. Then, since the pre-conditions in the IF parts are connected by AND, the firing strength of each rule is found using multiplication. For example, the firing strength for rule 1, wj, is the product of the heights of the top two dashed lines. Given, wi and W2, the overall output is computed as a weighted average of the f's. As noted by Jang (1993) and Mizutani (1997 p. 371), this TSKtype FIS is functionally equivalent to the ANFIS depicted in Figure 8. J^ayer^
Figure 8. Two-input ANFIS with two rules.
In the figure, the square nodes are adaptive nodes and have parameters while the circle nodes are fixed nodes and do not. Also,
418
A. F. Shapiro
the links only indicate the flow direction of signals between the nodes. There are no weights associated with the links, unlike with a regular NN. The five layers may be characterized as follows: Layer 1, labeled as "premise parameters,"20 has adaptive nodes, where the Aj's and Bj's are linguistic labels (such as "low" or "high"). The output of the layer is the membership grade that specifies the degree to which the inputs satisfy the quantifiers. Layer 2, the fixed nodes of which are denoted "IT," multiplies the incoming signals and sends the product out. Any t-norm operator that performs generalized AND can be used as a node function in this layer. Layer 3 is a fixed node, denoted "N" for normalization, which calculates the relative weight of the i-th rule's firing strength. Outputs of this layer are called normalized firing strengths and take the value vv = w; /£ w,. Layer 4, labeled "consequence parameters," applies the firstorder TSK fuzzy if-then rules to the output of layer 3. The output of node i of this layer is fjW,. Layer 5 is a single fixed node, denoted "E," that computes the overall output as the summation of all incoming signals and is equal to £ fi wt.. Bonissone (1999) characterizes Jang's approach as a "twostroke" optimization process. During the forward stroke the termsets of the first layer equal their previous iteration value while the coefficients of the fourth layer are computed using a least mean square method. Then, ANFIS computes an output, which, when compared with the related training set value, gives the error. The error gradient information is then used in the backward stroke to modify the fuzzy partitions of the first layer. This process is continued until convergence is attained. A number of studies have extended the basic ANFIS model or a 20 21
The premise also is referred to as the conditional or left-hand side of a rule. The consequence also is referred to as the action or right-hand side of a rule.
Merging Soft Computing Technologies in Insurance
419
variation of it. Examples include: Mizutani (1997), who extended the single output ANFIS model to a multiple-output ANFIS with nonlinear fuzzy rules; Juang and Lin (1998), who developed a selfconstructing neural fuzzy inference network that was inherently a modified TSK-type fuzzy rule-based model possessing neural network learning ability; and Abdelrahim and Yahagi (2001), who use principal component methodology22 to uncorrelate and remove redundancy from the input space or the ANFIS. An interesting overview can be found in Abraham and Nath (2000).
7
GAs Controlled by FL
Wendt (1995 p. 4), in his report on a GA efficient frontier, advised the reader that "... there is no one "perfect" approach [to developing a GA and] trial and error should be used to determine what works for specific problems." The particular candidates he mentioned for this trial and error were alternate patterns of mating, different mutation levels, and different population sizes. Of course, trial and error is quite time consuming, which is a major limitation of his suggestion. Another major limitation associated with GAs is their potential for premature convergence to an inferior solution. Essentially, GAs are very efficient during the global search in the solution space but tend to bog down when the search becomes localized. The original attempts to overcome this problem were based on intuition and experience and while a number of strategies for choosing the GA parameters evolved (Pearl 1988), they invariably required the parameters to be computed off-line and to be kept static during the algorithm's evolution. What was needed was an adaptive approach. FL is a natural candidate for resolving these problems because it 22
Principal component analysis is a methodology for finding the structure of a cluster located in multidimensional space. Conceptually, it is equivalent to choosing that rotation of the cluster that best depicts its underlying structure.
A. F. Shapiro
420
provides a vehicle for easily translating qualitative knowledge about the problem to be solved into an executable rule set. In the case at hand, FL can be used to provide dynamic control of GA resources (such as population size, selection pressure, and probabilities of crossover and mutation) during the transition from the global search to the local search, and thereby improve the algorithms' performance. Lee and Takagi (1993) exemplify the process of run-time tuning when they discuss the use of a fuzzy system which uses the three input variables shown in Figure 9 to determine the current state of the GA evolution and to produce the three output variables. Average fitness Best fitness Worst fitness Average fitness A fitness
>
A Population size A Crossover rate A Mutation rate
Figure 9. Sample input-output variables.
They also imposed a relative change limitation on the output variables, which could not change by more than half the current setting, and boundary conditions.23 Thus, for example, as far as the population is concerned, a typical fuzzy control scheme may include rules such as: if (average fitness/best fitness) is large, then population size should increase; or if (worse fitness/average fitness) is small, then population size should decrease. These could be implemented using a matrix of fuzzy rules along the lines of Table 2.
23
Population size, crossover rate and mutation rate were limited to the operational ranges of [10, 160], [0.2, 1.0], and [0.0001, 1.0], respectively.
Merging Soft Computing Technologies in Insurance
8
421
Neuro-Fuzzy-Genetic Systems
Given the foregoing discussions, it is not surprising that methodologies have evolved for merging all three of the NNs, FL and GAs technologies. Of course, the added complexity causes new layers of problems. Some of these problems are obvious: there will be an increased number of parameters that need to be tuned and an increase in the time needed to learn rules. The focus in the resolution of these types of issues will be on more efficient searches and more dynamic routines. For example, Gaussian MFs, since they have only two parameters, rather than three (triangular) or four (trapezoidal), can be used to develop more efficient search strings, and dynamic crossover and mutation probability rates can be incorporated. Other problems may only be obvious in retrospect: the augmented design stages may not be independent or if the new design involves a restricted search space, it may be more likely to result in partial or sub-optimal solutions. Here, among other things, simultaneous optimization may have to be implemented. Recent examples of studies that have merged all three of the NNs, FL and GAs technologies are: Duarte and Tome (2000), who developed a Fuzzy Adaptive Learning Control Network, in which GAs are used to find the connections between layers of the NN and the parameter values of the nodes; Kuo et al. (2001), who developed a GA based FNN (GFNN) whereby the GA is used first to obtain a "rough" solution and avoid local minima, and then the FNN is used to fine-tune the results; and Huang et al. (2001), who developed an integrated neural-fuzzy-genetic-algorithm, where the NNs were used to generate fuzzy rules and MFs and the GAs were used to optimize the defuzzification operator parameters.
422
9
A. F. Shapiro
Conclusions
The purpose of this chapter has been to explore ways in which the synergy between NNs, FL and GAs has been exploited and to survey some of the successful combinations that have lead to the development of hybrid systems. Table 4 summarizes the types of merging this chapter discussed. Table 4. Merged technologies.
Modified by: NN FL GA
Modified 1 Technology NN FL GA Y Y Y Y Y
As indicated, with the exception of NN modifying GAs, there was a discussion of each of the technologies modifying each of the other technologies. We saw how merging the technologies provided an alternative to a strictly knowledge-driven reasoning system or a purely datadriven one and learned that the merged techniques can provide a more accurate and robust solution than can be derived from any single technique. Moreover, there is relatively little redundancy because typically the methods do not try to solve the same problem in parallel but they do it in a mutually complementary fashion. There are clear avenues for insurance-related studies to explore that involve the synergies between the SC technologies. NN-based insurance studies, for example, often used linguistic-type input variables characterized by crisp MFs and they could be extended 24
This is not to say there are no examples of NN modifying GAs, but, rather, that few such articles were found. Kassicieh et al. (1998), for example, investigated the use of a NN to transform an economic series before a GA processed the series.
Merging Soft Computing Technologies in Insurance
423
using the fuzzy NNs of Buckley and Hayashi (1994). Another enhancement opportunity follows from the observation that, while some researchers used GAs to choose NN parameters, such as the learning rate and momentum coefficient, their approach was static. It would be interesting to compare their results with those obtained using a dynamic fuzzy-rules approach, along the lines of Bonissone (1998: 66). Similarly, there is ample opportunity to extend the FL-based insurance studies. One possibility would be to use GAs to tune the fuzzy systems either by adapting the fuzzy MFs and/or by facilitating the learning of the fuzzy if-then rules. Following Karr (1993), GAs could be used to learn fuzzy set MFs only, with a fixed set of rules set by hand, or in the manner of Liska and Melsheimer (1994), the entire process could be automated. Another area worth exploring is the use of NNs to tune the fuzzy inference systems. The methodology to consider in this regard is the ANFIS of Jang (1993) or one of its derivatives. In addition to the foregoing opportunities to merge GAs with the other SC technologies, another important topic is the tendency of GAs to converge to inferior solutions. It would be interesting to investigate whether a methodology such as that used by Lee and Takagi (1993) could improve the results of the GA-based insurance studies. Many insurance-related problems involve imprecision, uncertainty and partial truths. Since NNs, FL and GAs can help resolve these issues, more researchers are beginning to implement them. As we improve our understanding of the strengths and weaknesses of these technologies and improve the manner by which we leverage their best features, it seems inevitable that they will become an increasingly more important component of our methodology.
424
A. F. Shapiro
Acknowledgments This work was supported in part by the Robert G. Schwartz Faculty Fellowship and the Smeal Research Grants Program at the Perm State University. The assistance of Asheesh Choudhary, Angela T. Koh, Travis J. Miller, and Laura E. Campbell is gratefully acknowledged.
References Abdelrahim, E.M. and Yahagi, T. (2001), "A new transformed input-domain ANFIS for highly nonlinear system modeling and prediction," 2001 Canadian Conference on Electrical and Computer Engineering, vol. 1, pp. 655-660. Abraham, A. and Nath, B. (2001), "Hybrid intelligent systems design - a review of a decade of research," Technical Report (5/2000), Gippsland School of Computing and Information Technology, Monash University, Australia. Bakheet, M.T. (1995), "Contractors' Risk Assessment System (Surety, Insurance, Bonds, Construction, Underwriting)," Ph.D. Dissertation, Georgia Institute of Technology. Bentley, P.J. (2000), " 'Evolutionary, my dear Watson' investigating committee-based evolution of fuzzy rules for the detection of suspicious insurance claims," Proceedings of the second Genetic and Evolutionary Computation Conference (GECCO), pp. 702709. Bonissone, P.P. (1998), "Soft computing applications: the advent of hybrid systems," in Bosacchi, B., Fogel, D.B., and Bezdek, J.C. (eds.), Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation, Proceedings ofSPIE, vol. 3455, pp. 63-78. Bonissone, P.P. (1999), "Hybrid soft computing systems: where are we going," working Paper, GE Corporate Research and Devel-
Merging Soft Computing Technologies in Insurance
425
opment. Brockett, P.L., Cooper, W.W., Golden, L.L., and Xia, X. (1997), "A case study in applying neural networks to predicting insolvency for property and casualty insurers," The Journal of the Operational Research Society 48 (12), pp. 1153-1162. Brockett P.L., Xia, X., and Derrig, R.A. (1998), "Using Kohonen's self-organizing feature map to uncover automobile bodily injury claims fraud," JRI 65(2), pp. 245-274. Buckley, J.J. and Feuring, T. (1999), Fuzzy and Neural: Interactions and Applications, Springer-Verlag. Buckley, J.J. and Hayashi, Y. (1993), "Are regular fuzzy neural nets universal approximators?" Proceedings of the 1993 International Joint conference on Neural Networks, pp. 721-724. Buckley, J.J. and Hayashi, Y. (1994), "Fuzzy neural networks," in Yager, R.R. and Zadeh, L.A. (eds.), Fuzzy Sets, Neural Networks, and Soft Computing, Van Nostrand Reinhold, New York, pp. 233-249. Buckley, J.J. and Hayashi, U. (1994a), "Hybrid fuzzy neural nets are universal approximators," Proceedings of the Third IEEE Conference on Fuzzy Systems, vol. 1, pp. 238-243. Cox, E. (1995), "A fuzzy system for detecting anomalous behaviors in healthcare provider claims," in Goonatilake, S. and Treleven, P. (eds.), Intelligent Systems for Finance and Business, John Wiley & Sons, pp. 111-135. Duarte, C. and Tome, J.A.B. (2000), "An evolutionary strategy for learning in fuzzy neural networks," 19th International Conference of the North American Fuzzy Information Processing Society (NAFIPS), pp. 24-28. Feuring, T., Buckley, J.J., and Hayashi, Y. (1998), "Adjusting fuzzy weights in fuzzy neural nets," Second International conference on Knowledge-Based Intelligent Electronic Systems, pp. 402-406. Hayashi, Y., Buckley, J.J., and Czogala, E. (1992), "Direct fuzzification of neural network and fuzzified delta rule," Proc. 2nd Int.
426
A. F. Shapiro
Conf. Fuzzy Logic Neural Networks (IIZUKA '92), Iizuka, Japan, July 17-22, pp. 73-76. He, L., Wang, K., Jin, H., Li, G., and Gao, X. Z. (1999), "The combination and prospects of neural networks, fuzzy logic and genetic algorithms," 1999 IEEE Midnight-Sun Workshop on Soft Computing Methods in Industrial Applications. Kuusamo, Finland, June 16-18. Huang, C , Dorsey, R.E., and Boose, M.A. (1994), "Life insurer financial distress prediction: a neural network model," J. Insurance Regulation, Winter 13(2), pp. 131-167. Huang, Y., Gedeon, T.D., and Wong, P.M. (2001), "An integrated neural-fuzzy-genetic-algorithm using hyper-surface membership functions to predict permeability in petroleum reservoirs," Engineering Applications of Artificial Intelligence 14, pp. 15-21. Ishibuchi, H., Fujioka, R., and Tanaka, H. (1992), "An architecture of neural networks for input vectors of fuzzy numbers," Proc. IEEE Int. Conf Fuzzy Syst. (FUZZ-IEEE 92), San Diego, March 8-12, pp. 1293-1300. Jain, L.C. and Martin, N.M. (eds.) (1999), Fusion of Neural Networks, Fuzzy Sets, and Genetic Algorithms: Industrial Applications, CRC Press, New York. Jang, J.-S.R. (1993), "ANFIS: adaptive-network-based fuzzy inference systems," IEEE Transactions on Systems, Man, and Cybernetics, Vol. 23, No. 3, pp. 665-685. Jang, J-S.R., Sun, C-T., and Mizutani, E. (1997), Neuro-Fuzzy and Soft Computing: a Computational Approach to Learning and Machine Intelligence, Prentice Hall, Upper Saddle River, NJ. Jiao, L.C, Liu, F., Wang, L., and Zhang, Y.N. (1999), "Fuzzy wavelet neural networks: theory and applications," in Chen, G., Ying, M., and Cai, K.-Y. (eds.), Fuzzy Logic and Soft Computing, Kluwer Academic Publishers, Boston. Juang, C.-F. and Lin, C.-T. (1998), "An online self constructing neural fuzzy inference network and its applications," IEEE Transactions on Fuzzy Systems 6(1), pp. 12-32
Merging Soft Computing Technologies in Insurance
427
Karr, C.L. (1993), "Fuzzy control of ph using genetic algorithms," IEEE Transactions on Fuzzy Systems 1, pp. 46-53. Kassicieh, S.K., Paez, T.L., and Vora, G. (1998), "Data transformation methods for genetic-algorithm-based investment decisions," Proceedings of the 31st Hawaii International Conference on System Sciences, 1060-3425/98 IEEE, pp. 122-127. Kitano, H. (1990), "Empirical studies on the speed of convergence of neural network training using genetic algorithms," Proc. Eighth National Conf on Artificial Intelligence (AAAI-90), vol. 2, pp. 789-95. Kuo, R.J., Chen, Y.T., Cohen, P.H., and Kumara, S. (1993), "Fast convergence of error back propagation algorithm through fuzzy modeling," Intelligent Engineering Systems through Artificial Neural Networks, pp. 239-244. Kuo, R.J., Chen, C.H., and Hwang, Y.C. (2001), "An intelligent stock trading decision support system through integration of genetic algorithm based fuzzy neural network and artificial neural network," Fuzzy Sets and Systems, vol. 118, issue 1, pp. 21-45. Lee, M.A. and Tagaki, H. (1993), "Dynamic control of genetic algorithm using fuzzy logic techniques," in Forrest, S. (ed.), Proceedings of the Fifth International Conference on Genetic Algorithms, Morgan Kaufmann, pp. 76-83. Lee, S.C. and. Lee, E.T. (1974), "Fuzzy sets and neural networks," J. Cybernet. 4, pp. 83-103. Li, H. and Chen, C.L.P. (1999), "The equivalence between fuzzy logic systems and feed-forward neural networks," in Dagli, C.H., Buczak, A.L., Ghosh, J., Embrechts, M.J., and Ersoy O. (eds.), Smart Engineering System Design, ASME Press, New York, pp. 535-540. Liska, J. and Melsheimer, S.S. (1994), "Complete design of fuzzy logic systems using genetic algorithms," Proceedings of 3' IEEE international conference on fuzzy systems, pp. 1377-1382. McCulloch, W.S. and Pitts, W. (1943), "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical
428
A. F. Shapiro
Biophysics 5, pp. 115-133. Mclnerney, M. and Dhawan, A.P. (1993), "Use of genetic algorithms with backpropagation in training of feedforward neural networks," Proc. of 1993 IEEE Int. Conf. on Neural Networks. Mizutani, E. (1997), "Coactive neuro-fuzzy modeling: toward generalized ANFIS," in Jang, J.-S.R., Sun, C.-T., and Mizutani, E. (eds.), Neuro-fuzzy and Soft Computing: a Computational Approach to Learning and Machine Intelligence, Prentice Hall, Upper Saddle River, NJ, pp. 369-400. Montana, D.J. and Davis, L. (1989), "Training feedforward neural networks using genetic algorithms," Proc. of the Eleventh Int. Joint Conf. on Artificial Intelligence (IJCAI89), vol. 1, pp. 762767. Park, J. (1993), "Bankruptcy Prediction of Banks and Insurance Companies: an Approach Using Inductive Methods," Ph.D. Dissertation, University of Texas at Austin. Pearl, J. (1988), Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan-Kaufmann, San Mateo, CA. Shapiro, A.F. (2000), "A hitchhiker's guide to the techniques of adaptive nonlinear models," Insurance: Mathematics and Economics 26, pp. 119-132. Shapiro, A.F. (2002), "The merging of neural networks, fuzzy logic, and genetic algorithms," Insurance: Mathematics and Economics 31, pp. 115-131. Shapiro, A.F. and Gorman, R.P. (2000), "Implementing adaptive nonlinear models," Insurance: Mathematics and Economics 26, pp. 289-307. Shukla, K.K. (2000), "Soft computing paradigms for artificial vision," in Sinha, N.K., and Gupta, M.M. (eds.), Soft Computing and Intelligent Systems: Theory and Applications, Academic Press, San Diego, pp. 405-417. Takagi, T. and Sugeno, M. (1983), "Derivation of fuzzy control rules from human operator's control actions," Proc. IF AC Symp.
Merging Soft Computing Technologies in Insurance
429
Fuzzy Inform., Knowledge Representation and Decision Analysis, July, pp. 55-60. Takagi, T. and Sugeno, M. (1985), "Fuzzy identification of systems and its applications to modeling and control," IEEE Transactions on Systems, Man, and Cybernetics, 15(1), pp. 116-132. Thammano, A. (1999), "Neural-fuzzy model for stock market prediction," in Dagli, C.H., Buczak, A.L., Ghosh, J., Embrechts, M.J., and Ersoy O. (eds.), Smart Engineering System Design, ASME Press, New York, pp. 587-591. Trafalis, T.B. (1997), "Genetic algorithms in neural network training and applications to breast cancer diagnosis," Smart Engineering Systems: Neural Networks, Fuzzy Logic, Data Mining, and Evolutionary Programming, Proceedings of ANNIE'97, vol. 7, pp. 409-414. Verrall, R.J. and Yakoubov, Y.H. (1999), "A fuzzy approach to grouping by policyholder age in general insurance," Journal of Actuarial Practice (7), pp. 181-203. Wendt, R.Q. (1995), "Build your own GA efficient frontier," Risks and Rewards, December: 1, 4-5. Yamakawa, T. and Furukawa, M. (1992), "A design algorithm of membership functions for a fuzzy neuron using example-based learning," Proc. IEEE Int. Conf. Fuzzy Syst. (FUZZIEEE 92), San Diego, March 8-12, pp. 943-948. Yoo, J-H, Kang, B-H., and Choi, J-U. (1994), "A hybrid approach to auto-insurance claim processing system," IEEE, pp. 537-542. Young, V.R. (1997), "Adjusting indicated insurance rates: fuzzy rules that consider both experience and auxiliary data," Proceedings of the Casualty Actuarial Society 84, pp. 734-765. Zadeh, L.A. (1992), Foreword of the Proceedings of the Second International Conference on Fuzzy Logic and Neural Networks, Iizuka, Japan, pp. xiii-xiv. Zadeh, L.A. (1994), "The role of fuzzy logic in modeling, identification and control," Modeling Identification and Control, 15(3), p. 191.
430
A. F. Shapiro
Zhao, J. (1996), "Maritime Collision and Liability," Ph.D. Dissertation, University of Southampton. Zimmermann, H.J. (1996), Fuzzy Set Theory and Its Applications, 3rd ed., Kluwer Academic Publishers, Boston, MA.
Part 2 Other Computational Techniques
This page is intentionally left blank
Property and Casualty
This page is intentionally left blank
Chapter 12 Robustness in Bayesian Models for Bonus-Malus Systems E. Gomez-Deniz and F.J. Vazquez-Polo In implementing Bayesian models for a Bonus-Malus System (BMS) it is normal to include a parametric structure, 7r0(A), in the insurer's portfolio. According to Bayesian sensitivity analysis, the structure function can be modelled by specifying a class F of priors instead of a single prior. In this chapter, we examine the ranges of the relativities, i.e., 6* = E [A7r(A | data)] /E [A7r(A)], 7r G T. We combine standard and robust Bayesian tools to show how the choice of the prior can critically affect the relative premiums. Extending the recent paper of Gomez et al. (2002b), we develop our model to the Negative Binomial-Pareto model (Meng and Whitmore 1999) and the Poisson-Inverse Gaussian model (Tremblay 1992, Besson and Partrat 1992) and also extend the class of prior densities to ones that are more realistic in the actuarial setting, i.e., the classes of generalized moments conditions. We illustrate our method with data from Lemaire (1979). The chapter is mainly focused on showing that there exists an appropriate methodology to develop a Bayesian sensitivity analysis of the bonus-malus of loaded premiums. Keywords: Bonus-Malus, Bayesian robustness, classes of priors, econtamination class, generalized moments conditions.
1
Introduction
This chapter deals with the robustness of the bonus-malus system (BMS) premiums that are generally used in automobile insurance. It 435
436
E. Gomez-Deniz and F. J. Vazquez-Polo
is usual in BMS for premiums to be based only on the distribution of the number of claims, irrespective of their size. In implementing Bayesian models for a BMS it is normal to include a structure function (prior distribution), 7r0(A), in the insurer's portfolio. Nevertheless, the problem of which prior distribution to adopt for the Bayesian model is not easily resolved, since the values of A are not directly observable. According to Bayesian robustness analysis, the structure function can be modelled by specifying a class F of priors instead of a single prior. In this chapter, we examine the ranges of the relativities, i.e., E[A7r(A|data)] E[ATT(A)]
'
K)
as 7r varies over an e—contamination class Te = {7r(A) = (1 — £•) 7r0(A) + eq(X) | q € Q}, where e reflects the amount of probabilistic uncertainty in a base prior 7r0 and Q is a class of allowable contaminations. For Q* = {All probability distributions}, and Q** = {All unimodal distributions} we determine the range of 8n as 7r varies over r e . The variational problems involved are reduced to those of finding the extremes of the functions of one variable to be solved numerically and the explicit solutions of {inf 8*; IT e T} and {sup 8*; TT e T} to be obtained. Expression (1) is usually used for computing bonus-malus premiums and we can see how the choice of the prior can critically affect the relative premiums. Extending the recent paper of Gomez et al. (2002b), we develop our model to the negative binomial-Pareto model (Meng and Whitmore 1999) and the Poisson-inverse Gaussian model (Tremblay 1992, Besson and Partrat 1992) and also extend the class of prior densities to ones that are more realistic in an actuarial setting, i.e., the classes of generalized moments conditions. We illustrate our method with data from Lemaire (1979). The remainder of the chapter is structured as follows: to make the chapter self-contained, the classes of distributions and the Bayesian
437
Robustness in Bayesian Models for Bonus-Malus Systems
methodology that are used in the following are introduced in the second section. In section 3 we derive standard Bayesian bonus-malus premiums. Section 4 provides technical results for the problem. Upper and lower bounds for each premium principle are obtained, and transition rules are used in order to construct the bonus-malus premium table. Section 5 contains some numeric illustrations. Finally, in Section 6, some remarks and comments on related works are presented.
2
The Models
In this section all the probability distributions used below are presented. The random variables considered are assumed to be defined on a fixed probability space (£2, A, J7). For a given random variable X, the probability or density function is denoted, respectively, by p or / depending on its discrete or continuous character. E[X] is the expected value of X, and T(a) denotes the usual gamma function at a > 0. The pair likelihood and structure function chosen is termed "model".
2.1
The Likelihood
(a) Poisson distribution, Vo{\): P{x\\)
=
e
—^-, x\
i = 0,l,...
A>0.
(2)
E [X] = A. (b) Negative Binomial distribution, J\fB(r, A), r > 0, A > 0:
(3) E [X] = A.
438
2.2
E. Gomez-Deniz and F. J. Vazquez-Polo
The Structure Functions
(A) Generalized Pareto distribution, QV(a, b, v), a > 0, b > 0, v > 0: xv~l
ab
f(x | a, b, v) = B(v,b)(a ^ - — •+—xy -^j, n/ where B(v,b) =
=
(4)
rfr)r(&) r(v + b)'
E[X} = av 6-1' (B) Generalized Inverse Gaussian, QlQ(v,^',P'), 0, /3' > 0: '<* ' " ' " ' ' ^
x>0,
2 A ^ ^
V
~
l c
v e M.,/i' >
~ * ~ ^ '
* > °« (5)
where Kv(u) = \ l°°xv-le-%(x+*)dx 2 Jo is the modified type 3 Bessel function with index v. For v = —1/2, QXQ{\/2, fi', j3') is called the inverse Gaussian distribution, lQ(fi, (5). Its probability density function is given, therefore, by
h-irt*\K0)=2li-.1/2K_i/2{fi/py
(6)
The means of the QTQ{v, /J,', (3') and the TQ(ii, 0) are, respectively,
and E LY1 - a Kl'2
(/x//3)
- a
hence, it is simple to prove that Kv{u) = K-V(u),
\/v € R.
439
Robustness in Bayesian Models for Bonus-Malus Systems
2.3
Conjugate Posterior Structure Functions
Under the Bayesian point of view adopted in this chapter, the contributions of both the observational data (in the form of the likelihood function) and prior opinion or information (in the form of the structure function) are essential. The Bayes' theorem then acts as the formal instrument that transforms the observed data y, the initial information about a parameter of interest, 0, into the final opinion by means of the rule:
hf(y\e)*(e)deIn the BMS setting, and considering models for the frequency component, the portfolio is usually considered heterogeneous, with all policyholders having constant but unequal underlying risks of having an accident. Then, if ki, fc2,..., kt is a i.i.d. random sample from a distribution with a probability density function f(k \ 6) the posterior probability density function of 6 given fci, fc2,.-., fet, corresponding to the prior probability density function ir(0), is
SeI\Lif(h\eM9)d0 (?)
ocn/(fci|0M0) which results from applying the Bayes' theorem. 2.3.1 Negative Binomial-Generalized Pareto Model
The number of claims k, given the parameter A, is considered to be distributed according to BM (r,
P(k\r,\)=
u
—T
—T
, fc = 0,l,...
440
E. Gomez-Deniz and F. J. Vazquez-Polo
The parameter A > 0, denotes the different underlying risk of each policyholder making a claim. Assume for the structure function that A is distributed according to a Generalized Pareto distribution, QV(r*, (*, s*), where £* = s(, r* = r and s* = sr + 1. Therefore, 7T0(A) =
r « ) r ( s r + 1) (r + A) s C + s r + 1 '
With this probability density function E[A]= 1 J
r <
sr+ 1-1
= CS-
If k\, k2,..., kt is a random sample with a negative binomial probability denoting the number of claims made by the policyholder in the years i = 1,..., t, respectively; then the posterior structure function, according to Bayes' Theorem, is given by /
\s<-l
Mx\h,,k2,...,kt)«{r
r
\
\tr/
+ x)s(+sr+1(-T-x)
\Hki
[—)
OC
which is a Generalized Pareto distribution QV(r, £', s'), where (' = ^andS' =S+ t Now, it is straightforward to prove that
E[A I ki,k2,..., kt] =
J2ki~kt-
—, k = s
+ *
i=i
2.3.2 Poisson-Generalized Inverse Gaussian Model Suppose now that k is the objective random variable with a Poisson probability function with parameter A > 0, such that: p(k\X) = ^ - ,
A; = 0,1,...
Robustness in Bayesian Models for Bonus-Malus Systems
441
Assume that the parameter A is unknown but you believe it to be distributed according to a Generalized Inverse Gaussian as in (6); then the posterior distribution is given by A
7r 0 (A|t 1 ,,*; 2 ,...,fc l )ccA"-3- 1 e
WW
-£
This is a Generalized Inverse Gaussian with parameters v,fj/,(3', where v = y£ki-o = kt-= k--, k = kt, (8) (9) (10)
1 + 2/3* Therefore,
E\wk ko... k]-u E[X\kuk2,...,kt\-^i
3
l + 2pt
^±iMn Kv{fx>/(3/)
•
Premium Calculation in a Bonus-Malus System
This section considers bonus-malus system (BMS) premiums in automobile insurance. In general, BMS are based on the distribution of the number of accidents (or claims) K, irrespective of their size. The methodology of a BMS consists of ensuring that the premium increases with the number of claims and decreases with the period t in which the policyholder does not make a claim. This can be achieved by dividing a posterior expectation by a prior expectation according to an estimate derived by means of an appropriate loss function. The use of standard Bayesian analysis in credibility theory (viewed as the experience-rating process using a statistical decisionmaking model based on loss functions) has been considered in several actuarial applications (Biihlmann 1970, Heilmann 1989, Makov
442
E. Gomez-Deniz and F. J. Vazquez-Polo
1995, among others). In this chapter we consider the case in which the distribution of K is specified up to an unknown parameter, and where ratemaking incorporates individual claim experience. Now, therefore, we consider the case where a risk K within a given collective of risks is characterized by an unknown parameter A which is thought to be a realization of a parameter space A. Using the notation described in Section 2, the Bayes premium is defined (Heilmann 1989) to be the real number V(ki,k2,...,kt) minimizing the posterior expected loss E7r0(A|fe1,fc2,...,fct)[^(^(A),'P(A;i,A;2,...)A;t)], i.e., the posterior expected loss sustained by a practitioner who takes the action V(ki,k2,..., kt) instead of the 'P(A), the risk premium1, of which is unknown. Let the quadratic loss function be L [P(X),V(h,
k2,...,h)]
= (V(X) - V(k1}
k2,...,kt))2,
then V{k1,k2,...,kt)=
/
V(\)-K0(\\ki,k2,...,kt)d\,
is the Bayes net premium. Using the Bayes net premium, Lemaire (1979) introduces the following formula to be applied in a BMS
this is the ratio between the posterior expectation and the prior expectation multiplied by 100, where V££ (k) denotes the bonus-malus premium to be loaded when we use the prior 7r0, with observed sample result k. Observe that the premium the policyholder has to pay if the initial premium (t = 0, k = 0) is 100. 'Usually V{\) is obtained by minimizing %(fc|A) [£(&, P(A))] using the same loss function that we use to obtain the Bayes premium.
443
Robustness in Bayesian Models for Bonus-Malus Systems
Now, using (11) in the models described in Subsections 2.3.1 and 2.3.2, we obtain the Bayes bonus-malus premium by V2 (k) =
1 0
0 ^ < A ' ^ " ^ = 100 * ± * Eno(X) (s + t)C
(12)
and V* (k) = 100*"° (A ' ku **' "' kt) = 1 0 0 ^
+ 1
^/f3>)
(13)
respectively. Our approach in this chapter is based on the assumption that the practitioner is unwilling or unable to choose a functional form of the structure function, ir, but that he may be able to restrict the possible prior to a class that is suitable for quantifying the actuary's uncertainty. Therefore it is of interest to study how the premium for priors in such a class behaves. We use the classical e—contamination class of priors, T£ =
{TT(0)
= (1 - e) 7TO(0) + eq{9) \ q e Q},
where e reflects the amount of probabilistic uncertainty in a base prior 7T0 and Q is a class of allowable contaminations (Berger 1985, Gomez et al. 2000, 2002a and 2002b, Rios and Ruggeri 2000, Sivaganesan and Berger 1989, Betro et al. 1994, Goutis 1994, among others). These classes have been used in several situations to measure the sensitivity of quantities which can be expressed in terms of the posterior expectation of parameter functions. Nevertheless, when a relativity as in (11) is used, few papers have treated the problem (Gomez et al. 2002b). We present basic results to study the range of posterior ratio quantities in the form S* = (/ A ^( A ) 7 r ( A I data)dA) / ^g(X)Tr(X)d\^
,
444
E. Gomez-Deniz and F. J. Vazquez-Polo
as 7r varies over an Ts. For Q* = {All probability distributions}, and Q** = {All unimodal distributions} we determine the range of 5* as 7r varies over i \ . The variational problems involved are reduced to finding the extremes of the functions of one variable to be solved numerically and the explicit solutions of {inf Sn; re € Te} and {sup 5T; TT € T£} to be obtained. These technical results are shown in the next section.
4
Technical Results
Let K be a random variable with probability function f(k | A); given a sample k = (ki,k2,...,kt) of K, the experimental evidence about A is expressed by the likelihood function f(k \ A) = 4(A). Let Q be the space of the probability measures in the parameter space (A, J7). Let us define
Q = \q : f Hi(\)q(\)d\
< au
i=
l,2,...,n\
(14)
where Hi are given g-integrable functions and en,, i = 1 , . . . , n are fixed real numbers. Let §1 = f g(X)q(\ | k)d\ = JA
IAg(X)lk(X)g(X)d\ fAlk(X)q(X)dX
be an a posteriori functional, where g is a given function on A such that 5g is well defined. The conditions in (14) are called generalized moment conditions and our aim is to find the range of Sn when n runs over T. Theorem 1 (Winkler 1988). Let (A, J7) be a measurable space and suppose that Q is a simplex of probability measures whose extreme points are Dirac measures S\, i.e., the zero-one function taking value 1 in A, and 0, elsewhere. Fix measurable functions Hi, H2, •••, Ht and real numbers ai,a2, ...,at. Consider the set Q as defined in (14). Then
Robustness in Bayesian Models for Bonus-Malus Systems
445
(a) Q is convex and the set ex Q of its extreme points satisfies (
m
ex Q C Q' = \q e Q : q = J><5 Ai ,*i > 0,
^ *i = 1, Aj 6 A, 1 < m < n + 1 i=i
Furthermore, the vectors (Hi(Xi),..., are linearly independent.
Ht{\))',
1 < i < m
(b) If the moment conditions in Q are given by equalities, then equality of sets holds in (a). Lemma 1 (Sivaganesan and Berger 1989). Let ^ be the convex hull of a set $ / of probability measures on the real line given by $ / = {//T, r € / } , where / C Rfc is an index set. Assume that / and g are both real-valued functions on R, F-integrable for any F 6 i Assume that B + g(x) > 0 for a constant. Then A + Jf(x)dF(x) _ Fey B + / g(x)dF{x)
u TeI
A + Jf{x)fir(dx) B + / g(x)/iT(dx)'
The same result holds with sup replaced by inf. Lemma 2 (Sivaganesan and Berger 1989). For unimodal q and g such that / A g(\)lk(X) \ \)q(\)d\ < oo, then / g{\)lk(\)
| X)q(X)d\ = f"'
JA
JO
HS(z)dF(z),
where F is some distribution function and ' 9
H (z) = <
1 /-Ao+z
/
g{X)lk{X)\X)dX, g(X0)k(Xo),
where A0 is the mode value of q.
ifz^O, if2 = 0,
E. Gomez-Deniz and F. J. Vazquez-Polo
446
Theorem 2. If the moment conditions in (14) are given by equalities and the following systems of equations
m+l
^
HiiX^Pj = oii, Vi,i = 1 , . . . ,n,pj G [0,1],
(16)
where Aj = Ai for some j , j = 1 , . . . , m + 1, has a solution, with Ax the value where the suprema of the function (1 - e)p(k | TTO)^ EILi JM + £9(X) S?=i «*
(l-e)p(k\irQ)Tlsim
fc(A)
+
£
^Li
,„,
Oii
is reached, then sup
A
The same result holds with sup replaced by inf. Details of the proof of this result are in appendix A. However, such a class might be too large, containing unreasonable priors such as those giving a positive mass at an isolated point. Since the mode is a very intuitive statistical concept, the actuary who has good statistical training should have no problem in assessing the unimodality of the risk parameter and its numerical value, based on historical data. In fact, one way of expressing uncertainty about the prior is by specifying some moments of the parameter. The following corollary is a consequence of applying Theorem 2 and Lemma 2. The following result gives us a sufficient condition under which we can compute the range of 8* when n G T£ and Q = Q**, where
Q** = { : J^Hi(X)q(X)dX
= J^-Hi(X)Tr0(X)dX = aj ,
%
i,...,
n,
Robustness in Bayesian Models for Bonus—Malus Systems
447
and mode of q(X \ k) = mode of 7r0(A | k) = A0} ,
(18)
Corollary 1. If the systems of equations
yE- = 1 («,-^)) A 3n _n
^ ) "
h
m+l
2
n z
i( i)Pj
= "i, Vi, i = 1 , . . . , n,pj G [0,1],
i=i
where ^ = ^ for some j,j = 1 , . . . , m + 1, has a solution, with z\ > 0 the value where the suprema of the function /AAo°+'((l-£)p(fcko)E:=1 ^ + £ z E r = l a t ) d A
2 > 0
'
2 = 0.
is reached, then sup 5n = sup (p(z) =
4.1
Transition Rules in a Robustness Model
For the above models proposed in section 2, the bounds of V£M (kj over the classes of priors considered above, are obtained by calculating the extrema of functions (p{k,t,X,£) = 100
— ip2(k,£,n0)\
, lk{\)+e(
448
E. Gomez-Deniz
t¥>i(fc,e,To)/10Q.L
100 tp(k,t, z,e) = <
„
J^+z
x/ik(x)dX+e\
." A 0+ , Ao+2
V2(fc,£,7ro)i/ A
and F. J.
J*°+z JAn
_;
A// fc (A)dA+£C
i nnyi(fc,e,7ro)/100Ao// fc (Ao)+£Ao 1UU VP2(fc,£,7ro)A0/Zfc(A0)+£C '
Vazquez-Polo
\d\
—, _ ~
z >o r, '
for Q* and Q**, respectively, where yi(fc,e,7r0) = (1 - e)p(fe |
¥?i(fc,e,7r0)
<^2(A;,e,7ro) 4(A) oc ( ) f | , \r + XJ \r + XJ Zfc(A) oc e~XtXk,
TTO)^®,
(Neg. Bin. Gen. Pareto model),
(Poisson Gen. Inv. Gaussian model),
and roo
p(k | 7r0) = / lk(X)'Ko(\)dX. Jo We denote the lower and upper bounds obtained from the above expressions by lb and ub for each contamination class considered. Thus, for each e (fixed), we denote lbkt=
A
inf f(k,t, X (or z),e),ubkt = sup ip(k, t, X (or z),e). (° r ) z ' A (or) 2
BMS with a finite number of classes are characterized by growing premium percentages, and then the movements of the insurance takers between the classes are given by transition rules depending on the number of claims made during one year. Therefore, the analytical bounds derived by Theorem 2 and Corollary 1 must be considered jointly with these BMS restrictions in order to ensure that the bounds derived from a Bayesian BMS are fair (i.e., coefficients for lower and upper bounds for k greater than zero is less than one). These comments must be considered in the construction of bonus-malus tables for the bounds of loaded premiums and to ensure the increase of the
Robustness in Bayesian Models for Bonus-Malus Systems
449
bounds of the premiums with respect to k and their decrease with respect to t (k fixed). For a bonus-malus table, two typical movements are presented from an entry (k, t) : 1. (k, t) —> (k + l,t + 1), then the upper and lower bounds are: UBk+i,t+i = ubk,t V UBkj,
LBk+i,t+i = lbk,t V LBkjt (19)
2. (fc, t) —> (k, t + 1), then the upper and lower bounds are: UBu+l
= vbkit A UBKU
LBkit+i = lbk,t A LBktt
(20)
where x V (A)y represents max(min){x,?/} for real numbers x, y, and UB, LB denote the upper and lower bounds, respectively. With these transition rules, applied in a robust Bayesian setting, we obtain the following chain of inequalities: (LBk,t+i, UBktt+i) < (LBk>t,UBkt)
< (LBk+itt+i,UBk+i,t+i),
where (a, b) < (c, d) <=$• a < c, b < d, in a form similar to that of standard Bayesian analysis:
7>£(M +1) < KM°(M) < VZ{k + i,t + l).
5
Illustrations
In order to illustrate the above ideas, an example is considered in this section. The Negative Binomial-Generalized Pareto (Model 1) and the Poisson-Generalized Inverse Gaussian (Model 2) developed in this chapter are used with data from Lemaire (1979). The fitted distributions for both models are shown in Table 1. The frequency represents the number of policyholders in the portfolio from whom the insurer received k claims.
450
E. Gomez-Deniz and F. J. Vazquez-Polo Table 1. Observed and fitted numbers of claims.
Claim number, k Observed freq. Fitted freq. Fitted freq. (Model 1) (Model 2) 0 1 2 3 4 More than 4 Total
96978 9240 704 43 0 0 106974
96980.0 9235.9 702.1 51.80 3.90 0.30 106974
96974.97 9245.10 696.77 52.60 4.56 0 106974
Under the assumptions of Model 1, the claim distribution in Table 1 corresponds to the marginal distribution p(k) =
p(k\ A)7r0(A)dA Jo T(s( + sr + l ) r ( r + sr + l ) r « + k)T(r + k) r(s()r(sr + l)r(r)r(r + sr + s( + k + l)k\ '
that Meng and Withmore (1999) named as a negative binomialPareto distribution. This distribution was used to estimate the parameters C, T and s in Model 1. The parameters were estimated by maximum likelihood as C = 0.1011, r = 3.736 and s = 36.93. In the Poisson-Inverse Gaussian model we use the probability generating function P(z)
-0+2/3(1-2)
obtained as the Poisson mixed over the Inverse Gaussian. Tremblay (1992) called this model Poisson Inverse Gaussian. Since, from the data of Lemaire (1979) the mean and the variance are 0.1011 and 0.1074, respectively, we can use the moments method to estimate the parameters /J, = 0.1011 and /3 = 0.0623. Therefore, from expressions (9) and (10), we obtain
Robustness in Bayesian Models for Bonus-Malus Systems
_
451
0.0623
0.1246*+ 1 From (12) and (13), the relative premiums are shown in Tables 2 and 3 (note that these premiums are those obtained from standard Bayesian analysis, corresponding to e = 0 in our robust Bayesian scenario). Table 2. Bonus-Malus premium (Model 1). £= 0 t k 0 0 100 1 97.363 2 94.862 3 92.486 4 90.227 5 88.075
1 123.441 120.270 117.258 114.393 111.665
149.519 145.678 142.030 138.559 135.255
175.596 171.086 166.801 162.726 158.845
201.674 196.493 191.572 186.892 182.435
227.751 221.901 216.344 211.058 206.024
253.829 247.308 241.115 235.224 229.614
Table 3. Bonus-Malus premium (Model 2). £= 0
t k 0 0 100 1 94.297 2 89.471 3 85.317 4 81.693 5 78.494
1
2
3
4
5
6
149.092 138.801 130.173 122.819 116.463
224.025 205.661 190.484 177.714 166.809
313.666 285.570 262.490 243.180 226.777
411.912 373.337 341.718 315.321 292.946
514.740 465.406 424.999 391.293 362.745
620.017 559.823 510.535 469.434 434.634
Using Theorem 2 and Corollary 1 and expressions (19) and (20), we obtained the variation range of the relative premium. As a measure of robustness (or lack of robustness), we use a magnitude that does not depend on the premium measurement units. The sensitivity of the Bayesian bonus-malus premium is analyzed by the factor of relative sensitivity (RS) introduced by Sivaganesan (1991), and which is given by RS
=
Range of P*M (k) * J^-W 2KM° (k)
x 100%_
452
E. Gomez-Deniz and F. J. Vazquez-Polo
Table 4. Range (infimum, supremum and RS factor) with the mean incorporated (Model 1). 0.1 t k 0 1 2 4 3 5 6 0 100 1
109.23 144.39 171.95 196.31 220.78 0 244.18 103.55 159.19 441.64 1998.45 6550.37 15776.50 32721.10 53.17 20.23 99.40 520.37 1575.32 3415.07 6397.40
2
104.88 140.41 166.48 0 99.890 139.27 246.90 688.36 52.65 14.29 36.55 152.51
191.33 1998.45 459.84
215.35 6550.37 1427.44
238.40 15776.50 3141.44
3
0 96.85 52.35
100.64 136.60 162.22 130.27 195.21 396.04 12.63 20.63 70.08
186.58 975.43 205.88
210.14 2069.14 429.63
232.84 6550.37 1310.06
4
0 94.14 52.17
96.51 132.95 158.15 124.34 172.26 288.30 12.16 14.18 39.99
182.03 602.08 112.37
205.16 1234.33 243.81
227.49 2155.42 409.80
5
0 91.67 52.04
92.49 129.45 154.26 119.76 159.03 237.15 12.20 10.93 26.09
177.69 427.91 68.57
200.38 826.73 152.00
222.35 1440.00 265.15
0
100
0.2
1
235.89 100.00 138.81 165.97 190.82 214.13 0 110.94 200.71 717.97 3000.18 8670.87 19759.90 39980.40 56.97 40.79 193.67 807.02 2102.41 4291.03 7828.99
2
90.41 134.69 161.54 186.00 0 105.82 161.23 349.14 1037.51 3000.18 55.77 29.44 73.60 256.00 716.10
3
208.99 8670.87 1906.67
230.55 19759.90 3948.38
0 85.49 130.73 157.29 101.96 145.27 250.88 576.17 55.12 25.48 42.29 125.56
204.05 181.38 1340.95 3000.180 646.22 302.64
225.38 8670.87 1751.34
4
0 98.70 54.69
80.74 126.92 135.79 208.31 24.06 29.36
153.23 397.08 74.92
176.95 836.85 176.54
199.30 1577.25 326.43
220.39 3000.18 590.88
5
0 95.83 54.40
76.15 123.24 149.34 129.07 184.82 309.46 23.69 22.76 50.40
172.71 589.21 114.15
194.73 1077.96 214.35
215.57 1749.45 334.01
Robustness in Bayesian Models for Bonus-Malus Systems
453
Table 5. Range (infimum, supremum and RS factor) with the mean incorporated (Model 2). £ = 0.1 t k 0 1 100 0 131.943 214.964 300.808 389.428 474.249 550.889 100.441 188.194 564.253 2240.860 5582.290 9613.050 13888.600 53.25 18.86 77.95 309.25 630.33 887.71 1075.59 0 94.534 52.82
120.663 196.969 274.313 355.064 434.092 507.498 158.644 308.179 754.399 2240.860 5582.290 9613.050 13.68 27.03 84.05 252.55 553.08 813.25
0 89.779 52.61
111.051 182.013 252.337 143.680 242.601 457.498 12.53 15.90 39.07
326.284 399.798 923.504 2240.860 87.38 216.59
0 85.762 52.49
102.720 169.365 233.818 133.205 210.662 348.374 12.41 11.61 23.55
301.888 609.594 48.79
370.331 436.353 1049.610 2240.860 86.79 192.20
0 82.285 52.41
95.398 158.511 217.997 125.005 190.280 292.884 12.71 9.52 16.51 e = 0.2
280.975 464.542 31.33
344.828 744.046 55.02
469.563 5582.290 500.72
407.197 1138.650 84.14
100 0 115.356 205.337 288.051 370.064 446.120 514.098 107.754 233.162 864.512 3058.160 6599.010 8552.730 14944.400 57.13 39.50 147.12 441.56 756.10 787.44 1163.70 0 103.721 187.742 262.940 338.538 410.268 100.485 181.438 410.024 1063.770 3058.160 6599.010 56.15 27.99 54.04 140.21 364.23 664.87
475.708 8552.730 721.39
0 94.976 55.66
93.823 173.050 241.961 159.170 296.769 613.888 25.10 32.47 70.84
311.842 379.313 441.998 1217.470 3058.160 6599.010 132.51 315.15 602.99
0 90.466 55.36
85.276 160.575 224.181 145.091 245.708 441.707 24.35 23.95 44.72
289.010 352.437 796.268 1316.870 80.43 123.23
412.278 3058.160 281.81
0 86.637 55.18
77.807 149.828 208.920 134.758 215.588 355.106 24.45 19.71 32.23
269.296 590.232 54.77
385.975 1378.960 114.23
328.962 938.715 84.04
454
E. Gomez-Deniz and F. J. Vazquez-Polo
Tables 4 and 5 show the range of the BM premium for various degrees of certainty in the base prior 7r0 (e = 0.1 and 0.2). In these tables, each cell contains the infimum, the supremum and the RS factor, in this order. Reading the tables in cases of Bayesian robustness is analogous to reading a standard bonus-malus table, but taking into account that instead of a single premium, we obtain a range of premiums over the class Te of distributions that a priori are considered close to ir0 and therefore are plausible and compatible with the actuary's a priori information. The elements represented for each given value of e, t and k have the following meaning. Three values are shown in each cell; the first and second of these correspond to the lower and upper bounds obtained over the class being considered; the third and final value in each cell corresponds to the value obtained for the RS factor. Let us examine a specific case. When uncertainty is low in 7r0 (of the order of 10%, e = 0.1) we see that in the abovecommented case (k = 2,t = 2), we have a variation range, (5,5) = (140.406,246.897). Obviously, this interval contains the value S0 = 145.678 discussed above (Table 2) that was obtained under standard Bayesian analysis. However, this value is significantly closer to the lower bound 5 = 140.406, thus indicating the existence of structure functions that are very close to 7r0 (deduced with the actuary's a priori information) which would require considerably higher premiums and which would almost certainly contribute to reducing the strong degree of instability that can occur with BMS (Lemaire, 1995). The RS factor for this situation has a value of 36.55 which clearly indicates the asymmetry between the degree of uncertainty (10%) with respect to the a priori 7r0 for a standard Bayesian development. Let us now assume that the insurance taker makes a claim during the period we are considering, which in notation form is (2,2) —> (3,3), the premium to be charged presents a variation range (6, ~5) = (162.219,396.035) with an RS factor of 70.08%. Again, this interval contains the value 5Q = 166.801 (Table 2), but provides much more interesting discussion possibilities. The actuary's a priori in-
455
Robustness in Bayesian Models for Bonus-Malus Systems
formation, together with the sample information (the claim) produce an interval of possible premiums to be charged. This interval could be used in conjunction with the quantity of the claim in order to introduce criteria to correct the final quantities payable, depending on the size of the claims made. One very interesting property is the behaviour of the RS factor within a single degree of security in the structure function (fixed value of e) when the transition is made from one state (k, t) to another (k + l,t+l). The sensitivity increases significantly. In a sense, this indicates that the premiums become less robust as more claims are presented. Similarly, when the transition is (k, t) —> (k, t + 1), that is, when no claims are made, the RS factor decreases. The non-incorporation of new claims means that the actuary has a narrower range of possible premiums and therefore these are closer to 80. In general, Table 4 shows the model to be not very robust; the values of the RS factor exceed 1000% for the class in which only the mean value is incorporated. The situation is analogous with regard to the reading of Table 5. Therefore, it seems appropriate to fine-tune the class, incorporating additional information with respect to the structure function. As commented above, shape properties such as unimodality are strongly intuitive for the actuary, as well as being easily attributable to the structure function. Tables 6 and 7 show the modes considered for each of the models developed in this section. Table 6. Mode of the Pareto posterior distribution (Model 1). t
0 1 2 3 4 5
k
0 —
1 -
2 -
3 —
4 —
5 -
6 —
.0710671 .0692665 .0675547 .0659255 .0643731
.0970646 .0946054 .0922672 .0900420 .0879223
.123062 .119944 .116980 .114158 .111470
.149059 .145283 .146920 .138275 .135019
.175057 .170621 .166405 .162392 .158568
.201054 .195960 .191117 .186508 .182116
.227052 .221299 .215830 .210625 .205665
Tables 8 and 9 show the ranges obtained when unimodality is incorporated by means of the contamination class Q**. In general, we
456
E. Gomez'Deniz and F. J. Vazquez-Polo Table 7. Mode of the QXQ posterior distribution (Model 2).
t
0 1 2 3 4 5
k
0 —
1 —
2 —
3 —
4 -
5 —
6 —
.0433694 .0425709 .0418296 .0411284 .0404674
.0715785 .0688908 .0665131 .0643794 .0624544
.126975 .118766 .111861 .105957 .100839
.209563 .192189 .177874 .165860 .155624
.306628 .278716 .255826 .236706 .224930
.409953 .371149 .339366 .312848 .290387
.516189 .466391 .425619 .391618 .362829
see that the RS values decrease considerably in comparison with the results shown in Tables 4 and 5. Therefore, it can be concluded that the incorporation of this type of unimodality makes the model somewhat more robust.
6
Conclusions and Further Works
Calculating the Bayes premium requires the actuary to specify a prior structure function for the risk parameter. The incorporation of this prior distribution into the model has been criticised by some authors, and so a class of plausible distributions is used. This is the basis of robust Bayesian analysis, the methodology of which provides a range of variations of the Bayes premium that the insurance company can employ in order to achieve suitable tariffs for a given risk. The aim of the present paper is to illustrate notions and techniques of robust Bayesian analysis in the context of problems that arise in Bonus-Malus Systems. A BMS is a tariff method that is widely used by automobile insurance companies and is characterized by the fact that it only considers an increase in the number of claims in calculating the premium payable, thus provoking "unfair" situations, in that the same premium increase is charged to a policy-holder who makes a small claim than to one who makes a large one. The mostcommonly used model in BMS consists of assuming that the individual risk for the number of claims has a Poisson-type distribution, and that the mean is distributed (a priori) as a Gamma distribution,
Robustness in Bayesian Models for Bonus-Malus Systems
t
0
457
Table 8. Range (infimum, supremum and RS factor) with mean and mode (Model 1). £ = 0.1 1 2 3 4 5 6 k 0
100
1
245.983 93.579 120.607 146.745 172.394 197.570 222.155 103.189 155.277 387.489 1579.930 5059.670 11943.800 24355.000 14.04 4749.06 4.93 80.50 400.78 1205.43 2573.34
2
91.003 99.387 4.41
117.436 142.964 168.017 136.915 227.004 551.115 8.09 28.84 111.96
192.637 216.731 1579.930 5059.670 353.01 1091.23
240.150 11943.800 2366.20
3
88.549 96.336 4.20
114.419 139.365 164.307 128.540 184.817 335.489 6.02 16.00 51.31
187.932 736.155 143.08
211.543 1573.930 314.86
234.553 5059.670 1000.58
4
86.207 93.628 4.11
111.544 135.937 159.875 122.948 165.759 255.425 4.98 10.76 29.35
183.440 472.793 77.41
206.579 905.141 165.49
229.18 1573.930 285.84
5
83.970 91.155 4.07
108.802 132.667 156.083 118.579 154.498 217.277 4.37 8.07 19.26
179.149 352.120 47.40
201.827 621.860 101.93
224.024 1041.520 178.01
0
100
0.2
1
90.115 117.794 143.971 169.288 193.848 217.615 240.609 109.893 192.021 615.050 2393.330 6699.910 14891.300 29641.100 10.15 30.06 157.53 633.28 1613.01 3221.43 5791.39
2
87.523 114.641 140.251 104.700 156.018 309.479 17.20 9.05 58.08
3
85.052 111.638 136.708 161.900 100.820 141.455 229.603 473.581 8.52 12.71 93.42 32.70
4
82.695 97.567 8.24
5
80.444 94.703 8.09
165.030 189.102 212.445 823.694 2393.330 6699.910 192.49 560.89 1461.79
235.059 14891.300 2963.15
184.564 207.488 1018.780 2393.330 217.72 505.17
229.729 6699.910 1341.72
108.773 133.328 157.087 132.726 194.777 338.020 10.46 22.17 55.59
180.222 650.798 125.89
202.734 1167.230 228.49
224.608 2393.330 460.99
106.038 130.099 153.378 126.468 175.284 272.195 9.14 16.70 37.45
176.064 472.468 81.23
198.172 813.151 149.24
219.685 1271.240 228.98
458
E. Gomez-Deniz and F. J. Vazquez-Polo
Table 9. Range (infimum, supremum and RS factor) with mean and mode (Model 2). 0.1 t k 0 1 2 4 3 5 6
0
100
1
83.255 99.871 8.81
140.641 215.368 300.963 389.861 476.412 560.058 182.666 476.084 1601.74 3807.040 6312.140 8787.570 14.09 58.18 414.79 207.35 566.86 663.49
2
78.729 93.899 8.47
130.610 197.621 274.619 155.538 282.933 602.664 20.74 8.97 57.43
355.645 435.991 513.926 1601.740 3807.040 6312.140 362.16 166.88 517.86
3
74.837 89.118 8.36
122.215 182.922 252.800 141.392 230.121 397.357 7.36 12.38 27.53
327.002 723.526 58.01
401.591 474.550 1601.740 3807.040 141.19 326.37
4
71.440 85.088 8.35
115.049 170.532 234.430 131.331 202.853 317.244 6.62 9.09 17.02
302.729 511.117 33.04
372.083 821.649 57.44
440.516 1601.740 123.68
5
68.441 81.603 8.38
108.856 159.934 218.759 123.382 184.728 273.922 12.16 7.43 6.23
282.235 409.610 21.74
346.568 613.699 36.82
410.852 896.315 55.84
0
100
0.2
1
74.712 132.338 206.475 288.797 372.339 454.792 539.944 106.474 220.959 703.426 2212.810 4511.330 6946.830 9370.540 16.84 29.72 306.69 502.41 630.61 712.12 110.91
2
70.640 99.077 15.89
122.718 189.421 264.020 340.880 417.240 494.563 174.614 360.390 828.014 2212.810 4511.330 6946.830 98.74 439.84 18.69 41.56 250.70 576.27
3
67.152 93.523 15.45
114.663 175.270 243.365 154.163 271.204 510.204 15.17 25.18 50.82
314.296 385.327 456.641 936.918 2212.810 4511.330 214.99 91.10 397.10
4
64.117 88.994 15.22
107.803 163.325 225.890 141.003 229.574 384.628 13.51 18.63 32.63
291.590 644.559 55.96
357.874 424.263 1014.190 2212.810 83.86 190.50
5
61.448 85.157 15.10
101.882 153.097 210.918 131.229 204.007 318.984 15.25 12.59 23.82
272.005 498.133 38.59
334.035 752.726 57.71
396.205 1066.380 80.20
Robustness in Bayesian Models for Bonus-Malus Systems
459
thus taking advantage of the good analytical properties of the conjugation. Nevertheless, alternative models such as the one presented here are also employed in actuarial contexts. Assuming that it is really difficult, and perhaps impossible, to quantify an expert's a priori opinion in a single structure function, we suggest the use of classes of a priori distributions that are compatible with this information. In other words, the proposal consists of assuming that the actuary is not prepared to choose a single structure function for the behaviour of the parameter of interest, but prefers to use a whole class of structure functions and thus reflect his uncertainty about this aspect. As a second improvement, we used a contamination class with a given mean, and with the mean and the mode, because these appear normally in the actuarial process. The present study clearly reflects the advantages of using robust Bayesian analysis as a prior step to introducing severities into a BMS, unlike the study by Frangos and Vrontos (2001) in which a inclusion is made by calculating the product of the premiums under the hypothesis of independence. We have combined the tools of standard and robust Bayesian analysis in order to show how the choice of the structure function can have a crucial effect on the bonus-malus premium. The Bayesian analysis carried out in this exercise provides the actuary with a range of variations of the relative premium that can be used to resolve competitive problems of the insurance company. The present study leaves some aspects open to question, which could be the subject of future study. These are of two types, the short-term objectives, and the longer-term goals that require indepth investigation. Among the former are: 1. To incorporate new data, readily available to actuaries, into the contamination classes. For example, in Eichenauer et al. (1988), although with other intentions, a class of structure function with a given variance was used. These authors claimed that the actuary was willing to assign such a value.
460
E. Gomez-Deniz and F. J. Vazquez-Polo
2. Another interesting line of investigation concerns the consideration of a hierarchical model, which normally reduces the expert's uncertainty (see Cano 1993). Basically, this consists of considering that one (or several) of the parameters of the a priori distribution is in turn randomly distributed; thus, a new a priori distribution could be assigned to it. Logically, such a study is complicated from a computational standpoint, and it would be necessary to use computational Bayesian analysis such as MCMC (see Scollnik 1996).
Acknowledgments Research partially supported by grants from DGUI (Direction General de Universidades e Investigation, Gobierno de Canarias, project PI2000-061) and MCyT (Ministerio de Ciencia y Tecnologfa, Spain, project BEC2001-3774).
461
Robustness in Bayesian Models for Bonus-Malus Systems
Appendix A Proof of Theorem 2 If / Hi(\)q{\)d\
= aj, i — 1 , . . . , n, then
p{k\q)±
j^q{\\k)dX^±au
where p(k \ q), the predictive, is given by p(k [ lk(X)q(X)d\.
\ q) —
JA
In this case, for ir G Te it is simple to show that 57r has the expression: Wi(A) J (1 - e)p(fe | 7ro)5- 2 JA ^q(X \ k)dX+ n
..
eE"« t=i
7 A
\
(
n
9(X)q(X\k)dX\x{sY:
J
I t=i
(l-eMfc|7ro)E/A^g(A|fc)dAJ Now, applying Lemma 1 the suprema of 5* is obtained by maximizing ip(X) . From Theorem 1, the extrema are over probability measures, pj, giving a positive mass to at most m + 1 points. Therefore, if the value of Ai obtained by maximizing (17) satisfies the system of equations (15) (with this q(X \ k) is a probability measure) and (16) (the general moment condition), the suprema is reached. Proof for the infimum is similar, v
462
E. Gomez-Deniz and F. J. Vazquez-Polo
References Berger, J. (1985), Statistical Decision Theory and Bayesian Analysis, 2nd edition, Springer-Verlag, New York. Besson, J.L. and Partrat, C. (1992), "Trend et systemes de bonusmalus," ASTINBulletin, vol. 22, l,pp. 11-31. Betro, B., Ruggeri, F. and Meczarski, M. (1994), "Robust Bayesian analysis under generalized moments conditions," Journal of Statistical Planning and Inference, vol. 41, pp. 257-266. Buhlmann, H. (1970), Mathematical methods in Risk theory, Springer-Verlag, Berlin. Cano, J. (1993), "Robustness of the posterior mean in normal hierarchical models," Commun. Statist. Theory & Meth., vol. 22, 7, pp. 1999-2014. Eichenauer, J., Lehn, J. Retting, S. (1988), "A gamma-minimax result in credibility theory," Insurance: Mathematics and Economics, vol. 7, 1, pp. 49-57. Frangos, N. and Vrontos, S. (2001), "Design of optimal bonus-malus systems with a frequency and severity component on an individual basis in automobile insurance," ASTIN Bulletin, vol. 31,1, pp. 1-22. Gomez, E., Hernandez, A. and Vazquez-Polo, F. (2000), "Robust Bayesian premium principles in Actuarial Science," J. Royal Stat. Soc, Series D, vol. 49, 2, pp. 241-252. Gomez, E., Hernandez, A. and Vazquez-Polo, F. (2002a), "Bounds for ratios of posterior expectations: Applications in the collective risk model," Scandinavian Actuarial Journal, vol. 1, pp. 37^44. Gomez, E., Perez, J., Hernandez, A. and Vazquez-Polo, F. (2002b), "Measuring sensitivity in a bonus-malus system," Insurance .-Mathematics & Economics, vol. 31, 1, pp. 105-113. Goutis, C. (1994), "Ranges of posterior measures for some classes of priors with specified moments," International Statistical Review, vol. 62, 2, pp. 245-256. Heilmann, W. (1989), "Decision theoretic foundations of credibility
Robustness in Bayesian Models for Bonus-Malus Systems
463
theory," Insurance: Mathematics & Economics, vol. 8, pp. 77-95. Lemaire, J. (1979), "How to define a bonus-malus system with an exponential utility function," Astin Bulletin, vol. 10, 3, pp. 274282. Lemaire, J. (1995), Bonus-Malus Systems in automobile insurance, Kluwer Academic Publishers, London. Makov, U. (1995), "Loss robustness via Fisher-weighted squarederror loss function," Insurance: Mathematics & Economics, vol. 16, pp. 1-6. Meng, Y.W. and Whitmore, G.A. (1999), "Accounting for individual over-dispersion in a bonus-malus automobile insurance system," Astin Bulletin, vol. 29, 2, pp. 327-337. Rios-Insua, D. and Ruggeri, F. (2000), Robust Bayesian Statistics, Springer-Verlag, New York. Scollnik, D. (1996), "An introduction to Markov Chain Monte Carlo methods and their actuarial applications," in Proceeding of the Casualty Actuarial Society, LXXXIII, pp. 114-165. Sivaganesan,S. (1991), "Sensitivity of some posterior summaries when the prior is unimodal with specified quantiles," The Canadian Journal of Statistics, vol. 19, 1, pp. 57-65. Sivaganesan, S. and Berger, J. (1989), "Ranges of posterior measures for priors with unimodal contaminations," Annals of Statistics, vol. 17, 2, pp. 868-889. Tremblay, L. (1992), "Using the Poisson inverse Gaussian in BonusMalus Systems," Astin Bulletin, vol. 22, 1, pp. 97-106. Winkler, G. (1988), "Extreme points of moment sets," Mathematics of Operations Research, vol. 13, 4, pp. 581-587.
This page is intentionally left blank
Chapter 13 Using Logistic Regression Models to Predict and Understand Why Customers Leave an Insurance Company Montserrat Guillen, Jan Parner, Chresten Densgsoe, and Ana M. Perez-Marin
In this chapter we want to focus on the insurance industry customer. So, we start defining a customer as someone having one or several insurance policies in several lines of insurance. Then, the main question of explaining customer lapses is stated. We review some of the marketing literature concerning repurchase behavior. In the second section, we present the basic of logistic regression modeling, which has also been used in the context of insurance fraud detection. The presentation is self-contained, so that the reader can understand why we undertake this approach and which other approaches are possible. In the following section we show what information is usually available in insurance companies regarding the customers. We define the variables that are going to be used in the models presented afterwards. The results show how the methods can be used to predict the probability of a customer leaving the company. We focus on the explanatory variables and all the results derived from the data analysis. Finally, we give some conclusions and we specially mention why survival analysis methods are used to predict the time from the first policy cancelation to the customer completely leaving the company, i.e., when the last policy is ended. This chapter methodology mimics some of the methods currently used in fraud detection. The connections to marketing models are also discussed and an application to one major European insurance 465
466
M. Guillen et al.
company is presented. The results were developed at Codan's Business Intelligence Unit, Copenhagen, Denmark.
1
Introduction
For an insurance company a customer can be defined as a person, firm or any other organizations having one or more insurance policies in the same or different lines of insurance. Once a customer has bought his first policy, a lot of statistical information is collected by the insurer over time, such as features of the coverage, renewals, new policies undertaken, claims, cancelations, complaints and even statistical information obtained in surveys about his level of satisfaction. Such information is useful to plan the company's marketing strategies. In order to optimize the quality of the portfolio, the effort should be orientated to good customers to promote loyalty, even if much less care is given to bad clients who leave. A lot has been written about the definition of good/bad customers and loyalty in the marketing literature, but few specific references can be mentioned related to the insurance industry. Roughly speaking, for an insurance company good clients are those paying expensive premiums but having few claims over time, that is to say, those conveying the company increasing profits. The target is to acquire, develop and retain core customer relationships the way described in Figure 1. The objectives are represented by the three arrows: • Achieving profit margins on new customers within short time • Achieving higher profit margins on existing customers • Retaining profitable customers for a longer period of time. Regarding the concept of loyalty, some authors have provided several definitions from different approaches. Brown (1952) introduced the behavioral loyalty, that is to say, loyal customers are those buying the same product or brand many times. From this point of view
Using Logistic Regression Models to Predict and Understand A
467
Profit margin
J. /
-=
». Time
Figure 1. Business Intelligence objectives.
loyalty is easy to measure but the reasons why people buy are not taken into account. In our framework a loyal customer would have all the policies in the same insurance company. He would also renew them. Recently, the meaning of loyalty has become more complex to include an increasing number of psychological elements, such as the attitude and emotions. Jacoby and Chesnut (1978) define loyalty as a non-random behavior which includes a stronger positive attitude to some specific product or brand than the attitude to the others. Kapferer and Laurent (1983) present loyalty as the emotional union to some brand or product. Mowen (1995) considers the attitude, but also the compromise the customer has to the product and his purpose to buy it again. Emotions and its evolution is also included in the definition by Fournier and Yao (1997). And finally, Uncles and Laurent (1997) consider that loyalty is somewhat stochastic and not deterministic, that is to say, loyalty is just a tendency to buy some product, so it is really difficult to find a 100% loyal client. According to these definitions, the statistical information that should be collected by the insurer when assessing loyalty should include qualitative aspects as well as quantitative information. When addressing the study of loyalty within the insurance industry, one is obliged to consider the econometric literature. Economet-
468
M. Guillen et al.
ric models are widely used in the actuarial field, specially in the car line insurance. One of the most important antecedents in that sense is the contribution by Dionne and Vanasse (1992) in which they predict the policy holder's pure premium as a function of his risk characteristics. Later on, Dionne, Gourieroux and Vanasse (1999) used an ordered probit model to predict the number of claims in individual policies. Purchasing decisions were also studied and a dichotomous probit was used to analyze what makes the client choose a policy with a deductible. Pinquet (1997 and 1999) used generalized Poisson and log-linear models to explain the frequency and the magnitude of the cost of claims. Abrahamse and Carroll (1999) used a multiple linear regression model to analyze the proportion of serious claims in several zones in the USA. Dionne, Laberge-Nadeau, Desjardins, Messier and Maag (1999) used a logistic regression model to analyze the probability that there were at least one claim in a one-month period to evaluate the impact of a new regulation on the way to obtain the driving license carried out in Canada in 1991. All of the previous contributions were aimed at evaluating risk in terms of number of claims and/or their severity. The main objective is to provide an appropriate estimate for pricing and selecting risks. In the recent years, other interesting questions have come into the scene of automobile insurance. One example is the application of qualitative dependent variable models to the detection of fraud in the car line of insurance, see Artis, Ayuso and Guillen (1999, 2002). The analysis of the scope of the fraud and its detection is essential for making the customer believe in the contract fairness (Picard, 2000). What makes a claim more likely to be dishonest as well as the possibility to improve the auditing process of suspicious claims is the final objective. Models provide an estimation mechanism for the probability to be fraudulent given the features of the customer and the claim itself. As a preliminary approach we will apply the same methodology to predict and understand why customers leave an insurance company, but it is necessary to remark some particular
Using Logistic Regression Models to Predict and Understand
469
features that makes our problem different and maybe more complex than the analysis of fraud. In the fraud case there are two possible situations for a claim to be honest or fraudulent. The amount of fraud and the way to discover it (usually by means of fraud suspicion indicators) are also part of this process. When considering the situation of a customer in terms of type and number of policies underwritten or canceled there are a lot of possible states and a lot of possible evolutions from a given state. A customer who first buys, for example, three policies can renew all of them in its due time and even buy new policies or cancel some of them (partial cancelation) or all of them (total cancelation). In fact we are interested in each policy renewal, but we believe that there is some positive contagion (in the way statisticians like to call it), so that other policies hold by the same customer will follow similar patterns. We want to find out and predict if and when a policy is likely to be canceled. Moreover we would like to see the aggregate behavior of a customer, specially if he starts the process of cancelation by ending one of his contracts. In the fraud case there is a misclassification problem. Claims that are classified as fraudulent only contain claims for which the insured finally admitted that he had committed fraud. Honest claims contained claims with no fraud, but they could also include some claims with undetected fraud, that is to say, fraudulent claims that the existing methodology was unable to identify. This type of error is called omission error. In the cancelation analysis there is a problem due to the fact that the cancelation is the result of something that happened previously to the moment in which the customer communicates that he wishes to cancel his policy. Then, when a cancelation is communicated, the insurer should look at the most recent information collected about the insurer to find the reason why he wants to leave. In some sense, there is also a measurement error because the decision of canceling is already taken before the observed cancelation date. So, measurement error may be present not at the outcome of the customer's decision but in the time scale.
470
2
M. Guillen et al.
Qualitative Dependent Variable Models
Our target is to predict the conditional probability of a customer leaving the insurance company given some explanatory variables. Logistic regression models are widely used to deal with this type of analysis in which the dependent variable Y is qualitative and dichotomous in this case. Some basic references can be found in Snell and Cox (1989) and Agresti (1990).
2.1
Model Specification
The logistic regression model can be motivated in two ways; either through the subject specific propensity of lapsing or through the final observed response, i.e., lapse/no lapse. In the first formulation the outcome is observed as a lapse if the customer's propensity exceeds a certain threshold level whereas in the latter only the outcome of the individual decision process is registered. It should be stressed that the assumption of an underlying threshold model is unidentifiable from the observed data and thus both formulations may lead to the same logistic regression model. It should also be emphasized that if misclassification is not present the only random variation present is due to the Binomial nature of the experiment. Let Y* be the unobservable latent variable which indicates the utility of the ith individual to embrace one of the two possible outcomes (staying in the company or leaving the company). For any customer i = 1,..., n,where n is the number of individuals, we assume a regression model for Y* so that:
Y/ = A, + /?% + ei
(1)
where Xi is a fc-dimensional column vector of observed explanatory variables, p0 is the intercept, (5 is the vector of unknown parameters and ei is a disturbance term in the model with zero mean and constant variance.
471
Using Logistic Regression Models to Predict and Understand
Since Y* cannot be observed, model (1) leads to a qualitative choice model, where only the outcome is observed. Let Y^ be a dichotomous variable indicating the true outcome, so that: Yi = 1, Yi = 0,
if Y*>0 otherwise
(2) (3)
If there is no misclassification in the response, we have that the observed binary variable, Y{, is equal to Y* and then one can calculate the following probabilities P{Yi = l\Xi) P{Yi = 0\Xi)
= P(Y*>0\Xi) = P(ei>-([30 = FiPo + P'Xi) = P(Y*<0\Xl) = P(ei<-(po = l-Fifo + pXi)
+ P'Xl)) (4) + p'Xl)) (5)
where F(-) is the (symmetric) cumulative probability function of q. Assuming F(-) to be the normal distribution. This will lead us to the Probit Model P(Yi = l\Xi) = /
4>(t)dt = $(/3 0 +
ffXi)
(6)
J—oo
in which <£>(•) is the normal density function and <£>(•) its corresponding distribution function. But the approach that we will consider is to assume F(-) to be the Logistic distribution function, which leads us to the Logit Model, pf30+f3'Xi
P
^ = «)
= !+
e
^ x
t
= HPo + pXt)
(7)
where A(-) represents the Logistic distribution function
p(y.~iW = A(ft + ^)- 1 f
(8)
1 + exp(/3o + p'Xi) It is important to notice that the Logistic set up ensures that whatever estimate of the conditional probability we get, it will always be
472
M. Guillen et al.
some number between 0 and 1. This is not always true for other possible models. When trying to model the conditional probability using a multiple linear regression model it lead us to estimations out of the interval (0,1) and a disturbance term which presents heteroscedasticity dependent on /? so the assumption of constant variance for the disturbance term is broken down. For all these reasons the logistic model is widely used in practice. The conditional expectation of the response given a particular value of the explanatory variables is ElY^Xi]
= O-[l-F(0o + (?Xi)] + l-F(l3o + l3'Xi) = FiPo + P'Xi).
(9)
In that case, it is important to notice that the parameters of the model are not equal to the marginal effect of the corresponding explanatory variable on the response, because the model is not linear. We have
where /(•) is the corresponding density function. For the logit model dE[Yi\Xi] dXi
= A(A) + pXi) [1 - A(A, + P'Xi)} p.
(11)
An important measure used in the context of binary choice models is the Risk Ratio. Let the risk be the conditional probability of Yi — 1. In our case, we can identify the conditional probability a particular customer has to leave the company as the risk. Let Xij be some explanatory dichotomous variable indicating a characteristic the individual presents, for example X^ = 1 representing male and X^ = 0 female. Then the Risk Ratio is defined as r\Yi
— J-l^-il — 2^1) ..., Ji{j
1, —Ai n — Xin)
Using Logistic Regression
Models to Predict and
473
Understand
In that case, both individuals being compared have exactly the same characteristics but a different gender. It is important to notice that the value of RRij depends on the chosen value of all other covariates. The Odds Ratio is defined as ^-» T-)
* v. i
-H-'MI
•» V i — U|vVji
P\Xi ~ Oj^il iy±i
=
-^ili •••)J*-ij
->•) —<*-in
•'-'in)
Xn, ..., s\-ij
1 , —<*-in
•^in)
Xji,...,Xjj
= 0,...Xin = Xin)
= i|Aji = Xn, . . . , Ajj = U, —^i-in
=
2-inJ
Therefore, O ^ = exp(/3j) which does not depend on the chosen values of all other covariates. When using logistic regression for cross-sectional data, the parameter /30 cannot be validly estimated if the sampling fraction of the population is not known. So without having a good estimate of @Q we cannot obtain a good estimate of the predicted population risk. Since the OR's are independent of the population prevalence one cannot estimate the actual risk, only a relative one. Po can be validly estimated even in a cross-sectional- study, but the interpretation as a population feature is not achievable. Since we have population data in our example, this issue is not going to be a problem.
2.2
Estimation and Inference
Logistic models are usually estimated using the Maximum Likelihood Method. The likelihood function is obtained by calculating the joint probability density function for the observed responses. Each individual random variable follows a Bernoulli distribution function, with F(P'Xi) being the response probability. For a sample of n observations we have L = P(Y1 = y1,Y2 = y2,...,Y3 = yn) = n y i = 0 [l - F(A> + P'Xi)} uyi=1F(p0
+ /3'Xi)
= n ^ ^ f t + A i n i - F ^ o + A ) ] 1 " ' ' , (14)
474
M. Guillen et al.
where individuals are assumed independent. So the log-likelihood function is n
In L = J2 [Vi In F(p0 + P'XZ) + (1 -
Vi)
In (1 - F(P0 + /?%))]
i=\
(15) and then the first order condition for the maximum is dlnL _ "
yJjPo + P'XQ _ (l-yi)f(p0 FiPo + P'Xi) (i-FiPo
+ P'XM + P'X,))]
= l
' (16) In the general case, the last expression has to be solved using iterative methods. For the logit model the first and second derivatives can be easily obtained dlnL dp 2 a lnL dpdp'
n
J2(yl-A(Po
+ P'Xi))Xi
=0
(17)
1=1
- J2
A
(A> + PXi) (l - A (A) + pXi)) XiX[ (18)
and the parameters can be estimated using the Newton-Rapson method. Note that the Hessian is a negative defined matrix so the log-likelihood is a concave function for the whole range of possible values for the explanatory variables. The asymptotic covariance matrix of the maximum likelihood estimator can be estimated using the inverse Hessian. Once the Maximum likelihood estimates have been obtained, they can be used to make statistical inferences concerning the relationship between the canceling behavior and the independent variables. This step includes testing hypotheses and obtaining confidence intervals for parameters in the model. The likelihood ratio test is approximately a chi-square test which makes use of maximized likelihood values to evaluate the model reduction. In qualitative dependent variable models, there have been several proposals to asses the goodness of fit. At least the maximum of the
Using Logistic Regression Models to Predict and Understand
475
log-likelihood function In L for the whole model should be provided, as well as the maximum of the log-likelihood function for a model in which the only explanatory variable is the intercept term. The latter is called lnL 0 . With this two measures it is possible to calculate the Likelihood Ratio Index, which is analogous to the R2 coefficient in a multiple linear regression model, L R I = 1 - ^ . (19) lnL 0 This index is always between 0 and 1. If all the slope estimates are equal to zero, then LRI is equal to zero and it should be associated to a bad fit. It has been suggested that LRI increases as the fit of the model improves and that a value of LRI very close to 1 indicates something such a "perfect fit" which is really difficult to obtain in practice. An intuitive way to summarize the model predictive efficiency is using a 2 x 2 table with the successes and failures of the model, but first it is necessary to think of a binary prediction rule. The most typical choice is Y = 1 if the prediction for the conditional probability is greater than a given threshold value and Y = 0 otherwise. If the percentage of customers that lapse is 10%, then the average predicted probability in the sample would be 0.1, so a suitable threshold is 0.1. Sensitivity and specificity are standard measures when assessing the prediction value of a discriminating function and they will also be used here.
2.3
Stages in the Modeling Process
The modeling strategy should involve several stages. The first one is called variable selection. In this stage the researcher restricts his attention to those meaningful independent variable that could be used to explain the dependent one. These variables should be chosen to provide the largest possible meaningful model to be initially considered. In the second stage interaction assessment is carried out.
476
M. Guillen et al.
Interaction can be defined as this situation in which the effect of some pair of explanatory variables acting together is different from the effect of them acting separately. For binary variables the effect of both variables acting together is given by the odds ratio obtained when both variables are equal to 1, and the effect of one of these variables acting separately is given by the odds ratio for this variable equal to 1 and the other variable equal to 0. It holds that when there is no interaction on a multiplicative scale, the effect of both variables acting together is equal to the product of the effect of both variables acting separately. In a general framework, the last stage is confounding assessment. When analyzing the relationship between some explanatory variable and the outcome, one can get confused and find no association between them due to the confounding effect of another explanatory variable. In this case, it is necessary to assess this relationship controlling for confounding variables. It is important to remark that confounding assessment is an issue if the analysis address customer's behavior. Strictly speaking, our model is addressing prediction, i.e., scoring future lapses but not customer's behavior, so confounding is not an issue in that case. We would like to mention some statistical issues which are beyond the scope of this section, but needing special attention. This issues are multicollinearity and influential observations. • Multicollinearity occurs when one or more independent variables in the model can be approximately determined by other independent variables. In this situation, estimated regression coefficients can be highly unreliable. To avoid this problem it is necessary to check for possible multicollinearity at various steps when selecting the independent variables. • Influential observations are data on individuals that may have a large influence on the estimated regression coefficients. So if this individual is dropped from the data, the estimated regression coefficients may greatly change from the coefficients obtained when that person is retained in the data. Methods for
Using Logistic Regression Models to Predict and Understand
All
detecting influential observations should be considered when determining the best model.
3
Customer Information
For the Danish non-life insurance market the main policies in personal lines are building insurance for homeowners, household insurance, motor insurance and accident insurance, where the latter can be considered as being in between non-life and life insurance. The majority of insurance companies employ their own agents who sell insurance to customers and insurance brokers have only a limited market share. Traditionally Danish insurers are targeting the market through their different products rather than addressing the customer as a whole. As a result companies are usually organized according to the insurance lines with an emphasis on claim handling which also motivates the data the insurance company gathers and stores. Typical policy data are coverage, sums, deductible, premiums and initial date of coverage, whereas typical claim data are time of notification of claim, type of claim, cause of claim and compensation received. Each policy is linked to the customer that it covers on whom information such as age, gender, address and occupation is known. To address the customer as a whole it has to be taken into account that several policy holders may share the same household, e.g., spouse and children. Therefore a so-called single view is needed linking a household to several policyholders that again may hold several different insurance policies. Due to special conditions for taking out accident insurance related to the customer's health status we shall narrow our description of customer lapses to three types of insurance contracts. The first one is building insurance for homeowners, which covers the building itself (roughly speaking, the bricks) from damage generally caused by phenomena such as fire, storm or water. The second is household insurance, which covers the contents, i.e., the goods stored in-
M. Guillen et al.
478
side the house. That is for example furniture, gold and silver, paintings, clothes, television or stereo, which are objects that are subject to theft. The third is motor insurance, which may cover both bodily injury and property liability caused by a vehicle. The empirical information used here to study customer behavior is referred to the time frame described in Figure 2.
i
JaiaryUi
covariates
K - lapses —•
January 1,2001 March31,2001 Figure 2. Time frame.
The time of cross section is January 1, 2001. Some covariates are measured at this time point (for example, age or customer loyalty). Other covariates refer to the occurrence of some event in the previous 12 months to this time point, that is to say, between January 1, 2000 and January 1, 2001. We are interested in lapses occurring within the following 3 months from cross section. So, lapses that took place during the period between January 1, 2001 and March 31, 2001. The variables related to customers are presented in Table 1. All the variables included in the model are binary. The value 1 indicates the presence of the feature described by the corresponding label. So, buildings, households and motor are dichotomous variables indicating whether or not the household has buildings, household or motor insurance policy, respectively. In variables measuring new business we mean that the customer has underwritten the first policy of the corresponding type. For example, new business household means that the customer did not have any household policy before. The claim variable indicates whether or not the customer has had some claim in the past 12 months, even if it is not his/her fault. Prun-
Using Logistic Regression Models to Predict and Understand
479
Table 1. Covariates.
Variable name Buildings Household Motor New business buildings within past 12 months New business household within past 12 months New business motor within past 12 months Claims within past 12 months Pruning within past 12 months Core customer Gender (male) Age < 40 years 40 years < age < 50 years 50 years < age < 70 years 70 years < age Customer loyalty < 2 years 2 years < customer loyalty < 4 years 4 years < customer loyalty < 20 years 20 years < customer loyalty ing means that the premium on at least one policy has been raised substantially (20-50%) by the insurer due to a very bad claim history, but the coverage is unchanged. A core customer is a customer that has a household insurance and at least two other types of policies, for example motor and buildings (it can also be accident insurance or life insurance). We consider the age of the policy holder calculated at the time of cross section, which is January 1,2001. Four categories of age have been considered: younger than 40, between 40 and 50, between 50 and 70, and older than 70. Customer loyalty is measured by the time since the first policy issue within the group of policies that are being considered here. For this variable four levels have been considered: less than 2 years, between 2 and 4 years, between 4 and
M. Guillen et al.
480
20 years, and more than 20 years1. Other and more sophisticated covariates are of course possible but the above information is sufficient to illustrate the concept of loyalty scoring. Table 2. Descriptive statistics.
Variable label Buildings Household Motor New business buildings within past 12 months New business household within past 12 months New business motor within past 12 months Claims within past 12 months Pruning within past 12 months Core customer Gender (male) Age < 40 years 40 years < age < 50 years 50 years < age < 70 years 70 years < age Customer loyalty < 2 years 2 years < customer loyalty < 4 years 4 years < customer loyalty < 20 years 20 years < customer loyalty
Percentage 46.48% 77.19% 54.78% 1.56% 2.63% 2.33% 23.44% 0.21% 35.96% 66.58% 27.71% 16.15% 33.55% 22.59% 26.20% 13.02% 39.76% 21.02%
The response variable in our model is dichotomous. It indicates whether or not at least one of the customer's contracts suffers a lapse. It is important to remark that even though a policy can be canceled due to a lot of possible causes, not all of them regarded as being a lapse. The most important rule for whether termination is regarded 'Variables loyalty and age were discretized into groups of 2 and 5 years, respectively, and additional model reduction leads to the groups given here.
Using Logistic Regression Models to Predict and Understand
481
as a lapse is whether the risk exists at the time the insurance is ended or transferred to another insurer. Hence, the customer death or a risk being sold is not regarded as lapses but a transfer to another insurer is considered to be a lapse.
4
Empirical Results
The whole data set consists of 232,043 households (11,890 observed lapses). The data set has been divided into two groups by random sampling. One group of 116,271 households is used for model estimation and the other group (115,772 households) is used in model validation. The subsample used for estimation purposes has 5,979 lapses (5.14%) and the one used for validation has 5,911 lapses (5.11%). If modeling only the intercept term the value of —2 In L0 becomes 47,132.54 whereas —21nL equals 43,268.95 if regressing on both intercept and covariates. The overall test of no covariate effect takes the value 3,863.59 and is chi-square distributed with 16 degrees of freedom. The model estimates are shown in Table 3. By looking at the odds ratio, motor insurance, claims, new business households and pruning within the past 12 months are the most relevant factors influencing the probability of lapse. No core status, male, age lower than 40 years and customer loyalty between 2 and 4 years characterize customers with a large predicted probability of lapse. For illustration purposes, let us estimate the probability of lapse for a young male customer with no core status, with motor insurance, claims and pruning within past 12 months and an intermediate level of loyalty (between 4 and 20 years). The result is 30%, which is much higher than the estimated probability of lapse for a female core customer older than 70 years, with building and household insurances (and, e.g., accident insurance), neither claims nor pruning within past 12 months and more than 20 years of loyalty, which is 0.3%. We will now address the description of the model's ability to dis-
482
M. Guillen et al.
Table 3. Model estimates. Parameter Estimate Std. Dev. OR 95% CI p-value (*) Intercept -3.6568 0.0514 <.0001 Buildings 0.3731 0.0330 1.452 [1.361,1.549] <.0001 Household 0.2891 0.0334 1.335 [1.251,1.426] <.0001 Motor 0.8855 0.0324 2.424 [2.275,2.583] <.0001 0.3705 0.0942 1.449 [1.204,1.742] New bus. build. <.0001 New bus. hous. 0.5288 0.0687 1.697 [1.483,1.942] <.0001 New bus. mot. 0.2356 0.0742 1.266 [1.094,1.464] 0.0020 Claim 0.6126 0.0311 1.845 [1.736,1.961] < .0001 Pruning 0.4665 0.2131 1.594 [1.050,2.421] 0.038 Core customer -1.8526 0.0443 0.157 [0.144,0.171] < .0001 0.1554 Gender (male) 0.0315 1.168 [1.098,1.243] <.0001 Age <40 years 0.4609 0.0354 1.586 [1.479,1.700] <.0001 40 < age < 50 0.1886 0.0406 1.208 [1.115,1.308] <.0001 50 < age < 70 1.0 70 < age -0.6483 0.0519 0.523 [0.472,0.579] <.0001 Cust. loy. < 2 -0.2478 0.0365 0.781 [0.727,0.838] <.0001 2 < cust. loy. < 4 0.2646 0.0386 1.303 [1.208,1.405] <.0001 4 < cust. loy. < 20 1.0 -0.1762 0.0474 0.838 [0.764,0.920] 0.0002 20 < cust. loy. (*) p-values refer to associated likelihood ratio tests.
criminate between situations frequently leading to lapse and situations hardly ever leading to lapse by considering the predicted probability calculated from (7). In Figures 3 and 4 the empirical distribution of the predicted probability for observed lapses and non lapses on the validation data set is shown. We may consider that customers with a predicted probability greater than a given threshold value p as customers for whom the model predicts that he/she will lapse, while the other customers are not predicted to lapse. By doing that, the portfolio is divided into two sets, one in which the model predicts lapse and another one in which the model predicts no lapse. It is possible to compare what the model discriminates with the real observed result, that is to say, whether or not customers in fact lapsed. Then one have that for a given threshold
483
Using Logistic Regression Models to Predict and Understand
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20+
Probability of lapse within three months
Figure 3. Empirical distribution of the predicted probability for lapses.
R 8H
0.00
0,02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20*
Probability of lapse within three months
Figure 4. Empirical distribution of the predicted probability for non lapses.
484
M. Guillen et al.
p customers fall into one of the four groups that are shown in Table 4.
Observed No lapse Lapse Total
Table 4. Observed vs. predicted crosstab. Lapse predicted by model No lapse Lapse True negative (TN) False positive (FP) False negative (FN) True positive (TP) TN+FN FP+TP
Total TN+FP FN+TP
Four probabilities can be calculated to describe the model ability to discriminate between situations frequently leading to lapse and situations rarely leading to lapse. The sensitivity is the conditional probability of predicting that a customer will lapse among those customers actually lapsing Sensitivity =
TP
FN + TP
(20)
The specificity is the conditional probability of predicting that a customer will not lapse among customers actually lapsing Specificity
TN
TN + FP
(21)
The predictive positive value, PVpos, is the frequency of customers that are predicted to lapse that really were observed as lapses TP (22) PV = ry pos Fp + TP' And finally the predictive negative value, PVneg is the frequency of customers that are predicted not to lapse that really were observed as non lapses TN PV'neg —. (23) nPn = —
TN + FN
Using Logistic Regression Models to Predict and Understand
485
Table 5. Classification results for different threshold probabilities. Thr. prob. Pred. lap. Obs. lap. Sensitivity Specificity PVpos PVr 0.0% 232,043 11,890 100.0% 0.0% 5.1% 100.0% 2.5% 160,082 10,713 6.7% 98.3% 90.1% 31.8% 5.0% 8,036 84,781 67.6% 65.1% 9.5% 97.4% 7.5% 46,814 5,803 48.8% 81.4% 12.4% 96.7% 10.0% 23,496 4,107 34.5% 89.8% 15.5% 96.2% 12.5% 16,134 2,959 24.9% 94.0% 18.3% 95.9% 15.0% 9,122 2,049 17.2% 96.8% 22.5% 95.6%
The model data set consisting of 232,043 customers is scored. The results are shown in Table 5 for several threshold levels. When looking at Table 5, optimal overall results corresponds to a 5% threshold probability when all costs are equal. If we elaborate a cost model then, the best threshold value would be the one giving the minimum overall cost. When using this level, the predicted frequencies are the ones shown in Table 6. Table 6. Observed vs. predicted crosstab for a 5% threshold probability.
Observed No lapse Lapse Total
Lapse predicted by model No lapse Lapse 143,408 76,745 3,854 8,036 147,262 84,781
Total 220,153 11,890 232,043
In Figure 5 the Receiver Operating Curve (ROC) is shown. This curve plots Sensitivity versus (1 — Specificity) all possible threshold levels. The solid curve is the ROC curve corresponding to the model data set, while the dashed curve is the ROC curve corresponding to the validation data set. As the discrimination ability of the model improves, the ROC curve shows much sharper rise immediately after 0, so the dotted curve illustrate the identity between sensitivity and specificity, which means that the model had no discrimination ability in terms of lapse. The two ROC curves (for the estimation sample and the validation sample) are quite close together. This
486
M. Guillen et al.
means that the total model discrimination performance is still good when using the validation data set. 1.00.9^ 0.8^ 0.7 :
>, os~. '> % OS1. c © M
o.4: 0.3: 0.20.1:
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1-specificity
Figure 5. ROC curve.
Figure 6 shows sensitivity, specificity, PVpos and PVpos for increasing thresholds. As we can see, sensitivity falls very sharply while PVpos only increases slowly. That means that the model does not identify customers who will lapse within a three months period with absolute certainty. That is to say, the explanatory variables involved only explain a part of the risk of lapse. Additional explanatory factors should be included to improve that.
5
Conclusions
In this chapter the concept of customer lapse is defined as a downward revision of the customer relationship where the risk remains, while the policy is taken out with another insurer. This means that
487
Using Logistic Regression Models to Predict and Understand 1.0-0.9^ 0.8^ 0.7 :
8. 0i~11
0.4 : 0.30.2-; 0.1-
0.00
0.05
0.10
0.15
0.20
Threshold probability
Figure 6. Sensitivity, specificity, positive predictive value and negative predictive value as a function of the threshold probability.
a downward revision due to termination of risk on death is not regarded as customer lapse. The concept is suitable for analyzing the customer portfolio when the intention is to provide options to reduce customer lapses. However, the concept is not suitable for monitoring absolute lapses in the portfolio over calendar time, neither in terms of the customer nor premiums since major causes, e.g., death are left out. From our model we conclude that customers having a motor insurance are those showing a greater probability of lapse. Those customers having this type of contract have much higher risks of canceling one of the policies they hold. On the other side, core customers (those having at least two more types of policies besides household insurance) are very much unlikely to lapse. One interesting result is that having a claim is also influencing the probability to lapse. This should alert the companies' ability to react to such events by trying to reduce the odds of the customer deciding to leave or cancel the pol-
488
M. Guillen et al.
icy. As expected, a substaintable increase in premium also increases the estimated probability to lapse. The age effect is telling that older customers tend to have less propensity to lapse. The highest risk period -when a customer has more probability to lapse - is between 2 and 4 years after the first policy issue. In any case, it is important to remark that we should be careful with the interpretation due to possible unmeasured confounding. The methodology described in this chapter is suitable to construct a lapse score, which can provide some indication about the customer expected behavior before he/she is actually gone. It seems to us a simple and practical tool for practitioners. The following tasks are found to be next obvious step in terms of analysis on account of the results in the chapter: • a more detailed analysis of the effect of "events in the customer's life" (marriage, divorce, death of spouse, change of job, children, changes in income, retirement, voluntary early retirement...) on customer lapses • a more detailed analysis of lapses after a claim • a link between lapses in respect of life products and lapses in respect of general insurance products • a time continuous approach to loyal scoring, which could provide a dynamic model and would consider time between events occurring in the life of an insurance policy • an application of survival analysis techniques to predict time to lapse, as we are not only interested in predicting if it is likely to occur but also when it is likely to occur.
Acknowledgments We thank Spanish grant CICYT SEC2001-3672 for providing support.
Using Logistic Regression Models to Predict and Understand
489
References Abrahamse, A. F. and Carroll, S. J. (1999), "The frequency of excess claims for automobile personal injuries". In: Dionne G., LabergeNadeau (eds.) Automobile insurance: road safety, new drivers, risks, insurance fraud and regulation, Kluwer Academic Publishers, Boston, pp. 131-150. Agresti, A. (1990), Categorical data analysis, Wiley, New York. Artis, M. Ayuso, M. and Guillen, M. (1999), "Modelling Different Types of Automobile Insurance Fraud Behaviour in the Spanish Market", Insurance: Mathematics and Economics, vol. 24, pp. 67-81. Artis, M, Ayuso, M. and Guillen, M. (2002), "Detection of Automobile Insurance Fraud with Discrete Choice Models and Misclassified Claims", Journal of Risk and Insurance, vol. 69, 3, pp. 325-340. Brown, G. H. (1952), "Brand loyalty - Fact or fiction?", Advertising Age, vol. 9, pp. 53-55. Dionne, G., Gourieroux, C. and Vanasse, C. (1999), "Evidence of adverse selection in automobile insurance markets". In: Dionne G., Laberge-Nadeau (eds.) Automobile insurance: road safety, new drivers, risks, insurance fraud and regulation, Kluwer Academic Publishers, Boston, pp. 13-46. Dionne G., Laberge-Nadeau, C , Desjardins, D., Messier, S. and Maag, U. (1999), "Analysis of the economic impact of medical and optometric driving standards on costs incurred by trucking firms and on the social costs of traffic accidents". In: Dionne G., Laberge-Nadeau (eds.) Automobile insurance: road safety, new drivers, risks, insurance fraud and regulation, Kluwer Academic Publishers, Boston, pp. 323-351.
490
M. Guillen et al.
Dionne, G. and Vanasse, C. (1992), "Automobile insurance ratemaking in the presence of asymmetrical information", Journal of Applied Econometrics, vol. 7, pp. 149-165. Fournier, S. and Yao, J. L. (1997), "Reviving brand loyalty: a reconceptualization within the framework of customer-brand relationships", International Journal of Research in Marketing, vol. 14, 5, pp. 451-472. Jacoby, J. and Chesnut, R. (1978), Brand loyalty: Measurement and Management, Wiley, New York. Kapferer, J. N. and Laurent, G. (1983), La sensibilite aux marques, Foundation Jours de France, Paris. Maddala, G. S. (1989), Limited-dependent and qualitative variables in econometrics, Cambridge University Press, Cambridge. Mowen, J. C. (1995), Customer behaviour, Prentice Hall, New York. Pinquet, J. (1997), "Allowance for cost of claims in Bonus-Malus Systems", ASTIN Bulletin, vol. 27, 1, pp. 33-57. Pinquet, J. (1999), "Allowance for hidden information by heterogeneus models and applications to insurance ratings" In: Dionne G., Laberge-Nadeau (eds.) Automobile insurance: road safety, new drivers, risks, insurance fraud and regulation, Kluwer Academic Publishers, Boston, pp. 47-78. Snell, E. J. and Cox, D. R. (1989), Analysis of binary data, Chapman and Hall, London. Uncles, M and Laurent, G. (1997), "Editorial", International Journal of Research in Marketing, vol. 14, 5, pp. 399-404.
Life and Health
This page is intentionally left blank
Chapter 14 Using Data Mining for Modeling Insurance Risk and Comparison of Data Mining and Linear Modeling Approaches Inna Kolyshkina, Dan Steinberg, and N. Scott Cardell
Interest in data mining techniques has been increasing recently among actuaries and statisticians involved in analyzing the large data sets common in many areas of insurance. This chapter discusses the use of data mining in insurance and presents several case studies illustrating the application of such data mining techniques as decision trees, multivariate adaptive regression splines and hybrid models in general insurance and health insurance. Data mining is based on modern, computer-intensive methods that are both statistically reliable and fast. These new technologies make intense use of the "brute force" of today's fast processormachines so that the computer performs far more computations than in traditional statistical modeling. Thus, data mining methodologies are quicker and more reliable than are linear models. Several reasons exist for the increasing attractiveness of data mining vs traditional methods in the actuarial community. First, data sets used in general and health insurance for modeling the probability or cost of a claim typically have a large number of cases as well as many variables. It is often difficult, using traditional linear methods, such as Generalized Linear Model ("GLM"), to select the most important predictors from a large list of available vari-
493
494
/. Kolyshkina et al.
ables. In addition, linear methods can be quite time consuming, especially if the modeler has not been familiar with the data prior to the modeling (for example, when the data come from a new client or a new source). Data mining methodologies, on the other hand, can easily handle such large amounts of data, quickly selecting a few statistically significant predictors by scanning for the predictive potential of every variable of the hundreds available. This ensures that no important predictive relationship has been omitted from the model and allows the model to be built and refined quickly. A second advantage of data mining compared to traditional methods is that most data mining modeling methodologies involve automatic "self-testing" of the model. The model-building process first creates a model on a randomly selected portion of the data and then tests the model for performance and further refining on the remaining data. This feature further increases the reliability and robustness of the model. Other advantages of data mining are the ability to develop results in the presence of missing values, while keeping all the useful information in the data, and the ability to model without imposing any assumed structure. Data miners typically use methods such as decision trees, Classification and Regression Trees ("CART®"), neural networks, Multivariate Adaptive Regression Splines ("MARS®"), boosting/ ARCing and various hybrid methodologies (Hastie et al. 2001). It is important to note that data mining is different from other "buzz words" in data analysis such as On-Line Analytical Processing ("OLAP"), a tool that allows data to be summarized rather than modeled (Thomsen, 1997). Each data mining method has advantages and disadvantages. We found that CART and MARS models, as well as their hybrids (models that combine CART and MARS as described below), are relatively easy to interpret and explain to a client as opposed to other methods, in particular neural networks, that are notorious "black boxes", and thus can be a useful tool for analyzing insurance data.
Using Data Mining for Modeling Insurance Risk and Comparison
495
This chapter discusses the use of such models in projects that required modeling risk in workers' compensation and health insurance by PricewaterhouseCoopers Actuarial (Sydney) ("PwC"), the clients being large insurance companies. The chapter describes the methodology used and introduces innovative model performance measures used in data mining, such as gain or lift. The results of data mining modeling are compared to the results achievable by using more traditional techniques such as generalized linear models, currently widely used in the analysis of insurance data. Comparisons are made in terms of time taken, predictive power, selection of most important predictors, handling predictors with many categories (such as postcode or occupation code), interpretability of the model, missing values, etc. The more practical and nonstatistical issues of implementation and client feedback are also discussed.
1
Introduction
Recently a number of publications have examined the use of data mining methods in insurance and actuarial environment (Francis 2001, WorkCover NSW News 2001) and this can be contrasted with the use of "classical" approaches such as generalized linear models (Haberman and Renshaw 1998; McCullagh and Nelder 1989; Smyth 2002). The main reasons for the increasing attractiveness of data mining approach are as follows: • It overcomes the shortcomings of traditional methods that operate under the assumption that data are distributed normally (as is the case in linear regression) or according to another distribution in the exponential family, such as binomial, Poisson or Gamma (as is required for a generalized linear model). Classical linear methods are based on such assumptions, which are often incorrect. • It relies more than traditional models on the intense use of com-
496
/. Kolyshkina et al.
puting power. This results in analyses that are less time consuming and more flexible in terms of selection of predictors than those carried out by classical methods. Classical methods applied to large data sets take longer to develop models, and have particular trouble selecting important interactions between predictors. • It is able to handle categorical variables with a large number of categories (for example, claimant's occupation, industry or postcode). Classical methods have trouble dealing with such variables: as a result, they are either left out of the model, or have to be grouped by hand prior to inclusion. • Such data mining methods as decision trees can easily handle incomplete or noisy data, which can be often a problem for traditional linear methods. Vapnik (1996) and Hastie et al. (2001) give more detail on these points. This chapter gives two examples of the application of data mining methods to modeling risk in insurance based on recent projects completed by PwC data mining team for two large insurance company clients and provides comparisons of data mining approach with traditional modeling methodologies in terms or predictive power, precision, computational speed and some other aspects. In Section 2 we briefly comment on the definition of data mining, why it became widely available only recently, its origins and outline the areas in insurance where it is used. We will also emphasize the contrast between data mining and on-line analytical processing (OLAP). In Section 3 we briefly discuss CART (Breiman et al. 1984), the data mining technique based on the decision tree methodology, and multivariate adaptive regression splines (MARS) (Friedman, 1991), a computer-intense method of flexible nonparametric regression. We will also discuss the newest, cutting-edge hybrid model methodology that allows us to combine decision trees, MARS and tradi-
Using Data Mining for Modeling Insurance Risk and Comparison
497
tional linear techniques such as logistic regression, which results in achieving even higher levels of prediction accuracy. We would like to say that the discussion of CART and MARS in this section of the chapter is an introduction that is only complete enough to make clear the use of these techniques in the sections 4 and 5. Further details on CART, MARS, their use and extensions and hybrid modeling technology can be found in the literature (Breiman et al. 1984, Friedman 1991, Lewis and Stevens 1991, 1992, Lewis et al. 1993, Hastie et al. 2001, Han and Camber 2001, Steinberg and Cardell 1998a, 1998b). Section 4 describes the case study based on a project completed by PwC for an insurance company client, where analysis of a large data set in workers' compensation insurance was conducted using decision tree-based data mining methodology. The section also provides a brief discussion of comparison of modeling results of CART decision tree model to logistic linear regression methodology for the analysis of these data. Prediction results are also discussed. Section 5 describes the case study based on a project completed by PwC for a major health insurance company client where analysis of a large health insurance data set was conducted using hybrid (Steinberg and Cardell, 1998a, 1998b) data mining methodology. The hybrid model combined CART decision trees, MARS and linear modeling techniques.
2
Data Mining - the New Methodology for Analysis of Large Data Sets Areas of Application of Data Mining in Insurance
In insurance, like in many other industries (health, telecommunication, banking to name a few) databases today often reaches more
498
/. Kolyshkina et al.
than 1,000,000,000,000 bytes of data. In a data set like this, with millions of cases and hundreds of variables finding important information in a dataset is like finding the proverbial needle in the haystack. However the need for analyzing such data sets is very real, and data mining is the technique that can address that need effectively and efficiently.
2.1
Origins and Definition of Data Mining. Data Mining and Traditional Statistical Techniques
Data mining advent was stimulated firstly by the availability of large amounts of data found in modern commercial enterprises and the need to analyze these enormous data sets. It was important to be able to uncover some very complex relationships between large number of variables that would be difficult if not impossible to discover using manual exploratory techniques. Secondly, increasing power of computers and such features as high-speed servers and workstations and their lower cost have allowed the development of new techniques based on an exhaustive exploration of all possible solutions. Data mining can be defined as a process that uses a variety of innovative data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Data mining tools are based on highly automated search procedures: the machine does far more of the work than is the norm in classical statistical modeling. This may require millions of times more computation than is needed for conventional statistical analysis but given high speed algorithms and today's fast processors it usually takes only a few hours at the maximum to run a data mining model. Best data mining methods can automatically select data to use in pattern recognition, are capable of dealing with noisy and incomplete data, include self-testing to assure that findings are genuine
Using Data Mining for Modeling Insurance Risk and Comparison
499
and provide clear presentation of results and useful feedback to analysts. Some of the best-known data mining methodologies are decision trees, neural networks, multivariate adaptive regression splines (MARS) and boosting. These methods can be used separately or in combination with each other. Data mining however does not replace traditional statistical techniques. Rather, it is an extension of existing statistical methods and, as we discuss and illustrate in Section 3, can be effectively used in combination with them.
2.2
Data Mining and On-line Analytical Processing ("OLAP")
It is important to point out the difference between data mining and OLAP. The names of these two processes are sometimes incorrectly used interchangeably. The detailed description of OLAP is outside of the scope of this chapter, however we want to point out that while OLAP tools provide exploratory data analysis and support multidimensional analysis and decision making, data mining is mainly a tool for in-depth analysis such as predictive modeling, data classification, clustering and the characterization of data changes over time (Han and Camber, 2001).
2.3
The Use of Data Mining within the Insurance Industry
In the insurance industry various data mining methodologies have been applied to analyze large data sets. For example, such methods were used for risk prediction/assessment, premium setting, fraud detection, health costs prediction, treatment management optimization, investments management optimization, member retention and acquisition strategies (Francis 2001; WorkCover NSW News 2001).
500
3
/. Kolyshkina et al.
Data Mining Methodologies. Decision Trees (CART), MARS and Hybrid Models
PwC data mining team have performed an extended study of the available data mining techniques and made the decision to select decision trees and MARS for everyday modeling of insurance data. The reasons for such selection are as follows. Decision trees and MARS share usual features of data mining methodologies such as ability to deal with large data sets and high degree of precision, ability to handle categorical variables with many categories and to select of the important predictors out of many potential predictors. They also, unlike some other predictive modeling techniques such as for example neural nets, are very interpretable and, which is very important for a consultant, easy to explain to the client. They are faster than neural nets, require less data preparation and are minimally affected by outliers. Another important feature of CART and MARS is that they are easy to implement in SAS which is the main data analysis software package used by PwC as well as by the majority of our clients.
3.1
Classification and Regression Trees (CART)
The CART methodology is technically known as binary recursive partitioning (Breiman et al. 1984). It is binary because the process of modeling involves dividing data set into exactly two subgroups (or "nodes") that are more homogeneous with respect to the response variable than is the initial data set. It is recursive because the process is repeated for each of the resulting nodes. The resulting model is called a decision tree or simply tree. To split data into two nodes, CART asks questions that have a yes/no answer. For example, questions might be: "Is age greater than 55?" or "Is the postcode equal to one of 1052, 1530 or 2300?".
Using Data Mining for Modeling Insurance Risk and Comparison
501
CART then compares all possible splits for all values for all variables included in the analysis and conducts an exhaustive search through them all, selecting the split that divides data into two nodes with the highest degree of homogeneity. Once this best split is found, CART repeats the search process for each "child" node, continuing recursively until further splitting is impossible or stopped. When the data set is sufficiently large, CART builds the model on a randomly selected part (usually two-thirds) of the data (the "learning sample") and then tests and refines it on the remaining data (the "testing sample"). The learning sample is used to grow an overly large tree. The testing sample is then used to estimate the rate at which cases are misclassified (possibly adjusted by misclassification costs) and to "prune" the large tree accordingly. The nature of this self-testing process of model building further enhances the stability of the model. The resulting model is usually represented visually as a tree diagram. It divides all data into a set of several non-overlapping subgroups or nodes so that the estimate of the response is "close" to the actual value of the response within each node (Lewis et al. 1993). Important features of CART are being unaffected by outliers and its ability to deal with missing values easily. An obvious disadvantage of any decision tree including CART is the fact that the predicted value of the response is discontinuous, which means that sometimes a small change in the value of a predictor x could lead to a large change in the predicted value of the response y. Also, the decision tree model is coarse-grained in the sense that a model with n nodes can only predict n different probabilities, which can be an issue, particularly if the tree produces only a small number of nodes (Steinberg and Cardell, 1998a, 1998b). Another disadvantage is weakness at capturing strong linear structure. A very large tree can be produced a in an attempt to represent very simple linear relationships. The algorithm recognizes the structure but cannot represent it effectively.
502
3.2
/. Kolyshkina et al.
Multivariate Adaptive Regression Splines (MARS)
MARS is an adaptive procedure for regression, and can be viewed as a generalization of stepwise linear regression or a generalization of the recursive partitioning method (Breiman et al. 1984) to improve the latter's performance in the regression setting. The central idea in MARS is to formulate a modified recursive partitioning model as an additive model of functions from overlapping, instead of disjoint as in recursive partitioning, subregions (Lewis et al. 1993). The MARS procedure builds flexible regression models by fitting separate splines (or basis functions) to distinct intervals of the predictor variables. MARS uses expansions in piecewise linear basis functions of the form (x-t)+ and (t-x)+. The "+" means positive part, so (x-t,if x > t, [t-x'iix
[0, otherwise.
For example, in one of our MARS models we used these two basis functions based on the variable "age": (age-45)+ and (45-age)+. Each function is piecewise linear, with a knot at the value t. They can be described as linear splines. The collection of basis functions is C =
{(XJ-t)+,(t-XJ)+\te{X]j,x2j,...,XNj} j e 1,2,...p.
If all of the input values are distinct, there are 2Np basis functions altogether. The model-building strategy is like a forward stepwise linear regression, but instead of using the original inputs, we are allowed to use functions from the set C and their products. Thus the model has the form
Using Data Mining for Modeling Insurance Risk and Comparison
503
M
where each hm(X) is a function in C, or a product of two or more such functions (Hastie et al. 2001). Both the variables to use and the end points of the intervals for each variable - referred to as knots - are found via an exhaustive search procedure. Variables, knots and interactions are optimized simultaneously by evaluating a "loss of fit" (LOF) criterion. MARS chooses the LOF that most improves the model at each step. In addition to searching variables one by one, MARS also searches for interactions between variables, allowing any degree of interaction to be considered. The "optimal" MARS model is selected in a two-phase process. In the first phase, a model is grown by adding basis functions (new main effects, knots, or interactions) until an overly large model is found. In the second phase, basis functions are deleted in order of least contribution to the model until an optimal balance of bias and variance is found. By allowing for any arbitrary shape for the response function as well as for interactions, and by using the two-phase model selection method, MARS is capable of reliably tracking very complex data structures that often hide in high-dimensional data (Salford Systems, 2002).
3.3
Hybrid Models
Typically it is pointed out in the literature (see, for example, Hastie et al. 2001, Steinberg and Cardell 1998a,b) that decision trees, in particular CART trees, have such strengths as simple structure, ease of interpretability, ability to identify the most important predictors out of many potential variables, high level of computational speed, ability to effectively handle missing values and to remain unaffected by outliers in the data but at the same time these models have a number of weaknesses. As mentioned above, the main weaknesses of decision trees are lack of smoothness in the pre-
504
/. Kolyshkina et al.
dieted response and difficulty in modeling linear structure. On the other hand, other data analysis techniques including linear models, in particular logistic regression; neural networks and MARS, provide continuous smooth response and a unique predicted value for every record, but some of them, especially linear methods, are sensitive to outliers and exclude from the analysis the records with missing values, require extensive data pre-processing and are considerably slower than tree models. It is natural then, to try to combine these "smooth" modeling techniques with decision trees in such a way that their strengths can be combined effectively. Steinberg and Cardell (1998a, 1998b) describe the methodology of such combining. They suggest that the model is built in several consecutive steps. Firstly, a CART tree is built on the data. Then the output of the model in the form of terminal node indicator, of the predicted values or of the complete set of indicator dummies (the latter form is preferred by the authors) is included among other inputs in the "smooth" model. The authors point out that the additional to the underlying tree model effects identified by such hybrid model are likely to be weak as all strong effects already detected by the tree, but nevertheless, a collection of weak effects can be very significant.
4
Case Study 1. Predicting, at the Outset of a Claim, the Likelihood of the Claim Becoming Serious
This case study provides a brief comparison of data mining (CART decision tree) and logistic regression modeling approaches. Some aspects of this case study, in particular the details referring to logistic regression modeling, is described in more detail in Kolyshkina et al. (2002). In this chapter we focus on the aspects of CART modeling of the data.
Using Data Mining for Modeling Insurance Risk and Comparison
4.1
505
Problem and Background
In workers' compensation insurance, serious claims (defined here as claims that were litigated or where the claimant stayed off work for a long period of time), comprise only about 15% of all claims by number, but create some 90% of the incurred cost. This means that in order to reduce the cost in a maximally effective way, an insurer would need to concentrate its attention on such claims. From a practical point of view, the insurer must ensure that the management of claimant injuries is carried out in such a way that the injured person receives the most effective medical treatment at appropriate points in time to prevent his or her injury from becoming chronic and to enable the claimant to return to work in an optimal manner. To do so, the insurer ideally would need to know, at the time of a claim being received, whether the claim is likely to become serious. But in most cases this is not obvious as there are many factors contributing to the result. Therefore, it would be useful to have a model that would account for all such factors and would be able to predict at the outset of a claim the likelihood of this claim becoming serious.
4.2
The Data, Its Description and Preparation for the Analysis
The data available for modeling was represented by several years' worth of information about a large number of claims from the New South Wales ("NSW") workers' compensation scheme. The data contained information: • About the claim itself. Data on claim included date the claim was registered, policy renewal year of the claim, date when the claim was closed, date of the claim reopening, whether the claim was litigated, various liability estimates, payments made on the claim, reporting delay, etc.
/. Kolyshkina et al.
506
• About the claimant. Data on claimant included demographic characteristics of the client such as sex, age at the time of injury, family situation and whether the claimant had dependants, claimant's cultural background. Also there were variables related to the claimant's occupation, type of employment and work duties such as code for industry and occupation, nature of employment (permanent, casual, part or fulltime), wages, etc. • About the injury or disease such as time and place of injury, injury type, body location of the injury, cause or mechanism of injury, nature of injury, etc. Overall there were 83 variables that might have been considered as potential predictors, most of them categorical with many categories, for example, the variable "occupation code" had 285 categories, "injury location code" had 85 categories.
4.3
The Analysis, Its Purposes and Methodology
The two major purposes of the analysis were to identify the most important predictors from a large number of available variables containing information about a claim at the time when the claim is registered and to build a model based on such predictors, which would classify a claim as "likely to become serious" or "not likely to become serious". Another objective of the analysis was to compare the performance of the data mining and traditional statistical modeling approaches in terms of prediction accuracy, computational speed and ability to handle some issues common in insurance data such as presence of missing values, predictors with many categories, etc., with the view to select one of the methods for the future use with this type of data. 4.3.1
Analysis Using CART
We ran a CART model on the full data set using two methods, Gini and Twoing (introduced by Breiman et al. (1984)) offered by
Using Data Mining for Modeling Insurance Risk and Comparison
507
CART software for binary response variable. Salford Systems (2000) note that sometimes the differences between the models obtained by selecting the splitting criterion will be modest and at other times profound, and mention examples where judicious choice of a splitting rule in CART reduces the error rate by 5 to 10 percent. They add that the value of a decision tree may also be determined in terms of the best tree nodes having relatively high concentration of the value of interest of the response variable, and that although there are certain rule-of-thumb recommendations about which splitting rule out of a few offered by CART software for classification-type modeling is best suited to what type of problem, it is good practice to always use a few different splitting rules and compare the results. In addition, certain splitting rules may be better suited for specific types of data. Experimentation with splitting rules selection should theoretically provide us with different results, and such experimentation also provides us with better understanding of the data-specific issues (Salford Systems, 2000). We will provide here a brief description of the two splitting criteria, more detailed criteria description can be found in Breiman et al. (1984), and Steinberg and Colla (1995). The Gini criterion is based on class heterogeneity. It will always select a split that maximizes the proportion of class 1 ("serious" claims) in one node. Repeated application will tend towards isolating the serious claims in separate groups from the non-serious claims. In situations where there are more than two classes, Gini will focus on the most important (on the basis of size or cost) class first, and then proceed to the class next in importance: it is the default criterion in CART due to its generally superior performance. Twoing concentrates on class separation rather than class heterogeneity and selects the split that maximizes the separation between classes (the "serious" and "non-serious" claims). An important variation of the Twoing rule is power-modified Twoing which places a heavier weight on splitting the data in a node into two
/. Kolyshkina et al.
508
equal-sized partitions, and is more likely to generate a near 50-50 partition of the data than is simple Twoing. Twoing was used in this case in its power-modified version, with power equal to one. We compared the models built using both criteria and it appeared that the models gave very similar results in terms of the decision tree structure, richness of nodes in the serious claims and selection of the important predictors. This finding further confirmed the stability of the model and the fact that the modeling methodology was correctly identifying the underlying data patterns that characterize this type of data. The Gini criterion-based model however, produced slightly better results in terms of prediction accuracy and ranking the cases than the Twoing criterion-based model. CART decision tree offers two main tools of model evaluation: gains chart and classification tables for the learning and testing samples. These tools allow us to appreciate three aspects of the model. Firstly, we can conduct specificity and sensitivity analysis from the classification table. Secondly, we can investigate model stability by comparing classification tables for the test and learning samples (Table 1). Finally, we can appreciate how well the model performs in terms of the ranking of the cases by examining the gains chart (Figure 1). Both evaluation methods are easy to interpret and to explain to a client. Table 1. Misclassification tables for the learning and testing data (for the model based on Gini criterion). Misclassification for Learn Data Class N Cases N Misclassed Percent Error Cost Serious 16,922 3,891 22.99 0.23 Non-Serious 105,358 25,744 24.43 0.24
Class Serious Non-Serious
Misclassification for Test Data N Cases N Misclassed Percent Error 8,558 52,866
2,275 12,923
26.58 24.44
Cost 0.27 0.24
Using Data Mining for Modeling Insurance Risk and Comparison
509
We tried different ways of randomly dividing the data into the learning and testing samples, for example 50 percent in each sample, 60 percent in the learning sample and 40 percent in the testing sample, etc., but the results were very similar each time. We present in Table 1 results based on the default split suggested by CART software which is randomly selected two thirds of the data as the learning sample and the remaining data as the testing sample. Table 1 suggests that the model performs very similarly on the learning and testing samples (as is usually the case, the model performs slightly better on the learning sample), which suggests robustness of the model and no evidence of any significant over fitting. Gains Chart 100 80 CO
-CART
O 60"l_
-Wr - Logistic Regression
CD
W 40-
-Baseline
20 Oi 0
20 40
60
80 100
% Population
Figure 1. Gains charts for the CART model (also for the logistic regression model).
A gains chart is a popular model evaluation tool in data mining. The data are ordered from nodes with the highest concentration of cases in class 1 (in this case "serious" claims) to the lowest. The horizontal axis represents the percentage of the population examined and the vertical axis represents the percentage of "serious"
510
/. Kolyshkina et al.
claims identified. The 45° line represents the situation if the tree were giving no useful information about "serious" claims, i.e., if each node were a random sample of the population. The curved line represents the cumulative percentage of "serious" claims versus the cumulative percentage of the total population. The vertical difference between these two lines depicts the gain or lift at each point along the horizontal axis. Gains charts help the modeler to appreciate how well the model predicts and how well it ranks predicted cases: the higher the curve is above the line, the better the model. For example, consider the gains chart for the CART tree shown in Figure 1. In the first 20% of the population we find 60% of the "serious" claims, a gain or lift of 40%. 4.3.2
Brief Discussion of the Analysis Using Logistic Regression and Comparison of the Two Approaches
Our first step was to attempt building the model using logistic regression, the traditional statistical modeling approach for analysis of data with binary response. Logistic regression is a well-known classical technique and is easily implemented in SAS, the software package that is mainly used for statistical analysis within our practice as well as by the client, and is a familiar and reliable data analysis tool. We identified differences between the two approaches in the following areas: computational speed and time requirements; significant predictor selection; handling categorical predictors with many categories and picking up interactions of predictors; missing values; model and its interpretability; individual scoring vs. segmenting (the decision tree segments all cases into several groups that are homogeneous in terms of likelihood of a serious claim, and all cases within such a group are assigned the same estimate of probability. By contrast, the logistic regression model provides each case with an individual estimate of the probability); model evaluation and accuracy. The detailed report on the differences found can be found in Kolyshkina et al. (2002). We plotted the results from
Using Data Mining for Modeling Insurance Risk and Comparison
511
the logistic regression on the gains chart (Figure 1), showing that in this case the logistic regression model performs almost as well as the CART model for the first 15% of the population, but significantly worse thereafter. To summarize, the comparison of the two methods showed that for this large data set that the data mining approach (CART decision tree model) appeared to perform slightly better in terms of accuracy and significantly better in case ranking as shown by the gains chart (Figure 1). We want to add however that in some situations for instance when the number of predictors is relatively small, most of them are numeric rather than categorical, and the assumptions for logistic regression are clearly valid, logistic regression might be preferred to a tree model. The best thing to do, however, is to combine the two modeling approaches using hybrid methodology (Steinberg and Cardell, 1998a), which is described above. PwC have used this approach in one of our projects as described in detail in Section 5.
4.4
Findings and Results, Implementation Issues and Client Feedback
The two results of the modeling that were most important for us were selection of important predictors and prediction of the likelihood that a claim becomes serious at the outset of the claim. The CART decision tree selected 19 significant predictors out of 83 potential predictors. It is interesting to note that some variables that turned out to be important predictors were expected to be so on the basis of previous experience and analysis, for example, injury details (nature, location and mechanism of injury), while some others such as language skills of the claimant, were unexpected. The CART decision tree classifies correctly about 80% of all serious claims. The model targets 30% of all claims as "likely to be serious". Of these about half turn out to be serious. The model has been incorporated into the insurer's computer
/. Kolyshkina et al.
512
system and automatically generates a "priority score" for each claim. This score is used to determine the level of staff who will be allocated the claim.
4.5
Conclusion and Future Directions
In the case study discussed, the data mining (CART decision tree) methodology performed well for prediction of serious claims and proved superior to the 'classical' methodology of logistic regression. Some of the most significant predictors were the nature, location and mechanism of injury, age of the claimant, claimant's language skills and claimant's occupation. For future improvement of our model, we plan to apply hybrid models using the strengths of both logit and CART approaches (Steinberg and Cardell, 1998a,b).
5
Case Study 2. Health Insurer Claim Cost
5.1
Background
In a recent project completed by PwC for a major health insurance company client, data mining methodology was applied to creating a model of total claim cost including models for: • • • • •
Extras claims frequency and cost for the next year Hospital claims frequency and cost for the next year Transitions from one type of product to another Births, marriages, deaths and divorce Lapses.
These were then combined in the model of overall projected lifetime member value. We will discuss in this chapter in detail the model that was focused on predicting hospital claim cost for the next year.
Using Data Mining for Modeling Insurance Risk and Comparison
5.2
513
Data
PwC data analysis specialists used data relating to each member for three years on a member (rather than on a policy) basis. We only included in the analysis the data for the members who have joined the health insurance company before 1 January 1998. This latter restriction allowed us to avoid possible issues related to the need to account for waiting periods and enabled us to use two years of full historical data for building our predictive models. The model variables can be grouped as follows. • Demographic variables, such as age of the member, their gender, family status of policy holder (the member who is responsible for making policy payments) • Socio-economic and geographic variables such as geographic location of the member's residence, socio-economic indices related to the geographic area of the member's residence such as index of education and occupation, index of economic resources, index of relative socio-economic advantage and disadvantage; index of relative supply of hospital beds, depending on member's region, various indicators of member's socio-economic status • Membership and product details such as duration of the membership, details of the product held under the membership • Variables related to claim history and medical diagnosis details of the member such as the code of medical diagnosis, number of hospital and extras claims made by the member in previous years, number of hospital episodes and other services provided to the member in previous years, number of claims in a particular calendar year, number of hospital bed days in previous years. • Other variables such as distribution channel, most common transaction channel, payment method. Overall there were about 100 variables.
/. Kolyshkina et al.
514
5.3
Overall Modeling Approach
Our overall approach to the modeling has been to use the methodology described in detail below to analyze the data for the two years to the end of the first year to find which factors best explain the claims experience and predict the cost of hospital claims in the year 2000. The hospital claim cost over twelve months has been taken as the sum of the following benefits paid by our client: accommodation benefit; theatre benefit; prosthesis benefit; gap benefit and some other benefits. Following the suggestion of health management experts, we divided hospital events into three separate types: • Where a member has made at least one hospital claim in the year but stayed in hospital for not more than one day as shown by the difference in admission and separation dates. We have called this a "same day claim" for that year • Where a member has made at least one hospital claim in the year, has stayed in hospital for more than one day and has made a claim for the operation theatre services. We have called this a "surgical claim" for that year • Where a member has made at least one hospital claim in the year, has stayed in hospital for not more than one day, but has not made a claim for theatre services. We have called this a "medical claim" for that year. The experts expected each of these hospital events to have different risk drivers, which has indeed turned out to be the case. For each type of hospital event described above we have modeled separately the probability of at least one claim in the next 12 months and the expected claim cost over the next 12 months, given at least one claim. Multiplying these two quantities together gave the expected hospital claim cost for each of the client's member based on their data history over the past two years.
Using Data Mining for Modeling Insurance Risk and Comparison
5.4
515
Modeling Methodology
We first applied to the data decision tree analysis using CART software for the purposes of exploratory analysis as well as preliminary modeling. The preliminary tree model suggested that the member base needs to be broken down into four large segments, as it appeared that claim experience was significantly different for each of the segments. This finding suggested that for optimal modeling results it was best to conduct further analysis and modeling separately for each of the main segments. The segments identified by CART were based on previous health history, hospital claim experience and age of the member and were defined as follows. • Members who have had a hospital claim in the last two years (we have called this group "sick") • Members who have not had a hospital claim in the last two years and are aged younger than 25 years (the "young healthy") • Members who have not had a hospital claim in the last two years and are aged 25 to 64 years (the "middle healthy") • Members who have not had a hospital claim in the last two years and are aged 65 years or older (the "old healthy"). We then built a separate decision tree model for each of the groups above. The risk drivers for these models were sometimes similar (for example age and the type of hospital cover were among important predictors for practically all groups), but in many cases there were some significant differences. For example, one of the most important predictors for hospital claims cost for the "young healthy " was the index of relative supply of hospital beds, depending on member's region of residence, which was not the case for any of the other groups. The response (hospital claim cost value) was continuous, and it was important in this case to provide a continuous predicted value and an individual estimate for each record rather than simply segment the data into homogeneous groups in terms of the average
516
/. Kolyshkina et al.
predicted hospital cost. Also we expected some of our potential important predictors selected by CART decision tree, such as age, to be related to the response in the linear way, and decision trees are known for not being able to effectively model a strong linear structure. For these reasons to further enhance the predictive accuracy of the model, we then built a hybrid model of CART and MARS using the hybrid modeling methodology (Steinberg and Cardell, 1998a, 1998b) as discussed above. This was achieved by including the decision tree output (in the form of a categorical variable that assigned each record to one of the segments according to the CART tree model) as one of the input variables into a MARS model. Other inputs into MARS model were variables selected by CART as important predictors. MARS, like CART, ranks variables in order of their importance as predictors from the highest to the lowest. It ranked the predictor representing CART decision tree output as the most important, which provided further confirmation of adequacy and good performance of CART tree model. However, as we had expected, other, mostly continuous, variables were high in the importance list, which showed that MARS model was picking minor linear effects that had not been picked by the decision tree. In some cases where we wanted to achieve an even higher degree of precision, we applied further hybridization techniques, combining CART, MARS and a linear model such as logistic regression or generalized linear model. This was done by feeding MARS output (in the form of basis functions created by MARS) as inputs into a linear model. Such a "three-way hybrid" model was easy to implement in SAS as MARS basis functions are already produced in the SAS format, so a user can directly copy and paste them into a SAS program. To get a total predicted hospital cost, as previously discussed, we simply multiplied our probabilities and "costs given at least one claim" together, then summed the various claim types.
Using Data Mining for Modeling
5.5
Insurance Risk and
Comparison
517
Model Diagnostic and Evaluation
The main tools we used for summarizing model diagnostics were gains chart and analysis of actual versus predicted values for hospital cost. 5.5.1
Gains Chart for Total Expected Hospital Claims Cost
The gains chart for the overall hospital claims cost model presented in Figure 2 shows that we are able to predict the high cost claimants with a good degree of accuracy. As a rough guide, the overall claim frequency is 15%. Taking the 15% of members predicted as having the highest cost by the model, we end up with 56% of the total actual cost. Taking the top 30%) of members predicted as having the highest cost by the model, we end up with almost 80%> of the total actual cost. Some of this performance is generated simply by the dependence of cost on age but we estimate that the model is improving the gains chart by around 10% of total cost in the high cost part of the population, over a simple model depending on age. - • — % of actual cost created by randomly selected claimants
Total Cost 100% 90%
% of actual cost created by claimants in the highest category
60% 50% 40% 30% 20% 10% 0% X
X
X
X
^S
°
cs °
§
§pla8hants in Highest Category
°
(%)
Figure 2. The gains chart for overall annual hospital cost.
518
/. Kolyshkina et al.
5.5.2
Actual versus Predicted Chart
A further diagnostic of model performance is analysis of actual versus expected values of probability of claim or claim cost. Such analysis can be pictorially represented by a bar chart of averaged actual and predicted values for overall annual hospital cost. This chart is shown in Figure 3. To create this chart, the members were ranked from highest to lowest in terms of predicted cost, and then segmented into 20 equally sized groups. The average predicted and actual values of hospital cost for each group were then calculated and graphed. Total Expected Cost
i
:'• H ..
? '} :• 1 1 1 1 1 Fl m m A°
'
Jf>
•?
A"
V
Jf
f
Jp
V
Jf>
"?
(*>
V
Jp
i?
Jp
P
Jf
if
A?
¥
Fl FI F i n ™ «
Jc
(?
Jf
(?
(A»
^
Jf
V
Jc
V
Jf>
i?
„ A°
•?
A>
A»
N#
Percentile (predicted)
Figure 3. The bar chart of averaged actual and predicted values for overall annual hospital cost.
If the model is performing well, such a chart will demonstrate a high degree of differentiation, in other words, the highest average predicted value will be much higher than the lowest average predicted value, and the actual and predicted average values will match closely. If we consider the chart presented in Figure 3, we will see that the shape of the chart suggests a good degree of differentiation
Using Data Mining for Modeling Insurance Risk and Comparison
519
as well as good model fit. The model slightly over-predicts for the lower expected costs but this was of little business importance for the client.
5.6
Findings and Results
5.6.1
Hospital Cost Model Precision
The hospital cost model achieved high degree of precision as is demonstrated at the actual versus predicted graph (Figure 3) and gains chart (Figure 2) above. 5.6.2
Predictor Importance for Hospital Cost
Predictors of the highest importance for overall hospital cost were age of the member, gender, number of hospital episodes and hospital days in the previous years, the type of cover and socio-economic characteristics of the member. Other important predictors included duration of membership, family status of the member, the type of cover that the member had in the previous year, previous medical history and the number of physiotherapy services received by the member in the previous year. The fact that the number of ancillary services (physiotherapy) affected hospital claims cost was a particularly interesting finding. Less important predictors included hospital supply index for the geographic area, socio-economic indices of the geographic area and number of optical and pharmaceutical services received by the member in the previous year. A number of factors such as, for example, number of chiropractic and dental services received by the member in the previous year seemed to have minimal or no effect on hospital cost. It is important to note that while some health insurance specialists argue that the only main risk driver for hospital claim cost is age of the member, our results have demonstrated clearly that although age is among important predictors of hospital claims cost,
520
/. Kolyshkina et al.
there are other significant contributing factors such as, for example, socio-economic indicators and, for some age groups, the factors related to availability of hospitals in the geographic location of the member's residence.
5.7
Implementation and Client Feedback
The deliverables for the model included a SAS algorithm that takes the required input data and produces a cost score for a given member so the client could easily implement the model directly in SAS environment. Client feedback was very positive.
6
Neural Networks
We have not directly compared our results to the results achieved by the use of neural network techniques. The reason for this was the fact that neural networks are known to be "black boxes" that are hard to interpret and explain to a client. We did not use neural networks in combination with decision trees and MARS, though building such hybrid models would seem to be a fruitful question for further research.
7
Conclusion
In conclusion, we would like to reiterate that the new data mining CART decision tree and MARS methodology effectively addresses the need for analysis of large data sets, seems to perform faster, be unaffected by such data issues as missing values, categorical predictors with many categories, be more accurate and require less data preparation than the traditional linear methods. A number of projects completed by PwC for large insurer clients demonstrated that these methods can be very useful for analysis of the insurance data.
Using Data Mining for Modeling Insurance Risk and Comparison
521
Acknowledgments We would like to thank Dr Richard Brookes (PwC) for support, advice and thoughtful comments on the manuscript.
References Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Wadsworth, Pacific Grove, CA. Francis, L. (2001), "Neural networks demystified," Casualty Actuarial Society Forum, pp. 252-319. Haberman, S. and Renshaw, A.E. (1998), "Actuarial applications of generalized linear models," in Hand, D.J. and Jacka, S.D. (eds.), Statistics in Finance, Arnold, London. Han, J. and Camber, M. (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers. Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York. Kolyshkina, I., Petocz, P. and Rylander, I. "Modeling insurance risk: a comparison of data mining and logistic regression approaches," submitted to The Australian and New Zealand Journal of Statistics in October 2002. Lewis, P.A.W. and Stevens, J.G. (1991), "Non-linear modeling of time series using multivariate adaptive regression splines," Journal of the American Statistical Association, vol. 86, no. 416, pp. 864-867. Lewis, P.A.W., Stevens, J., and Ray, B.K. (1993), "Modeling time series using multivariate adaptive regression splines (MARS)," in Weigend, A. and Gershenfeld, N. (eds.), Time Series Prediction: Forecasting the Future and Understanding the Past, Santa Fe Institute: Addison-Wesley, pp. 297-318. McCullagh, P. and Nelder, J.A. (1989), Generalised Linear Mod-
522
/. Kolyshkina et al.
els, 2n ed., Chapman and Hall, London. Salford Systems (2000), CARTfor Windows User's Guide, Salford Systems. Salford Systems (2002), "MARS (Multivariate Adaptive Regression Splines)," http://www.salford-systems.com, (accessed 08/ 10/2002). Smyth, G. (2002), "Generalised linear modeling," http://www .statsci.org/glm/index.html, (accessed 25/09/2002). Steinberg, D. and Cardell, N.S. (1998a), "Improving data mining with new hybrid methods," presented at DCI Database and Client Server World, Boston, MA. Steinberg, D. and Cardell, N.S. (1998b), "The hybrid CART-Logit model in classification and data mining," Eighth Annual Advanced Research Techniques Forum, American Marketing Association, Keystone, CO. Steinberg, D. and Colla, P.L. (1995), CART: Tree-Structured Nonparametric Data Analysis, Salford Systems, San Diego, CA. Thomsen, E. (1997), OLAP Solutions: Building Multidimensional Information Systems, John Wiley & Sons, Inc. Vapnik, V. (1996), The Nature of Statistical Learning Theory, Springer-Verlag, New York. WorkCover NSW News (2001), "Technology catches insurance fraud," http://www.workcover.nsw.gov.au/pdf/wca46.pdf, (accessed 08/10/02).
Chapter 15 Emerging Applications of the Resampling Methods in Actuarial Models Krzysztof M. Ostaszewski and Grzegorz A. Rempala Uncertainty of insurance liabilities has always been the key issue in actuarial theory and practice. This is represented for instance by study and modeling of mortality in life insurance and loss distributions in traditional actuarial science. These models have evolved from early simple deterministic calculations to more sophisticated, probabilistic ones. Such probabilistic models have been traditionally built around parameters characterizing certain probability laws, e.g., Gompertz's model of force of mortality, or parametric models of the yield curve process. In this chapter we describe the methodology of the bootstrap, and more generally, resampling and show some of its possible advantages in describing the underlying probability distributions. We provide two detailed examples of application of the resampling methods. First, we show how bootstrap can be used successfully to enhance a parametric mortality law suggested by Carriere (1992). Next, we develop a whole company asset-liability model to study the nonparametric bootstrap alternative to lognormal and stable Paretian models of interest rate process proposed by Klein (1993). Our results indicate that bootstrap can be instrumental in understanding the rich structure of random variables on the asset and liability sides of an insurance firm balance sheet, and in error estimation in both parametric and non-parametric setting.
523
524
1
K. M. Ostaszewski and G. A. Rempala
Introduction
Actuarial science has always been concerned with uncertainty and the process of modeling it. In fact, the uncertainty of insurance liabilities may be considered the very reason for the creation of actuarial science. This is clearly represented in a study of mortality and other decrements in life, disability, health insurance, and loss distribution in property/casualty insurance. The models used in such studies began as tabular ones, and then developed into a study of probability distributions, traditionally characterized by some numerical parameters. Examples of such probability models include Makeham and Gompertz laws of mortality, or the use of gamma distribution family in modeling losses, as well as the omnipresent use of normal approximations (including the lognormal distribution modeling of interest rates). Over the last decade a score of new methodologies for statistical inference both in parametric and non-parametric setting have been based on the concept of the bootstrap, also known under a somewhat broader term of resampling. In this work, we present the basic ideas of the bootstrap/resampling, and point out its promising applications in actuarial modeling. The chapter is organized as follows. This section gives a brief overview of the basic ideas of the bootstrap methodology in the context of parametric and non-parametric estimation as well as time series-based inferences. The subsequent two sections of the chapter are devoted to the somewhat detailed discussion of two particular examples, arising from the actuarial modeling problems. In Section 2 we look at mortality estimation, studying a mortality law introduced by Carriere (1992), generalizing on the classical Gompertz mortality law. In Section 3, we proceed to a whole company asset-liability model, comparing cash flow testing results for an annuity company under two parametric assumptions of interest rate process and the method based on the nonparametric, bootstrap-based approach to analyzing time series data. Section 4 offers some concluding remarks.
Emerging Methods in Actuarial Models
1.1
525
The Concept
The concept of the bootstrap was first introduced in the seminal piece of Efron (1979) as an attempt to give some new perspective to an old and established statistical procedure known as jackknifing. Unlike jackknifmg which is mostly concerned with calculating standard errors of statistics of interest, Efron's bootstrap achieves more ambitious goal of estimating not only the standard error but also the distribution of a statistic. In his paper Efron considered two types of bootstrap procedures useful, respectively, for nonparametric and parametric inference. The nonparametric bootstrap relies on the consideration of the discrete empirical distribution generated by a random sample of size n from an unknown distribution F. This empirical distribution Fn assigns equal probability to each sample item. In the parametric bootstrap setting, we consider F to be a member of some prescribed parametric family and obtain Fn by estimating the family parameter(s) from the data. In each case, by generating an independent identically distributed (iid) random sequence, called a resample or pseudo-sequence, from the distribution Fn, or its appropriately smoothed version, we can arrive at new estimates of various parameters or nonparametric characteristics of the original distribution F. This simple idea is at root of the bootstrap methodology. The conditions for consistency of the bootstrap method were formulated in the paper of Bickel and Freedman (1981). This resulted in further extensions of the Efron's methodology to a broad range of standard applications, including quantile processes, multiple regression and stratified sampling. Singh (1981) made a further point that the bootstrap estimator of the sampling distribution of a given statistic may be more accurate than the traditional normal approximation. In fact, it turns out that for many commonly used statistics the use of the bootstrap is asymptotically equivalent to the use of one-term Edgeworth correction, usually with the same convergence rate, which is faster than the normal approximation. In many recent statistical texts, the bootstrap is recommended for
526
K. M. Ostaszewski and G. A. Rempala
estimating sampling distributions, finding standard errors, and confidence sets. Although the bootstrap methods can be applied to both parametric and non-parametric models, most of the published research in the area has been concerned with the non-parametric case since that is where the most immediate practical gains might be expected. Bootstrap as well as some other resampling procedures procedures are available in recent editions of some popular statistical software packages, including SAS and S-Plus. Readers interested in gaining some background in resampling are referred to Efron and Tibishirani (1993). For a more mathematically advanced treatment of the subject, we recommend the monograph by Shao and Tu (1995) which also contains a chapter on bootstrapping time series and other dependent data sets.
1.2
Bootstrap Standard Error and Bias Estimates
Arguably, one of the most important practical applications of the bootstrap is in providing conceptually simple estimates of the standard error and bias for a statistic of interest. Let 9n be a statistic based on the observed sample, arriving from some unknown distribution function F. Assume that 9n is to estimate some (real-valued) parameter of interest 9, and let us denote its standard error and bias by seF(9n) and biasF(9n). Since the form of the statistic 6n may be vary complicated the exact formulas for the corresponding bootstrap estimates of standard error (BESE) and bias (BEB) may be quite difficult to derive. Therefore, one usually approximates both these quantities with the help of the multiple resamples. The approximation to the bootstrap estimate of standard error of 9n suggested by Efron (1979) is given by ( B
SeB=l^2[0*(b)-9:(-)f/(B-l)\
\1'2
(1)
527
Emerging Methods in Actuarial Models
where 9*n (b) is the original statistic 9n calculated from the 6-th resample (6 = 1 , . . . , B), d*n{-) = Y,Li K(b)/B> a n d B i s t h e t o t a l n u m ber of resamples (each of size n) collected with replacement from the empirical estimate of F (in either parametric or non-parametric setting), By the law of large numbers lim seB =
BESE{6n),
B—>oo
and for sufficiently large n we expect BESE(8n)
« seF{6n).
Similarly, for BEB one can use its approximation biass based on B resamples B
biasB^Y^§n(b)/B-9n
(2)
6=1
Let us note that B, the total number of resamples, may be taken as large as we wish. For instance, it has been shown that for estimating BESE, B equal to about 250 typically gives already a satisfactory approximation, whereas for BEB this number may have to be significantly increased in order to reach the desired accuracy (see Efron and Tibishirani 1993 for a discussion of these issues). Let us also note that BESE and BEB can be applied to the dependent data after some modifications of the resampling procedure (see Section 1.4 below).
1.3
Bootstrap Confidence Intervals
Let us now turn into the problem of using the bootstrap methodology to construct confidence intervals. This area has been a major focus of theoretical work on the bootstrap and in fact the procedure described below is not the most efficient one and can be significantly improved in both rate of convergence and accuracy. It is, however, intuitively
K. M. Ostaszewski and G. A. Rempala
528
appealing and easy to justify and seems to be working well enough for the cases considered here. For complete review of available approaches to bootstrap confidence intervals for iid data, see Efron and Tibishirani (1992). As in the case of standard error and bias estimates the methodology for bootstrap interval estimation applies also to time dependent observations as long as we modify our resampling procedure to either block or circular bootstrap sampling (see Section 1.4 below). Let us consider #*, a bootstrap estimate of 9 based on a resample of size n (or, in the case of dependent observations, on k blocks of length I, see Section 1.4 below) from the original data sequence X\,..., Xn, and let G* be its distribution function given the observed series values G,(x) = Prob{6*n < x\Xx = xu...,Xn
= xn}.
(3)
Recall that for any distribution function F and r e (0,1) we define the r-th quantile of F (sometimes also called r-th percentile) as F _ 1 ( r ) = inf{a; : F(x) > r } . The bootstrap percentiles method gives G~1(a) and G'71(l — a) as, respectively, lower and upper bounds for 1 — 2a confidence interval for 9n. Let us note that for most statistics 9n the form of a distribution function of the bootstrap estimator (9* is not available. In practice, G~1(a) and G~l{\ — a) are approximated by generating B pseudo-sequences (Xf,..., X*), calculating the corresponding values of #* (6) for b — 1 , . . . , B, and then finding the empirical percentiles. In this case the number of resamples B usually needs to be quite large (B > 1000).
1.4
Dependent Data
The Efron's basic bootstrap procedure fails when the observed sample points are not independent. The extension of the bootstrap method to the case of dependent data was first considered by Kiinsch (1989) who suggested a moving block bootstrap procedure, which
529
Emerging Methods in Actuarial Models
takes into account the dependence structure of the data by resampling blocks of adjacent observations rather than individual data points. The method was shown to work reasonably well. Its main drawback is related to the fact that for fixed block and sample sizes, observations in the middle part of the series had typically a greater chance of being selected into a resample than the observations close to the ends. This happens because the first or last block are often shorter. To remedy this deficiency Politis and Romano (1992) suggested a method based on circular blocks, wrapping the observed time series values around the circle and then generating consecutive blocks of bootstrap data from the circle. In the case of the sample mean this method, known as a circular bootstrap, again was shown to accomplish the Edgeworth correction for dependent data. Let Xi,... ,Xn be a (strictly) stationary time series, and as before let 9n = 9n(Xi,..., Xn) denote a real-valued statistic estimating some unknown parameter interest 6. Given the observations Xi,...,Xn and an integer I (1 < I < n) we form the lblocks Bt = (Xt,..., Xt+i_i) for t = 1 , . . . , n — I + 1. In moving blocks procedure of Kiinsch (1989) the bootstrap data (pseudo-time series) (Xjf,...,-X"*) is obtained by generating the resample of k blocks (k — [n/l]) and then considering all individual (inside-block) pseudo-values of the resulting sequence B*..., B^. The circular bootstrap approach is somewhat similar except that here we generate a subsample B\ ..., Bk from the /-blocks Bt = (Xt,..., Xt+i-i) for t = 1 , . . . , n where
x = [Xt \Xt-n
ift = 1
'---,n'
t = n + l,...,n
+
l-l.
In either case, the bootstrap version of 9n is 9*n = 9n{X{,..., X*). The idea behind both the block and circular bootstrap is simple: by resampling blocks rather than original observations we preserve the original short-term dependence structure between the observations (although not necessarily the long-term one). For the data series
530
K. M. Ostaszewski and G. A. Rempala
for which the long term dependence is asymptotically negligible in some sense (e.g., a-mixing sequences with appropriately fast mixing rate) the method will produce consistent estimators of the sampling distribution of§n — 6 as long as I, n —>• oo at the appropriate rate (see e.g., Shao and Tu (1995 Chapter 9), for the discussion of the appropriate assumptions). In particular, it follows that the formulas for the approximate BESE and BEB given by (1) and (2), respectively as well as the method of percentiles for confidence estimation are still valid with the block and circular bootstrap. Let us also note that for I = 1 both methods reduce to the Efron's bootstrap procedure for iid random variables. In the next two sections we demonstrate the use of the above bootstrap techniques in both parametric and non-parametric setting.
2
Modeling US Mortality Tables
In this section we consider an application of the bootstrap techniques to estimating the variability of the parameters of a general mortality law proposed by Carriere (1992). We first describe the mortality model.
2.1
Carriere Mortality Law
The research into a law of mortality has had a long history going back the simplest formula of De Moivre. Probably the most popular parametric model of mortality (proposed by Benjamin Gompertz) has been the one that models the force of mortality Mx = Bcx
(4)
as an exponential function. Whereas Gompertz's model typically fits observed mortality rates fairly well at the adult ages it exhibits some deficiency in modeling the early ages. This is mostly due to the fact
531
Emerging Methods in Actuarial Models
that in many populations the exponential growth in mortality at adult ages is preceded by a sharp mortality fall in early childhood and the hump at about age 23 (cf. e.g., Tenenbein and Vanderhoof 1980). This early years behavior of the mortality force is not easily modeled with the formula (4). There has been many attempts in the actuarial literature to modify Gompertz's law in order to obtain a better approximation for the early mortality patterns. It has been observed in Heligman and Pollard (1980) that by adding additional components to Gompertz model one can obtain a mortality law that fits early mortality patterns much better. This idea has been further extended by Carriere 1992 [referred to in the sequel as CR] who has suggested using a mixture of the extreme-value distributions: Gompertz, Inverse Gompertz, Weibull and Inverse Weibull, to model the survival curve of a population 4
s(x) = ^xpiSiix).
(5)
i=l
Here ipi > 0 for i = 1 . . . , 4, are the mixture coefficients satisfying ]T)i=1 •01 = 1 and Si(x) for i = 1,... 4 are the survival functions based on the Gompertz, Inverse Gompertz, Weibull, and Inverse Weibull distribution functions, respectively, with the corresponding location and scale parameters m* and
=exp{e-mi/(Jl
- e(*-mi)/CTl)
l-exp(-e-(*-"*>/^) w ) =—.; /—mi N 1 - exp(-e^/(J2) 3/
(6a) ( 6b ) (6c) (6d)
In Carriere mortality model the Gompertz (6a) and Inverse Gompertz (6b) curves, both derived from (4), are used to model the mortality of adulthood, whereas the Weibull and the Inverse Weibull curves are
532
K. M. Ostaszewski and G. A. Rempala
used to model the mortality of childhood and early adolescence. The introduction of the last two survival functions allows fitting correctly the early years decrease in mortality as well as the hump around age 20, which are both present in many populations' mortality patterns. As it has been demonstrated empirically in CR the model (5) fits quite well the patterns of mortality for the entire US population as well as, separately, US male and female mortality laws. Due to the interpretation of the survival components of the curve (5) Carriere model, unlike that of Heligman and Pollard (1980), gives also the estimates of childhood, adolescence and adulthood mortality by means of the mixture coefficients tpi. In the next section we discuss a way of arriving at these estimates.
2.2
Fitting the Mortality Curve
We will illustrate the use of bootstrap by fitting the general mortality model (5) to the Aggregate 1990 US Male Lives Mortality Table (US Health Dept. 1990). This particular table is of special actuarial interest because it was a data source in developing 1994 GAM and 1994 UP tables. In order to fit the model (5) to the observed mortality pattern, some fitting criterion (a loss function) is needed. In his paper CR, Carriere has considered several different loss functions but here, for simplicity, we will restrict ourselves to the one given by 100
/
„
N
2
£H) • where qx is a probability of death within a year for a life aged x, as obtained from the US table and qx is a corresponding probability calculated using the survival function (5). The ages above x = 100 were not included in (7) since, similarly to CR, we have found that the mortality pattern for the late ages in the table had been based on Medicare data and hence might be not representative to the entire US male population. The parameters of the survival functions (6) as
Emerging Methods in Actuarial Models
533
well as the corresponding mixing coefficients ^ ' s were calculated by minimizing the loss function (7). The calculations were performed with a help of the SOLVER add-on function in the Microsoft Office 97 Excel software. The estimated values of the parameters m*, of, and tpi for i = 1,... 4 are given in Table 1. Table 1. Estimates of the Carriere survival model parameters for 1990 US Male Mortality Tables. ip4 = 1 — Y^=i i>iComp. Location par (m^) Scale par (<7;) Mixture par {tpi) 80.73 11.26 0.944 sii) 42.60 14.44 0.020 S2(-) 0.013 0.30 1.30 23.65 7.91 0.022 * > ( • )
* * ( • )
As we can see from the table the modes m* of the survival components in the Gompertz and the Inverse Gompertz case are equal to 80.73 and 42.60 respectively, whereas in the Weibull and the Inverse Weibull case they are equal to 0.3 and 23.65, respectively. This indicated that, as intended, both Gompertz components model the later age mortality, whereas both Weibull components model the mortality of childhood and adolescence. Moreover, since -01 = 0.94 it follows that most of the deaths in the considered population are due to the Gompertz component. A plot of the fitted curve of ln(gx) using the estimated values of the parameters given in Table 1 along with the values of ln(gx) for x = 0 , . . . , 100 is presented in Figure 1. As we can see from the plot the fit seems to quite good except perhaps between the ages of 6 and 12. Let us also note the resemblance of our plot in Figure 1 to that presented in Figure 4 of CR, which was based on the 1980 US Population Mortality Table.
2.3
Statistical Properties of the Parameter Estimates in Carriere Mortality Model
The statistical properties of estimators of the unknown parameters ipi i = 1 . . . 3 and m*, at i = 1 . . . , 4 obtained by minimizing the
534
K. M. Ostaszewski and G. A. Rempala
1
7
13
19
25
31
37
43
49 55 Age x
61
67
73
79
85
91
97
Figure 1. A plot of ln(^ x ) from US Life Table (dashed line) and of ln(
function (7) (we refer to them in the sequel as the Carriere or CR estimators) are not immediately obvious. For instance, it is not clear whether or not they are consistent and what is their asymptotic efficiency. Since these questions are related to the problem of consistency of the bootstrap estimation, we will discuss them briefly here. In order to put our discussion in the context of the statistical inference theory we need to note an important difference between the loss function (7) and the usual, weighted least-square loss. Namely, in our setting the US mortality rates qx for x = 1 , . . . , 100 are not independent observations but rather are themselves estimated from the large number n of the deaths observed in the population.1 Since the gx's are estimated based on the same number n of the observed deaths, they are negatively correlated and hence the loss function (7) treated as a random function, is not a sum of independent identically dis'The US table used as an example in this section has, in fact, the values qx's obtained by interpolation with the help of Beer's 4-th degree oscillatory formula.
Emerging Methods in Actuarial Models
535
tributed random variables. In view of the above, the usual statistical methodology of M-estimation does not apply, and in order to obtain asymptotics for the CR estimators we need to develop a different approach which is briefly outlined below. For further references and a general discussion of the method, including the model identifiability issue, see the forthcoming paper of Rempala and Szatzschneider (2002). Under the assumption that the true mortality curve of the population indeed follows the model (5) with some unknown parameters Vf, (i = 1 , . . . , 3) and mf, of (i = 1,... 4) one may consider CR estimators associated with the i-th survival component (i = 1 , . . . , 4) as functions of the vector of mortality rates qx for x = 1 , . . . , 100 and expand them around the true parameters ipf, mf, of using the multivariate Taylor's Theorem. The expansion is valid in view of the Implicit Function Theorem which guarantees that all the local minimizers of (7) are locally continuously differentiable functions of the qx's as long as the first derivatives of qx = Qxi^i^i^i) are locally continuous and do not vanish (that is, the appropriate Jacobian function is no equal to zero, at least for one age x) around the point Vf, m-, of. This method reveals that CR estimates are consistent and that their joint distribution is asymptotically multivariate normal with some covariance matrix with entries depending upon the values of the true parameters ?/f, (« = 1 , . . . , 3), m°, of (i = 1,... 4) and two first derivatives of the gx's with respect to ipi, fnu ®i-
2.4
Assessment of the Model Accuracy with Parametric Bootstrap
One way of assessing the variability of CR estimates would be to use the normal approximation outlined in the previous section. However, since the asymptotic variance of this approximation depends upon the unknown parameters which would have to be estimated from the data, the derivation of the exact formula for the covariance matrix
536
K. M. Ostaszewski and G. A. Rempala
as well as the assessment of the performance of variance estimators would be required. In addition, the normal approximation would not account for the possible skewness of the marginal distributions of tpi, rrii, and o-j. Therefore, as an alternative to this traditional approach, we propose to apply the bootstrap method. Since the bootstrap typically corrects for the skewness effect we would hope that it might provide better approximations. Let us note that the methods of the nonparametric bootstrap cannot be applied here since we do not know the empirical distribution Fn of the observed deaths. However, we do know the estimated values of the parameters in the model (5), since these are obtained by minimizing the expression (7) which is concerned only with the quantities qx. Hence, assuming that the true population death distribution follows the model (5) we may use the values given in Table 1 to obtain our resampling distribution Fn. Once we have identified the resampling distribution, the bootstrap algorithm for obtaining approximate variances, biases, and confidence intervals for CR estimators is quite simple. We use a computer to generate a large (in our example 20,000 points) random sample from Fn and apply interpolation2 to obtain the sequence of the pseudo mortality rates q* for x = 1 , . . . , 100. In order to find the corresponding bootstrap replication ip* i — 1 . . . 3 and m*, a* i = 1 . . . , 4 we minimize the loss function (7), where now the qx's are replaced by the q*'s. The above procedure is repeated a large number (B) times and the resulting sequence of bootstrap replications ip*(b), m*(b), a*(b) for b = 1 , . . . , B is recorded. The plot of a single replication of the log of g*'s curve and the corresponding log of q*'s curve, i.e., log of the fitted survival curve (5) with the parameters ip*, m*, a* is presented in Figure 2. In order to find the corresponding bootstrap estimates of standard error and bias for each 2
For the sake of efficiency one should use here the same method of interpolation as the one used to create the original values of the g x 's in the table. However, in our example we simply applied linear interpolation to obtain the values of the pseudo-empirical distribution at the integer ages and then to calculate the q* 's.
537
Emerging Methods in Actuarial Models o -i
-9
1
•
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96 101
Age x
Figure 2. A plot of ln(g*) (rough solid line) from a single 20,000 point resample of a parametric model (dashed line) and of In q* (smooth solid line).
one of the 11 CR estimators in Table 1 we use the formulas (1) and (2). The corresponding bootstrap confidence intervals are calculated as well, with the help of the bootstrap percentile method. The general appropriateness of the above outlined bootstrap procedure is not immediately obvious and in fact requires some theoretical justification as it does not follow immediately from the general theory. Since the mathematical arguments needed to do so get quite technical and are beyond the scope of the present work, let us only briefly comment that, generally speaking, for the mere consistency of our bootstrap algorithm, the argument similar to that used to argue the consistency of CR estimators may be mimicked. However, in order to show that our bootstrap estimators in fact have an edge over the normal approximation method a more refined argument based on the Edgeworth expansion theory is needed. For some insight into this and related issues see, for instance, Rempala and Szatzschneider (2002).
538
K. M. Ostaszewski and G. A. Rempala
Table 2. Bootstrap estimates of the standard error, bias and 95% CI for the estimators given in Table 1 (V>4 = 1 — J2i=i &)• TOi m2 0"2 CTI 1p2 V'I 42.60 14.44 0.020 Value (from Table 1) 80.73 11.26 0.944 SE 0.278 0.015 8.257 4.788 0.245 0.018 0.096 -0.120 -0.003 -1.266 -1.653 0.005 Bias Lower Bd 95% CI 80.414 10.521 0.900 26.222 3.007 0.000 Upper Bd 95% CI 81.336 11.608 0.965 54.531 19.932 0.056 1p4 m3 (7 4 7714 0-3 i>3 0.022 7.91 23.65 Value (from Table 1) 1.30 0.013 0.30 3.122 SE N/A 3.011 2.268 0.001 0.303 0.254 N/A Bias 0.000 -0.179 -0.253 0.037 N/A Lower Bd 95% CI 0.011 0.053 17.769 3.048 0.359 Upper Bd 95% CI 10.019 0.015 30.000 13.504 N/A 1.290
The final results based on our bootstrap approximations to the standard error, bias and confidence bounds for all 11 CR estimators given in Table 1 are presented in Table 2. As we can see from these results it seems that only the estimates of the Gompertz component (mi,o"i,^i) of the mortality model enjoy a reasonable degree of precision in the sense that their standard errors and biases are small (relative to the corresponding numerical value), and their confidence intervals short. For all the other component both the values for standard errors and the lengths of the confidence interval, indicate their apparent lack of precision. In particular, the lower bound of the mixture coefficient tp2 corresponding to the Inverse Gompertz component (6b) equals zero, which could indicate that the Inverse Gompertz component is simply not present in the considered mortality pattern. In fact, CR fitting the model (5) to US 1980 Mortality data, which has a similar pattern to the one we have considered here, did take tp2 = 0. Hence, by providing the 95% CI's for the estimates of the ipiS, our bootstrap method also suggests the general form of the mixture in the Carriere law.3 3
For a formal test of significance of V'i = 0 in a similar fashion to the usual test of the coefficients in the regression models, see, for instance, Rempala and Szatzschneider (2002).
539
Emerging Methods in Actuarial Models
For illustration purposes the distribution of B = 1000 bootstrap replications for the parameter m\ along with its 95% confidence interval is presented in Figure 3. Let us note that the histogram of the replication values seems to approximately follow the symmetric unimodal curve which is consistent with our theoretical derivations. 0.16 -| 0.14
;
I—I :
:
0.12 -
i
i
!
i
i —i
i
d- o.io t
1
0.08 --
"53 cc 0.06
! '• ;
I r—
j
0.04 --
:
i
0.02 --
|T
i
o.oo l i l
~~I
!
' !
i : ~~1
i
~~K-l
Ml—I—I—I—U-l—I—I—I—I—I—I—U-l—I—i—i—
80.227 80.440 80.653 80.866 81.079 81.292 81.505 m,* Figure 3. The distribution of B — 1000 bootstrap replications for the parameter m\ along with its 95% confidence interval. The vertical line in the center show the actual fitted value of mi as given in Table 1.
As we have seen above, the parametric bootstrap algorithm can be helpful in obtaining a more complete information about the fitted mortality model and its parameters. The drawback of the algorithm presented above is mainly in its very high computational cost, since for every resample of the sequence of 11 values of the parameters we need to first obtain a resample of the fitted mortality rates qx,x = 1..., 100. It would be interesting to develop some better methods of bootstrapping in the problems involving the loss functions of type (7), either with the help of the residuals bootstrap method (Efron and Tibishirani 1993) or maybe some techniques aiming at increasing the efficient of resampling, like, e.g., importance method (Shao and Tu 1995).
K. M. Ostaszewski and G. A. Rempala
540
3
Methodology of Cash-Flow Analysis with Resampling
In our next example we illustrate the use of nonparametric bootstrap for dependent data, by developing a whole company asset-liability model with the nonparametric model of the interest rate process. The results are compared with that based on the lognormal and stable Paretian models proposed by Klein (1993).
3.1
Interest Rates Process
While there exists a consensus as to the necessity of using interest rates scenarios, and to account for some form of their randomness, in the process of long-term modeling of an insurance enterprise, one can hardly argue that such a consensus is formed concerning the actual stochastic process governing the interest rates process. Becker (1991) analyzes one common model derived from the process dlt = Itfidt + ItadW
(8)
with It denoting the interest rate prevailing at time t, W denoting the standard Wiener process, with /i and a being the drift and volatility parameters. Becker tests the hypothesis, implied by substituting li = 0 and a = const, that the distribution of Jt = In (It+i/It) is normal with mean zero and constant variance. However, Becker (1991) rejects the hypothesis of normality and even independence of the Jt's based on the empirical evidence. Of course, (1) is not the only stochastic differential equation proposed to describe interest rate process. Many other processes have been put forth in this very active area of modern mathematical finance. Hull (1997) provides an overview of various stochastic differential equations studied in this context. De La Grandville (1998) provides a straightforward argument for approximate lognormality of the variable (1 + rate of return), based on the Central Limit Theorem. But Klein (1993) dis-
Emerging Methods in Actuarial Models
541
cusses various reasonings underlying lognormality assumption, and points out that mutual independence of consecutive returns, finite variance of one period return distribution, invariance under addition, and the functional form of the distribution, are very convenient but often unreasonable assumptions. While Klein (1993) argues convincingly against finite variance assumption, the work of Cerrito, Olson and Ostaszewski (1998), among others (notably Fama and French, 1988), aims at the evidence against independence of returns in consecutive periods. Both take away key assumptions of the Central Limit Theorem. Klein (1993) [referenced as KL in the sequel] proposes to replace lognormal distribution by the Stable Paretian, and provides a comparison of a model company under these two hypotheses. As these differing assumptions produce drastically different results, KL argues that a valuation actuary must give proper consideration to the possibility that the stochastic dynamics of the interest rate process differ significantly from the simple assumption of lognormality. In this section we propose a bootstrap-based alternative to his model as an example of the use of resampling in the nonparametric setting.
3.2
Modeling Interest Rates with Nonparametric Bootstrap of Dependent Data
Let us note that, unlike in the case of the US Mortality Tables, the historical, empirical realizations of the interest rates processes are typically available. This allows us to use the non-parametric bootstrap methods. In the following sections we investigate a non-parametric model of a company cash flow as a function of the interest rates process {Jt}- We do not assume independence of the J t 's, relaxing that assumption to m-dependence and stationarity of the interest rates process (see Section 3.4). We propose that it may be unreasonable to assume one functional form of the probability distribution and the choice of the non-parametric model here is not merely for the sake
542
K. M. Ostaszewski and G. A. Rempala
of example. In our opinion, this allows us to proceed to deeper issues concerning the actual forms of uncertainty underlying the interest rate process, and other variables studied in dynamic financial analysis of an insurance firm. In our view, one cannot justify fitting convenient distributions to data and expect to easily survive the next significant change in the marketplace. If one cannot provide a justification for the use of a specific parametric distribution, than a nonparametric alternative should be studied, at least for the purpose of understanding firm's exposures. In this section we show the possible applicability of the resampling in studying such non-parametric alternatives. While there are many variables involved in a company model that could be possibly investigated using our methods, in this work we shall focus exclusively on the nonparametric estimate of the distribution of 10-th year surplus value treated as a function of the underlying stochastic interest rates process. For simplicity, we assume a simple, one-parameter structure of the yield curve. Further work will be needed to extend our methodology to nonparametric estimates of the entire yield curve, as well as the liability side variables. Some recent research in this direction (Stanton 1997) provides one possible nonparametric model of term structure dynamics and the market price of interest rate risk. We also make a simplifying assumption of parallel shifts of all interest rates derived from the basic rate (long term Treasury rate). We follow the design of the model company in KL. That model could be generalized by extending the bootstrap techniques described here to several dependent time series (e.g., interest rate series at various maturities). Let us note that KL model assumes a parametric functional form for lapses, withdrawals, and mortality, as well as prepayments for mortgage securities in the asset portfolio. Further research into nonparametric alternatives to lapse, mortality, and prepayment models would be very beneficial to our understanding of the probability distribution structure of those phenomena. It should be noted, however, that any such research will require extensive data, which, unlike the interest rates and capital asset
Emerging Methods in Actuarial Models
543
returns, is not always easily available.
3.3
Model Company Assumptions
The model company was studied effective December 30, 1990. It offered a single product: deferred annuity. The liability characteristics were as follows: • • • • • •
Number of policies: 1,000 Fund Value of Each Policy: 10,000 Total Reserve: 10,000,000 Surrender Charge: None Minimum Interest Rate Guarantee: 4% Reserve Method: Reserve Equals to Account Balance
The assets backing this product were: • 8,000,000 30-year 9.5% GNMA mortgage pools • 2,000,000 1 Year Treasury Bills for a total asset balance of 10,000,000 effectively making initial surplus equal to zero. The following initial yields in effect as of December 31, 1990 are assumed: • • • •
1 Year Treasury Bills: 7.00% 5 Year Treasury Notes: 7.50% 30 Year Treasury Bonds: 8.25% Current coupon GNMAs: 9.50%
Treasury yields are stated on a bond-equivalent basis, while GNMAs yields are nominal rates, compounded monthly. The following interest rates are given on 5 Year Treasury Notes at indicated dates: • December 31, 1989: 7.75% • December 31, 1988: 9.09%
K. M. Ostaszewski and G. A. Rempala
544
• December 31, 1987:8.45% • December 31, 1986: 6.67% The following assumptions hold about the model company: • Lapse rate formula (based on competition and credited interest rates, expressed as percentages) is given by (w) _ J 0-05 + 0 . 0 5 [I00(icomp
- icred)]2
10.05
^ icomp ~ icred > 0j
otherwise;
with an overall maximum of 0.5. • Credited interest rate (iCred) is a currently anticipated portfolio yield rate for the coming year on a book basis less 150 basis points. In addition to always crediting no less than the minimum interest rate guarantee, the company will always credit a minimum of iCOmp and 2%. • Competition interest rate (iCOmp) is the greater of 1 Year Treasury Bill nominal yield to maturity and the rate of 50 basis points below 5 year rolling average of 5 Year Treasury Bond's nominal yield to maturity. • Expenses and Taxes are ignored. • There is no annuitization. • Mortgage prepayment rate (ratet) for the 30 year GNMA pools, calculated separately for each year's purchases of GNMAsis (0.05 +0.03 [ 1 0 0 ( w P ) t - w ) ] ratet
= \ + 0 . 0 2 [ 1 0 0 ( z c o u P i j — lCurr)\
[0.05
if ^comp ~ Icred > 0;
otherwise;
with an overall maximum of 0.40. Here, iC0UPtt denotes the coupon on newly issued 30 year GNMAs, issued in year t, and
Emerging Methods in Actuarial Models
545
icurr denotes the coupon rate on currently issued 30 year GNMAs which is assumed to shift in parallel with the yields on 30 year Treasury Bonds. We assume the same, i.e., parallel shifts, about the 1 Year Treasury Bill rates, and the 5 Year Treasury Bond rates. • Reinvestment strategy is as follows. If, in a given year, the net cash flow is positive, any loans are paid off first, then 1-year Treasury Bills are purchased until their book value is equal to 20% of the total book value of assets, and then newly issued current coupon GNMAs are purchased. • Loans, if needed, can be obtained at the rate equal to current 1 Year Treasury Bills yield. • Book Value of GNMAs is equal to the present value of the future principal and interest payments, discounted at the coupon interest rate at which they were purchased, assuming no prepayments. • Market Value of GNMAs is equal to the present value of the future principal and interest payments, discounted at the current coupon interest rate and using the prepayment schedule based on the same rate, assuming that the current coupon rate will be in effect for all future periods as well. • Market Value of Annuities is equal to the account value. All cash flows are assumed to occur at the end of the year. The projection period is 10 years. We run 100 scenarios. At the end of 10 years, market values of assets and liabilities are calculated. The distribution of final surplus is then compared.
3.4
Interest Rates Process Assumptions
In order to make a comparison of our non-parametric cash-flow analysis with the one presented in KL, we have considered the same set of yield rates on 30-years Treasury bonds from years 1977-1990,
546
K. M. Ostaszewski and G. A. Rempala
namely the average annualized yield to maturity as given in Table 2 ofKL. As indicated in KL the plot of Jt empirical quantiles versus that of a normal random variable indicates that the J t 's possibly come from a distribution other than normal. In his approach KL addressed this problem by considering a different, more general form of a marginal distribution for Jt, namely that of a stable Paretian family, still assuming independence of the Jt's. However, as it was pointed out in the discussion of his paper by Beda Chan, there is a clear statistical evidence that, in fact, the Jt's are not independent. The simplest argument to this end can be obtained by applying the runs test (as offered for instance by the Minitab 8.0 statistical package). There are also more sophisticated general tests for a random walk in a time series data under ARMA(p, q) model as offered for instance by SAS procedure ARIMA (SAS Institute 1996). The results of these tests conducted by us for the data in Table 2 of KL give strong evidence against the hypothesis of independence (both P-values in the above tests were less then .001). In view of the above, it seems to be of interest to consider the validity of the stationarity assumption for Jt. Under ARMA(p, q) model the SAS procedure ARIMA may be used again to perform a formal test of non-stationarity, either by means of Phillips-Perron or Dickey-Fuller tests. However, a simpler graphical method based on a plot of an autocorrelation function (ACF) and values of the appropriate t-statistics (cf. e.g., Bowerman and O'Connell 1987) may also be used (if we assume that the marginal distribution of Jt has a finite variance). In our analysis we have used both the formal testing as well as graphical methods and found no statistical evidence for non-stationarity of Jt (all P-values < .001). In view of the above results, for the purpose of our non-parametric cash-flow analysis, we have made the following assumptions about the time series Jt. (Al) The series is a (strictly) stationary time series and (A2) has the m-dependence structure, i.e., t-th and s + t + m-th el-
Emerging Methods in Actuarial Models
547
ements in the series are independent for s, t = 1,2,... and some fixed integer m > 1. The assumption (Al) implies that all the J4's have the same marginal distribution but with no conditions regarding the existence of any of its moments. Thus a random variable Jt could possibly have an infinite variance or even infinite expectation (which is the case for some distributions from the stable Paretian family). As far as the assumption (A2) is concerned, the ACF plot in Figure 2 indicates that in the case of long-term Treasury Bonds from years 1953-1976, an appropriate value of m is at least 50. The assumption of m-dependence for Jt seems to be somewhat arbitrary. In fact, from a theoretical viewpoint the condition (A2) may be indeed substantially weakened by imposing some much more general but also much more complicated technical conditions on the dependence structure of the J t 's. These conditions, in essence, require only that Jt and Jt+S are "almost" independent as s increases to infinity at appropriate rate and are known in the literature as mixing conditions. Several types of mixing conditions under which our method of analysis is valid are given by Shao and Tue (1995 p. 410). For the sake of simplicity and clarity of a presentation, however, we have not considered these technical issues here.
3.5
Bootstrapping the Surplus-Value Empirical Process
In the sequel, let M denote a 10-th year surplus value of our company described in Section 3.3. In general, as indicated above M is a complicated function of a variety of parameters, some deterministic and some stochastic in nature, as well as the investment strategies. However, since in this work we are concerned with modeling cash flow as a function of underlying interest rates process, for our purpose we consider M to be a function of the time series { Jt} satisfying assumptions (Al) and (A2) above, assuming all other parameters to
548
K. M. Ostaszewski and G. A. Rempala
be fixed as described in Section 3.3. Under these conditions, we may consider M to be a random variable with some unknown probability distribution H(x) = Prob(M < x). Our goal is then to obtain an estimate of H(x). To this end, we consider a set of n = 167 values of Jt process obtained from annualized 30-year Treasury bond yield rates as given in Table 2 of KL. If we make an assumption that for the next 10 years the future values of increments of In It will follow the same stationary process Jt, we may then consider calculating the values of M based on these available 167 data values. In fact, since the J's in Table 2 of KL are based on the monthly rates, every sequence of 120 consecutive values of that series gives us a distinct value of M which results in obtaining n — 120 + 1 = 46 empirical values of M. However, by adopting the "circular series" idea, i.e., by wrapping the 167 values from Table 1 around the circle and then evaluating M for all sequences of 120 "circular consecutive" Jt, we obtain additional 119 empirical values of M. Thus a total of n = 167 empirical values of M is available. The cumulated distribution of these values is plotted in Figure 4 as the thick curve. It is important to note that the obtained empirical M's are not independent but rather form a time series which inherits the properties (Al) and (A2) of the original time series Jt. 4 A natural approximation for H{x) is now an empirical process Hn(x) based on the values, say, rai,... mn of M obtained above (n = 167). That is, we put 1
n
Hn(x) = -J2l(mi<x)
(9)
i=i
where /(m* < x) — 1 when rrii < x and 0 otherwise. By a standard result from the probability theory, under the assumptions that the 4
Note that this statement is not entirely correct here, as there is, perhaps, a small contamination due to the use of the "wrapping" procedure for calculating empirical values of M. However, it can be shown that the effect of the circular wrapping quickly disappears as the number of available J[s grows large. For the purpose of our analysis we have assumed the wrapping effect to be negligible.
549
Emerging Methods in Actuarial Models
mi's are realizations for a time series satisfying (Al) and (A2) we have that for any x Hn(x) —> H(x)
as n —> oo.
As we can see from the above considerations, the empirical process H(-) could be used to approximate the distribution of the random variable M and, in fact, a construction of such an estimator would typically be a first step in any non-parametric analysis of this type. In our case, however, the estimate based on the empirical values of the m-i's suffers a drawback, namely that it uses only the available empirical values of the surplus value, which in turn rely on the very long realizations of the time series Jt while in fact explicitly using only very few of them. Indeed, for each m* we need to have 120 consecutive values of Jt, but in the actual calculation of the surplus value we use only 10. This deficiency of Hn(-) may be especially apparent for small and moderate values of the dependence constant m in (A2), i.e., when we deal only with a relatively short dependence structure of the underlying Jt process (and this seems to be the case, for instance, with our 30 year Treasury yields data). In view of the above, it seems reasonable to consider a bootstrapped version of the empirical process (9), say H*, treated as a function of the underlying process Jt with some appropriately chosen block length I. Thus, our bootstrap estimator of H(x) at any fixed point x (i.e., 0* in our notation of Section 1.4, with Hn being 6n) is 1
B
^
K(x) = -Y,H*(b)(x)
(10)
6=1
where HJbAx) is a value of H(x) calculated for each particular realization of the pseudo-values «/*,...,«/* generated via a circular bootstrap procedure using k blocks of length I as described in Section 1.4, and B is the number of generated pseudo-sequences. In our case we have taken B = 3000.
550
K. M. Ostaszewski and G. A. Rempala
The first four realizations5 of the process H*(x) are presented in Figure 4 Let us note that the above estimate may be also viewed as a
1 2 3 Surplus Value in mlns Cx)
Figure 4. Empirical surplus value process H (x) (thick line) and the four realizations of the bootstrapped surplus value process H^(x).
function of the bootstrapped surplus values of M*, say, ra*,..., m* obtained by evaluating the surplus value for pseudo-values of the series J*,..., J*. For the purpose of the interval estimation of H(x), we consider bootstrap confidence intervals based on the method of percentiles, as described in Section 1.3 where now the bootstrap distribution G* denotes the distribution of H*(x) at any fixed point x, given the observed values of J i , . . . , Jn. In the next two sections, using the cash flow model outlined in Section 3.3, we analyze the distribution of the 10-th year surplus using the data for average annualized yields to maturity for 30 years 5
In order to achieve a desired accuracy of the bootstrap estimator for m-dependent data,the number of generated resamples is typically taken to be fairly large (typically, at least 1000). Of course, the number of all possible distinct resamples is limited by the number of available distinct series values (n) and a block length (I).
Emerging Methods in Actuarial Models
551
Treasury Bonds. For the sake of comparison we also present the results of a similar analysis based on a larger set of data, namely on the average yield to maturity on long-term Treasury Bonds for the period 1953-1976 as given in Table 4 of KL.
3.6
Bootstrap Estimates of the Surplus Cumulative Distribution
The bootstrap estimates H*(-) of the 10-th year surplus values distribution H(x) = Prob(M < x) for a grid of x values ranging between —3 and 4 (millions of dollars) along with their 95% confidence intervals were calculated using a method of circular bootstrap as described in Section 1.4 with a circular block length of six months (i.e., in notation of Section 1.4, / = 6). Other values of I were also considered and the obtained results were more or less comparable for 1 < I < 6. In contrast, the results obtained for / > 7 were quite different. However, the latter were deemed to be unreliable in view of a relatively small sample size (n = 167), since the theoretical result on the block bootstrap states that I should be smaller then y/n in order to guarantee the consistency of the bootstrap estimates (cf. Shao and Tu, 1995 Chapter 9). The case I = 1 corresponds to an assumption of independence among the J t 's and was not considered here. The plot of the bootstrap estimate of H(x) = Prob(M < x) given by (10), along with the 95% confidence bounds is presented in Figure 5. In order to achieve a reasonable accuracy the number of generated resamples for each calculated point x was taken to be B = 3000. The values of the estimators, together with their 95% confidence intervals for some of the values between -2.5 and 4 (in millions of dollars), are also presented below in Table 3. Let us note that often for merely practical reasons we would be more interested in estimating 1 — H(x), i.e., a probability that surplus value will exceed certain threshold, rather then H(x) itself. However, these estimates can be easily obtained from Table 3 as well. For in-
552
K. M. Ostaszewski and G. A. Rempala
Surplus Value in mlns ( x )
Figure 5. The plot of H^{x) along with its 95% confidence bounds.
stance, in order to obtain an estimate of a positive 10-th year surplus (Prob(M > 0)) we subtract the estimate of H(0) given in Table 3 from one. The resulting estimate is then 1 — H*(0) = 1 — 0.06 = 0.94. In similar fashion the upper and lower confidence bounds for 1 — H(0) can be obtained as well. The estimates of probability of exceeding of several different threshold values x are given in Table 4. From the values reported in either table it is easy to see that our bootstrap estimates based on average yields on 30 years Treasury Bonds 1977-1990 are not precise for x values in the interval from -2 to 2 million, in the sense that they have very wide confidence intervals. For instance, the estimated probability of a negative 10-th year surplus equals 0.06 but its interval estimate with 95% confidence is in fact a value between 0.01 and 0.57. Hence, in reality, the true value of this probability could be as high as 0.57. It seems, however, that this lack of precision was not the bootstrap's fault, but rather an effect of a relatively small sample size for the observed values of the
553
Emerging Methods in Actuarial Models
Table 3. Bootstrap estimates of H(x) for selected values of x along with their 95% confidence bounds. Surplus Bootstrap 95%CI Value x Estimator H*(x) (0.00,0.10) -2.50 0.01 (0.00,0.17) -1.75 0.01 (0.00,0.29) -1.00 0.02 -0.50 0.04 (0.00, 0.40) (0.01,0.57) 0.00 0.06 (0.01,0.67) 0.50 0.10 1.00 0.18 (0.01, 0.77) (0.02, 0.96) 1.75 0.42 (0.10,0.98) 2.00 0.53 (0.42, 1.00) 2.50 0.75 (0.74, 1.00) 3.00 0.91 (0.90, 1.00) 3.50 0.97 (0.96, 1.00) 4.00 1.00 Table 4. Bootstrap estimates of 1 — H(x) along with their confidence bounds obtained from the estimates given in Table 3. x 0 1 2 3
1-H*(x) 0.94 0.82 0.47 0.09
95% CI ~ (0.43,0.99) (0.23,0.99) (0.02,0.90) (0.00,0.26)
time series. In order to verify whether or not this was truly the case we have conducted our bootstrap analysis using also a different set of data, namely the data on the average yields of the long term Treasury bonds from the period 1953-76 as given in Table 4 of KL.
3.7
Estimates Based on the Yields on the Long-Term Treasury Bonds for 1953-76
Applying the same methodology as above we have re-analyzed the Klein company model using the data for the average yield to maturity on the long-term Treasury bonds for years 1953-1976. For selected
554
K. M. Ostaszewski and G. A. Rempala
-
1 0 1 2 Surplus Value in mlns Cx)
Figure 6. The plot of H^(-), along with its 95% confidence bounds using the longterm Treasury bonds data for years 1953-1976.
values of x the numerical values of the bootstrap estimates of H(x) based on this data set are presented in Table 5. The plot of the H*(x) calculated for sufficiently many x values so as to give it an appearance of a continuous curve, along with its 95% confidence bounds, is presented in Figure 6. Let us note that in this case the bootstrap estimate is indeed more precise (in the sense of generally shorter confidence intervals) then the one presented in Figure 5 and based on the 30-years Treasury Bond data for 1977-1990, especially in the range between -1 and 1. In particular, the probability of the negative 10-th year surplus is seen to be estimated as 0.03 with 95% confidence interval between 0 and 0.24. Hence, it seems that using the time series data on the interest rates spanning over 24 years (as opposed to 14 years for 30-year Treasury) we were able to make our bootstrap estimates more precise, especially in the middle part of the 10-th year surplus distribution. It would be useful to further investigate that phenomenon by considering even longer periods of time and study the effect of the long time series data sets on the accuracy
555
Emerging Methods in Actuarial Models
of our bootstrap estimators. Of course a danger of considering too long data sets is that over a very long period of time the key assumptions (Al) and (A2) of Section 3.4 may no longer be valid. Table 5. Bootstrap estimates of H(x) along with their 95% confidence bounds for the longterm Treasury bonds data for 1953-1976. Surplus i mlns (x) -1.75 -1.00 -0.50 0.00 0.50 1.00 1.75 2.00 2.50 3.00 3.50
3.8
Bootstrap Estimator H^(x) 0.00 0.00 0.01 0.03 0.05 0.12 0.44 0.59 0.86 0.98 1.00
95% CI (0.00,0.01) (0.00, 0.08) (0.00,0.11) (0.00, 0.24) (0.00, 0.43) (0.00, 0.61) (0.03, 0.93) (0.12, 1.00) (0.57, 1.00) (0.90, 1.00) (0.99, 1.00)
Comparison with the Parametric Analysis Results
In KL an analysis of the 10-th year surplus for a company model described in Section 3.3 was presented under the assumptions that (1) the random variables Jt are independent and identically distributed, and (2) that they follow some prescribed probability distribution. In his paper KL has considered two competing parametric classes: lognormal and stable Paretian. By estimating the parameters in both models with the help of the data for average yield to maturity on 30-years Treasury Bonds for years 1977-1990 and then generating a large number of interest rates scenarios, via Monte Carlo simulations (the techniques similar to parametric bootstrap described in Section 2), KL has arrived at the empirical estimates of the distributions for the 10-th year surpluses under two different models.
K. M. Ostaszewski and G. A. Rempala
556
= 0.7 -
/
6
£ °- "
/ J
£ 0.5 -
f
s 0.4 -
y
= 0.3 -
0 I -
(/
JI
.—-*
i 3
~ -
~i" 2
j f
i -
g
, 1
0
1
1 2
, 3
4
Surplus Value in mlns ( x )
Figure 7. The plot of H^(-) from Figure 5 (lowest curve) along with the empirical c.d.f.'s obtained by KL. Upper curve corresponds to a stable Paretian and the middle one to the lognormal model.
KL pointed out that the probability of negative surplus under a stable Paretian model seems to be much higher then under a lognormal one. In Figure 7 we present the empirical distributions of the 10-th year surplus based on 100 random scenarios as reported in KL, along with our bootstrap estimate. Based on the naive comparison of the three curves it seems that our bootstrap estimate resembles more closely the distribution obtained by KL under the lognormal interest rates model, and in particular, does not indicate the probability of a negative surplus to be as high as indicated under the stable Paretian model for the underlying process of interest rates. However, keeping in mind the confidence bounds on our bootstrap estimate presented in Figure 5, one would have to be very cautious about making such conclusions. In fact, since KL in his method of approximating the true distribution of the 10-th year surplus did not consider any measure of error of his estimators, a simple comparison of the point values on the three curves presented in Figure 7 may be quite misleading. It
Emerging Methods in Actuarial Models
557
would be certainly more beneficial for the sake of the validity of our comparisons if an error estimate was produced for the surplus distributions of KL, for instance via the methods described in Chapter 2.
4
Conclusions
The assessment of the fitted model variability is one of the most important aspects of statistical modeling. In this work we have shown that the bootstrap, in both the parametric and the non-parametric setting, can be quite useful in accomplishing this goal for the actuarial models. In mortality modeling this allows for the better understanding of the fitted model and its sensitivity to the changes in the underlying mortality patterns. The benefits of our approach are even more apparent in cash flow testing model since the stochastic structure there is somewhat richer, and probably still less understood. The remarkable feature of the bootstrap method applied to the interest rates process was that under quite minimal assumptions we were able to obtain the model for cash flow testing which seems to perform equally well, if not better then the parametric ones. The models of this work are, hopefully, the first step towards the development of new actuarial methodologies in modeling of insurance enterprises. Some additional inroads into modeling of loss distributions, inferences about the severity of loss, loss distributions percentiles and other related quantities based on data smoothing, as well as bootstrap estimates of standard error and bootstrap confidence intervals for losses have been given, for instance, by Derrig, Ostaszewski and Rempala (2000). We believe that this area holds a great promise for dynamic future research which we would like to promote and encourage with this work.
558
K. M. Ostaszewski and G. A. Rempala
Acknowledgments The authors graciously acknowledge the support of their research received from the Actuarial Education and Research Fund. They also wish to thank Ms Kim Holland for her expert typsetting help.
References Actuarial Standards Board (1991), "Performing cash flow testing for insurers," Actuarial Standard of Practice No. 7, Washington, D.C. Becker, D.N. (1991), "Statistical tests of the lognormal distribution as a basis for interest rate changes," Transactions of the Society of Actuaries, 43, 7-72. Bickel, P.J. and Freedman, D.A. (1981), "Some asymptotic theory for the bootstrap," Ann. Statist. 9, no. 6, 1196-1217. Black, F. and Myron S. (1973), "The pricing of options and corporate liabilities," Journal of Political Economy, 81, no. 3, 637-654. Bowerman, B. and O'Connell, T. (1987), Time Series Forecasting, Duxbury Press, Boston. Carriere, J. (1992), "Parametric models for life tables," Transactions of the Society of Actuaries, 44,77-100. Cerrito, P., Dennis O., and Ostaszewski, K. (1998), "Nonparametric statistical tests for the random walk in stock prices," Advances in Quantitative Analysis of Finance and Accounting, 6, 27-36. de La Grandville, O. (1998), "The long-term expected rate of return: setting it right," Financial Analysts Journal, NovemberDecember, 75-80. Derrig, R., Ostaszewski, K., and Rempala, G. (2000), "Applications of resampling methods in actuarial practice," Proceedings of the Casualty Actuarial Society, 87, 322-364. Efron, B. (1979), "Bootstrap methods. Another look at Jackknife,"
Emerging Methods in Actuarial Models
559
Ann. Statist. 7, no. 1, 1- 26. Efron, B. andTibshirani, R. (1993), An Introduction to the Bootstrap, Chapman and Hall, New York. Fama, E. and French, K. (1988), "Permanent and temporary components of stock market returns," Journal of Political Economy, 96, 246-273. Hsu, D., Miller, R., and Wichern, D. (1974), "On the stable Paretian behavior of stock-market prices," Journal of the American Statistical Association, 69,108-113. Hull, J. (1997), Options, Futures, and Other Derivative Securities, 3rd ed., Prentice-Hall Simon Schuster. Klein, G. (1993), "The sensitivity of cash-flow analysis to the choice of statistical model for interest rate changes," Transactions of the Society of Actuaries, 45, 9-124. Discussions 125-186. Kiinsch, H. (1989), "The jackknife and the bootstrap for general stationary observations," Ann. Statist. 17, no. 3, 1217-1241. Ostaszewski, K. (2002), Asset-Liability Integration, Society of Actuaries Monograph M-F102-1, Actuarial Education and Research Fund, Schaumburg, Illinois, 2002. Panjer, H. (ed.) (1998), Financial Economic with Applications to Investments, Insurance and Pensions, Actuarial Education and Research Fund, Schaumburg, Illinois, 1998. Politis, D. and Romano, J. (1992), "A circular block-resampling procedure for stationary data," in LaPage, R. (ed.), Exploring the Limits of Bootstrap, 263-270, Wiley, New York. Rempala, G. and Szatzschneider, K. (2002), "Bootstrapping parametric models of mortality," manuscript, to appear in Scandinavian Actuarial Journal. Ross, S. (1976), "The arbitrage theory of capital asset pricing," Journal of Economic Theory, 13, 341-360. SAS 8.2 Statistical Package (2001), SAS Institute, Cary NC. Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, Springer-
560
K. M. Ostaszewski and G. A. Rempala
Verlag, New York. Sharpe, W.F. (1964), "Capital asset prices: a theory of market equilibrium under conditions of risk," Journal of Finance, 19, 425442. Singh, K. (1981), "On the asymptotic accuracy of Efron's bootstrap," Annals of Statistics 9, no. 6, 1187-1195. Stanton, R. (1997), "A nonparametric model of term structure dynamics and the market price of interest rate risk," The Journal of Finance, 52, no. 5, 1973-2002. Tilley, J.A. (1992), "An actuarial layman's guide to building stochastic interest rate generators," Transactions of the Society of Actuaries, 44, 509-538. Discussions 539-564. Vanderhoof, I.T. and Altman, E., editors (1998), Fair Value of Insurance Liabilities, Kluwer Academic Publishers.
Asset-Liability Management and Investments
This page is intentionally left blank
Chapter 16 System Intelligence and Active Stock Trading Steve Craighead and Bruce Klemesrud In this chapter, we study the selection and active trading of stocks by the use of a clustering algorithm and time series outlier analysis. The Partitioning Among Mediods (PAM) clustering algorithm of Kaufman and Rousseeuw (1990) is used to restrict the initial set of stocks. We find that PAM is effective in its ability to specify nonuniform stock series from the entire universe. We are pleasantly surprised that the algorithm eliminated the bankrupt Enron and Federal Mogul stock series, without our intervention. We use outlier analysis to define two separate active trading strategies. The outliers within a time series are determined by the use of a Kalman Filter/Smoother model developed by de Jong and Penzer (1998). Using our strategies and trading weekly in stocks with an initial $30,000 with a closed stock portfolio from 1993 to 2001, we obtained a 17.8% annual return on a cash surrogate passive strategy, 18.1% on a passive strategy using all the stocks in our restricted asset universe, 20.2% on a combined cash protected and outlier active strategy, and 23.3% using the outlier active strategy only. Comparing these results to the passive strategy being entirely invested in the S&P 500 Large Cap index with a 9.9% return, we find that under this stock portfolio any of our strategies are superior to that of a purely passive index strategy. We also extend our analysis to include data separately to July 2002 and to August 2002 to examine our strategies as the market has continued to decline during 2002. 563
564
1
S. Craighead and B. Klemesrud
Introduction
The process of actively managing a stock portfolio is more an art than a science. The industry irritation is that elementary school children pick stocks with better performance than those of the professional. Also, to add insult to injury, it is reputed that stock portfolios chosen randomly from Rolodexes by monkeys perform better than the students. Even though we might be competing with our youth and various other simians, we believe that our experience and two newer statistical tools may still allow us to make some well-reasoned decisions in active stock management. Primarily, investing is a two step process. The first step is the development of an asset universe (or asset list) which falls to an investment committee. By adding to this list they are providing strategic options for the portfolio. In a sense they are creating resources from which the portfolio manager will draw. When they choose an asset, the committee reviews qualitative issues such as market niche, company management style, consistency of operations, asset availability, diversification and risk appetite. This list should provide a broad range of acceptable securities, which are expected to perform well over time regardless of current market activity. An asset committee will purposefully select stocks from various sectors/industries and different indices, since each sector/industry and index behaves differently, and in aggregate may create a synergy. The second step involves the portfolio manager and the selection of assets (from the asset list) to purchase or sell in regards to the portfolio. The manager is the tactical overseer of the portfolio. The manager draws from the resources created by the investment committee, but makes decisions based upon current market activity. This activity is frequently measured in quantitative changes - a bad earnings report, unfavorable management comments, or quick deviations from historical norms. These quantitative changes are reported in the financial press each day and the experienced manager will attempt to wait until all information is known before buying or selling a se-
System Intelligence and Active Stock Trading
565
curity. However, it is difficult for the manager to follow all news regarding the securities in the portfolio, especially if it contains a large number of securities. Being besieged by this volume of information, the manager must respond and in a sense place bets by weighting specific investments more than others based on his or her personal judgement. Using classical portfolio theory the initial choice of assets is frequently based on a risk/return tradeoff using quadratic programming (or from a CAPM approach comparing various j3 values). However, in this chapter we are also interested in the stock price series and we believe that the change in the level of the stock price is masked if we only use the stock return series. This leads us to use the Partitioning Among Mediods (PAM) algorithm. This algorithm is introduced by Kaufman and Rousseeuw (1990). PAM is designed to take a collection of vectors and obtain the best representatives for a specific number of clusters. We use the algorithm only in the design of the initial asset universe. Classic portfolio theory is a short period decision process and though it can be used to determine which assets best optimizes the current portfolio, one must deal with issues of portfolio drift and rebalancing. However, we want an investment process that is able to monitor the market and make specific movement recommendations on the specific assets. This leads us to use a time series outlier algorithm developed by de Jong and Penzer (1998). Their work is based on using a single pass of a Kalman Filter/Smoother to produce an outlier statistic they call r 2 . We use the change of this statistic to determine when to actively move in and out of various stocks. In the next section, we will discuss the data collection and asset selection process. In Section 3, we will discuss the use of r 2 to indicate the change of a market paradigm. In Section 4, we describe the strategies that we use to make our investment decisions. In Section 5, we examine the results of our strategies. In Section 6, we discuss our conclusions, model limitations, and possible future research. In Appendix A, we give an outline of the PAM cluster algorithm. In
566
S. Craighead and B. Klemesrud
Appendix B, we briefly outline the formulation of the r 2 .
2
Determination of the Asset Universe
We develop our initial asset universe by using an incomplete preliminary price history of 138 stocks from many separate sectors and indices. The asset prices reflect 54 varying dates from February 1998 to December 2001. Obtaining the average and the standard deviation of the prices for each of the series, we detrend each price series by subtracting the mean and dividing by the standard deviation. We treat these time series as vectors of length 54. We use the PAM algorithm (as outlined in Appendix A) to determine five representative clusters. We examine each cluster to determine if there is only one asset in that cluster, assuming that those assets are aberrations. This analysis eliminates Enron, a security that is entering bankruptcy. The PAM algorithm reveals behavior across all of the data. We only had one datum after the Enron downfall, but the algorithm pointed out that the overall stock price series had no peers among the other 137 stocks. Eliminating Enron, we reprocess the remaining 137 stocks in the same way, we eliminate Federal Mogul, which is also bankrupt. Once eliminating Federal Mogul, the PAM algorithm returns five clusters with several assets in each cluster. Note: We used the Ly Norm (or the sum of the absolute value of differences of the values) to define the distances in the algorithm to reduce the influence of the outliers upon the selection process. The PAM algorithm provides a defensive benefit in that it allows us to avoid stocks that may be so extreme in their behavior that they have no true peers. We obtain a complete price history from Yahoo! Finance for the remaining assets and remove stocks that do not have a price history longer than nine years. We used nine years was for two reasons. First, we develop the trading strategies on the middle third of the data and use the other thirds to back and forward validate the strategy. Second, we realize that a company is constantly changing, and
System Intelligence and Active Stock Trading
567
our view is that a company in existence over a long period is not the same as when it began. Changes in the economy, management philosophy and management itself causes market data to become stale after a given period and there are not many companies maintaining a specific market paradigm for a long period of time. However, we decided that nine years is a good compromise between the historical statistics and the current market paradigm. Finally, we rely upon our investment experience to reduce to the final asset universe displayed in Table 1. Notice in Table 2 how the asset universe is diversified by sector and in Table 3 by size. We use two stocks (specifically JNJ and XOM) as cash surrogates. We define a cash surrogate stock to be a stock that will replace the use of a highly secure asset such as a Treasury Bill in portfolio selection and analysis. A cash surrogate stock is usually a Blue Chip which is large, well diversified, highly liquid and has minimal price volatility when compared to the overall market. We collect the prior twenty years (if available) of weekly data (from January 1, 1982 to December 31, 2001) prices from Yahoo! Finance (chart.yahoo.com). These prices are adjusted for stock splits and dividends. The outlier statistics are then determined upon these prices (the entire history). Note: These prices are not detrended as above in the use of PAM. When using historical trading strategies, we realize that our mixed methods of choosing our asset universe may be considered as manipulating the results (i.e., the elimination of Enron or Federal Mogul). Also, this accusation may also be leveled due to the fact that our asset history only spans the prior nine years. This period, which contains the longest bull market period in history, may inflate our results. We address these issues in three ways/First, our asset universe is determined once and is never again modified in any of the subsequent work. Second, as mentioned before, our strategies are created using only the middle third of the data. Once the strategy is created, the first third of the data is used to back validate our results and the final third is used to forward validate. Third, since 1998, there has been great
S. Craighead and B. Klemesrud
568
market instability and most of the large market gains in the early to mid 1990s have been dramatically reduced by this period.
3
Implementation
Using the Kalman Filter/Smoother method briefly described in Appendix B, we obtain the outlier r 2 statistic for each time for each series. Examples of r 2 are plotted in Figures 1 and 2. In Table 4, statistics of r 2 for each stock are listed. The r 2 statistics are approximately chi-square, and can provide a means to judge the significance of the values. Time Series of Outlier Statistics
—i 01/03/1993
1 10/1S<1994
1 07/28/1996
1 05/10/1998
1 02/20/2000
1 — 12/02/2001
Figure 1. Time series of outlier statistics for cash surrogates.
We believe that each stock price series contains specific information that is both market and company specific. We assume that the market is fairly efficient and that the price of a stock changes to reflect new information. However, we also believe that there are also complex interchanges between the market and a stock's value - not
o
TJ-
o
OO
00 OQ
O, OH
y-» i n
88
II
I U
8 Z
CL, ft.
O
u
ft, ft* c/3 W
ft,
3
«
"o >> "S >> M 8
.a
^l 1:1
i—' [T. Si
ft.
8 188 in rt in
3
^
(« ft, o ^ "i
•* "
oo ^ ; i n
t
o
§1 S
s
ai
g u «U U
g88 «?£
03 ,-.
cd o
^
03
^
•*s a •*>
% *i
03
| 8 8 a <%
£ £ o - U - -a S S o o « ^ . Q _ . i*~>
•O T3
3 a o U
lilill ^3
"
o o
|ss c/3 i n i n ft, ft. C/3 C/3
O
e
Kl o
. Su O
M
CQ < K
?ft.
P S
J)
f> m _: I J pS — _
CL Si ^
B
p
H
f L P > ^ ^ 8 ^ o P "s? ^ * £ £ i s s s s
III
o ^ a S < „ i•2 w « . , c P O 1
S
< < 5 ffl o
ft. o
< e
u
S O p Z. £ « _ ft, rt rt o -S •S-E1
u
on
ft, f!
8 =3 I •** •%8 *8
-
«
111 IS
System Intelligence and Active Stock Trading
ou c C
E o o
o on
8
C/3
s
I 9 * 5 w
si
569
S. Craighead and B. Klemesrud Table 2. Stock Diversification by sector.
Sector Healthcare Energy Utilities Financial Technology Consumer Cyclical Capital Goods Consumer Non Cyclical Services Basic Materials
Count 5 1 2 2 3 1 1 2 2 2
Table 3. Stock Diversification by size.
Size Dow Industrials Dow Utilities S&P 500 NASDAQ 100 S&P 400 (MidCap) S&P 600 (SmallCap) Aggregate Growth
Count 8 2 6 2 2 1 1
the least that of market psychology. This leads us to contemplate that there is the possibility that there is additional information contained within the series that has not yet been reflected by the market. Since high values of the r 2 imply that the fundamentals underlying the stock price are changing, and this leads to an outlier, we believe that the statistic can be a good indicator of any and all new information. We may not know the specific reason of the paradigm change, however, we assume that the outlier statistic reveals that a change is occurring. In the next section, we assume that new information is strengthening while the statistic is increasing. However, we assume
System Intelligence and Active Stock Trading
571
Time Series o1 Outlier Statistics
1 1 1 t 1 r-'Kl JNJ XOM AEP AIG
i
\
/.' / / /
\A
/
\
/ / / /
<"
\ \ \
\ i \ \
•/ ' • - / .
"'-•'-••..
01/03/1933
10/16/1994
I 07/28/1996
1 05/10/1996
i 02/20/2000
i 12/02/2001
Time in weeks
Figure 2. Time series of outlier statistics-comparison.
that as the statistic falls that the majority of new information has already registered, and the price series begins to revert to status quo. In the next section, we construct two active strategies that sell when r 2 is falling.
4
Model Description
We examine five historical trading strategies. The first we call the "S&P 500" strategy, which is a passive index strategy where we invest the initial amount into the S&P 500 index and make no changes in the investment for the entire investment period. The second we call the "Cash Surrogate" strategy. Here, we place the initial amount equally split between our cash surrogates, and we do not make any other changes in the investment over the investment period. The third we call the "Passive" strategy. Under this strategy we
572
Stock Symbol JNJ XOM AEP AIG AMAT BAC CAH D EK HWP ITW IVC LANC MCD MDT MO MRK MSFT RPM SBC USAUX WOR
S. Craighead and B. Klemesrud
Table 4. Basic statistics on stock outlier Minimum First Median Mean Quartile 0.0015 0.3085 1.1220 3.0788 4.90E-6 0.09437 1.1091 2.1219 0.00059 0.06599 0.3933 1.4001 3.06E-6 1.7712 9.6778 16.163 0.00102 2.07258 6.36806 20.9459 0.00029 17.3243 1.47846 5.22012 8.1376 0.00023 1.82930 5.42177 4.37229 0.00096 0.14277 0.88906 3.07E-6 1.08909 10.41518 29.7587 8.47E-5 0.21473 3.79574 11.33841 0.00016 4.37575 9.67202 15.71565 2.83E-7 0.29823 1.59093 2.87665 7.75E-5 0.47648 1.74143 3.64980 0.00031 0.09142 1.23688 3.71250 5.10800 0.00006 0.27945 2.25513 0.00006 0.27604 0.95651 3.61067 0.00020 0.71763 2.58939 13.73230 0.00286 7.60743 21.54652 49.58955 0.00142 0.13271 0.22657 0.51394 0.00030 0.40280 1.40575 4.64279 0.00093 0.45461 3.05818 5.94627 0.00089 0.39012 1.64675 2.04949
statistic. Third Quartile 5.4090 3.0110 1.66202 15.8791 11.33510 22.44295 11.31191 5.27433 50.461 16.45391 21.47767 4.65674 4.79727 5.88821 6.57462 5.13000 20.42116 55.25148 0.36017 4.89330 5.79714 3.71074
Maximum 13.9371 13.295 10.3763 80.177 176.3947 85.1243 46.93333 39.54997 122.615 103.3401 64.40019 10.11641 14.59669 22.90411 25.67335 19.80898 77.65404 252.13553 5.96583 32.38622 46.52636 5.76535
place two thirds of the initial amount evenly in the cash surrogates and the remaining third equally distributed in the other twenty stocks. No other changes are made in the investment over the investment period. Before introducing the fourth and fifth strategies, we want to examine Figure 3. Here we have two time series. The lower series is a hypothetical price series and the upper series is the corresponding T 2 series. The two vertical bands in the figure are regions where both series are decreasing. In active trading, we would like to enter a stock position when price is low and exit when the price is high before it turns around. However, we might give up the desire to enter low if we can preserve the value of the portfolio in the event of a downturn.
System Intelligence and Active Stock Trading
573
We use the T 2 statistic to indicate the strength of information entering the series. We make the assumption that when the price series falls and the r 2 series is falling that the stock has entered a downturn and will begin to seek status quo. Using this we now develop our two active strategies.
Figure 3. Time series of the portfolios.
The fourth strategy we call "Active" and we distribute the initial investment to all 22 stocks as in the "Passive" strategy, but we use the above sell strategy to move between the various stocks. Specifically, if r 2 — T±_X for a stock S is negative and the price of S at time t minus its price at time t — 1 is negative execute the order to sell one half of the stock position in S into cash surrogates and make the cash available for other investments. Otherwise, if cash is available, execute a buy order of stock S by the bitesize. We define bitesize as the acceptable trading size that we wish to enter at an initial commitment and we consistently use $500. The fifth strategy we call the "Restricted" strategy. In this strategy we distribute the initial investment amount to all 22 stocks as in the "Passive" strategy, and we use the "Active" movement strategy with a cash surrogate restriction. The restriction is if the total value of the cash surrogates of the portfolio at a specific time is less than 35% of the total portfolio value, no money will be moved into the other stocks.
574
5
S. Craighead and B. Klemesrud
Results
We will start with an initial investment of $30,000. We will not add any additional moneys to the portfolios. In the final three strategies, $10,000 is initially invested in each cash surrogate and $500 in the other stocks. We initially conducted the study from 1996 to 1998, and found that our active strategies were sound. Over the nine-year period we see in Figures 4 and 5 the performance of each of the strategies. The basic statistics on these portfolios are in Table 5. In Figure 4, we see that all of the strategies exceed that of the "S&P 500" passive strategy, and that in the long term the buy and hold positions of the "Cash Surrogate" and "Passive" approach each other even though most of the time the "Passive" strategy exceeds that of the "Cash Surrogate". Note how the "Active" and "Restricted" exceed that of the "Cash Surrogate" in Figure 5. Except for some high volatility in the second and third week of March 2001, the "Active" and "Restricted" portfolios seem to have reasonable volatility and performance. In Figures 6 and 7 and Table 5 one can observe the cash surrogate portion of the "Active" strategy with that of the overall portfolio performance. Notice how quickly the strategy moves out of the cash surrogates in the early years and dramatically moves into the safer cash surrogates after the 2000 market downturn in Tech stocks. From 1998 to 2000, the cash surrogates are somewhat volatile, because of the market shifts due in part to difficulties in Russia and the Pacific Rim. Note that the same results are in the "Restricted" strategy except the post 2000 shift into the cash surrogates are not as steep, and the cash surrogate volatility is not as extreme from 1998 into 2000. The only difficulty is the March 2001 volatility in both of these strategies. In March 2001, we saw the beginning of an increase in unemployment, and a severe drop in the DJIA and the NASDAQ. The perfect active strategy would have moved more into cash surrogates when the market value moved down so drastically. Possibly if we
575
System Intelligence and Active Stock Trading
S&P500 Index, Cash Surrogate and Passive Portfolios
o o _ OJ o
&$< <)' » ' -fc.l'
S&P50Q Cash Surrogate
.; :(vn
£. J*?<
o o " o
^
*
00G
liar
o
^(Af Jriyf*
o
I1-
— 1
1
01/11/1993
10/24/1994
1 OB/05/1996
\
05/18/1998
i 02/28/2000
1 12/10/2001
Time in weeks
Figure 4. Time series of the portfolios.
3000
Cash Surrogate, Active and Restricted Portfolios
o OJ
i
— 000
o
Cash Surrogate Active Restricted
t*.'x * \ *> 1
100000
'
••-4 DO
a
o o
-
j£y - ^ W — H * ^ - ^ I
I 10/24/1994
i 08/05/1996
i 05/18/1998
i 02/28/2000
Time in weeks
Figure 5. Time series of the portfolios.
12/10/2001
576
S. Craighead and B. Klemesrud Active Portfolio and its Cash Surrogate Component
"i 01/11/1998
r
10/24/1994
06/05/1995
05/1S/1998
02/26/2000
12/102001
Figure 6. Time series of active portfolios with their cash surrogates.
150 000
Restricted Active Portfolio and its Cash Surrogate Component
1 ,1
W^
Restricted Cash Component
il"
H
50000
J .'•'-':
I 01/11/1993
I 10/24/1994
l 08/05/1996
' : /
1 05/16/1998
1
1
Time in weeks
Figure 7. Time series of active portfolios with their cash surrogates.
577
System Intelligence and Active Stock Trading
used these strategies on a daily basis, we might have eliminated the volatility, since we only move 50% of a stock's position in one week, but daily trading would have increased the overall volatility of the strategies. Table 5. Basic statistics on Portfolio Minimum First Median Quartile S&P500 $ 29,799 $34,860 $61,773 Cash Surrogate 27,546 37,514 72,098 Passive 29,199 39,822 77,263 Active 29,381 38,715 92,503 Restricted 29,502 37,572 83,202 Passive Cash Active Cash Restricted Cash
6
18,364 1,004 9,768
25,009 1,259 13,074
48,065 1,443 28,929
portfolios. Mean Third Maximum Quartile $61,311 $ 84,997 $ 104,824 72,098 105,460 130,984 77,978 118,468 134,501 99,832 160,380 230,961 84,728 128,558 190,692 48,066 21,879 35,547
70,307 17,888 46,937
87,322 171,706 132,560
Conclusions and Further Research
We found that the PAM algorithm was extremely valuable when determining which stocks should be included in the portfolio from the original 138 stocks. It was fascinating that the clustering algorithm specified that Enron and Federal Mogul were unique. Our successful use of PAM in restricting the asset universe leads us to believe that this data discovery tool (or its large dataset cousin CLARA also described by de Jong and Penzer (1998)) may be of use in asset management. With the initial $30,000, we obtained a 17.8% annual return on the cash surrogate passive strategy, 18.1% on the passive strategy, 20.2% on the restricted strategy, and 23.3% using the active strategy only. Comparing these results to the passive strategy being entirely invested in the S&P 500 Large Cap index with a 9.9% return, we find that all of our strategies are superior to that of a purely passive index strategy.
578
5. Craighead and B. Klemesrud
Updating our data to reflect the stock performance in 2002, we have obtained the results in Table 6. We have duplicated our analysis through June 30, 2002 and separately through July 31, 2002. The separate July 2002 analysis was conducted because the July 2002 stock market performance has been one of the worst in the past twelve years. Here we see that the cash surrogate passive strategy continues to maintain its return where the passive strategy, the restricted strategy and the active strategy have dramatically dropped in value. However, these strategies continue to dramatically exceed the return results for the S&P 500. We need to examine these results further to determine other possible modifications to our active and restricted strategies. Table 6. Updated Results.
Strategy S&P 500 Cash Surrogate Passive Restricted Active Active
1/ 1/1993 12/31/2001 9.9% 17.8% 18.1% 20.2% 23.3%
111/1993 6/30/2002 9.1% 17.8% 17.6% 18.8% 20.1%
8/ 1/199 7/31/20C 8.0% 17.8% 16.9% 17.2% 18.4%
Our analysis assumes that there are no transaction costs. We realize this does not reflect reality and we need to examine their effects. This may demonstrate that a portion of our 5% pickup (over the passive cash strategy) in our initial study may very well disappear. Our model assumes that there are no taxes on capital gains. This is valid for tax qualified structures such as endowment funds, 401(k), IRA accounts, qualified pension funds, etc. However, the impact of taxes needs to be considered for more general active portfolio management since active trading trigger capital gains and losses with increased inefficiencies. The model also assumes that no new money is available for investment. Consideration for this needs addressing.
System Intelligence and Active Stock Trading
579
r 2 is determined over the entire history of the price series. It is possible that decisions made at time t are influenced by the stock prices used in the r 2 calculation at times greater than t. The asset selection process is also problematic in that when we use the PAM algorithm, we use only the most current history and restrict our asset universe to only those who have stock histories for nine years. The entire study needs to be conducted assuming that PAM is used and T 2 is found only up to time t. Methods to determine the type of outliers within the time series are also discussed in de Jong and Penzer (1998). We use the r 2 results only to specify a paradigm shift, no matter the type of outlier. Using their other techniques, we could add additional strategies based on the outlier types as well. The active strategy executes a sell if the change in r 2 and price are both negative, without regard to the magnitude of the change. Another strategy might be to consider movement only if the price change is above a certain threshold. Other strategies may be developed from the structure and behavior of the r statistic. Here, we might examine the height, breadth and width of the outlier graph. Other areas of future research would be to examine the sensitivity of the strategies to varying trading frequencies, bitesizes, trading limits, cash protection limits and active asset selection. An examination of the use of our strategies over longer (or earlier) historical periods should also be considered.
Acknowledgments Steve Craighead: I would like to thank my wife Patricia and my children Sam, Michelle, Bradley, Evan and Carl for their patience. I would also like to thank Stephen Sedlak, Vice President of Corporate Actuarial at Nationwide for his ongoing support and encouragement in our various research endeavors. Bruce Klemesrud: I would like to thank my wife and children for bearing with my investing.
580
S. Craighead and B. Klemesrud
Appendix A Partitioning around Mediods (PAM) The purpose for the partitioning of a data set of objects into k separate clusters is to find clusters whose members show a high degree of similarity among themselves but dissimilarity with the members of other clusters. The PAM algorithm searches for k representative objects among the data set. These k objects represent the varying aspect within the structure of the data. These representatives are called the mediods. The k clusters are then constructed by assigning each member of the data set to one of the mediods. Using the notation of Kaufman and Rousseeuw (1990) we denote the distance between the objects i and j as d(i,j). This can be any acceptable metric, such as Euclidean or the L\ Norm (or the sum of the absolute value of differences of the values) distance. Denote a dissimilarity as a nonnegative number D(i,j) which is near zero when objects i and j are "near" and is large when i and j are "different." Usually D(i,j) meets all of the metric requirements except the triangular inequality. Various candidates are discussed in Kaufman and Rousseeuw (1990). The PAM algorithm consists of two parts. The first, the build phase, follows the following algorithm: 1. Consider an object i as a candidate. 2. Consider another object j that has not been selected as a prior candidate. Obtain its dissimilarity Dj with the most similar previously selected candidates. Obtain its dissimilarity with the new candidate i. Call this D(j, i). Take the difference of these two dissimilarities. 3. If the difference is positive, then object j contributes to the possible selection of i. Calculate Cji = m&x(Dj — D(j, i),0).
System Intelligence and Active Stock Trading
581
4. Sum Cji over all possible j , ]T\ Cji. This gives the total gain obtained by selecting i. 5. Choose the object i that maximizes the sum of Cji over all possible j . Repeat the process until k objects have been found. The second step attempts to improve the set of representative objects for all pairs of objects (i, h) in which i has been chosen but h has not been chosen as a representative, it is determined if the clustering results improve if object i and h are exchanged. To determine the effect of a possible swap between i and h we use the following algorithm: 1. Consider an object j that has not been previously selected. We calculate its swap contribution Cjih'. (a) If j is further from i and h than from one of the other representatives, set Cjih to zero. (b) If j is not further from i than any other representatives (d(j, i) = Dj), consider one of two situations: i. j is closer to h than the second closest representative and d(j, h) < Ej where Ej is the dissimilarity of j and the second most similarly representative. Then Cjih = d(j, h) — d(j, i). Note: Cjih can be either negative or positive depending on the positions of j , i and h. However, if j is closer to i than to h a positive influence is introduced, which implies that swapping object i and h is a disadvantage in regards to j . ii. j is at least as distant from h than the second closest representative, or d(j,h) > Ej. Let Cjih = Ej — Dj. The measure is always positive, because it not wise to swap i with a h further away from j than with the second closest representative.
582
S. Craighead and B. Klemesrud
(c) If j is further away from i than from at least one of the other representatives, but closer to h than to any other representative, Cjih = d(i,h) — Dj will be the contribution of j to the swap. 2. Sum the contributions over all j . Tih = HjCjih- This indicates the total result of the swap. 3. Select the ordered pair (i, h) which minimizes Tih. 4. If the minimum T^ is negative, the swap is carried out and the algorithm returns to the first step in the swap algorithm. If the minimum is positive or 0, the objective value cannot be reduced by swapping and the algorithm ends.
583
System Intelligence and Active Stock Trading
Appendix B Filter/Smoother Model The outlier statistic r 2 in de Jong and Penzer (1998) is obtained by creating two hypothetical models of the data. The r 2 is a measurement at a specific time of how the data does not match the null hypothesis model. We will use their notation and set up the necessary notation to obtain their r 2 statistic. Using their notation, let the data be represented by y = (y'i, y2, • • •, y'n)' for time t = 1 , . . . , n. Assume that the null model of y has mean 0 and has a covariance matrix cr2E. The £ gives the serial correlation of the data series y. We will represent the null model as y ~ (0, cr2£). We want to determine if there are departures from the null model and this is modeled by the addition of an intervention variable D = (D[, D'2,..., D'n)', and the alternative hypothesis model will be denoted as y ~ (D5,
= Ztat + Gtet = Ttat + Htet,
and t = l,...,n,
(1) (2)
where et ~ ./V(0, a21), at ~ (cui, a2Pi), and et and ax are mutually uncorrected. The matrices Zt, Tt, Gt, and Ht are deterministic but
584
S. Craighead and B. Klemesrud
could vary over time. For t = 1,2,..., n let vt Ft Kt a 4+ i Fm
= = = = =
yt- Ztat ZtPtZ't + GtG't, (TtPtZ't + HtG't)F-\ Ttat + A > t , and TtPtL't + Htft,
(3) (4) (5) (6) (7)
where Lt = Tt — KtZt and Jt = Ht — KtGt. Now using the Kalman Smoother, we take the results of the Kalman Filter and initialize the Smoother with rn = 0 and Nn = 0. Then for t = n,..., 1 ut Mt r t _! iVt-i
= = = =
F;xvt-K'tru F-l + K'tNtKu Z ^ + ^rt, and l Z'tF^ Zt + i:tNtLt.
(8) (9) (10) (11)
They set up the alternative model as yt
= XtS + Ztat + Gtet = Wt<J + T t a t +
at+1
and fl-tet,
(12) (13)
where Xt and W4 are called the shock design matrices and 5 is the shock magnitude. They go on to state that for a given time t and null state-space model the maximum of p\ = s'^^St, with respect to the Xt and Wt is pf
= v'tF^vt
+ r'tNrlrt,
(14)
where vt, Ft, rt, and Nt are computed with the Kalman Filter Smoother applied to the null model. The maximum is attained when Xt = vt and Wt = Ktvt + N^n. Finally de Jong and Penzer show that r 2 has a maximum value at 2 rt* = a~~2p*2 and that a plot of r^*2 against t reveals when the shock design is significant at time t. We use this r*2 as our outlier statistic in our active strategies.
System Intelligence and Active Stock Trading
585
References de Jong, P. and Penzer, J. (1998), "Diagnosing shocks in time series," JASA 93, no. 442, 796-806. Kaufman, L. and Rousseeuw, PJ. (1990), Finding Groups in Data: an Introduction to Cluster Analysis, Wiley, New York.
This page is intentionally left blank
Chapter 17 Asset Allocation: Investment Strategies for Financial and Insurance Portfolio K.C. Cheung and H. Yang
This chapter surveys optimal portfolio selection strategy for financial and insurance portfolios. In particular, we will examine some of the recent advances in optimal portfolio theory. Multi-period generalization of the classical single-period Markowitz model will be discussed in details. Optimal portfolio problems using Value-at-Risk (VaR) and Capital-at-Risk (CaR) as risk measures, together with the mathematics techniques, are presented to demonstrate the variety of both formulations and solution techniques in optimal portfolio theory.
1
Introduction
The financial market is full of uncertainty. When we invest, our primary objective is to earn a high return and at the same time maintain a low risk. These conflicting objectives lead naturally to the optimal portfolio problem: to seek an optimal way to allocate our wealth among different financial securities. The mean-variance approach proposed by Markowitz (1952) was the first systematic treatment in this direction. In this classical model, the objective is to maximize the expected return and minimize the variance (variance is used as a measure of risk) in a single-period situation. Since then, the Markowitz model has been generalized tremendously in different directions: from singleperiod to multi-period, from variance to other risk measures, from 587
588
K. C. Cheung and H. Yang
discrete-time to continuous-time, to mention a few. Risk management is becoming more important in recent years among financial institutes. A key component in risk management is risk measure. A risk measure is a summary statistics which measures the risk exposure of a financial institute. Risk measure quantifies risk such that risk management team of a firm can take systematic and scientific action. Risk measure also plays a significant role when financial institutes, such as banks or insurance companies, formulate investment strategies. A typical situation is that while aiming at achieving a high investment return, the risk of the investment portfolio, which is measured by a particular risk measure, is not allowed to exceed a certain level. Therefore, one of the most important aspects in formulating optimal portfolio problem is the choice of risk measure. Different risk measures may require different solution techniques, and lead to different solutions. We will examine three risk measures in this chapter, namely variance, Value-at-Risk (VaR), and Capital-at-Risk (CaR). Traditionally, insurance companies concern about their liability side more than the asset side. Much attention is paid to the pricing of insurance products. In recently years, more emphasis is focused on the relationship between asset and liability. Insurance companies need to formulate investment strategies which on one hand can provide them with high investment return, and on the other hand has limited and controllable risk, to ensure that the future liability can be met, and regulatory requirment be satisfied. A well-formulated optimal portfolio problem may help insurance companies to manage its asset better. In this chapter, we will select some optimal portfolio models which are proposed and analyzed in the literature. More specifically, in Section 2, we will briefly describe the classical single-period Markowitz model. In Section 3, we will generalize this single-period model to a multi-period model. We will demonstrate how to obtain the optimal multi-period mean-variance portfolio policy through an auxiliary problem. In Section 4, we will briefly review the classical
Investment Strategies for Financial and Insurance Portfolio
589
Merton's continuous time model. In Section 5, we will use Value-atRisk (VaR) as a measure of risk, and formulate an optimal portfolio problem with a VaR constraint. In Section 6, we will study the optimal portfolio problem using Capital-at-Risk (CaR) as a risk measure. In Section 7, we discuss some optimal portfolio models which are more related to insurance. Finally, we have the Conclusion section.
2
Single-Period Markowitz Model
Suppose there are n + 1 risky securities traded in the capital market. For security i, 0 < i < n, it has a random return Ri over a single period. An investor having an initial wealth x0 have to decide how much to be invested in each security. If Ui is the amount invested in asset i, then we have the constraints ]C"=0 ui — xo and Ui > 0 for 0 < i < n. One period later, the wealth of the investor becomes n
xl = Y^uiRh
(1)
with mean Hxi) = l^Ui^Ki)
= u-ti[K)
V)
i=0
and variance Var(xi) = it'Eu,
(3)
where u
=
[u0,iti,...,wn]',
(4)
R E
= =
[Ro,Ri, • • • ,Rn}', E{RR'}-E{R)E(R').
(5) (6)
A rational investor would generally prefer high return to low return, and prefer low risk to high risk. In the Markowitz model, the variance of a portfolio is used as a measure of risk. Thus, investors
590
K. C. Cheung and H. Yang
will choose a portfolio that has a maximum expected return while keeping the portfolio variance below a certain tolerable level; and a portfolio that has a minimum variance while maintaining a desirable level of expected return. To translate the above idea to mathematics, we have the following mathematical formulations: Pl(
maximize subject to
u'E(R)
u'T,u < a n
1=0
P2(e)
m
> 0
minimize
u'Hu
subject to
u'E(R) n U
Yl i i=0
i = 0 , 1 , . . . ,n
> e =
X
0
Ui > 0
i = 0 , 1 , . . . ,n
In formulation Pl(cr), a represents the highest level of risk we can tolerate. Similarly, in the formulation P2(e), we want to achieve an expected return which is not less than e. Despite the two formulations given above, there are numerous variations and modifications. Interested readers may refer to Steinbach (2001) and Wang and Xia (2002).
3
Multi-Period Mean-Variance Model
In this section, we will examine how to generalize the single-period Markowitz model to multi-period case. The treatment given here follows closely the paper by Li and Ng (2000).
Investment Strategies for Financial and Insurance Portfolio
3.1
591
Model and Problem Formulation
Again, we assume that there are n + 1 risky securities in the market. Let e\, i = 0 , 1 , . . . , n , t = 0 , 1 , . . . , T — 1, be the random return of security i over the time period t to t + 1. An investor can alter the composition of his/her portfolio at the beginning of each time period. Let u\, i = 1,2,..., n , t = 0 , 1 , . . . , T — 1, be the amount invested in security i at time t, then the amount invested in security 0 at time t would be xt — SILi u\, where xt is the total wealth of the investor at time t. Clearly, we have
( =
n
\
n
i=l
/
i=l
e°txt + P'tut
t = 0,l,...,T-l,
(7)
where ut = K \ ^ , . . . , < ] ' , Pt = [ e t 1 - e t 0 , C t 2 - e ? > . . . , e ? - e t 0 ] ' .
(8) (9)
Lete 4 = [e°,ej,..., e"]'. It is assumed that et, t = 0 , 1 , . . . , T —1, are independent and have known mean and known covariance. We can then formulate the multi-period versions of Pl(cr) and P2(e): M-Pl (cr) subject to
M-P2(e) subject to
maximize
E(xT)
Var(x r ) < a xt+x = e®xt + P[ut t = o,i,...,r-i minimize E(xT) xt+i
Var(xT) > e = e\xt + P'tut £= 0,1,...,T-1
K. C. Cheung and H. Yang
592
Note that we do not prohibit short sale in the multi-period setting to make our life easier. Equivalently, we can have the following formulation: ~EP(u) subject to
maximize xt+i
E(xT) - co Var(x r )
= e°xt + P'tut,
t = 0,1,..., T — 1
In all the above formulations, the maximization and minimization are performed over all multi-period portfolio policy, which is a sequence of maps that map the wealth at time t, xt, to the vector ut. Proposition llfn* is a portfolio policy that solves problem EP(u), then (a) ir* solves the problem M-Pl(a) with a = Var(er) U* (b) 7r* solves the problem M-P2(e) with e = E(e T ) 1^* Based on this proposition, we can solve problem M-Pl(cr) and M-P2(e) once we can solve problem EP(a>). Thus we concentrate on how to solve problem EP(CJ) from now on.
3.2
Model Assumption
The following assumptions are made to simplify our model and to guarantee existence of closed form solution. (a) et, t = 0 , 1 , . . . , T — 1 are independent; (b) et, t = 0 , 1 , . . . , T — 1 have known mean and covariance; (c) E(ete't) is positive definite for t = 0 , 1 , . . . , T — 1. From Assumption (c), it is not difficult to deduce that (d) E(P t P/) is positive definite for t = 0 , 1 , . . . , T - 1; (e) E[(e°)2] -E(e°tP;)E-l(PtP;)E(e°tPt) > 0,for£ = 0 , 1 , . . . ,
T-1.
Investment Strategies for Financial and Insurance Portfolio
3.3
593
A Useful Auxiliary Problem
The multi-stage optimization problem EP(u;) is best tackled by the method of dynamic programming. Unfortunately, directly applying dynamic programming method is extremely difficult because of the nonseparability nature of the problem. Li and Ng (2000) embeded the problem EP(w) into an auxiliary problem which is solvable by dynamic programming. The auxiliary problem is A(A,u;) subject to
maximize
— LOE(X^) +
xt+i = e°txt + P[ut,
XE(XT)
t = 0,1,..., T — 1
Denote the solution set of problem EP(tj) and A(A, u>) by and 71^4 (A, u) respectively, that is TVEP(U) TTA(\,UJ)
= =
| 7T solves EP(w)}, {TT I 7r solves A ( A , C J ) } . {TT
TTEP(U)
(10) (11)
Proposition 2 Ifn* € nEP(uj), then TT* G7T A (l + 2 w E ( x r ) | 7 r .
,U).
Proposition 2 means that if a portfolio policy IT* is optimal for problem EP(cu), then this policy is also optimal for problem A(A, u) for some particular A. This enables us to solve problem EP(u>) in the following way: 1. Given fixed u, we first solve problem A(A,u;) for all A. The optimal portfolio policy 7r*, and hence E(xT)|7r» and Var(xT)|w» can be expressed in terms of A: 7r*(A),E*(zr)(A),Var*(a;T)(A). 2. In problem EP(a>), we maximize E(xT) — u;Var(a;T), which is then equivalent to max{E*(x r )(A) - wVar*(xr)(A)}.
(12)
K. C. Cheung and H. Yang
594
This unconstrained 1-dimensional optimization problem can then be solved, either analytically or numerically, to obtain an optimal A*. 3. The optimal portfolio policy for problem EP(o;) is then
TV*(X*).
Now, it remains to solve problem A(A, u). It turns out that A(A, u>) can be solved in a straightforward manner using standard dynamic programming method. The following proposition is due to Li, Chan, and Ng (1998): Proposition 3 The optimal portfolio policy ir* for the problem A(X,UJ) is given by: ut(xtn)
= -Ktxt + vti-y)
t=
0,l,...,T-l,
where 7 Kt
Mi)
=
A
(13)
U>
(14)
= E-\PtP[)E(e°tPt)
=
I (n1 §) v-\ptpi)E{Pt) t =
Q,l,...,T-l
1
= ^E" (PT_1P^1)E(Pr_1)
VT-M
A] = A\
=
E(e°t)-E(Pl)E-l(PtP;)E(e°tPt) t = 0,l,...,T-l 2 E[(e°) ]-E(e°P;)E- 1 (P t P/)E(e°P t ) *= 0,1,...,T-1.
(15) (16) (17) (18)
Using the portfolio policy IT* given above, it can be deduced that EMU = Var(x r )| „ =
A^o + ^7 0(7 - bx0)2 + cx\,
(19) (20)
595
Investment Strategies for Financial and Insurance Portfolio
where T-l
(21)
»
= £
(22)
nT_1
B
B\
\2Ylk=t+i
A1
t =
0,l,...,T-l
BT-I
B
T-l
(24)
=
Bt
=
a
=
b = c =
E ^ E ^ P ^ E ^ )
* = 0,1,
T - l
v v 2 ~
(25) (26) (27)
a T
(23)
At,
(28)
— fi — ab
T-l
r = t=on ^
(29)
It must be stressed that given u > 0, optimal portfolio policy TT*, E(xT)\n, and Var(x T )| 7 r , can be expressed in terms of 7, and hence A. Armed with the solution to problem A(A, u), we can now return to the problem we are interested in, EP(w). As we explained before, problem EP(u;) is equivalent to
=
maxAeR
{E*(x T )(A)-cuVar*(x T )(A)}
maxAeR
| / i x 0 + wy — ua(^y — bxo)2 — ix>cx\ \ A (j,Xo + v
maxAeR
A uia{~
LO
bx0f
UJCX,0
(30)
LO
The function inside the bracket is concave in A, hence A* can be obtained from the first order condition: 2a LO
— \ LO
bx0
= 0.
(31)
596
K. C. Cheung and H. Yang
This gives us A* = ui ( bx0 +
(32)
2atu
Substituting this A* into the optimal portfolio policy for problem A(A, to) yields optimal portfolio policy for problem EP(uO, that is Proposition 4 The optimal portfolio policy, n*(u>) , for problem EP(LO) is given by u*t(xt;u) = -Ktxt
+vt (bx0 + -—J
t = 0 , 1 , . . . ,T - 1
where Kt, ut are the same as in Proposition 3. Employing the above portfolio policy n*(to), the mean and variance of the terminal wealth will be given by
V
tix0 + u[bx0 +
= x0(n + bu)+ Var(x T )| ir*(u>) =
0(7* - bx0)2 bx0 + V 4:OU2
V
2au> 2
,
(33)
2au> + cxl
v 2au>
bx0
+ cx2Q.
CXo
(34)
The significance of the above expressions is that they express the optimal mean and variance of the terminal wealth in terms of the parameter u, which allow us to solve problem M-Pl(cr) and M-P2(e) based on Proposition 1. M-Pl(cr): Given a, we solve the equation a=Var(x T )| ) r . ( w ) = £ ^ + cx;
(35)
Investment Strategies for Financial and Insurance Portfolio v
for u, which gives u* = —,
597
„ . Then the optimal portfolio
policy for problem M-Pl(cr) will be given by 7r*(u;*), where 7r*(-) is the policy stated in Proposition 4. M-P2(e): Given e, we solve the equation £ = E(x r )| T . ( w ) = sofa + &!/) + —
(36)
2
1
for CJ, which gives u* — 2a<£, ''+bl/\x i • Then the optimal portfolio policy for problem M-P2(e) will be given by n*(ui*), where 7r*(-) is the policy stated in Proposition 4. A continuous-time version of the mean-variance model is also been developed. See Li and Zhou (1999) and Yong and Zhou (1999).
4
A Brief Review of Merton's Model
Samuelson (1969) extended the work of Markowitz to a dynamic model and considered a discrete time consumption investment model with objective of maximizing the overall expected consumption. He advocated a dynamic stochastic programming approach and succeeded in obtaining the optimal decision for a consumption investment model. In 1969, Merton first used the stochastic optimal control method in continuous finance. He extended Samuelson's model to a continuous set up. He was able to obtain closed form solution to the problem of optimal portfolio strategy under specific assumptions about asset returns and investor preferences. He showed that under the assumptions of geometric Brownian motion to the stock returns and HARA utility that the optimal proportion invested in the risky asset portfolio is constant through time. We will briefly review Merton's work below in an informal fashion, one may refer to Duffie (2001), Merton (1969, 1971, 1990), Korn and Korn (2001) for technical details and the derivations of results.
K. C. Cheung and H. Yang
598
Assume that there are two assets in the market: the stock (or stock portfolio) and bond (cash). Let S(t) and P(t) denote the stock price and bond price at time t respectively. Assume that St follows the geometric Brownian motion model: dS(t) = pS(t)dt + aS{t)dW(t).
(37)
where W(t) is standard Brownian motion. We assume that the bond earns the fixed interest rate r, thus follows the dynamics dP(t) = (rP(t) - C(t))dt,
(38)
where C(t) is the consumption process. Define the wealth process as X(t) = P(t) + S(t), then dX(t) = [rP(t) - C(t) + fiS(t)]dt + aS(t)dW(t).
(39)
Let 7r(£) = Y§) De m e proportion of amount of wealth invested in stock at time t, then 1 — n(t) be the proportion of amount of wealth invested in the risk free bond. We have the dynamics for the wealth process dX(t)
= [rX(t)-nr(t)X(t)-C(t) + fj,ir{t)X(t)]dt +an(t)X(t)dW(t) = {[r + (fi-r)7r(t)]X(t)-C{t)}dt +an{t)X(t)dW{t). (40)
We first consider the infinity time horizon. Assume that the total expected discounted utility function be given by: /•oo
e-psU(C(s))ds}
(41) Jo where U(-) is a utility function (refer to Definition 2), p is the discount rate. In the special case of power utility, J(C,TT)
= E{
U(x) = —, 7
0 < 7 < 1.
(42)
Investment Strategies for Financial and Insurance Portfolio
599
Our primary objective is to maximize J with respect to all admissible consumption-investment pair (C, ir). Merton (1969) obtained the closed form solution to this optimization problem:
<™ = { r V ^ f ^ + T ^ H *•<«> = J ^ r y
«*>
We now assume that the investor's time horizon ends at time T, where T is a fixed positive number. We assume that the investor is risk averse with utility function for consumption U(C). Let the total expected discounted utility function be given by: J(C,7r) = E{ [T e-'wC/1(C(s))ds + U2(X{T))}, (45) Jo where Ui(-) and U2(-) are two utility functions. Again our objective is to maximize J over all admissible pair (C.TT).
From Merton (1971), if we assume U^x)
— 7 C/2(x) = e - ^ e i - 7 ^ I 7 for 0 < 7 < 1, then there is a closed form solution: C*(t)
=
= [f(t)]^X(t)
™ =^ ?
(46) (47)
(48)
(49)
where
/W a
= [1+(^g-1)e;p{^-T)}]^ =
* =
/9-7[(^-r)
T^-.
1-7
2
/ 2 a 2 ( l - 7 ) + r]
(50 (51) (52)
(43)
600
K. C. Cheung and H. Yang
Later, Grauer and Hakansson (1982, 1985) used a discrete time approach to determine optimal asset allocations. They updated the joint distribution of asset returns every period and were able to incorporate time variation in the return distribution. Their conclusion was that active rebalancing among the major asset classes can substantially improve investment performance. Cox and Huang (1985, 1989) and Pliska (1986) introduced the martingale representation technique to deal with the consumption portfolio problem in continuous time under uncertainty. For a quite general class of utility functions, Cox and Huang showed that, in order to prove the existence of optimal controls, it suffices to check whether the parameters of a system of stochastic differential equations, derived completely from the price system, satisfy a local Lipschitz and uniform growth condition. This approach take cares of the non-negativity constraint on consumption in a simple and direct way. Brennan et al. (1997) have developed a continuous time model of strategic asset allocation that incorporates time variation in the expected returns of three major asset classes. These asset classes are a portfolio of stocks, long term bonds and cash. The authors use stochastic optimal control to solve the investor's problem and illustrate their solution with numerical examples. They find that the results are quite different from those obtained under a myopic investment policy (so called Tactical Asset Allocation).
5
Continuous-Time VaR Optimal Portfolio
5.1
Value-at-Risk
While variance is often used as a risk measure, it has certain obvious defect: variance measures only the variability of the wealth, while ignoring whether it is downside or upside. A more popular risk measure in recent years in finance industry is Value-at-Risk (VaR),
Investment Strategies for Financial and Insurance Portfolio
601
which focuses more on the downside risk 1. Roughly speaking, VaR measures the portfolio loss that will occur over a certain period of time, that is exceed with a small probability (Duffie and Pan (1997)). Definition 1 lfx(0) is the initial portfolio value (at time 0) and x (T) is the value of the portfolio at time T, which is a random variable. The profit/loss random variable is then x(T) — x(0). Value-at-Risk at level a, a € [0,1] is defined as VaR a (x(T)) =' - inf { x G R | P(X(T) - x(0) < x) > a}. In particular, if x(T) is continuous, then P(z(0) - x(T) < VaRa{xT))
= 1 - a.
(53) (54)
Usually, a is chosen to be small, say 0.01. The VaR associated with such a small a is positive. A large VaR signifies high risk. Degree of risk aversion is represented by the value of a: smaller a means more risk averse.
5.2
Model and Problem Formulation
Let T, T > 0, be our investment horizon. Suppose that there are n + 1 securities traded in the capital market whose prices Pi, i = 0 , 1 , . . . , n, are assumed to follow dP0(t)
= P0(t)r(t)dt
dPi(t)
= Pi{t) biW +
(55) f^CTijMdWjit)
i = l,2,...,n Pi(0) = Pi « = 0 , l , . - - , n
(56) (57)
'Recently, VaR has been criticized on theoretical grounds. The first authors to clearly articulate the theoretical problems with VaR were Artzner et al. (1999). They showed that the traditional VaR measure violated one of the four desirable axioms they developed for a coherent risk measure.
602
K. C. Cheung and H. Yang
where W(t) = [Wi(t),.. .,Wn(t)]', t £ [0,T], are n-dimensional standard Brownian motion defined on a filtered probability space (£), T, {Ft}, P). It is assumed that the processes r(t), 6(t) = [61 ( t ) , . . . , 6„(*)]' and
Vx e R n , Vt € [0, T].
> Kx'x
(58)
If we use TTi(t),i = 1, 2 , . . . , n, to denote the proportion of the total wealth of an investor invested in asset i at time t, it is required that the portfolio process ir(t) = [ni(t),..., nn(t)]', t € [0, T], has to be measurable, adapted and the corresponding wealth process xn(t), t G [0, T] satisties V [T(x*{t)TTi(t))2dt
a.s.
(59)
Vie[0,T].
(60)
i=iJo
xw(t)>0
a.s.
Such portfolio process is called admissible and we use the notation A{x) to denote the set of all admissible portfolio processes with an initial portfolio value x. Based on the price dynamics of the securities and the definition of portfolio process 7r, the wealth process xn would follow = xn(t)[n(t)'(b(t)-r(t)l) +xn(t)7r(t)'a(t)dW{t) 0^(0) = x(0)
dx*(t)
+ r(t)]dt (61) (62)
where 1 = ( 1 , 1 , . . . , 1)' is a constant column of 1, and x(0) is the initial wealth of the investor. In the previous sections, the performance of a portfolio policy is simply measured by the expected value of the terminal wealth. A popular alternative is to use the expected utility of the terminal wealth.
Investment
Strategies for Financial and Insurance
Portfolio
603
Definition 2 A utility function U : [0, oo) —-> H is a continuously differentiable function such that U is strictly increasing, strictly concave, and satisfies lim U'(w) = 0, (63) w—>oo
x
lim U'(w)
= oo.
(64)
w—>0
Using VaR as a portfolio policy constraint and expected utility of terminal wealth as a measure of performance, we can formulate a optimal portfolio problem as follows: U-VaR(5,a) subject to
maximize
VaR Q (^(T)) dxn(t)
E[U(x*{T))]
< x(0) - x = x*(t)[Tr{t)'(b(t)-r{t)l) +xn(t)ir{t)'a{t)dW(t)
TT G
+ r(t)]dt
A'(X(0)),
where A'{x) = {TT G A(X) I E[U(x*{T))-} < oo}.
(65)
In this formulation, we restrict the VaR to be not greater than a certain ceiling, namely x(0) — x, where x is specified exogenously. Combining with the definition of VaR, the VaR constraint is equivalent to P{x*{T)>x)>l-a, (66) which is operationally more convenient. It should be noted that the maximization in problem U-VaR(x, a) is performed over all admissible portfolio policies which satisfy the integrable condition E[U(xv(T))~] < oo. This condition guaranteed the existence of maximum while allowing +oo. It is interesting to observe that when a = 0, we are then requiring the terminal wealth to be greater than the lower bound x with probability 1, which can be regarded as an insurance. On the other extreme, if a is chosen to be 1, then we are indeed imposing no VaR restriction. Thus a reflects our degree of risk aversion towards VaR.
K. C. Cheung and H. Yang
604
5.3
Solution Approach
The continuous-time, utility-based optimal portfolio problem can be solved by adopting the martingale approach (e.g., Karatzas et al. (1987), Korn (1997)), which decomposes the original problem into a static maximization problem and a portfolio representation problem. It is the static maximization problem that interests us: U-VaR'(x, a)
maxx(T)eB
subject to
E[U(x(T))}
E[H(T)x(T)] P(x(T) >x)
< >
x(0) 1-a,
where H = {H{t) | 0 < t < T} is the state price density process that has the dynamics dH(t) = -H(t) [r{t)dt + {a{t)'(b(t) - r{t)\))'dW{t)]
(67)
with H(Q) = 1, and B is the set of all positive, FT - measurable random variables B with E[U(B)-] < oo.
(68)
Once problem U-VaR'(5, a) is solved (as indicated by the next proposition), that is the optimal terminal wealth B is decided, then there is always a portfolio process n G ^4(x(0)) such that x*(T) = B
P-a.s..
(69)
This is a property of our market model, known as completeness (see, e.g., Korn and Korn (2001)). Before attacking problem U-VaR'(5, a), it may be instructive to consider the following problem first: U-max
maximize subject to
E[[/(x7r(T))] TV e
A'(x(0)).
Investment Strategies for Financial and Insurance Portfolio
605
The main difference between problem U-max and U-VaR(x, a) is that this problem does not have the VaR constraint. The reason for investigating this problem first is that its solution may give us a hint on what the solution to problem U-VaR(£, a) looks like. By the completeness of the market model, problem U-max can again be decomposed into a static maximization problem (finding the optimal terminal wealth), and the portfolio representation problem. Parallel to problem U-VaR'(:r, a), we have the static maximization problem: U-max'
max^Des subject to
E[U(x(T))}
E[H{T)x(T)] < x(0)
The solution to problem U-max' is well known (e.g., Karatzas et al. (1987), Korn and Korn (2001)), which is given by x*(T) = I(yH(T)),
(70)
where /(•) is the inverse function of U'(-), and y is a positive number that solves E(I(yH(T))) = i(0). (71) Returning to problem U-VaR'(x,o;), if the optimal terminal wealth of problem U-max', that is I(yH(T)), satisfies the VaR constraint in problem U-VaR'(£, a), then we are done. However, it may happen that I(yH(T)) falls below the bound x too often (that means with probability greater than a), then one may guess that we can "lift up" I(yH{T)) to x on part of the ranges of H(T) with I(yH(T)) < x such that the resulting altered I(yH(T)) can satisfies the VaR constraint. It turns out that our plan works, as indicated by the next result, which is due to Basak and Shapiro (2001). It states the solution to problem U-VaR'(x, a), assuming existence: Proposition5 Denoted as x*(x,a), the optimal terminal wealth that solves problem U-VaR'(x, a) is given by I(yH(T)) x*(x, a) = { x I(yH(T))
ifH(T)
K. C. Cheung and H. Yang
606
where /(•) is the inverse function of U'(-), H_ = ^1, H is such that P(H(T) > H) = a, and y is a positive real number that solves E[H(T)x*(x,a)} = x(0). A sketch the of proof is given in Appendix 1. For the details, see the paper by Basak and Shapiro (2001). As we noted before, there should exist an admissible portfolio process corresponding to the above optimal terminal wealth x*. However, such portfolio process is difficult to obtain analytically for general utility functions. Nevertheless, explicit solution for the optimal portfolio process can be obtained if we assumed the utility funcw1 0 < 7 < 1. Details can tion U to be a power function: U(w) = —, be found in Basak and Shapiro (2001)
6
Continuous-Time CaR Formulation
In addition to variance and Value-at-Risk, Capital-at-Risk (CaR) is another popular risk measure. In this section, we will discuss two optimal portfolio problems concerning CaR. Roughly speaking, CaR measures how much will be lost under adverse market situation. A precise definition can be found in the next subsection.
6.1
Model and CaR
The market model is very similar to that in the previous section. Let T be our investment horizon and there are n + 1 securities traded in the market whose prices Pi(t), i = 0 , 1 , . . . , n follow the dynamics: dP0(t)
= P0(t)rdt
(72) n
dPi(t)
= Pi{t) bi +
Y^VijdWjit) 3=1
i = l,2,...,n Pi{0) = Pi i = 0,l,...,n
(73) (74)
607
Investment Strategies for Financial and Insurance Portfolio
where W(t) = [Wi(t),.. .,Wn(t)}', t G [0,T], are n-dimensional standard Brownian motion defined on a filtered probability space (Q, J7, {Ft}, P). It is assumed that the n x n matrix a = {crij}i r for i = 1, 2 , . . . , n. The main difference between this model and the model used in the previous section is that the riskless interest rate r, the stock appreciation rate b, and the stock volatility a no longer vary with time. This simplifies the model and makes it easier to deal with. Also, we will restrict ourselves to those admissible portfolio policy that are constant over time, that is 7r(£) = 7r for all t G [0, T]. The wealth process corresponding to a constant portfolio policy will then follow dxn(t) x*(0)
= x*(t)[ir'(b-rl) = x(0).
+ r]dt + x*(tW
(75) (76)
Solving the above stochastic differential equation (SDE), we obtain for t e [0, T] II
'
I [2
x*(t) = x(0) exp{(7r'(6 - rl) + r - ^-p-)t
+ 7r'aW{t)},
(77)
where || • || is the Euclidean norm in R n . Capital-at-Risk (CaR) at level a, a G [0,1], of a constant portfolio policy 7r is defined by CaR a (x(0), 7T, T) d= x(0)erT - ha(x{0), TT, T),
(78)
where ha{x{0), TT, T) d= mi{z G R | P(xn{T) < z) > a}
(79)
is the a-quantile of xw(T). If we invest all of our wealth in the riskless security (security 0), then we will obtain a sure amount of x(0)e rT at time T. Thus CaR somehow measures the potential loss of an investment strategy relative to the riskless investment strategy.
K. C. Cheung and H. Yang
608
Observe that 7r'crW(r) = iV(0,T||7rV||2),
(80)
and hence we have x*{T)
= z(0)exp{(7r'(&
rl) + r
" 7 r V " )T
+Vf\\n'a\\Z}.
(81)
Here, Z is a standard normal random variable. One immediate consequence of this observation is that E(e^w^)
= e3ll*'"llar
(82)
and hence E(x*(T)) = x(0) exp{(7r'(6 - r l ) + r)T}.
(83)
Let za be the a-quantile of a standard normal random variable, we have the following expressions ha{x(0),ir,T)
=
x(0)exp{(7r'(fe-rl)+r- "
" )T
+y/T\\-K'a\\za},
(84)
and CaR Q (x(0),7r,T)
=
x(0)erT -x(0) exp{(7r'(6 - r l ) + r -
~ ^ ) T
+VT\\n'a\\za} =
x(0)e rT [l - exp{(7r'(6 - r l ) -
=
+VT||7r'a||z a }] x(0)erT[l - efM],
where II
f
112
/(TT) = ^ ( 6 - r i ) T - H I L ^ T + Vr||7rV||^ a .
^^-)T
(85)
609
Investment Strategies for Financial and Insurance Portfolio
6.2
Problem Formulation
It is obvious that CaR a (x(0),7r,T) is bounded above by x(0)erT. Using CaR as a risk measure, we may formulate two optimal portfolio problems. The first problem is to minimize CaR a (x(0), n, T) over all constant portfolio policy TV, without any constraint; the second problem is to maximize the expected value of the terminal wealth while placing a upper bound on CaR. More precisely, we formulate Min-CaR CaR(C)
min*ew» max i e B « subject to
CaR a (x(0),7r,T) E(x7r(T))
CaR a (x(0), TT, T) < C
where the ranges of C and a are given by: 0 < a < 0.5,
C < x(0)erT.
(86)
Requiring a to be less than 0.5 is quite natural, since CaR is supposed to measure potential loss rather than potential gain. It is typical to choose a small a, say 0.01. Regarding the range of C, since CaR a (x(0), 7r, T) < x(0)erT, if C is greater than x(0)e rT , the CaR constraint is always fulfilled no matter which policy IT is adopted. To avoid such triviality, we have the stated range for C. The following two propositions are due to Emmer, Kliippelberg and Korn (2001). Proposition 6 Let 9 = \\cr~1 (b—rl) || andir* be the optimal constant portfolio policy that solves problem Min-CaR. Assume a < 0.5. (a)
If hi — r for i = 1,2,... ,n, then TT* = 0 and
(b)
CaRQ(z(0),7r*,T) = 0.
(87)
If hi ^ r for at least one i, and 8y/T < \za\, then TT* = 0 and
CaRa(x(0),ir*,T)
= 0.
(88)
K. C. Cheung and H. Yang
610
(c)
If hi 7^ r for at least one i, and Q\[T > \za\, then \z I \ n*={l--^Lj(ao-rL(b-rl), CaRQ(x(0),7r*,T) = x(0)erT {l - eW^-\*«\)2}
(89) .
(90)
Proof of Proposition 6 is given in Appendix 2. It can be observed that if our investment horizon T is short enough such that 9y/T < \za\, and if our objective is to minimize risk (measured by CaR), then the result in part (b) of Proposition 6 tells us that we should investment all the money in the riskless security. Proposition 7 Let 6 = ||cr-1(6 — r l ) | | and assume that bi ^ r for at least one i. Further assume that C satisfies 0
x{0)e
2
if 0 VT <\za\, rT
(l - e K ^ - M ) ) < C < x(0)e
(91)
ifOVT > \za\. (92)
Then the optimal constant portfolio policy for problem CaR(C) is given by: TV =— [aa')~\b-rl),
(93)
1
where l\
2
-2cT,
in 1
T
f - wr ) •
The proof of Proposition 7 is given in Appendix 3.
(94)
(95)
Investment Strategies for Financial and Insurance Portfolio
7
611
Optimal Investment Strategy for Insurance Portfolio
In this section, we will briefly discuss some optimal insurance portfolio problems. Browne (1995) studied the portfolio selection problem for an insurance firm, modeling the reserve process and the risky asset by a Brownian motion with drift and a geometric Brownian motion respectively. The result of Browne showed that the optimal investment strategy was to invest a constant amount of money in the risky asset regardless of the surplus level. Asmussen and Taksar, (1997) examined a similar problem. They assumed that the surplus process follows Brownian motion with drift where the control variable is the dividends. They were able to obtain closed form solutions. All these results were implied by the modelling of the reserve process using a Brownian motion with drift instead of the traditional compound Poisson model. The more intuitive compound Poisson model, widely used in classical models of risk theory as a model of aggregate claim, may be more appropriate for modelling the aggregate claim size. Recently, a new continuous-time model of portfolio selection was developed. In this model, the stock price is modeled by a geometric Brownian motion and the aggregate claim is modeled by a compound Poisson process. The standard stochastic control theory including the use of the Hamilton-Jacobi-Bellman (HJB) equation could then be used for the derivation of the optimal investment strategy. Hipp and Plum (2000) proved that using this model, optimal investment in stock is not constant and that optimal investment would become infinity for large surplus if the claim size distribution was fat-tailed. The works of Hipp (2000) and Schmidli(2001) generalized the model further. We assume there are only two assets in the financial market: a risk-free bond and a risky asset, namely stock. Although it is possible for the interest rate in the real world to go negative, it seldom persists for a long time. Therefore, the risk-free interest rate is assumed to
612
K. C. Cheung and H. Yang
be non-negative in the model. As in the usual set-up of the financial literature, the price of the stocks is modeled as a geometric Brownian motion. To express it mathematically, the price of the risk-free bonds P{t) follows dP{t) = rP(t)dt (96) where P(t) is the price of the risk-free bonds at time t, r is the riskfree interest rate (r > 0). Stock price S(t) is assumed to follow dS{t) = nS(t)dt + aS(t)dW(t)
(97)
where fj, is the instantaneous rate of return of the stock, a (a > 0) is the volatility of the stock, and { W(t) : t > 0} is a standard Brownian Motion defined on a filtered complete probability space (fi,^,^i,P). Next we assume the risk process follows the Cramer-Lundberg model, i.e., the aggregate claim is modeled as a compound Poisson process A(t) with constant Poisson intensity A. In mathematical terms, the risk process follows dR(t) = cdt - dA(t), R(0) = s
(98)
where c is the premium rate and Nt
A(t) = Y/Yi
(99)
i=i
where Nt is the number of claims from time 0 up to time t, which follows a Poisson process with intensity A and V^'s are individual claim sizes. The claim size distributions are usually modeled as Exponential, Gamma, Lognormal or Pareto in risk theory. As in the risk theory literature, c is set to equal (1 + 9)XE[Y], where 9(> 0) is the security loading. This assumption may be relaxed if other forms of premium are more appropriate. Finally all these are put together to specify the stochastic differential equation governing the real-valued surplus process (or the process for net worth) X(t). Let 7r(t) be the total amount of money
613
Investment Strategies for Financial and Insurance Portfolio
invested in the stock at time t by the investor. We have to assume that 7r(t) is locally bounded so that the corresponding surplus process X"(t) is well-defined. We must also restrict ir(t) to the set of admissible policies, i.e., {ir(t), t > 0} is a progressively measurable, with ir(t)]2dt < oo (100) o a.s. for every T < oo. We can then specify the dynamics of X(t) by
L
dX(t) = ? r ( t ) ~ y + (X(t) - * ( t ) ) ^ + dR(t),
(101)
or, more clearly, dX(t) X(0)
= [(n-r)n(t)+rX(t) +air(t)dW(t)-dA(t), = s.
+ c]dt (102) (103)
If our objective is to minimize the ruin probability (or any other sensible objectives), by using the stochastic control method, first the HJB equation associated with the problem can be obtained. Solving the HJB equation (most of the time by numerical methods), we can obtain the optimal investment strategy. For detailed discussion on this subject, see, for example, Hojgaard and Taksar (2000) and Taksar (2000). Optimal portfolio selection problem in life annuity have also been discussed in the literature, for example in Charupat and Milevsky (2002).
8
Conclusions
The asset allocation problem is of both of theoretical interest and practical importance. In this chapter, the classical single-period Markowitz model and Merton's model are briefly reviewed. The formulations and solutions of some other optimal portfolio problems,
614
K. C. Cheung and H. Yang
including multi-period Mean-Variance model, VaR-based model, and CaR-based model are discussed. Optimal portfolio model related to insurance industry is also briefly discussed.
Acknowledgments The work described in this chapter was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKU 7139/01H).
615
Investment Strategies for Financial and Insurance Portfolio
Appendix 1 Proof of Proposition 5 Step 1: It is easy to check that x*(x, a) £ B and satisfies the two constraints stated in problem U-VaR'(5, a) with equality, that is E(H(T)x*(x,a)) P{x*{x,a) >x) Step 2:
= x(0),
(104)
=
(105)
I-a.
For given fixed H(T), x*(x, a) solves the problem: max{f/(x) - yH(T)x + yil{x>i}},
(106)
where d yi
=fU(I(yH))-yHI(yH)-U(3:)+yHx
(> 0).
(107)
The proof of step 2 to is easy and involves only elementary calculus, so we omit it here. Step 3: For a general x G B satisfying the two constraints in problem U-VaR'(x, a), we have E(H{T)x) P(x>x)
< x(0), > 1 - a.
(108) (109)
By the results in steps 1 and 2 and these two inequalities, E(U(x*(x,a)))-E(U(x)) = E(U(x*(x,a)))-E(U(x)) -yx(0) + yx(0) + T/I(1 -a)-yi(l> E(U{x*{x,a)))-E{U{x)) -E(yH(T)x*(x,a)) + E(yH(T)x)
a)
+ E(2/il{i»(i, a )>i}) - E(yil{ x >x})
>
0.
This completes the proof of Proposition 5.
(110)
616
K. C. Cheung and H. Yang
Appendix 2 Proof of Proposition 6 Because CaR a (z(0),7r,T) = x(0)erT[l - e^% minimising CaRa(a;(0), 7r, T) is equivalent to maximising f(ir). Part (a) is trivial. Parts (b) and (c): For each 7r e R n , there must be some nonnegative e such that |TT'CT11 = e. Hence we first maximize f(n) over all n with ||7r'a|| = e. For fixed e, on the ellipsoid ||7rV|| = e, f(7r) = Tr'(b-rl)T-jT+VTeza,
(111)
which is linear in TX. To maximize such linear function with constraint ||7r'a|| = e, we can employ the Lagrange-Multiplier method. It is not difficult to derive the solution: £
-(oa')-l(b-rl).
7r*£ =
(112)
Thus the maximum value of / over all 7r with ||7rV|| = e is given by
/«)
= « ) ' ( & - rl)T - S\ + VTeza =
£
-{b-rl)'{oor\b-rl)T-£^T+jTsza
= £-{a-\b - rl))\a-\b - rl))T - ^ T + Vfeza = £-92T -£-^T+ =
Vfsza
- — T + e{6T - \za\y/T)
since za<0.
(113)
Note that f(ir*) is expressed in terms of s. To obtain the optimal constant portfolio that maximize f{ir), all we need to do is to maximize /(TT*) over all nonnegative e, which is easy since /(7r*) is quadratic
Investment Strategies for Financial and Insurance Portfolio
617
in e. Depending on the sign of the coefficient of e in f(ir*), f{ir*) attains its maximum value at different e*, as stated in the proposition. This completes the proof of Proposition 6.
618
K. C. Cheung and H. Yang
Appendix 3 Proof of Proposition 7 The proof is similar to that of Proposition 6. Step 1: From Proposition 6, the minimum value of CaRa(a:(0), IT, T) under situations Oy/T < \za\ and Q\fT > \zQ\ are 0 and x(0)erT (l — ez^^ - ! 1 "!) 2 ] respectively. Thus we have the stated ranges of C to ensure that the CaR constraint is effective. Step 2: Since E(xn(T)) = x(0) exp{(7r'(6 - r l ) + r)T}, maxE(a;*(T))<=^inax7r'(&-rl). 7r€R n
(114)
7rGR n
Step 3: We have the following equivalent statements of the CaR constraint in problem CaR(C): CaR a (o;(0),7r,T)
~
/w s ln
T
c
(' - wf ) -
<=> ir\b-rl)T>c+l-^-p-T-za\\7r'a\\VT.
(115)
Step 4: From the previous proposition, over the ellipsoid ||7rV|| = e, £
-(aa')-l{b-rl)
K*e =
(116)
maximize ir'(b — r l ) . Adopting this portfolio policy, E(x^(T)) e{e6T+r)T^ w n j c [ 1 j s a n increasing function in e.
=
Step 5: Let 7re be any constant portfolio policy on the ellipsoid ||7r'cr|| = e such that it satisfies the CaR constraint (that is f(ne) > c), then the constant portfolio policy 7r* also satisfies CaR constraint, for /«)
> fM
> c,
(117)
619
Investment Strategies for Financial and Insurance Portfolio
since 7r* maximize f(n) on the ellipsoid ||7r'cr|| = e. Step 6: By the result of step 5, the optimal constant portfolio policy must be of the form of IT* for some nonnegative e. In step 4, we have shown that E(xn*(T)) is an increasing function of e. Thus we want to choose e as large as possible while satisfying the CaR constraint (that is the last inequality stated in step 3). Therefore, the optimal constant portfolio policy is £
< >w,
*i. = 1(
(118)
where e* is the largest possible e that satisfies « ) ' ( & - r\)T >c+£-T-
zaeVT.
(119)
Using the expression of TT* given in step 4, this inequality is equivalent to -T-e(9T-\za\Vf)
+ c<0,
(120)
which is quadratic in e. The largest possible value of e that satisfies this inequality is thus 2
e* = 8 - i^L +
9 - —^ ) - 2cT.
(121)
Step 7: Using the assumption on the ranges of C, it is trivial to verify that e* given above is real and is nonnegative. This completes the proof of Proposition 7.
620
K. C. Cheung and H. Yang
References Artzner, P., Delbaen, R, Eber, J., and Heath D. (1999), "Coherent measures of risk" Mathematical Finance, vol. 9, pp. 203-228. Asmussen, S. and Taksar, M. (1997), "Controlled diffusion models for optimal dividend pay-out" Insurance: Mathematics and Economics, vol. 20, pp. 1-15. Basak, S. and Shapiro, A. (2001), "Value-at-risk-based risk management: optimal policies and asset prices," The Review of Financial Studies, vol. 14, pp. 371-405. Brennan, M.J., Schwartz, E.S., and Lagnado, R. (1994), "Strategic asset allocation," Journal of Economic Dynamics and Control, vol. 21, pp. 1377-1403. Browne, S. (1995), "Optimal investment policies for a firm with a random risk process: exponential utility and minimizing the probability of ruin," Mathematics of Operations Research, vol. 20, pp.937-958. Browne, S. (1997), "Survival and growth with liability: optimal portfolio strategies in continuous time," Mathematics of Operations Research, vol. 22, pp. 468-493. Chan, T.F., Li, D., and Ng, W.L. (1998), "Safety-first dynamic portfolio selection," Dynamics of Continuous, Discrete and Impulsive Systems, vol. 4, pp. 585-600. Charupat, N. and Milevsky M.A. (2002), "Optimal asset allocation in life annuities: a note," Insurance: Mathematics and Economics, vol. 30, pp. 199-209. Cox, J.C. and Huang, C.F. (1985), A variational problem arising in financial economics, Mimeo, Sloan School of Management, M.I.T..
Investment Strategies for Financial and Insurance Portfolio
621
Cox, J.C. and Huang, C.F. (1989), "Optimal consumption and portfolio policies when asset prices follow a diffusion process," Journal of Economic Theory, vol. 49, pp. 33-83. Duffie, D. (2001), Dynamic Asset Pricing Theory, 3rd ed., Princeton University Press. Duffie, D. and Pan, J. (1997), "An overview of value at risk," Journal of Derivatives, vol. 4, pp. 7-49. Emmer, S., Kltippelberg, C , and Korn, R. (2001), "Optimal portfolios with bounded capital at risk," Mathematical Finance, vol. 11, pp. 365-384. Grauer, R.R. and Hakansson, N.H. (1982), "Higher return, lower risk: historical returns on long run actively managed portfolios of stocks bonds and bills: 1936-1978," Financial Analysts Journal, March-April. Graue, R.R. and Hakansson, N.H. (1985) "Returns on levered, actively managed long run portfolios of stocks, bonds and bills: 1934-1983," Financial Analysts Journal, September-October. Hipp, C. (2000), "Optimal investment for investors with state dependent income, and for insurers," Working paper No. 82, Centre for Actuarial Studies, The University of Melbourne. Hipp, C. and Plum M. (2000), "Optimal investment for insurers," Insurance: Mathematics and Economics, vol. 27, pp. 215-228. Hojgaard, B. and Taksar M. (2000), "Optimal dynamic portfolio selection for a corporation with controllable risk and dividend distribution policy," submitted to Mathematical Finance. Karatzas, I., Lehoczky, J.P., and Shreve, S.E. (1987), "Optimal portfolio and consumption decisions for a 'small investor' on a finite horizon," SI AM Journal of Control and Optimization, vol. 25, pp. 1557-1586.
622
K. C. Cheung and H. Yang
Korn, E. and Korn, R. (2001), Option pricing and portfolio optimization: modern methods of financial mathematics, American Mathematical Society. Korn, R. (1997), Optimal Portfolios, World Scientific, Singapore. Li, D. and Ng, W.L. (2000), "Optimal dynamic portfolio selection: multiperiod mean-variance formulation," Mathematical Finance, vol. 10, pp. 387-406. Li, D. and Zhou, X.Y. (1999), "Explicit efficient frontier of a continuous-time mean-variance portfolio selection problem," in Control of Distributed Parameter and Stochastc Systems, Chen, S. et al. (eds.). Markowitz, H. (1952), "Portfolio selection," Journal of Finance, vol. 8, pp. 77-91. Merton, R.C. (1969), "Lifetime portfolio selection under uncertainty: the continuous - time case," Review of Economics and Statistics, vol. 51, pp. 247-257. Merton, R.C. (1971), "Optimum consumption and portfolio rules in a continuous-time model," Journal of Economic Theory, vol. 3, pp. 373-413. Merton, R.C. (1990), Continuous Time Finance, Basil Blackwell, Cambridge, MA. Pliska, R. (1986), "A stochastic calculus model of continuous trading: optimal portfolios," Math. Operations Research, vol. 11, pp. 371-382. Samuelson, RA. (1969), "Lifetime portfolio selection by dynamic stochastic programming," Review of Economics and Statistics, vol. 51, pp. 239-246.
Investment Strategies for Financial and Insurance Portfolio
623
Schmidli, H. (2001), "On minimizing the ruin probability by investment and reinsurance," Annals of Applied Probability, to appear. Steinbach, M.C. (2001), "Markowitz revisited: mean-variance models in financial portfolio analysis," SI AM Review, vol. 43, pp. 3185. Taksar, M. (2000), "Optimal risk and dividend distribution control models for an insurance company," Mathematical Methods of Operations Research, vol. 1, pp. 1-42. Wang, S. and Xia, Y. (2002), Portfolio Selection in Asset Pricing, Springer, New York. Yong, J. and Zhou, X.Y (1999), Stochastic Controls: Hamiltonian Systems and HJB Equations, Springer, New York.
This page is intentionally left blank
Chapter 18 The Algebra of Cash Flows: Theory and Application W. Hurlimann
The analysis of deterministic discrete-time cash flows is reduced to the analysis of the properties of a non-commutative real polynomial ring. Composition formulas for the probabilistic measures of duration and convexity of cash flow products are derived. To immunize a liability cash flow product against convex shift factors of the term structure of interest rates, it suffices to immunize its factors serially. Moreover, the cash flow risk of a product identifies with the convolution of its serial cash flow risk components. Several concrete examples illustrate the discussed results.
1
Introduction
The financial success of a business is often reduced to the analysis of its present and future cash flows. In the special situation of deterministic discrete-time cash flows, this analysis can be done using polynomial algebra. We discuss some new and interesting algebraic and probabilistic aspects of deterministic cash flow analysis, which are motivated by several concrete examples. Following the modern pricing methodology, the current arbitrage-free price of a cash flow is uniquely determined by the current prices of all zero coupon bonds in a frictionless, competitive and discrete trading economy. This insight uses in part the fundamental term structure of interest rates (abbreviated TSIR), whose
625
626
W. Hurlimann
basic properties are summarized in Section 2. In this framework, a cash flow is uniquely determined by its price-discount function (Theorem 1), which turns out to be a finite degree polynomial in the real polynomial ring R[x], where the variable x can be interpreted as the one-period discount factor of the cash flow. Since R[x] is algebraically a unique factorization domain, every discretetime cash flow is a product of one- and two-period cash flows (Theorem 2). However, the multiplication of cash flows can be given different alternative economic contents. Following a recent proposal (Costa 1997), we show in Section 4 that unless the term structure is flat, the considered multiplication of cash flows is not commutative. This leads in general to a non-commutative polynomial ring {R[x];+,*}, whose properties should be exploited in the practical analysis of cash flows. A main subject of considerable interest, to which the above concept applies, is immunization theory. Two key notions are the duration and convexity of a cash flow, which are usually defined analytically. In the often encountered special situation of non-negative cash flows, for example in asset and liability management, these notions can be defined in a probabilistic way, as shown in Section 5. The analytical and probabilistic measures of duration and convexity are equal for flat term structures, but differ slightly in general. In several simulations of the approximations to the price change of a cash flow under a shift of the term structure, the probabilistic approximations are more accurate than the analytical ones. For this reason, we prefer the probabilistic notions. The composition formulas for the analytical measures of duration and convexity of cash flow sums and products are formulated for the probabilistic measures (Theorems 4 to 6). The alternative approach allows for a very precise probabilistic interpretation of the cash flow risk associated to a cash flow product. As shown later in Theorem 10, the cash flow risk of a product identifies with the convolution of its serial cash flow risk components. Sections 6 recalls the required main facts from deterministic immunization theory, which are based on the simplifying notion of
The Algebra of Cash Flows: Theory and Application
627
stochastic convex order. Then, in Section 7, we show how to immunize simply a liability cash flow product against convex shift factors of the term structure. It suffices to immunize its factors serially at the delivery dates of associated forward contracts. Finally, Section 8 illustrates the concepts with some examples.
2
Term Structure of Interest Rates
Recall some basics of fixed-income securities and arbitrage-free pricing methodology as exposed in many excellent textbooks (Jarrow 1996, Pliska 1997, Panjer et al. 1998). One considers a frictionless, competitive and discrete trading economy with trading dates {0,1,2,...,T}, where T is the time horizon. The risky traded securities in the economy are zero coupon bonds of all maturities T e{\,2,...,T} and a bank account. The zero coupon bond with maturity T is the security that pays one unit at time r. Its price at time / is a stochastic process denoted P{t,r), t = 0,1,....r. For a fixed time t the collection {P(t, t + 1), P(t, t + 2), ..., P{t, T)} of all zero coupon bond prices is called term structure of zero coupon bond prices. Later on in the chapter, only the current time t = 0 is of interest, in which case we write P(T) instead of P(0,T). Figure 1 is a price/time diagram. Time T t+2 t+\ t
| P(t,t+l)
P(t,t+2)
P(t,T)
Figure 1. Term structure of zero coupon bond prices.
Price
628
W. Hiirlimann
Another important notion is the time s forward interest rate for the period [t, t + 1) defined and denoted by
f(s,t) =
P(s,t) -1 P(s,t + l)
(1)
This corresponds to the rate contracted at time s for a riskless loan over the time period [t, t + 1), sometimes denoted ist+l. Figure 2 is an interest/time diagram. Time f+1
t I
2 1 0
At)
Ah t)
fat)
A2,t)
At,i)
Interest
Figure 2. Forward interest rates.
For fixed s the function P(s, t) is strictly decreasing in t, hence one has/fo 0 > 0- The special case s = 0, often encountered in applications, is abbreviated P(t) 1 (2)
'w-ixTM)- -
Substituting (1) one sees that the zero coupon bond prices are determined by the forward rates as follows
P(s,t)=tl k=s+\
P(s,k-1) P(s,k)
= fl[l+ /(*,*-!)]"
(3)
k=s+l
Recall that a forward contract is a financial security obligating
629
The Algebra of Cash Flows: Theory and Application
the purchaser to buy a commodity at a prespecified price, called forward price, and at a prespecified date, called delivery or expiration date. Since no money changes hands until delivery, the contract has zero value at the time it is initiated. The time s forward price of a forward contract issued on a zero coupon bond with maturity T and delivery date t is denoted F(s, t, T), where necessarily s
F(s,t,r) = ^ F(S,t,T) = flF(s,k,k
+ l) = fl[l + As,k)}-'
k=t
(4) .
(5)
k=t
For the current time s = 0 the forward price is simply denoted F(t, T). If further t = 0, one has F(0, r) = P{T). Furthermore, one considers the yield rate at time s on a zero coupon bond with maturity r, which is defined and denoted by
y(s,T) = -—^--\n{P(s,T)}.
(6)
(T-S)
It can be viewed as an instantaneous interest rate, often denoted 5. Similarly, the forward yield rate at time s on a zero coupon bond forward contract with maturity T and delivery date t is defined and denoted by
For practical purposes, if s = 0, one writes simpler y(r) = y(0, r) and y(t, r) = y(0, t, r). Then the fundamental relations (4) and (5) are equivalent to the relations (t-s)y(s,t) + (T-t)y(s,t,T) = (T-s)y(s,T), (8)
630
W. Hurlimann x-\
(r-0X*,>,r) = X.K's>*>* + 1 )-
(9)
*=/
If s = 0, one obtains the relations ty{t) + (T-t)y(t,T)
= Ty{T),
(10)
= YJy(k,k + \).
(11)
r-l
(T-t)y(t,T)
k=t
3
The Cash Flow Polynomial Ring
A discrete-time cash flow is defined to be a collection of current and future payments c = (co, c\, ..., cj), where c, may be strictly positive (long position), strictly negative (short position) or zero (no cash payment). The current time t = 0 arbitrage-free price of a cash flow is denoted by Pc and is uniquely given by (Jarrow 1996, formula (8.2)):
Pc=2>rP(i).
(12)
;=0
This equation expresses the equivalence of cash flows with portfolios of zero coupon bonds. The yield of a cash-flow c is the value y = y(c), which solves the implicit equation Yjci-QxV{-yi)
= Pc.
(13)
i=0
For variable y the expression on the left of (13) defines a function fly), called price-yield function of the cash flow. Setting x = e~y, which is the one-period discount factor of the cash flow, the function fly) transforms into the polynomial function T
£(*) = ! > , • * ' , ;=o
(14)
The Algebra of Cash Flows: Theory and Application
631
which is called price-discount function. Our first result characterizes a cash flow in terms of this polynomial function. Theorem 1. A cash flow is uniquely determined by its pricediscount function. Proof (Costa 1997). If two cash flows have identical pricediscount functions, then they must be identical. To see this, take their difference (go long one and short the other). The resulting cash flow satisfies g(x) = £,-=0 c,-x' = 0. Since the functions 1, x, ..., xT are linearly independent over the real numbers, one has c, = 0 for 0 < i < T, which proves the result. 0 This result allows one to identify a cash flow c with the corresponding price-discount polynomial function g(x). In modern algebra, the set of all real-valued polynomial functions, endowed with an addition and a product operation, is called a real polynomial ring and is denoted by R[x]. Therefore, the set of all discrete-time cash flows coincides with the real polynomial ring R[x], For the limiting case of an infinite time horizon T = oo, one obtains perpetual cash flows, whose price-discount functions are formal power series g{x) = £ ^ 0 ct-xl. These, too, form a ring of power series denoted i?[[x]], which contains R[x] as sub-ring. Note that one is only interested in power series g{x), which converge for |x| < 1. Example 1 (a) The monomial c-xs represents a zero-coupon bond with maturity date s and face value c, whose arbitrage-free price is clearly Pc = c • P(s).
(b) The polynomial 1 - xT can be interpreted as an interest-free loan of one unit over the time interval [0, T\. Indeed, it is equal to the difference of the present value of the amount given in loan without interest and the present value of the debt with due interest represented by this loan. (c) The geometric power series I + x + ... + x' + ... represents a perpetual annuity paying one unit at the beginning of each period.
W. Hiirlimann
632
The polynomial ring R[x] of discrete-time cash flows is, in the language of modern algebra, a unique factorization domain. Therefore, every polynomial g(x) can be factored uniquely up to scalar factors into a finite product of irreducible polynomials g,(x) such that
g(x) = c-Y\gl(x).
(15)
There are two types of irreducible polynomials, the linear polynomials a + bx, b # 0, and the irreducible quadratic polynomials a + bx + ex2, b2 - Aac < 0. Theorem 2. Every discrete-time cash flow can be expressed as a product of one- and two-period cash flows. Proof. This is the unique factorization property of R[x]. 0 The sum and product operations in the polynomial ring R[x] satisfy simple properties. For example, since g(x)-h(x) = h(x)-g(x) for g(x), h(x) e R[x], the polynomial ring is commutative with respect to the product operation. Though the sum of cash flows has an immediate and unique economic interpretation, the multiplication and factorization of cash flows can be given different alternative economic contents. We follow the proposal by Costa (1997). Denoting by (g*h)(x) the economic product of the two cash flows g(x) and h(x), we show in Section 4 that unless the term structure of interest rates is flat, the economic product is not commutative, that is (g*h)(x) * (h*g)(x), and different in general from the ordinary cash flow product (g-h)(x) = g(x)-h(x). The main aim of the present chapter is to introduce new readers to this interesting new topic and continue the study of the non-commutative polynomial ring of cash flows initiated by Costa (1997).
The Algebra of Cash Flows: Theory and Application
4
633
The Economic Multiplication and Division of Cash Flows
Consider a cash flow g(x) = 5T=0 crx' whose current arbitrage-free price (12) is from now on denoted by Pg instead of Pc. Given such a cash flow the following assumption is always made: (A) At time 0 one enters into zero coupon bond forward contracts with maturity n and delivery date / to reinvest (or borrow if ct < 0) the cash payment c; at the forward yield rate y(i, n) from time / to time n. Using the relations (10) and (11), one shows that, under this assumption, the value of g{x) grows to Pg-eny^"' at time n. Consider now two cash flows g(x) = S"^ Cj-x' and h(x) = 2™ dj-x1 such that n + m < T. The (economic) cash flow product, denoted (g*h)(x), is defined to be the finance instrument constructed in three steps as follows: Step 1: Compute the arbitrage-free price Pg = Z"=0 ct-P{i) of g(x), and the forward price Fh(n) of a forward contract on h(x) with maturity n + m and delivery date n such that
where the last equality follows using (4). Step 2: Purchase Fh{n) units of cash flow g(x) at time 0 with all the forward contracts prescribed under assumption (A). Step 3: Enter into Pg-eny^"' forward contracts at price Fh(n) on h(x) as described in Step 1, along with all prescribed reinvestment forward contracts. Since forward contracts are no-cost, the current arbitrage-free price of the defined investment is identical to the cost of the purchase in step 2, namely
W. Hiirlimann
634
P^=Ps-Fh{n) = t t c r d r P { i ) - ^ ^ - .
(16)
r n
1=0 ; = o
\)
The cash flow polynomial of this finance instrument must by the uniqueness of the price-discount function in Theorem 1 and (12) be equal to (use again (4)):
(g.h)(x) = ±±Crdi.^±A.^. /'C'
i=o j=o
(17)
+ iJ
By construction, the first cash flow matures at time n at the accumulated value Pg-eny^n\ Through the contract in step 3, this amount is reinvested into Pg-e"^ units of h(x), which at time n + m mature at the value Pg-e"m-Fh(n)-emy{n'n+m) = [Pg-Fh(n)]-e(n+m)y{n+m\ as should be. This interpretation shows that a cash flow product is defined economically by investing in cash flows serially with all proceeds at maturity of one reinvested in the next (Figure 3 is a diagram representing this construction). 0 Fh(n)-Pg
g(x)
n__ Fh(n)-Pg^y(n) n -Fh(n)
ny{n)
Pg Q
n+m
h(x) n+m PgFh{n) e ^ " ^ " ^
Figure 3. Economic cash flow product.
The above definition extends obviously to a finite cash flow product (gi*g2*—*gr)(x) with arbitrage-free price Pg*gz*...*gr = Pg,-Fg£ni)-Fgs(ni + n2)-...-Fg(jii + ... + nr-\), where gfe) is a polynomial of degree «/. The order of the serial investment is in general not arbitrary. A similar definition of the cash flow product (h *g)(x) yields the arbitrage-free price
635
The Algebra of Cash Flows: Theory and Application
r^WW-tt'^j-Z^-PU).
(18)
r m
1=07=0
\ )
which corresponds to the cash flow polynomial ,1
w N v ^ v 1 J F(m,m + i) ,+ / c rf x
(**«)W = ZZ /- y-^77—f 1=0 y=o
/ i m
•
(19)
rKJS + J)
Besides these economic cash flow products, the cash flow (g-h)(x) = (h-g)(x) = g(x)-h(x) associated to the ordinary commutative multiplication in the polynomial ring R[x] has arbitrage-free price n
p
p
m
c
sh= hg=YL
rdrni
+ J)
(20)
i=0 7=0
and polynomial representation n
m
(g • h)(x) = (h • g)(x) = YLcrdr
x,+J .
(21)
1=0 7=0
The following result provides conditions on the forward rates of interest under which the above alternative cash flow products define the same investment. Theorem 3. Let g(x) = Z"=0 ct-x' and h(x) = S™,, df-x1' be two cash flows such that n + m
o
f(n + j) = f(j),J = 0,...,m-\, f(m + i) = f(i),i = 0,...,n-\ (g-h)(x) = (g*h)(x) o (g-h)(x) = (h*g)(x) » f(k) = f(0),k = \,...,n + m-l. Proof. This is shown in Appendix 1. 0
(22)
(23)
636
W. Hiirlimann
In modern algebraic terms, the set of all polynomial cash flows g(x) e R[x], endowed with the ordinary addition (g + h)(x) = g(x) + h(x) of cash flows and the economic product (g*h)(x) of cash flows, generates an algebraic structure, denoted {J?[x];+,*} and called cash flow polynomial ring. Theorem 3 says that this structure differs from the polynomial ring structure {i?[jc];+,-} endowed with the ordinary commutative product (g-h)(x) = g(x)-h(x). In general, the cash flow polynomial ring defines a non-commutative ring. It is only in the very special case of aflat term structure of interest rates that both rings are commutative and identical. However, in real-life economy, term structures ar never perfectly flat. Therefore, the cash flow polynomial ring is always a non-commutative ring. Corollary 1. The cash flow polynomial ring {i?[x];+,*} is a commutative ring if, and only if, the TSIR is flat, that isf(k) =./(0)> k = 1, 2, ... . Furthermore, if this holds, the rings {i?[x];+,*} and {i?[x];+,-} are isomorphic. Proof. The ring will be commutative if, and only if, the conditions (22) hold for all g(x), h(x) e R[x]. Obviously, this implies the flat TSlRflk) =flO), k = 1, 2,.... By (23) this is sufficient to ensure that (g-h)(x) = (g*h)(x) for all g(x), h(x) e R[x], which implies the second assertion. 0 To conclude the present section, let us illustrate the (economic) cash flow product construction with some examples, which are perceived to be of some practical relevance. Example 2 (a) The ^-period discount bond x" leads to the factorization (Id * Id * ... * Id)(x), where Id(x) = x is a one-period discount bond. This economic cash flow product says that a future payment of one unit at time n can be realized as a current payment of x" units invested and reinvested in a series of one-period forward contracts at the forward yield ratesy(l),y(l, 2), ...,y(n—\, n). (b) The tt-period interest-free loan of one unit, the polynomial l-x",
The Algebra of Cash Flows: Theory and Application
637
leads to the product (g*h)(x) withg(x) = \-x and h(x) = 1 + x + ... + x"_1. It can be realized as a one-period interest-free loan of 1 + x + ... + xn~l units. It will mature at time 1 to have value l - (1 + x +... + x"'1) - (1 + x +... + x"_1) = (; - 1)(1 + x +... + x""1), which is then used to buy an annuity paying (- - 1) units at time 1, 2, ..., n. Since~x-1 is the one-period interest on one unit, this exactly tracks the cash flow produced by the n-period interest free loan. (c) In the limit as n —» oo in example (b), one has 1 = (1 - x)(l + x + ... + x" + ...). A future stream of payments of \ -1 units every period, which can be generated by a current unit, can alternatively be financed by a one-period interest-free loan of (1 + x + ... + x" + ...) units with the proceeds invested in a perpetual annuity. This example illustrates further the fact that cash flows g(x) = CQ + c„x" with Co * 0 have an inverse in i?[[x]]. This suggests that an (economic) division of cashflows is possible in /?[[*]]. However, division by zeros is not allowed. (d) An ft-period coupon bond with coupon payments c and face value \-c generates the cash fiowX*) = c + ex + ... + ex"'1 + x". In the special case n = 3, one has the simple factorization y(x) = g(x)-h(x) with g(x) = a + x, a2 \-a a3 2 h{x) = 7~a' Y-X + X , c = r. \-a+a l-a+a \-a+a The bondholder receives in each period the interest / = c/(l-c). For example, if a = -, the interest is i = 5% and fix) = ^ ( l + 3 x > ( l - 2 x + 7x2).
5
The Probabilistic Notions of Duration and Convexity
In traditional immunization theory, the (modified) duration and convexity of a cash f l o w ^ ) = g(x) = S"=0 crx', with y = -ln(x) the yield, are defined analytically by
638
W. Hiirlimann
f'(y)-_x- x •g'(*) Dug = —-^-^-^-7-7-, the duration, and
(24)
Co. = ^ - ^ = DM. + x2 • ^ - ^ , the convexity.
(25)
f(y)
g
g(x)
g
/O0
g(x)
Given a shift in the term structure of interest rates, that is, the zero-coupon bond price curve changes from P(t, f) to, say, P'(t, f), one is interested in approximations to the current shifted arbitragefree price Pg = E"^ Ci-P'(i), which only depend on the initial term structure and the change in cash flow yield Ay = / - y, where / = ln(x') is the shifted yield (theoretical solution of the equation fly') = g(x') = Pg). One considers the following first and second order approximations to Pg defined and denoted by Plg:=[l-Dug.Ay].Pg, P2-.= \-Dug-Ay
+ ^Cog-(Ay)2 •P.
^
Mathematically, these formulas are just the first and second order Taylor approximations of the price-yield function fly') =fly + Ay) = P' g
'
In contrast to this, the modern probabilistic approach to immunization theory (Shiu 1988, Uberti 1997, Panjer et al. 1998, Section 3.5) considers slightly different risk measures of duration and convexity. Given a discrete-time non-negative cash flow g(x) = S"=0 Cj-x', Cj > 0, define the mathematical object, called cash flow risk, as the discrete arithmetic random variable Rg with support {0, 1,..., n) and probabilities {go, g\,..., g„) such that ^
=
c L _P0) ;
i = oX...,n,
(27)
g
represents the normalized cash flow at time /'. In this framework, the counterparts of the traditional measures (24) and (25) are just the first and second order moments of the cash flow risk, namely
639
The Algebra of Cash Flows: Theory and Application
Dg = £[^ g ], t n e duration, and
(28)
Cg = E[R2g ], the convexity.
(29)
The approximations (26) are replaced by pV:=[\-D g.Ay].Pg, g p(2) ._ l-D -Ay g
+
±Cg-(Ayf
(30) P„.
It is not difficult to see that in general Dg & Dug, Cg * Cog, hence P g * Pg, Pg(2) * Pg, but the differences are usually negligible. Note that equality holds for flat term structures. Some theoretical advantages of the probabilistic approach to immunization theory will be emphasized in Section 6. Furthermore, simulation examples suggest that the approximations (30) outperform in accuracy the traditional ones (26). For these reasons, only the probabilistic approach is retained. Costa has obtained simple composition formulas for the analytical measures (24) and (25) applied to (economic) cash flow sums and products. We develop here their probabilistic counterparts, which are in accord with Costa's formulas and mathematically even more precise in their interpretation. Besides the measures (28) and (29), which refer to the current time of valuation, we need notions of forward duration and convexity. Given a non-negative cash flow g(x) = E"=0 c,-x', c, > 0, the time s forward cash flow risk is the random variable Rg(s) with support {0, 1,..., n) and probabilities {go, g\s,..., g„s} such that (1)
*f = C ' - ^ + ° , * = 0,...,„,
(31)
which represents the time s normalized forward cashflow at time /. Similarly to (28) and (29), define Dg(s) = E[Rg(s)\, the time s forward duration, and
(32)
Cg(s) = E[R^(s)\, the time s forward convexity.
(33)
640
W. Hiirlimann
The construction of the cash flow product (g\*---*gr)(x) as a serial investment suggests that the total duration should be related to the sum of the forward durations of the factors, taken at the times when the corresponding forward contracts are delivered. Theorem 4. Let gt(x), i = 1, ..., r, be the price-discount functions of r non-negative cash flows, with nt - deg[g,(x)] the degree of the polynomial g,(x). Then one has
^ W
=D
+D
*
+ DSM +n2) + Dg(n]+...+nr_l) (34)
SM)
Proof. Using induction it suffices to show the case r = 2. Let g(x) = S"=0 Cj-x' and h(x) = £™0 dyx' be two non-negative cash flows. By (26) and the definition of duration in (28), one has 1 r
r g
h \n)
n
m
Pin + j) Pin)
i=o j=o
(35)
On the other side, using (32) one gets
n + Dh(n) = —tic,P(i) + —fjd,J PgU ' Fh(n)U
P(n + j) J
P(n)
,
(36)
which after rearrangement is seen to be identical to (35). 0 The duration of a sum of cash flows is related to the durations of the cash flow components through the following well-known linear weighted formula (Buhlmann and Berliner 1992, Section 6.7). Theorem 5. Let g,(x), / = 1, ..., r, be the price-discount functions of r cash flows. Then one has the formula D
Sl+...+,,
= t^Dgt
, with wk =Pgj£Pgi,
* = l,...,r . (37)
Proof. The immediate verification is left to the reader. 0 The convexity satisfies similar composition formulas.
641
The Algebra of Cash Flows: Theory and Application
Theorem 6. Let gi(x), i = 1, ..., r, be the price-discount functions of r non-negative cash flows, with nt = deg[g,(x)] the degree of the polynomial gt(x). Then one has
^•..^E^.™*"*^*'!^. k=\ C
+c
=D
g\*-*gr
g\*-*gr
k
= l>->r
(38>
D + c
gi-
l] [ s^-DlM)
(39)
+...+ C. («, +.. .+/ir_,) - D* («! +.. .+«r_,) Proof. The formula (38) is immediate. Using induction it suffices to show the case r = 2 of (39). Let g(x) and h(x) be as in the proof of Theorem 4. Then one has
Fh{n)n
J
P(n) PgFh{n)ti
'
'U
J P
(n)
= Cg+Ch (n) + 2DgDh (n) = D2gth + [Cg - D2g ] + [Ch (n) - D2h (n)\ where the last equality follows by using (34). 0 Expressed in terms of (forward) cash flow risks, (34) and (39) are equivalent to the following mean and variance formulas :
^K,..,gJ=^K1J+^K(«2)J+-+4^("1+-+"r-i)J5 M V
> J = VarK
J+ VarK
("2)J+ - + Var[Rgr (n, +... + nr_x)\.
The forward cash flow risk of the factors are uncorrected. In fact Theorem 10 shows that the cash flow risk of the product is the independent sum of these forward cash flow risks.
642
6
XV.
Hiirlimann
Convex Order and Immunization Theory
In classical immunization theory (Panjer et al. 1998, Section 3), one assumes that assets and liabilities are independent of interest rate movements. In the special situation studied in Hiirlimann (2002a), he consider a portfolio with non-negative asset inflows {A\, ..., Am), occurring at dates {1, ..., m), and non-negative liability outflows {L\, ..., L„), due at dates {1, ..., «}, with time t = 0 value m
n
r=^-^=S«*-X^=0,
(40)
where ak = AkP(k) and £ . = LjP(j) are the current prices of the asset and liability flows. One is interested in the possible changes of the value of a portfolio at a time immediately following the current time t = 0, under a change of the term structure of interest rates (TSIR) from P(s) to F(s) such fhatj(s) = P'(s)/P(s) is the shift factor. Immediately following the initial time, the post-shift change in value is given by
AV = V-V = Zl&fik) - Yj-AfW •
(41>
The classical immunization problem consists of finding conditions under which (41) is non-negative, and give precise bounds on this change of value in case it is negative. In the probabilistic setting of Section 5, the cash flow risks of the asset and liability cash flows Z™, AkXk and Z" Lpj are called asset and liability risks. These random variables are denoted A and L and take the probabilities
643
The Algebra of Cash Flows: Theory and Application
pk=Pr(A = k) = 4j^-,
k = l,...,m,
A
<7,=Pr(Z = j ) = - ^ - ^ ,
(42) j = \,...,n.
"L
In view of (40) the normalization assumption PA = Pi = 1 will be made throughout. The durations, convexities and the new Msquare indices of the asset and liability cash flows are according to Section 5 defined by DA = E[A], DL = E[L], CA = E[A\ CL = E[L2], MA2 = Var[A], Mi2 = Var[L]. In this setting, the change in portfolio value (41) identifies with the mean difference AV = E[f(A)]- E[f(L)].
(43)
Under duration matching, that is DA = DL, assumed in immunization theory, the sign of the difference (43) is best analyzed within the context of stochastic orders. The notion of convex order, as first propagated by Rothschild and Stiglitz (1970, 1971, 1972), yields the simplest and most useful results. Definition 1. A random variable X precedes Y in convex order, written X
E[f(X)] < E[f[T)] for all convex real functions fix) for which the expectations exist; (CX2) E[(X- d)+] < E[(Y- d)+] for all real numbers d; (CX3) E[\X- d\] <E[\Y- d\] for all real numbers d; (CX4) There exists a random variable Y =d Y (equality in distribution) such that £[F|Z] = X with probability one. The equivalence of (CXI), (CX2) and (CX4) is well-known from the literature (Kaas et al. 1994, Shaked and Shanthikumar 1994). The equivalence of (CX2) and (CX3) follows immediately from the identity (X- d)+ = \ [\X - d\ + X - d] using that the means are equal. The partial order induced by (CX3) has sometimes been called dilation order (Shaked 1980/82).
644
W. Hurlimann
Recall three main results in immunization theory. Theorem 7. Let A and L be asset and liability risks with equal duration DA = DL, and a convex shift factor J(s) of the TSIR. If V= 0, the portfolio of assets and liabilities is immunized, that is, AV= E[f(A)] - E[f(L)] > 0 if, and only if, one has L £[ji.-£|],
k = \,...,m,
m>n.
(44)
Proof. This generalization of earlier results by Fong and Vasicek (1983a,b) (sufficient condition under constant shift factors) and Shiu (1988) (necessary condition under convex shift factors) has been first derived in Hurlimann (2002a). 0 In case the shift factor of the TSIR is not convex, immunization results can be obtained through generalization of the notion of convex function (see Hurlimann (2002b) for another extension). Definition 2. Given are real numbers a and /?, and an interval / c R. A real function fix) is called a-convex on I if fix) - \ax2 is convex on /. It is called convex-jB on / if 2/3x -f(x) is convex on /.
Theorem 9. Let A and L be asset and liability risks such that L
-a{M\ -M2L)
= E[f(A)]-E[f(L)]<^(M2A
-M2L).
(45)
Proof. This result by Uberti (1997) is a simple consequence of (CXI) in Definition 1. By assumption, one has the inequalities
645
The Algebra of Cash Flows: Theory and Application
E[f(L)]--aE[L2]<E[f(A)]--aE[A2\ 1
(46)
1
-f3E[L2]-E[f{L)\<-(3E[A2]-
E[f(A)\
But £^ 2 j- J E[l 2 J=Far[^]-Far[l]=M^ -M2L, hence (45). 0
7
Immunization of Liability Cash Flow Products
We discuss the immunization of liability cash flow products (g*h)(x), where g(x) = S"=0 crx' and h(x) = IT 0 dj-x! are two nonnegative liability cash flows. We are looking for asset cash flow products (v*w)(x), where v(x) = 2"=0 at-x' and w(x) = 2 ^ bj-x* are non-negative asset cash flows, which immunize the product liability (g*h)(x) under convex shift factors. Denote by Rv and Rg the cash flow risks of v(x) and g(x), and by Rw(n) and Rhin) the forward cash flow risks of w(x) and h(x). Theorem 10. Given are the above liability cash flows g(x) and h(x), and the asset cash flows v(x) and w(x). If Rg
(47)
where © denotes the sum of independent random variables. Proof This is shown in Appendix 2. 0 This result reveals two interesting properties. Firstly, to immunize a liability cash flow product under convex shift factors, it suffices by Theorems 7 and 10 to immunize the factors serially at the delivery dates of the associated forward contracts. Secondly, the instructive proof of the relationship (47) identifies the asset and liability cash flow product risks as convolutions of the risks of the asset and liability factors. This new probabilistic representation of
646
W. Hurlimann
non-negative cash flow products explains mathematically the validity of the mean and variance formulas following Theorem 6.
8
Examples
In the framework of Section 6, let L be a given liability risk with support {1, ..., n) and probabilities gj,j = 1, ..., n. We look for an asset risk A with support {1, ..., m},m>n, and probabilities^, k = 1, ..., m, such that the portfolio of asset and liabilities, with initial value zero, is immunized against convex shift factors of the TSIR. By Theorem 7, this holds if, and only if, one has L
= max{i(MJ -M2L)}= m^{\(Var[A}-
Var[L\)} (48)
over all finite arithmetic random variables A satisfying the restriction L 3, this maximum equals
^axQ^-E^-iMrc-O-A,
(49)
and is attained at the asset risk A with two atoms and probabilities
I<'-^T,>-')P, qj=0,
j =
?.~I:,('-DA.
(50)
2,...,n-\. n
Expressed in terms of the duration DL = DA = T]/p,, one has n-DL
ax=
DL-\
f-, qn=-^-r. n-\ n-\
It follows that the asset convexity equals
....
(51)
647
The Algebra of Cash Flows: Theory and Application
CA=E[A2]=(n + \)DL-n.
(52)
Therefore, the maximum Shiu measure is alternatively described in terms of the liability duration and convexity by max{\(M2A-M2L)}=±[(n
+
l)DL-n-CL],
(53)
which is quite handy for practical use. In the setting of Section 7, consider further the immunization of a liability product (g*h)(x), where g(x) = 2"_0 ct-xl and h(x) = X™0 dyx1 are non-negative cash flow polynomials. According to Theorem 10, it suffices to immunize the cash flow risks Rg and Rh(n) separately. For this, choose simply the diatomic cash flow risks with maximum Shiu risk Rg
l
'
(54)
A(»)"l
m-\ — ,' wm m-\ m =• Theorem 10 guarantees that the cash flow liability product (g*h)(x) is immunized against convex shift factors by the cash flow asset product (v*w)(x). To state Uberti's bounds, one needs the Shiu measure of the portfolio cash flow. A calculation using the Theorems 4 and 6, as well as (52), yields the formulas w, =
Dv,w = Dgth,
Cgth = D2gth +Cg-D2g+Ch
Cvtw = D2g,h +(n + \)Dg-n-D2g+(m
(») - D2h {n), + \)Dh(n)-m-
D2 (»). ( 5 5 )
The resulting Shiu measure turns out to be by (53) equal to the sum of the maximum Shiu measures of the component cash flow risks. Therefore, the immunization of a liability product is controlled by the immunization of its liability components.
648
9
W. Hiirlimann
Conclusions
Deterministic discrete-time cash flows have been identified with price-discount polynomial cash flows. The set of all polynomials, endowed with the ordinary addition and the economic product of cash flows, generates the abstract structure of a non-commutative polynomial ring. Therefore, cash flow analysis is reduced to the analysis of properties of an algebraic structure. As a single application, the immunization of liability cash flow products has been discussed. In particular, it has been shown that the immunization of a liability product can be controlled in a simple way by the immunization of its liability components using cash flow risks with maximum Shiu measures. The above new type of cash flow analysis deserves special attention in future work. It is an example of how abstract algebraic structures may become relevant in real-life economics. The future potential of "general algebra" in applications is in my opinion enormous. As a pillar of modern mathematics, its application in the insurance industry should not be neglected.
Acknowledgment This chapter is dedicated to the 60 birthday of my algebra teacher and thesis adviser M.-A. Knus on April 14, 2002.
The Algebra of Cash Flows: Theory and Application
649
References Buhlmann, N. and Berliner, B. (1992), Einfuhrung in die Finanzmathematik, Band 1, Uni-Taschenbiicher 1668, Verlag Paul Haupt, Bern-Stuttgart-Wien. Costa, D.L. (1997), "Factorization of bonds and other cash flows," in Anderson, D.D. (ed.), Factorization in Integral Domains, Lecture Notes in Pure and Applied Mathematics, vol. 189, Marcel Dekker. Fong, H.G. and Vasicek, O.A. (1983a), "A risk minimizing strategy for multiple liability immunization," unpublished manuscript. Fong, H.G. and Vasicek, O.A. (1983b), "Return maximization for immunized portfolios," in Kaufmann, G.G., Bierwag, G.O., and Toevs, A. (eds.), Innovation in Bond Portfolio management: Duration Analysis and Immunization, JAI Press, 227-238. Hurlimann, W. (2002a), "On immunization, stop-loss order and the maximum Shiu measure," Appears in Insurance: Mathematics and Economics. Hurlimann, W. (2002b), "On immunization, s-convex orders and the maximum skewness increase," Manuscript, available at http ://www.mathpreprints. com. Jarrow, R.A. (1996), Modelling Fixed Income Securities and Interest Rate Options, McGraw Hill Series in Finance. Kaas, R., van Heerwaarden, A.E., and Goovaerts, M.J. (1994), Ordering of Actuarial Risks, CAIRE Education Series 1, Brussels. Panjer, H.H. (ed.) (1998), Financial Economics, with Applications to Investments, Insurance and Pensions, The Society of Actuaries. Pliska, S.R. (1997), Introduction to Mathematical Finance, Discrete Time Models, Blackwell. Rothschild, M. and Stiglitz, J.E. (1970), "Increasing risk: I. A definition," Journal of Economic Theory 2, 225-243. Rothschild, M. and Stiglitz, J.E. (1971), "Increasing risk. II. Its economic consequences," Journal of Economic Theory 3, 66-84.
650
W. Hiirlimann
Rothschild, M. and Stiglitz, J.E. (1972), Addendum to "Increasing risk: I. A definition," Journal of Economic Theory 5, 306. Shaked, M. (1980), "On mixtures from exponential families," Journal of the Royal Statistical Society B 42, 192-198. Shaked, M. (1982), "Ordering distributions by dispersion," in Johnson, N.L., and Kotz, S. (eds.), Encyclopedia of Statistical Sciences, vol. 6, 485-490. Shaked, M. and Shanthikumar, J.G. (1994), Stochastic Orders and Their Applications, Academic Press, New York. Shiu, E.S.W. (1988), "Immunization of multiple liabilities," Insurance: Mathematics and Economics 7, 219-224. Uberti, M. (1997), "A note on Shiu's immunization results," Insurance: Mathematics and Economics 21, 195-200.
651
The Algebra of Cash Flows: Theory and Application
Appendix 1 Proof of Theorem 3 From the uniqueness representation Theorem 1 one has, using (17) and (19), that (g*h)(x) = (h*g)(x) if, and only if, P(n + j) P(m + i) — — •P(m) = — --Pin), P(J) P(i)
. . i = 0,...,n,j = 0,...,m.
(56)
In the special case / = 0, one has the conditions P(n + j) = P(n)-P(J),
j = 0,...,m.
(57)
Using (1.1) and setting successivelyy = 1,2,..., m, one sees that f(n + j) = /UX j = 0,...,m-\. (58) Similarly, setting j = 0 in (56), or inserting (57) into (56), one has the conditions P(m + i) = P(m)-P(i), i = 0,...,n, (59) which lead to f(m + i) = f(i), i = 0,...,n-l. (60) This shows (22). Similarly (g-h)(x) = (g*h)(x) if, and only if, Pin + j)
P(i + j)
—rr^— = —r—-,
n
i = 0,...,n, j =0,...,m.
.,n (61)
Pin) Pi}) For /' = 0 this is (57), which implies (58). Inserting (57) into (61) implies the further conditions P(i + j) = P(i)-P(j),
i = 0,...,n,j = 0,...,m,
(62)
from which one deduces without difficulty that j(k) =J(Q), k=\, ..., n + m - 1. This shows the first equivalence in (23). The second one is shown similarly. 0
W. Hurlimann
652
Appendix 2 Proof of Theorem 10 By Section 6 assume Pg = Pv = Ph = Pw = 1. By the relations (27) and (31) one has P(0
'
P(n + j) ( 6 3 )
Pf \
a; = ——, b, = — . J P(i) P(n + j) By construction of the cash flow product, the coefficients of &**)(*) = 2 ^ e*x* are v^ to
,
F(n,n + k-i) F(i,k)
1 ^ P(k) to
,
,rA.
It follows that the price of the product is one unit: n+m
n+m(
k
\
Using again (27) the cash flow risk Rg*h has probabilities r g*A
= Pr(Rg®Rh(n)
= k),
/=o
(66)
k = 0,...,n + m.
A similar representation holds for the cash flow risk Rv*w. Since the convex order is preserved under convolution (Shaked and Shanthikumar 1994, Theorem 2.A.6), the result follows. 0
Index Bootstrap Bias estimates, 526 Concept, 525 Confidence intervals, 527 Dependent data, 528 Introduction, 524 Standard error, 526 Bootstrap applications Cash-flow analysis, 540 Analysis, 555 Company Assumptions, 543 Surplus Cumulative Distribution, 551 Surplus-value, 547 Interest rates process, 540 Assumptions, 545 Nonparametric Bootstrap, 541 Mortality tables, 530 Carriere's law, 530 Fitting, 532 Model accuracy, 535 Properties, 533
A Algebra of cash flows, 625 ANFIS, 416 Artificial intelligence Background, 340 Asset allocation, 587 Automatic relevance determination (ARD), 367 MLP-ARD, 376
B Backpropagation algorithm, 64, 344 Bayesian decision method, 314 Fuzzy, 324 Bayesian learning neural networks, 365 Bayesian model applications Bonus-malus systems Illustrations, 449 Technical results, 444 Bayesian models Likelihood, 437 Models, 437 Robustness Transition rules, 447 Structure functions, 438 Conjugate posterior, 439 Bias-variance trade-off, 147 Bonus-malus system, 236 Premium calculation, 441
c Capital-at-Risk (CaR), 606 Model and CaR, 606 Problem formulation, 609 Cash flow Duration and convexity, 637 Economic multiplication and division, 633
653
Index
654 Immunization, 645 Polynomial ring, 630 Term structure of interest rates, 627 CHAID, 162, 283 and Linear Model, 162 Chain ladder method, 71 Classification and regression trees (CART), 284, 500 Hybrid models, 503 Clinical methods, 275 Confusion matrix, 213 Convex order and immunization theory, 642 Curse of dimensionality, 139 Customer retention Background, 466 Customer information, 477
D Data mining, 493 and on-line analytical processing, 499 and traditional statistical techniques, 498 Application areas, 497 Integrated approach, 199 Introduction, 495 Methodologies, 500 CART, 500 Hybrids, 503 MARS, 502 and logistic regression, 510 Data mining applications Health insurer claim cost, 512 Projected severity, 504 Data warehousing, 273 Decision making
Fuzzy states and alternatives, 328, 331 Dimension reduction, 86
E Evaluating results Empirical evaluation, 388 Mean-squared error comparisons, 174 Model fairness, 179 Evidence framework, 378
F Filter/smoother, 583 Fuzzy immunization theory, 308 Fuzzy Inference System (FIS), 22 Fuzzy logic, 19 Advantages and disadvantages, 404 C-means algorithm, 23 Decision making, 234 Fuzzy arithmetic operations, 304 Fuzzy inference system, 22 Fuzzy numbers, 20, 303 Linguistic variables, 19 Fuzzy logic applications Asset and investment models, 31, 290 Asset and liability management, 304, 626 Bonus-malus system, 236 Classification, 26, 251 Decision making, 327 Immunization theory, 308 Pricing, 30 Projected liabilities, 33 Some extensions, 250
Index
The future, 254 Underwriting, 24 Fuzzy set theoretical approach, 301
G Gains chart, 509 Generalized linear model (GLM), 75 Genetic algorithm applications Asset allocation, 37 Classification, 36 Underwriting, 37 Genetic algorithms Advantages and disadvantages, 404 Implementing GAs, 411 Overview, 35 Population regeneration factors, 35
H Health insurance claims Modeling, 514 Evaluation, 517 High-risk populations Data sources, 267 Identification (targeting), 267 Implementation issues, 271 Prediction methodologies, 266, 271,273 Clinical methods, 266, 275 Statistical methods, 43, 101, 136,266,277
I Implementation issues, 271 Input relevance determination, 376 Insurance market, 230
655
K Kohonen network, 8
L Linear model, 164, 227 Link function, 160 Logistic function, 69 Logistic regression applications Customer retention, 465 Empirical results, 481 Logistic regression models, 465 Estimation and inference, 473 Model specification, 470 Modeling process, 475
M Mathematical objectives Fairness criterion, 152 Evaluating, 179 Precision criterion, 149 Proof of the equivalence, 191 MATLAB Clustered using, 204 Mean squared error (MSE), 146 Mean-variance Multi-period, 590 Single-period Markowitz, 589 Merging soft computing technologies, 401 FISs Tuned by NNs, 415 GAs controlled by FL, 403 GAs Controlled by FL, 419 Neuro-fuzzy-genetic systems, 421 NNs controlled by FL, 403 NNs Generated by GAs, 410 Merton model, 597 Mixture models, 181, 184
Index
656 Models (see also ind. listings) CHAID decision trees, 162 Classification and regression trees (CART), 15, 284, 500 Filter/smoother, 583 Generalized linear model, 99, 160, 493 Greedy multiplicative model, 159, 160, 175 Linear model, 157 Linear models, 493 Mean-variance Multi-period, 590 Single-period Markowitz, 589 Merton, 597 Mixture models, 172, 181, 184 Multivariate adaptive regression splines (MARS), 502 Negative binomial-generalized pareto, 439 Negative binomial-generalized pareto, 439 Ordinary neural network, 163 Partitioning around mediods (PAM), 580 Poisson-generalized inverse Gaussian, 440,449 Softplus neural network, 168 Support vector machine, 170 Table-based methods, 158 Models for firm failure, 338 Multilayer perceptron (MLP), 6,54 Multi-period mean-variance model, 590 Multivariate adaptive regression splines (MARS), 502
N Negative binomial-generalized pareto, 439 Neural network Advantages and disadvantages, 404 Background, 54, 284, 340 Bayesian learning, 365 General model, 341 Goodness of fit, 122 Hidden units, 166 Inputs and weights, 405 Interactions, 84 Learning rates, 408 Momentum coefficients, 408 Overfitting, 153 Processing units and layers, 343 Structure, 55 Supervised NNs, 5 Testing variable importance, 110 Unsupervised NNs, 8 Visualizing results, 122 Weight decay, 158 Neural network applications Asset and investment models, 13 Auto claim fraud, 365 Automobile insurance ratemaking, 137 Classification, 12, 368 Fraud, 105 Insolvency, 14 Life insurer insolvency, 345 P&L insurer insolvency, 356 Price sensitivity, 218 Projected liabilities, 17 Risk sharing pool facilities, 183 Simulated data, 119 Underwriting, 11, 117
657
Index
Neural network examples Modeling loss development, 81 Nonlinear function Complex, 71 Simple, 57 One node neural network, 60 Neural processing unit (neuron), 5, 342 Neural-fuzzy model, 409 Non-life insurance model Fuzzy logic applications, 236
o Optimal investment strategy for insurance portfolio, 611 Optimal portfolio of policy holders, 220
P Parameter optimization, 146 Partitioning around mediods (PAM), 580 Penalized criterion, 146 PIP claims data, 384 Poisson-generalized inverse Gaussian, 440, 449 Population risk management High-risk member, 262 Prediction of claim cost Fuzzy c-means clustering model, 205 Heuristic model, 206 K-means clustering model, 205 Prediction of retention rates Combining small clusters, 215 Decision thresholds, 210 More homogeneous models, 213
Neural network model, 101, 136, 186,209 Preprocessing Factor analysis, 86 Principal components analysis, 86 Price sensitivity analysis, 218
Q Qualitative dependent variable models, 470
R Receiver operating curve (ROC), 374,485 Resampling methods See bootstrap, 523 Retention Rate Prediction of, 209 Ridge regression, 146 Risk classification, 201 Fuzzy c-means clustering model, 204 Heuristic model, 204 K-means clustering model, 203 Risk management economic model Effectiveness, 250 Using, 295 Risk management techniques, 293 Estimating return, 295 Interventions, 293 Optimizing inputs, 295 Rule-based model, 176
s Severity trend models, 58 Soft computing technologies
658 Advantages and disadvantages, 403 Statistical learning theory Hypothesis testing, 144 Parameter optimization, 146 Stock trading, 563 Asset universe, 566 Model description, 571 Trading strategies, 571 Two step process, 564 Support vector machines (SVM), 170 Support vectors, 171
Index
T Testing variable importance, 110 Transfer function, 165
V Value-at-risk (VaR), 600 Model and problem formulation, 601 Solution approach, 604
List of Contributors
Yoshua Bengio Department of Computer Science and Operations Research Universite of Montreal CP 6128 succ Centre Ville Montreal Quebec H3C 317 Canada Patrick L. Brockett Management Science and Information Systems Department College of Business Administration The University of Texas Austin TX 78712 N. Scott Cardell Salford Systems 8880 Rio San Diego Drive Suite 1045 San Diego CA 92108 Raquel Caro Carretero Departamento de Metodos Cuantitativos Universidad Pontificia Comillas Facultad de Ciencias Economicas C/ Alberto Aguilera 23 28015 Madrid Spain Chiu-Cheng Chang Graduate Institute of Management and Department of Business Administration Chang Gung University 259 Wen-Hwa 1st Road Kwei-Shan Tao-Yuan 333 R.O.C Taiwan
659
660
List of Contributors
Nicolas Chapados Department of Computer Science and Operations Research Universite of Montreal CP 6128 succ Centre Ville Montreal Quebec H3C 3J7 Canada Ka Chun Cheung University of Hong Kong Pokfulam Road Hong Kong Steve Craighead Nationwide Financial Services One Nationwide Plaza Columbus OH 43215 Guido Dedene Applied Economic Sciences Katholieke Universiteit Leuven Dept. TEW Naamsestraat 69 B-3000 Leuven Belgium Germain Denoncourt Alpha Compagnie d'Assurances Inc 430 St-Georges Drummondville QC J2C 4H4 Canada Chresten Densgsoe Department of d'Econometria Universitat de Barcelona Diagonal 690 08034 Barcelona Spain Richard A. Derrig Automobile Insurers Bureau of Massachusetts Insurance Fraud Bureau of Massachusetts 7th Floor 101 Arch Street Boston MA 02110
List of Contributors
Charles Dugas Department of Computer Science and Operations Research Universite of Montreal CP 6128 succ Centre Ville Montreal Quebec H3C 3J7 Canada Ian Duncan Lotter Actuarial Partners 915 Broadway Suite 607 New York NY 10010 Christian Fournier Insurance Corporation of British Columbia 151 West Esplanade North Vancouver British Columbia V7M 3H9 Canada Louise A Francis Francis Analytics and Actuarial Data Mining, Inc. 706 Lombard St. Philadelphia PA 19147 Linda L. Golden University of Texas at Austin's McCombs School of Business Marketing Dept Campus Mail Code: B6700 Austin TX 78712 Emilio Gomez-Deniz Department of Metodos Cuantitativos en Economia y Gestion Titular de Universidad Campus Universitario de Tafira 35017 Las Palmas de Gran Canaria Spain Montserrat Guillen Department of d'Econometria Universitat de Barcelona Diagonal 690 08034 Barcelona Spain
661
662
Werner Hurlimann Winterthur Life & Pensions General Guisan-Strasse 40 Postfach 300 8401 Winterthur Jaeho Jang Center for Management of Operations & Logistics The University of Texas at Austin Mail code: B6501 1 University Station Austin Texas 78712 Bruce Klemesrud Nationwide Financial PO Box 2399 Columbus OH 43216 Inna Kolyshkina Global Risk Management Solutions (Actuarial) Pricewaterhouse Coopers Level 14, 201 Sussex Street PO Box 2650 Sydney 1171 Krzysztof M. Ostaszewski Actuarial Program 313G Adlai Stevenson Hall Campus Box 4520 Illinois State University Normal IL 61790-4520 Jan Parner Business Intelligence Unit Codan Forsikring Denmark
List of Contributors
List of Contributors
Ana M. Perez-Marin Departamento de Econometna, Estadistica y Economia Espanola Facultad de Ciencias Economicas y Empresariales de la Universidad de Barcelona Av. Diagonal, 690 08034 Barcelona Spain Grzegorz A. Rempala Associate Professor of Statistics Department of Mathematics 226A Natural Sciences Bldg. University of Louisville Louisville, KY 40292 USA Arthur Robb Landacorp Inc. 4151 Ashford Dunwoody Road Suite 505 Atlanta GA 30319 Arnold F. Shapiro Actuarial Science Program Smeal College of Business The Pennsylvania State University University Park PA 16802 Kate A. Smith School of Business Systems Faculty of Information Technology PO Box 63B Monash University Victoria 3800 Australia Dan Steinberg Salford Systems 8880 Rio San Diego Drive Suite 1045 San Diego CA 92108
663
664
List of Contributors
Francisco Vazquez-Polo Department of Quantitative Methods University of Las Palmas de G.C. Campus Universitario de Tafira 35017 Las Palmas de Gran Canaria Spain Stijn Viaene Leuven Institute for Research in Information Systems (LIRIS) K.U. Leuven Departement Toegepaste Economische Wetenschappen (TEW) Naamsestraat 69 B-3000 Leuven Belgium Pascal Vincent Department of Computer Science and Operations Research Universite of Montreal CP 6128 succ Centre Ville Montreal Quebec H3C 3J7 Canada Chuanhou Yang Risk Management and Insurance Program Red McCombs School of Business University of Texas at Austin Austin TX 78712 Hailiang Yang Department of Statistics and Actuarial Science University of Hong Kong Pokfulam Road Hong Kong Ai Cheo Yeo Monash University School of Business Systems PO Box 63B Victoria 3800 Australia